© 2020 Strange Loop
Despite what you might think, successful large-scale cloud systems are not designed by an architect. Instead, they grow organically, failing in complex ways. However, by applying Chaos Engineering, you can prevent failures by detecting weaknesses before they do real harm.
Large systems evolve from successful, smaller one, an observation predicted by the branch of study known as systems theory. Systems theory also predicts that our systems will inevitably behave, and fail, in unforeseen ways. This talk will draw from the ideas of two very different systems theorists to demonstrate that neither quality architecture nor thorough testing can prevent our software from eventually exhibiting pathological behavior. The first is the safety researcher Sidney Dekker, who proposed a theory of "drift into failure" that describes how seemingly reliable safety-critical systems can still lead to accidents. The second is the late pediatrician John Gall, who coined the "Generalized Uncertainty Principle" about how all types of complex systems behave unexpectedly.
Even though failure is inevitable, there is still hope. Chaos Engineering is an approach that can be used to identify system vulnerabilities before they lead to outages. This talk will cover how to design and run Chaos Engineering experiments, drawing examples from our experiences at Netflix.
Lorin Hochstein is a Sr. Software Engineer in the Traffic & Chaos Team at Netflix, where he works on ensuring that Netflix remains available. He was previously Sr. Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California's Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.