Chaos engineering sounds cool. I blame Jurassic Park. The name conjures images of mid-career Jeff Goldblum running water down Laura Dern’s hand and babbling about the Butterfly Effect. Or if you read the source material, maybe you remember some of the actual mathematical conceits thrown in to ground the Michael Crichton thriller.
Sadly the relationship of chaos theory to chaos engineering is tangential at best. Maybe a general sense of the uncertainty inherent in large systems. But that’s about it. Which isn’t to say that chaos engineering is boring far from it.
Modern Apps are Chaotic
It seems a little counterintuitive that the modern applications are prone to chaos. After all, breaking up monolithic apps and embracing microservices is supposed to make everything more simple right? It allows for easier scaling and visibility into problems. This makes things easier right?
Microservices are great, you can isolate them down to specific functions and get a deep understanding of how they work. But the issue is that they are obviously dependent on other microservices. As these compound in a scale out environment, they inevitably cause unpredictable outcomes. It is a tenant of Chaos Engineering that distributed systems are by their nature chaotic.
Burn It Down
With that as an operating assumption, chaos engineering looks to offer a proactive way to gauge the stability of a distributed system. In some ways, it reminds me of doing pen testing on your security procedures. But there are some notable differences. Chaos Engineering is kind of hung up on the whole engineering aspect of its name. It’s not just about breaking everything you can as fast as you can. Pen testing is often opportunistic in that regard. Chaos Engineering is built around making this a systematic approach with a repeatable procedure.
The other major difference comes to the outcome for the distributed system. Often when security flaws are discovered, it’s not a disruptive event in and of itself. A security exploit in the hands of a malicious actor most definitely will be disruptive. But discovery of that exploit by a pen tester often doesn’t require a disruption of production.
Because Chaos Engineering is concerned with finding the unpredictable flaws in large distributed systems, it is often forced to work on production systems. Quite often it is impossible to set up a test environment that can operate at the same complexity, thus making any testing ineffective. This is why a methodical approach in Chaos Engineering isn’t just a matter of pride, it’s required to limit the blast radius of any systemic weaknesses found.
A Mindset
One of the most interesting aspects about Chaos Engineering isn’t a new set of tools that enable it. It’s the change in mindset that it requires to be successful. Much like adopting a DevOps culture, Chaos Engineering requires buy-in to a wholistic vision for how to approach stability within distributed systems. Just like CI/CD isn’t just about getting commits in faster, Chaos Engineering isn’t about breaking as many microservices as you can. It represents a change in workflow, culture, and viewpoint for how to approach overall system stability.
As distributed systems continue to become the norm across organizations, I think Chaos Engineering will be de rigueur in a few years.
Like everyone’s favorite Chaotician said: Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should. If Chaos Engineering stopped at breaking things, it might still have a marginal value, but never see adoption on a large scale. By doing the hard work about why you should break things, it then sets up a framework to see how it could be possible.