Chaos Engineering as a concept and as a practice has been growing in popularity over the last few years. It was first popularized by Netflix in 2011 and continues to evolve as adoption continues to grow. What was once an operation model only possible for hyperscalers has come within the reach of most organizations with the introduction of enterprise ready offerings, such as Gremlin.
The Emergence of Chaos Engineering
In 2008, Netflix begin making a transition from hosting their own service to utilizing Amazon Web Services (AWS) to host their streaming platform. Coinciding with this initiative was a rearchitecting of the streaming platform to break it into multiple separate but interdependent services and containers that made up the application as it was delivered to end users.
To ensure the stability and availability of this new operation model, Netflix introduced “Chaos Monkey” and the Simian Army suite of tools in 2011. The purpose of which was to intentionally disrupt elements of the platform’s infrastructure to ensure that the network of computers (or “instances” in the case of AWS) responded accordingly and maintained uninterrupted service to Netflix’s customers.
This may seem like a radical and reckless practice to those unfamiliar with the concept and in many cases this is true. Without careful planning and a dedicated team of developers and engineers, most enterprise IT departments would see their own internal chaos engineering initiatives fail. It’s also likely such an initiative would cost the parent organization revenue and damage their reputation in the process should the testing result in prolonged outages.
Additionally, simply pulling the plug on a VM or instance will only exposed problems with a subset of the infrastructure responsible for providing your application to users. What if your network or storage are experiencing latency? What if a DNS misconfiguration results in users being unable to resolve and subsequently reach your application to begin with? Netflix recognized this challenge and developed the entire Simian Army suite of tools on the heels of Chaos Monkey. But, to create and maintain such a large number of utilities is beyond the reach of all but the largest enterprises.
Enter Gremlin
Founded by Kolton Andrus and Matthew Fornaciari, Gremlin is born out of the experiences of engineers from the likes of Netflix and Amazon who were responsible for ensuring the reliability and availability of their employer’s service. Andrus spent significant time working in a reactive operations environment and was searching for a proactive model of testing to provide 4 9’s (99.99%) of availability.
While working in Netflix and Amazon directly, Andrus felt there was a good need in the market for chaos engineering utilities and that everyone who was chasing the microservices application architecture would be a candidate for the Simian Army use case. The pain of being paged in the middle of the night to troubleshoot an outage is all too familiar Andrus. So he and Fornaciai sought to make the same testing capabilities that were being used by Internet giants available to the enterprise, and thus formed Gremlin.
Chaos Engineering for the Masses
Germlin’s aim is to provide a framework that allows the enterprise to “safely, securely, and easily simulate real outages with an ever-growing library of attacks.” The first step of proactive operations is to introduce infrastructure failures in the form of host-level outages. Gremlin provides the capability to not only shut down VMs but to also simulate congestion in resources such as CPU and RAM as well as degraded performance of key services such as DNS.
The point is to allow customers to experiment and introduce small scale tests of a complex system and understand not only how the system reacts and refine accordingly, but to also identify issues that exist within the organization’s people and processes. From there a customer can remediate any technical issues found as well as address any procedural issues that may have prevented a timely recovery from an outage. Gremlin initially referred to this model as “failure as a service” and still has references to this phrase in their docs. It was felt though, that this would be an of putting phrase for decision makers in some of Gremlin’s target customers and they prefer to refer to the product as providing “resiliency as a service.”
Whether a test reveals a bug or passes without issue, the next step after remediation (if necessary) is to scale your tests up and repeat. Starting small and scaling guarantees disciplined development and operations teams and cultivates the ability to respond to outages in a timely and thorough manner. Should testing go awry, Gremlin’s UI includes a “Halt All” button, which will allow teams to recover, reassess, and test at a later time.
Ken’s Conclusion
While I was being briefed by Kolton, he made a point that resonated with me and truly drove home the value that Gremlin delivers. Image being able to introduce faults in an infrastructure as a training exercise. You could tell your new employees to expect an outage within a specified window and ask them to remediate as a test of your existing systems, training, documentation, and procedures.
This immediately conjured memories of my time working a job that saw me being added to the on-call rotation 3 weeks into my employment. Despite their best efforts to train me as effectively as possible and build a reliable system for me to watch over, Murphy’s law struck. During my first week on call, I was awoken by outages for 4 straight evenings and it left me scrambling to fix things and wonder what I had gotten myself into.
It’s worth noting that subsequent on-call weeks were not the same. I could go an entire week receiving little to no alerts outside business hours. However, this incident came to mind when Kolton described using Gremlin as a training tool. I cannot help but think I would have been better prepared for the real thing if I had experienced the kind of simulations that Kolton described being possible during my briefing.
The chaos engineering space is in the early phases of adoption in the enterprise and is perhaps too new and frightening like technologies that preceded it, such as virtualization or cloud. Gremlin is in a prime position to lead this space and become a leader early in the emergence of the practice and provide leadership to forward thinking enterprises.