You’ve probably been bitten by bad code at some point or another in your life. Maybe it was a dodgy update to your mobile phone that erased your calendar. Or perhaps it was a patch for a game that you play that ruined it for you. Or, if you work in IT, it was any one of a number of patches or feature releases that you installed to fix a problem only to find out you now have three more issues to deal with.
Why does this always happen? How is it that the culture of software creation today always seems to cause more problems than it solves? The avenues for these issues are legion. Continuous development introduces bugs that can be fixed later given an amount of time to go bug hunting. Maybe this code has to ship by a certain date, either decided by you or by external forces that don’t respect the creative process. Perhaps there is something else at work that causes these issues to crop up. If the code you create causes issues, you can be sure your users and customers are going to start asking the same questions.
Concentrating on Completeness
Ken Duda is no stranger to code. As the CTO for Arista as well as their Senior Vice President for Software Engineering, he has spent a lot of time writing code as well as overseeing those that do. When your company operates a platform as large as Arista EOS is, you have to wonder how easy it is for bugs to develop in the code. How problems can be overlooked before they are shipped to production. After all, it happens to everyone, right?
Ken gave a great talk earlier this year at Networking Field Day 22 about this very topic. It’s one that he’s talked about several times and had a lot to say on the subject. Here’s a video of his latest discussion around the approach Arista takes to software development:
I can tell that Ken is passionate about what he does from the way he describes the things that have happened in the past. He’s the kind of engineer that remembers what happens when you let your code get out of hand. The description of what happens when someone forks a codebase in order to hit a deadline only for bugs to keep cropping back up? That has happened more times than anyone can count. In the exact way he describes. It’s a result of the mentality that you need to do code development in a bubble and rush the results before ensuring they’re ready to ship.
Acid Tests
One of the common refrains I hear when code is released with a litany of bugs is, “Did no one test this?” It seems to be the most-aired complaint. As if having a legion of people sitting around poking commands into the CLI or testing random pieces of the code is going to identify bugs that develop under a specific set of circumstances. Human testing, in and of itself, is only exposing the code to the randomness that is the human mind. We may be able to break things in new and different ways but we’re still throwing darts in the dark.
Arista, on the other hand, has a much more elegant solution. They have automated testing for every piece of code that gets checked in. It’s a brilliant feature when you think about it. Arista already has a list of all the bugs in EOS that they know about or have ever known about. Once those bugs are identified, they are entered into a database that is referenced for testing. Now, when someone writes code to fix a particular bug they can mark that issue as done. But when they check in the code for that patch, the rest of the database is searched to ensure they didn’t introduce another existing bug into the mix. Maybe it was an old race condition that needed a strange jump. Or perhaps someone streamlined something and introduced a privilege escalation issue. Whatever the case the automated testing that Arista has in place will catch it and inform the developer of the problem before it hits production. It’s almost like it’s looking over your shoulder in the best possible way.
The other development feature that I like is how Arista has true maintenance trains of code. In a maintenance release, you’re looking for stability for a system. You’re not looking to run the bleeding edge or introduce new features that may or may not work properly. I often get flack from coworkers because I refer to feature release trains as “late beta.” But it really is true. Most of the time, I find more memory leaks and bugs there than anywhere else. But the maintenance releases should be free of those things, right? Unless perhaps the developers slipped in a feature that they think everyone should have. In which case the feature can cause bugs just from existing.
In Arista’s case, they use a model that I’ve frequently lauded. They have a maintenance code train that only fixes bugs. No new features added. It’s rock solid. If you need the features it offers you can be on the latest release and be sure that as many of the bugs as possible have been fixed. If you need a new feature for some new aspect of what you’re doing, you won’t find it in there. Instead, you need to move to a different train with newly released features and code. It comes with a big warning that things are likely to break. It’s a good way to tell people that they’re essentially testing things. It sounds a lot like the description above but the key difference to me is that Arista is warning you about what you’re getting into and containing that discussion to a code train that is removed from mainline release. If you don’t chase features you will only end up with stable code.
Bringing It All Together
Software isn’t perfect. Even if it’s written by a machine. It’s a complex interaction not unlike a cooking recipe. Every attempt produces a slightly different result. You can do your best to ensure that the results are predictable in both cases by being specific about what you need and removing the possibility of errors along the way. Arista has done a great job of expanding their development cycle to include things like automated testing and bug identification. As well, they practice proper hygiene of the code to ensure bugs can’t cross-contaminate from the release train into the maintenance train. That’s a lesson that any code cooks would do well to learn.