Can you remember the last time you really tested your network for failures? I’m not talking about running through your disaster recovery plan and checking the boxes to acknowledge that you read it this year. I’m talking about shutting off switches or routers and seeing what happens when you flip the switch?
Odds are good that you haven’t really tested the network in that way in a long time. It’s difficult to understand just how important it is to actually do a real network resilience test every once in a while. Yes, that means that you have to shut the power off to make sure the generators come online and the batteries are working in your UPS systems.
Terry Slattery is a veteran of many network outages. And he’s forgotten more things about fixing broken networks that many of us are likely to encounter in our lifetimes. So when he tells you that you need to do a full-scale shutdown test of your network resilience plans he’s not kidding. Terry goes into why it’s necessary to do it for real instead of simulating it now and then. Here’s a great sample of his sage-like wisdom in this post:
There will inevitably come a time when a failure test doesn’t go as planned, typically because some unforeseen part of the infrastructure also fails, or a dependency isn’t well understood. For example, you may think that your DNS server infrastructure is fully redundant, but for some reason it isn’t. Everything works as long as the primary data center is operational. But when you disconnect it to simulate a failure, the secondary DNS server is found lacking. Maybe it experienced an undetected failure or perhaps it isn’t able to handle the full production load. There may also be problems with applications that are suddenly split-brain, where both are functioning, but not communicating with each other.
Make sure you read more of Terry’s great blog here: Network Stability Through Resilience Engineering