What do we want? 100% uptime! When do we want it? Well, any time soon would be nice. The problem is that networks are not static; something is always changing, and as I discussed in my previous post (“Verify, or die trying: Observations on change management”), change management can be tricky at the best of times. Cisco’s Network Assurance Engine (NAE) aims to ensure that even in a controller-driven network, the infrastructure is doing what you intended it to do.
“Even in a controller-driven network?” Well, yes. Controllers are supposed to be the answer to all our problems; define intent centrally and watch policies cascade to all the network devices without any further effort. It is reasonable to assume that any network controller worth its salt will make the changes it was asked to; so it sounds rather redundant to have a system which checks that the network is configured properly. If that was the sole purpose of NAE, one may question why such functionality was not in the controller already. Controllers, however, are still given instructions by humans, and humans make mistakes all too easily. Controllers make our lives much easier, but it’s still possible to miss configuration steps or specify inconsistent requirements, causing the network to not deliver the intended functionality.
Imagine a situation where routes are being leaked between two VRFs, resulting in partially overlapping subnets. The result of this may be seemingly intermittent connectivity to certain hosts. That’s something that NAE can easily catch. There’s nothing wrong with what was configured by the controller, but NAE’s monitoring of actual system state every 15 minutes (configurable) means that it can see discrepancies and identify that there are issues with connectivity.
What if a new policy is defined allowing network A to communicate with network B using SSH, but earlier in the policy definitions, SSH to network B was blocked from all sources? Cisco’s NAE can detect that the policy evaluation would never reach the later permit policy; not based on hit count (although that is also examined), but rather using a mathematical analysis of the policies.
Why do we have change management procedures? In my experience, the aim is to:
- Minimize exposure to outages by scheduling changes at times with the least potential impact
- Reduce outages by vetting changes
- Give visibility of the changes across multiple teams
- Coordinate changes to avoid simultaneously conflicting changes
- Provide an audit trail of activities
- Clean-up stale or unused policies to prevent drift so they remain easy to understand, especially during those dark 2am troubleshooting calls
Some companies schedule all changes during maintenance windows – usually overnight, or otherwise outside business hours. This is a rather curious decision, since limited maintenance windows lead to many changes being jammed into the allocated windows – conflicting with goal #4 above. Engineers working during overnight windows are often sleepy and therefore prone to make mistakes. If something goes wrong, the change is being executed when there is the least support available, both internally and externally – possibly increasing the time to repair in the event of an incident.
I understand the logic of non-business-hour maintenance windows. The intent usually is that if the change goes wrong it cannot interfere with critical business operations. To me this also says that the business does not believe that changes can be performed without acceptable risk exposure. This is, in an abstract sense, true –no change is risk free. But when assessing the likely risk exposure resulting from a change incident, one typically evaluates the product of the impact of something happening, often measured in financial terms rather than technical, and the likelihood of it actually happening.
For example, most people would consider configuring an access port VLAN as a low risk activity, but it’s possible, in theory, that a bug in the switch software could mean that applying the configuration would reboot the switch, which would have a high impact. Experience tells us that the likelihood of hitting such a bug is extremely small, so many companies will allow an access port VLAN change to be executed during business hours. A basic back-of-the-envelope calculation might look something like this:
high impact x extremely low likelihood= low risk exposure
Each business must determine its own risk appetite, i.e. the level of risk exposure the business considers acceptable, and what low risk really means. The more changes that are assessed as being low risk (by minimizing either the impact or the likelihood of an event), the more changes that can be executed during regular hours, and the more agile the business becomes. This is where network modeling offers a big win.
Let’s take each of the factors in that risk equation.
- Reducing the likelihood of outage: much like software development we need to create a process with multiple verification stages which systematically reduce the likelihood of errors making their way to production. With Cisco NAE running in a pre-production lab environment, changes can be tested and NAE can identify the majority of policy misconfigurations. In the near future it may also be possible to model the changes in NAE itself; a faster iteration model, but one that may miss some of the dynamic state information available in a lab environment. Combined, this multi-step process could drastically reduce the likelihood of a misconfiguration making its way into the actual production environment.
- Reducing the impact of erroneous changes: With thousands of failure scenarios built-in right out of the box, the NAE provides an extensive verification suite. By running this suite against your product fabric in your maintenance window, you can instantly identify adverse impacts of your changes and correct them, meaningfully reducing the impact of errors that got through your change approval process. This beats, by orders of magnitude, simply running spot checks for connectivity and hoping the application owners won’t call tomorrow because something isn’t working.
Predicting the Future
Where verification and mathematical modeling win over pure monitoring of current state is in the ability to analyze policies and contracts in a way that a human never could, significantly reduce the overall risk of making changes by detecting issues quickly. While NAE can’t yet predict how a change would impact the network, it’s able to detect subtle policy issues which might otherwise go undiscovered for a long time. For example:
|Planned Change||Successful Outcome?|
|New policy: Deny all ICMP Echo between two tenants||NAE generates a warning that the new policy shadows later policies which explicitly permit ICMP between certain EPGs|
|New policy: Allow Finance App servers to talk to HR Database||NAE highlights that the return route to Finance Apps is not visible from the HR Database|
|Create new subnet / VLAN||NAE identifies that the subnet already exists in routing tables, leaked from elsewhere in the network, and is now in conflict|
Shadowed policies in particular are very common, because the quantity and complexity of policies typically configured makes it difficult to see whether a new policy is needed. As a result, controllers are almost always filled with old policies that everybody is scared to delete. Deleting a redundant policy, for example, could be executed with a certain knowledge that another, retained policy will provide the needed connectivity. Cisco NAE proactively analyzes both the intent and the network state and ensures that the intended connectivity exists. This helps avoid potential network outages, hopefully leading to improved uptime and enhanced reported SLAs. Issues found result in SmartEvents, which are diagnoses of what’s wrong, what is affected, and how to resolve the issue. Additional integrations with platforms like Splunk and ServiceNow can provide direct alerting as well.
At one company I worked for, I instituted a list of “Golden Rules”. The document comprised a list of maybe 25 problems we had experienced before and did not wish to experience again; usually because a particular action had caused an outage. I’d say it was a list of stupid things not to do. But some of the things on the list were not intuitively bad in any way, and new engineers on our team would typically fall into those traps – unless we guided them. To me, preemptively flagging bad configurations seemed like an obvious thing to do, and was of benefit to the uptime of the network.
Similarly, though dwarfing my own efforts, Cisco NAE has a built-in library of over 5,000 common network issues, all of which get evaluated against the current network state and controller data. It’s kind of embarrassing to have a connectivity issue, open a case with TAC and find out that the problem has occurred because you forgot to configure something. This is precisely the kind of thing NAE will catch and spare you both the call and the downtime.
A Change for The Better
Change management is the bane of most network engineers, and anything that helps increase confidence in the change process and minimize downtime is extremely valuable. Lower outage rates on changes could also lead to a more flexible change policy, reducing out-of-hours requirements – which is a benefit all round. Cisco NAE is a key component in the change management process, and as additional devices and technologies are supported, the value of NAE will continue to grow.
- Change Doesn’t Have To Be a Four Letter Word - May 23, 2018
- Verify, Or Die Trying: Observations on Change Management - April 12, 2018
- Moving To The Cloud – Network Nightmare or Dream? - February 20, 2018
- Diving Into Design With The Aruba 8400 - September 13, 2017
- SaaS and the Software Defined WAN - April 12, 2016
- Making Your WAN Work For You - March 29, 2016
- SD-WAN: I Can See Clearly Now - March 15, 2016
- Who is Ocedo, and Why did Riverbed Acquire them? - March 1, 2016
- A Look Back at ONUG Spring 2015 - May 27, 2015
- What Is An “ONUG”, and Why Should I Care? - April 15, 2015