All networks, regardless of size or type, have one thing in common; the need to change. However, change is inherently risky. Introducing a new configuration element to a stable network could produce new instabilities. Anticipating the potential side-effects of any given change, especially in a complex network, can be extremely challenging.
Programmers face similar challenges with large, complex codebases developed by multiple people over a period of time, and a need to keep patching, refining and enhancing the code without breaking its existing functionality. Over time, programmers have created toolchains to enable their development workflow using elements such as functional testing, unit testing, and build testing; network engineers have not.
Programmers and network engineers both need to follow consistent processes to minimize the impact of changes, and this post examines the need for good verification as a pillar of change control.
Verify Before the Change
Ideally, any potential problems associated with a network change would be identified before the change takes place. This could take a number of forms, all of which can help reduce the likelihood of a problem occurring.
Syntax Verification
If configurations are issued using a CLI, verifying that the planned change is syntactically correct is the minimum level of validating that must be performed. This may be achieved by using a lab or virtual device, for example. For a programmer, syntax validation is often accomplished within their integrated development environment (IDE) or identified when compiling or executing the code.
Lab Testing
In an ideal world, every change would be tested in a lab environment prior to deployment in production. In order to most accurately identify potential issues, the lab environment must exactly mirror the production network. The latter requirement is where most lab testing falls down because simulating the production environment simply isn’t practical, partly in terms of equipment, but mainly because it’s difficult to mirror the dynamic aspects of the production network environment. However, lab testing is still useful to ensure that – to the greatest extent possible – a proposed changed will not be negatively impactful.
Programmers, similarly, may use destination hardware, virtual machines or containers to validate that the software compiles and executes in an environment known to match that in production as closely as possible.
Peer Review
Peer review of changes is pointless. Unless the peer reviewer duplicates the research and work undertaken by the original author to determine a change (which is unlikely), peer review is little more than the application of the peer reviewer’s experience and the level of trust they have in the author. One way to improve the value of peer review is to include details of the background research as part of the change document so that the reviewer can reference it as needed. Where peer review often triumphs is in identifying procedural issues, especially problems that have occurred before, or issues specific to an environment.
Programmers frequently use a formal process to have code changes reviewed and approved before being accepted into the code repository for further testing.
Controller-Driven Changes
Controllers and orchestrators are typically effective at obscuring change implementation details from the user, which is precisely why they’re used. Using the lens of change management though, this leads to a problem of knowing what has been requested, but not knowing how it will be implemented because there’s a degree of opacity about the fine details. It is necessary to trust the controller to do the right thing, yet we have all learned over time that computers are fundamentally dumb and will often do whatever is asked of them, no matter how stupid. It is also of note that trying to peer-review a description of what buttons will be clicked in a UI and what will be entered by the user, is tremendously ineffective.
Analyzing the risk of a controller driven change prior to execution tends to be a self-limiting exercise; the only thing to do is to execute the change then verify the state of the network after the change has been completed.
Verify After the Change
Checking results after the change seems obvious, yet is an area which seems to be lacking in many environments. It’s all very well spending hours validating a change prior to execution, but without verifying afterwards that the change was successful, it’s likely that unforeseen impacts will not be caught in a timely fashion.
Automated Changes vs Human Operators
One way to avoid errors is potentially to use automation to deploy changes, but as has been so correctly said, where a human can make a mistake, automation allows us to make the same mistake hundreds of times over, and many times faster.
Both human changes and automated changes must be coupled with a test plan to evaluate the change’s impact. Post-change testing should also be checking whether there was an impact felt anywhere else in the network, which could be a large task if done thoroughly. For example, changing a route filter to block a route can be easily validated by checking whether the route is being placed in the forwarding table after the change, but if that route was being used as the source of a conditional route or an aggregate route somewhere else in the network, how quickly would that be identified and the finger pointed back to the route filter as the root cause?
Syntactically Correct versus Semantically Correct
It is possible, indeed common, to have a syntactically perfect change that executes perfectly, yet does not achieve the desired intent. The route filter change is a perfect example of this, and unfortunately many of the pre-change verifications are prone to confirm accuracy of syntax but are not able to confirm the real effect of a change on the network. Perhaps the network engineer who submitted the route filter change was aware of the downstream impact and it was a desired outcome. Although if so, the post-change checks should have explicitly been verifying the downstream route tables to confirm the desired impact.
What Is the Intent of the Change?
Intent is one area where having a centralized system built on statements of intent rather than specific configuration stanzas and security policies makes a huge difference. If a controller has a routing policy which allows subnet A to communicate with subnet B, but a route filter change means that subnet A no longer has a valid route to subnet B, it is possible then to not only identify the post-change network state (a route was lost), but then to evaluate it against the business intent and identify that the change broke connectivity which was supposed to exist.
This kind of deep post-change verification goes far beyond what most people can mentally process, and is an ideal candidate for some automated analysis.
Change Rollback
Once it has been determined that a change did not accomplish the intended goal or broke something else, it will typically be rolled back. How good is the post-rollback test plan in a typical change script? Is it possible to be sure whether the network has been returned to its previous state? By way of an example, here is a real change which was issued on a Cisco router running IOS:
router bgp 65500 redistribute ospf subnets route-map OSPF_TO_BGP
It was decided that the change didn’t do what was desired, so the change was rolled back:
router bgp 65500 no redistribute ospf subnets route-map OSPF_TO_BGP
Unfortunately, the post-change testing was lacking, so it was not noticed until months later that the rollback configuration did not actually perform a correct rollback, and left the configuration looking like this:
router bgp 65500 redistribute ospf subnets
Had there been a better comparison pre-change, post-change and then post-rollback, this large error could have been identified and remediated quickly. In fact, the ability to be able to examine the network state over time would be useful in general, to identify when changes happened in the past and track down the cause. It’s probably not reasonable to expect users to trigger this kind of pre/post change verification, so a system which can continually check the network status would be a great help.
Cisco Network Assurance Engine
Change verification is one of the things Cisco looked at when developing its Network Assurance Engine (NAE) product. NAE reads the controller policies (in the first release, Cisco’s APIC) and connects to all the network devices to read configuration and dynamic state data. With the intent and the current network state in hand, NAE can build a mathematical model of the network against which it can evaluate the controller policies, i.e. the statements of intent, and determine whether there are issues in the network.
In addition to the modeling component, Cisco has used its experience with support and troubleshooting to identify thousands of failure scenarios which can be run against the model every time it updates (which is every 5-15 minutes), meaning that more well-hidden issues such as overlapping IP routes and missing elements in the controller’s policy configuration can be identified.
Finally, each time the network and controller state are updated, the network state is recorded so that NAE can act almost like a network DVR, a display the state from any point in the past.
I will dive into more detail on Cisco’s Network Assurance Engine in a separate post, but there’s an important point here that effective change control requires effective change verification that extends beyond the device being changed and looks at the changes both syntactically and semantically.
syntactically and semantically. The potential outcomes of a change can be so subtle in some cases that the only way they’ll be identified is by use of a tool which can see the changes and compare them against the stated intent to determine when promises are being broken.
Network Assurance Engine seems to be a great first step in that direction, providing a toolset that allows verification at multiple stages of the change process outlined above, massively reducing risk of changes. While the first release is focused on Cisco ACI-based networks, NAE already has ecosystem partners like F5 and Citrix NetScaler whose devices implicitly have connectivity requirements which have to be fulfilled by the underlying network too.
Wouldn’t it be nice to know within 15 minutes that a change did what it was supposed to and didn’t break something in the process? I think so, and in my next post I’ll look more at how NAE itself works.