It has always been a challenge to keep up to speed with changes in the networking industry, but over the last few years it feels like the pace of change has been accelerating at an alarming rate. As network architects, engineers and operators, we are being asked to design and run networks using technologies which weren’t even in the market until fairly recently.
Cisco’s Application Centric Infrastructure (ACI) is a good example of the dramatic change that has occurred. ACI is moving network configuration from the classic, imperative, distributed command line model that we all know so well, to a new declarative model. This paradigm shift is also known as “Intent-Based Networking”. These models are drastically different from one another, yet the vast majority of network engineers working on a Cisco ACI network have spent much of their time working in a legacy, pre-ACI environment. It’s important to consider how to evolve legacy networking skills so that network teams can confidently deliver – and troubleshoot – services using the powerful ACI model. As with all new technologies, expertise comes with time and practice. To further complicate matters, connectivity intent (i.e. security) is part of the ACI model, and since EPG contracts can be applied with different scopes, the final connectivity configuration within ACI can rapidly scale beyond the complexity that a normal person can manage in their head.
We should consider Cisco’s Network Assurance Engine (NAE) as a “safety net” or “verification suite” for organizations running Cisco’s ACI. NAE runs in the background, constantly analyzing the network and asking the question “Did you mean to do that?” It’s like having the network and security configurations reviewed by an expert engineer, consistently, exhaustively and tirelessly – every few minutes.
Did You Mean To Do That?
In some cases, the answer to the question “Did you mean to do that” is going to be “yes”, and that’s OK. There are times when the network has to be configured in a particular way. This is especially true when undertaking the migration of services from a legacy network to an ACI fabric, where the new and old networks are likely connected to one another.
Sometimes, however, the answer is going to be “no”. It’s nice to know when that’s the case, so you can proactively fix a problem before a service owner complains. More to the point, if a service owner doescomplain, how well can the network team troubleshoot the problem on a modern fabric compared to the skills they had developed on legacy networking hardware?
With the following situations, how would it be possible to proactively detect these issues and track down the source of the problem? Let’s take a look at a few examples.
Server Moved to Wrong Bridge Domain
If a server is configured with an incorrect bridge domain, it won’t be able to communicate with the network. It’s an easy mistake to make; it’s syntactically correct from the network perspective, so the controller will deploy the configuration as requested; but it isn’t what was intended. Through its interaction with both the controller and the network hardware, NAE is able to detect that a switch is receiving frames from an IP which is not applicable for the configured bridge-domain, and thus it can trigger an alert.
Duplicate IP
Imagine that a VM has been cloned, but the IP address was not changed. Suddenly, there’s an IP address conflict on the network; two devices have the same IP. Which one is supposed to have that IP (at least, which one had it first), and how quickly would the issue be discovered in the first place? Because Cisco NAE takes snapshots of the network state every time it runs, it’s possible to see when and where the new, conflicting IP appeared, and immediately identify the endpoint which needs re-addressing.
Something’s Broken But “Nobody Changed Anything”
How often does a problem arise, but apparently nobody made any changes? NAE’s snapshots provide a great historical record of how the network has changed over time, and if there was a change made which breaks the connectivity which was requested in contracts. Warnings would also be generated asking if you meant to do that.
Shadowed Policies
As a real world example, a large European IT provider with a fast-growing cloud business had an mixed environment of switches with large and small TCAMs. Yet despite the upgrades, they were still maxing out the TCAM every quarter. Cisco installed NAE to inspect the network. NAE’s policy visualization tools and TCAM usage statistics clearly showed there were huge numbers of contracts which were generating ‘shadowed’ security policies – namely policy entries that would never be hit because an earlier policy would already have matched the traffic flow. With that information in hand, the provider was able to review and restructure the security contracts to fix the problem.
Human errors have been a reality of all networks. With software-defined / intent-based networks they have moved to a higher level of abstraction, different from the traditional networking most of us have learned. It’s not surprising that mistakes are going to happen, and with potentially wider reaching impact given the automation power. Even without the network in the picture, server owners can always throw a spanner in the works with some bad network configuration, or new (but bad) dynamic state may get learned from outside the fabric boundary, say from the branch network. What NAE brings to the table is the ability to detect these issues – including some network problems introduced by servers – quickly.
Roll It Out From The Start
The more I have learned about NAE, the more I’ve felt that it should be installed at the same time as the ACI network and run from Day 0. Installation of NAE on the required three VMs is straightforward and requires very little time (Cisco claims less than 30 minutes, plus maybe 10 minutes for clustering). Having a safety net made of the collective knowledge of thousands of problems being checked regularly on the network seems like an obvious thing to add, especially if this is the first Cisco ACI installation in an organization. It can boost the network administrators’ confidence in the new fabric architecture and let them learn the paradigm much faster. Ironically, the first thing they’ll be asked to do is to migrate all the existing networks on to the ACI fabric – which is alwaysa higher-risk task than building a nice clean green-field network – yet they have to do it with minimal experience actually running a fabric! Who wouldn’t want the warm fuzzies of an advanced verification engine that identifies potential issues and says, “Did you mean to do that?”
Further Enhancements
Currently, NAE runs constantly and delivers periodic network analyses so that problems can be identified within a short period of time. It’s fairly easy to toggle between each snapshot (or “epoch” as they are known within NAE) and see the number of alerts changing, and drill down to see what’s new. However, a handy feature coming out shortly is the ability to look at the delta (i.e. the difference) between two epochs, so it’s clear what new problems arose between those two points in time, and therefore more easily identify newly-introduced problems.
Another potential feature – and this one really excites me – is a “sandbox” which would allow changes to be evaluated before they are applied to the main network, so that NAE can predict what problems wouldoccur were the change to be applied in production. This is not a trivial task, but in terms of increasing network agility while maintaining (and even improving) uptime, this feature is huge, in my opinion; it’s like having a lab without having to buy a lab. Even better, the sandboxed environment would exactly match the production network, so change modeling would have a high chance of accurate results.
Conclusions
Cisco’s Network Assurance Engine is an obvious and essential companion to Cisco ACI, providing insights and historical data that just aren’t possible to obtain any other way. As new features are rolled out, NAE may dramatically alter how we, and businesses, view the risky world of change management.
Would love to know who the European IT Provider was since I will be presenting NAE to some European Cisco AMs and SEs :). Maybe one of their customers and maybe they can share that knowledge to other Cisco AMs and SEs resulting in more product sales of NAE