Cisco Network Assurance Engine: From Download to Value in 60 Minutes (or less)

This is post 6 of 7 in the series “Cisco Network Assurance Engine”

A Tech Talk series brought to you by Cisco, looking at their Network Assurance Engine.

Verify, Or Die Trying: Observations on Change Management
Assure Network Security Policy and Compliance in the Data Center with Cisco Network Assurance Engine
Change Doesn’t Have To Be a Four Letter Word
Configuration and Hardware Assurance in the Datacenter with Cisco Network Assurance Engine
Hands On with Cisco Network Assurance Engine
Cisco Network Assurance Engine: From Download to Value in 60 Minutes (or less)
Networking Has Changed, Have You?

It is well known that product announcements have a great focus on the new shiny features and operational/business benefits. Sometimes a smart marketing move is to create cool PowerPoint slides with references to the nerd culture, we all love superheroes and Sci-Fi movies.

Experienced Network engineers wear sunglasses during presentations to avoid being dazzled by all the shininess and to focus on the facts.

The first questions they ask are: how long it takes to extract value from the product? How long it takes from unboxing to get all the pieces working and enjoy the results? Will the opex required to manage the new tool kill my productivity?

Let’s wear sunglasses too and start our journey with Cisco Network Assurance Engine (NAE), from zero to actual value.

Plan

A good advice I try to follow is “plan your work, work your plan”.

The plan here is to have NAE up and running to start collecting data from the Fabric as soon as possible.

Under the hood NAE is deployed as a cluster of three virtual machines. The GUI to manage all the VMs is only one for all three.

There are two models of the Cisco NAE available at this point: Small and Medium. I suspect we can expect more models in future to satisfy different needs. The data sheet provides information regarding the system requirements for the various appliance models.

The appliance size must accommodate the number of switches member of the fabric (only leaf nodes actually). The storage determins the length of retention time. The longer you want to retain your data for historical purposes – the larger the on-board disks must be. There is an option for the customer to start with less storage and add more capacity as needed to increase retention time.

The VMs will need reachability of the spine and leaf nodes of the fabric through the out of band management network, in order to collect information required to build the mathematical models that NAE uses for its operation.

Supported hypervisor is the usual VMware vSphere with versions ranging from 5.5 to 6.5. I expect 6.7 works too.

Network requirements are quite straightforward: NAE required ports 443 and 22 open for HTTPS and SSH communication between the Cisco NAE and the APIC (Application Policy Infrastructure Controller) and the fabric spines and leafs. NAE VMs need unrestricted communication between them, so to keep it simple and save time, just put them in the same VLAN. Also, ensure there is https and ssh connectivity to both the spine and leaf switches in the fabric.

Deploy

Once the correct appliance model is chosen and downloaded is time to deploy the VMs and perform the initial configuration. The OVF file is already packaged and ready for deploy, the procedure on vSphere or vCenter is the same used for any other deployment. The deployment wizard will ask for an IP address to assign to the VM.

Once the three VMs are ready and powered up we can point our browser to the IP address configured during the deployment and we’re ready to start the initial setup. Use any of the three IPs, they all provide access to the GUI:

First step is to create an administrator user:

For the cluster configuration we need to add the other VMs:

Set DNS server:

Set the NTP server:

Set the SMTP server; this is used to reset password of administrator account and get alerts:

If all the information is correct click Submit and proceed:

After entering all the information, the configuration process is ready to start. This takes less than 10 minutes:

When the configuration process finishes, we’re ready to login into NAE using our admin user:

Verify Deployment

“Trust but verify”, the golden rule any Network Engineer should always keep in mind. Under the appliance status menu we can check the status of the three appliances of the cluster.

Green is good, so we’re ready to start!

Offline Analysis

NAE can run in Online mode or Offline mode. Online mode means it connects to a Fabric through an Out of Band link and keeps collecting all the information to run analysis on them.

The offline analysis script uses the same set of API calls and CLI commands of online mode but it runs one time instead of continuously.

Cisco provides the Python scripts to collect data from a Fabric for offline analysis. We can download the script from the NAE GUI:

For this particular case I’ll use some Offline collection files provided by Cisco. Notice the files only contain raw data coming from an actual Fabric. All the analysis is performed by the VMs of the cluster.

The first step for offline analysis is to upload the offline collection files:

And then create a new Offline Analysis:

The appliance will process all the collected data and do it’s magic to extract all the meaningful information. This process takes only a couple of minutes, but it clearly depends on the size of the fabric being analyzed.

Once the offline analysis is complete we can navigate through the dashboard and see what NAE found.

Change Management Use Case

Let’s start with a simple use case: a change request.

Out fellows DB admins request access between two databases servers, one in production and one in testing environment. This is quite common to get real data from production to test a new release of software.

This change usually requires working on Firewall rules but with a Fabric the security rules can be placed on the contracts between EPGs.

To satisfy the request the Network administrators needs to create or modify the inter tenant contract, adding the correct permit rules.

First step then is to find in a contract already exists between the two EPGs. On the dashboard go to the Tenant Security menu:

We can see in the security health section there aren’t contracts between “prod” and “non-prod” Tenants. For reference, the radial diagram represents a Tenant. The individual dots are End Point Groups (EPGs) and the green lines are the contracts between them. The lack of a line between “prod” and “non-prod” means there is no contract that connects these two EPGs.

After the Network administrator creates the contract we can go back again to the same report. The line between petstore-db-tier-epg in tenant prod and petstore-db-tier-epg in tenant non-prod is a visual representation of the new contract.

A mouse-over on the new contract shows more details:

We can see now the security policy between the EPGs, it permits ICMP and TCP port 1521. The two DBs should now be able to connect each other.

Trust but verify, remember? Let’s check on the dashboard if everything is fine and… it’s not! There’s an issue with the Security policies applied to tenant non-prod.

One click more, we can access the Smart Events and read the Event Description:

Zoom in on the specific event, we get more details of the issue and suggested next steps. These suggestions are provided by a knowledge base created from many sources, including Cisco TAC engineers. It definitely saves us hours of googling or on the phone with the tech support.

In our case something is wrong with the scope of the EPGs.In the screenshots I’ve highlighted the problem description, impact and suggested remediation:

When the network administrator, as suggested by NAE, changes the contract scope from “Tenant” to “Global” we can see the model is updated, the analysis runs again and the problem is solved. Nice job!

We check the dashboard again, the major events are fixed now but NAE shows some minor events. Let’s investigate!

Click on the event and here’s the issue: some policies are redundant, resulting in a possible waste of TCAM resources on leaves switches:

When we look at the details of the contract we notice two filters are allowing ICMP traffic, this doesn’t make sense:

Fixing the issue is just a matter of removing the redundant ICMP filter and we’re good to go, no more warnings on the dashboard now:

Well done! Our fabric is perfectly configured, no warning on the dashboard, we can close the Change ticket now and have a well-deserved cup of coffee, offered by our DB admin of course!

Final Notes

We started analyzing what appeared as a simple change request and then became a process with multiple steps. NAE helped throughout the operations to validate the final configuration, including some Security policies optimization to save precious hardware resources.

We got all this in less than an hour, including the initial deployment of the virtual machines and the setup process!

These are just a few examples of the thousands of elements analyzed by NAE to ensure the Fabric works as expected. Change validations is often a time consuming process with potential impact on production. Conflicting, incorrect and overlapping policies are complex to find and fix. NAE provides a powerful mechanism to continuously audit policies for correctness and utilization.

A Quick Digression: AI, ML, MM

Let me leave the comfort zone for a 10k feet view of artificial Intelligence, Machine Learning, Mathematical Models. These terms are often used and abused, they can create some confusion but as Network Engineers we need to have at least a general idea to understand how different products work.

For many authoritative sources artificial intelligence (AI) is still a subject relegated to science fiction books more than to reality. There are a few examples of Artificial Narrow Intelligencebut a General AI is not here (yet).

Machine learning (ML) is a reality today, it is based on statistical techniques to give computers the ability to learn without being explicitly programmed.

A Mathematical Model (MM) is a description of a system using mathematical concepts and language. A model allows to study the effects of different components, and to make predictions about behavior.

The main difference between ML and MM is the former is useful when a system cannot be modeled, and some margin of error is accepted while the latter is deterministic, not probabilistic, so it produces more accurate predictions. ML requires time for training, results get better over time but in the initial phases the accuracy may not be good.

So why this long explanation? I just wanted to clarify why a product like NAE that builds a mathematical model and makes predictions on the model itself is the best tool today to manage a complex system like a datacenter network.

Other uses like SIEM and behavior analysis for security may take advantage of ML to identify malware and data exfiltration. ML and modeling are different tools for different purposes.

TL;DR version:

Artificial Intelligence ? good for SF movies and books, not a thing today

ML ? you really don’t want your network fabric to rely on that

Mathematical Model ? that’s the way to go for accuracy and assurance

Cisco Network Assurance Engine: From Download to Value in 60 Minutes (or less)