As discussed in the previous post, Cisco Network Assurance Engine (NAE) can verify the Network Security and Compliance of a Datacenter providing full visibility of the current state of the network. In this post, I’ll discuss how NAE can assure the datacenter fabric is working as expected for configuration correctness, endpoints and hardware resources.
Configuration Correctness
Being compliant to pass security audits is only part of what is necessary to manage security in a datacenter. Network security policy sprawl adds risk to the network. NAE can help, with built-in checks, to identify incompletely specified policies, unused policies, and duplicate intent (in the form of overlapping or conflicting contracts between EPGs), then listing them in the dashboard as Smart Events.
Each Smart Event includes a list of objects involved, the root cause in terms of failing condition, and the suggested next steps to troubleshoot the issue. The next steps are based on Cisco TAC experience and best practices.
Redundant, shadowed, or contradictory policy clean-up and removal are tasks no Security Engineer likes to perform. It is risky and quite often there is not enough visibility of the effect it may have on the network. These policies may not have a direct impact on the security, but waste finite hardware resources , lead to complex troubleshooting scenarios, and most importantly make changes extremely risky.
NAE has the capability to understand policies and suggest via Smart Events the necessary actions to bring consistency to policy specification and optimize resource utilization.
Tenant Endpoints
A fabric exists to provide services to Endpoints. Endpoints may simply get connectivity or interact with the fabric itself, like a router advertising prefixes to a border leaf.
One of the most powerful features of NAE is the ability to track Endpoints attached to the fabric and continuously inspect their behavior, triggering a Smart Event in case any potential problems arises.
A quite common example of a problem on an Endpoint is wrong addressing. NAE can notice an Endpoint is assigned to the wrong or missing subnet and generate a Smart Event with all the details necessary to fix the issue:
An Endpoint may also fail to get an IP address from the DHCP server. There’s a Smart Event for that! This check is performed continuously for every Endpoint connected to the fabric, no one is left behind with this exhaustive analysis.
If for some reason Endpoint information is not consistent across the fabric leaves and spines, NAE can detect this condition and provide all the details necessary to investigate:
These are just a few examples of Endpoint issues NAE can detect but there are plenty more. What if an external router advertises an overlapping prefix? What if a firewall misconfiguration impacts the functionality of the fabric?
Mean Time to Innocence!
NAE can perform continuous verification and diagnosis on the entire dynamic state – Endpoints connected to the fabric, fabric-wide forwarding state – and identify any actual or potential issues. It provides full visibility on current state and suggests remediation actions. It is easy to imagine the impact on reducing time to detect and fix issues, and the ability to assure all the Endpoints are working as expected. No more fix/break, NAE provides assurance that infrastructure and Endpoints are doing what they are intended to do at any time. Now, when the application fails, for the first time, the network guy in the room can actually prove that the network is innocent! We don’t need to be the fall guys any more
Hardware still matters
We all know TCAM and tables used to store the state of a switch are a finite resource that must be taken in account. The network is software-defined but still runs on hardware, hidden by layers of abstraction. NAE provides granular visibility into usage of these hardware resources so that the operator can ensure that changes they make will successfully deploy.
With a complete visibility of TCAM utilization of the switches NAE ensures that no change, however small and insignificant it may seem, causes the network to exhaust resources. More than that for every leaf, NAE can correlate TCAM utilization and security policies with granularity up to each single EPG and Tenant.
With this information the network administrator can take actions like move a tenant to another leaf with spare resources or plan hardware upgrades based on the actual use of TCAM.
Knowing enough resources are available is not enough to validate the hardware layer. There are thousands of policies and hundreds of leaves in the networks today, how to verify policies are being correctly applied on every switch?
Following the “trust but verify” paradigm, Cisco NAE can ensure security policy enforcement is correct at the leaf configuration level and leaf hardware level for the whole fabric.
If any of the enforcements are missing, a Smart Event is created in the dashboard suggesting actions to correct the issue:
Final Thoughts
NAE assurance platform continuously tracks dynamic state changes looking for any issues in configurations, Endpoints, forwarding state, policies and hardware. This is truly what Assurance is.
Cisco Network Assurance Engine is a valid ally of the network administrator, not only to manage a running network, but also during deployment to verify that hardware and policies are configured according to the defined design.
The available tools so far have been old school monitoring and experience based on mistakes. NAE is part the next generation of tools, able to increase network reliability and scale to network sizes and complexity that cannot be managed merely by relying on human knowledge and manual correlation anymore.
Correctness by design and continuous verification are the least network engineers should expect to manage today’s complex networks. Cisco NAE provides all the tools based on formal mathematical models to assure configuration correctness, expected state and network security policy compliance of the Datacenter today.
Nice. Always wondered if there was a way to see some of the example of the 5000 + checks that can be done before running into those errors :). Candid/CNAE is an interesting product and would love to more before any implementation. Thanks for the detailed article, nerd in me wants more :). I have to present some NAE to a customer and to a sales person. I promise to steal/borrow some of your examples. Wish there was a simulator for NAE available (or dcloud but haven’t seen it yet. Thank You and good job.