Network Operators have access to a vast, diverse set of tools that can assist them in day-to-day operations and, in most cases, notify them when something goes wrong. These notifications can come in the form of SNMP traps, email alerts, SMS messages, automated phone calls, or even social media messages. For those managing a large campus network, data center, or service provider network, the sheer volume of information coming from infrastructure notifications can be overwhelming. Managing this information, and being able to identify what is relevant, and actionable, and what is simply background noise can be daunting. What often ends up happening is the creation of a set of inbox rules that dump these alerts into various sub-folders, and a certain apathy to this constant barrage of notifications can set in. This can, and often does, result in critical issues being missed, potentially leading to outages.
Current alerting mechanisms often require the configuration of thresholds to trigger the appropriate alert. These are often static values, and very often have predefined values that can be adjusted by an administrator to suit what they deem as abnormal, or outside of ideal parameters. These static thresholds often lead to false positives, and result in further dilution of the alert pool, making it difficult to discern what are actual problems, and what is simply noise.
It’s difficult to know what is normal for a given network unless some sort of baseline is measured. Even with a good baseline of the network, there is constant entropy in modern networks such that the baseline itself will shift over time. This shift could happen from one year to the next, or even from one day to the next. Arguably what is “normal” for a particular network could change by the minute.
At Cisco Live 2019 in San Diego, JP Vasseur, PhD and Cisco Fellow, introduced Cisco’s machine learning platform, which will provide insight via DNA Center Assurance, into how a network is expected to be operating as a baseline, and provide relevant alerts based on real network data, leveraging a massive data lake, and Cisco’s proven expertise in networking.
At Tech Field Day Extra at Cisco Live 2019, JP and Dave Zacks, Distinguished Technical Marketing Engineer, had an opportunity to further discuss the platform and what it would mean for Cisco customers.
Volume is Key, Diversity is Crucial, Quality is a Must
JP discussed the massive data lake being used for Cisco AI Network Analytics and how, while it is critical to have a large pool of data to extract from, the diversity of that data, and the quality of that data were equally important in order to prevent undesired or random outcomes when applying any machine learning algorithms to it. This data lake is fed from actual customer networks, and includes data from enterprise, data center, wireless, and SD-WAN networks, lending to its diversity.
Anonymity is very important here, and JP stressed this, noting that their anonymization engine ensures the quality of the data remains intact, while removing any personally identifiable information such as user names, MAC addresses, hostnames, network addresses, etc. All of this data is also encrypted, with the private keys maintained by the customer in DNA-C.
Cisco has been working with and developing their expertise in AI and ML for many years, and this, combined with their 35 years of networking expertise, allows them to target the specific use-cases desired for this platform, while filtering out extraneous “noise” within the massive data set.
The data is fed through this machine learning stack, which is trained via a diverse set of ML algorithms. No single algorithm is going to provide the kind of actionable insight for these kinds of networks or traffic, and diversity in the models used is as important as the data set.
From the customer perspective, network data is gathered in DNA-C, aggregated via the Network Data Platform (NDP), anonymized, encrypted, and then sent to the Cisco AI Network Analytics AWS instance, where it joins with the existing customer data, adding to the larger collective data lake. The data is modeled and trained, and a predictive baseline is generated, and then fed back to the customer DNA-C. Here it is decrypted, de-anonymized, and presented back to the customer with relevant identifiable information.? This baseline becomes more accurate and more defined as more data is passed through the platform.
Cognitive Issue Detection and Analysis
Currently, the Cisco AI Network Analytics platform is first looking at wireless networks, which can be very challenging to troubleshoot by their very nature. Eventually, other technologies will be incorporated, including routing, switching, SD-WAN, etc. Wireless networks are complex, and it becomes difficult to identify issues or the root cause of many issues because of that complexity, and due to the fact that the physical layer is the air. Problems reported by users are often difficult to replicate, and there can be a real lack of quantifiable data showing exactly what the user experience is like. There may be little or no noise, no interference, and a reasonable number of clients on an access point, but complaints of poor application performance, or throughput are still heard. How does this look from within DNA-C Assurance, and the Cisco AI Network Analytics platform?
Once a baseline is established, and what is “normal” for the network is defined, the platform can then begin to find anomalies, and identify the root cause. In wireless, these anomalies can be problems with client onboarding, DHCP, AAA, or throughput, as examples.
From within DNA-C Assurance, the Issues Dashboard now highlights issues into a new category – “AI Driven.” These are prioritized events that have been identified by the platform based on the analyzed data and what has been determined to be normal for the network. One key element here is taking a potentially large volume of anomalies and filtering them down to a small number of relevant, actionable items for the network operator/administrator.
Drilling down into these issues, DNA-C now presents detailed information about the problem, along with images of the anomaly as it pertains to the expected baseline. From here, a Root Cause Analysis is also available, and finally Suggested Actions for remediation.
Typically, the process for identifying, investigating, troubleshooting, and correcting these kinds of issues would require several hours of work for an administrator. The Cisco AI Network Analytics platform boils this down to a single dashboard, with clean, clear explanations around what has been determined to be causing the issue, and what needs to be done to resolve it.
Closing the Loop
As the development of the Cisco AI Network Analytics platform continues, there may be a future where these problems can be solved by the platform itself. Keeping in mind the previous point that as more Cisco customers adopt DNA-C, more data is fed through the platform, the better the data gets, and the more reliable the issue analysis and root cause identification will become.
In many cases the end goal of AI or ML-driven solutions is to drive automation and what are often coined as “self-healing networks,” or closed loop systems that don’t require a network administrator to get involved with the solution. Once it is identified, the system is able to make the necessary changes to solve the issue.
The exponential growth and development of AI and ML over the past several years has provided some exciting opportunities to improve any number of technologies, networks included. The Cisco AI Network Analytics platform is one example that is ready and able to be put into practical use for Cisco customers.