The idea behind getting telemetry and metrics for your network is driving the way we build things today. If you have the data that can tell you where the bottlenecks are and can find ways to send traffic around them when they occur, you can create happier users and reduce the strain on your infrastructure. It’s a hard problem to solve in the enterprise for sure.
At scale, the telemetry problem becomes massive. When your network is operating at the hyperscale level, every packet flow is enormous. Every change could result in downtime, which causes gigabytes or even terabytes of traffic to be lost. Tens of milliseconds of extra latency can cause significant revenue loss, either due to trading platform delays or lost sales on major holidays. As hard as these problems are to solve in the enterprise, hyperscale networking is an entirely different animal to tame.
Programming Your Way To Assurance
One of the most significant issues with trying to track telemetry at the hyperscale level is that no good tools exist to do it in the way you need to accomplish it. Because these tools are built with enterprises and service providers in mind, they don’t scale to the heights associated with large cloud providers or huge services like Facebook or Google. More often than not, the smallest flows in a hyperscale environment are way beyond the maximum capabilities of the most robust tools out there.
These flaws lead to the hyperscalers developing their platforms to handle their unique needs. This task isn’t impossible because brilliant people work for these companies and are more than capable of writing great software. The roadblock they hit next comes from the hardware itself. While these programmers are great at writing their software, they still have to interact with the hardware to get the data they’re looking for. The hardware they’re using is designed to be queried by the tools built for the enterprise. It could be that the switch or router isn’t expecting to send the details on a massive number of flows and could easily fall over because it just doesn’t know how to handle that many statistics.
You need networking gear that can be programmed to give you the statistics you need in a way that isn’t going to cause issues. Thankfully, the team that founded Barefoot Networks took care of this with their programmable Tofino chipset and P4 programming language. They were so good that they were snapped up by Intel and integrated into several product lines. During the recent Networking Field Day 23 event, they were able to show off some of what they’ve been working on with telemetry:
Deep Insight is a platform that I’ve looked at in the past before the acquisition. It’s solid and can provide the kind of telemetry that helps your systems make good decisions about things like dynamic routing and such. More importantly, Deep Insight offers the same kinds of integration potential that the rest of the Barefoot technologies do as well.
P4 gives you the flexibility to do amazing things. Using Deep Insight and In-Band Network Telemetry (INT), you can pull packet flows and aggregate them quickly. You can customize those data captures to get only the things you’re looking for. If you don’t need the entire contents of a NetFlow/IPFIX export, you don’t need to grab it. Considering the stories I’ve heard from Facebook engineers about systems crashing from the overwhelming amount of data coming in from NetFlow in their early days, I’m sure they would love something like this running in their core networks.
Tofino offers the capability to extend your network beyond just being a packet mover. Because of the chipset’s extensibility combined with P4, you can go beyond slinging packets back and forth. As described in the video, SoftBank was able to do rapid traffic classification and load balancing across hosts without adding a significant amount of hardware to the equation. Whereas they may have needed to purchase expensive new equipment in the past or find ways to modify existing hardware to get something approximate, P4 and Tofino combine to give the customer the ability to extend the network in any direction they want. Getting reports and data about those extensions is a snap as well, thanks to INT and Deep Insight.
Bringing It All Together
For the hyperscale network operators, customization is a way of life. Parts and pieces that might look perfectly serviceable under an enterprise workload can fall apart when subjected to the amount of use they would see in that rarified air, as Audi found out about car horns years ago. If you want to be a trusted provider to the hyperscale market, you’re going to need to make something that can be taken apart and reassembled as needed by the customer. There is no ‘one size fits all’ here because the size of the customer you’re dealing with is beyond any imagination. Intel has always built big. With the Barefoot team bringing Tofino, P4, Deep Insight, and more under the umbrella, they are swing for the hyperscale fences and should have no trouble hitting a home run.