All Utilizing Tech

Evolving Connectivity for AI Applications with Ultra Ethernet with J Metz | Utilizing Tech 06×07

Ultra Ethernet promises to tune Ethernet for the needs of specialized workloads, including HPC and AI, from the lowest hardware to the software stack. This episode of Utilizing Tech features Dr. J Metz, Steering Committee Chair of the Ultra Ethernet Consortium, discussing this new technology with Frederic Van Haren and Stephen Foskett. The process of tuning Ethernet begins with a study of the profile and workloads to be served to identify the characteristics needed to support it. The group focuses on scale-out networks for large-scale applications like AI and HPC. Considerations include security, latency, ordering, and scalability. The goal is not to replace PCIe, CXL, or fabrics like NVLink but to extend Ethernet to address the connectivity and performance needs in an open standardized way. But Ultra Ethernet is more than hardware; the group is also building software features including a Libfabric interface and are working with OCP, DMTF, SNIA, and other industry groups.

Ultra Ethernet – Tuned for AI and HPC Workloads

AI deployments are proliferating rapidly across sectors. Not too far in the future, doctors will perform surgeries from across continents via AI robots, and AI-powered farming will reverse the damaging effects of unpredictable weather patterns.

But before that happens, the world needs a network to bring it all together. Recently, the Internet is stirred by the whispers of an Ultra Ethernet, that is believed is going to be transformative in the era of AI.

In this episode of Utilizing AI Podcast brought to you by Tech Field Day, now part of The Futurum Group, hosts Stephen Foskett and Frederic Van Haren, talk with one of the members of the body behind the technology – a long-time delegate, J Michel Metz. The discussion takes the lid off of the technology and explains how people and communities might use Ultra Ethernet.

As many would know, Metz wears many feathers in his cap. Besides being the Chair of Ultra Ethernet Consortium (UEC), he is also the Chair of the Storage Networking Industry Association (SNIA) Board of Directors, and the Technical Director of AMD, one of the steering members of UEC.  For the most of his career, Metz has been involved with high-performance networking and storage, and coming to Ultra Ethernet was a natural progression from that.

 A High-Speed Network for HPC and AI

Ethernet is arguably one of the most enduring networking technologies known. Its ubiquity in a landscape fraught with competition is a testament to its continued evolution. Ultra Ethernet takes traditional Ethernet to the next level, making it ready for a new generation of HPC and AI workloads that demands ultra-low latency and high speeds.

Metz describes Ultra Ethernet as “an ambition project that attempts to tune the different stacks of Ethernet so that you have workload-specific performance behaviors.”

The tuning happens across all the layers, from the lowest hardware level to the software stack. “The physical layer, the link layer, the transport layer, software APIs, everything is tuned to specific profiles that are designed to address the different networking considerations that both AI and HPC have. It’s this alignment of the layers that we’re specifically working on,” he explains.

Metz breaks down the way the profiles are being configured to specific network types. Broadly, networks are classified into three types – the traditional, general-purpose LAN and WAN which are your typical Internet-based networks, the scale-out networks, and the massive-scale accelerator-based networks.

Characteristically, the latter two are different from the first, especially because they have unprecedented latency and bandwidth requirements. Ultra Ethernet works on aligning the requirements vertically as well as horizontally across the network.

“What you would normally consider to be an enterprise-level network is not what we’re looking at. We are not changing the general-purpose nature of Ethernet as is typically applied in systems today.” Instead, UEC has its eyes set on the middle category.

Within this network, there are different profiles of HPC and AI workloads which come with their own unique sets of networking demands. “For instance, AI will have different security requirements than HPC, different latency requirements, bandwidth requirements, ordering requirements and so on.”

An underperforming network is the Achilles heel of these workloads.

Why Popular Interconnects Won’t Do?

Inside the server, disparate components are linked point-to-point with interconnects. There is already a growing variety of these interconnects in the market – PCIe – a principle one, InfiniBand – an industry gold-standard for two decades, CXL based on PCIe, known for composability features like memory pooling, NVLink – a bi-directional, direct GPU interconnect, and AMD Infinity Fabric – an interconnect layer for AMD products.

But when it comes to remote networks that scale outside a single server or chassis, these products come up short. “They don’t tend to go over a remote network. They’re inside of the core technology,” points out Metz. “Now there are PCIe and CXL switches in development right now that are going to be out in the market, but they are effectively what we would consider to be small by Ethernet-standard scaling,” he reminds.

AI and HPC Workloads Require High-Speed Communication

As enterprises take on new AI deployments, the need to pare down network latency is growing. In AI training, tail latency or the speed at which communication takes place between these components, determines the level of GPU utilization. The lower the tail latency, the more efficiently the compute resources work.

The goal to bring it to under 200 nanoseconds that is ideal for these applications and workloads is not met by interconnects like PCIe and CXL. “You wouldn’t necessarily want to extend PCIe. It’s a very impractical bus level technology for these kinds of things.”

Low utilization of links is a problem that needs addressing if the goal is to reduce tail latency. Ultra Ethernet seeks to build on a technique called packet spraying. It is when a flow uses every path in the network simultaneously to its destination. Packet spraying is believed to be a more balanced and efficient approach as opposed to multi-pathing that maps too many flows to one path, as it involves making use of all available network paths.

“What we want to do is try and expand upon the ability to do packet spraying across every available link. The issue with Remote Direct Memory Access (RDMA) is that it has load balancing approaches that wind up being a problem at very large scales, and there’s a maximum. You would start hitting the limit, the further out you go.”

To go around that, UEC has an alternative plan.  “We want to create the transport layer to be able to handle the semantic reordering in the transport layer itself which means that you could actually wind up with an open Ethernet-based approach for that packet spraying because you don’t have to worry about the ability to do the reassembly of the packets in order at the other end of the line.”

An Option, Not a Competition

Despite its enhanced capabilities, Ultra Ethernet is not a direct competition for horizontal solutions like InfiniBand.

“InfiniBand definitely has a place for those companies and customers that want to have that ecosystem as part of theirs. This is just an alternative, and I fervently believe that it is a huge plus to the end consumer. We’re taking an open approach with a large group of contributors to help companies and customers find their future AI and HPC needs,” Metz says.

There is an entire cottage industry for Ethernet that can benefit from upgrading to Ultra Ethernet. Metz believe that the uptake may be slow at first, but as more people join the community, the technology will get the support it requires to take off.

The ultimate goal of UEC is to be able to scale to an extraordinary number of endpoints within one training cluster. “A typical large-scale network probably has about 10,000, maybe 40,000 nodes, at most. We’re looking at go up to a million. We’re focusing on very large scales, not quite in order of magnitude difference, but pretty close when we start talking about the broad scope of things and future-proofing that at scale,” Metz says.

One of the advantages that Ethernet has is broad interoperability and compatibility. Does Ultra Ethernet offer the same flexibility to the customers? Metz assures that Ultra Ethernet can be deployed on existing switches.

“You don’t necessarily need to have super-fancy types of Ultra Ethernet networking equipment to do this end to end, and it’s designed specifically to be able to ease yourself into Ultra Ethernet approaches.”

Ultra Ethernet Consortium is founded and run by ten steering members – AMD, Intel, Meta, Microsoft, Oracle, Broadcom, Cisco, HP, Arista and Eviden – but a growing community of tech companies are throwing their weight behind it to fuel innovation.

Be sure to check out UEC’s website, and their whitepaper on Ultra Ethernet to learn more. Also keep an eye out for the announcement of the next season of Utilizing AI podcast here at Gestalt IT.

Podcast Information

Gestalt IT and Tech Field Day are now part of The Futurum Group.

About the author

Sulagna Saha

Sulagna Saha is a writer at Gestalt IT where she covers all the latest in enterprise IT. She has written widely on miscellaneous topics. On she writes about the hottest technologies in Cloud, AI, Security and sundry.

A writer by day and reader by night, Sulagna can be found busy with a book or browsing through a bookstore in her free time. She also likes cooking fancy things on leisurely weekends. Traveling and movies are other things high on her list of passions. Sulagna works out of the Gestalt IT office in Hudson, Ohio.

Leave a Comment