All Featured Liqid Sponsored Tech Note

GPUs, HPC, and PCI Switching

  1. GPUs, HPC, and PCI Switching
  2. Spectre, Meltdown, and Flexible Scaleout
  3. GPUs and Composable Computing
  4. Openness and Composable Systems

In 2009, Volodymyr V. Kindratenko, et al., presented a paper on using Graphics Processor (GPU) clusters for High Performance Computing (HPC). In this paper, they note the rise of GPU based high performance compute clusters, and consider several of the problems, including processing, speed, and power efficiency across GPUs connected to the same cluster. Many of these challenges, it seems, were traceable back to a pair of problems. The first problem is the design of the software itself; building software to run on multiple cores is, to this day, a lot more challenging that it might initially appear. The second is what any network engineer with long experience might expect: the network. The problem the CAP theorem exposes, that there are efficiency and consistency costs to moving data through space, will always rear its head when processing large scale problems across multiple processors connected through some sort of network, even GPUs in an HPC cluster. A more recent paper led by Youngsok Kim considers a series of proprietary connectivity options designed by GPU vendors to reduce the inefficiency when clustering GPUs in this way.

These proprietary systems are purpose built for graphics applications, which means they may, or may not, work in other HPC situations, such as neural networks used for deep learning applications. The proprietary interfaces also limit the user to utilizing GPUs from a single vendor, and likely even a single generation of chips from a single vendor. Of course, these vendor limitations are what general purpose networks are designed to resolve. The network world, right now, revolves around Ethernet for data, and Fiber Channel for storage. Ethernet is simple and fast, but to go from the GPU to the physical Ethernet network, a new chip, and control plane, must be inserted along the way—the Network Interface. This is clearly not ideal when dealing with high speed data, as each chip in the middle requires some form of processing, including looking the destination up, imposing headers, switching based on the headers, and then stripping the headers. Data must be carried off the internal PCI bus connecting the GPU to the network interface, then through the physical interface, serialized onto the Ethernet, copied back off the physical wire, and then copied back onto another PCI bus.

When a Bus Isn’t a Bus

The obvious question here is: why not just use the PCI bus as a connection network? Traditional PCI (such as PCI-X) was designed as a true bus, which means there were parallel wires to carry each bit. PCIe, or PCI Express, is not a bus. Rather it is a network. The figure below illustrates.

In order to serialize the data onto a single set of wires, the data must be marshaled, which involves framing the data and placing some sort of destination address on the frame. Unlike Ethernet, however, PCIe does not use Media Access (MAC) addresses to determine the destination of the framed data. Instead, it maps each particular device into a memory space. For instance, if you have 4 devices, and 256 bits of memory (a radical simplification of the real world, just for illustration), you can map the information flowing from the first device into the first 64 bits of memory space, the second in the second 64 bits of memory space, etc. When some other device on the system wants to read from or write to a specific device, it can simply pull from or push to that specific device’s memory space.

Addressing Memory Ranges in a PCIe Network

The neat property of PCIe is that every device built to run inside any sort of compute platform has an PCIe interface built into it, from memory to network interfaces to… GPUs. Given this, each device already knows how to translate its data to and from the PICe network, including any required framing, serialization, and deserialization. The problem you will face when using PCIe as a network is, however, the rather interesting form of addressing just described—there are no addresses, just memory ranges. This suggests a possible solution, however, similar to the solution used in all optical systems. In an all optical system, the data being carried over one wavelength of light can be switched to another data stream using optical devices to change its color.

To switch PCIe frames, then, what you need is a system that can virtualize the available memory space and set up channels between devices by switching the destination memory address on the fly. This would allow the PCIe switch to connect a wide range of devices to one another, where each has every other device mapped into what appears to be a local memory space. One GPU, for instance, could address a particular piece of information to another GPU by placing it on the PCIe network, using the destination GPUs memory block as a sort of destination address. The PCIe switch could remap the memory from the source GPUs address space to the destination GPUs address space, allowing the two devices to communicate at a very high speed across a close-to-native interface. In the GPU use case, this functionality is referred to as peer-to-peer communication

The Liqid Switch

Liqid makes just such a device—a PCIe switch. This kind of switch can be used to compose a system out of a number of different components, such as shared memory space, GPUs, display outputs, network interfaces, and large-scale storage (such as solid state or spinning drives). Such a system would allow multiple GPUs to communicate to one another, register to register, or through an apparently shared memory space (remember the memory spaces are virtualized and remapped in the switching process, rather than being directly correlated as they might be in more standard applications). This enables the construction of multicore GPU systems, limited only by the scale and speed of the PCIe switch connecting everything together.

This kind of system may not be as fast as the proprietary GPU interconnections being designed and shipped by GPU designers today, but it is much more flexible, in that a single multicore task can be assigned the “correct amount” of processing power, along with enough memory to carry out the required processing, and access to a set of input and output devices. All of this could be done on the fly by configuring the PCIe switch ports into what would look (to the average network engineer) like a set of virtual networks, or VLANS. This kind of system could provide a huge breakthrough in the ability of even mid-sized companies who do a lot of data analytics to build cost effective, flexible systems to take on analytics jobs, such as searching for patterns in customer data, or doing speech recognition chores.

This is an interesting field of engineering—something every engineer should be keeping an eye on, as the network in the rack might, in the future, be a PCIe bus, rather than Ethernet.

About the author

Russ White

Russ White has more than twenty years’ experience in designing, deploying, breaking, and troubleshooting large scale networks. Across that time, he has co-authored more than forty software patents, spoken at venues throughout the world, participated in the development of several internet standards, helped develop the CCDE and the CCAr, and worked in Internet governance with the Internet Society. Russ is currently a member of the Architecture Team at LinkedIn, where he works on next generation data center designs, complexity, security, and privacy. He is also currently on the Routing Area Directorate at the IETF, and co-chairs the IETF I2RS and BABEL working groups.

Leave a Comment