As the tides of information technology ebb and flow, these are the days of our lives. Once massive storage and data transit networks were built in parallel and only converged at specific points allowing for the utilization of the storage resources.
Long have complex storage networks been built to serve HPC (high-performance computing) systems and other specialty datacenter applications. However, the relatively lost cost to performance ratio and ready availability of NVMe, running NVMe-of (NVMe over fabrics), has emerged as the next logical evolution.
What is NVME-of?
NVME-of allows for the low latency of other specialty networking standards such as Infiniband and expands the availability and deployment model to a more standard and familiar base. Protocols such as standard ethernet, iWARP, and RoCE can be leveraged with a standard interface card such as the versatile intel Ethernet 800 series – described in this tech field day presentation.
What Does This Mean Operationally?
NVME-of is arguably an extremely efficient storage block protocol for use in datacenter fabrics. Given that the NVMe-of is a standard and utilizes the same base architecture and the same host software stack as PCIe, from an operational and capital standpoint, the same host adapters can be deployed across large swaths of datacenter hardware. While this may seem minor, the operational overhead it relieves is substantial from a sparing and support standpoint. When all systems, hypervisor, storage, head node, compute node, etc., leverage standard hardware, it enables a significantly more efficient operational model.
Operational overhead is further diminished by using transport models such as NVMe over TCP, as it allows for both the standard network configuration model and standard Ethernet switching infrastructure. This also lends itself very well to a controlled rollout at scale. It does not include a forklift upgrade to most existing architectures with the possible exception of adding new 800 series interface cards, which is likely required for expansion or inclusion of a new storage model.
What Does it Mean for Performance?
A fair bit, as it turns out. TCP, while the standard as far as reliable transport goes, can be a bit of a bear in some instances (see: creation and heavy use of QUIC by Google and others to work around TCP performance issues, among other things). There are a wealth of efficiencies gained by the use of ADQ (application devices queues) – natively supported in the 800 series. These provide flexible congestion control, very useful in scaled out TCP based applications, potentially reducing latency and jitter across large fabrics.
As more and more environments migrate toward a hybrid cloud or on-prem cloud solution for either all or some of their workloads, hosting, and general compute needs, simplification is paramount for both long and short-term supportability.
In the do-more-with-less, IT world, having flexible, scalable solutions and, most notably, lower overhead to procure, implement and support is not just a nice to have. It is becoming more of a requirement.
As more and more interconnect migrate away from specialty transport and onto what has functionally become the de facto standard of ethernet, the high potential for advantages is pretty straightforward. This is true for interconnect both outside the datacenter such as DCI (datacenter interconnect), FTTx (fiber to the curb/home/premises), and NVMe-of. Engineers no longer need to learn as many specialty protocols and can instead converge onto ethernet and reduce the complexity down to what is on the network rather than what the network is made of.