All Featured Pure Storage Sponsored Tech Field Day Events Tech Note

Pure Accelerates with Direct Memory Module

While the new FlashArray//C is a return to the origins as low-cost for Pure Storage, the Direct Memory Module (DMM) is a continuation of the recent Pure, chasing the lowest latency and highest throughput. The DMM is a new tier of storage for the FlashArray//X; it is larger and slightly slower than DRAM but smaller and faster than the TLC NAND flash in the Direct Flash Modules (DFMs) that provide data persistence. A FlashArray//X with Direct Memory Modules installed can have read latency as low as 150us, compared to 250us without the DMM. These Direct Memory Modules will help some Pure Storage customers achieve better critical application performance, although they are not right for every FlashArray.

Many Optanes

Each DMM contains a 750GB Optane drive in U.2 packaging, a 2.5” drive with NVMe connectivity. U.2 is a specification for PCIe devices that do not plug into motherboard expansion slots. Instead, a SATA/SAS connector extends PCIe to the drive. Each Optane drive is bolted to the same type of carrier as the Direct Flash Module (DFM) that is used for the capacity tier of FlashArray//X, so it simply plugs into the standard drive slots. DMMs come in sets, either 4 or 8 modules, which must be placed in the left-most slots in either a FlashArray//X70R or a FlashArray//X90R, not in a DirectFlash shelf.

I suspect that the lower specification FlashArray models simply don’t have the CPU and memory to get sufficient benefit from the DMM. One other useful characteristic of Optane is that the drives are high endurance, up to 60 Drive Writes Per Day (DWPD), compared to 1-5 DWPD for the TLC flash in the capacity tier. This endurance allows data to be migrated into the cache for a while, then aged out and replaced with new data frequently, much more frequently than the deduplicated and compressed blocks change on the TLC flash. Each data block is decompressed as it is migrated into the DMM read cache, although it remains deduplicated. The decompressed data does use more of the DMM capacity, but it also removes a small amount of latency and CPU load for subsequent reads of the same block.

Massive Read Cache

Without the DMM, data on a FlashArray is read directly from the TLC NAND and returned to clients. The RAM in the controllers could be used as a cache. However, RAM is expensive and better used as a cache of metadata, so data reads are usually from the NAND. The DMM provides a read cache of either 3GB or 6GB, with a media read latency of less than 30us. Standard TLC drives have a media read latency of 150-200us, so the DMM can shave around 150us off the array response time for data that is in the cache. Writes don’t go near this cache, they have always been sent to NVRAM for even lower write latency.

Deployment Challenge

The DMM modules must be placed in the first four or eight slots in the FlashArray; this is also where the first storage pack of Direct Flash Modules (DFMs) is installed in a FlashArray without the DMM. The result is that existing FlashArrays will have DFMs in these slots, making adding DMM to an existing array difficult. I expect most DMM deployments will be new arrays, probably arrays that unlock new applications with their much lower read latency. If I had to guess why the DMM must go in these first module slots, I would say that the remaining slots use PCIe switches, so they share bandwidth. The Intel 750GB Optane in its U.2 form is capable of a little over 2GBps read or write throughput, more than the throughput of two PCIe 3 lanes.

Fast Frequently Read Data

A read cache like the DMM helps workloads that frequently read a smaller amount of data than can be held in the cache. For example, a simulation application that needs to make 1000 passes over a 2GB data set with slightly different parameters each time is ideal for a 3GB cache which reduces latency by 50%. The 1000 simulations might be run in parallel on 50 physical servers, using the DMM for the shared data is likely to be more cost-effective than putting 2GB more RAM in each of these servers. On the other hand, if that data set is 10GB in size, then the cache will be insufficient. Equally, a transaction processing database that is 2GB in size will be dominated by writes so that the read cache will have little effect.

About the author

Alastair Cooke

I am self-employed consultant and writer based in New Zealand. Much of my work is communicating about how IT infrastructure works and what it means for IT professionals. Outside of work I’m datacenter and VMware focussed, having created the AutoLab which is a free tool to simplify the creation of a vSphere test or training lab.

Leave a Comment