There are plenty of reasons to consider QLC flash storage for our next storage system. Replacing aging infrastructure provides an opportune time to revisit earlier decisions and update our thinking based on what has changed in the past 3-5 years.
We have more information now, both on how flash performs in the real world and on what actually happened compared to our predictions. Assumptions made years ago need to be re-examined lest we mindlessly carry outdated ideas into the future, compromising our ability to provide high-quality infrastructure to our customers.
When flash first started to be used, we discovered issues with the durability of MLC and TLC flash compared to SLC. Most of those issues have now been addressed by vendors using updated techniques such as smarter placement and wear-levelling algorithms. We also have a wealth of real-world knowledge about actual drive endurance compared to theoretical predictions.
Several large-scale studies show that, on average, enterprise flash drives still have around 85% endurance remaining when they get replaced. We’re systematically over-spending on endurance we don’t end up using!
There are also lingering concerns about QLC performance. Yet for read-intensive workloads we now know that QLC flash performs just as well as TLC flash, and vastly better than the alternative of HDD-based systems.
Clearly our outdated thinking on QLC flash needs a refresh.
Where QLC makes sense
Generally when considering QLC, we’re comparing it to TLC flash, or HDD-based systems, or perhaps hybrid flash+HDD systems.
The range of workloads that QLC is suitable for continues to grow, while HDD-based systems shrink to fit a niche of nearline or offline style secondary systems. The expectations of low-latency, high-throughput performance continue to increase and HDDs simply can’t keep up.
Flash is the clear choice here.
We thus arrive at deciding between TLC and QLC flash for online capacity storage.
How to Choose between TLC and QLC Flash?
QLC makes the most sense for read-intensive workloads and large data streaming writes that require low-latency and high throughput across a large dataset. CDNs, media systems, and AI/ML processing are just some of the workloads that are well-suited for QLC flash.
TLC flash provides greater resilience for small and random write-intensive workloads, but as we’ve found, range anxiety is causing some designers to overspend on TLC flash for read-heavy workloads. This has the added downside of needlessly limiting capacity on the systems that need capacity the most.
It’s far better to take the 20-30% savings of QLC flash over TLC flash and spend it on additional capacity, extending the lifetime of a read-optimized capacity system while reducing operational complexity and maintenance issues. In fact, given what we know about the actual endurance of QLC flash in the field, we can have confidence to push these QLC systems harder than we might have done in the past. In a time of doing more with less, this is compelling.
Power, cooling, and rack space consumption are all lower with QLC-based flash compared to equivalent TLC-based systems, and substantially lower than HDD systems. For a large estate, the savings add up quickly, and a quick look at the world outside should remind us that these cost pressures are only going to increase.
Modern QLC devices are also able to provide real-time monitoring data about their health, including likely endurance lifetime, based on actual usage. We don’t need to guess, we can know about the overall endurance of the system. We can not only plan better, we can automate how we place workloads based on this information by integrating it into modern workload orchestration systems.
From a product quality and reliability standpoint, there is no difference between a data center class leading QLC with a TLC drive. Both have similar projected Annual Failure rate and both are developed and validated against the same industry spec JESD218 and JESD219, which means they have same operating range in terms of temperature and provides the same end of life 3 months’ worth of retention.
Clearly, we should not only be considering QLC flash, but actively investigating it first and falling back to TLC-based systems when needed. Wasting money out of misplaced fear actively harms our ability to react to future capacity demands that we know will arrive.
Such wasteful decisions are also likely to come under greater scrutiny due to the changed economic environment. Why not preempt this scrutiny and present options that both save money, and improve performance and longevity of our storage? Surely we have the ability to revisit old assumptions and make new decisions based on the best evidence available?