All Solidigm Sponsored Tech Note

Modern Flash Endurance with Solidigm

Modern flash endurance is dramatically better than early flash storage, but outdated assumptions about endurance persist. Range anxiety is causing many storage designers to invest in endurance they don’t need, instead of investing in the performance and capacity their customers actually need.

It turns out that most of the times, buyers aren’t in any danger of wearing out their flash. This should prompt a reconsideration of how endurance is taken into account when designing flash systems.

Measuring Lifetime Writes

There are two main approaches vendors use to describe drive endurance on their spec sheets: drive writes per day (DWPD) or total bytes written (usually as terabytes written (TBW)). They’re subtly different.

DWPD doesn’t really match the way users think about writes when using a system. They tend to think about throughput, which is about total bytes. At a given throughput rate, how long would the system last? That’s easier to work out if thought in terms of total bytes written.

A 7.68 TB flash drive with a rating of 1 DWPD would be able to handle 14,016 TB of writes before its industry standard 5-year warranty expires. That means for 20,000 TB of endurance, the drive needs to be bigger, right?

Not necessarily. Some 7.68 TB drives are rated differently for random vs. sequential writes. This is because sequential writes are more predictable, and the drive firmware can better manage the way writes are performed, extending the life of the drive. Sequential-write optimized 7.68 TB drives can support over 20,000 TB of writes in 5 years.

Costly mistakes can be sidestepped by bringing vendor drives specs into a common language, like total lifetime bytes. And if the kind of workload is matched to a drive optimized for that workload, then better investment choices can be made.

In fact, users probably don’t even need the endurance they think they need.

Actual Endurance

large study of drive endurance found that most drives had used less than 15% of their predicted lifetime endurance. Drives simply don’t get used enough to wear out in most enterprise installations.

It turns out that drive firmware appears to be the most significant factor affecting drive reliability. Bugs in the software are more likely to trigger problems that cause drives to be replaced, not overuse of the drives that wears them out early.

Familiarity with workloads and drive behavior, which is used to make firmware improvements, is therefore something to keep in mind when selecting a flash drive vendor.

Given the discussion of drive endurance specs above, focusing merely on DWPD or total bytes written without an understanding of workloads would be a mistake.


p.p1 {margin: 0.0px 0.0px 8.4px 0.0px; font: 9.0px Helvetica; color: #000000} span.s1 {text-decoration: underline ; color: #0000ff}
Captioned: “Annual replacement rates for different drive families, grouped by firmware version. (Source: Stathis Maneas et al., “A Study of SSD Reliability in Large Scale Enterprise Storage Deployments,” 2020, 137–49, https://www.usenix.org/conference/fast20/presentation/maneas)”

A Word on PE Cycles

The rules of thumb used to be that each increase in cell-level technology (single-, to multi-, to triple-) dropped the number of PE cycles by about a factor of 10. Thus, with starting SLC cells starting at around 100,000 lifetime PE cycles, TLC cells would end up with around 1,000 PE cycles.

This log-linear approach is easy to remember, but it’s not quite accurate. Enterprise MLC cells were able to get 3 times the PE cycles of regular MLC (30,000 instead of 10,000), and TLC cells also managed 3x the log-linear predicted cycles (3,000 instead of 1,000).

When the discussion gets to QLC 3D NAND, things are very different. Instead of the rule-of-thumb predicted 100 PE cycles, one can get at least 1,000. In fact, modern QLC 3D NAND drives with their vastly more sophisticated firmware are regularly able to achieve 2,000 or even 3,000 PE cycles, which brings them into TLC level endurance range.

Naïvely assuming that there’s a substantial difference in PE cycles between modern QLC flash and TLC flash is going to lead to wrong decisions. It’d be foolish to rest on assumptions and not look at what modern drives are actually capable of.

Conclusion

Modern enterprise flash lasts much longer in real-life situations than users tend to assume. Drive reliability has very little to do with theoretical endurance under most enterprise conditions.

It’s time to get over range anxiety and start looking at the wealth of real data that modern storage systems offers. If the industry can let go of outdated fears, buyers may well be able to invest in storage systems that provide superior result for them.

About the author

Justin Warren

Justin is Chief Analyst and managing director of PivotNine, a firm that advises vendors on positioning and marketing, and customers on how to evaluate and use technology to solve business problems. He is a regular contributor at Forbes.com, CRN Australia, and iTNews.

Leave a Comment