Many people involved in artificial intelligence don’t spend much time considering storage infrastructure. That’s the topic of this episode of Utilizing AI, which features Ace Stryker of Solidigm discussing the role of flash storage in AI infrastructure with Frederic Van Haren and Stephen Foskett. Considering the cost of GPUs and the rest of the AI stack, idle time is the enemy. That’s why it’s critical to have a low-latency storage layer to support tasks like training.
Storage Is Critical to the Future of AI – A Conversation with Solidigm
Artificial intelligence is making enterprises thirsty for data storage. Training AI involves shoveling volumes of data into AI models, and that is creating new pressure on organizations to scale the existing storage solutions.
In this episode of the Utilizing AI Podcast brought to you by Tech Field Day, part of The Futurum Group, hosts, Stephen Foskett and Frederic Van Haren, talk with Ace Stryker, Director of Product Management at Solidigm, about the increasing significance of storage in AI, and the full consideration of cost and performance.
A More Advanced and Intensive Breed of Storage is in Order
Too often, the focus is laid overly on compute as the priority is to meet the immediate and most pressing need. “[storage ] is certainly not the first thing that folks think of when they’re designing an AI server. Our research suggests that 60 to 90% of the spend goes towards compute. That is the 800-pound gorilla and the primary concern for someone designing a system for this kind of work,” notes Stryker.
From a storage standpoint, spending money on solutions, that in the long run, provide performance and capacity exceeds the scope of AI use cases. A comprehensive storage deployment strategy not only makes the infrastructure AI-ready, but future-proofs it for any data-intensive innovation.
At the edge, where performance and capacity needs are on the rise, growing regulatory demands are pressing companies to maintain longer data trails for improved model accuracy, and especially to save inference data for compliance reasons.
“There are lots of places where storage plays an important role, particularly as it relates to the powerful and expensive compute components, making sure those are maximally utilized and you’re getting the most out of the money you spent on the important components in an AI server,” he emphasizes.
For GPUs to operate at full potential, data needs to be stored somewhere that allows fast and low-latency access. Data prepping is a read-heavy process. Raw data is extracted from storage systems, and put through ETL (extract, transform, load) where it is cleaned up, deduplicated, and tokenized before being exposed to the models.
“That’s done in a deliberately random way to prevent over-biasing, or creating any kind of issues with the usefulness of the model itself. That’s a lot of random read activity,” he says.
Businesses are juicing their existing storage solutions to deposit prodigious amounts of data. But hard drives that are known for deep capacity, are failing to meet the needs of AI workloads. Bottomless capacity is not the only ask, says Stryker.
Within the training process, checkpoints or recovery points need to be written periodically. Checkpointing allows the immediate results of data processing to be saved from time to time, so that a full recovery is possible in the event of failure or shutdown. It can be at an interval of 30 minutes, or once every few hours.
Regardless, checkpointing involves heavy writing, says Stryker, and can go up to several terabytes at a time. “To minimize the failures, you need to be doing a lot of checkpointing, and the storage system is key to that overall training experience.”
A slow storage is a colossal stumbling block in that process. “The slower your storage, the longer it is going to take, and while your checkpoints are being written, those expensive GPUs and all that great hardware is sitting idle and waiting for that to complete.”
A New Class of AI Use Cases at the Edge
The rise of edge underscores the importance of having high-speed access to data. “As more and more deployments are going towards the edge, its critical to have low latency where you can read in your inputs, whether it’s someone typing into a chatbot, a security camera somewhere, or whatever that source might be, to run that through the model,” says Stryker.
More companies are tapping into edge computing to process data physically close to the source. The use cases demand storage infrastructures that can support split-second requirements of those workloads.
“As more and more intense work needs to be done at the edge, more real-time inferencing, lightweight training and reinforcement learning happening, and you’re not feeding your inputs all the way back to the core datacenter to initiate another round of training, you might just be refining locally and making smaller improvements.”
The edge storage demands are as variegated as the use cases themselves. “When you layer on things like concurrency where multiple data scientists work with the dataset at the same time, or multi-tenancy, where a single system resource is used on different projects entirely at the same time, it generates these really complicated mixed read/write IO profiles.”
Serious Performance at Best Value
Solidigm’s line of low-latency storage solutions is built to support these mixed edge applications. “Rather than the requirements of the architecture being dictated by the limitations of the hardware as more hardware comes online and becomes more capable, those requirements are going to be dictated by what’s ideal for the use case,” Stryker predicts.
With one product specifically, Solidigm has carved a name in the crowded SSD market. It’s the D5-P5336. The P5336 is an incredibly fast SSD, and one that has ramped up the max capacity of 28TB, to 61TB on a single device. This is a big TCO optimizer, says Stryker, as it not only requires users to buy fewer drives, but also creates servers with less cooling and power needs.
Solidigm SSDs come in purpose-built form factors designed to meet the unique requirements of edge deployments. Like the Ruler for example, is a long and skinny 12-inch form factor. It is specifically fashioned to optimize density, allowing more drives to be packed in less space.
“You can fit far more devices in a datacenter rack using one of these new form factors than you ever could with the legacy form factors that we inherited from the hard drive world.”
Coming back to the premise, is fast storage life-and-death for AI? “Slow storage will certainly be an impediment always. There is a threshold under which you simply would not want to go for AI work,” he replies.
Wrapping Up
A fast storage opens virtually limitless possibilities, delivering high performance and max GPU acceleration. And knowing that enterprise storage needs are sure to overshoot current estimates in the near future, investing in a storage solution that offers AI-grade performance and speed may actually be a sound decision and an optimal spend.
Head over to Solidigm’s dedicated AI landing page to learn about Solidigm’s SSD architecture, customer stories, and other conversations on storage. Be sure to check out Solidigm’s presentations from the AI Field Day event to get a deep-dive of the technology.
Podcast Information
- Stephen Foskett is the Publisher of Gestalt IT and Organizer of Tech Field Day, now part of The Futurum Group. Find Stephen’s writing at Gestalt IT and connect with him on LinkedIn or on X/Twitter.
- Frederic Van Haren is the CTO and Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on LinkedIn or on X/Twitter and check out the HighFens website.
- Ace Stryker is the Director of Product Marketing at Solidigm. You can connect with Ace on LinkedIn and learn more about Solidigm and their AI efforts on their dedicated AI landing page or watch their AI Field Day presentations from the recent event.
Thank you for listening to Utilizing AI, part of the Utilizing Tech podcast series. If you enjoyed this discussion, please subscribe in your favorite podcast application and consider leaving us a rating and a nice review on Apple Podcasts or Spotify. This podcast was brought to you by Tech Field Day, now part of The Futurum Group. For show notes and more episodes, head to our dedicated Utilizing Tech Website or find us on X/Twitter and Mastodon at Utilizing Tech.
Gestalt IT and Tech Field Day are now part of The Futurum Group.