Building an AI Training Data Pipeline with VAST Data | Utilizing Tech 07x02

Model training seriously stresses data infrastructure, but preparing that data to be used is a much more difficult challenge. This episode of Utilizing Tech features Subramanian Kartik of VAST Data discussing the broad data pipeline with Jeniece Wnorowski of Solidigm and Stephen Foskett. The first step in building an AI model is collecting, organizing, tagging, and transforming data. Yet this data is spread around the organization in databases, data lakes, and unstructured repositories. The challenge of building a data pipeline is familiar to most businesses, since a similar process is required in analytics, business intelligence, observability, and simulation, but generative AI applications have an insatiable appetite for data. These applications also demand extreme levels of storage performance, and only flash SSDs can meet this demand. A side benefit is the improvements in power consumption and cooling versus hard disk drives, and this is especially true as massive SSDs come to market. Ultimately the success of generative AI will drive greater collection and processing of data on the inferencing side, perhaps at the edge, and this will drive AI data infrastructure further.

Apple Podcasts | Spotify | Overcast | More Audio Links | UtilizingTech.com

Storage, Not an Afterthought in AI – A Conversation with VAST Data

On the floors of the world’s biggest artificial intelligence labs, rows of supercomputers outfitted with powerful workhorse accelerators stand ready to whir into action and start churning data. Except, companies have a problem. It’s the data.

After exhausting all reservoirs of data on the Internet, they are now turning to internal data to train their next version of AI systems. But this is not so much a supply problem as it is a handling problem.

The data is culled and corralled from different sources and silos, prepped and readied before it is exposed to the models. Methodically managing this growing diversity of data assets across the digital gulf has become the Achilles heel of many organizations.

It takes a deep knowledge of data science. The process involves endlessly processing information, analyzing data quality, and decision making to just be able to pick and choose what data to use.

Underneath, a robust and muscly storage system needs to be in place to bear the weight of this proliferating corpus.

In this episode of Utilizing AI Podcast Season 7, Subramanian Kartik, Global Vice President of Systems Engineering at VAST Data, joins co-hosts, Stephen Foskett, and Solidigm’s Datacenter Product Marketing Manager, Jeniece Wnorowski, to talk about this. The discussion shines the spotlight on data, the new celebrity in the AI arena. This episode is brought to you by Solidigm.

The Place of Storage in AI Is Not in the Sidelines

GPUs and DPUs often hog all the limelight in conversations about artificial intelligence, and rightly so. “GPUs are important, no question. AI companies need data, and they have a lot of performance requirements due to checkpointing and other things,” says Kartik.

But a fact that is underwritten and often overlooked is that a supporting storage infrastructure equally high-performing and efficient is vital for GPUs to be able to function at max capacity.

“There’s actually a whole bunch of heavy-lifting that happens before gigabytes of tokens are created distilled from petabytes and petabytes of data all on the internet to build these foundation models which we all know and love,” Kartik reminds.

The AI workflow is a highly dynamic one where every piece is different. They have their own characteristics and peculiarities, not to mention widely different moods and needs. The only thing they all have in common is data.

“Data fits as part of the pipeline that spans all the way from raw data to inference, and each of the stages is data-intensive as we go along, not just a little bit under the GPUs.”

With transfer learning picking up, companies are reaching into their own archives for business-specific data to fine-tune private LLMs. Harvesting this data is more trouble than anticipated.

There are decades of data, some archived since before the organizations even existed, that needs to be sifted through and analyzed end-to-end to correctly conclude what datasets will be valuable for AI learning.

“This is a data wrangling problem which is enormous, and customers have tens, if not hundreds of petabytes of data. We’re now trying to figure out how to get a grip on this, and actually make these fit in the old AI pipeline,” Kartik says.

It does not stop there. With data accumulated over many years, there is no one place to find everything. Assets are scattered across databases, data lakes, data warehouses, cloud, on prem, and so on. Tracking down each of these silos, and accessing the data within while getting around the limitations of the dated technology they are sitting on, is a major obstacle in and of itself.

“There is no ontological model which typically exists in large enterprises and people are scrambling to build one. You will hear a lot of talk about things like knowledge graphs, data fabrics and data meshes. These are all efforts to get a grip on where the data is, what is it, and what use is it to them,” Kartik informs.

If a company is able to somehow locate all the datasets that the model can learn from, the next big challenge is to clean and refine that data and make it ready for consumption by the LLM. It will require a heavy-duty storage solution that can not only accommodate that colossal volume of information, but also make it available to the GPUs with speed and consistency.

Ironically, no money is made in training. “Money does not get made in training – it’s a money sink,” he exclaims. “It’s all in the inference. You’ve got to get it to the hands of the end users and that’s what needs to happen.”

A Case for SSDs

Many companies have undergone colossal infrastructure overhauls to get AI-ready. “Part of this transformation which we think is going to be the biggest infrastructure transformation in the history over the next five years or so, and we anticipate people will spend about $2.8 trillion on this, is to start aggregating the data and understanding the meaning,” Kartik predicts. “The next thing is to prepare it and decide how they’re going to vector it towards a variety of models which are then going to be able to transform how the business operates.”

In AI and HPC spaces, businesses are rapidly upgrading servers with solid-state technologies to get the system ready for AI workloads.

“The I/O patterns for AI tend to be more random read dominated, and NAND can do a much better job delivering this,” Kartik comments.

Hard disks are falling out of favor, and traditional hard drive-based storage systems are slowly giving way to all-Flash namespaces. “The drives need to handle an unusual mix of workloads – traditional HPC simulation, MPI jobs – high-throughput, large block-type sequential read/write workloads, contrasting with the heavy random I/O intensive workloads. The perfect platform for this combination is a completely solid-state solution. That transformation is well underway,” he says.

The rise of edge and rapid infusion of technology in these environments is hastening this transformation. “From a capacity perspective, floor-space or power, it is going to be significantly lower in these kinds of environments than what we had in disk space environments,” Kartik points out. “I think just that delta alone is going to eliminate hard drives. They’re just not power-efficient enough, not space-efficient enough and not performant enough. They’re going to get squeezed out on the low end with tape, and on the high end with all Flash-based systems.”

Economics is the biggest obstacle. SSDs are significantly pricier, and at scale, well outside budget for many small-size organizations.

Solidigm is changing that reality with its portfolio of high-density, high-endurance solid-state drives. The drives are a balanced combination of performance and capacity, complemented with thin form factors and efficient thermal management.

“Training is a GPU-bound process. That’s not where you get hammered for I/O. Where you do is while doing checkpointing. You absolutely need to have systems that are very high-performance,” emphasizes Kartik.

Naysayers argue that a low volume of data that is typical in small training jobs can lead to idle arrays of expensive drives in datacenters. This argument does not hold for AI workloads because of their performance-intensive nature and unpredictable requirement graph. There’s no way of predicting when you’re going to need more capacity. Having adequate storage at the ready always makes sense for these scenarios.

But this does not necessarily translate to big investments. SSDs like Solidigm’s are built with AI’s most pressing issues – hardware costs and power consumption – top of mind. The drives, being high-density, offer more bang for the buck. They come in slim form factors, meaning more drives can be packed in less space leading to reduced physical footprint and energy consumption.

But what makes them a truly great fit for AI is their ability to maximize GPU utilization. Solidigm SSDs are designed to handle the varying I/O characteristics of the AI phases, and provide high read/write performances throughout. “You no longer have to worry if the right data is at the right place at the right time,” Kartik says.

VAST Data Platform for the AI Pipeline

With AI emerging as the new frontier, a new class of cloud service providers (CSPs) have emerged, that, unlike the biggest CPSs we know, are built from ground up to support AI use cases. These companies have the most coveted hardware purpose-built for AI’s scale.

CoreWeave, a specialized CSP that is involved in massive-scale AI training is one of the names in that space. CoreWeave’s secret config is a combination of Solidigm’s QLC SSDs and the VAST Data solution. With these, it has designed a platform that offers the perfect balance of scale, speed, performance, and efficiency in the AI data pipeline.

VAST Data plays a crucial role in untangling the AI data knot. The VAST Data platform exposes data through modalities other than the usual file and object protocols which makes the lengthy process of AI data management less costly and cumbersome.

“We also expose ourselves through tables,” explains Kartik. “The native tabular structures within us is crucial for data crunching – taking the raw data which is currently sitting in large data lakes on Hadoop, Iceberg, MinIO or object stores. We want to be able to corral that and give it the transformation platform to convert into what can be actually put into a model.”

The platform caters to all the stages of the AI pipeline – data capture, data preparation, training and inferencing. “We do this with an exceptional degree of security, governance and control.”

Consolidating the entire pipeline into a single platform eliminates the need to copy and move data around, significantly reducing data footprint. This unification, Kartik says, is what makes VAST Data a highly suitable solution for any AI work.

You can now listen to Utilizing AI in your favorite podcast application. Be sure to give this episode a listen, and keep your eyes peeled for upcoming episodes. To read up more on this, check out VAST Data’s white papers on their website. To know more about Solidigm’s high-performance, value SSD solutions for AI, head over to their website.

Podcast Information:

Stephen Foskett is the Organizer of the Tech Field Day Event Series President of the Tech Field Day Business Unit, now part of The Futurum Group. Connect with Stephen on LinkedIn or on X/Twitter and read more on the Gestalt IT website.

Jeniece Wnorowski is the Datacenter Product Marketing Manager at Solidigm. You can connect with Jeniece on LinkedIn and learn more about Solidigm and their AI efforts on their dedicated AI landing page or watch their AI Field Day presentations from the recent event.

Subramanian Kartik, Ph. D, is the Global Systems Engineering Lead at VAST Data. You can connect with Subramanian on LinkedIn and learn more about VAST Data on their website or watch the videos from their recent Tech Field Day Showcase.

VAST Data Tech Field Day Showcase:

Thank you for listening to Utilizing Tech with Season 7 focusing on AI Data Infrastructure. If you enjoyed this discussion, please subscribe in your favorite podcast application and consider leaving us a rating and a nice review on Apple Podcasts or Spotify. This podcast was brought to you by Solidigm and by Tech Field Day, now part of The Futurum Group. For show notes and more episodes, head to our dedicated Utilizing Tech Website or find us on X/Twitter and Mastodon at Utilizing Tech.

Building an AI Training Data Pipeline with VAST Data | Utilizing Tech 07×02

Storage, Not an Afterthought in AI – A Conversation with VAST Data