The COVID-19 pandemic spurred digital innovation across industries bringing AI front and center. Over the following years, the number of companies tapping into AI skyrocketed, ramping up deployment of AI workloads. But companies misjudged the challenges of deploying AI without preparedness. They’re much bigger than was anticipated.
As stories of failure emerged, they provided an important lesson. The pre-condition of building or deploying an AI solution – an AI-ready computing infrastructure is not to be taken lightly. This base infrastructure must combine all the right pieces of AI-optimized storage and OS to support the heavy-lifting works of AI training and building. Without it, the endeavor is doomed to failed. But building out such a powerful framework is not a weekend project either.
At the last AI Field Day event, NVIDIA presented an AI-optimized development hub for the most power-hungry AI projects. This platform that NVIDIA designed in collaboration with NetApp constitutes the NVIDIA DGX Foundry infrastructure, and NVIDIA Base Command, an operating system that fuels all NVIDIA DGX systems. Combined, they present the ultimate enterprise-ready AI stack.
Barriers to AI Adoption
AI systems produce immense transformative business value across verticals, which makes it integral to all operations, solutions and strategies. But deploying and integrating AI into an existing infrastructure is challenging on a whole different level.
Without a powerful AI infrastructure serving as the bedrock, building and training AI solutions is a comedy of errors. It demands an insane amount of knowledge and expertise that is hard to meet for any existing crew.
When getting hands-on with it, data scientists often describe the process as messy and prone to mistakes. Too many moving parts split the focus, and that takes a toll on the teams’ overall productivity. For example, a simple problem of access control, or a hiccup in data management can put a days-long pause in a time-sensitive project.
A lot of questions pop up along the way answering which takes deep level of knowledge, understanding and cognizance with the entire stack of technologies that no one person or team has. It is incredibly difficult for engineers to match that technical depth. That’s why it pays to have a ready-to-use platform that can help scale out workloads at low effort and low friction.
AI Expertise from Deep within NVIDIA
Adam Tetelman, Sr. Product Architect at NVIDIA opened the session by stating that “NVIDIA is an AI company.” Although better known as the manufacturer of world-class GPUs, NVIDIA has for some years been dabbling in AI, building technologies internally.
“We build a lot of AI within NVIDIA. We have our consumer products and enterprise products, and we have natural language processing models, super-resolution, denoising, style transfer models built in-house for our products, or with some of our partners,” informs Tetelman.
Tetelman talked about NVIDIA’s internal super-computers that employ AI at scale. The prototype, DGX-1 was launched about 8 years back in 2016, and DGX SATURNV followed within months. The SATURNV is the origin of DGX Foundry and Base Command. SATURNV has a whopping 22,500 GPUs, 2,000,000 AI training jobs, and is used by a sprawling team of AI developers and researchers who have deployed “all types of AI workloads within the same environment.”
“Over time we’ve played around with the software stack internally, and figured out how to scale up the hardware or the cluster, but most importantly, we learned how to interface the software into that. We ended up centralizing everything. In the process, we’ve built our AI center of excellence within the company,” said Tetelman
As the technologies reached maturity, the next step was to get them out in the market. So Base Command was born to extend the tools and capabilities that were so far being used in-house at NVIDIA to anyone outside the company.
Getting AI Projects to Move Quickly Start to Finish
NVIDIA Base Command is a comprehensive AI workflow management platform that is created to solve the challenges of building an AI platform. Designed to eliminate workflow bottlenecks, it covers all the enterprise and developer requirements that formerly created blind spots in the process.
Base Command is a cloud-hosted software whose key function is to help manage large-scale AI development workflows in cloud and on-premises. The platform serves as a “development hub” where data scientists can quickly develop AI projects from inception to completion.
Base Command offers the coveted single-pane-of-glass view of all moving parts in the AI development. The control plane presents a complete picture of everything in the environment. Through it, users can manage and share resources, and monitor all components through a unified dashboard. Engineers can view and monitor clusters, manage datasets, right-size resources and execute AI workloads all from the GUI.
The Base Command is common to all DGX systems and enables users to get the best out of their workstations.
With a robust out-of-the-box compatibility with ML libraries, integration with third-party MLOps tools and support for multi-node training, Base Command enables a massive ecosystem and “meets all the teams where they are”.
Its most compelling feature is its breadth of advanced capabilities. Tetelman gave the example of application profiling which is enabled by telemetry and insight solutions built into the platform.
The Base Command Platform is a subscription-based service that NVIDIA offers jointly with NetApp. While Base Command is the software part of the platform, the DGX Foundry is NVIDIA’s hardware infrastructure. Currently, the infrastructure comes in two flavors – NVIDIA DGX Foundry and DGX SuperPOD.
DGX Foundry is a hosted service that uses storage from NetApp. It is fully managed and can be rented for a subscription. Alternatively, users can also buy the DGX SuperPOD which is an on-premises infrastructure. DGX SuperPOD can be bought and deployed with the Base Command software.
“You can get trial access to Base Command through a program called NVIDIA LaunchPad. The LaunchPad website has a trial link where you can sign up and get access to the software and see how it all works,” said Tetelman.
With AI being the new nerve center of the modern digital infrastructure, companies need a ready-to-deploy AI platform that tackles the distracting chores and lets scientists hone in on the development part. NVIDIA DGX Foundry provides that ramp to launch AI. Besides giving enterprises a future-proof foundation for AI projects of all types, it presents a fully managed infrastructure built with world-class technologies, poised to give data scientists a development experience free of distractions and the usual challenges.