HPC and AI Storage by the Numbers

In the first story of our “Future of AI and HPC Storage” series, we looked at the overall market for HPC and AI storage and some of the big challenges that organizations are facing and trends shaping the market. In this second installment, we’re going to look at the storage market as it currently sits and get a lay of the land for vendor preferences in HPC and AI.

Demand for storage is growing well at the moment, thanks in large part to the big data and AI booms. In the HPC sector, we can get good numbers thanks to Hyperion Research, which conducts periodic global HPC site surveys to get a better idea of the compute, storage, and networking investments that some of the biggest HPC sites around the world.

Hyperion’s 2023 global site survey found total spending on “add on” storage amounted to $6.3 billion, or about 21% of the total on-prem spending for HPC. Storage was the fastest growing element, and Hyperion projected that its share of the overall HPC pie would grow to more than 22.4% by 2028.

The overall global data storage market (not just AI and HPC, but also enterprise IT) drove $218 billion in spending last year and was projected to grow to $255 billion this year, according to Fortune Business Insights. By 2032, it’s projected to grow to $774 billion, representing a compound annual growth rate of 17.2%. These numbers suggest that storage is growing faster in the HPC and AI segment than the overall enterprise IT market.

From a hardware perspective, much of the spending is on solid state disks using NVMe technology, which offers much greater throughput and lower latencies than spinning disk. The global market for SSDs was estimated to be $19.1 billion in 2023, according to Grand View Research, and is projected to reach about $331.37 billion by 2034, representing a CAGR of 17.6%. There are dozens of SSD makers around the world, but fewer companies that make the NVMe flash media at the heart of SSDs.

We’re seeing deliveries of large capacity SSDs that can store 122 TB–and some that store 128 TBs–which is impacting the market. However, the cost of these massive SSDs is in the five-digit range. Spinning disk is still about four to five times cheaper on a per-GB basis than SSDs, which is helping to maintain the market for HDDs, particularly for nearline storage for AI training workloads. In fact, this AI-driven demand has put pressure on HDD manufacturers, and resulted in the lead time to order high-capacity nearline HDDs growing to over 52 weeks, according a recent report by TrendForce.

Where HPC traditionally had two or maybe three storage tiers, many AI workloads today have four or five. AI and HPC projects may utilize SSDs as a high-performance, or “scratch,” layer, and perhaps use spinning disks as a high-capacity or nearline storage tier for other stages in the AI pipeline, such as model checkpointing, inference log. Slower disk and tape are used for long-term or archival storage.

We’re currently in the midst of a spike in disk and memory prices, following the crash in prices of late 2022 and early 2023. The dynamic is a product of initial low demand in the consumer sector, followed by a rush of orders that, in large part, have been spurred by the AI buildout. Industry reports indicate we could see elevated NAND and DRAM prices for two years.

In a note to customers last month, Western Digital said it was increasing prices for every hard disk in its catalog thanks to “unprecedented demand for every capacity in our portfolio.” Flash disk maker SanDisk (owned by Western Digital) also announced recently it was increasing the price of its NAND flash products by 10%. Micron announced a price freeze for its DRAM and NAND products and reportedly notified distributors to expect a 20% to 30% increase for DRAM products.

The continued demand for nearline HDDs is leading some people to speculate that cheaper QLC SSDs could someday fill the role now being done by HDDs. That doesn’t seem to be happening at the moment, but it’s something to keep in mind.

According to Hyperion’s 2023 global site survey, 75% of HPC and AI sites get their on-premise storage from system vendors, as opposed to independent storage vendors. Dell Technologies was the most used on-premise storage provider for HPC and AI sites, with a 22.3% share, followed by IBM at 19.1%, Lenovo at 8.5%, Fujitsu at 5.3%, and HPE Cray at 5.3%. Among the independent storage vendors, NetApp led the way with an 8.5% share, followed by DDN at 7.4%. The “other” category ate up 23.4% of the overall share, with vendors like VDURA, ATOS, Huawei, and Inspur getting votes.

While storage accounted for about 21% of spending for on-prem HPC and AI installations, customers who run their HPC and AI workloads in the cloud spend upwards of 33% of their budget on storage, according to Hyperion’s 2023 survey. All told, only about $7.5 billion was spent on AI and HPC in the cloud, with compute instances gobbling up nearly 47% of that share. The on-prem segment of the HPC and AI market was about 5x bigger than cloud.

Interestingly, spending on ephemeral storage–that fast, or “scratch” layer mentioned above—accounted for 13.8% of total cloud spending for HPC and AI, according to Hyperion, only slightly less than the 16.2% spent on persistent storage. However, spending on persistent storage had been 2x greater than ephemeral storage in past studies, which shows how quickly that type of storage is growing–and gives a hint at the type of AI workloads that organizations are running.

Lastly, we take a look at the software side of the storage equation. There are two main camps when it comes to file systems: network attached storage (NAS) shops and parallel file system users.

In the NAS and scale-out NAS realm, NFS dominated, with a 66% share, followed by OneFS at 14% and QumuloFS at 10%, according to Hyperion’s 2022 survey. In the parallel file system department, Lustre had the lion’s share with 41%, followed by IBM’s Storage Scale (formerly Spectrum Scale and GPFS) at 17%, in-house developed pFS at 14%, WEKA at 6%, and other at 14%. Interestingly, 61% of HPC-AI shops said they don’t use the same file system in the cloud as they do on-prem.

A lot has changed in three years, however, which is why it will be interesting to see Hyperion’s updated figures. ChatGPT isn’t quite three years old, so the generative AI revolution isn’t reflected in those numbers.

What we know anecdotally is that there continues to be a divide between the NAS vendors and the parallel file system vendors, with some new faces and new approaches (i.e. software-only as opposed to selling appliances). In addition to parallel file system vs. NFS and parallel NFS (pNFS), the capability to support object storage is also driving storage purchasing decisions, thanks to the different capabilities that object brings (which we’ll discuss at length in a future installment of this series).

For instance, Hammerspace is in conversations as a software-only storage product based on pNFS, that also speaks S3, and is equipped with federated data virtualization capabilities. VAST Data is also taking a software-only approach with full data stack offering, which supports multiple storage protocols (NFS, SMB, S3, and even the Kafka API). They are both competing fiercely with WEKA, which has also developed a scale-out file system, dubbed WekaFS, that supports NFS, SMB, and S3. Meanwhile, MinIO has also entered the hyperscale game with its open source, S3-compatible object store and enterprise AIStor offering.

These specialized software-defined storage vendors have teamed up with system providers like HP, Dell, Supermicro, to create massive storage systems that can scale into the exabyte region, and beyond. These vendors–along with vendors with appliances like Pure Storage, NetApp, and DDN — are all bidding to provide storage for the multi-gigawatt AI data centers that AI giants like Meta, Google, OpenAI, and others are currently building.

We’ll continue this special series next week. Stay tuned.

Content Curated Originally From Here