In AI hardware circles almost everyone is talking about inference.
Nvidia CFO Colette Kress said on the company’s Wednesday earnings call that inference made up roughly 40% of Nvidia’s $26.3 billion in second-quarter data center revenue. AWS CEO Matt Garman recently told the No Priors podcast that inference is likely half of the work done across AI computing servers today. And that share is likely to grow, drawing in competitors eager to dent Nvidia’s crown.
It follows then, that many of the companies looking to take some market share from Nvidia are starting with inference.
A founding team of Google alums has founded Groq, which focuses on inference hardware and raised $640 million at a $2.8 billion valuation in August.
In December 2023, Positron AI came out of stealth with an inference chip it claims can perform the same calculations as Nvidia’s H100, but five times cheaper. Amazon is developing both training and inference chips โ aptly named Trainium and Inferentia respectively.
“I think the more diversity there is the better off we are,” Garman said on the same podcast.
And Cerebras, the California company famous for its oversized AI training chips announced last week that it had developed an equally large inference chip that is the fastest on the market, according to CEO Andrew Feldman.
All inference chips are not built equally
Chips designed for artificial intelligence workloads must be optimized for training or inference.
Training is the first phase of developing an AI tool โ when you feed labeled and annotated data into a model so that it can learn to produce accurate and helpful outcomes. Inference is the act of producing those outputs once the model is trained.
Training chips tend to optimize for sheer computing power. Inference chips require less computation muscle, in fact some inference can be done on traditional CPUs. Chipmakers for this task are more concerned about latency because the difference between an addictive AI tool and an annoying one often comes down to speed. That’s what Cerebras CEO Andrew Feldman is banking on.
Cerebras’s chip has 7,000 times the memory bandwidth of Nvidia’s H100, according to the company. That’s what enables what Feldman calls “blistering speed.”
The company, which has begun the process of launching an IPO, is also rolling out inference as a service with multiple tiers, including a free tier.
“Inference is a memory bandwidth problem,” Feldman told Business Insider.
To make money in AI, scale inference workloads
Choosing to optimize a chip design for training or inference isn’t just a technical decision, it’s also a market decision. Most companies making AI tools will need both at some point, but the bulk of their need will likely be in one area or the other, depending on where the company is in its building cycle.
Massive training workloads could be considered the R&D phase of AI. When a company shifts to mostly inference, that means whatever product it has built is working for end customers โ at least in theory.
Inference is expected to represent the vast majority of computing tasks as more AI projects and startups mature. In fact, according to AWS’s Garman, that’s what needs to happen to realize the as-yet-unrealized return on hundreds of billions of AI infrastructure investments.
“Inference workloads have to dominate, otherwise all this investment in these big models isn’t really going to pay off,” Garman told No Priors.
However, the simple binary of training v. inference for chip designers may not last forever.
“Some of the clusters that are in our data centers, the customers use them for both,” said Raul Martynek, CEO of datacenter landlord Databank.
Nvidia’s recent acquisition of Run.ai may support Martynek’s prediction that the wall between inference and training may soon come down.
In April Nvidia agreed to acquire Israeli firm Run:ai, but the deal has not yet closed and is receiving scrutiny from the Department of Justice, according to Politico. Run:ai’s technology makes GPUs run more efficiently, allowing more work to be done on fewer chips.
“I think for most businesses, they’re gonna merge. You’re gonna have a cluster that trains and does inference,” Martynek said.
Nvidia declined to comment on this report.