Ever since OpenAI’s ChatGPT sparked the generative AI boom in 2022, it’s been clear that having the right data, and enough of it, is essential to creating an AI model that is accurate, reliable, and efficient.
The problem? The best data, particularly specialised “expert” data in specific domains like health and finance, is in short supply.
AI companies have strip-mined the Internet for fresh information, but AI models are constantly hungry – and must be fed.
San Francisco-based startup Gretel AI has long believed that the most satisfying solution is to create fake food that is just as tasty as the real thing. It helps clients such as EY, Google, and the US Department of Justice generate synthetic data – that is, artificially generated data that mimics the characteristics of real-world data.
And it’s getting easier to make it: Today, for example, Gretel announced the wide availability of a generative-AI-powered system that lets users create synthetic datasets for tabular data – think of text and number data that goes in columns and rows, like Excel spreadsheets – with just a natural language prompt like those used for ChatGPT.
Let’s say a bank wants to create a synthetic dataset that is similar to its own customer data but does not include actual individual names or information. Using Gretel’s Navigator product, the bank can prompt the system to create millions of fictional names, IDs, dates, dollar amounts, and account balances, for example, based off of Gretel’s own datasets, or off of the bank’s own proprietary data.
The resulting computer-generated data doesn’t infringe on customer privacy, since it does not include any real-world customer information, and can generate enough data to train a powerful, accurate model, claims Gretel.
As data scarcity forces companies to seek other sources to build general models or fine-tune ones for specific tasks, synthetic data is having a moment in 2024, Gretel cofounder and CEO Ali Golshan told Fortune.
Golshan, who had previously cofounded two security-focused startups, pointed out that the company got its start in 2020 as a way to generate privacy-minded data (the name Gretel came from the classic story of Hansel And Gretel, who left a trail of breadcrumbs to find their way home). The company “wanted to make sure people don’t leave digital breadcrumbs behind” while offering developers a way to access useful data, particularly in highly regulated industries.
“We never really thought about the context of running out of data – that was a ChatGPT moment,” he said. But now data scarcity – as well as data privacy and security – is why companies are turning to synthetic data as an option to train AI models.
Golshan emphasises that generating synthetic data is not about spewing out high volumes of low-quality, useless data (think Reddit posts). “People think synthetic data is sort of interchangeable with fake data or junk data, that they just need more of it,” he said. “That is where you end up with these sorts of toxic dovetails and spirals of hallucinations – the quality part has to be there.”
What will drive business over the next two decades, he added, is taking large AI investments built on the back of “messy, public, privacy-riddled data” and “plugging them into our sensitive, owned, domain-specific data – that is unique and can drive models forward.”
He also pushed back on the idea of synthetic data being not “as good” as real data, as well as the potential dangers of AI training itself on its own hallucinations or misinformation. Since the company mostly services businesses, organisations, and governments, Gretel’s work typically starts with a seed of data a company already has – whether it is patient data, fraud data, or transaction data. “That acts as the boundaries and the gates for how we build the rest of the data,” he said.
Gretel’s latest product lets companies generate data even on topics about which they lack information. Its technology focuses on highly specific data meant to improve individual tasks within a client’s internal systems – and not produce data based on millions of pages scraped from the internet that could prove problematic.
Gretel is not alone in attempting to corner the market on generating synthetic data to train AI models. Startups like SynthLabs, Synthetaic, and Clearbox AI are all racing to provide companies with all the data they need – computer-generated, that is.
That has led Golshan and his cofounders to consider the future. He says companies will soon be able to make money by allowing others to buy synthetic data trained on that organisation’s unique datasets. Organisations that have lots of data but aren’t building AI models, for instance, could sell others access to their data to help training for their synthetic data.
To that end, Golshan said, Gretel’s next big move is to build a synthetic data and model exchange. “We are going to enable companies and customers to train models on their data, get mathematical guarantees that data is safe, and somebody can come and ‘subscribe’ to that model, generate data, and pay as you go,” he explained.
This, he added, will take Gretel to the next level to “become the safe interface for private data, where you remove this exploitative approach to mining and harvesting data.” It would also mean companies like Anthropic and OpenAI, which have built huge AI models built on massive amounts of data, would not have to strike licenses with every individual company they want to get data from, he said.
As for funding, Gretel has raised a total of US$68mil (RM320.28mil) with its Series B back in 2021. Golshan said the startup has a lot of money left, with “about two years of runway ahead of us”. But in this “moment” for synthetic data, he says he sees an opportunity to build the next Databricks or Snowflake – two of the biggest data cloud platforms – or even OpenAI.
“We are leaning into it pretty aggressively because we’re having a ton of pull,” he said. “We envision building the next safe, high-quality data business, which, if you think about the needs, is a pretty significant opportunity.” – Fortune.com/The New York Times