Synthetic data can have significant negative impacts on generative AI

Synthetic data can have significant negative impacts on generative AI

Considering the rapid advancements in generative artificial intelligence (AI), such as OpenAI’s GPT-4o or Stability AI’s Stable Diffusion, it’s crucial to acknowledge the challenges associated with training these models. The need for massive amounts of data has already led to concerns about potential supply limitations, which could ultimately impact the availability of resources for training.

In light of this data scarcity, utilizing synthetic data to train future iterations of AI models may appear as an appealing option for major tech companies. This approach offers several advantages, including its cost-effectiveness and virtually unlimited supply in contrast to real-world data. Additionally, synthetic data poses fewer privacy risks, especially in sensitive domains like medical data, and in some cases, it may even enhance AI performance.

Nevertheless, recent research by the Digital Signal Processing group at Rice University has shed light on the potential negative consequences of overreliance on synthetic data for training generative AI models in the long term.

“The problems arise when this synthetic data training is, inevitably, repeated, forming a kind of a feedback loop ⎯ what we call an autophagous or ‘self-consuming’ loop,” said Richard Baraniuk, Rice’s C. Sidney Burrus Professor of Electrical and Computer Engineering.

“Our group has worked extensively on such feedback loops, and the bad news is that even after a few generations of such training, the new models can become irreparably corrupted. This has been termed ‘model collapse’ by some ⎯ , most recently by colleagues in the field in the context of large language models (LLMs). We, however, find the term’ Model Autophagy Disorder’ (MAD) more apt, by analogy to mad cow disease.”

Mad cow disease, also known as bovine spongiform encephalopathy (BSE), is a fatal neurodegenerative illness that affects cows and has a human equivalent caused by consuming infected meat.

A significant outbreak in the 1980s and 1990s shed light on the fact that mad cow disease proliferated due to the practice of feeding cows the processed remains of their slaughtered peers, leading to the term “autophagy,” which comes from the Greek words auto, meaning “self,” and phagy, meaning “to eat.” The study “Self-Consuming Generative Models Go MAD” is the first peer-reviewed work on AI autophagy and focuses on generative image models such as DALL·E 3, Midjourney, and Stable Diffusion.

“We chose to work on visual AI models to better highlight the drawbacks of autophagous training, but the same mad cow corruption issues occur with LLMs, as other groups have pointed out,” Baraniuk said.

The process of training generative AI models often involves using datasets obtained from the internet, which can result in the emergence of self-consuming loops with each new model generation. Baraniuk and his team investigated three variations of these loops in order to understand potential scenarios:

  • Fully synthetic loop: Each new generation of a generative model was trained using entirely synthetic data derived from the output of prior generations.
  • Synthetic augmentation loop: The training dataset for each generation consisted of a combination of synthetic data from previous generations and a fixed set of real training data.
  • Fresh data loop: Every model generation underwent training using a mix of synthetic data from previous generations and a new set of real training data.

As the iterations of the loops progressed, it became evident that without a continuous supply of fresh real data, the models started producing distorted outputs with compromised quality and diversity. It’s clear that the key to robust AI lies in the availability of fresh and diverse data.

When we examine successive generations of AI-generated image datasets side by side, we are confronted with a disconcerting glimpse into potential AI outcomes. Human face datasets are marred by gridlike scars, termed “generative artifacts” by the authors, or begin to resemble the same individual more and more. Meanwhile, datasets containing numbers transform into unintelligible scribbles.

“Our theoretical and empirical analyses have enabled us to extrapolate what might happen as generative models become ubiquitous and train future models in self-consuming loops,” Baraniuk said. “Some ramifications are clear: without enough fresh real data, future generative models are doomed to MADness.”

In order to enhance the realism of these simulations, the researchers incorporated a sampling bias parameter to address the issue of “cherry picking” – the tendency for users to prioritize data quality over diversity. This means trading off variety in the types of images and texts in a dataset for those that look or sound appealing.

Cherry-picking offers the incentive of preserving data quality over numerous model iterations. However, it also results in a steeper decline in diversity.

“One doomsday scenario is that if left uncontrolled for many generations, MAD could poison the data quality and diversity of the entire internet,” Baraniuk said. “Short of this, it seems inevitable that as-to-now-unseen unintended consequences will arise from AI autophagy even in the near term.”

The research was supported by the National Science Foundation, the Office of Naval Research, the Air Force Office of Scientific Research, and the Department of Energy.

Journal reference:

  1. Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, Richard G. Baraniuk. Self-Consuming Generative Models Go MAD. International Conference on Learning Representations (ICLR), 2024.

Originally Appeared Here