Bulls**t Automation: Generative AI services are tickling the imagination of every single CEO in the tech industry right now. They are expected to replace millions of workers and automate almost everything, but MIT researchers are warning that AI models don’t really comprehend the “rules” of complex systems.
A large language model (LLM) can supposedly mimic human intelligence and provide very convincing results based on a user’s textual prompt. However, the model is simply predicting, sometimes with uncanny precision, the best words to put next to the previous ones in a specific textual context. When LLMs face unpredictable, real-world conditions, their output can quickly become unreliable.
MIT researchers tried to develop new metrics to properly verify if generative AI systems can understand the world, like checking their ability to provide turn-by-turn directions in New York City. Modern LLMs seem to “implicitly” learn world models, the researchers said in a recent study, but there must be a formalized way to properly assess this apparently remarkable showcase of “intelligence.”
The team focused on transformers, a type of generative AI model used by popular services like GPT-4. Transformers are trained on massive databases of language-based data, so they become highly skilled in their text-prediction deeds. The researchers then evaluated generative AI predictions by using a class of problems known as deterministic finite automaton (DFA).
The DFA definition includes different kinds of problems such as logical reasoning, geographic navigation, chemistry, or game-playing. The MIT scientists chose two different problems – driving on the streets of New York and playing Othello – to test AI’s ability to properly understand the underlying rules. “We needed test beds where we know what the world model is. Now, we can rigorously think about what it means to recover that world model,” Harvard postdoctoral researcher Keyon Vafa said.
The tested transformers were generally able to generate accurate directions and valid Othello moves, but they performed poorly when the researchers added detours to the New York map. In this particular instance, all the generative AI models were unable to properly “read” the detours, proposing random flyovers that didn’t actually exist or streets with “impossible” orientations.
Generative AI performance deteriorated quickly after adding a single detour, Vafa stated. After closing just 1 percent of the possible streets on the map, model accuracy went from nearly 100 percent to just 67 percent. The study results show that transformer-based LLMs can be accurate in certain tasks, but they don’t understand or capture accurate world models. Or, as computer scientist Alan Blackwell famously said, we are just automating bulls**t over and over again.