AI basics | What it means for AI models to ‘reason’, with OpenAI’s ‘smartest’ new o3 and o4-mini models launched | Explained News

AI basics | What it means for AI models to ‘reason’, with OpenAI’s ‘smartest’ new o3 and o4-mini models launched | Explained News


On April 16, OpenAI released two new Artificial Intelligence (AI) reasoning models named OpenAI o3 and o4-mini, which the company said were the latest “in a series of models trained to think for longer before responding”. The company called them the “smartest models” it has released, “representing a step change in ChatGPT’s capabilities for everyone from curious users to advanced researchers”.

In training both models, the company said it used “reinforcement learning”, a technique previously used by other AI companies, including the Chinese startup DeepSeek. OpenAI has also claimed that, compared to the earlier iterations, its new models should “feel more natural and conversational, especially as they reference memory and past conversations to make responses more personalized and relevant”.

What exactly is the process underlying these improvements? And how is it different from what we have experienced with AI chatbots and programs so far? We explain.

Story continues below this ad

First, why is reasoning important in the world of AI?

When Large Language Models (LLMs) such as ChatGPT and Google Gemini were first released, the allure lay in their quick and fairly coherent responses, even if some were occasionally faulty.

Essentially, these tools recognise patterns in large amounts of data and generate responses to user prompts through a series of predictions and calculations. At a basic level, they predict the next likely word in a sequence of words.

Festive offer

“When a chatbot begins to respond to you… It performs an absurdly large number of calculations to determine what the first word in the response should be. After it has output — say, a hundred words — it decides what word would make the most sense given your prompt together with the first hundred words that it has generated so far,” Princeton University researchers Arvind Narayanan and Sayash Kapoor write in their book AI Snake Oil: What Artificial Intelligence can do, cannot do and how to tell the difference.

Where did these LLMs get the data to do these calculations and predictions? Mainly from the internet — everything from Wikipedia articles to books. The understanding among AI companies was that one way to improve LLMs was by pumping a lot more data into them. More data could mean a better understanding of patterns, translating into more refined responses.

Story continues below this ad

However, by 2024, AI companies had utilised all of the text on the internet.

Questions then arose about the next possible step to improve LLMs. In September 2024, OpenAI released its o1 model, the first of its reasoning models, that “thinks before it answers” and “can produce a long internal chain of thought before responding to the user”. This model was trained through reinforcement learning.

What is reinforcement learning?

In Reinforcement Learning: An Introduction, computer scientists Andrew Barto and Richard S Sutton, credited to have pioneered algorithms for the reinforcement learning technique, write: “Whether we are learning to drive a car or to hold a conversation, we are acutely aware of how our environment responds to what we do, and we seek to influence what happens through our behavior. Learning from interaction is a foundational idea underlying nearly all theories of learning and intelligence.”

They explain that every action, something as simple as making breakfast, involves assessing and interacting with one’s surroundings to produce the desired effect. Sutton and Barto developed computational algorithms in reinforcement learning in the 1980s, based on the concept of “reward”.

Story continues below this ad

“The field of artificial intelligence (AI) is generally concerned with constructing agents—that is, entities that perceive and act. More intelligent agents are those that choose better courses of action. Therefore, the notion that some courses of action are better than others is central to AI. Reward—a term borrowed from psychology and neuroscience—denotes a signal provided to an agent related to the quality of its behavior. Reinforcement learning (RL) is the process of learning to behave more successfully given this signal,” notes the citation for Sutton and Barto’s Turing award, considered the Nobel Prize for Computer scientists, which the duo won in 2024.

“It is a little like training a dog,” Jerry Tworek, an OpenAI researcher, told The New York Times about the approach. “If the system does well, you give it a cookie. If it doesn’t do well, you say, ‘Bad dog.’”

So, how are reasoning models different?

Reasoning models arrive at answers to user queries in complex and multiple ways. “With previous models like ChatGPT, you ask them a question and they immediately start responding… This model can take its time. It can think through the problem — in English — and try to break it down and look for angles in an effort to provide the best answer,” OpenAI’s chief scientist Jakub Pachocki had earlier told The NYT.

Through “reasoning”, the models consider different approaches and solutions to a prompt, while recognising patterns to arrive at the answer. OpenAI claims that the o3 model was “ideal for complex queries”, for which “answers may not be immediately obvious”.

Story continues below this ad

The jury is still out on whether this means that these AI systems “reason” or “think” like humans and whether this is the way to go to achieve AI’s next frontier. But for now, it seems to be the approach that AI research companies are taking in their quest for constant improvement.



Originally Appeared Here