GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI

Author

David Kowalski · Developer Tools & Agents Editor

Coding agents and IDE workflows tested the way working teams use them.

About this contributor →

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI

If you are a developer waiting for the next leap in reasoning capabilities, you might be wondering if the gap between standard chat models and specialized “thinking” models has finally closed. That’s the core tension Jerry Tworek addresses in his first podcast interview as OpenAI’s Vice President of Research. In his view, GPT-5 is more of an iteration on o3 than a departure from it.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 2

Tworek, one of the lead architects behind the o1 model, frames GPT-5 as effectively being “o3.1.” He argues that compared to GPT-4, this new iteration focuses on scaling what already works: greater capabilities, longer reasoning times, and the ability to autonomously interact with multiple systems. What OpenAI aims to do next is create another “o3 miracle”—building a model with these expanded powers rather than reinventing the wheel entirely.

I think treating GPT-5 as o3.1 suggests incremental scaling over architectural revolution. As a builder, developers should expect longer latency for deeper reasoning tasks in this release.

During the hour-long interview, Tworek spoke extensively about his thoughts on the GPT series of models. He discussed the evolution from o1 to GPT-5, detailing OpenAI’s model inference processes, internal company structure, and the significance of reinforcement learning (RL) to the company. He also shared personal anecdotes about joining OpenAI and his views on the path toward Artificial General Intelligence (AGI).

If you showed today’s ChatGPT to someone from 10 years ago, they might call it AGI.

He also specifically acknowledged the contribution of the GRPO algorithm proposed by DeepSeek, noting that it has advanced RL research in the United States. This admission highlights how external open-source innovations are influencing internal closed-loop development strategies at major labs.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 3

Interestingly, when he mentioned that he is also a heavy “enthusiast” of ChatGPT, spending $200 monthly on subscriptions, netizens pointed out an amusing detail:

Who would have thought? Even OpenAI employees have to pay for ChatGPT. (doge emoji)

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 4

That said, the interview was packed with high-density information and is highly recommended. Tworek himself posted on social media saying:

If you want a deep dive into RL, this podcast is not to be missed.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 5

How GPT-5 Thinks

I followed the release details on how OpenAI is handling model reasoning, a topic that has piqued everyone’s curiosity. Host Matt Turk began by posing the core question:

What exactly are they thinking about when we chat with ChatGPT?

In simple terms, this boils down to understanding what model reasoning is.

Jerry Tworek immediately hit the nail on the head: the process of model reasoning can be likened to human thought. At its core, both involve seeking answers to unknown questions, which may entail performing calculations, retrieving information, or engaging in self-learning.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 6

This reasoning process is specifically manifested in Chain of Thought (CoT). Since OpenAI released the o1 model, this concept has become widely known.

It involves expressing the model’s thinking process in natural human language. The entire workflow entails training a language model on vast amounts of human knowledge to learn how to think like humans, and then “translating” that reasoning back into human-readable text via Chain of Thought.

In the early days, triggering Chain of Thought required prompts such as “Let’s solve this step by step.” If asked directly, the model might fail in its reasoning; however, instructing it to proceed step-by-step encourages it to generate a series of thought chains, ultimately leading to a result.

Consequently, the longer a model spends reasoning, the better the results tend to be.

However, OpenAI discovered through actual user feedback that most users dislike waiting for extended periods. This constraint has influenced their decision-making regarding model development strategies.

Currently, OpenAI makes both high-reasoning and low-reasoning models available to users simultaneously, returning the choice of reasoning duration to the user while internally experimenting with heuristic methods to find an optimal balance.

The origin of OpenAI’s reasoning models traces back to o1.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 7

This was OpenAI’s first officially released reasoning model.

However, Jerry, as the lead on o1, candidly acknowledged that o1 excelled primarily at solving puzzles. Therefore, rather than being a truly useful product, it served more as a technical demonstration.

The landscape changed with the arrival of o3, which represented a structural shift in AI development.

o3 is genuinely useful; it skillfully utilizes tools and contextual information from various sources, demonstrating persistence in digging for answers during its reasoning process.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 8

Jerry himself only began to fully trust reasoning models starting with o3.

In a sense, GPT-5 can be viewed as an iteration of o3—essentially o3.1—sharing the same lineage in its thinking process.

Looking ahead, OpenAI will continue to pursue the next major leap: developing reasoning models that are more capable, think more effectively, and operate with greater autonomy.

Joining OpenAI Was a Natural Progression

For developers watching the reasoning arms race, the gap between GPT-5 and o3.1 feels less like a marketing sprint and more like an architectural inevitability. Jerry Tworek, a key figure driving OpenAI’s reasoning models, describes his path to this moment not as a sudden stroke of genius, but as a crystallization of intent.

He likens the process to the formation of a crystal: the innate desire to pursue scientific research became increasingly clear throughout his education and career, until the moment OpenAI emerged, signaling that the time was right.

This journey began in his childhood. Growing up in Poland, Jerry displayed talents surpassing those of his peers, particularly in mathematics and science. As he puts it:

They were things that naturally suited me.

At 18, aiming to become a mathematician, he enrolled at the University of Warsaw to study math, driven by a thirst for truth. However, due to his “rebellious” nature and weariness with the rigidity and strictness of academia, he abandoned this ideal.

To support his family, he decided to become a trader, leveraging his mathematical skills for a living. He interned in JPMorgan Chase’s equity derivatives trading department before leaving to co-found a hedge fund.

A few years later, growing weary of trading work and facing a career bottleneck, he sought a new direction.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 9

The status quo was broken by the emergence of DeepMind’s DQN agent. Jerry became deeply fascinated by reinforcement learning; previously, he had believed that classifiers lacked true intelligence, but DQN demonstrated the ability to learn complex behaviors.

Consequently, he joined OpenAI in 2019. Initially, he worked on robotics projects, focusing on dexterous manipulation. This project was also OpenAI’s famous “Solving Rubik’s Cube with Robots” initiative, a representative work showcasing reinforcement learning and interaction with simulated entities.

Subsequently, as is well known, he led the o1 project and drove advancements in OpenAI’s model capabilities. Currently, his primary role involves collaborating with other researchers to brainstorm and refine research plans.

According to Jerry, the internal structure at OpenAI is quite unique, combining top-down direction with bottom-up freedom.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 10

Specifically, the company focuses on three or four core projects, concentrating its efforts and resources. Within these projects, researchers enjoy relative bottom-up autonomy.

The entire research department comprises approximately 600 people, yet everyone is aware of all project details. OpenAI believes that the risk of preventing researchers from making optimal decisions due to information silos far outweighs the risk of intellectual property leakage.

OpenAI’s ability to rapidly release products—moving from o1 to GPT-5 in just one year—is ultimately attributed to its robust operational structure, significant momentum, and the high output efficiency of top-tier talent. Everyone believes in the significance of their work:

AI will only be built and deployed once in history.

Additionally, employees extensively use internal tools. Jerry himself is a heavy user of ChatGPT, paying for it monthly. Tools like CodeX are also widely applied in internal code development.

Why Reinforcement Learning Matters Now

As someone who spends their days wrestling with agent loops and evaluation harnesses, I’ve watched the industry pivot from pure pre-training to hybrid systems. Jerry’s perspective on OpenAI’s trajectory highlights a critical shift: reinforcement learning (RL) isn’t just an add-on; it’s the mechanism that turns raw language models into reliable tools.

For Jerry himself, reinforcement learning (RL) was the key that led him into OpenAI. Looking at the company as a whole, RL has also been pivotal in several major turning points.

Today’s language models can be described as a combination of pre-training and reinforcement learning: first pre-train the model, then apply reinforcement learning on top of it. Both components are indispensable. This hybrid approach has been the core of OpenAI’s research strategy since 2019.

However, to better understand RL’s role at OpenAI, one must first clarify what RL actually is.

Jerry likens RL to training a dog: when the dog behaves well, it receives a “reward” (such as a treat or a smile); when it misbehaves, it faces a “punishment” (like having its attention diverted or being shown displeasure).

RL functions similarly within models: positive rewards are given for correct behavior, while negative rewards follow incorrect actions. The key elements here are the policy and the environment:

  • Policy: Refers to the model’s behavior—a mathematical function that maps observations to actions.
  • Environment: Everything the model perceives must be interactive. The environment evolves based on the model’s actions. For example, when learning to play the guitar, feedback is received from the sounds produced by plucking strings. RL is the sole mechanism for teaching models how to respond to environmental changes.

DeepMind’s DQN later elevated RL to a new stage—Deep RL—by combining neural networks with reinforcement learning, giving rise to truly meaningful agents.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 11

Jerry also shared a story about GPT-4 shortly after its initial training. At that time, internal teams were dissatisfied with its performance because GPT-4 lacked coherence in longer responses.

This issue was eventually resolved through RLHF (Reinforcement Learning from Human Feedback), which involves humans providing feedback on the model’s outputs to serve as rewards.

It was precisely this integration of RLHF into GPT-4 that gave the world the “ChatGPT moment.”

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 12

Recently, OpenAI’s unexpectedly strong performance in programming competitions was attributed by Jerry to researchers’ long-term use of programming puzzles as testbeds for experimenting with their RL ideas.

What began as an incidental effort bore fruit: during their research into RL, they also secured milestone achievements for OpenAI.

Thus, as long as current results can be evaluated and feedback signals calculated, RL can be applied to any domain—even when answers are not simply right or wrong.

However, scaling up RL remains challenging because it is prone to numerous potential errors in practice. Compared to pre-training, RL involves more bottlenecks and failure modes.

It is an extremely delicate process. Analogously, compared to pre-training, applying RL is far more complex than manufacturing semiconductors versus producing steel.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 13

Additionally, Jerry expressed approval of GRPO (Group Relative Policy Optimization), a new reinforcement learning algorithm proposed by the DeepSeek team:

The open-sourcing of GRPO allows many U.S. labs lacking advanced RL research projects to launch and train reasoning models more quickly.

Reinforcement Learning + Pre-training is the Correct Path to AGI

Jerry Tworek’s closing thoughts on AI’s future center on a simple workflow reality: we need tools that can handle long-horizon tasks, not just instant answers. He argues that Agents are the vehicle for this shift, allowing models to solve human problems through automation rather than mere information retrieval.

Currently, most models respond in minutes. But internal tests reveal they can engage in independent thinking for 30 minutes, an hour, or longer. The challenge isn’t capability—it’s product design. We need interfaces that properly deploy these extended reasoning processes. Agents driven by foundational reasoning enable this depth, tackling complex workflows like programming, travel booking, and design. The agentification of AI is an inevitable trend.

Model alignment remains a public concern, specifically regarding how we guide behavior to match human values. Jerry frames this not as a philosophical debate, but as an engineering problem: alignment issues are essentially reinforcement learning (RL) problems. For models to make correct choices, they must deeply understand their actions and consequences. This is an endless process because alignment evolves alongside human civilization itself.

GPT-5 ≈ o3.1! OpenAI Details Thinking Mechanism: RL + Pre-training is the Path to AGI — figure 14

If we aim for AGI, current pre-training and RL are undoubtedly essential, though future iterations will integrate additional elements. Jerry explicitly rejects the industry view that “pure RL is the only path to AGI.” He insists on a symbiotic relationship:

RL requires pre-training to succeed, and pre-training also requires RL to succeed; neither can work without the other.

While he hesitates to predict exactly when models will achieve self-improvement without significant external output or human intervention, he remains confident that OpenAI is currently on the right track. Future progress will come from adding new complex components rather than overturning existing architectures.

Personally, long-horizon agents are necessary for real dev workflows, not just chat. I think rL and pre-training must coexist; one cannot replace the other in production.

References

Here’s where I found the source material for this breakdown.

  1. watch

Comments