The emergence of “Absolute Zero” from Tsinghua University and the Institute for Computing Technology (ICT) signals a potential shift in how Asian research hubs are approaching model efficiency, challenging the global reliance on massive external datasets.
Can pre-trained large language models learn reasoning through self-play without introducing external data? This question drives recent work by researchers from Tsinghua University, the Institute for Artificial Intelligence (THUAI), and Pennsylvania State University, who have proposed a training method called “Absolute Zero.”
I followed the release of this methodology, which enables large models to acquire reasoning capabilities by generating and solving tasks based on specific reasoning objectives. This approach suggests that internal simulation might eventually rival external supervision in certain domains.

In tests, models trained with “Absolute Zero” outperformed those trained using expert-labeled samples. Notably, the “Absolute Zero” method requires training only in a code environment yet yields significant improvements in mathematical reasoning. This efficiency could lower barriers for institutions lacking access to curated human feedback data.

The research has sparked discussion on Reddit, with users reposting the findings and exclaiming: “Has self-evolving AI been unlocked?” While online enthusiasm is high, I note that scaling this technique beyond code and math remains unproven.

Self-Learning Through Problem Generation and Solving
The ripple effect of this research extends beyond Beijing’s academic circles. When Chinese institutions pioneer data-free training loops, they challenge the global reliance on massive curated datasets. I think this shift could reduce dependency on Western-controlled data pipelines in the long term.
I followed the release from Tsinghua University and the Institute of Computing Technology (ICT), which details a method called “Absolute Zero.” It employs a self-play learning paradigm where a unified language model assumes two roles: Proposer and Solver. The Proposer generates new reasoning tasks, while the Solver tackles them. Through this alternation, the model autonomously constructs a distribution of learning tasks and enhances its reasoning capabilities without external data.

“Absolute Zero” represents all reasoning tasks uniformly as triplets of $(p, i, o)$: program, input, and output. The program is executable code, the input is data fed into it, and the output is the result. This formalization transforms abstract reasoning into concrete programming problems, allowing language models to generate and manipulate code for task generation and solving.
Based on which elements of $p$, $i$, and $o$ are known, the method categorizes tasks into three types:
- Abduction Tasks: Given $p$ and $o$, find possible values for $i$. These assess reverse-engineering skills and code semantics understanding.
- Deduction Tasks: Given $p$ and $i$, determine $o$. These evaluate the ability to execute and comprehend code logic.
- Induction Tasks: Given a set of $(i, o)$ examples, find a unified program $p$. These test pattern summarization and code generation.

Before self-play training begins, “Absolute Zero” requires an initial seed set of tasks. This is generated by having the foundational language model produce valid code triplets $(p, i, o)$. If the base model is strong enough, this step may be omitted. When the seed set is empty, the system uses a predefined “zero triplet,” essentially a simple identity function.

In each iteration, the Proposer generates a new reasoning task based on the current existing set and specified type. It samples historical examples as references and leverages generative capabilities to produce a new $(p, i, o)$ triplet:
- For abduction tasks, $p$ and $o$ are generated, but not $i$.
- For deduction tasks, $p$ and $i$ are generated, but not $o$.
- For induction tasks, a set of input-output pairs $(i, o)$ is generated, but not $p$.
For induction tasks specifically, the Proposer samples a program $p$ from historical abduction and deduction tasks, then generates $N$ matching input-output pairs $(i, o)$ along with a natural language description. This provides richer contextual information, helping the Solver better understand and resolve them.
The Proposer attempts to control difficulty and novelty to ensure tasks are meaningful for the current Solver. “Absolute Zero” introduces “learnability” to estimate learning value by having the Solver attempt the task and recording its success probability. Tasks that are too easy or difficult have low learnability; the goal is moderate learnability.

Newly generated tasks are sent to an independent code executor for verification. The executor runs the program and checks for syntax correctness in a Python interpreter.
Tsinghua & Institute of Computing Technology Unveil ‘Absolute Zero’ Training Method: Self-Play Enables Reasoning in Large Models Without External Data
From an APAC angle, this self-play framework challenges the industry’s reliance on massive external datasets for reasoning capabilities.
Globally, the method’s determinism offers a potential path toward more transparent and auditable AI systems.
I think by removing external data, this approach could reshape how Asian labs compete in foundational model efficiency.
The verification process acts as a strict gatekeeper for the “Absolute Zero” system. I followed the release notes to understand how the executor filters tasks before they enter the training loop. The program must pass three specific checks: it cannot use unsafe operations or libraries, such as file I/O or system calls; it must be deterministic, producing identical outputs for identical inputs with no randomness; and it must satisfy basic logical constraints. By passing these three checks, the executor filters out most invalid or harmful tasks.
For tasks that pass verification, the executor also calculates a “learnability reward” to provide feedback on the Proposer’s performance. Finally, all verified tasks are stored in a task buffer pool for subsequent training use.
After filtering reasoning tasks, “Absolute Zero” switches to the Solver role to begin solving them. The specific approach varies depending on the task type:
-
For abduction tasks, the Solver infers possible values for $i$ given $p$ and $o$. This process resembles “reverse-executing” the program.
-
For deduction tasks, the Solver deduces $o$ from $p$ and $i$. The Solver must simulate the program’s execution to derive the final output.
-
For induction tasks, the Solver infers a possible program $p$ from input-output pairs $(i, o)$. The Solver needs to summarize general patterns from limited samples.
During task solving, the Solver can leverage existing knowledge in the language model (such as common algorithmic patterns and programming conventions) to assist in resolution.
The solutions generated by the Solver are verified again by the code executor. The executor checks whether the input, output, or program provided by the Solver truly satisfies the task requirements.
If satisfied, the task is considered successfully solved by the Solver, and a corresponding reward is granted; otherwise, it is deemed a failure, with no reward or a penalty applied.
This reward signal serves as feedback for the Solver’s behavior, helping it learn how to better solve various types of reasoning tasks.
Simultaneously, the Solver’s solutions are recorded as references for future generation and solving of similar tasks.

At the end of each iteration, “Absolute Zero” uses the feedback signals collected by both the Proposer and Solver to jointly optimize and update the entire model. This ensures that tasks generated by the Proposer are more conducive to learning, while the Solver’s ability to solve tasks becomes increasingly robust.
After multiple iterations, “Absolute Zero” eventually converges to a strong equilibrium point where the tasks generated by the Proposer perfectly match the Solver’s capabilities, and the Solver can acquire sufficient knowledge from these tasks.
Coding and Math Gains Without External Data
The results from Tsinghua University and the Institute of Computing Technology show that “Absolute Zero” can boost reasoning capabilities without relying on external datasets. I followed the release notes closely, noting how self-play mechanisms are reshaping training efficiency in the region.
For coding tasks, researchers evaluated performance across three specific datasets: HumanEval+, MBPP+, and LCB. The data shows a clear uplift for Qwen-2.5-7B-Coder when compared to versions not trained with this method. Specifically, the pass rate on HumanEval+ rose from 80.5% to 83.5%. On MBPP+, it moved from 69.3% to 69.6%, and on LCB, it jumped significantly from 19.9% to 31.7%.
From an APAC angle, self-play reduces dependency on curated data, a shift that could lower costs for global AI labs.
Mathematical reasoning showed even more dramatic gains across six representative datasets: AME’24, AME’25, AMC’23, MATH500, Minerva, and Olympiad. “Absolute Zero” achieved an average accuracy of 39.1% across these benchmarks. This represents a 15.2 percentage point improvement over the baseline without the method.
The gains were particularly sharp on specific challenges. On the MATH500 dataset, accuracy hit 72.6%, surpassing the baseline by 22.6 percentage points. Similarly, on AMC’23, it achieved 57.5% accuracy, exceeding the baseline by 17.5 percentage points.

Beyond the primary Qwen model, I noted that researchers tested “Absolute Zero” on several other pre-trained language models to verify scalability. The results suggest a strong correlation between model size and improvement potential.
- Qwen-2.5-3B-Coder: Coding pass rates increased from 51.2% to 54.9%, while math accuracy rose from 18.8% to 26.5%.
- Qwen-2.5-14B-Coder: Coding pass rates increased from 60.0% to 63.6%, and math accuracy surged from 20.2% to 43.0%.
- Llama-3.1-8B: Coding pass rates increased from 28.5% to 31.6%, with math accuracy rising from 3.4% to 6.8%.
Globally, larger models benefit disproportionately, reinforcing the trend toward scaling compute for marginal reasoning gains.
Testing across different model sizes and types revealed that performance improvements are positively correlated with model scale. Models with more parameters exhibit greater post-training gains using this method. For example, in math tasks, the 3-billion-parameter Qwen-2.5-3B-Coder improved by 7.7 percentage points, while the 14-billion-parameter Qwen-2.5-14B-Coder improved by 22.8 percentage points.
This indicates that “Absolute Zero” effectively leverages the inherent capabilities of large models to achieve higher gains in reasoning performance without external supervision.

The full technical details are available in the paper linked below.
Paper Link: https://arxiv.org/abs/2505.03335
Comments
Sign in to join the discussion and leave a comment.
Sign in with Google