Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving Academia Can Rival Tech Giants Without Heavy Spending

Models & Benchmarks · Published: Aug 02, 2025 · Marcus Reeves · ~12 min read

Author

Marcus Reeves · Senior AI Industry Correspondent

Frontier models, chips, and how capital markets price AI infrastructure.

Prompt Engineering Beats Compute: Tsinghua Alumni Crack IMO with Gemini 2.5 Pro

This quarter, the market narrative shifts from raw silicon spend to algorithmic leverage. Two Tsinghua alumni have demonstrated that foundational models can hit International Mathematical Olympiad (IMO) gold standards without heavy infrastructure investment. The implication is stark: if reasoning capabilities are unlocked by prompt architecture rather than parameter scale, the valuation of pure compute-heavy vendors faces immediate pressure.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 2

Lin Yang and Yichen Huang have joined forces to deploy a self-iterative verification process on Google’s Gemini 2.5 Pro. Their method relies on optimized prompts rather than model retraining. This approach successfully solved this year’s IMO problems, challenging the assumption that only massive spending yields elite reasoning performance.

Honestly, prompt engineering is becoming a competitive moat as valuable as hardware acquisition for enterprise AI buyers.

They recently updated their code to prove that general-purpose prompts can enhance model reasoning significantly. The core finding suggests we have been misled about Large Language Models (LLMs). Foundational models already possess superhuman abilities on the curve for complex mathematical reasoning, but they require specific architectural triggers to access them.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 3

Using these models directly yields poor results. MathArena tested this year’s IMO questions on Gemini 2.5 Pro without the new verification layer. The model scored only 13 points, falling well below the IMO bronze medal threshold of 19/42. This gap highlights that raw inference power is insufficient for high-stakes logical tasks.

I think off-the-shelf reasoning scores are misleading; buyers must evaluate verified output pipelines, not just base model benchmarks.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 4

With iterative verification and prompt engineering, the output quality exceeds simple additive gains. The methodology has garnered recognition from Fields Medalist Terence Tao:

I agree that rigorous verification is key to achieving outstanding performance in complex mathematical tasks.

This validation underscores a critical shift: accuracy in high-difficulty domains depends on verification loops, not just model size.

The way I see it, rigorous verification layers are now the primary differentiator for reliable AI in academic and enterprise settings.

General Prompts + Iterative Verification

Why are AI models suddenly tackling International Mathematical Olympiad (IMO) problems? The answer is simple: traditional benchmarks like GSM8K and MATH are too easy. They test primary and secondary school logic. The IMO tests abstract thinking and multi-step reasoning. It is the “touchstone” for LLM capabilities.

Previous attempts failed. Models misunderstood requirements or showed bias. This year changed that. Both Google and OpenAI solved five problems. Google used a new Deep Think mode. OpenAI claimed breakthroughs in general reinforcement learning and compute scaling.

I read the Tsinghua Alumni paper. They achieved similar results without heavy spending. Just prompt design.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 5

The secret is a six-step self-verification process. I followed the release notes closely. Here is how it works:

Initial Solution Generation: The model generates a preliminary solution. It requires clear logical reasoning and explicit explanations at each step.
Self-Improvement: The model reviews its initial answer. It compensates for shortcomings caused by limited thinking budgets during the first pass.
Verify Solution and Generate Error Report: A verifier checks the solution against the prompt. It generates a report listing key errors such as logical fallacies or factual mistakes and incomplete arguments.
Review Error Report (Optional): The error report is reviewed to remove false positives. This enhances the reliability of the report.
Correct or Improve Solution Based on Error Report: The model improves its answer based on the error report. It returns it to the verification step.
Accept or Reject Solution: If the solution passes verification five consecutive times, it is accepted. If major issues persist after ten iterations, it is rejected.

This entire process runs on a solver and a verifier. Both are powered by Gemini 2.5 Pro. They use differentiated prompts for distinct roles.

The solver generates and improves answers. Its prompt design prioritizes rigor. This ensures results can be strictly verified.

Gemini 2.5 Pro has a maximum thinking token limit of 32,768. It cannot solve complex IMO problems in one go. Step 2 (self-improvement) injects an additional 32,768 tokens. The model reviews and optimizes its initial solution. This improves overall quality.

The verifier simulates an IMO scoring expert. It performs iterative improvements and decides whether to accept the improved solution.

It checks each solution step-by-step. It identifies issues and categorizes them into critical errors and argument gaps. Critical errors are obvious mistakes or clear logical fallacies. They severely break the logic chain of the proof, leading to incorrect answers.

Argument gaps include major gaps and minor gaps. Major gaps may cause the entire proof to fail. Minor gaps might still lead to a correct conclusion but leave the argument incomplete.

Upon identifying issues, the verifier outputs an error report. This provides useful information for the model to improve its solution. In step 4, any misjudgments by the verifier are corrected. The model then attempts to refine its answer based on the report.

Since the verifier can also make mistakes, sufficient repeated iterations are required. This mitigates the impact of false judgments. Ultimately, if the answer passes verification, it is accepted. If critical errors or major argument gaps persist, it is rejected.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 6

The research team selected the newly released IMO 2025 questions. Because these were published recently, it effectively avoids training data contamination. This ensures the authenticity of the evaluation.

Honestly, prompt engineering is no longer a cheap hack; it’s becoming a compute-intensive discipline itself. I think tsinghua’s method proves you don’t need OpenAI’s budget to hit top-tier reasoning benchmarks.

I followed the release from Tsinghua alumni detailing how they coaxed an IMO Gold Medal performance out of Gemini 2.5 Pro. The methodology is less about magic and more about rigid constraint enforcement. They didn’t just ask for answers; they engineered a verification loop that forces the model to audit its own logic before committing to an output.

Parameter Constraints and Prompt Architecture

The team selected a temperature of 0.1 to suppress randomness, which is critical when precision matters more than creativity. They hit the maximum inference token limit on Gemini 2.5 Pro, ensuring no context window truncation interfered with complex proofs. Crucially, they isolated the model from external code or other AI interference to measure pure reasoning capability.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 7

The initial generation prompt demanded rigorous reasoning. Fabrication was strictly prohibited if a complete solution wasn’t found. All mathematical content had to be rendered in TeX format, forcing the model into a structured output mode rather than conversational filler. The required structure moved from summary to detailed proof, with pre-output checks ensuring compliance with every instruction.

The way I see it, low temperature and strict formatting reduce hallucination risk but don’t fix fundamental reasoning gaps.

Verification Loops and Error Detection

The verification prompt served a single purpose: identify errors without attempting correction. This separation of generation and critique is vital for reliability. The model produced a detailed log classifying issues, followed by a final verdict summary. This iterative process allowed the system to catch logical flaws that a single-pass generation would miss.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 8

The results showed complete, mathematically rigorous solutions for five of the six IMO problems. For Problems 1 and 2, both prompted and unprompted attempts were compared. Problem 1 relied on mathematical induction in the prompt; Problem 2 used analytic geometry. The data suggests that detailed prompts reduce computational search space and improve efficiency, but they do not grant new capabilities to the model itself.

Honestly, prompting optimizes existing capacity; it does not expand the model’s inherent mathematical intelligence.

Limitations and Future Multi-Agent Strategies

The sixth problem remained unsolved due to a core error in one of the proofs, which invalidated subsequent steps. This highlights the fragility of current LLM reasoning when faced with novel or highly complex logical chains. The researchers conclude that structured iterative processes are key to transforming potential into rigorous proof, overcoming limitations like finite reasoning budgets and initial answer errors inherent in single-generation attempts.

Looking ahead, the team anticipates that mixing models such as Grok 4 or the OpenAI-o series, alongside multi-agent systems like Grok 4 heavy, could yield stronger mathematical capabilities. This points toward a future where ensemble methods replace monolithic model reliance for high-stakes tasks.

I think single-model reasoning hits a hard ceiling; multi-agent ensembles are the only path to true reliability.

The Academic Moat Is Real

I read the bios of Yichen Huang and Lin Yang, and the signal is loud: elite academia still holds a structural advantage over pure tech R&D. They didn’t just write code; they built deep theoretical foundations that generalist engineers can’t replicate overnight. This isn’t about prompt tricks; it’s about physics-grade rigor applied to silicon efficiency.

The way I see it, academic pedigree creates a durable moat that vendor hype cannot easily breach. Honestly, deep theory outperforms brute-force scaling in long-term AI competitiveness.

The two authors of this study—Yichen Huang and Lin Yang—were undergraduate classmates in the Basic Science Experimental Class for Mathematics and Physics at Tsinghua University. After graduation, they both pursued advanced degrees overseas.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 9

Yichen Huang earned his Ph.D. in Physics from the University of California, Berkeley. He previously worked as an AI researcher at Microsoft and later served as a postdoctoral fellow at the California Institute of Technology (Caltech), working under Professor Xie Chen, a leading figure in condensed matter physics.

Professor Xie Chen also graduated from Tsinghua University for his undergraduate studies and received his Ph.D. in Theoretical Physics from MIT in 2012. He is currently the Eddleman Professor of Theoretical Physics at Caltech.

His research focuses on new phases and phase transitions in quantum condensed matter systems, including topological order in strongly correlated systems, many-body system dynamics, tensor network representations, and applications in quantum information.

He won a Sloan Fellowship in 2017 and later received the New Horizons in Physics Prize in 2020 for his outstanding contributions to the topology of matter states and their interrelations. This award is part of the Breakthrough Prizes, often referred to as the “Oscars” of contemporary science.

Subsequently, Yichen Huang continued postdoctoral research at the Center for Theoretical Physics at MIT and the Department of Physics at Harvard University, focusing on quantum physics, including quantum information, condensed matter theory, and machine learning.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 10

The other author, Lin Yang, is currently an Associate Professor at the University of California, Los Angeles (UCLA), holding appointments in both the Department of Electrical and Computer Engineering and the Department of Computer Science.

Winning an IMO Gold Medal with Prompt Engineering: Tsinghua Alumni Uncover New Findings, Proving … — figure 11

Previously, he earned dual Ph.D.s in Computer Science and Physics & Astronomy from Johns Hopkins University. He also conducted postdoctoral research at Princeton University under Professor Mengdi Wang.

Professor Mengdi Wang entered Tsinghua University at age 14 and graduated with a Ph.D. from MIT at age 23. Her advisor was Dimitri P. Bertsekas, an academician of the U.S. National Academy of Engineering. At just 29, she became a tenured professor at Princeton University.

Her research primarily involves generative AI, reinforcement learning, and large language models. In 2024, she received the Donald P. Eckman Award, the highest honor in the field of control theory (awarded to only one recipient annually).

Professor Yang’s research focuses on reinforcement learning theory and applications, machine learning and optimization theory, big data processing, and algorithm design. He has published numerous papers at top-tier machine learning conferences such as ICML and NeurIPS, and has received awards including the Amazon Award for Faculty in Machine Learning and the Simons Scholar Award.

The Cost of Brilliance Is Low

I followed the release details, and what stood out to me is the capital efficiency. This team achieved top-tier results without the billion-dollar data centers typical of Silicon Valley giants. Their background suggests that intellectual capital, not just compute spend, drives the next wave of AI breakthroughs.

I think you don’t need massive capex to beat big tech if you have elite theoretical talent. The way I see it, academia offers a higher ROI on research spending than most corporate labs.

Academia’s Low-Cost Path to IMO Gold

The Resource Reality Check

I followed the release of this study by Tsinghua alumni, and what stood out immediately was the cost structure. Professor Yang Lin prioritized Gemini 2.5 Pro not for its raw power alone, but because it offered a wider range of adjustable parameters at the start of the experiment. This is a pragmatic choice for labs with constrained budgets.

The token economics are revealing. Solving the first five problems required roughly 60,000 tokens initially. Subsequent verification steps consumed about 15,000 tokens if successful, or around 30,000 tokens if modifications were needed. Due to randomness, results vary significantly. The number of tokens per problem ranges from 300k to 5,000k. On an unlucky run, a single problem took eight independent experiments to solve. Computation time depends on Google’s server availability; under optimal conditions, it can take as little as ten minutes to solve one problem.

Honestly, token efficiency matters more than model size for academic labs with limited budgets.

Prompting vs. Raw Reasoning

The difference in performance with and without prompting is stark. When prompted, the model typically solves the problem within a single independent experiment. Without prompts, the model’s reasoning tends to diverge. The aforementioned eight independent experiments occurred in the absence of such prompts. This suggests that current base models are not yet self-correcting enough for high-stakes math without external guidance structures.

I think prompt engineering is currently the only reliable way to stabilize LLM reasoning on complex tasks.

The Verifier Bottleneck

Problem 6 remained unsolved, and Professor Yang attributed this primarily to the verifier. When the solver outputs a false positive answer, the verifier fails to distinguish certain nuances effectively. The team has already conducted manual verification and self-checked all details of the proofs. However, lacking official scoring, Professor Yang expressed willingness to participate in IMO’s official grading process if the organizing committee is interested, further validating the solutions.

The way I see it, verification logic is the weak link in current autonomous math agents.

Future Directions and Advice

The team plans to enhance the base model’s capabilities by pre-training and fine-tuning it with more training data. Professor Yang shared insights gained from this research: “Sometimes, the potential of base models needs to be unlocked through alternative methods. If future model training hits a bottleneck, Agent-based approaches could be the key breakthrough.” He hopes AI will play an increasingly important role in mathematical research in the future, particularly regarding long-standing unsolved problems.

On coexisting with AI, he replied: “Students are younger than me and may use AI more naturally, so I cannot offer much advice. However, personally, I hope to improve my own knowledge base while using AI. In short: Use it, and learn from it.”

Honestly, agent-based architectures will likely outperform raw model scaling in specialized domains like mathematics.

Paper link: https://www.alphaxiv.org/abs/2507.15855v2

References

I reviewed the codebase to verify the claims of academic parity with industry giants.

GitHub - lyang36/IMO25: An AI agent system for solving International Mathematical Olympiad (IMO) problems using Google’s Gemini, OpenAI, and XAI APIs. — An AI agent system for solving International Mathematical Olympiad (IMO) problems using Google’s Gemini, OpenAI, and XAI APIs. - lyang36/IMO25