AI Models Achieve IMO Gold, Sparking Debate on Reinforcement Learning and Prompt Engineering

Artificial intelligence models from both OpenAI and Google DeepMind have reportedly achieved gold medal-level performance at the 2025 International Mathematical Olympiad (IMO), a significant milestone in AI's ability to tackle complex mathematical reasoning. This breakthrough, announced in July 2025, marks a pivotal moment, demonstrating advanced problem-solving capabilities in a competition known for its exceptionally difficult problems requiring deep insight and creativity. The models solved five out of six problems under human exam conditions, without external tools or internet access.

The achievement has ignited discussions within the AI community regarding the methodologies employed. In a recent social media post, user "jxmo" posited that it is "totally plausible that a model could get IMO gold without any reinforcement learning, given a perfectly-crafted prompt." The tweet further highlighted a perceived gap, stating, "we just don't know, and lack tools to efficiently search through prompt space."

OpenAI's experimental large language model, which achieved this feat, was described by researcher Alexander Wei as a general-purpose reasoning system. OpenAI claims their model required "very little IMO-specific work," suggesting a broad natural language training and advanced reinforcement learning techniques rather than narrow domain optimization. This approach aims for generalizable reasoning rather than task-specific fine-tuning.

Conversely, Google DeepMind's 'Gemini Deep Think' also achieved gold medal performance, explicitly detailing its methodology. DeepMind stated they trained Gemini on novel reinforcement learning techniques for multi-step reasoning and problem-solving. Additionally, they "provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions," indicating a role for prompt engineering.

The differing claims underscore the ongoing debate about the most effective paths to advanced AI capabilities. While both companies leveraged reinforcement learning to some extent, OpenAI's emphasis on general models contrasts with DeepMind's more explicit mention of curated data and prompt-based guidance. This highlights the complexity of discerning the precise contributions of various training paradigms and the challenges in efficiently exploring the vast prompt design landscape for optimal AI performance.