OpenAI's O3 Model Achieves 26.6% Accuracy on 'Humanity's Last Exam' via Advanced End-to-End Reinforcement Learning for Tool Use

Image for OpenAI's O3 Model Achieves 26.6% Accuracy on 'Humanity's Last Exam' via Advanced End-to-End Reinforcement Learning for Tool Use

OpenAI Research Scientist Xikun Zhang recently provided insight into the evolving training paradigms for advanced artificial intelligence models, emphasizing the continued development of end-to-end reinforcement learning for tool use. In a social media post on July 18, 2025, Zhang stated, "> It is not a direct finetune of o3. But we do continue the paradigm of teaching tool use through end-to-end RL as done by o3. GPT-5 will be different. It is coming!" His remarks underscore OpenAI's strategic focus on building more autonomous and capable AI agents.

The "o3" model, part of OpenAI's "o-series" of reasoning models, represents a significant leap in AI capabilities, becoming generally available on April 16, 2025. These models are engineered to "think longer before responding," excelling in complex tasks that demand deep analytical thinking and problem-solving. They integrate various tools, including web search, Python execution, and image analysis, directly into their reasoning process.

A core innovation behind the o-series, including o3, is its training through end-to-end reinforcement learning (RL) on challenging browsing and reasoning tasks. This method enables the models to autonomously determine when and how to deploy tools, fostering greater adaptability in open-ended scenarios. OpenAI's "Deep Research" product, launched around February 2025, is a prime example of this paradigm, powered by a fine-tuned version of o3 designed for comprehensive online investigation.

This advanced training has yielded notable performance improvements. The model powering Deep Research achieved a 26.6% accuracy on "Humanity's Last Exam," a rigorous evaluation testing AI across a broad range of expert-level questions. Furthermore, the o3 model demonstrated a 69.1% accuracy on the SWE-bench Verified software engineering benchmark and 91.6% on the AIME 2024 mathematics competition, showcasing its proficiency across diverse technical domains.

Zhang's tweet also hints at the future, noting that "GPT-5 will be different." While the end-to-end RL for tool use paradigm established with o3 will continue to influence development, the upcoming GPT-5 model is expected to introduce further distinctions. This suggests an ongoing convergence of the specialized reasoning capabilities found in the o-series with the natural conversational abilities of the GPT-series, pushing the boundaries of AI towards more sophisticated and integrated agentic systems.