Open-Source AI Agent X-Master Achieves 32.1% on Humanity's Last Exam, Outperforming Closed Rivals

Image for Open-Source AI Agent X-Master Achieves 32.1% on Humanity's Last Exam, Outperforming Closed Rivals

A new open-source AI agent, X-Master, has demonstrated a significant leap in performance on the challenging Humanity's Last Exam (HLE) benchmark, achieving a score of 32.1%. This result, detailed in a paper titled "SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can They Lead on Humanity's Last Exam?", marks the first time a system has surpassed the 30% threshold on HLE, outperforming leading closed-source models by approximately five points. The development offers a replicable method for enhancing language models without requiring extensive retraining.

X-Master operates as a wrapper designed to augment open-source language models, such as DeepSeek-R1, enabling them to interact with external data and execute code. The core mechanism involves the model generating Python code snippets, which are then executed in a sandbox environment. The results of these executions, whether from mathematical computations, web scraping, or ad-hoc tools, are fed back into the model's context, effectively acting as extended memory.

To bolster accuracy and overcome single-shot errors, the researchers implemented a "scattered-and-stacked" agentic workflow, referred to as X-Masters. This process involves launching multiple "Solver" copies to generate diverse solutions. These solutions are then refined by "Critic" agents, followed by "Rewriter" agents that polish the batch, and finally, a "Selector" agent chooses the optimal answer. This iterative cleanup process, echoing reinforcement learning rollouts, significantly enhances the model's reasoning capabilities without altering its foundational weights.

The impact of X-Master is evident in DeepSeek-R1's performance on Humanity's Last Exam, which saw its score increase from 17.7% to 32.1%. Humanity's Last Exam, developed by the Center for AI Safety and Scale AI, is a rigorous, multi-modal benchmark comprising 2,500 questions across various subjects designed to test AI systems at the frontier of human knowledge. Its creation addressed the "benchmark saturation" issue, where previous AI models achieved near-perfect scores on older tests, making it difficult to measure true progress.

Rohan Paul, an author associated with the paper, highlighted the accessibility of this advancement, stating that "Anyone can replicate the recipe without touching model weights." This open-source approach aims to democratize access to cutting-edge AI capabilities, fostering broader participation and innovation in the field beyond resource-rich industry labs. The breakthrough suggests a promising path for open-source models to compete with and even surpass their proprietary counterparts in complex scientific reasoning tasks.