Grok-4 Achieves 44% on "Humanity's Last Exam," Sparking AI Capability Debate

A recent social media post by Chris Paxton highlights a significant development in artificial intelligence, stating that a large language model (LLM) achieved "44% on humanity's last exam," describing the score as "impressive." This refers to the "Humanity's Last Exam" (HLE), a rigorous benchmark designed to push the boundaries of AI evaluation.

The Humanity's Last Exam (HLE) is a challenging, multi-modal benchmark developed jointly by the Center for AI Safety and Scale AI. It was created in response to the "saturation" of existing LLM benchmarks, where state-of-the-art models were consistently achieving over 90% accuracy, limiting their utility in measuring true progress. HLE comprises 2,500 to 3,000 questions across a broad spectrum of advanced academic subjects, including mathematics, physics, computer science, humanities, and more. Questions are meticulously crafted by nearly 1,000 subject matter experts globally, designed to be non-searchable and require deep reasoning rather than simple retrieval.

Initial evaluations of leading LLMs on HLE showed consistently low accuracy, generally below 10%, underscoring a significant gap between current AI capabilities and expert human performance. The 44% score mentioned in the tweet, attributed to xAI's Grok-4 model (specifically noted as "Grok 4 heavy" with advanced compute in community discussions), represents a substantial leap in performance. This achievement, if confirmed, would place Grok-4 significantly ahead of other frontier models on this demanding benchmark.

However, the interpretation of such a high score on HLE has ignited debate within the AI community. While proponents view it as a strong indicator of advanced knowledge application and reasoning, some researchers argue that HLE primarily tests specialized knowledge and the ability to apply it, rather than demonstrating Artificial General Intelligence (AGI) or novel problem-solving capabilities. As one Reddit user, xirzon, commented, "It seems to be a benchmark that would benefit well from scaling up training compute & reasoning tokens... But it doesn't really tell us much about the model's general intelligence in open-ended problem-solving." Critics suggest that the benchmark, while difficult for humans, may still be susceptible to data contamination if its questions or related discussions enter training datasets.

Despite these discussions, the HLE serves as a crucial tool for assessing the evolving capabilities of LLMs. Its creators, including Dan Hendrycks, Executive Director of the Center for AI Safety, aim for it to be "the final closed-ended academic benchmark of its kind," providing a clear measure of AI progress and informing future research and policy decisions. The rapid advancements seen on HLE, such as Grok-4's reported 44%, highlight the accelerating pace of AI development and the ongoing need for increasingly sophisticated evaluation methods.