London – AI firm Caesar Data has announced a groundbreaking achievement, securing the highest-ever published score of 55.87% on the demanding Humanity’s Last Exam (HLE) Text-Only Benchmark. The impressive performance, highlighted by Rohan Paul on social media, marks a significant leap in AI capabilities, demonstrating advanced reasoning with remarkable efficiency.
Humanity’s Last Exam is a rigorous AI benchmark co-developed by the Center for AI Safety and Scale AI. It was established to address the "benchmark saturation" issue, where leading large language models (LLMs) consistently achieved over 90% accuracy on older tests like MMLU. HLE comprises 2,500 to 3,000 challenging questions across more than a hundred academic disciplines, designed to test deep reasoning and resist simple internet lookups.
Previously, even the most advanced LLMs struggled with HLE, with initial scores often falling below 10%. By early 2025, some frontier models had managed to reach the upper 20% range on the text-only subset. Caesar Data's reported 55.87% score dramatically surpasses these figures, setting a new global benchmark for text-based AI reasoning on this notoriously difficult exam.
According to the tweet from Rohan Paul, the achievement was accomplished using "100 compute units i.e. approximately 1min of Caesar reasoning time." This detail underscores the efficiency of Caesar Data's AI system, suggesting not only superior accuracy but also optimized computational performance.
Caesar Data, a UK-based company incorporated in January 2025, specializes in data infrastructure and analytics, with a stated mission to use AI to answer humanity's most challenging questions. This record-breaking score positions the relatively new firm as a significant player in the competitive AI research landscape. The breakthrough indicates accelerating progress in AI's ability to handle complex, expert-level academic problems, pushing the boundaries of what is possible in artificial intelligence.