Kimi K2 Thinking Achieves State-of-the-Art 44.9% on Humanity's Last Exam

The Kimi K2 Thinking model has successfully passed Humanity's Last Exam (HLE) on its initial attempt, a significant milestone for artificial intelligence, as noted by AI expert Ethan Mollick. This achievement places Kimi K2 among a select few models to demonstrate such advanced reasoning capabilities on the rigorous benchmark. Mollick, in a recent social media post, observed, > "Kimi K2 Thinking passes the Lem Test the first time, very few models have done so." He further characterized the model as "very weird & interesting... hard to benchmark," despite its often high-quality writing.

Humanity's Last Exam, developed by the Center for AI Safety and Scale AI, is a highly challenging benchmark designed to push the limits of AI knowledge and reasoning. Comprising 2,500 questions across diverse subjects like mathematics, physics, and humanities, HLE was created to address "benchmark saturation," where leading AI models were achieving near-perfect scores on older tests. Kimi K2 Thinking achieved a state-of-the-art score of 44.9% on HLE, outperforming several established proprietary models.

Developed by Moonshot AI, Kimi K2 Thinking is an open-source Mixture-of-Experts (MoE) model featuring 1 trillion total parameters with 32 billion activated. Its advanced capabilities stem from "deep thinking & tool orchestration," enabling it to interleave chain-of-thought reasoning with function calls. This allows for autonomous research, coding, and writing workflows that can span hundreds of steps without losing coherence, a key factor contributing to its unique and challenging-to-benchmark nature.

The model's strong performance, particularly its agentic reasoning and ability to execute 200–300 sequential tool calls, positions it competitively against proprietary systems like OpenAI's GPT-5 and Anthropic's Claude Sonnet 4.5. Kimi K2 Thinking's open-source release under a Modified MIT License further democratizes access to frontier-level AI. Its efficiency, including native INT4 quantization for faster inference, makes it a compelling alternative for developers and enterprises seeking high-performance AI solutions.