Kimi-K2 Achieves Parity with Claude 3.7 Sonnet on METR Benchmark, Highlighting Open-Source Progress

A recent social media post by "Lisan al Gaib" has highlighted a significant development in the artificial intelligence landscape: the Kimi-K2 model has reportedly achieved the same score as Anthropic's Claude 3.7 Sonnet on the METR benchmark. This achievement underscores the rapid advancements in open-source AI, even as the author suggests a persistent gap in "agentic, long-context reasoning tasks" compared to frontier labs.

The tweet stated: "> Kimi-K2 Thinking gets the same score on METR as Claude 3.7 Sonnet," adding, "> as I was saying, open-source is 9 months behind frontier labs on agentic, long-context reasoning tasks." This comparison points to a nuanced picture of AI development, where open-source models are making substantial strides while still facing challenges in certain advanced capabilities.

Anthropic's Claude 3.7 Sonnet, released on February 24, 2025, was a pioneering hybrid AI reasoning model known for its "extended thinking" mode, designed for complex, multi-step reasoning. However, Anthropic announced its deprecation on October 28, 2025, with newer models like Claude Sonnet 4.5 taking its place. On the METR Data Deduplication evaluation, a benchmark for autonomy risk, Claude 3.7 Sonnet achieved a median F1 score of 70.2%, with only 4 out of 30 trials scoring above the set threshold.

Information regarding Kimi-K2 is less publicly detailed, but its reported parity with Claude 3.7 Sonnet on METR suggests a notable leap for an open-source model. The METR benchmark is designed to assess an AI model's ability to perform efficient data deduplication, a task requiring sophisticated reasoning and long-context understanding. While the specific details of Kimi-K2's architecture and development remain less transparent, its performance indicates a closing of the gap in specific technical evaluations.

Despite this benchmark achievement, the tweet's author maintains that open-source models lag approximately nine months behind frontier labs in "agentic, long-context reasoning tasks." This perspective suggests that while raw performance on specific benchmarks can be matched, the broader capabilities for autonomous, multi-step problem-solving over extended contexts may still be dominated by proprietary models from leading AI research institutions. The continuous "exponential" improvement in open-source AI, however, indicates a dynamic and competitive future for the field.