AI Models Achieve Near 30% on Challenging ARC-AGI-2 Benchmark, Signaling Rapid Progress

Recent developments in artificial intelligence have seen models achieve unprecedented performance on the Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark, particularly its more challenging second iteration, ARC-AGI-2. This progress has surprised researchers, with one long-time ARC expert describing the advancements as "scary" and "unprecedented."

The ARC-AGI benchmark, introduced by François Chollet in 2019, is designed to measure fluid intelligence and abstract reasoning, critical components for Artificial General Intelligence (AGI). Unlike traditional benchmarks that often reward memorization or narrow task-specific skills, ARC-AGI presents visual grid-based puzzles requiring models to infer underlying rules from minimal examples and apply them to novel situations, a task humans find relatively easy but AI systems have historically struggled with.

Earlier this year, performance on the initial ARC-AGI-1 benchmark saw leading models like OpenAI's experimental o3 achieve scores around 75-87%, a significant leap from previous attempts. However, many other models still "barely get 40% on ARC AGI 1," as noted by Chris, an expert who has worked with ARC for years. This indicated a substantial gap in general reasoning capabilities across different AI systems.

The recently introduced ARC-AGI-2, launched in March 2025, significantly raises the bar by emphasizing efficiency and capability, making it considerably harder than its predecessor. While general large language models (LLMs) have shown dramatically lower scores, specialized systems are pushing boundaries. Jeremy Berman recently achieved a new state-of-the-art score of 29.4% on ARC-AGI-2, demonstrating a notable advance in abstract reasoning capabilities.

This rapid improvement has profound implications for the pursuit of AGI. > "It starts getting scary when the models start doing really well on benchmarks you thought would take years," Chris stated, adding, "Like what do you mean models are getting 30-40% on ARC AGI 2, when they could barely get 40% on arc AGI 1, beginning of this year. I’ve been working with and on ARC for years. This is unprecedented progress even for training on a test set." Despite these gains, human performance on ARC-AGI-2 remains at 100%, and the computational cost for AI solutions is still significantly higher than human problem-solving, highlighting ongoing challenges in efficiency and true human-like generalization.