A recent social media post by user Haider. has ignited discussion around the evolving landscape of Artificial General Intelligence (AGI) benchmarks, particularly the Abstract and Reasoning Corpus (ARC-AGI). Haider. articulated a common sentiment within the AI community, stating, > "by the time we reach 50% on ARC-AGI 3, there will be an ARC-AGI 4 that resets the progress back to 1% this is the cycle with these so-called benchmarks." This perspective highlights the continuous challenge of measuring true AI progress against ever-advancing evaluation standards.
The ARC-AGI benchmark, introduced in 2019 by François Chollet, creator of Keras, is designed to assess fluid intelligence and generalization in AI systems. Unlike traditional benchmarks that test narrow skills or memorization, ARC-AGI focuses on abstract reasoning through visual puzzles, adhering to the principle of being "easy for humans, hard for AI." It aims to measure an AI's ability to apply concepts to novel situations with minimal examples, a key indicator of general intelligence.
The challenge intensified with the March 2025 launch of ARC-AGI-2, which significantly raised the bar for AI systems while remaining solvable by humans. Initial results reveal a dramatic performance drop, with "pure LLMs" scoring 0% on ARC-AGI-2, and other public AI reasoning systems achieving only single-digit percentages. This contrasts sharply with the 75.7% score achieved by OpenAI's experimental o3 model on the previous ARC-AGI-1.
Haider.'s observation that "every time real progress is made, a more difficult version emerges, moving the goalposts further away" resonates with the current state of AI evaluation. While OpenAI's o3 demonstrated a breakthrough on ARC-AGI-1, the subsequent ARC-AGI-2 effectively reset the performance curve, underscoring the dynamic nature of AGI research. This constant recalibration ensures benchmarks remain challenging and relevant as AI capabilities advance.
Looking ahead, ARC-AGI-3 is already in development, set for a 2026 launch, and will focus on skill acquisition efficiency in novel, unseen environments. These evolving benchmarks, including the introduction of efficiency metrics like cost-per-task, continue to push AI researchers to develop more robust and generalized intelligence. The ongoing cycle of progress and increased difficulty reflects the ambitious pursuit of truly human-like reasoning in artificial systems.