AI Faces Paradox: Excels on "Humanity's Last Exam" While Struggling with "Easy" ARC-AGI-2

Artificial intelligence systems are demonstrating an "unintuitive" dichotomy in their capabilities, as highlighted by Mike Knoop, co-founder of ARC Prize Foundation. Knoop observed that AI can "simultaneously score 50%+ on Humanity's Last Exam (relatively hard for humans) while only scoring 16% on ARC-AGI-2 (relatively easy for humans)," according to his recent tweet. This stark contrast underscores the complex and often unpredictable nature of current AI development. "Humanity's Last Exam" (HLE) is a rigorous benchmark comprising 2,500 questions across diverse academic subjects, including mathematics, physics, and humanities. Developed by the Center for AI Safety and Scale AI, HLE was specifically designed to challenge advanced AI models, with initial AI performance ranging from a mere 3% to 14%. However, researchers anticipate that leading large language models (LLMs) could achieve over 50% accuracy on HLE by the end of 2025, signaling rapid progress in handling complex, knowledge-intensive tasks. Conversely, the Abstraction and Reasoning Corpus for Artificial General Intelligence 2 (ARC-AGI-2) evaluates fluid intelligence and abstract reasoning. Co-authored by Mike Knoop, this benchmark features tasks that are consistently solvable by humans (achieving 100% success in testing) but prove extremely difficult for AI. Current pure LLMs typically score 0% on ARC-AGI-2, and even specialized AI reasoning systems struggle to surpass 4% accuracy. The 16% figure cited by Knoop suggests a higher, possibly projected, performance for specific models, but generally, AI models find these tasks exceptionally challenging. The core difficulty for AI in ARC-AGI-2 lies in its demand for symbolic interpretation, compositional reasoning, and contextual rule application. AI systems often fail to grasp the semantic meaning behind visual patterns, struggle to apply multiple interacting rules simultaneously, and cannot adapt rules based on subtle contextual cues. This contrasts with HLE, where success hinges on deep knowledge retrieval and complex problem-solving within established academic domains. This performance gap suggests that while AI is rapidly advancing in areas requiring vast data processing and pattern recognition, it still lacks human-like intuitive reasoning and generalization from minimal examples. The paradox illustrates a critical frontier in AI research: bridging the divide between knowledge-based expertise and the flexible, adaptive intelligence that humans possess.