OpenAI Researcher Calls for 'AGI-Complete' Evaluations to Unlock AI Progress

Will Depue, a prominent researcher at OpenAI, recently took to social media to articulate a critical perspective on the current state of artificial intelligence development, asserting that progress is significantly "eval-limited." In a widely noted tweet, Depue emphasized that the ability of researchers to iterate on AI capabilities is often constrained by the existing methods of measuring those capabilities. His statement underscores a growing sentiment within the AI community regarding the need for more sophisticated evaluation metrics.

Depue stated, > "often researcher’s ability to iterate on a capability is limited by our ability to measure that capability. i do believe progress is more eval-limited than people think." This highlights a fundamental challenge: without robust and comprehensive evaluation tools, accurately assessing and subsequently improving advanced AI systems becomes increasingly difficult. This limitation can inadvertently slow down the pace of innovation.

He further illustrated this point by posing a thought-provoking question about the relationship between specific benchmarks and AI advancements: > "sometimes evals feel causal. did SWE-Bench follow agentic coding, or did agentic coding follow SWE-bench?" SWE-Bench is a benchmark designed to evaluate AI coding agents on real-world GitHub issues, requiring complex problem-solving within large codebases. Agentic coding refers to AI systems that can autonomously plan and execute multi-step tasks, often involving interaction with their environment. Depue's query suggests a potential feedback loop where the existence of a benchmark might shape the development trajectory of AI capabilities, rather than merely measuring pre-existing ones.

The OpenAI researcher also pointed to the ambition of developing AI for "really long horizon tasks (weeks, months)" and the concept of continual learning as prerequisites for Artificial General Intelligence (AGI). However, he critically noted, > "yet where are the evals to prove our models’ deficiencies here?" AGI-complete evaluations are theoretical benchmarks designed to test whether an AI system possesses human-level general intelligence across a wide range of cognitive tasks, often extending beyond narrow, predefined problems. Such evaluations are considered crucial for truly understanding and validating progress towards AGI.

Depue concluded his remarks with a direct call to action for the research community, expressing a strong desire for more work on > "AGI-complete evals, evals that truly track economic value & impact, month-long tasks, etc." His comments from a researcher at a leading AI organization like OpenAI underscore the strategic importance of shifting focus towards developing more holistic and real-world-aligned evaluation methodologies. This shift is seen as essential for overcoming current development bottlenecks and paving the way for the emergence of more capable and impactful AI systems.