AI Benchmarking Falls Short on Agentic Work, Warns Ethan Mollick

Philadelphia, PA – Wharton Professor Ethan Mollick has voiced a critical concern regarding the current state of Artificial Intelligence evaluation, asserting that existing benchmarks inadequately measure the true economic value of AI systems. Mollick, a leading voice in AI research, highlighted the growing importance of "agentic work" over simple task completion in a recent social media post.

"We have too many benchmarks on model ability, and too few on agentic work. Increasingly, what matters economically is not the ability of AIs to get a question right through an API call, but rather its ability to combine tools & ability to solve a problem. That is under-measured," Mollick stated in his tweet. This perspective underscores a significant shift in how AI's real-world utility should be assessed.

Mollick's work frequently emphasizes that AI's capabilities have evolved beyond basic question-answering to systems that can plan, execute multi-step tasks, and utilize various tools autonomously. He argues that traditional benchmarks, often focused on narrow academic or coding challenges, fail to capture this complex problem-solving ability, which is crucial for economic impact. He points to OpenAI's GDPval paper as a more effective benchmarking approach, which uses realistic, multi-hour tasks designed by industry experts to evaluate AI performance.

The concept of agentic AI, where models can act independently to achieve goals, is gaining traction across the industry. Recent advancements in models like OpenAI's GPT-5 and Anthropic's Claude showcase enhanced tool-use and reasoning, allowing them to tackle more intricate problems. However, Mollick stresses that organizations must develop their own tailored benchmarks to accurately assess AI's effectiveness for specific use cases, rather than relying solely on generic metrics.

This evolving landscape suggests that the true measure of AI's intelligence lies not just in its ability to recall information or perform isolated functions, but in its capacity for strategic planning and tool integration to solve complex, real-world challenges. As AI agents become more sophisticated, the focus on evaluating their "agentic" capabilities will be paramount for both developers and businesses seeking to leverage their full potential.