New Study Reveals 84% of AI Benchmarks Lack Statistical Rigor, Undermining Progress

A groundbreaking study led by the Oxford Internet Institute at the University of Oxford, in collaboration with researchers from Stanford and other leading global institutions, has exposed significant scientific flaws in the evaluation of large language models (LLMs). The research, titled "Measuring What Matters: Construct Validity in Large Language Model Benchmarks," highlights a pervasive issue of "construct validity," where benchmarks fail to accurately measure the intended skills. This critical assessment, which will be presented at NeurIPS 2025, was brought to light by Rohan Paul in a recent tweet.

The study, which systematically reviewed 445 AI benchmarks from top conferences, found that a staggering "only 16% of reviewed benchmarks used statistical tests in their comparisons," as noted in an overview of the findings. This widespread lack of statistical rigor means that many reported advancements or comparisons between models might be due to chance rather than genuine improvement, casting doubt on the reliability of current AI progress metrics.

Researchers identified several core problems, including vague concept definitions and contaminated datasets. "Many benchmarks miss a clear definition, reuse tasks that are not representative, rely on fragile scoring, skip uncertainty, and make claims the test setup cannot support," stated Rohan Paul, summarizing the study's findings. Approximately half of the benchmarks examined tested abstract concepts like "reasoning" or "harmlessness" without providing clear, agreed-upon definitions.

Furthermore, 38% of benchmarks were found to reuse existing datasets or exams, increasing the risk of data contamination and models memorizing answers rather than demonstrating true understanding. Andrew Bean, lead author and doctoral researcher at the Oxford Internet Institute, emphasized the gravity of the situation, stating, “Benchmarks underpin nearly all claims about advances in AI. But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”

To address these critical shortcomings, the study proposes an 8-step checklist for developing more robust and scientifically sound benchmarks. This checklist guides researchers to define skills precisely, design tasks that isolate those skills, choose metrics that truly capture success, and report uncertainty. The authors advocate for a "construct-valid benchmark process" to ensure that evaluations provide reliable evidence for decision-making in AI development and regulation.

The findings carry significant implications for the AI industry, where benchmarks often guide research priorities, shape competition, and inform regulatory frameworks like the EU AI Act. The researchers have also made a "Construct Validity Checklist" available, a practical tool for developers and regulators to assess benchmark quality before relying on their results, aiming to foster more dependable and meaningful AI evaluations.