New Research Identifies Evaluation Metrics as Key Obstacle to Reducing LLM Hallucinations

New research co-authored by Adam Tauman Kalai of OpenAI reveals that the persistent issue of hallucinations in large language models (LLMs) stems from a fundamental connection between supervised and self-supervised learning, and is exacerbated by current evaluation methods. The findings, detailed in a paper published on September 4, 2025, suggest that the way LLMs are trained and evaluated inherently rewards "guessing" over acknowledging uncertainty.

According to a tweet by Adam Tauman Kalai, the research "explains why LLMs hallucinate, through a connection between supervised and self-supervised learning." The tweet further notes that the study also "describe[s] a key obstacle that can be removed to reduce them." This obstacle is identified as the prevailing use of binary evaluation metrics that penalize expressions of uncertainty, pushing models to generate plausible but incorrect statements.

The paper, "Why Language Models Hallucinate," posits that hallucinations are not mysterious but originate as statistical errors. It argues that even with error-free training data, the objectives optimized during pretraining lead to models that generate errors. Post-training further entrenches these issues, as benchmarks often reward overconfident responses, even if incorrect, over abstaining or signaling uncertainty.

The implications of LLM hallucinations are significant across various real-world applications, impacting user trust and potentially leading to misinformation. Instances of models fabricating data, misinterpreting facts, or producing nonsensical content have been observed in critical sectors like healthcare, finance, and legal services. Users frequently report "Factual Incorrectness" and "Nonsensical/Irrelevant Output" as primary concerns, leading to reduced satisfaction and lower application ratings.

The research suggests that modifying existing evaluation benchmarks to explicitly incorporate confidence targets and reward appropriate expressions of uncertainty could be a crucial step. This socio-technical mitigation would encourage models to behave more reliably by not penalizing "I don't know" responses. Addressing this fundamental misalignment in evaluation could pave the way for more trustworthy and accurate AI systems.