AI Community Grapples with Up to 38% Discrepancy in Log Probabilities Between vLLM and Hugging Face Models

A significant technical concern has emerged within the artificial intelligence community, highlighted by a recent tweet from user "finbarr" on July 4, 2025. The tweet, which described a "horrifying bug," pointed out that vLLM and Hugging Face, two widely used tools for large language model (LLM) inference, are producing "significantly different logprobs," with reported relative errors reaching up to 38%. This discrepancy raises questions about the consistency and reliability of LLM outputs across different platforms.

Log probabilities (logprobs) are fundamental to how LLMs function, representing the logarithm of the likelihood of each token appearing in a generated sequence. They are crucial for assessing a model's confidence in its predictions, calculating perplexity (a measure of how well a model predicts a sample), and scoring the overall probability of a generated text. Higher logprobs, closer to zero, indicate greater confidence in a token's selection.

Discussions on the vLLM forum confirm these numerical differences, with users reporting absolute discrepancies in log probabilities ranging from approximately 0.0017 to 0.0065 in specific test cases. Experts attribute these variations to inherent numerical issues, differing CUDA kernels, and the precision of floating-point arithmetic, particularly when using float16 precision. While vLLM's optimized kernels enhance inference speed, they can introduce subtle numerical deviations compared to Hugging Face's standard implementations.

The implications of such discrepancies extend beyond mere numerical variance. In applications where precise confidence scoring is critical

—such as AI-generated text detection, classification tasks, retrieval-augmented generation (RAG) systems, or even sensitive medical LLMs—inconsistent log probabilities can lead to unreliable confidence assessments, inaccurate model comparisons, or even affect the semantic plausibility judgments of the models. For instance, a model might appear less confident in its output than it actually is, or vice-versa, impacting downstream decision-making.

While some level of numerical inconsistency is "somewhat expected" due to the different underlying implementations and optimizations, the magnitude of the reported differences underscores a challenge for developers relying on these tools for consistent model behavior. Some users have found that employing float32 precision can significantly reduce these discrepancies, though often at the cost of slower inference. The AI community continues to develop and refine consistency testing methods, including sentence similarity and perplexity-based evaluations, to ensure that performance optimizations do not compromise the integrity and predictability of LLM outputs.