Rohan Paul has announced a significant breakthrough in artificial intelligence, revealing a new method named DeepConf that reportedly achieved 99.9% accuracy on the AIME 2025 benchmark. Shared via a recent tweet, Paul detailed how this approach, utilizing the open-source model GPT-OSS-120B, drastically cuts token usage by up to 84.7% compared to standard parallel thinking. The development signals a potential leap in AI efficiency and reasoning capabilities without the need for additional training or fine-tuning.
DeepConf operates as a test-time method, leveraging a model's inherent token confidence to selectively retain only its most robust reasoning pathways. This innovative technique contrasts sharply with prevalent systems that rely on self-consistency with majority voting, which often face diminishing returns and high token consumption. According to Paul, DeepConf effectively filters out weak reasoning traces, thereby enhancing accuracy while significantly reducing computational overhead.
The core of DeepConf lies in its sophisticated use of confidence signals derived directly from the model's output probabilities. These signals include token confidence, which measures the negative mean log probability of top candidates at each step, and group confidence, an average over a sliding window to detect local dips. Furthermore, tail confidence averages the final chunk of tokens, and lowest group confidence identifies the single weakest window, serving as a critical gate for early trace termination.
"DeepConf is a plug-in test-time compression recipe that filters or halts weak reasoning in place, so teams get higher accuracy and a big token cut without retraining or new hyperparameters," Paul stated in the tweet.
This methodology aims to address the limitations of traditional parallel thinking, where accuracy gains from increased sampling often flatten, leading to linear compute scaling with minimal benefit. By scoring and filtering reasoning locally, DeepConf offers a practical solution for developers seeking to optimize large language models. The claimed 99.9% accuracy on AIME 2025, a challenging mathematics competition, underscores the potential impact of DeepConf on complex problem-solving tasks.