NaturalThoughts Dataset Boosts LLM Reasoning, Lifts GPQA-Diamond Score to 48.6%

Image for NaturalThoughts Dataset Boosts LLM Reasoning, Lifts GPQA-Diamond Score to 48.6%

A new research initiative dubbed "NaturalThoughts," spearheaded by a team at Meta's Fundamental AI Research (FAIR) lab, has introduced a novel dataset and training methodology significantly enhancing the reasoning capabilities of large language models (LLMs). The project, detailed in a paper published on July 2, 2025, challenges conventional approaches to LLM training by demonstrating that larger, more varied datasets, coupled with strategic distillation, lead to superior performance in complex reasoning tasks.

The core of the NaturalThoughts approach involves curating a massive pool of up to 500,000 high-quality reasoning demonstrations. This is achieved by pulling 2.8 million real-world questions and having a powerful "teacher" model generate detailed, step-by-step thought processes, known as "chains of thought" or "reasoning traces." These traces are then meticulously tagged for strategy type, length, and verbosity, with a focus on filtering for cases where teacher models exhibit disagreement or employ multiple strategies.

A key innovation lies in mixing short "just answer" traces (System-1 distillation) with longer "think aloud" traces (System-2 distillation). This hybrid training strategy enables student models to maintain efficiency while significantly improving their reasoning prowess. As stated by Rohan Paul, "mixing short “just answer” traces with long “think aloud” traces lets the model stay fast yet smart." This directly addresses the tendency of LLMs to "freeze up when facing new STEM puzzles" if trained on limited, hand-picked sets.

Empirical tests on models like Llama-3.1-8B and Qwen-2.5-7B underscore the efficacy of NaturalThoughts. The research found that "scale matters," with a randomly selected 100,000 examples outperforming prior "Less is More" sets. More notably, training with the full 500,000 NaturalThoughts examples propelled Llama-3.1-8B's accuracy on the challenging GPQA-Diamond benchmark to 48.6%, a significant leap that "eclips[es] earlier distilled models." Hard examples, where teacher models exhibited debate or extended reasoning, were particularly effective in driving performance gains.

The findings suggest a paradigm shift in distilling reasoning skills into smaller LLMs, moving away from the "less is more" hypothesis towards a focus on high-quality, diverse, and scaled data. This advancement holds promise for developing more capable and adaptable AI systems that can dynamically balance speed and depth of reasoning based on task demands.