NVIDIA Research Achieves 54.8% Logic Puzzle Score Increase in 1.5B LLM with Prolonged RL

NVIDIA researchers have unveiled a novel approach using prolonged reinforcement learning (ProRL) on a compact 1.5 billion-parameter language model, significantly enhancing its reasoning capabilities. The technique, detailed in a recent paper, led to a remarkable 54.8% increase in scores on logic puzzles, alongside substantial improvements in math, coding, and STEM tasks. This development addresses a critical challenge where smaller models often plateau due to training-induced exploration collapse.

The core of the methodology involves "stabilizing tricks" within an extended reinforcement learning framework. As stated in the tweet by Rohan Paul, these include "decoupled clipping lets unlikely tokens climb, dynamic sampling skips boring prompts, and a small 0.0001 KL term stops runaway drift." Additionally, the team resets the reference policy and optimizer every 400 steps, providing the model "fresh room to wander without forgetting good habits."

This prolonged training, which can extend over 2,000 steps, leverages five reward-checked domains, including 40,000 math items and 24,000 coding puzzles, processed through an auto-grading sandbox. The Group Relative Policy Optimization (GRPO) algorithm was applied, with loosened constraints to prevent "entropy collapse," ensuring the model continues to explore rather than locking into narrow behaviors. The model also utilized an 8K context, later proven effective at 16K.

The results are compelling, with the Nemotron-Research-Reasoning-Qwen-1.5B model demonstrating significant gains. Beyond the 54.8% improvement on logic puzzles, the model achieved 14.7% higher scores on math, 13.9% on coding, and 25.1% on challenging STEM questions. These performance metrics indicate that the small 1.5B model can match or even surpass the capabilities of much larger or narrowly tuned domain-specialized models.

This research, published in the paper "Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training" by Shizhe Diao, Mingjie Liu, and others, suggests that reinforcement learning can genuinely expand the reasoning capacity of LLMs beyond their base capabilities. The findings challenge previous debates on whether RL merely optimizes existing solutions or fosters new reasoning pathways, highlighting the potential for small, efficient models to tackle complex problems.