VCRL Framework Achieves Over 4.6-Point Lead in LLM Math Reasoning Performance

Researchers from Alibaba Cloud Computing have introduced a novel approach named Variance-based Curriculum Reinforcement Learning (VCRL), designed to significantly enhance the mathematical reasoning capabilities of Large Language Models (LLMs). The new framework, detailed in a research paper published on arXiv on September 24, 2025, addresses limitations in existing reinforcement learning methods by dynamically adjusting the difficulty of training samples. AK, one of the authors, shared the development on social media, stating, "> VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models."

VCRL operates on the principle of dynamically controlling the difficulty of training samples based on the variance of group rewards. This method aims to mimic human cognitive processes, moving from easier to more challenging tasks, thereby optimizing the LLM's learning efficiency. The authors noted that existing rollout-based reinforcement learning techniques often "fail to explicitly consider LLMs' learning ability for samples of different difficulty levels."

The framework integrates two core components: Variance-based Dynamic Sampling and Replay Learning with a memory bank. Variance-based Dynamic Sampling identifies samples of moderate difficulty—those where the model's success rate is approximately 50% and reward variance is highest—as these are most valuable for learning. Replay Learning then utilizes a memory bank to maintain and prioritize these high-value samples, further boosting training efficiency and stability.

Extensive experiments were conducted on five mathematical benchmarks using Qwen3-4B-Base and Qwen3-8B-Base models. VCRL consistently outperformed current LLM RL baselines, including GRPO, DAPO, and GSPO. On the Qwen3-8B-Base model, VCRL achieved an average score of 57.76, marking a significant improvement of over 4.67 points above the strongest baseline, GSPO (53.09), and a remarkable 24.8 points over the base model.

This consistent superiority, particularly on challenging datasets like AIME-2024 and AIME-2025, underscores VCRL's effectiveness in developing complex, multi-step reasoning abilities in LLMs. The research highlights VCRL as a robust strategy for enhancing the mathematical reasoning prowess of large language models, promising more efficient and stable training processes.