Hangzhou, China – A new reinforcement learning method, Pre-estimated Value-based Policy Optimization (PVPO), has demonstrated significant improvements in the efficiency and stability of training AI reasoning agents. Developed by researchers at Alibaba Cloud Computing, PVPO enables training runs to finish 1.7 to 2.5 times faster without compromising accuracy, according to a recent tweet by Rohan Paul and the associated arXiv paper. This advancement addresses critical challenges in the training of large language models (LLMs) for complex tasks.
PVPO introduces a novel approach by replacing the unstable group baselines used in traditional critic-free methods, such as Group Relative Policy Optimization (GRPO), with a fixed reference anchor. As explained in the paper, "Critic-free group methods compare rollouts against their own group average, so the baseline shifts with every sample and needs many rollouts." In contrast, PVPO sets a stable reference by pre-rolling a separate reference model and averaging its rewards for each question, decoupling the advantage calculation.
The method's core innovation lies in its "Static V Estimate" and "Group Sampling" strategies. The static reference anchor significantly lowers variance and reduces the number of rollouts required during training. Additionally, PVPO filters training data by dropping "always easy cases," retaining mixed ones, and seeding "impossible ones" with cached successful trajectories from a larger model. This provides a concrete path for learning when rewards are sparse, directing gradients more effectively.
Experimental results detailed in the paper, authored by Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, and Hao Wang from Alibaba Cloud Computing, show PVPO's superior performance. On multi-step retrieval tasks, a 7B policy trained with PVPO not only beats GRPO but also rivals stronger general models on in-domain datasets. The method also demonstrated robust generalization across mathematical reasoning benchmarks.
The efficiency gains are substantial; PVPO achieves the accuracy level of GRPO in approximately half the training steps. The data filtering process alone, which removes 40-60% of the dataset, contributes to the 1.7x-2.5x speedup in training. This makes PVPO a promising development for the practical application and scaling of agentic AI systems, offering a more stable and cost-effective training paradigm.