DeepSeek-R1 Achieves 77.9% on AIME 2024 with Pure Reinforcement Learning, Published in Nature

Image for DeepSeek-R1 Achieves 77.9% on AIME 2024 with Pure Reinforcement Learning, Published in Nature

DeepSeek-AI has published its groundbreaking research on DeepSeek-R1 in the prestigious scientific journal Nature, showcasing a novel approach to developing reasoning capabilities in large language models (LLMs) through pure reinforcement learning (RL). This publication comes amidst industry scrutiny regarding the independent development of AI models, with DeepSeek's paper serving to validate its distillation and training techniques. The research introduces DeepSeek-R1-Zero, a model trained solely with answer-only rewards, enabling it to self-evolve advanced reasoning strategies without reliance on human-annotated reasoning traces.

The core innovation of DeepSeek-R1 lies in its ability to foster sophisticated reasoning skills by rewarding only the correctness of final answers, rather than requiring step-by-step human demonstrations. This method allows the model to discover non-human-like reasoning pathways, including self-checking, verification, and dynamic strategy adjustments. During training, DeepSeek-R1-Zero exhibited a "self-evolution" process, increasing its "thinking time" and developing reflective behaviors such as a notable spike in the use of the word "wait" during problem-solving.

DeepSeek-R1, an enhanced version of R1-Zero, utilizes a multi-stage training framework that integrates rejection sampling, RL, and supervised fine-tuning to improve readability and general language capabilities while retaining strong reasoning. The model employs Group Relative Policy Optimization (GRPO), a variant of PPO, to efficiently train without a separate value model. Notably, DeepSeek has also successfully distilled these advanced reasoning patterns into smaller models, making powerful AI more accessible and energy-efficient.

On challenging benchmarks, DeepSeek-R1 demonstrates impressive performance, achieving a 77.9% pass@1 score on AIME 2024, which further rises to 86.7% with self-consistency decoding. This performance surpasses the average human competitor on the AIME. Despite this significant scientific achievement and its publication in a top-tier journal, Kevin Xu, a prominent observer, noted the apparent lack of attention from Western tech media, stating, "> And of course, the paper got also no attention in the western tech media."

The publication in Nature not only highlights DeepSeek-R1's technical prowess but also sets a procedural precedent for transparency in the AI industry. Reviewers emphasized the importance of such peer-reviewed validation in assessing AI systems, moving beyond self-promotional claims to support findings with solid evidence and reproducible processes. This move is seen as a crucial step towards establishing AI as a scientific discipline, fostering public trust, and curbing industry speculation.