AI Model Achieves 70% First-Shot Success on miniF2F Theorems with Lean 4 Feedback

A new AI model, StepFun-Prover, has demonstrated a significant leap in automated theorem proving, achieving a 70% success rate on miniF2F theorems on the first attempt. This advancement, highlighted in a recent preview by Rohan Paul, leverages interaction with the Lean 4 theorem prover and a feedback-driven learning approach. The research indicates a shift from traditional provers that often rely on numerous guesses without incorporating real-time checker feedback.

The StepFun-Prover initiative, detailed in the arXiv paper "StepFun-Prover Preview: Let's Think and Verify Step by Step," begins by preparing open Lean problems into a clean, cold-start dataset. This foundational dataset is designed to teach the AI basic syntax and tool calls necessary for formal reasoning. The project then fine-tunes 7B and 32B DeepSeek backbone models using this data, subsequently refining their writing styles by correcting errors from a prior model, Kimina-Prover.

A crucial aspect of StepFun-Prover's methodology involves reinforcement learning, where the model generates natural language, Lean code snippets, and sandbox replies within extended contexts. Each attempt receives a binary reward—1 for checker agreement, 0 otherwise—ensuring stable and honest gradient steps during training. This iterative feedback loop allows the model to learn and adapt efficiently, a departure from methods that do not "listen to the checker," as noted by Paul.

Following two cycles of reinforcement learning and supervised fine-tuning, the 32B version of StepFun-Prover reportedly outperforms 72B competitors by over 5%, utilizing only one sample per proof compared to 32 samples for other models. The study also found that extending generation windows from 4K to 20K tokens boosted the pass@1 rate from 58% to 70%, underscoring the value of extensive interaction with the Lean 4 checker. The model can even shorten proofs by interpreting warnings and refactoring its tactics dynamically.

The team plans to release open weights for both the 7B and 32B models, aiming to enable broader testing of formal reasoning with minimal GPU requirements. This move is expected to democratize access to advanced AI theorem proving capabilities, fostering further research and development in the field. The miniF2F benchmark, a collection of formalized mathematics problems, serves as a standard for evaluating such systems, and Lean 4 is an open-source interactive theorem prover widely used for formal verification of mathematical proofs.