AdEMAMix and MARS Emerge as Leading LLM Optimizers for Large-Scale Pretraining

Image for AdEMAMix and MARS Emerge as Leading LLM Optimizers for Large-Scale Pretraining

A new comprehensive study, "Benchmarking Optimizers for LLM Pretraining," published on arXiv (2509.01440), identifies AdEMAMix and MARS as consistently superior optimizers for large language model (LLM) training, particularly with larger models and batch sizes. The research, conducted by Andrei Semenov, Matteo Pagliardini, and Martin Jaggi from EPFL, provides crucial guidance for practitioners navigating the complex landscape of LLM optimization.

The paper directly addresses the challenge of comparing various optimization techniques, which often suffer from inconsistent experimental protocols. By systematically varying model size, batch size, and training duration while maintaining identical setups, the researchers established a comparable ranking of optimizers. According to a tweet by Rohan Paul, "AdEMAMix and MARS consistently lead on larger models and batches."

The study found that while traditional optimizers like AdamW have dominated for years, newer methods offer significant advantages. For instance, "Sophia often blows up on longer runs," as noted by Paul, indicating stability issues. In contrast, sign-based optimizers such as Lion and Signum demonstrated strong performance specifically when utilized with larger batch sizes. The research also highlighted D-Muon for its predictable and stable behavior across different configurations.

Key insights from the benchmarking include the effectiveness of cosine learning-rate schedules, which generally yield better results, though linear schedules can suit sign-based methods and MARS. The importance of hyperparameter tuning was also underscored, with a weight decay near 0.1 proving to be a robust default for long training runs, while higher decay might be preferred for shorter durations. The paper concludes with practical recommendations: "For practical choices, pick AdEMAMix or MARS for large jobs, keep D-Muon when you want predictable behavior, and always tune warmup and weight decay." This research provides a valuable, evidence-based roadmap for optimizing LLM pretraining.