LLaMA 4 Unveiled with 10 Million Token Context Window, Faces Scrutiny Over Benchmark Claims

Meta AI officially released LLaMA 4 in April 2025, introducing advanced multimodal capabilities, a Mixture-of-Experts (MoE) architecture, and an industry-leading 10 million token context window in its Scout model. However, the launch has been met with mixed reactions, including a tweet from Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) stating, > "you think you understand how much of a disaster LLaMA 4 was, but you still don't. It wasn't just a failure to keep pace with Chinese labs; they regressed a lot specifically in things they wanted to deliver on." This sentiment contrasts sharply with Meta's claims of significant advancements.

The LLaMA 4 collection includes three primary models: Scout, Maverick, and the forthcoming Behemoth. LLaMA 4 Scout, with 17 billion active parameters, is designed for efficiency and can run on a single NVIDIA H100 GPU, boasting the remarkable 10 million token context window for extensive document analysis. LLaMA 4 Maverick, also with 17 billion active parameters but utilizing 128 experts, targets general assistant and chat use cases, offering strong performance in reasoning and coding. The largest model, LLaMA 4 Behemoth, with an estimated 2 trillion parameters, remains in training and is intended as a "teacher model" for its smaller counterparts.

A key innovation in LLaMA 4 is its adoption of the Mixture-of-Experts (MoE) architecture, which enhances efficiency by activating only a fraction of the total parameters for each task, leading to lower serving costs and latency. The models are also natively multimodal, integrating text and visual tokens through "early fusion" during pre-training. This allows for a deeper, more integrated understanding of both images and text, a significant step beyond previous "bolted-on" multimodal approaches.

Meta has presented benchmark results indicating that LLaMA 4 Maverick outperforms models like OpenAI's GPT-4o and Google's Gemini 2.0 on various coding, reasoning, multilingual, and image benchmarks. The company also notes that Maverick achieves comparable results to DeepSeek v3 on reasoning and coding. Despite these positive claims, the release sparked controversy, with some critics questioning Meta's benchmarking practices, alleging the use of an "experimental chat version" optimized for testing that differed from the public release.

The timing of the LLaMA 4 launch on a Saturday also fueled speculation, with some analysts suggesting it was a strategic move to pre-empt announcements from competitors, including Chinese AI labs like Alibaba's Qwen and DeepSeek. Reports also indicated that LLaMA 4 faced internal development delays and challenges with math and logic tasks, as well as voice features, which could contribute to the perception of "regression" mentioned in the tweet. Meta's continued commitment to an open-source model strategy aims to foster innovation and widespread adoption within the developer community.