NVIDIA's Nemotron-Flash Boosts Small Language Model Throughput by up to 45.6x with New Hybrid Architecture

NVIDIA Research has unveiled Nemotron-Flash, a new family of hybrid Small Language Models (SLMs) designed to significantly enhance the accuracy and efficiency of AI agents. This development focuses on optimizing real-device latency and throughput, crucial for real-world applications with demanding performance requirements. The research introduces an evolutionary search framework to identify optimal architectural configurations.

The core innovation addresses the challenge that traditional parameter-efficient SLMs do not always translate to proportional real-device speed-ups. Researchers identified depth-width ratios and operator choices as critical architectural factors influencing latency and throughput. Their findings suggest that while deep-thin models can be parameter-efficient, they may not always offer the best accuracy-latency trade-off.

A key aspect of Nemotron-Flash is an evolutionary search framework that automatically discovers latency-optimal combinations of efficient attention alternatives within hybrid SLMs. This automated approach moves beyond manual, heuristic-driven architecture decisions, integrating promising operators like DeltaNet, Mamba2, and Attention. The framework leverages the early stabilization of performance rankings in training to efficiently identify superior architectures.

Further enhancements include a novel weight normalization technique, which enables more effective weight updates and improves final convergence, along with the use of meta tokens for cache initialization. These training innovations contribute to the overall performance gains. The Nemotron-Flash family, including 1B and 3B parameter models, was trained on 4.5 trillion tokens using NVIDIA H100 GPUs.

"we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy–efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3×/1.9× lower latency, and 18.7×/45.6× higher throughput compared to Qwen3-1.7B/0.6B, respectively," stated the research paper, a summary of which was shared by Tanishq Mathew Abraham, Ph.D. on social media.

Nemotron-Flash-1B, for instance, demonstrated a 5.5% higher average accuracy, 1.9x lower latency, and 45.6x higher throughput than Qwen3-0.6B. Similarly, Nemotron-Flash-3B achieved up to 5.5% higher average accuracy, 1.3x lower latency, and 18.7x higher throughput compared to Qwen3-1.7B. These improvements position Nemotron-Flash as a significant advancement for deploying efficient and accurate small language models in diverse, latency-sensitive applications.