InclusionAI has officially open-sourced Ling-flash-2.0, a new Mixture-of-Experts (MoE) large language model boasting 100 billion total parameters while utilizing only 6.1 billion active parameters. The announcement highlights the model's significant efficiency and performance, positioning it as a competitive force in the rapidly evolving landscape of open-source AI.
The model demonstrates impressive speed and reasoning capabilities, with its developer, Ant Ling, stating, "> 3x faster than 36B dense (200+ tok/s on H20) --> Beats ~40B dense LLM on complex reasoning --> Powerful coding and frontend development." This performance is achieved despite its relatively small active parameter count, which contributes to its efficiency. Ling-flash-2.0 is reported to deliver state-of-the-art performance among dense models under 40 billion parameters.
Built on the Ling 2.0 architecture, the model incorporates a 1/32 activation-ratio MoE design, optimized through various techniques including expert granularity and a unique routing strategy. This architectural approach allows for substantial efficiency gains, with InclusionAI claiming a 7x improvement over equivalent dense architectures. The model was rigorously trained on over 20 trillion tokens of high-quality data, further enhanced by supervised fine-tuning and multi-stage reinforcement learning.
The open-sourcing of Ling-flash-2.0 on Hugging Face has generated considerable interest within the AI community. Developers are particularly keen on its potential for efficient deployment due to its sparse activation, which can reduce VRAM requirements and accelerate training times. Its native 32K context length, extendable to 128K with YaRN, also offers flexibility for diverse applications, from complex reasoning to advanced coding tasks.