Uni-MoE-2.0-Omni Achieves 7% Lead in Omnimodal Understanding with 16x Fewer Training Tokens

Harbin, China – Researchers have unveiled Uni-MoE-2.0-Omni, a new omnimodal large model that significantly outperforms its competitor Qwen2.5-Omni in key metrics, including a 7% lead in omnimodal understanding, despite being trained on substantially fewer tokens. The model, developed by a team including Yunxin Li, is now open-source, marking a notable advancement in efficient, comprehensive AI capabilities.

The Uni-MoE-2.0-Omni model was built on the Qwen2.5-7B dense architecture and trained from scratch using approximately 75 billion tokens of open-source multimodal data. This contrasts sharply with Qwen2.5-Omni, which reportedly utilized 1.2 trillion tokens. The efficiency gain highlights breakthroughs in architectural design and training strategies, allowing for superior performance with a fraction of the computational resources.

Key to Uni-MoE-2.0-Omni's innovation is its novel Omnimodality 3D RoPE (Rotary Position Embedding) and Dynamic Capacity Mixture-of-Experts (MoE) architecture. This design unifies alignment across speech, text, images, and video in spatiotemporal dimensions, enabling better processing of diverse inputs. The MoE framework allows for adaptive computation allocation based on task complexity, enhancing efficiency.

The model also features a deeply fused multimodal encoder-decoder design, supporting any combination of input and output modalities for true omnimodal interaction and generation. Its progressive training strategy, involving cross-modal alignment, expert warm-up, MoE fine-tuning, and generative training, ensures stable convergence with less data, particularly for reinforcement learning.

Uni-MoE-2.0-Omni demonstrates broad capabilities, including speech generation and interaction, image generation and editing, image/video understanding, and audio-visual reasoning across more than 10 multimodal tasks. According to the researchers, it "outperformed Qwen2.5-Omni (1.2T tokens) in 50+ out of 76 comparable tasks." Specific improvements include a 5% increase in video understanding, a 7% lead in omnimodal understanding, and a 4.3% lead in speech QA.

The open-source release of Uni-MoE-2.0-Omni, including its model, code, and homepage, is expected to accelerate research and development in omnimodal AI. This initiative provides the broader AI community with access to a highly efficient and powerful tool for advancing language-centric multimodal understanding, reasoning, and generation.