Huawei's CloudMatrix 384 System Achieves 920,000 Tokens/Second with DeepSeek Optimization

Huawei continues to advance its AI infrastructure capabilities, with recent optimizations to its CloudMatrix 384 system demonstrating a significant performance boost for DeepSeek's Mixture-of-Experts (MoE) models. A specific deployment unit, representing 37.5% of the full system, has achieved an impressive 345,000 tokens per second (KT/s), with projections indicating the complete CloudMatrix 384 can deliver 920,000 tokens per second. This development underscores Huawei's strategic efforts to enhance its AI computing power, particularly for large language models.

The CloudMatrix 384 is a rack-scale AI architecture built around 384 Huawei Ascend 910C chips, designed to compete directly with Nvidia's high-end systems like the GB200 NVL72. Each deployment unit within the CloudMatrix 384 comprises 144 individual chips, totaling 288 dies. For DeepSeek's MoE architecture, the system is configured to place two "experts" (one routed and one redundant) per die, optimizing the processing of complex AI workloads.

This performance is critical for inference tasks on large language models, where the speed of token generation directly impacts user experience and application efficiency. While specific per-NPU decoding rates from research papers show figures around 1,943 tokens per second, the reported system-level metrics of 345 KT/s for the unit and 920 KT/s for the full system highlight the aggregate throughput achieved under optimized conditions. According to a social media post by Teortaxes, a DeepSeek enthusiast, "a deployment unit is 37.5% slice of the system, 144 individual chips, 288 dies; they plop 2 experts (routed and "redundant") per die; get 345KT/s for the entire thing. So 384 => 920KT/s."

Huawei's approach, often described as "brute force" due to its higher chip count and power consumption compared to some Western counterparts, allows it to achieve competitive performance despite international restrictions on advanced chip manufacturing. The CloudMatrix 384 leverages advanced optical interconnects to ensure high-bandwidth, low-latency communication across its vast array of chips, a crucial factor for scaling MoE models. This strategic focus on domestic AI infrastructure is vital for China's technological independence in the rapidly evolving artificial intelligence landscape.