The ByteDance Seed team has unveiled UltraMemV2, a new memory-layer architecture for large language models (LLMs) that reportedly achieves performance on par with state-of-the-art 8-expert Mixture of Experts (MoE) models. As stated in a recent tweet by AI researcher Teortaxes, UltraMemV2 demonstrates "performance parity with 8-expert [ie finegrained] MoEs under same computation and parameters but significantly lower memory access." This advancement addresses a critical bottleneck in LLM inference and could pave the way for more efficient and scalable AI deployments.
Mixture of Experts (MoE) models have gained prominence for their efficiency in activating only subsets of parameters, allowing for larger total parameter counts without proportional increases in computational cost. However, their inference process is often hampered by high memory access requirements due to expert routing overhead. Previous memory-layer architectures, such as the initial UltraMem, had only managed to match the performance of less efficient 2-expert MoE configurations, leaving a significant performance gap.
UltraMemV2 aims to bridge this gap through several key architectural innovations, as detailed in its accompanying arXiv paper (2508.18756). These include integrating memory layers directly into every transformer block, simplifying value expansion with single linear projections, and adopting an FFN-based value processing approach inspired by PEER. The ByteDance team also implemented principled parameter initialization to prevent training divergence and rebalanced computation ratios between memory lookups and feed-forward networks.
The new architecture demonstrates superior performance on memory-intensive tasks, showing improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. The tweet further noted that UltraMemV2 models "scale to 120B total/2.5B active" parameters, underscoring its potential for practical applications at significant scale.
A significant finding from the research is that "activation density has greater impact on performance than total sparse parameter count," as stated in the associated paper. This insight suggests that optimizing how actively parameters are used is more crucial than simply increasing the overall number of sparse parameters. The ByteDance Seed team's work presents UltraMemV2 as a compelling and viable alternative for efficient sparse computation in the evolving landscape of large language models.