
DeepSeek AI has made significant strides in optimizing large language models (LLMs) by reducing the computational complexity of their attention mechanism from quadratic to near-linear. This breakthrough, integrated into their DeepSeek-V3.2 models, promises substantial improvements in efficiency and cost-effectiveness for processing long contexts. AI researcher Susan Zhang highlighted these advancements, noting, "Deepseek reduced attention complexity from quadratic to ~linear through warm-starting (w/ separate init + opt dynamics) and adapting the change over ~1T tokens."
At the core of this innovation is DeepSeek Sparse Attention (DSA), also referred to as Native Sparse Attention (NSA). Unlike traditional full attention mechanisms that process every token pair, DSA employs a "lightning indexer" to selectively focus on a fixed number of the most relevant past tokens. This strategic token selection dramatically lowers the computational burden, allowing the model's attention work to scale almost linearly with sequence length, a critical factor for handling extensive text.
The transition to this efficient sparse attention was achieved through a meticulous two-stage warm-starting process. Initially, a "dense warm-up" phase involved training the lightning indexer to accurately mimic the behavior of a dense attention mechanism. Following this, a "sparse training stage" adapted the model to the new sparse attention over approximately 943.7 billion tokens, ensuring performance quality was maintained while benefiting from the reduced complexity.
DeepSeek has also implemented distinct optimization strategies for the prefill and decode phases of inference, utilizing what appear to be separate attention modes. The prefill stage, which processes the initial input prompt, and the decode stage, responsible for generating subsequent tokens, have different computational characteristics. DeepSeek's architectural differentiation for these phases allows for tailored optimizations, with Susan Zhang observing this as potentially the "first public account of arch difference between the two."
These architectural and training innovations have profound implications for the AI industry, setting new benchmarks for economic efficiency. DeepSeek's DeepSeek-V3.2 models, leveraging DSA, have demonstrated significant cost reductions, with reports indicating up to a 70% decrease in inference costs for long-context operations. This positions DeepSeek as a formidable competitor in the rapidly evolving landscape of high-performance, cost-efficient LLMs.