FlashAttention v4 Code Leaked, Revealing Blackwell GPU Optimization

A recent social media post by user Lisan al Gaib, stating "> FlashAttention v4 is coming to Blackwell GPUs," has ignited discussions within the artificial intelligence community, particularly following the apparent leak of its source code. This development suggests a new iteration of the critical FlashAttention algorithm, specifically optimized for NVIDIA's latest Blackwell GPU architecture, could be nearing release. The leaked code reportedly indicates compatibility with Blackwell's SM100+ architecture and leverages its fifth-generation Tensor Cores, alongside the CuTe DSL (CUTLASS) for enhanced performance.

FlashAttention, developed by Tri Dao, is a pivotal optimization technique that significantly accelerates the attention mechanism in Transformer models, which are fundamental to large language models (LLMs). Previous versions, such as FlashAttention-2, offered substantial speedups and memory savings across NVIDIA's Ampere, Ada, and Hopper GPUs. FlashAttention-3, released as a beta, further optimized performance specifically for Hopper GPUs like the H100, achieving up to 2.0x faster processing than its predecessor.

The emergence of FlashAttention v4, as suggested by the leak, points to a dedicated effort to harness the full capabilities of NVIDIA's Blackwell architecture. Blackwell GPUs, including the B200, are designed to be the engine behind next-generation AI factories, offering unparalleled performance, efficiency, and scale for complex AI workloads. The integration of a highly optimized attention mechanism like FlashAttention v4 is crucial for maximizing the throughput and reducing the latency of LLM training and inference on these new processors.

Industry experts anticipate that a Blackwell-native FlashAttention v4 would further solidify NVIDIA's position in the AI hardware market by enabling even more efficient and faster processing of large-scale AI models. While FlashAttention-3 was largely exclusive to Hopper, the leaked information implies FlashAttention v4 is tailored for Blackwell, potentially leaving older architectures to rely on prior versions. This continuous evolution of software optimizations, tightly coupled with hardware advancements, remains key to pushing the boundaries of AI capabilities.