Yam Peleg Proposes Novel O(N³) Attention for Transformers, Promising Enhanced Power with Standard Inference Cost

Image for Yam Peleg Proposes Novel O(N³) Attention for Transformers, Promising Enhanced Power with Standard Inference Cost

AI researcher Yam Peleg has stirred discussion within the machine learning community with a bold proposal for a new attention mechanism in Transformer models. Peleg suggests an architecture that would involve an O(N³) computational complexity during training, which he claims would yield "insanely more powerful" models while maintaining the current O(N²) inference cost. This proposition challenges conventional approaches to Transformer design, which typically aim to reduce quadratic scaling.

In a recent social media post, Peleg outlined his concept: "> Instead of causal attention, do: N × N × N = O(N³) ▲ ▲ ▲ │ │ └ fully connected attention │ for every token up to current token less efficient training but insanely more powerful with same inference cost as today O(N²)." This statement highlights a significant trade-off, suggesting that a substantial increase in training computation could unlock unprecedented model capabilities without burdening real-time applications.

Current Transformer models predominantly utilize self-attention mechanisms, which typically exhibit O(N²) complexity for both training and inference, where 'N' represents the sequence length. Causal attention, a common variant used in generative models like GPT, restricts a token's attention to only preceding tokens, ensuring autoregressive generation. While effective, these mechanisms can struggle with very long sequences due to their quadratic scaling.

Peleg's mention of "fully connected attention for every token up to current token" alongside an O(N³) complexity suggests a departure from standard self-attention. While the exact architectural details leading to this cubic scaling are not elaborated, the implication is a more exhaustive or intricate interaction between tokens during the learning phase. This could involve higher-order relationships or more complex aggregation functions than currently employed.

The core of Peleg's argument rests on the idea that this increased training cost is a worthwhile investment for a significant boost in model power, especially if the inference cost remains manageable at O(N²). Such a development could lead to models capable of understanding and generating highly complex information, potentially pushing the boundaries of what large language models can achieve in various domains. The AI community will be watching to see if this theoretical advancement can be practically realized and validated.