CALM Achieves Up to 44% Reduction in LLM Training FLOPs by Shifting to Continuous Vector Generation

A new research paper introduces Continuous Autoregressive Language Models (CALM), a novel framework designed to significantly enhance the computational efficiency of large language models (LLMs). The approach moves away from traditional token-by-token generation, instead employing vector-by-vector steps, leading to substantial compute reductions while maintaining quality. This innovation promises "cheaper, faster LLMs by compressing several tokens into 1 step without losing quality," as highlighted by Rohan Paul on social media.

The core of CALM lies in a high-fidelity autoencoder that can compress a chunk of K tokens into a single continuous vector, reconstructing the original tokens with over 99.9% accuracy. This method effectively reduces the number of generative steps by a factor of K. To ensure robustness and prevent "brittle vectors," the autoencoder incorporates light regularization and dropout, creating a smooth and error-tolerant latent space, according to the paper.

CALM also introduces a comprehensive likelihood-free framework for training, evaluation, and controllable sampling in the continuous domain. A tiny energy-based head predicts the next vector in a single shot, trained with a distance-based score rather than traditional likelihood. For evaluation, the BrierLM metric is proposed, which scores using samples only and closely tracks cross-entropy without requiring probabilities.

For decoding control, the research presents a likelihood-free temperature method that accepts repeated draws, alongside a batch approximation for efficient lower-temperature sampling. This allows CALM to remain anchored in text by feeding the last decoded tokens through a small compression module instead of raw vectors. The paper states that "overall, CALM shifts the performance compute frontier and adds a new knob, increase information per step not just parameters."

Experimental results demonstrate that a CALM-M model with 371 million parameters achieves comparable BrierLM scores to a 281 million parameter Transformer-S baseline, while requiring 44% fewer training FLOPs and 34% fewer inference FLOPs. This indicates a superior performance-compute trade-off, establishing next-vector prediction as a scalable pathway toward ultra-efficient language models. The full details are available in the paper "Continuous Autoregressive Language Models" on arXiv.