Cambridge, MA – Researchers Alex L. Zhang and Omar Khattab from MIT have introduced Recursive Language Models (RLMs), a novel inference strategy designed to overcome the limitations of large language models (LLMs) when processing extensive contexts. This new approach has demonstrated significant performance gains, notably outperforming GPT-5 by over 33% on a challenging long-context benchmark known as OOLONG.
The core innovation of RLMs lies in their ability to "agentically call an LM that works over an environment, storing prompt and context (that evolve over time)," as described by Axel Darmouni in a recent tweet. This allows a "root LM" to interact with a dynamic Python REPL environment, enabling it to decompose complex problems and recursively query sub-LMs for intermediate computations. This method effectively mitigates "context rot," a phenomenon where an LM's performance degrades with increasing input length.
RLMs address the critical challenge of handling unbounded input context lengths and mitigating performance degradation in long conversations or documents. By allowing the root LM to interact with a context stored as a variable, it can employ strategies such as peeking, grepping, partitioning, and summarization to efficiently process vast amounts of information. Omar Khattab highlighted this aspect, stating, "> while the results are incredible, my favorite part of Alex's post is studying what the recursive LM actually decides what to do how does it take a variable with 10M tokens and figures it out you see strategies like: peeking, map/reduce, summarization, and output accumulation."
Beyond the OOLONG benchmark, where RLM(GPT-5-mini) delivered more than double the correct answers compared to GPT-5, the researchers also tested RLMs on the BrowseComp-Plus benchmark. Here, RLMs demonstrated the capacity to handle contexts exceeding 10 million tokens without significant performance degradation, a considerable improvement over traditional LLM approaches. The researchers suggest that RLMs represent a significant step forward in general-purpose inference-time scaling, building upon prior work in reasoning and agent models.