Manifest AI Unveils 'Power Retention' Architecture, Promising 100x Faster AI Inference and Cost Reduction

Image for Manifest AI Unveils 'Power Retention' Architecture, Promising 100x Faster AI Inference and Cost Reduction

New York, NY – Manifest AI, a New York City-based independent research lab founded by former OpenAI and Google Brain researchers Jacob Buckman and Carles Gelada, has introduced a new AI paradigm called "power retention." This innovative architecture aims to solve the critical "context length" problem in large language models, promising significantly faster and more cost-efficient AI training and inference. The company, established in 2023, has raised $5.5 million in venture capital funding.

The core issue Manifest AI addresses is the current limitation of transformer models, which struggle with processing and retaining vast amounts of information over long contexts. As Carles Gelada explained, “They say, ‘We're going to throw out everything in my memory that's very old... But it should not be the human deciding what goes in it and what goes out of it.’” Existing transformer architectures, which underpin models like ChatGPT, Copilot, and Gemini, become slower and more expensive as context length increases due to a dynamically growing memory cache.

Manifest AI's "power retention" architecture offers a fundamental shift by allowing AI models to intelligently decide what information to retain, moving beyond the fixed limitations of transformers. This approach results in a fixed-size memory, which simplifies inference infrastructure and dramatically reduces costs. Jacob Buckman highlighted the potential impact, stating, "Power retention is going to be a complete game changer." The company claims speedups of over 10x during training and more than 100x during inference for context lengths of 64k tokens, with gains increasing for even longer contexts.

To facilitate adoption, Manifest AI has open-sourced its "power retention" technology, including an open-source coding model called PowerCoder-3B, which can process and generate longer pieces of code. They also released Vidrial, a framework for writing efficient CUDA kernels, which has shown to make FlashAttention2 up to 20% faster. This open-source strategy aims to foster community trust and accelerate the integration of their architecture across the AI landscape, potentially transforming existing transformer models through a process they term "metamorphosis."