New Framework Aims to Unlock 1 Billion Times More Data for AI, Challenging 'Peak Data' Narrative

Andrew Trask, Executive Director of OpenMined and a Senior Research Scientist at Google DeepMind, has publicly challenged the prevailing notion of "peak data" for artificial intelligence, proposing a new framework called Attribution-Based Control (ABC) to unlock a vast, currently inaccessible reservoir of private data. Trask argues that the limitation for advanced AI models is not data scarcity but rather a lack of secure access to existing data, which could expand training datasets by an estimated billion-fold.

Trask directly countered the perspective of Ilya Sutskever, former OpenAI chief scientist, who recently stated that "LLMs scaling has plateaued" and described the internet as a "finite resource," akin to "the fossil fuel of AI." In a social media post, Trask asserted, "> IMO — Ilya is wrong." He highlighted that while frontier LLMs are trained on approximately 200 terabytes of text, there are an estimated 200 zettabytes of data globally, which is "about 1 billion times more data" and doubles every two years.

The core of Trask's argument is that "The problem is not data scarcity, it's data access." This immense volume of data, including electronic health records, financial transactions, and proprietary research, remains private and cannot be scraped for AI training. He contends that current data market failures, where owners lose control and value upon sharing, prevent this data from being utilized.

To address this, Trask advocates for Attribution-Based Control (ABC), a system designed to incentivize data owners to share their information by maintaining control and enabling them to generate continuous revenue. This framework ensures data owners can dictate which specific AI predictions their data supports and receive compensation for its use. The technical underpinnings for ABC involve model partitioning techniques like Mixture of Experts (MoE) and Retrieval-Augmented Generation (RAG), alongside privacy-preserving infrastructure such as GPU enclaves and homomorphic encryption.

Trask's article, "Unlocking a Million Times More Data For AI," details how these technologies can create a sustainable data economy for AI. He suggests an ARPANET-style government program to integrate these components and establish ABC as an industry standard. This initiative, he believes, would transition AI from its current centralized model to a "network-sourced" paradigm, fostering innovation by tapping into the vast, high-quality private data currently locked away.