Hangzhou, China – DeepSeek AI has unveiled DeepSeek-OCR, an innovative open-source model that processes scanned pages and visual artifacts directly, marking a significant departure from traditional AI training methodologies. This approach contrasts sharply with leading Western AI labs like OpenAI, Anthropic, and Google DeepMind, which primarily rely on pre-digitized text. The new system, capable of generating over 200,000 pages of training data daily on a single A100-40G GPU, promises to enhance AI's understanding of human knowledge by "perceiving knowledge" rather than merely "parsing words."
DeepSeek-OCR's core innovation lies in its direct pixel-to-representation training, allowing it to ingest visual art, manuscripts, and microfilm archives. This enables access to non-digitized cultural memory and provides cleaner, curated data, bypassing the "Internet sludge" problem associated with web-scraped datasets. According to a recent social media post by Ajay Patel, "Western models read transcriptions of knowledge. DeepSeek is trying to read the artifacts themselves." This method allows the model to fuse layout, color, structure, and text into a multimodal reasoning foundation.
In contrast, models such as OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini primarily build their foundation models on digital text corpora. Their training datasets typically include Common Crawl, Wikipedia, digitized books from public domain sources like Project Gutenberg, and licensed text and code from platforms like GitHub. These sources are "already digitized and converted to text," as Patel noted, meaning these models work with the textual representation rather than the raw visual information of documents.
DeepSeek-OCR achieves remarkable efficiency through its "optical compression" technique, compressing text by up to a factor of ten while maintaining 97% accuracy. The system comprises a 380 million-parameter DeepEncoder for image analysis and a 570 million-active parameter text generator built on DeepSeek3B-MoE. This allows for processing extensive textual information with significantly less computing power, addressing a critical challenge for large language models dealing with long-form content.
The philosophical implication of DeepSeek's approach is profound. While ChatGPT-type LLMs function as "linguistic simulators" modeling how humans use language, DeepSeek's OCR-style pipeline aims for "cognitive reconstruction," modeling how humans encode and organize meaning across various media. If this approach scales, future AI models may not require traditional tokenizers or word-based context windows, instead training on "scanned reality" to merge textual and visual intelligence into a unified model of human knowledge.