DeepSeek-OCR's Foundational Research Emphasizes Contexts Optical Compression

A recent social media post by Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) highlighted what they termed "The direct precursor paper to DeepSeek-OCR," drawing attention to the underlying research behind the advanced Optical Character Recognition (OCR) model. The tweet, which included a link to the paper, underscores the innovative approach taken by DeepSeek AI in developing its OCR technology.

DeepSeek-OCR, an open-source model developed by DeepSeek AI, is designed to investigate the role of vision encoders from an LLM-centric viewpoint, specifically focusing on "Contexts Optical Compression." This technology aims to compress image-based text documents, allowing large language models (LLMs) to process significantly longer contexts without encountering memory limitations. The model converts documents into structured text formats like Markdown, including tables and headings, rather than just raw character streams.

The core idea behind DeepSeek-OCR involves processing text as an image, which can require less computational power than handling digital text directly. According to DeepSeek's technical paper, their OCR system can achieve up to a tenfold compression of text while retaining 97 percent of the original information. This innovative approach is seen as a potential breakthrough for enhancing the context window of AI models, particularly for tasks involving extensive documents like research papers.

DeepSeek-OCR leverages the advanced DeepSeek VL2 AI model, a series of Mixture-of-Experts (MoE) Vision-Language Models. DeepSeek-VL2 integrates a dynamic tiling vision encoding strategy for high-resolution images and has been trained on diverse datasets, including those focused on OCR, charts, and knowledge-based content. This robust foundation enables DeepSeek-OCR to offer high accuracy and speed in text extraction from various formats, including scanned documents, images, and PDFs.

The release of DeepSeek-OCR, available on platforms like Hugging Face and GitHub, signifies DeepSeek AI's continued contributions to the open-source AI community. The model's ability to provide permissive licensing, Markdown-first exports, and GPU-friendly presets lowers the barrier for structured data extraction, making advanced document AI more accessible to developers and researchers.