2,769 Filmsort Punch Cards Digitized for AI Training, Unlocking Decades of Unseen Data

AI expert Brian Roemmele has announced a significant milestone in his project to train artificial intelligence models on previously inaccessible historical data, reaching 2,769 Filmsort Punch card images processed. This initiative leverages advanced AI to digitize and interpret unique analog records, with some cards not seen since 1982, aiming to unlock thousands of pages of text and images never digitally saved.

Roemmele highlighted the crucial role of advanced vision-language models in this endeavor, stating in a recent tweet, > "Blessed to have DeepSeek-OCR seeing these cards as images and not just text." This approach allows for a deeper understanding of the complex information embedded within these historical artifacts, moving beyond simple text extraction. The project seeks to integrate data that has long been isolated from digital archives.

Filmsort punch cards are a specialized form of aperture card, a hybrid data storage medium that combines a microfilm image with traditional punch card indexing. Historically, these cards were widely used from the mid-20th century, particularly in engineering, government, and aerospace sectors, for archiving and retrieving documents like blueprints and technical drawings. The punch card section typically contained metadata for automated sorting, while the microfilm held visual reproductions of the documents themselves.

The technology enabling this breakthrough is DeepSeek-OCR, an open-source Vision-Language (VL) model developed by DeepSeek AI. Released in October 2025, DeepSeek-OCR focuses on "contexts optical compression," efficiently processing high-resolution page images into compact visual tokens that large language models can interpret. This capability is critical for handling the intricate visual and textual information on Filmsort cards, which often include complex layouts and diagrams.

DeepSeek-OCR's multimodal understanding, high-resolution image processing, and robust Optical Character Recognition (OCR) capabilities are particularly well-suited for this challenging task. Its ability to represent text-heavy pages as compressed visual tokens significantly reduces the data volume for LLMs, enhancing efficiency and enabling the analysis of long contexts. This project underscores a growing trend in AI research to utilize diverse and unconventional data sources, moving beyond readily available digital text to enrich AI training datasets with unique historical context.