Hugging Face has officially launched FineVision, a new open-source multimodal dataset designed to significantly advance the training of Vision-Language Models (VLMs). Announced on September 6, 2025, the release aims to democratize access to high-quality VLM training data, making it available on the Hugging Face Hub. This initiative addresses the industry's reliance on proprietary datasets, fostering greater transparency and reproducibility in AI research.
FineVision stands out as one of the largest publicly available VLM datasets, encompassing 17.3 million images and 24.3 million samples. Aggregating data from over 200 sources, the dataset also includes 88.9 million question-answer turns and nearly 10 billion answer tokens. This massive 5 TB curated resource spans nine distinct categories, including General VQA, OCR QA, and GUI navigation.
The dataset promises substantial performance improvements for VLMs, with models trained on FineVision demonstrating gains of up to 46.3% over LLaVA, 40.7% over Cauldron, and 12.1% over Cambrian across 11 widely used benchmarks. This significant leap in performance is attributed to FineVision's meticulous curation and vast scale. The open-source nature of the dataset is expected to accelerate innovation across the AI community.
FineVision introduces data for emerging VLM skill domains such as GUI navigation, pointing, and counting, expanding the capabilities of next-generation models. The dataset boasts exceptionally low data leakage, with only a 1% overlap with benchmark test sets, ensuring reliable evaluation performance. This rigorous quality control and comprehensive coverage set a new standard for open multimodal datasets.
Clement Delangue, CEO of Hugging Face, underscored the company's commitment to open science in a social media post, stating, > "We’re doing the work that nobody else wants to do! Welcome to FineVision, the best free open dataset to train vision language models. Let’s go open-source!" This release reinforces Hugging Face's mission to make advanced AI accessible, providing researchers and developers with a robust foundation for building state-of-the-art VLMs. The availability of FineVision on the Hugging Face Hub facilitates immediate integration into ongoing research and development efforts.