OpenBMB's MiniCPM-V4.5 Surpasses GPT-4o and Gemini 2.0 Pro in Vision-Language Benchmarks

OpenBMB, a prominent open-source community, has announced the release of MiniCPM-V4.5, an 8-billion parameter multimodal large language model (MLLM) that reportedly outperforms leading proprietary models like GPT-4o-latest and Gemini 2.0 Pro in vision-language tasks. The announcement, made via a tweet from OpenBMB, highlighted the model's efficiency and advanced capabilities, including its handling of high-resolution images and high-frame-rate videos.

"Anycoder is impressive! Just two lines of prompt and we have a working image and text interaction MiniCPM-V4.5 demo! Huge shoutout to @AK for the amazing work. More details about MiniCPM-V4.5 👉 https://t.co/JW7zPXOCqh MiniCPM-V 4.5 is an 8B model that beats GPT-4o and Gemini 2.0 Pro in vision-language tasks, and handles high-res images and high-FPS videos with unmatched efficiency."

The tweet emphasized the ease of use, citing a "two lines of prompt" demonstration with "Anycoder" for image and text interaction. MiniCPM-V4.5, built on Qwen3-8B and SigLIP2-400M, achieved an average score of 77.0 on OpenCompass, a comprehensive evaluation across eight popular benchmarks. This performance positions it as the most capable MLLM under 30 billion parameters in the open-source community.

Key technical advancements in MiniCPM-V4.5 include a unified 3D-Resampler, which enables a 96x compression rate for video tokens, allowing the model to process more video frames without increasing LLM inference costs. This innovation facilitates state-of-the-art high-refresh-rate (up to 10FPS) and long video understanding. Additionally, the model features controllable hybrid "fast/deep thinking" modes, balancing efficiency and performance for various user scenarios.

OpenBMB, a non-profit open-source community dedicated to building large-scale AI models, has consistently aimed to democratize access to advanced AI technologies. Their MiniCPM series focuses on creating efficient, deployable models suitable for edge devices such as smartphones and iPads. MiniCPM-V4.5 extends these capabilities with strong Optical Character Recognition (OCR) and document parsing, processing images up to 1.8 million pixels with high accuracy, surpassing even GPT-4o-latest and Gemini 2.5 on OCRBench.

The model's design also emphasizes trustworthiness, with evaluations showing it outperforms GPT-4o-latest on the MMHal-Bench for hallucination reduction. It supports over 30 languages and offers flexible deployment options, including efficient CPU inference via llama.cpp and Ollama, and optimized local iOS applications. This release signifies a notable step in making powerful multimodal AI accessible and efficient for a broader range of applications and devices.