ZhipuAI has officially unveiled GLM-4.5V, a new open-source visual reasoning model that demonstrates state-of-the-art performance among its size class. The model has shown dominance across 42 public vision-language benchmarks, marking a significant advancement in multimodal AI capabilities. This release, announced by Z.ai on social media, makes the powerful model accessible to the broader AI community.
The GLM-4.5V model is built upon ZhipuAI’s flagship text foundation model, GLM-4.5-Air. It incorporates a powerful 106-billion-parameter Mixture-of-Experts (MoE) architecture, with 12 billion parameters actively engaged during inference. The model inherits and refines proven techniques from its predecessor, GLM-4.1V-Thinking, focusing on enhancing reasoning capabilities beyond basic multimodal perception.
GLM-4.5V is designed for real-world usability, offering a comprehensive suite of visual reasoning functionalities. Its capabilities include advanced image reasoning for scene understanding and complex multi-image analysis, robust video understanding for long video segmentation and event recognition, and efficient GUI tasks like screen reading and desktop operation assistance. Additionally, it excels in complex chart and long document parsing, as well as precise visual element localization (grounding).
A notable feature of GLM-4.5V is its "Thinking Mode" switch, which allows users to balance between rapid responses and deep, intricate reasoning based on task requirements. This flexibility enhances its applicability across diverse scenarios, from web page coding and intelligent agents to subject problem-solving. The open-source nature of GLM-4.5V is expected to accelerate innovation and development within the vision-language model landscape.
The model is now available on Hugging Face and GitHub, with API access provided through Z.ai's platform. An online demo and a desktop assistant for macOS are also offered, facilitating easy access and experimentation for developers and researchers. This open release positions GLM-4.5V as a strong contender in the competitive field of open-source vision-language models, offering a robust foundation for future AI applications.