Alibaba's Ovis-U1 Achieves 69.6 on OpenCompass Multimodal Benchmark with Unified AI Approach

Alibaba Group's Ovis Team has unveiled Ovis-U1, a 3-billion-parameter unified artificial intelligence model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. The technical report for Ovis-U1, publicly shared via a tweet by AK, highlights its novel unified training approach that starts from a language model, collaboratively enhancing performance across all integrated tasks.

Ovis-U1 distinguishes itself by achieving a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models in its parameter class, such as Ristretto-3B and SAIL-VL-1.5-2B. This performance underscores the efficacy of its unified training strategy, which aims for a more integrated AI fluency rather than simply combining separate tools. For text-to-image generation, it scored 83.72 on DPG-Bench and 0.89 on GenEval, while for image editing, it achieved 4.00 on ImgEdit-Bench and 6.42 on GEdit-Bench-EN.

The model's architecture features a diffusion-based visual decoder and a bidirectional token refiner, designed to improve the interaction between text and visual information. This allows Ovis-U1 to handle complex tasks, from comprehending intricate scenes and generating high-quality images from text prompts to performing precise edits on existing images. Its compact 3-billion-parameter size makes it potentially more efficient and easier to deploy compared to larger models.

Alibaba Cloud has been aggressively expanding its AI portfolio, with Ovis-U1 being the latest addition to its Qwen series of models. The company has committed significant investment, with plans to spend over $52 billion on AI infrastructure over the next three years, reinforcing its ambition to lead in the global AI race. This strategic focus includes open-sourcing over 200 generative AI models, aiming to democratize access and accelerate AI development within the community.

The release of Ovis-U1 aligns with a broader industry trend towards multimodal AI, which processes and understands information from various sources like text, images, and audio simultaneously. Experts believe such unified models are crucial for enhancing human-AI interaction, improving data analysis, automating content creation, and advancing robotics. By open-sourcing Ovis-U1 under the Apache 2.0 license, Alibaba aims to foster collaborative research and accelerate the development of general-purpose AI systems.