MetaEmbed Boosts Multimodal Retrieval Accuracy by up to 6.6% on 32B Models

Meta Superintelligence Labs has unveiled MetaEmbed, a novel framework designed to significantly enhance multimodal retrieval by optimizing the balance between embedding expressiveness and computational efficiency. The new approach, detailed in an arXiv paper by Zilin Xiao and colleagues, introduces a flexible late-interaction mechanism that scales multimodal retrieval effectively at test-time. This development addresses long-standing challenges in handling complex, fine-grained information in large-scale embedding models.

Current universal multimodal embedding models often struggle with either limiting expressiveness by condensing information into a single vector or becoming prohibitively expensive due to an excessive number of vectors. MetaEmbed tackles this by appending a fixed number of learnable "Meta Tokens" to input sequences during training. At test-time, these tokens' contextualized representations serve as compact yet expressive multi-vector embeddings, enabling a more nuanced interaction.

The framework employs Matryoshka Multi-Vector Retrieval (MMR) training, which organizes information by granularity across multiple vectors. This allows users to dynamically adjust retrieval quality against efficiency demands by selecting the number of tokens used for indexing and interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and Visual Document Retrieval Benchmark (ViDoRe) confirm MetaEmbed achieves state-of-the-art performance. Notably, the 32B parameter variant of MetaEmbed demonstrated a 6.6% improvement in overall MMEB accuracy compared to the best single-vector retrieval methods.

According to the paper, MetaEmbed scales robustly across various Vision-Language Model (VLM) architectures and model sizes, including those with 32 billion parameters. While scoring costs increase with larger retrieval budgets, the scoring stage itself is not compute-bound until extreme cases, with query encoding remaining the dominant factor in latency. The flexibility of MetaEmbed allows for efficient deployment by balancing index memory consumption and retrieval latency.

The introduction of MetaEmbed marks a significant step forward in developing more accurate, efficient, and deployable multimodal retrieval systems. By offering granular control over embedding interactions and performance, MetaEmbed paves the way for advanced AI applications that require sophisticated understanding and retrieval across diverse data types.