Visual Token Pruning Boosts Multimodal Model Speed by Up to 1.68x

New research reveals that pruning image tokens in large multimodal models (LMMs) can significantly reduce latency, achieving up to a 1.68x total speed improvement, while largely maintaining accuracy across various tasks. This development addresses a critical efficiency challenge in LMMs, which often suffer from slow inference due to the high volume of visual tokens. The findings are detailed in a new paper titled "Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models," which introduces a unified evaluation framework.

Large multimodal models process images by splitting them into numerous small tokens, many of which contain redundant information. This redundancy leads to substantial computational overhead, memory usage, and increased inference latency, hindering real-time and large-scale deployment. Visual token pruning techniques aim to mitigate this by intelligently removing less critical tokens, thereby streamlining the processing pipeline.

To systematically evaluate these methods, researchers developed UniPruneBench, a unified and extensible benchmark covering various models, datasets, and compression algorithms. A surprising finding from the benchmark is that "a simple random drop baseline is often as strong as fancy methods," indicating the need for more effective pruning strategies. The study also highlighted that "there is no one best method across tasks, models, and pruning levels," and that "pruning sensitivity varies significantly across tasks, with OCR being most vulnerable."

The study demonstrated substantial performance gains, noting that "at extreme 88.9% pruning on a large model, total speed improves by about 1.62x to 1.68x." This acceleration primarily stems from a shorter prefill phase before text generation begins, with the overhead for scoring and dropping tokens being "tiny, under 0.5s." Furthermore, "bigger models handle pruning better at the same token budget," suggesting that larger architectures offer greater robustness to token reduction.

These insights are crucial for advancing the efficiency of LMMs, enabling faster and more scalable applications without compromising essential capabilities. While instruction following tasks remained stable with pruning, the research indicates that tasks like reading text in images are more susceptible to accuracy degradation. The benchmark is expected to serve as a vital tool for future research into more effective and universally applicable visual token compression methods.