Google's latest artificial intelligence model, Veo 3, has demonstrated significant advancements in zero-shot video reasoning, showcasing a wide array of capabilities without explicit prior training. The development, highlighted by AI researcher Aran Komatsuzaki, positions Veo 3 as a potential step towards more generalist vision foundation models. This new iteration introduces "Chain-of-Frames" reasoning, a visual counterpart to the "Chain-of-Thought" prompting used in large language models.
Veo 3 exhibits emergent zero-shot abilities across diverse visual tasks, including object segmentation, edge detection, image editing, understanding physical properties, and simulating tool use. It can also tackle early forms of visual reasoning such as maze and symmetry solving. According to Komatsuzaki's social media post, the model shows "broad zero-shot skills (perception → physics → manipulation → reasoning)."
The "Chain-of-Frames" reasoning is a novel approach that allows Veo 3 to process and understand visual information in a sequential, logical manner, mirroring the step-by-step reasoning seen in advanced language models. This innovation contributes to the model's ability to handle complex visual scenarios and perform tasks it was not explicitly trained for. The arXiv paper detailing Veo 3's capabilities lists authors from Google and DeepMind, confirming its origin within the tech giant.
Komatsuzaki noted a "Big jump Veo2 → Veo3" with improvements in areas like "edits, memory, symmetry, mazes, analogies." However, the model still faces limitations, with a "Caveat: still lags SOTA on depth/physics; cost high." Despite these challenges, Veo 3's emergent capabilities suggest that video models are evolving to become unified, generalist vision foundation models, akin to the role of large language models in natural language processing.