ViPE Spatial AI Tool Processes Videos at 3-5 FPS, Released by NVIDIA and Academic Partners

Image for ViPE Spatial AI Tool Processes Videos at 3-5 FPS, Released by NVIDIA and Academic Partners

A new spatial artificial intelligence tool, ViPE (Video Pose Engine), has been made available, designed to accurately recover camera motion, intrinsics, and dense metric depth from various video types. Developed by a collaborative team including NVIDIA Research Scientist Jiahui Huang and researchers from the University of Toronto, Vector Institute, and UT Austin, ViPE aims to advance 3D geometric perception in spatial AI systems.

The powerful engine efficiently processes unconstrained raw videos, estimating camera intrinsics, camera motion, and dense, near-metric depth maps. According to a recent announcement by Jiahui Huang, ViPE operates at a speed of "3–5 FPS" on a single GPU, demonstrating its efficiency in handling diverse video content. This includes challenging scenarios such as cinematic shots, dashcams, dynamic selfie videos, and even 360° panoramas, supporting various camera models like pinhole and wide-angle.

ViPE's robust performance is highlighted by its ability to outperform existing uncalibrated pose estimation baselines, showing improvements of 18% and 50% on the TUM and KITTI sequences, respectively. Its core strength lies in bridging the gap between classical bundle adjustment frameworks and modern learning-based approaches, offering both scalability and precision.

A significant application of ViPE is its capability to annotate large-scale video datasets. The tool has been instrumental in annotating approximately 96 million frames, comprising 100,000 real-world internet videos, 1 million high-quality AI-generated videos, and 2,000 panoramic videos, all with accurate camera poses and dense depth maps. This annotated dataset is being open-sourced to accelerate further development in spatial AI.

The release of ViPE as an open-source tool, with its code available on GitHub, underscores a commitment to fostering innovation in the field. This development is expected to have a substantial impact on various downstream applications, including robotics, augmented and virtual reality, autonomous systems, and advanced video generation.