LeRobot, an initiative by Hugging Face, has unveiled X-VLA, a new soft-prompted Vision-Language-Action (VLA) model designed to scale across diverse robotic embodiments. Announced on social media, the model aims to unify different robots, cameras, action spaces, and environments under a single transformer backbone, addressing a significant challenge in generalist robotics. The development represents a stride towards more adaptable and versatile robotic systems.
The X-VLA model is engineered for generalism, supporting various robot platforms including Franka, WidowX, and Agibot, in both simulated and real-world scenarios. A key innovation lies in its use of "soft-prompt domain IDs," which are tiny learnable embeddings enabling the model to adapt efficiently to new hardware configurations. This mechanism allows the unified transformer core, utilizing flow-matching, to provide smooth and continuous 50 Hz control.
X-VLA was pretrained on an extensive mixed-embodiment dataset encompassing over seven different robotic platforms and a wide array of tasks. LeRobot is providing six pre-trained checkpoints out-of-the-box, allowing users to fine-tune the model on their specific datasets. This approach aims to democratize access to advanced VLA capabilities, aligning with LeRobot's mission to lower the barrier to entry for robotics research and development.
Hugging Face's LeRobot project is dedicated to advancing and democratizing artificial intelligence through open source and open science, offering models, datasets, and tools for real-world robotics in PyTorch. X-VLA, similar to other open-source VLA models like SmolVLA, tackles the industry-wide issue of data fragmentation and the high computational demands of existing large, proprietary VLA models. Its soft-prompted architecture offers a scalable solution for integrating heterogeneous robotic data, paving the way for future adaptable, cross-embodiment robot foundation models.