Helix: A Vision-Language-Action Model for Generalist Humanoid Control
"Helix: A Vision-Language-Action Model for Generalist Humanoid Control" appears to be a research paper or project focusing on creating a unified model that integrates vision, language understanding, and action generation to enable humanoid robots to perform a wide range of tasks.
Helix is a groundbreaking Vision-Language-Action (VLA) model developed by Figure AI, a robotics startup focused on building commercially viable autonomous humanoid robots. Announced in February 2025, Helix represents a major advancement in embodied AI by integrating visual perception, natural language understanding, and precise motor control into a single, generalist neural network. It enables humanoid robots (like Figure's 02 and later 03 models) to perform complex, dexterous tasks in unstructured real-world environments—such as homes or warehouses—using simple natural language commands, without task-specific programming or extensive fine-tuning.
Key Innovations and Firsts
Helix stands out for several pioneering features:
- Full Upper-Body Control: It's the first VLA model to output high-rate (up to 200 Hz) continuous control over a humanoid's entire upper body, including 35+ degrees of freedom (DoF) for arms, wrists, torso, head, and individual fingers. This allows precise, human-like manipulation of objects.
- Zero-Shot Generalization: Robots can handle thousands of novel objects they've never seen before, grasping and manipulating them based on language prompts (e.g., "Pick up the desert item" correctly identifies and grabs a toy cactus).
- Multi-Robot Collaboration: A single Helix instance can control multiple robots simultaneously for coordinated tasks, like sorting groceries without predefined roles.
- Self-Termination and Long-Horizon Tasks: Includes a synthetic "task completion" output, allowing the model to decide when a job is done and sequence behaviors autonomously.
Demos showcase Helix-powered Figure robots unloading groceries, loading dishwashers, folding laundry, sorting packages in logistics settings, and even learning navigation from human video data. These run onboard low-power GPUs, making them practical for commercial deployment (e.g., at BMW factories).
Architecture: Dual-System "System 1 + System 2" Design
Helix uses a decoupled, hierarchical setup inspired by human cognition:
- System 2 (S2): A larger vision-language model (e.g., based on SigLIP or LLaMA-2, ~7B parameters) runs slower (~30-40 Hz). It processes images, language instructions, and scene understanding to generate high-level plans and latent conditioning vectors.
- System 1 (S1): A smaller, faster visuomotor policy (~80M parameters) handles real-time execution at 200 Hz, using S2's latents plus direct sensor inputs for reactive, low-level actions.
This asynchronous design avoids action tokenization bottlenecks in prior VLAs, scales to high-DoF control, and minimizes train-inference gaps. Enhancements include stereo multiscale vision, learned proprioception, and online calibration for seamless cross-robot transfer.
Training and Data
- Trained on ~500 hours of high-quality teleoperated robot data (a fraction of datasets used by competitors) paired with auto-generated text descriptions.
- Recent updates enable learning directly from human video (no robot demos needed), accelerated by partnerships like Brookfield (access to 100,000+ residential units for diverse real-world data).
- Project "Go-Big" aims for the world's largest humanoid pretraining dataset.
Impact and Deployments
Helix powers Figure's push toward mass production (e.g., BotQ facility targeting 100,000+ robots in years ahead) and has influenced models like NVIDIA's GR00T N1. It's deployed in pilots for household chores, logistics (e.g., package triaging faster than humans), and manufacturing. As of late 2025, Figure 03 integrates Helix fully, with upgrades for home use (softer design, wireless charging) and mass manufacturing.
Helix addresses core robotics challenges like scalability, dexterity in unstructured settings, and generalization, positioning humanoids as versatile helpers in homes and industry. Figure's $1B+ funding and $39B+ valuation underscore its potential. For the full technical report, visit Figure's site.