arXiv 2507.15597

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

By Hao Luo, Yicheng Feng, et al.

Published 2025-07-21

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we…

View the original paper on arXiv