arXiv 2507.12440
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
By Ruihan Yang, Qinxi Yu, et al.
Published 2025-07-16
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness…