arXiv 2507.12440

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

By Ruihan Yang, Qinxi Yu, et al.

Published 2025-07-16

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness…

View the original paper on arXiv