arXiv 2507.12440

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

By Ruihan Yang, Qinxi Yu, et al.

Published 2025-07-16

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness…

View the original paper on arXiv