arXiv 2509.21986
Developing Vision-Language-Action Model from Egocentric Videos
By Tomoya Yoshida, Shuhei Kurita, et al.
Published 2025-09-26
Citation lineage
Review the prior work and downstream research connected to this paper.
Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such…