arXiv 2509.21986

Developing Vision-Language-Action Model from Egocentric Videos

By Tomoya Yoshida, Shuhei Kurita, et al.

Published 2025-09-26

Citation lineage

Review the prior work and downstream research connected to this paper.

Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such…

View the original paper on arXiv