arXiv 2509.21986
Developing Vision-Language-Action Model from Egocentric Videos
By Tomoya Yoshida, Shuhei Kurita, et al.
Published 2025-09-26
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such…