arXiv 2509.21986
Developing Vision-Language-Action Model from Egocentric Videos
By Tomoya Yoshida, Shuhei Kurita, et al.
Published 2025-09-26
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such…