arXiv 2509.21986

Developing Vision-Language-Action Model from Egocentric Videos

By Tomoya Yoshida, Shuhei Kurita, et al.

Published 2025-09-26

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such…

View the original paper on arXiv