arXiv 2509.21986

Developing Vision-Language-Action Model from Egocentric Videos

By Tomoya Yoshida, Shuhei Kurita, et al.

Published 2025-09-26

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such…

View the original paper on arXiv