arXiv 2506.09985

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

By Mido Assran, Adrien Bardes, et al.

Published 2025-06-11

Discussion

Read the public discussion and references gathered around this paper.

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architec…

View the original paper on arXiv