arXiv 2512.10942

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

By Delong Chen, Mustafa Shukor, et al.

Published 2025-12-11

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled compar…

View the original paper on arXiv