arXiv 2512.10942
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
By Delong Chen, Mustafa Shukor, et al.
Published 2025-12-11
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled compar…