arXiv 2502.14786

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

By Michael Tschannen, Alexey Gritsenko, et al.

Published 2025-02-20

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. W…

View the original paper on arXiv