arXiv 2502.14786
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
By Michael Tschannen, Alexey Gritsenko, et al.
Published 2025-02-20
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. W…