arXiv 2504.13181
Perception Encoder: The best visual embeddings are not at the output of the network
By Daniel Bolya, Po-Yao Huang, et al.
Published 2025-04-17
Discussion
Read the public discussion and references gathered around this paper.
We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining wit…