arXiv 2504.13181

Perception Encoder: The best visual embeddings are not at the output of the network

By Daniel Bolya, Po-Yao Huang, et al.

Published 2025-04-17

Discussion

Read the public discussion and references gathered around this paper.

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining wit…

View the original paper on arXiv