arXiv 2512.07829

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

By Yuan Gao, Chen Chen, et al.

Published 2025-12-08

Citation lineage

Review the prior work and downstream research connected to this paper.

Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatche…

View the original paper on arXiv