arXiv 2602.00462

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

By Benno Krojer, Shravan Nayak, et al.

Published 2026-01-31

Citation lineage

Review the prior work and downstream research connected to this paper.

Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at e…

View the original paper on arXiv