arXiv 2602.00462
LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
By Benno Krojer, Shravan Nayak, et al.
Published 2026-01-31
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at eā¦