arXiv 2602.00462

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

By Benno Krojer, Shravan Nayak, et al.

Published 2026-01-31

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at e…

View the original paper on arXiv