arXiv 2602.00462

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

By Benno Krojer, Shravan Nayak, et al.

Published 2026-01-31

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at e…

View the original paper on arXiv