arXiv 2504.01002

Token embeddings violate the manifold hypothesis

By Michael Robinson, Sourya Dey, et al.

Published 2025-04-01

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

A full understanding of the behavior of a large language model (LLM) requires our grasp of its input token space. If this space differs from our assumptions, our comprehension of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around each token has a relativel…

View the original paper on arXiv