arXiv 2504.01002
Token embeddings violate the manifold hypothesis
By Michael Robinson, Sourya Dey, et al.
Published 2025-04-01
Citation lineage
Review the prior work and downstream research connected to this paper.
A full understanding of the behavior of a large language model (LLM) requires our grasp of its input token space. If this space differs from our assumptions, our comprehension of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around each token has a relativel…