arXiv 2209.10652

Toy Models of Superposition

By Nelson Elhage, Tristan Hume, et al.

Published 2022-09-21

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the g…

View the original paper on arXiv