arXiv 2012.14913

Transformer Feed-Forward Layers Are Key-Value Memories

By Mor Geva, Roei Schuster, et al.

Published 2020-12-29

Citation lineage

Review the prior work and downstream research connected to this paper.

Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned pattern…

View the original paper on arXiv