arXiv 2406.06484

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

By Songlin Yang, Bailin Wang, et al.

Published 2024-06-10

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the…

View the original paper on arXiv