arXiv 2406.06484
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
By Songlin Yang, Bailin Wang, et al.
Published 2024-06-10
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the…