arXiv 2406.06484
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
By Songlin Yang, Bailin Wang, et al.
Published 2024-06-10
Citation lineage
Review the prior work and downstream research connected to this paper.
Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the…