arXiv 2205.14135

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

By Tri Dao, Daniel Y. Fu, et al.

Published 2022-05-27

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for re…

View the original paper on arXiv