arXiv 2511.05313
Attention and Compression is all you need for Controllably Efficient Language Models
By Jatin Prakash, Aahlad Puli, et al.
Published 2025-11-07
Discussion
Read the public discussion and references gathered around this paper.
The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal fro…