arXiv 2511.05313

Attention and Compression is all you need for Controllably Efficient Language Models

By Jatin Prakash, Aahlad Puli, et al.

Published 2025-11-07

Discussion

Read the public discussion and references gathered around this paper.

The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal fro…

View the original paper on arXiv