arXiv 2309.06180
Efficient Memory Management for Large Language Model Serving with PagedAttention
By Woosuk Kwon, Zhuohan Li, et al.
Published 2023-09-12
Citation lineage
Review the prior work and downstream research connected to this paper.
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we pā¦