arXiv 2309.06180

Efficient Memory Management for Large Language Model Serving with PagedAttention

By Woosuk Kwon, Zhuohan Li, et al.

Published 2023-09-12

Citation lineage

Review the prior work and downstream research connected to this paper.

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we p…

View the original paper on arXiv