arXiv 2309.06180

Efficient Memory Management for Large Language Model Serving with PagedAttention

By Woosuk Kwon, Zhuohan Li, et al.

Published 2023-09-12

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we p…

View the original paper on arXiv