arXiv 2403.02310

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

By Amey Agrawal, Nitin Kedia, et al.

Published 2024-03-04

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilizatio…

View the original paper on arXiv