arXiv 2509.01092

REFRAG: Rethinking RAG based Decoding

By Xiaoqiang Lin, Aritra Ghosh, et al.

Published 2025-09-01

Citation lineage

Review the prior work and downstream research connected to this paper.

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off betw…

View the original paper on arXiv