arXiv 2509.01092
REFRAG: Rethinking RAG based Decoding
By Xiaoqiang Lin, Aritra Ghosh, et al.
Published 2025-09-01
Citation lineage
Review the prior work and downstream research connected to this paper.
Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off betw…