arXiv 2509.01092
REFRAG: Rethinking RAG based Decoding
By Xiaoqiang Lin, Aritra Ghosh, et al.
Published 2025-09-01
Discussion
Read the public discussion and references gathered around this paper.
Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off betw…