arXiv 2509.01092

REFRAG: Rethinking RAG based Decoding

By Xiaoqiang Lin, Aritra Ghosh, et al.

Published 2025-09-01

Discussion

Read the public discussion and references gathered around this paper.

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off betw…

View the original paper on arXiv