arXiv 2509.01092
REFRAG: Rethinking RAG based Decoding
By Xiaoqiang Lin, Aritra Ghosh, et al.
Published 2025-09-01
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off betw…