arXiv 2509.01092

REFRAG: Rethinking RAG based Decoding

By Xiaoqiang Lin, Aritra Ghosh, et al.

Published 2025-09-01

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off betw…

View the original paper on arXiv