arXiv 2512.01278
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
By Yilong Zhao, Jiaming Tang, et al.
Published 2025-12-01
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longe…