arXiv 2512.01278
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
By Yilong Zhao, Jiaming Tang, et al.
Published 2025-12-01
Citation lineage
Review the prior work and downstream research connected to this paper.
Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longe…