arXiv 2512.01278

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

By Yilong Zhao, Jiaming Tang, et al.

Published 2025-12-01

Citation lineage

Review the prior work and downstream research connected to this paper.

Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longe…

View the original paper on arXiv