arXiv 2602.16052
MoE-Spec: Expert Budgeting for Efficient Speculative Decoding
By Bradley McDanel, Steven Li, et al.
Published 2026-02-17
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce specul…