arXiv 2602.16052

MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

By Bradley McDanel, Steven Li, et al.

Published 2026-02-17

Citation lineage

Review the prior work and downstream research connected to this paper.

Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce specul…

View the original paper on arXiv