arXiv 2602.16052

MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

By Bradley McDanel, Steven Li, et al.

Published 2026-02-17

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce specul…

View the original paper on arXiv