arXiv 2503.04398
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
By Yan Li, Zhenyu Zhang, et al.
Published 2025-03-06
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-pla…