arXiv 2503.04398

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

By Yan Li, Zhenyu Zhang, et al.

Published 2025-03-06

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-pla…

View the original paper on arXiv