A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Date: February 27, 2026

Title: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Speaker: Yunxi Shen, Cornell University

Abstract: Scaling models to large sizes to improve performance has led a trend in deep learning, and sparsely activated Mixture-of-Expert (MoE) is a promising architecture to scale models. However, training MoE models in existing systems is expensive, mainly due to the All-to-All communication between layers.

All-to-All communication originates from expert-centric paradigm: keeping experts in-place and exchanging intermediate data to feed experts. We propose the novel data-centric paradigm: keeping data in-place and moving experts between GPUs. Since experts' size can be smaller than the size of data, data-centric paradigm can reduce communication workload. Based on this insight, we develop Janus. First, Janus supports fine-grained asynchronous communication, which can overlap computation and communication. Janus implements a hierarchical communication to further reduce cross-node traffic by sharing the fetched experts in the same machine. Second, when scheduling the "fetching expert" requests, Janus implements a topology-aware priority strategy to utilize intra-node and inter-node links efficiently. Finally, Janus allows experts to be prefetched, which allows the downstream computation to start immediately once the previous step completes.

Evaluated on a 32-A100 cluster, Janus can reduce the traffic up to 16? and achieves up to 2.06? speedup compared with current MoE training system.

Bio: Yunxi Shen is a second-year computer science Ph.D. student in Cornell Bowers, advised by Prof. Hakim Weatherspoon. His research focuses on data center networking and ML systems.