MSCCL++: Rethinking GPU Communication Abstractions for AI Inference
- Changho Hwang ,
- Peng Cheng ,
- Roshan Dathathri ,
- Abhinav Jangda ,
- Saeed Maleki ,
- Madan Musuvathi ,
- Olli Saarikivi ,
- Aashaka Shah ,
- Ziyue Yang ,
- Binyang Li ,
- Caio Rocha ,
- Qinghua Zhou ,
- Mahdieh Ghazimirsaeed ,
- Sreevatsa Anantharamu ,
- Jithin Jose
AI applications increasingly run on fast-evolving, heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable.
This paper introduces MSCCL++, a design methodology for developing high-performance, portable communication kernels. It provides (1) a low-level, performance-preserving primitive interface that exposes minimal hardware abstractions while hiding the complexities of synchronization and consistency, (2) a higher-level DSL for application developers to implement workload-specific communication algorithms, and (3) a library of efficient algorithms implementing the standard collective API, enabling adoption by users with minimal expertise.
Compared to state-of-the-art baselines, MSCCL++ achieves geomean speedups of 1.7× (up to 5.4×) for collective communication and 1.2× (up to 1.38×) for AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and has also been adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open source and available at https://github.com/microsoft/mscclpp (opens in new tab). Our two years of experience with MSCCL++ suggests that its abstractions are robust, enabling support for new hardware features, such as multimem, within weeks of development.