MSCCL++: Rethinking GPU Communication Abstractions for AI Inference

Changho Hwang; Peng Cheng; Roshan Dathathri; Abhinav Jangda; Saeed Maleki; Madan Musuvathi; Olli Saarikivi; Aashaka Shah; Ziyue Yang; Binyang Li; Caio Rocha; Qinghua Zhou; Mahdieh Ghazimirsaeed; Sreevatsa Anantharamu; Jithin Jose

MSCCL++: Rethinking GPU Communication Abstractions for AI Inference

Changho Hwang ,
Peng Cheng ,
Roshan Dathathri ,
Abhinav Jangda ,
Saeed Maleki ,
Madan Musuvathi ,
Olli Saarikivi ,
Aashaka Shah ,
Ziyue Yang ,
Binyang Li ,
Caio Rocha ,
Qinghua Zhou ,
Mahdieh Ghazimirsaeed ,
Sreevatsa Anantharamu ,
Jithin Jose

ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) | March 2026

Download BibTex

AI applications increasingly run on fast-evolving, heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable.

This paper introduces MSCCL++, a design methodology for developing high-performance, portable communication kernels. It provides (1) a low-level, performance-preserving primitive interface that exposes minimal hardware abstractions while hiding the complexities of synchronization and consistency, (2) a higher-level DSL for application developers to implement workload-specific communication algorithms, and (3) a library of efficient algorithms implementing the standard collective API, enabling adoption by users with minimal expertise.

Compared to state-of-the-art baselines, MSCCL++ achieves geomean speedups of 1.7× (up to 5.4×) for collective communication and 1.2× (up to 1.38×) for AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and has also been adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open source and available at https://github.com/microsoft/mscclpp (opens in new tab). Our two years of experience with MSCCL++ suggests that its abstractions are robust, enabling support for new hardware features, such as multimem, within weeks of development.