MSCCL++: Rethinking GPU Communication Abstractions for AI Inference

ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) |

AI applications increasingly run on fast-evolving, heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable.

This paper introduces MSCCL++, a design methodology for developing high-performance, portable communication kernels. It provides (1) a low-level, performance-preserving primitive interface that exposes minimal hardware abstractions while hiding the complexities of synchronization and consistency, (2) a higher-level DSL for application developers to implement workload-specific communication algorithms, and (3) a library of efficient algorithms implementing the standard collective API, enabling adoption by users with minimal expertise.

Compared to state-of-the-art baselines, MSCCL++ achieves geomean speedups of 1.7× (up to 5.4×) for collective communication and 1.2× (up to 1.38×) for AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and has also been adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open source and available at https://github.com/microsoft/mscclpp (opens in new tab). Our two years of experience with MSCCL++ suggests that its abstractions are robust, enabling support for new hardware features, such as multimem, within weeks of development.