Portrait of Saeed Maleki

Saeed Maleki

Principal Research SDE

Connect on LinkedIn


I am a Principal Research SDE working in Parasail (opens in new tab) project in RiSE (opens in new tab) Research group. My main focus recently has been on optimizing AI infrastructure. In the past two years, I have been leading MSCCL project for optimizing collective communication kernels for 1P and 3P inference and training workloads on Azure. Today, MSCCL runs communication operations for several high-profile AI workloads across Microsoft. You may read about MSCCL vision in our public repository (opens in new tab).

Before MSCCL, I worked on AdaSum (opens in new tab) which is a technique to combine gradient to scale batch parallelization beyond what is possible with normal gradient averaging. AdaSum is supported in Horovod by simply passing a flag and it is being used by multiple 3P users of Azure AI infrastructure.

I also pioneered rank-1 parallelization technique for dynamic programming algorithms. This technique enables parallelization across dependent loop iterations while it preserves the semantic of the computation. This work was recognized as a part of CACM Research highlight (opens in new tab).

These projects would have not been successful without the many great research interns that I have worked with in the past few years. Their great works were published in top conferences which are listed below. If you are looking for an internship at Microsoft Research working in optimizing AI infrastructure, please send me an email: saemal@microsoft.com.

Below is a list of past interns and their publication in the past 5 years:

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL (opens in new tab) — NSDI’23
Intern: Aashaka Shah

GC3: An Optimizing Compiler for GPU Collective Communication (opens in new tab) — ASPLOS’23

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads (opens in new tab) — ASPLOS’22
Intern: Abhinav Jangda

Synthesizing Optimal Collective Algorithms (opens in new tab) — PPoPP’21 (Best Paper Award)
Interns: Zixian Cai and Zhengyang Liu

Distributed Training of Embeddings using Graph Analytics (opens in new tab) — IPDPS’21
Interns: Gurbinder Gill and Roshan Dathathri

CHET: An Optimizing Compiler for Fully-Homomorphic Neural-Network Inferencing (opens in new tab) — PLDI’19
Intern: Roshan Dathathri

Scaling Distributed Training with Adaptive Summation (opens in new tab) — MLSys’21

Semantics-Preserving Parallelization of Stochastic Gradient Descent (opens in new tab) — IPDPS’18