Dion: The distributed orthonormal update revolution is here

September 24, 2025
Maya Murad, Microsoft; Kwangjun Ahn, Microsoft
Microsoft Research Forum | Season 2, Episode 1

Kwangjun Ahn, Senior Researcher at Microsoft Research AI Frontiers, introduces Dion, a next-generation optimizer in the style of Muon that orthonormalizes only the top-r subspace via amortized power iteration. Dion retains Muon’s fast convergence while significantly reducing compute and communication, scaling efficiently with FSDP/TP for very large models.

Explore more

Dion optimizer algorithm (opens in new tab)
GitHub

Dion: Distributed Orthonormalized Updates
September 2025

All Research Forum sessions

Transcript

Dion: The distributed orthonormal update revolution is here

[MUSIC]

MAYA MURAD: Even the best models are limited by how we train them at scale. Training today’s state-of-the-art systems can take millions of GPU hours, and once workloads are spread across many machines, a major bottleneck emerges: communication.

At every step, GPUs must exchange and combine their updates, and that synchronization can be both slow and costly. That’s where Dion comes in.

Let’s hear from my fellow AI Frontiers colleague Kwangjun, joining us from the New England lab to share how this next-gen optimizer makes training faster and more scalable. Let’s dive in.

[MUSIC]

KWANGJUN AHN: Hi, everyone. My name is Kwangjun Ahn. I’m a senior researcher in Microsoft Research AI Frontiers. Today, we’ll talk to you about our new optimizer, Dion.

As we all know, training state-of-the-art AI models requires millions of GPU hours, so it is really important that we come up with good optimizer for training AI models. For the longest time, we have been using Adam or weight-decay variant AdamW for training our favorite AI models. But here the question is: is Adam the end of history?

Recently, there has been [an] orthonormal updates revolution; for example, Muon, which has shown to lead to faster convergence on the factor of 2 and leading to stable training as well as being big-batch friendly. And a lot of excitements are going on around orthonormal updates, especially after two production-level models’ release, such as Kimi-K2 model and GLM-4.5. So let’s pause for a bit and talk about what orthonormal updates are.

In order to explain orthonormal updates, let’s first review standard SGD. Standard SGD will take the current parameter, say \(X_{t-1}\), and update along the negative direction of gradient, say, \(G_{t}\), gradient matrix \(G_{t}\), with a learning rate <em>eta</em> to obtain next set of weights, \(X_{t-1}\). In contrast to standard SGD, orthonormal updates do the following: we first take the \(G_{t}\) matrix and singular value decompose into \(U \Sigma V^T\) and only take \(U V^T\) as an update direction for this optimizer that step. At this point, people might wonder, isn’t singular value decomposition too expensive for each step of optimizer? In fact, the popular Muon algorithm implements this orthonormalization via iterative matrix multiplications called Newton-Schulz iteration. So theoretical justification behind orthonormal updates is as follows. If you think of Matrix Parameter M as a vector transformer that takes input Activation A into output Activation B, by having the update of optimizer to be orthonormal, it ensures that whatever input activation comes in, it will be transformed by equal amount. This underlying principle surprisingly leads to very effective practice where it leads to faster convergence and stable training, etc.

So then is Muon the end of history? It turns out Muon hits a wall when trying to scale up to bigger than 100-billion parameter models, especially dense models. It turns out that underlying procedure of Muon step as I described—the matrix iteration, matrix multiplication iteration called Newton-Schulz iteration—needs dense matrix multiplication on full matrices, which clashes with sharded weights, common for distributed training, and also needing for full matrices require heavy cross-shard communication collectives or redundant computation. This is where Dion comes in.

Dion is basically [a] more distributed-training friendly orthonormal update. The core question we would like to answer in this project is, can we design orthonormal updates without full-matrix materialization, which is more friendly with sharded weights, hence leading to better scalability?

Dion precisely answers this question by designing orthonormal updates that’s more distributed-training friendly. First of all, being orthonormal updates, Dion preserves all the Muon’s benefits that I mentioned; for example, faster convergence, stable training, etc. But we implement this orthonormal updates by rethinking linear algebra. Instead of employing Newton-Schulz iteration, which requires dense full matrix-matrix multiply, we use this amortized power iteration, which is more compatible with distributed sharded weight training. Crucially, while doing that, we introduce new scalability lever in the form of low-rank fraction factor, and we’ll talk about this in a bit.

First, let me show nice features of Dion. As I said, it enhances scalability of orthonormal updates. With low-rank factor—we’ll shortly talk about—Dion update rule scales much better with less communication across scale.

So here we’re showing microbenchmark results. On the left-hand side, we’re showing square matrix benchmark where we’re measuring time per each optimizer step across different optimizers—Muon, Dion with 1/4th-rank fraction, Dion with 1/16th-rank fraction—and as you can see, as matrix size becomes larger, Dion with low-rank fraction becomes a lot more favorable in terms of time per step.

On the right-hand side, we’re showing specific case of Llama 3 405 billion dense model training configuration and benchmarking each optimizer’s time per step. And as you can see, Dion becomes a lot more tractable than Muon for this specific setting. And again, as I said, Dion is more compatible with sharded weight training. In particular in our paper and also in our open-source code base, you can find efficient implementation of Dion for both one-way and two-way sharding.

Another interesting thing about Dion is because we’re using amortized power iteration, it leads to a lot more algorithmic flexibility. For example, in our paper, there’s a Lazy-Dion variant, which leads to further speedup.

So let me talk about low-rankness, or low-rank fraction, a new scalability lever that we introduced in this Dion update rule. So low-rank fraction basically controls the rank of each update, making the update cheaper to compute and less things to communicate across device, leading to big speedups. Crucially, we make this work by introducing error-feedback mechanism, which maintains update quality at lower ranks.

Another interesting thing about Dion is that across scale, it seems that we can get away with low-rankness. So here we’re showing a result in our empirical study, scaling study, where we increase the model size, and as we increase the model size, we saw that the gap between higher-ranked Dion and low-ranked Dion becomes narrower and narrower, showing at larger scale, you can get away with smaller-rank fraction. Also, we add PowerSGD-style compressed data-parallel sync mechanism, which allows for even less communication, which could be very beneficial in hybrid sharding strategy.

So let me summarize what Dion is about.

Dion again is orthonormalized update for large models. It’s more scalable for larger models. In particular, as you saw in the benchmark result, Dion is feasible even at Llama 3 405B dense model setting. Crucially, we introduce low-rankness, or low-rank fraction; new scalability access, scalability lever, that leads to better compute and communication in optimizer step. Another nice thing about Dion is that it’s fully compatible with weight-sharding strategies. As I highlight, it leads to great algorithmic flexibility. For example, in the paper, you can find Lazy-Dion variants and effective-rank variant. We fully open sourced our both one-way sharding and two-way sharding implementation of Dion in the following GitHub link.

As a final remark, optimizer is used everywhere to train AI models, so it is important that we adopt this orthonormal updates revolution to save FLOPs and do efficient model training.

I’d like to also thank our collaborators, both from AI Frontiers and our interns. And I’d like to thank Microsoft Research for letting us open source this project.

Thank you for listening.