Dion: Distributed Orthonormalized Updates

arXiv

Recent work has shown that orthonormal matrix updates speed up neural network optimization, improve training stability, and offer better hyperparameter transfer across model sizes. Applying these updates efficiently when model weights and optimizer states are sharded across a large-scale distributed LLM training system remains a major challenge. We introduce Dion (DIstributed OrthoNormalization), a scalable and communication-efficient orthonormalizing optimizer. Dion leverages low-rank approximation and decoupled momentum buffers, eliminating the need for full gradient synchronization while producing numerically equivalent results. It is compatible with simultaneous DDP, FSDP, and TP parallelism, and it computes an orthonormalized update without unsharding a full parameter matrix on any single device. We evaluate Dion on language models from 120M to 3B parameters and find that its benefits improve with increasing model size and batch size.

Related Tools

Dion: Distributed Orthonormal Updates

Dion is a scalable optimizer that accelerates neural network training by applying orthonormal weight updates using amortized power iteration, which works efficiently on sharded matrices. It reduces communication overhead through low-rank compression and error feedback, offering faster convergence compared to traditional methods like Adam and Muon.

Dion: The distributed orthonormal update revolution is here

Kwangjun Ahn, Senior Researcher at Microsoft Research AI Frontiers, introduces Dion, a next-generation optimizer in the style of Muon that orthonormalizes only the top-r subspace via amortized power iteration. Dion retains Muon’s fast convergence while significantly reducing compute and communication, scaling efficiently with FSDP/TP for very large models.

Explore more

Dion optimizer algorithm (opens in new tab)
GitHub

Dion: Distributed Orthonormalized Updates
September 2025

Transcript

Dion: The distributed orthonormal update revolution is here

[MUSIC]

MAYA MURAD: Even the best models are limited by how we train them at scale. Training today’s state-of-the-art systems can take millions of GPU hours, and once workloads are spread across many machines, a major bottleneck emerges: communication.

At every step, GPUs must exchange and combine their updates, and that synchronization can be both slow and costly. That’s where Dion comes in.

Let’s hear from my fellow AI Frontiers colleague Kwangjun, joining us from the New England lab to share how this next-gen optimizer makes training faster and more scalable. Let’s dive in.

[MUSIC]

KWANGJUN AHN: Hi, everyone. My name is Kwangjun Ahn. I’m a senior researcher in Microsoft Research AI Frontiers. Today, we’ll talk to you about our new optimizer, Dion.  

As we all know, training state-of-the-art AI models requires millions of GPU hours, so it is really important that we come up with good optimizer for training AI models. For the longest time, we have been using Adam or weight-decay variant AdamW for training our favorite AI models. But here the question is: is Adam the end of history?  

Recently, there has been [an] orthonormal updates revolution; for example, Muon, which has shown to lead to faster convergence on the factor of 2 and leading to stable training as well as being big-batch friendly. And a lot of excitements are going on around orthonormal updates, especially after two production-level models’ release, such as Kimi-K2 model and GLM-4.5. So let’s pause for a bit and talk about what orthonormal updates are. 

In order to explain orthonormal updates, let’s first review standard SGD. Standard SGD will take the current parameter, say \(X_{t-1}\), and update along the negative direction of gradient, say, \(G_{t}\), gradient matrix \(G_{t}\), with a learning rate <em>eta</em> to obtain next set of weights, \(X_{t-1}\). In contrast to standard SGD, orthonormal updates do the following: we first take the \(G_{t}\) matrix and singular value decompose into \(U \Sigma V^T\) and only take \(U V^T\) as an update direction for this optimizer that step. At this point, people might wonder, isn’t singular value decomposition too expensive for each step of optimizer? In fact, the popular Muon algorithm implements this orthonormalization via iterative matrix multiplications called Newton-Schulz iteration. So theoretical justification behind orthonormal updates is as follows. If you think of Matrix Parameter M as a vector transformer that takes input Activation A into output Activation B, by having the update of optimizer to be orthonormal, it ensures that whatever input activation comes in, it will be transformed by equal amount. This underlying principle surprisingly leads to very effective practice where it leads to faster convergence and stable training, etc. 

So then is Muon the end of history? It turns out Muon hits a wall when trying to scale up to bigger than 100-billion parameter models, especially dense models. It turns out that underlying procedure of Muon step as I described—the matrix iteration, matrix multiplication iteration called Newton-Schulz iteration—needs dense matrix multiplication on full matrices, which clashes with sharded weights, common for distributed training, and also needing for full matrices require heavy cross-shard communication collectives or redundant computation. This is where Dion comes in. 

Dion is basically [a] more distributed-training friendly orthonormal update. The core question we would like to answer in this project is, can we design orthonormal updates without full-matrix materialization, which is more friendly with sharded weights, hence leading to better scalability?  

Dion precisely answers this question by designing orthonormal updates that’s more distributed-training friendly. First of all, being orthonormal updates, Dion preserves all the Muon’s benefits that I mentioned; for example, faster convergence, stable training, etc. But we implement this orthonormal updates by rethinking linear algebra. Instead of employing Newton-Schulz iteration, which requires dense full matrix-matrix multiply, we use this amortized power iteration, which is more compatible with distributed sharded weight training. Crucially, while doing that, we introduce new scalability lever in the form of low-rank fraction factor, and we’ll talk about this in a bit.  

First, let me show nice features of Dion. As I said, it enhances scalability of orthonormal updates. With low-rank factor—we’ll shortly talk about—Dion update rule scales much better with less communication across scale.   

So here we’re showing microbenchmark results. On the left-hand side, we’re showing square matrix benchmark where we’re measuring time per each optimizer step across different optimizers—Muon, Dion with 1/4th-rank fraction, Dion with 1/16th-rank fraction—and as you can see, as matrix size becomes larger, Dion with low-rank fraction becomes a lot more favorable in terms of time per step. 

On the right-hand side, we’re showing specific case of Llama 3 405 billion dense model training configuration and benchmarking each optimizer’s time per step. And as you can see, Dion becomes a lot more tractable than Muon for this specific setting. And again, as I said, Dion is more compatible with sharded weight training. In particular in our paper and also in our open-source code base, you can find efficient implementation of Dion for both one-way and two-way sharding. 

Another interesting thing about Dion is because we’re using amortized power iteration, it leads to a lot more algorithmic flexibility. For example, in our paper, there’s a Lazy-Dion variant, which leads to further speedup.  

So let me talk about low-rankness, or low-rank fraction, a new scalability lever that we introduced in this Dion update rule. So low-rank fraction basically controls the rank of each update, making the update cheaper to compute and less things to communicate across device, leading to big speedups. Crucially, we make this work by introducing error-feedback mechanism, which maintains update quality at lower ranks.  

Another interesting thing about Dion is that across scale, it seems that we can get away with low-rankness. So here we’re showing a result in our empirical study, scaling study, where we increase the model size, and as we increase the model size, we saw that the gap between higher-ranked Dion and low-ranked Dion becomes narrower and narrower, showing at larger scale, you can get away with smaller-rank fraction. Also, we add PowerSGD-style compressed data-parallel sync mechanism, which allows for even less communication, which could be very beneficial in hybrid sharding strategy.  

So let me summarize what Dion is about. 

Dion again is orthonormalized update for large models. It’s more scalable for larger models. In particular, as you saw in the benchmark result, Dion is feasible even at Llama 3 405B dense model setting. Crucially, we introduce low-rankness, or low-rank fraction; new scalability access, scalability lever, that leads to better compute and communication in optimizer step. Another nice thing about Dion is that it’s fully compatible with weight-sharding strategies. As I highlight, it leads to great algorithmic flexibility. For example, in the paper, you can find Lazy-Dion variants and effective-rank variant. We fully open sourced our both one-way sharding and two-way sharding implementation of Dion in the following GitHub link. 

As a final remark, optimizer is used everywhere to train AI models, so it is important that we adopt this orthonormal updates revolution to save FLOPs and do efficient model training.  

I’d like to also thank our collaborators, both from AI Frontiers and our interns. And I’d like to thank Microsoft Research for letting us open source this project. 

Thank you for listening. 

Dion2: A new simple method to shrink matrix in Muon

Dion2 reduces the cost of Muon’s orthonormalization step by orthonormalizing only a small, selected submatrix at each iteration. This lightweight approach preserves Muon’s strong performance while significantly improving scalability of optimizer at scale.

Explore more

Transcript

Dion2: A new simple method to shrink matrix in Muon

Working out of the New York City lab in the AI frontiers, team Kwangjun is here to introduce Dion2: A simple yet powerful method that makes advanced optimizers more scalable by shrinking the expensive computations they rely on. Dion2 preserves performance while dramatically reducing cost. Opening the door to faster and more flexible training at scale, it’s a great example of how elegant focused research can have massive impact on real world AI systems. Let’s hear more.

Hey, I’m Kwangjun. I’m here to talk about a new simple method to shrink matrix in Muon called Dion2. Our work is in line with AutoML optimizer revolution for context training. AI models consume a lot of compute, and we want to do this better. The current de facto standard for training. A model is something called Adam W and is extremely popular.

And here the question is can we do better than this default algorithm? Adam W. Recently there is a new contender for Adam W called Muon which is the orthonormal optimizer. It is the optimizer for matrix parameters which make up nearly all parameters in modern neural networks.

The key idea here is to orthonormalize each update matrix. It has clean theoretical motivation and strong empirical performance. In particular, it has been adopted in frontier models such as Kimi and GLM.

So should we all switch to Muon? Not so fast, because it turns out it’s a bit tricky to scale Muon on to larger scales. It’s mostly because Muon relies on matrix level computations, which conflicts with distributed training where weights are sharded and inherently the Muon computation specifically, or normalization, is super linear complexity computation.

This is in contrast with Adam W, which only relies on element wise operations. Hence, it’s scalable and compatible with all the distributed training framework. So here’s our idea. Can we add a scalability knob to muon to make it more scalable? What I mean by that is let’s add a parameter controlling how much of the update matrix we’re orthonormalizing.

Let’s call it beta, which is in between 0 and 1. When beta is chosen to be 1.0, it recovers the original muon. Hence orthonormalizing using the full matrix, whereas when beta is chosen strictly less than 1.0. We’re doing partial orthonormalization, hence leading to cheaper compute and less communication, and hopefully it retains all the benefits of muon.

So scalability knob again is chosen between 0 and 1. And the goal here is to when the beta scalability factor is chosen strictly less than one. It still preserves strong update quality benefits of muon, even with um with the orthonormalizing only a fraction of the matrix. That’s where Dion2 comes in. It’s a very simple method to shrink matrix size and muon.

And here is the outline of our results. On the left hand side we’re showing. Optimizer step benchmark between Muon and Dion2 with various fractions. And you can see that as fraction beta becomes smaller Dion2 achieves speedup over muon. And on the right hand side we’re showing the training runs between muon and Dion2. And even with the 25% of orthonormalization, Dion2 is achieving very competitive update quality as muon.

The remarkable thing about Dion2 which makes it practical is its simplicity. It first picks top beta fraction of rows or neurons based on their norms, and then we also orthonormalize only the chosen rows or neurons and decay them the selected rows so that Various diverse neurons get selected across different training steps. This is in contrast with the Dion1, which relied on somewhat complicated low rank approximation for efficiency, and its reliance on linear algebraic function comes with overhead at small scale, which makes it less practical.

Also, in our experience, the ultra achieves better update quality than beyond one, making it better and more practical optimizer than the other one. Please try out Dion2 in your code base. You can use the same configuration as muon as long as muon is integrated in your code base. In our experience, beta equal to 0.5 or 0.25 leads to a very competitive algorithm with full muon.

This is our experience while integrating our Dion2 and muon to vibe code base. And you can see that Dion with 0.5 or 0 to 0.25 achieves very competitive update quality. In other words, once Muon is in your code base, Dion2 is plug and play. Here are some references you can check out Dion2 implementation from our Microsoft Dion code base, and you can check out more details about Dion2 from our paper.

Thank you for listening.