Dion2: A new simple method to shrink matrix in Muon

Dion2 reduces the cost of Muon’s orthonormalization step by orthonormalizing only a small, selected submatrix at each iteration. This lightweight approach preserves Muon’s strong performance while significantly improving scalability of optimizer at scale.

Explore more

Transcript

Dion2: A new simple method to shrink matrix in Muon

Working out of the New York City lab in the AI frontiers, team Kwangjun is here to introduce Dion2: A simple yet powerful method that makes advanced optimizers more scalable by shrinking the expensive computations they rely on. Dion2 preserves performance while dramatically reducing cost. Opening the door to faster and more flexible training at scale, it’s a great example of how elegant focused research can have massive impact on real world AI systems. Let’s hear more.

Hey, I’m Kwangjun. I’m here to talk about a new simple method to shrink matrix in Muon called Dion2. Our work is in line with AutoML optimizer revolution for context training. AI models consume a lot of compute, and we want to do this better. The current de facto standard for training. A model is something called Adam W and is extremely popular.

And here the question is can we do better than this default algorithm? Adam W. Recently there is a new contender for Adam W called Muon which is the orthonormal optimizer. It is the optimizer for matrix parameters which make up nearly all parameters in modern neural networks.

The key idea here is to orthonormalize each update matrix. It has clean theoretical motivation and strong empirical performance. In particular, it has been adopted in frontier models such as Kimi and GLM.

So should we all switch to Muon? Not so fast, because it turns out it’s a bit tricky to scale Muon on to larger scales. It’s mostly because Muon relies on matrix level computations, which conflicts with distributed training where weights are sharded and inherently the Muon computation specifically, or normalization, is super linear complexity computation.

This is in contrast with Adam W, which only relies on element wise operations. Hence, it’s scalable and compatible with all the distributed training framework. So here’s our idea. Can we add a scalability knob to muon to make it more scalable? What I mean by that is let’s add a parameter controlling how much of the update matrix we’re orthonormalizing.

Let’s call it beta, which is in between 0 and 1. When beta is chosen to be 1.0, it recovers the original muon. Hence orthonormalizing using the full matrix, whereas when beta is chosen strictly less than 1.0. We’re doing partial orthonormalization, hence leading to cheaper compute and less communication, and hopefully it retains all the benefits of muon.

So scalability knob again is chosen between 0 and 1. And the goal here is to when the beta scalability factor is chosen strictly less than one. It still preserves strong update quality benefits of muon, even with um with the orthonormalizing only a fraction of the matrix. That’s where Dion2 comes in. It’s a very simple method to shrink matrix size and muon.

And here is the outline of our results. On the left hand side we’re showing. Optimizer step benchmark between Muon and Dion2 with various fractions. And you can see that as fraction beta becomes smaller Dion2 achieves speedup over muon. And on the right hand side we’re showing the training runs between muon and Dion2. And even with the 25% of orthonormalization, Dion2 is achieving very competitive update quality as muon.

The remarkable thing about Dion2 which makes it practical is its simplicity. It first picks top beta fraction of rows or neurons based on their norms, and then we also orthonormalize only the chosen rows or neurons and decay them the selected rows so that Various diverse neurons get selected across different training steps. This is in contrast with the Dion1, which relied on somewhat complicated low rank approximation for efficiency, and its reliance on linear algebraic function comes with overhead at small scale, which makes it less practical.

Also, in our experience, the ultra achieves better update quality than beyond one, making it better and more practical optimizer than the other one. Please try out Dion2 in your code base. You can use the same configuration as muon as long as muon is integrated in your code base. In our experience, beta equal to 0.5 or 0.25 leads to a very competitive algorithm with full muon.

This is our experience while integrating our Dion2 and muon to vibe code base. And you can see that Dion with 0.5 or 0 to 0.25 achieves very competitive update quality. In other words, once Muon is in your code base, Dion2 is plug and play. Here are some references you can check out Dion2 implementation from our Microsoft Dion code base, and you can check out more details about Dion2 from our paper.

Thank you for listening.