ARO: A new lens on matrix optimization for LLMs
- Anson Ho, Microsoft; Wenbo Gong, Microsoft; Chao Ma, Microsoft
- Microsoft Research Forum | Season 2, Episode 3
We present Adaptively Rotated Optimization (ARO), a matrix optimizer that speeds up LLM training by applying updates in a rotated, geometry-aware coordinate system. Guided by new insights on global structures on LLM loss landscapes, ARO treats rotation as a unifying principle for sample efficiency, and proposed a new update policy that is applicable to all model weight matrices. In large scale controlled experiments, ARO consistently outperforms AdamW and orthogonalization-based method, maintaining its gains as models and training budgets scale.
Explore more
Transcript
ARO: A new lens on matrix optimization for LLMs
Training large language models efficiently is one of the biggest challenges in AI right now.
From Microsoft Research, Cambridge. Chao and Wenbo introduced adaptively rotated optimization, or ARO, a new approach to optimization that uses geometry-aware updates to significantly improve training efficiency at scale. This is fresh work coming straight out of the lab, with timely results and big implications for both research and production systems.
Handing it over to you, Chao and Wenbo.
Hello everyone. My name is Chao Ma and I’m a senior researcher at Microsoft Research Cambridge. We have been working on innovations in AI efficiency and together with my colleague Wenbo will talk about ARO, a new matrix optimization framework for large models. This has been a collaboration with the amazing team at MSR Cambridge.
To begin with, training AI is expensive, and The Optimizers crucially determines how effectively we turn compute into intelligence. The industry so far has been dominated by the use of AdamW for a decade now. Matrix based optimizers like Muon are emerging using gradient localization methods to improve sample efficiency and supporting production scale training.
From our perspective, we believe organization as one example of this new optimization research landscape, and the goal of our work is to develop new paradigms in optimization that push the efficiency frontier even further. To explain our approach, let me start with the basics. Gradient descent is the simplest optimizer. It computes the gradient g and use it to update the weights w.
Adaptive optimizers like atom also follows the gradient, but they adaptively rescale the gradient before applying the weight update. The adaptive rescaling is implemented by some nonlinear mapping F, and more recent matrix optimizers like orthogonalization, shampoo and soap, etc. are more complicated.
However, despite very different motivations and derivations, they are fundamentally similar to Adam. After some simplifications, we can show that they run Adam like optimization, but in a rotated coordinate system. Concretely, this happens by first rotating the gradient by r transpose, then applied nonlinear map f, and then rotated basis and finally rotating back.
In the literature of matrix optimization, f is typically variance of Adam, and R is chosen as the singular vectors of G.
This reinterpretation is not entirely new, but based on the insights from our research, we believe rotation should not just be a Reinterpretation, but instead a core primitive principle for automatic design. This perspective opens the door to many new ideas in optimization, and in particular, we propose the use of more powerful nonlinear mappings beyond atoms rescaling and more significantly, better rotations that are informed by the choice of F and deliver improved sample efficiency.
And that gives our framework adaptively rotated optimization or ARO. Before going to the details of ARO’s performance. Let me first comment on why we believe rotations are so important. We hypothesize that it is really rooted in symmetries of loss, landscape, current language models, architectures, choices, for example, transformers and reach rotational symmetries in their weights, which when applied correctly, those rotations will leave the model’s predictions unchanged.
ARO exploits this property by moving along those symmetric orbits without changing the model’s output. To search for updates are more effective at navigating the loss landscape, and this symmetric hypothesis also enables additional features of error. For example, leveraging cross layer couplings for better performance.
In summary, ARO is a general optimization framework that not only explains many existing matrix optimizers, but also enables the discovery of new update rules that work very well in practice. Now I will hand over to Wenbo to work through our key results.
Thanks Chao for introducing ARO. Hi, I’m Wenbo. Since we have covered the key features of the ARO, next I will briefly talk about how it performs in practice to ensure fair comparison. We have carefully built an optimizer benchmark protocol that aims to mitigate many potential forms of evaluation bias.
First, we collaborated with Microsoft Research Asia and evaluates error on their newly developed M0 model called Sigma. We measured the optimizer performance using the relative speedup in steps over Adam W. This is shown in y axis, and the x axis indicates the data scale where each unit represents the compute optimal data set sizes.
For example, this spot indicates the 1.36 times speedup as the eight over trend factor. It means that the ARO only uses about 73% of the data or GPU time to reach the same loss as Adam W at that training stage. Therefore, the higher the speed up, the more efficient the optimizer will be. From the plot we can see the result from two clusters. The lower cluster represents also optimization based methods, including MUA. The output cluster shows two error variations. Their results shows a clear performance scan for error, which persists across data skills. Importantly, ARO can optionally be used as a full model optimizer where all matrix parameters in the model are updated under a single rule, while some also organization methods, although we do not show it here, can be as stable in that setting.
Overall, this suggests ARO converges faster and delivers better sample efficiency under our rigorous benchmarking protocol. Next, we stress test ARO across model scales to check whether its efficiency gain holds as we scale up. We evaluate the arrow on models from 0.3 billion up to 8 billion parameters, covering different architectures and overall training regimes.
This plot shows the speedup comparison of ARO versus muon and Adam w. From this we observe that the arrow remains consistent. Advantages. Around 1.3 and 1.1 times over Adam W and Muon, respectively. As model size increases, we do not see the advantage shrinking at larger scales.
All of the experiments use our efficient distributed error implementation in our setup of 8 billion rounds. The first step runtime is on par with atom W, meaning that the speed up translates directly into work local time savings. Overall, these results suggest that the ARO speedup is scale robust. Based on this trend, it is likely to maintain its advantages over Adam W and muon as we move on to even larger models. To summarize, ARO is a general framework for matrix optimizer. It not only unifies many existing ones. It also provides design opportunities for novel and more efficient optimizers.
Additionally, ARO can be easily extended and therefore incorporate new developments in the Matrix Optimizer community for further improvements. Our empirical verifications shows consistent advantages over both Adam W and Muon across different model and data skills. Overall, we believe ARO is a competitive optimizer for training large model at scale.
We want to thank all the collaborators in Microsoft research and all entrants for their valuable contributions, discussions, and feedback. If you are interested, please find the paper link below.
-
-
Anson Ho
Program Manager
-
Wenbo Gong
Senior Researcher
-
-
-
Research Forum: Season 2, Episode 3
-
Dion2: A new simple method to shrink matrix in Muon
- Anson Ho,
- Kwangjun Ahn
-
-
-
-
-