{"id":1146334,"date":"2025-08-12T13:09:21","date_gmt":"2025-08-12T20:09:21","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1146334"},"modified":"2025-09-09T10:06:33","modified_gmt":"2025-09-09T17:06:33","slug":"dion-the-distributed-orthonormal-update-revolution-is-here","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/dion-the-distributed-orthonormal-update-revolution-is-here\/","title":{"rendered":"Dion: the distributed orthonormal update revolution is here"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1441\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-scaled.jpg\" alt=\"Three white icons on a gradient background transitioning from blue to green. From left to right: a network of interconnected nodes, a speedometer with the needle pointing right, and a flowchart with squares and a diamond shape.\" class=\"wp-image-1147793\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-scaled.jpg 2560w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><\/figure>\n\n\n\n<p>Training AI models requires choosing an optimizer and for nearly a decade, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1412.6980\" target=\"_blank\" rel=\"noopener noreferrer\">Adam(<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&#8211;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1711.05101\" target=\"_blank\" rel=\"noopener noreferrer\">W)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/kellerjordan.github.io\/posts\/muon\/\" target=\"_blank\" rel=\"noopener noreferrer\">Muon<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> showed serious promise by powering a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/KellerJordan\/modded-nanogpt\/tree\/master\" target=\"_blank\" rel=\"noopener noreferrer\">nanoGPT speedrun<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. This proved out, with multiple AI labs (e.g., <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2502.16982\" target=\"_blank\" rel=\"noopener noreferrer\">Kimi-AI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2505.02222\" target=\"_blank\" rel=\"noopener noreferrer\">Essential-AI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) reporting 2x scale improvements and the release of the 1T parameter <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/moonshotai.github.io\/Kimi-K2\/\" target=\"_blank\" rel=\"noopener noreferrer\">Kimi K2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> model.&nbsp;Restated: you can train a model to similar performance with half as many GPUs.<\/p>\n\n\n\n<p>There\u2019s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2304.11277\" target=\"_blank\" rel=\"noopener noreferrer\">FSDP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1909.08053\" target=\"_blank\" rel=\"noopener noreferrer\">TP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> parallelization becomes desirable.&nbsp;Going back to the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/jeremybernste.in\/writing\/deriving-muon\" target=\"_blank\" rel=\"noopener noreferrer\">inspiration for Muon,<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> the key idea is an orthonormal update, which sparked the search&nbsp;for more scalable alternative linear algebras realizing the same goal. That\u2019s exactly what <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/dion-distributed-orthonormalized-updates\/\" target=\"_blank\" rel=\"noreferrer noopener\">Dion<\/a> is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale. &nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-s-an-orthonormal-update\">What\u2019s an orthonormal update?<\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1422\" height=\"515\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure1_Dion.png\" alt=\"Illustration of matrix parameters\" class=\"wp-image-1146808\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure1_Dion.png 1422w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure1_Dion-300x109.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure1_Dion-1024x371.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure1_Dion-768x278.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure1_Dion-240x87.png 240w\" sizes=\"auto, (max-width: 1422px) 100vw, 1422px\" \/><figcaption class=\"wp-element-caption\">Figure1. Illustration of matrix parameters<\/figcaption><\/figure>\n\n\n\n<p>At the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Orthonormality\" target=\"_blank\" rel=\"noopener noreferrer\">orthonormality<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on the update matrix, thereby equalizing its effect across all input directions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-dion\">What is Dion?<\/h2>\n\n\n\n<p>While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.essential.ai\/blog\/infra\" target=\"_blank\" rel=\"noopener noreferrer\">Essential AI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, applying Muon to large architectures like LLaMA-3 becomes <em>compute-bound<\/em>\u2014and potentially <em>communication-bound<\/em>\u2014due to the cost of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.modula.systems\/algorithms\/newton-schulz\/\" target=\"_blank\" rel=\"noopener noreferrer\">Newton\u2013Schulz orthonormalization steps<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1685\" height=\"857\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure2_Dion.png\" alt=\"Pseudocode of the centralized version of Dion\" class=\"wp-image-1146810\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure2_Dion.png 1685w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure2_Dion-300x153.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure2_Dion-1024x521.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure2_Dion-768x391.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure2_Dion-1536x781.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure2_Dion-240x122.png 240w\" sizes=\"auto, (max-width: 1685px) 100vw, 1685px\" \/><figcaption class=\"wp-element-caption\">Figure 2. Pseudocode of the centralized version of Dion<\/figcaption><\/figure>\n\n\n\n<p>This is where <strong>Dion<\/strong> enters. At a high level, Dion introduces a new axis for scalability: the <strong>rank<\/strong>. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance.&nbsp;Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Tool<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/microsoft\/dion\/\" data-bi-cN=\"Dion optimizer\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Dion optimizer<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>Dion implements orthonormalization using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1905.13727\" target=\"_blank\" rel=\"noopener noreferrer\"><em>amortized power iteration<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em>.&nbsp;<\/em>Power iteration typically pulls out the largest singular value by repeated matrix multiplication.&nbsp;By amortizing this process over optimization steps\u2014applied to the slowly-evolving momentum matrix\u2014we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one.&nbsp;This amortized power iteration is fully compatible with standard distributed training techniques such as <strong>FSDP<\/strong> and <strong>tensor parallelism<\/strong>.&nbsp;Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix <em>without ever seeing a full row or column of it<\/em>.&nbsp;<\/p>\n\n\n\n<p>Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"670821\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Microsoft research newsletter<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-label=\"Microsoft Research Newsletter\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/Newsletter_Banner_08_2019_v1_1920x1080.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Newsletter<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-newsletter\" class=\"large\">Stay connected to the research community at Microsoft.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button is-style-fill-chevron\">\n\t\t\t\t\t\t<a href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-describedby=\"microsoft-research-newsletter\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t\t\tSubscribe today\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"how-does-it-work\">How does it work?<\/h2>\n\n\n\n<p>Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to <em>decrease<\/em> overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion\u2019s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"699\" height=\"414\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure-3_Dion.png\" alt=\"Wall-clock time speedup of Dion for 3B model training\" class=\"wp-image-1146815\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure-3_Dion.png 699w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure-3_Dion-300x178.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure-3_Dion-240x142.png 240w\" sizes=\"auto, (max-width: 699px) 100vw, 699px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Wall-clock time speedup of Dion for 3B model training<\/figcaption><\/figure>\n\n\n\n<p>Why would adding a constraint <em>improve<\/em> the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates\u2014leading to a measurable improvement in performance.<\/p>\n\n\n\n<p>This edge further grows with batch size\u2014with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"637\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure4_Dion.png\" alt=\"Scaling of Dion across different batch sizes\" class=\"wp-image-1146818\" style=\"width:596px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure4_Dion.png 786w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure4_Dion-300x243.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure4_Dion-768x622.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure4_Dion-222x180.png 222w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><figcaption class=\"wp-element-caption\">Figure 4. Scaling of Dion across different batch sizes<\/figcaption><\/figure>\n\n\n\n<p>Here you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and \u00bc rank Dion (in orange) and Muon (in blue).&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>In our experiments, these benefits extend to various post-training regimes as well.<\/p>\n\n\n\n<p>We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1893\" height=\"511\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure5_Dion.png\" alt=\"Low-rank Dion across different model sizes\" class=\"wp-image-1146821\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure5_Dion.png 1893w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure5_Dion-300x81.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure5_Dion-1024x276.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure5_Dion-768x207.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure5_Dion-1536x415.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Figure5_Dion-240x65.png 240w\" sizes=\"auto, (max-width: 1893px) 100vw, 1893px\" \/><figcaption class=\"wp-element-caption\">Figure 5. Low-rank Dion across different model sizes<\/figcaption><\/figure>\n\n\n\n<p>Projecting this trend out to the scale of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2407.21783\" target=\"_blank\" rel=\"noopener noreferrer\">LLaMA-3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> 405B parameter models suggests that Dion is fully effective even with <strong>rank fractions as low as 1\/16 or 1\/64<\/strong> for large dense models like LLaMA-3.&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>Using hardware timings of the individual update steps suggests a story that looks this:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1075\" height=\"645\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion_FIG6.png\" alt=\"Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1\/16 or lower offers an order-of-magnitude speedup over Muon.\" class=\"wp-image-1147684\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion_FIG6.png 1075w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion_FIG6-300x180.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion_FIG6-1024x614.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion_FIG6-768x461.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion_FIG6-240x144.png 240w\" sizes=\"auto, (max-width: 1075px) 100vw, 1075px\" \/><figcaption class=\"wp-element-caption\">Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1\/16 or lower offers an order-of-magnitude speedup over Muon.<\/figcaption><\/figure>\n\n\n\n<p>We\u2019ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of <strong>Dion<\/strong>, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of <strong>Muon.<\/strong><\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/microsoft\/dion\/\">Dion optimizer<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>We thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dion is a new AI model optimization method that boosts scalability and performance over existing leading methods by orthonormalizing only a top rank subset of singular vectors, enabling more efficient training of large models such as LLaMA-3 with reduced overhead.<\/p>\n","protected":false},"author":43868,"featured_media":1147793,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Kwangjun Ahn","user_id":"43950"},{"type":"user_nicename","value":"John Langford","user_id":"32204"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1146334","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[992148],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"John Langford","user_id":32204,"display_name":"John Langford","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jcl\/\" aria-label=\"Visit the profile page for John Langford\">John Langford<\/a>","is_active":false,"last_first":"Langford, John","people_section":0,"alias":"jcl"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-960x540.jpg\" class=\"img-object-cover\" alt=\"Three white icons on a gradient background transitioning from blue to green. From left to right: a network of interconnected nodes, a speedometer with the needle pointing right, and a flowchart with squares and a diamond shape.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/Dion-BlogHeroFeature-1400x788_New-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Kwangjun Ahn and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jcl\/\" title=\"Go to researcher profile for John Langford\" aria-label=\"Go to researcher profile for John Langford\" data-bi-type=\"byline author\" data-bi-cN=\"John Langford\">John Langford<\/a>","formattedDate":"August 12, 2025","formattedExcerpt":"Dion is a new AI model optimization method that boosts scalability and performance over existing leading methods by orthonormalizing only a top rank subset of singular vectors, enabling more efficient training of large models such as LLaMA-3 with reduced overhead.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1146334","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1146334"}],"version-history":[{"count":27,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1146334\/revisions"}],"predecessor-version":[{"id":1147795,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1146334\/revisions\/1147795"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1147793"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1146334"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1146334"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1146334"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1146334"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1146334"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1146334"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1146334"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1146334"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1146334"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1146334"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1146334"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}