{"id":1148589,"date":"2025-09-09T07:00:00","date_gmt":"2025-09-09T14:00:00","guid":{"rendered":""},"modified":"2025-09-19T08:38:07","modified_gmt":"2025-09-19T15:38:07","slug":"breaking-the-networking-wall-in-ai-infrastructure","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/breaking-the-networking-wall-in-ai-infrastructure\/","title":{"rendered":"Breaking the\u00a0networking\u00a0wall\u00a0in\u00a0AI infrastructure\u00a0"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1.jpg\" alt=\"Two white line icons on a gradient background transitioning from blue to pink. From left to right: icon representing a set of gears; an icon representing three connected nodes each containing a user icon.\" class=\"wp-image-1148762\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>Memory and network bottlenecks are increasingly limiting AI system performance by reducing GPU&nbsp;utilization&nbsp;and overall efficiency,&nbsp;ultimately preventing&nbsp;infrastructure from reaching its full potential&nbsp;despite enormous investments.&nbsp;At the&nbsp;core&nbsp;of this challenge is a fundamental trade-off in the communication technologies used for memory and network interconnects.<\/p>\n\n\n\n<p>Datacenters typically deploy two types of physical cables&nbsp;for&nbsp;communication between&nbsp;GPUs.&nbsp;Traditional copper links&nbsp;are power-efficient and&nbsp;reliable,&nbsp;but&nbsp;limited to&nbsp;very short&nbsp;distances&nbsp;(< 2 meters)&nbsp;that&nbsp;restrict their use&nbsp;to within a single&nbsp;GPU&nbsp;rack. Optical&nbsp;fiber&nbsp;links&nbsp;can&nbsp;reach&nbsp;tens of meters,&nbsp;but&nbsp;they&nbsp;consume far more&nbsp;power&nbsp;and fail up to 100 times&nbsp;as often as&nbsp;copper. A&nbsp;team working across&nbsp;Microsoft&nbsp;aims&nbsp;to&nbsp;resolve&nbsp;this trade-off&nbsp;by&nbsp;developing&nbsp;MOSAIC,&nbsp;a novel optical link technology&nbsp;that&nbsp;can provide&nbsp;low power and cost, high reliability, and long reach (up to 50 meters)&nbsp;<em>simultaneously<\/em>.&nbsp;This approach leverages a hardware-system co-design and adopts&nbsp;a wide-and-slow design with hundreds of parallel low-speed channels using&nbsp;microLEDs.&nbsp;<\/p>\n\n\n\n<p>The fundamental trade-off&nbsp;among&nbsp;power, reliability, and reach&nbsp;stems from&nbsp;the&nbsp;<em>narrow-and-fast<\/em>&nbsp;architecture&nbsp;deployed&nbsp;in&nbsp;today&#8217;s copper and optical links,&nbsp;comprising&nbsp;a few channels&nbsp;operating&nbsp;at&nbsp;very high&nbsp;data rates. For example,&nbsp;an&nbsp;800 Gbps link&nbsp;consists of eight 100 Gbps channels.&nbsp;With&nbsp;copper links, higher channel speeds lead to greater signal integrity challenges, which limits their reach.&nbsp;With optical&nbsp;links,&nbsp;high-speed transmission is inherently inefficient, requiring power-hungry laser drivers and&nbsp;complex electronics&nbsp;to compensate for transmission impairments. These challenges&nbsp;grow&nbsp;as speeds increase&nbsp;with&nbsp;every&nbsp;generation&nbsp;of networks.&nbsp;Transmitting at high speeds also pushes the limits of optical components, reducing&nbsp;systems&nbsp;margins&nbsp;and increasing failure rates.&nbsp;<\/p>\n\n\n\n<p>These limitations force systems designers to make unpleasant&nbsp;choices,&nbsp;limiting the scalability of AI infrastructure.&nbsp;For example,&nbsp;scale-up networks connecting AI accelerators at&nbsp;multi-Tbps&nbsp;bandwidth&nbsp;typically&nbsp;must&nbsp;rely on&nbsp;copper links&nbsp;to meet&nbsp;the&nbsp;power budget,&nbsp;requiring&nbsp;ultra-dense racks that&nbsp;consume&nbsp;hundreds of kilowatts&nbsp;<em>per rack<\/em>. This creates significant challenges in cooling&nbsp;and&nbsp;mechanical design,&nbsp;which constrain&nbsp;the practical scale of these networks and end-to-end performance. This imbalance&nbsp;ultimately&nbsp;erects&nbsp;a&nbsp;<em>networking wall<\/em>&nbsp;akin&nbsp;to the&nbsp;<em>memory wall<\/em>, in&nbsp;which CPU speeds have outstripped memory speeds, creating performance bottlenecks.<\/p>\n\n\n\n<p class=\"has-text-align-left\">A technology offering copper-like power efficiency and reliability over long distances can overcome this networking\u00a0wall,\u00a0enabling\u00a0multi-rack\u00a0scale-up domains and unlocking\u00a0new architectures. This is a highly active R&D area, with many candidate technologies currently being developed across the industry.\u00a0In\u00a0our recent\u00a0paper,\u00a0<em>\u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mosaic-breaking-the-optics-versus-copper-trade-off-with-a-wide-and-slow-architecture-and-microleds\/\">MOSAIC: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs<\/a>\u201d<\/em>, which received the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/conferences.sigcomm.org\/sigcomm\/2025\/program\/papers-info\/\" target=\"_blank\" rel=\"noopener noreferrer\">Best Paper award at ACM SIGCOMM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we present\u00a0one such promising\u00a0approach\u00a0that is\u00a0the result of a multi-year collaboration between Microsoft Research,\u00a0Azure, and M365.\u00a0This\u00a0work is\u00a0centered around\u00a0an optical\u00a0wide-and-slow architecture, shifting from a small number of high-speed serial channels towards\u00a0hundreds of parallel low-speed channels.\u00a0This\u00a0would be impractical\u00a0to realize with today\u2019s copper and optical technologies because of\u00a0i)\u00a0electromagnetic interference challenges in high-density copper cables and ii) the\u00a0high cost\u00a0and power consumption of lasers\u00a0in optical links,\u00a0as well as the increase in packaging complexity.\u00a0MOSAIC overcomes these issues by\u00a0leveraging\u00a0directly modulated\u00a0microLEDs, a technology originally developed for\u00a0screen\u00a0displays.\u00a0<\/p>\n\n\n\n<p>MicroLEDs&nbsp;are significantly smaller than traditional LEDs (ranging from a few to tens of&nbsp;microns) and, due to their&nbsp;small size,&nbsp;they&nbsp;can be modulated at several Gbps.&nbsp;They&nbsp;are manufactured in large arrays,&nbsp;with over half a million&nbsp;in a small physical footprint for high-resolution displays&nbsp;like&nbsp;head-mounted devices or smartwatches. For example, assuming 2 Gbps per&nbsp;microLED&nbsp;channel, an 800 Gbps MOSAIC link can be realized by using a 20\u00d720&nbsp;microLED&nbsp;array, which can fit in less than 1 mm\u00d71 mm&nbsp;silicon&nbsp;die.&nbsp;<\/p>\n\n\n\n<p>MOSAIC\u2019s&nbsp;wide-and-slow&nbsp;design&nbsp;provides four core benefits.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Operating&nbsp;at low speed improves power efficiency&nbsp;by&nbsp;eliminating&nbsp;the need for&nbsp;complex&nbsp;electronics&nbsp;and&nbsp;reducing optical power requirements.<\/li>\n\n\n\n<li>By&nbsp;leveraging&nbsp;optical transmission (via&nbsp;microLEDs),&nbsp;MOSAIC&nbsp;sidesteps&nbsp;copper\u2019s reach issues, supporting distances up to 50 meters,&nbsp;or&nbsp;> 10x&nbsp;further&nbsp;than copper.<\/li>\n\n\n\n<li>MicroLEDs\u2019&nbsp;simpler structure&nbsp;and temperature insensitivity&nbsp;make them more reliable than lasers. The parallel nature of&nbsp;wide-and-slow&nbsp;also&nbsp;makes it easy to add redundant channels, further increasing reliability, up to two orders of magnitude higher than optical links.&nbsp;<\/li>\n\n\n\n<li>The&nbsp;approach is also scalable, as higher aggregate speeds (e.g.,&nbsp;1.6&nbsp;Tbps&nbsp;or 3.2&nbsp;Tbps) can be achieved by increasing the number of&nbsp;channels and\/or raising per-channel speed&nbsp;(e.g., to 4-8 Gbps).&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>Further,&nbsp;MOSAIC is fully compatible with today\u2019s pluggable transceivers\u2019 form&nbsp;factor&nbsp;and it provides a drop-in replacement for today\u2019s copper and optical cables, without requiring any changes to existing server and network infrastructure.&nbsp;MOSAIC is protocol-agnostic, as it simply relays bits from one endpoint to another without&nbsp;terminating&nbsp;or inspecting the connection&nbsp;and, hence,&nbsp;it\u2019s&nbsp;fully compatible with today\u2019s protocols (e.g.,&nbsp;Ethernet, PCIe, CXL).&nbsp;We are currently working with our suppliers to&nbsp;productize&nbsp;this technology and&nbsp;scale&nbsp;to mass production.&nbsp;<\/p>\n\n\n\n<p>While&nbsp;conceptually simple, realizing this architecture posed a few key challenges&nbsp;across the stack, which&nbsp;required&nbsp;a multi-disciplinary team with&nbsp;expertise&nbsp;spanning across integrated photonics, lens design, optical transmission, and&nbsp;analog&nbsp;and digital design.&nbsp;For example, using individual&nbsp;fibers&nbsp;per channel would be prohibitively complex and costly due to the&nbsp;large number&nbsp;of channels. We addressed this by employing imaging&nbsp;fibers,&nbsp;which are typically used for medical applications (e.g., endoscopy).&nbsp;They&nbsp;can support thousands of cores&nbsp;per&nbsp;fiber, enabling multiplexing&nbsp;of&nbsp;many channels within a single&nbsp;fiber.&nbsp;Also,&nbsp;microLEDs&nbsp;are a less pure light source&nbsp;than lasers,&nbsp;with&nbsp;a larger beam shape (which complicates&nbsp;fiber&nbsp;coupling) and&nbsp;a broader spectrum (which&nbsp;degrades&nbsp;fiber&nbsp;transmission due to chromatic dispersion).&nbsp;We tackled these issues through&nbsp;a novel&nbsp;microLED and&nbsp;optical lens design,&nbsp;and&nbsp;a power-efficient&nbsp;analog-only electronic back&nbsp;end, which does not require any expensive digital signal processing.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Based on our current estimates, this approach can save&nbsp;up to 68% of power, i.e., more&nbsp;than 10W per cable while reducing failure rates by up to 100x. With global annual shipments of optical cables&nbsp;reaching into&nbsp;the tens of millions, this translates to over 100MW of power savings per year,&nbsp;enough to power more than 300,000 homes. While these immediate gains are already significant, the unique combination of low power consumption, reduced cost, high reliability, and long reach opens up exciting new opportunities&nbsp;to rethink&nbsp;AI&nbsp;infrastructure from network and cluster architectures to compute and memory designs.<\/p>\n\n\n\n<p>For example,&nbsp;by&nbsp;supporting&nbsp;low-power,&nbsp;high-bandwidth connectivity at long reach,&nbsp;MOSAIC&nbsp;removes the need for ultra-dense racks and&nbsp;enables&nbsp;novel network topologies, which would be impractical today. The resulting redesign could&nbsp;reduce&nbsp;resource fragmentation and&nbsp;simplify&nbsp;collective optimization.&nbsp;Similarly,&nbsp;on the&nbsp;compute&nbsp;front,&nbsp;the ability&nbsp;to&nbsp;connect&nbsp;silicon&nbsp;dies at low power over long distances&nbsp;could&nbsp;enable&nbsp;resource&nbsp;disaggregation, shifting from today\u2019s&nbsp;large,&nbsp;multi-die packages to&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/good-things-come-in-small-packages-should-we-adopt-lite-gpus-in-ai-infrastructure\/\">smaller, more cost-effective, ones<\/a>.&nbsp;Bypassing packaging area constraints would also make it possible to drastically increase&nbsp;GPU&nbsp;memory&nbsp;capacity and bandwidth,&nbsp;while&nbsp;facilitating&nbsp;adoption of&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/storage-class-memory-is-dead-all-hail-managed-retention-memory-rethinking-memory-for-the-ai-era\/\">novel memory technologies<\/a>.&nbsp;<\/p>\n\n\n\n<p>Historically, step changes in network technology have unlocked entirely new classes of applications and workloads. While our SIGCOMM paper provides&nbsp;possible future&nbsp;directions, we hope this work sparks broader discussion and collaboration across the research and industry communities.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Datacenter memory and network limits are restraining AI system performance. MOSAIC uses microLEDs and a wide-and-slow optical architecture to deliver faster, longer, more reliable, and energy efficient connections that could transform AI cluster designs.<\/p>\n","protected":false},"author":43868,"featured_media":1148762,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Paolo Costa","user_id":"33218"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1148589","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-systems-and-networking","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Paolo Costa","user_id":33218,"display_name":"Paolo Costa","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pcosta\/\" aria-label=\"Visit the profile page for Paolo Costa\">Paolo Costa<\/a>","is_active":false,"last_first":"Costa, Paolo","people_section":0,"alias":"pcosta"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Two white line icons on a gradient background transitioning from blue to pink. From left to right: icon representing a set of gears; an icon representing three connected nodes each containing a user icon\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MicroLED-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pcosta\/\" title=\"Go to researcher profile for Paolo Costa\" aria-label=\"Go to researcher profile for Paolo Costa\" data-bi-type=\"byline author\" data-bi-cN=\"Paolo Costa\">Paolo Costa<\/a>","formattedDate":"September 9, 2025","formattedExcerpt":"Datacenter memory and network limits are restraining AI system performance. MOSAIC uses microLEDs and a wide-and-slow optical architecture to deliver faster, longer, more reliable, and energy efficient connections that could transform AI cluster designs.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1148589","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1148589"}],"version-history":[{"count":32,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1148589\/revisions"}],"predecessor-version":[{"id":1149460,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1148589\/revisions\/1149460"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1148762"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1148589"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1148589"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1148589"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1148589"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1148589"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1148589"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1148589"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1148589"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1148589"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1148589"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1148589"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}