Microsoft Research and Microsoft Azure improve the efficiency and capacity of cloud-scale optical networks


Today, at HotNets 2017, the Sixteenth ACM Workshop on Hot Topics in Networks, we shared with you some important research results from Microsoft Research and Microsoft Azure that show how we can increase the capacity of optical networks without purchasing new fibers.

Data, data, data – it’s all about the data
Our journey to understand and improve optical networks began several years ago. To innovate we first needed to measure the operational efficiency of our optical networks. This proved challenging. Microsoft is one of the largest cloud providers on the planet, with a huge number of optical links in our wide-area networks and data centers. We were unable to locate a tool or reference in the scientific literature about measuring the operational efficiency of massive scale, production operational networks. So, we built what is perhaps one of the world’s largest network monitoring tools to see what was happening under the covers. Sure enough, we found interesting things. But, I am getting ahead of myself, let me step back and cover some basics first.

Optical networking in Microsoft
Microsoft spends hundreds of millions of dollars building datacenter networks across the world and then interconnects them using a wide-area backbone network. Reducing cost and improving the efficiency and availability of these networks is central to our ability to provide cost-effective cloud services.

Our backbone network spans the globe. Within the United States alone we have well over tens of thousands of kilometers of lit optical fiber interconnecting more than 35 cities with well over hundreds of terabytes per second of bandwidth capacity. As Yousef Khalidi, Corporate Vice President of Microsoft Azure, noted in March 23, 2017, blog post about innovating and developing unique solutions in optical networking, optical gear is expensive. For every 100 gigabits per second, Microsoft spends roughly tens of thousands of dollars just on the optical equipment. To support the continuously exploding demand of the cloud, efficient use of deployed fiber, especially in wide area networks, is imperative.

Spotlight: Academic programs

Working with the academic community

Read more about grants, fellowships, events and other ways to connect with Microsoft research.

A few years ago, we invented a new software control system called SWAN, short for software-driven wide-area network, which we described in a paper presented at the 2013 conference of the Association for Computing Machinery’s Scientific Interest Group on Data Communications, known as SIGCOMM. With SWAN, we effectively doubled the efficiency of our backbone networks. That effort, however, did not touch on the network’s physical layer. So, we began considering other ideas to enhance the efficiency while reducing the cost of our global networks. We started to investigate how we could optimize the operation of the physical layer of Microsoft networks.

Squeezing additional juice from existing fiber in our backbone network
While optical characteristics are key to the network’s traffic carrying capacity, we found that the research and practitioner communities are only beginning to explore the operational characteristics of deployed fiber. Engineers had performed simulations, but not made measurements. To better understand the operational efficiency of our networks, we began collecting data from all transceivers and amplifiers in our optical backbone. We have been collecting this data since 2015. We augmented static segment-based simulations to wavelengthbased measurements by continuously monitoring thousands of wavelengths in our backbone network every 15 minutes. We developed new analysis techniques to quantify their end-to-end quality and ability to carry more data and produced some important results.

This figure shows the cumulative distribution function (CDF) of the capacities of links if they were to be utilized according to their SNR.

Our analysis of the signal-to-noise ratio, or SNR, of optical links over a period of two years shows that there is tremendous opportunity to improve the efficiency of existing fiber deployment. For example, capacity of 99 percent of the 100 Gbps segments in our network could be augmented to 150 Gbps by simply changing the modulation format at the two ends, keeping the fiber and intermediate amplifiers unchanged. Interestingly, 34 percent of the segments could be driven at double their capacity, i.e., 200 Gbps.

Microsoft Research together with Microsoft Azure presented our results at the 2016 Optical Networking and Communication Conference and Exhibition. Since then, we have packaged our measurement and data-driven analysis techniques into a real-time performance and failure monitoring engine for the optical layer.

Helping the industry – Making the Data Available to Data Scientists

Over the last six years, I have seen ~50 papers appear in top networking conference. They study optical WAN topology, failures, performance, routing, and traffic engineering but they mostly focus on the IP-layer. Underneath the IP layer is the physical layer and it includes thousands of miles of optical fiber. Unfortunately, before us the optical layer had gone largely unexamined. After building our massive-scale telemetry system to measure optical signals, in the spirit of collaboration and improving the state-of-art, we released our dataset to our colleagues in academia and industry. This dataset includes 14 months of data from 4000 optical channels carrying live traffic. We believe this dataset is the first public release of a large-scale optical backbone and we at Microsoft hope that it will provide researchers a unique opportunity to study temporal behavior of optical links, their quality of signal, transmit power, dispersion, correlation among channels, correlation among optical links, and more. We presented a few results from our analysis of this data at the 2016 Internet Measurement Conference and were honored with the “best open dataset award” from the community. Specifically, we showed that optical signals are strong predictors of link failures in the WAN and we described how Microsoft’s state-of-art software defined WANs were incorporating optical layer measurement and derived characteristics in our management substrate. For more information about our dataset check out our project, Wide-Area Optical Backbone Performance.

Adapt when you can – it is good for the network
Building on our earlier work, we concluded that instead of operating our optical networks using fixed link modulation, the smart thing to do is to adapt the modulation, and therefore the capacity, of the optical links based on the receiver’s SNR values. This type of adaptation is a well understood and prolifically used technique in modern-day wireless networking, including the hugely popular Wi-Fi networks.

Based on these results, Microsoft Azure made the decision to purchase bandwidth variable transceivers that can vary modulation to achieve data rates between 100, 150, and 200 Gbps depending on the SNR of the fiber path. Check out the figure above and notice the three colored lines are for different data rates and notice how 100 Gbps networks can easily be operated at 150 Gbps. The measured data convinced us that we should purchase bandwidth variable transceivers instead of the industry standard fixed, higher-order modulators and operate our networks at higher speeds.

Today, when the SNR of an optical signal drops below its pre-determined modulation threshold, the link is declared down and is considered a failure in the IP layer. We analyzed several months of failure tickets and noticed that some of the link failures were caused by SNR degradation and not complete loss-of-light. This finding provided us an opportunity to replace link failures by link flaps wherein the capacity is adjusted according to the new SNR. These results are included in our HotNets 2017 paper. We also analyze the capability of state-of-the-art bandwidth variable transceivers and quantify the hardware switching latency in reconfiguring link capacities. Our results show the possibility of reducing this latency from 68 seconds to 38 milliseconds. We are optimistic that our efforts in this direction can make dynamic capacity links feasible in wide-area networks.

Optical networks are literally the backbone of the Azure Cloud and our quest to extract the maximum out of these networks continues.