Our research is focused on the following four broad themes. Each theme has numerous projects, and some projects span multiple themes. A partial list of current and past projects is available.
As we move from software on disk (e.g., Office) to Software-as-a-service delivered over the network (e.g., Ofice365) it is imperative that network down times not diminish service availability. Network verification seeks to guarantee correct operation of our data center and core networks by leveraging work in formal methods for programs. Despite the presence of cables and routers, a network can be viewed abstractly as a “program” that takes packets from the input edges of the network and outputs packets to the output edge.
This leads to a broad research agenda: building tools that are the equivalents of testers, static checkers, and compilers for Microsoft networks. New research is required because differences in the networking domain require rethinking classical verification tools (e.g., model checking, symbolic testing) to produce new concepts. At MSR, we have built four tools including SecGuru (used operationally within Azure), NetSonar (aspects of which are in Autopilot), Batfish (which can predict the effect of routing configuration changes), and Network Optimized Datalog (which can check reachability across firewalls and load balancers). This is joint work between the MNR and RiSE groups, various network product teams, and external researchers in Stanford and UCLA.
In ongoing work, we are 1) improving the scalability of reachability checks by leveraging symmetries in data center topologies; 2) improving the speed of configuration change analysis by decomposing and modularizing the analysis logic into smaller chunks; and 3) proving correctness under all topologies and route announcements through symbolic execution of configurations.
Public communications networks, such as those delivering mobile and wired broadband Internet access to homes and businesses, are typically subject to a high degree of regulatory oversight. Consequently, the policies that national governments enact have a big impact on how our customers experience our products, influencing how fast and available their connections are and how much they cost.
MNR technology policy efforts give researchers a clear understanding of policy perspectives, opportunities and constraints, and in coordination with LCA, bring world-leading technical knowledge into the policy-making arena. Examples of focus areas for MNR technology policy include rules for use of wireless radio frequencies (spectrum), and rules protecting networked services (such as O365, Bing, Skype and other Microsoft products) from being impaired by network operators (network neutrality).
- Spectrum: Every Windows device now relies on some form of wireless connectivity, and many – such as the HoloLens, Band, and most Surfaces – assume the availability of connections based on Wi-Fi or Bluetooth protocols. These protocols are built to work on “unlicensed” spectrum bands – frequencies that can be freely used without requiring permission or payment, as long as the device shares the spectrum appropriately with other users. MNR work develops novel sharing techniques and future applications of both unlicensed and licensed spectrum bands, supporting LCA advocacy for allocation of new bands to such purposes. It also informs regulatory efforts to ensure appropriate sharing of unlicensed bands.
- Network neutrality: Every Microsoft service ultimately relies on a network operator to connect to the customer. In most countries, the communications regulatory framework remains rooted in a legacy industry model of services provided directly by network operators. The evolving model of network-independent services, like ours, delivered “over the top” of operator networks, is motivating a transition to new rules to address questions such as whether networks may impair or charge over-the-top service providers, whether and how such “over-the-top” services should be regulated, and the extent to which network operators should be deregulated. MNR work focuses on the impact that evolving network architecture has on the need for new rules as well as the creation of partnership opportunities with network operators.
Design and operations of today’s networks decouples the physical layer from higher layers—to higher layers everything is an Ethernet port, irrespective of the physical media underneath. This decoupling makes it hard to diagnose failures, manage risk (e.g., multiple IP-level links may traverse the same physical media), debug degradation of packet delivery (e.g., corruption), or modulate transmission speed based on physical layer characteristics. The decoupling may have made sense in a world with diverse physical layers, but with the convergence of the physical layer to optics, we believe it is time to revisit it.
We are pursuing two threads of research. First, we are developing techniques to characterize the physical layer and correlate its performance to that of packet delivery. To enable this analysis, we are mining optical data from Microsoft’s wide area network (WAN) and data center networks. Our analysis is uncovering key insights such as fibers cuts are not the leading cause of WAN faults (equipment failures are), the level of optical power overprovisioning in WANs is such that data transmission speeds can be safely increased by over a third, and low receive power is a common cause of packet corruption inside data centers.
Second, we are exploring radical cross-layer network designs. For the data center, we are focusing on free-space optics and the use of DMDs (digital micromirror devices, which are pervasive in projectors today) as the basis for a completely flat interconnect with high fanout and fast (10 microsecond) switching. For WANs, we are focusing on cross-layer traffic engineering, that is, a system that jointly and dynamically decides the routing of wavelengths and packets. Since commodity hardware does not enable us to prototype such ideas, we are also developing an FPGA-based platform for programmable optical transceivers.
RDMA in Large Scale Data Center Networks
Modern datacenter applications demand high throughput (over 40Gbps) and ultra-low latency (less than 10 microseconds) from the network. At the same time, the brutal economics of the cloud services market dictates that CPU overhead should be minimized. Standard TCP/IP stacks cannot meet these requirements: e.g. the single hop latency of a production TCP stack can be over 15 microseconds, and to saturate a 40Gbps link, the stack can consume 15-20% CPU cycles. Remote Direct Memory Access (RDMA) can provide low latency, and high throughput by bypassing the host networking stack for data transfer operations. On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a lossless (i.e. no congestion drops) network.
However, PFC can lead to poor application performance due to problems like head-of-line blocking and unfairness. To alleviates these problems, we have designed DCQCN, an end-to-end congestion control scheme for RoCEv2. To optimize DCQCN performance, we build a fluid model, and provide guidelines for tuning switch buffer thresholds, and other protocol parameters. Our experiments show that that DCQCN dramatically improves throughput and fairness of RoCEv2 RDMA traffic. DCQCN is implemented in Mellanox NICs, and is being deployed in Microsoft’s datacenters.
TIMELY, a protocol put forth by Google is a parallel effort to DCQCN. It aims to solve the same problem, but uses delay as a congestion signal (like TCP Vegas).
Another way to think about DCQCN and TIMELY is that these congestion control algorithms represent a new design point in the age-old tussle between fast response and stability. They rely on PFC to offer fast response (i.e. avoid packet drops) to short-term congestion, while relying on conventional, (ECN or delay-based) based congestion control to provide long-term stability.
Datacenter Networking & Performance Optimization of Cloud ServicesWe are pursuing a multi-year cross-lab research program that focuses on producing the next generation data center networking and services. We are experimenting with radical new designs in network architecture, programming abstractions, and performance management tools. We care about inexpensive future-proof networking inside the data centers, between globally distributed data centers and to the data centers. Our research includes several projects the cut across various systems and networking research areas that are being pursued in collaboration with Microsoft’s Global Foundation Services Team, Windows Azure Team, Bing Team, and the Management Solutions Division.
Mobile Computing & Software ServicesWe are pursuing a variety of mobility-related projects: studying how the cloud can enhance the user experience on mobile devices (HAWAII); understanding how people use smartphones and the performance characteristics of 3G networks (3GTest & Diversity Studies); building systems to enhance smartphone performance, functionality, and battery lifetime using code offload (MAUI); building infrastructure to enable mobile social applications (Virtual Compass); and enhancing mobile device sensors by making their sensor readings trustworthy. In the software services arena, we are pursuing a variety of systems to simplify building scalable and geo-distributed services (PRS/Centrifuge, Volley, and Stout). Another area of emphasis is home networks, where we are pursuing network diagnosis services for the home (NetMedic & NetClinic), as well as new services and abstractions for easily building networked applications for the home (HomeOS).
Continuous Hands-free Mobile InteractionWhen combined with high-resolution touch-enabled displays, web access has proven a killer application. Playing games, reading the news or watching YouTube can capture the attention of users for extended periods many times a day. However, if the user has limited attention or the relevant tasks are short and happen many times an hour, e.g., making an appointment, adding a song to a playlist or checking on a bus, pulling the phone out of your pocket for an immersive experience is cumbersome. The newly launched Continuous Mobile Interaction project is developing usages, devices and systems to make such lightweight but frequent actions easy to do. Current efforts are along several directions. (a) an always-available speech accessory, and (b) developer support for natural-language interaction (c) platform for multi-modal interaction in moving vehicles, and (d) an always-on visual cognition engine
Cognitive Wireless NetworkingThe next generation of wireless networks will include software defined radios, cognitive radios, and multi-radio systems which will co-exist harmoniously while operating over a very wide range of frequencies. We are revisiting “classical” wireless networking problems and designing new solutions that incorporate and build upon recent advances in software and hardware technologies. Of interest lately has been our research solutions to problems in white space networking (the KNOWS project). We are working with ploicy makers, business units and acdemia to address the societal needs for providing inexpensive broadband connectivity everywhere.
Enterprise Network Management & ServicesWe are pursuing several different projects in this area. In particular, NetHealth is a network management research program in which end-hosts cooperatively detect, diagnose, and recover from network faults. Unlike existing products we take a end-host centric approach to gathering, aggregating, and analyzing data at all layers of the networking stack for determining the root cause of the problems. NetHealth includes several on-going projects in the wireless and wired space that are being pursued in collaboration with Microsoft’s Management Solutions Division and Microsoft’s Unified Communications Group.