We strive to find a balance between long term research and product impact. You may know about our research from the papers we publish and the talks we give, but we also have a broad impact on current Microsoft products, such as Windows, Xbox, Office 365, Bing, as well as earlier products like Windows Phone and Skype.
- Microsoft’s Wide Area Software Defined Network (implements a centralized traffic engineering system that has led to an improvement of the inter-DC WAN bandwidth utilization from 40% to 90%+, thus saving us millions of dollars annually)
- XBOX One Wireless Controller Protocol (a high throughput, low latency , energy efficient Microsoft propriety protocol between the XBOX One console and controllers. It has won accolades of mainstream press as the best controller in the gaming marker)
- Windows Azure Full-Bisection Bandwidth Datacenter Network (hailed as one of the most significant recent advances in computer science, our design led to an 80x improvement in dollars/Mbit/sec over previous designs. It is now the architecture of choice for all of Microsoft’s Datacenters. It enabled technologies like the highly-scalable Windows Azure Flat Network Storage)
- Windows Azure Software Load Balancer (reduced cost by a factor of 15 [$60K versus $1M] by removing dependence on expensive hardware load balancers and improved cloud manageability. This fully configurable load balancer is used by both Azure and Bing)
- Windows Firmware TPM (enabled Microsoft to offer the widely used BitLocker and DirectAccess features and a new security feature, Virtual Smart Cards, in the Windows 8 RT and Windows 8 Phone)
- XBOX Live Service Graphs (reduced performance diagnostics in large-scale enterprise & Data Center networks from days to minutes helping meet customer SLAs. XBox Live is the first Microsoft cloud service to use use this network performance diagnosis technology)
- Windows Network Virtualzation (enabled Windows to provide seamless connectivity between Microsoft’s Data Centers and customers’on-premise networks. Our design heavily influenced the Hyper V network virtualization feature that ships in Windows Server 2012)
- Windows Virtual Wi-Fi (enabled Windows features like range extension, concurrent corporate and guest access, and Internet gateway using a single Wi-Fi card. Before becoming a product, our prototype was downloaded several 100,000s times becoming the top three most popular MSR software download)
- TCP for Data Center Networking (improved performance of Data Centers networks without incurring cost for expensive hardware switches. It is implemented in our core networking stack and deployed in our Data Center properties)
Details about these and our other product contributions are provided below.
Microsoft’s Wide Area Network – Architecture & Management Software (2013-14)Increases the inter-DC WAN utilization from 40% to 90%+
- Previously inter-DC WANs running MPLS achieved only 40% average utilization. We built a logically centralized traffic engineering system that attains 90%+ average utilization while meeting different service policies. This saves tens of millions of dollars in WAN expense each year. Our paper describing an early (pre-production) version of this system was published in SIGCOMM 2013. Our design and technologies have been adopted and put into production by Windows Azure Networking and Microsoft’s Global Network Services.
AutoPilot’s Network-state Management Service (2014)Dramatically simplifies network management app development and operations while maintaining network-wide SLA
- DC network management applications are complex and sophisticated, usually requiring years to design, develop and deploy. Running multiple such applications is challenging as they may conflict with one another and their collective actions can impair network operation. The Autopilot Statesman service that we developed, simplifies application development by shielding apps from low-level interactions with devices. By offering a novel network state model, Statesman enables apps to operate independently while maintaining network-wide safety. Our technology has been deployed in all Autopilot managed datacenters.
Windows Azure ZKaaS (ZooKeeper as a service) (2014)A multi-tenant coordination cloud service that uses open source Zookeeper
- We worked closely with the AutoPilot team to build a multi-tenant layer on top of ZooKeeper that can be deployed and monitored by their software. The coordination service underneath runs multiple ensembles which can execute arbitrary requests from authenticated tenants. Although the ensembles are shared, to the user it seems as if they are running a dedicated ensemble. In future releases tenants will receive concrete compute resources tokens with full performance isolation.
Windows Azure Autopilot NetInsight (2013-14)End-to-end measurement and analysis tools that run automatically and vastly improve the accuracy of WAN fault localization.
- Localizing performance faults on WAN is difficult due to the plethora of routers and paths between DCs. We developed a system that accurately localizes faults to a specific router interface among thousands of candidate routers and paths. Compared to the state-of-the-art SNMP-based system, our system, called NetInsight, reduces the number of false positives by two orders of magnitude.
Network Virtualization for Hybrid Clouds (2010-12)Enabled Windows to provide seamless connectivity between Microsoft’s Data Centers and customers’on-premise networks
- Cloud computing provides as much compute and storage as needed, where needed. However, large enterprises require hybrid clouds: private Data Centers that only send overflow computing to the cloud. The challenge is to allow competing enterprises to use the same public Internet addresses on a shared Azure network without interference, and to also have seamless connectivity back to the enterprise network. We designed new mechanisms with separate logical and physical Internet addresses in V-Net thus allowing it to efficiently virtualizes Azure’s network; V-Net allows fast delivery of Internet packets, rapid migration of Virtual Machines with low overhead, and performance isolation V-Net’s design has heavily influenced the Hyper-V network virtualization feature that ships in Windows Server 2012.
Visual Studio Energy Modeler & Profiler (2012)
- Poorly written apps are one of the primary reasons for high energy drain on mobile devices. One reason for energy-inefficient apps is that app developers do not have sufficient tools to determine the energy impact of their apps. As part of a Wattson research project we designed a Visual Studio plug-in that provides visibility to the application developer of their application’s energy consumption. Our paper Empowering Developers to Estimate App Energy Consumption, published in ACM MobiSys 2012 describes the details of the system. This work formed the basis for the Energy Profiler that is part of the Visual Studio SDK for Windows Phone 8.
GreenUp (2011-12)Delivers significant power and monetary savings for enterprise customers by enabling seamless remote access to sleeping desktop machines
- Enterprise can save significant amounts of power by letting idle desktop machines go to sleep (S3). This behavior has been the default setting in Windows desktop for many years. However, users and system administrators often override this because they may need to access the machines remotely. Current wake-on-Lan technique are cumbersome, and do not always work on complex networks. We designed and built a wakeup service (“GreenUp”) that works transparently – any time the user tries to remotely access a sleeping machine, it seamlessly wakes it up. This encourages users to save power. GreenUp scales up to large, complex corporate networks, by using a novel distributed leader election
Fully Configurable Windows Azure Software Load Balancer (2011)Reduced costs by a factor of 15 by removing dependence on hardware load balancers and improved cloud manageability as well
- Hardware load balancers are traditionally used as the front tier of clouds to distribute incoming traffic to servers. We designed a scale-out software load balancer that can dynamically scale from 1 Gbps to 100 Gbps. To the best of our knowledge this was the first software load balancing solution in industry. Our solution reduces costs by a factor of 15x ($60K versus $1M) compared to hardware load balancers, and allowed more flexibility and easier management as well. To compete with the speed of hardware, our design made clever use of existing routing protocols and the Windows networking stack. Our design is now the load balancer of choice for Azure and Bing.
Full-Bisection Bandwidth Datacenter Networks (2009-10)Servers in a datacenter are no longer limited by the network that connects them
- For cloud services the three key elements to success are: cost of infra-structure, availability, and response time. Conventional datacenter design advocated by established networking vendors does poorly on all three measures. We designed and validated a new datacenter network architecture that excels in all three metrics; in particular, it provides an 80X improvement in dollars/Mbit/sec over existing designs while providing uniform high capacity between servers, performance isolation between services, and dynamic resource allocation across large server pools. Our design is now the network architecture of choice for all of Microsoft’s datacenters including those managed by XBox, Bing and Azure.. Our initial SIGCOMM 2009 paper (VL2: A Scalable and Flexible datacenter Network) was republished by the Communications of the Association for Computing Machinery (CACM) in 2011 as one of the most important research result in Computer Science in recent years; it was cited as “a great example of rethinking networking from scratch, while coming full circle to work with today’s equipment.”
TCP Analyzer (2010)Enabled Microsoft Network Monitor to provide deeper insights into the working of Internet’s Transport Control Protocol
- We designed and built a plugin (called an “expert”) for Microsoft Network Monitor (NetMon) that helps analyze TCP traces. It uses several sophisticated heuristics to answer the key question “what limited the throughput of this TCP connection”. Apart from answering this question, the plugin also allows the user to visualize the connection in a number of different ways. Our plugin has been downloaded thousands of time, and is one of the most popular NetMon “experts”.
Data Sense Bandwidth Attribution Technology (2012)
- We built a technology that tracks cellular and Wi-Fi data consumption for individual apps and OS components, and displays it in an intuitive UI. A challenge we had to overcome was to accurately attribute data consumption across the numerous APIs and OS services that mobile apps use, and to do so in a lightweight manner. See the original technology demo video (Aug. 2011).
Mobile Input Services & Technology (2012)
- Typing intelligence: We enabled Windows Phone to scale their typing intelligence solutions (hit-target resizing, spell correction, candidates-on-demand, etc.) to over 50+ languages, including new languages such as Latin Hindi.
- WordFlow Keyboard User Adaptation: We helped with a feature that allows Windows Phone keyboard to adapt to the users’ language and offer their words as completions and next word predictions.
- Keyboard Input Architecture: We helped revise the input architecture and created a new edit buffer to facilitate new features such as user adaptation, multilingual editing within the same message, and seamless multi-modal integration.
Firmware TPM Emulator (2012)
- We delivered the TPM driver and firmware TPM simulator. The development team used our simulator to develop & test important security features even before the vendors provided them the actual devices. A better description of our contribution is provided under “Windows”.
AppInsight to the Application Compatibility Team (2012)
- We delivered a customized version of our application analytics tool for performance testing and failure analysis of the top WP marketplace applications on various hardware and software SKUs. The development team run this tool routinely on third-party apps and they estimate to have reduced the time spent on app. failure analysis by a factor of 2 to 4. The first paper (AppInsight: Mobile App Performance that describes our system appeared in OSDI 2012.
Wireless Optimizations in Windows 8 (2011-12)Increases battery lifetime in Windows 8 Tablets and Surface computers.
- Compared to laptops the new class of mobile devices, such as tablets and Surface computers, need to stay connected even when the screen is turned off. Keeping the Wi-Fi always on consumes significant energy. We designed a set of techniques that allows the Wi-Fi device to not lose its connection even when the screen is turned off and the processor (and SoC) is in a low power state. We accomplished this by reducing the Wi-Fi power consumption to a few mW in standby state. Our techniques shipped in Windows 8.
Network Quality of Service for Virtual Machines (2011-12)Enabled Windows 8 to provide predictable networking to high-value cloud services
- We helped design and evaluate a mechanism to adaptively control the network usage of a Virtual Machine (VM), analogous to equivalent controls that existed for CPU and memory. Our design includes a feedback loop that ensures VMs receive network bandwidth that is proportional to their share and that spare bandwidth is allocated among VMs that need it. OurVM Rate Shaper shipped in Windows 8.
Support for Security Features in Windows ARM (2011-12)Enabled widely used security features (BitLocker, DirectAccess, Virtual SmartCards) on Windows RT and Windows Phone
- Many enterprises mandate the use of crucial security features such as BitLocker and Direct Access on their employees mobile devices. These Windows features require Trusted Platform Module (TPM) chips that are normally part of the hardware of modern computers. Unfortunately, Windows 8 and Windows Phone have low power versions that run on ARM chips which do not have standard TPM hardware; instead, ARM offers an alternate hardware security platform called TrustZone. We worked with the Windows team and helped deliver a reference implementation of a firmware emulation (of the missing TPM chip) we called fTPM that leverages ARM’s hardware. Microsoft delivered fTPM to our SoC partners who incorporated it in their latest firmware releases.This then enabled Microsoft to offer BitLocker, Direct Access, and a new Windows 8 security feature, Virtual Smart Cards, in the Windows 8 RT and Windows 8 Phone releases.
Antenna Placement on Windows Tablet (2011-12)Enabled best-in-class Wi-Fi network connectivity & performance
- We helped design the antenna placement on tablet devices. Since users hold tablets differently than laptops, existing antenna placement techniques (on the laptop’s screen) are not the most optimal for tablets. The placement of a user’s hand around the antenna might reduce the signal, and so can the orientation in which the tablet is held. We studied these phenomena in detail – in the wild and in antenna chambers – and made recommendations to the Windows 8 team, which wereincorporated in the final design of Windows 8 tablets.
Datacenter TCP (2010-12)Improved network performance in Data Centers with inexpensive switches
- We designed a new variant of TCP, called Datacenter TCP (DCTCP) to address congestion control issues in datacenter networks. DCTCP leverages Explicit Congestion Notification (ECN) and a simple mOLti-bit feedback mechanism at the host to reduce application latencies by overcoming network impairments such as queue buildup, buffer pressure, and incast. DCTCP was designed in close collaboration with the Windows Networking Team and itships in the Windows 8 networking stack. The initial paper (Datacenter TCP) was published in SIGCOMM 2010.
Virtual Wi-Fi (2009)Enabled Windows to connect to multiple WLANs simultaneously and offer range extension, concurrent corporate and guest connection, and Internet gateway features
- We designed a technique to virtualize wireless LAN (WLAN) cards. With it users can concurrently connect to multiple Wi-Fi networks using a single WLAN card, thus enabling several novel scenarios. The original paper. The original paper (MultiNet: Connecting to Multiple IEEE 802.11 Networks Using a Single Wireless Card) was published in INFOCOM 2004. Our mini-port driver was downloaded by over hundred thousand developers and was one of Microsoft Research’s most popular software downloads. Virtual Wi-Fi first shipped in Windows 7.
Network Bandwidth Estimation (2004-05)Enabled Windows to offer a better media streaming experience over Wi-Fi
- We developed a technique (“Probe-Gap”) to estimate the capacity and the available bandwidth of network paths based on end-point measurements. The problem was particularly difficult for cable modems and Wi-Fi networks because they do not have point-to-point links. For example, they employ mechanisms such as token bucket rate regulation; non-FIFO scheduling, and multiple rate. The initial paper (Bandwidth Estimation in Broadband Access Networks) describing the problem was published in IMC 2004. Probe-Gap first shipped in Windows XP.
NDIS WLAN extensions in Windows 2000, Windows XP, Vista & Windows 7Elevated Wireless LAN connectivity to a premier consumer networking technology in Windows
- We designed the (first set of) NDIS WLAN OID for Windows 200 and beyond. Prior to our contribution Windows exposed a wireless LAN network adapter as an Ethernet network adapter. We enhanced the programming interface exposed by the Network Device Interface Specification (NDIS) and WinSock which then enabled novel wireless-aware and mobility-aware applications.
Network Failure Recovery in Data Centers (2012)NetPilot reduces the recovery time for the common Data Center network failures from a few hours to tens of minutes
- Handling network failures is one of the most challenging tasks for Data Center operators. Different from the conventional failure diagnosis and repair process which requires significant human intervention, Our NetPilot technology mitigates failures by deactivating or restarting the suspect network devices without the need for knowing the exact root causes. By enabling automatic failure mitigation, Netpilot dramatically reduces the recovery time for common network failures. Our initial paper (NetPilot: Automating Datacenter Network Failure Mitigation) describing this system was published in SIGCOMM 2012 and we shipped as part of the Bing Metallica Release in June 2012
Improving Page Load Time of Bing Searches (2012)Faster load times leads to better user experience
- We performed a comprehensive analysis of the page load time in Bing to help uncover and explain strange effects such as Page Load Time (PLT) increase during off-peak hours and the impact of browser population and query type. These insights were used to develop a more precise and detailed alerting tool for PLT degradation. We documented some of our learnings in a SIGCOMM 2013 paper (A provider-side view of Web Search Response Time).
Onset-of-congestion Signaling (2012)Our congestion prediction technology enables mitigation strategies that lead to better application performance
- In distributed file systems, when one storage node is congested both read- and write- traffic can be steered to other replicas and other nodes with empty space. If the on-set of such congestion is detected quickly, one can avoid needless queuing lags and improve overall store throughput. We helped design a predictor that uses current load and historical performance to predict the congestion status of storage nodes in Cosmos early. As a side-benefit, this also serves as a measure of application-perceived capacity of the distributed storage layer and a monitor of current usage and hotspots. Our technology is shipping in Cosmos clusters in Bing since December 2011.
ReOptimizer for Data Parallel Computing (2011)This technology significntly reduced the response times of large jobs in our Data Centers
- Performant execution of data-parallel jobs needs good execution plans. Certain properties of the code, the data, and the interaction between them are crucial to generate these plans. Yet, these properties are difficult to estimate due to the highly distributed nature of these frameworks. We built the first reoptimizer for data-parallel jobs. It collects certain code and data properties by piggybacking on job execution and adapts execution plans by feeding these properties to a query optimizer. Our technology shipped in Bing’s Cosmos clusters in December 2011. and it has significantly improved the response times on production jobs.
Mitigating Outliers in Data Parallel Jobs (2009)
- Laggard tasks signicantly prolong the completion time for data-parallel jobs. The causes for such outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and imbalance in task workload. We buit a system that monitors tasks and culls outliers by restarting tasks, network-aware placement of tasks and protecting outputs of valuable tasks. The result was a significant improvement of job completion time. Our technology is in production use across all of the Cosmos clusters in Bing since May 2010.
- To answer some of the basic questions about real workloads we built NetTrace, a network tracing service for large data center clusters. This service collects low level networking logs (socket-level) and uploads to COSMOS. Processing the data yields a much better understanding of the traffic patterns of operational workloads and also helps diagnose whether the network or the application is to blame for poor performance. NetTrace ships as an autopilot service in Bing since 2009. We also shipped an analysis suite and Bing continues to invest in NetTrace, in their June 2012 release, they expanded the types of data captured, lowered resource consumption of the logger and are in the process of rolling it out as an always-on service.
DNS Query Time Optimization (2008)
- We conducted a series of experiments to measure the DNS query resolution time for Bing. Based on these measurement we came up with a set of improvements to our DNS query chain. Worked closely with Bing we deployed these improvements and in the process reduced the median DNS query time by more than half of previous amount. More importantly, the 95th percentile was cut in half.
Scalable and Consistent Caching (2008)Enabled MSN web properties to better handle spikes in load (flash crowds)
- The MSN Publishing Platform serves billions of web pages a month. As they grew, scalability bottlenecks started to show up in their previous architecture. TheScalable and Consistent Caching (SCC) technology allowed them to solve these bottlenecks while maintaining the strict consistency semantics that content publishers expect, such as adding breaking news to a web page and all viewers seeing the updated content.
Partitioning and Recovery Service (2008)Enabled Live Mesh (now SkyDrive) backend cloud services to scale resiliently
- The Live Mesh data center services need to partition user data across a large number of servers. We designed and built the Partitioning and Recovery Service (PRS), which became their mechanism for doing this. The PRS made the development of the server code easier by providing a number of novel properties, such as strong consistency for soft state and guaranteed notifications to trigger state republishing. Microsoft’s Live Mesh product won CNET’s best technology innovation/achievement award.
Managing Shared Credential Vulnerabilities (2008)The technology behind Microsoft Forefront risk analysis and mitigation planning feature
- We developed a technique which evaluates the risk to an organization based on patterns of user privilege and access. Attackers use accounts to compromise machines and use machines to compromise accounts. In the absence of explicit management to mitigate this risk, growth in jumping from one machine to another via a compromised account is exponential. In a test, corroborated by our graph-based analysis, we found that over 70% of the machines investigated yielded at least one account that granted control over 100 other machines on the next hop. Our system performs static analysis and generates pre- and post-incident reports for planning risk mitigation strategies. We shipped our technology in the Access and Security Division’s (ASD) ForeFront Product Suite.
XBOX One Wireless Controller Protocol (2013-14)The wireless protocol between the XBOX One controllers and the console
- We designed the wireless protocol which XBOX accessories use to connect with the XBOX one console. The protocol enables high-throughput and low-latency wireless communication, which is required for gaming and media traffic. Using this protocol the XBOX One can support more number of accessories, with higher throughput, lower latency than any known gaming or home entertainment system that currently exists in the market.
Service Graphs for Large-Scale Network Diagnostics (2012)Helps meet customer service level agreements (SLAs) by quickly identifying faltering components, reducing down time from days to minutes
- One of the challenge for cloud vendors is to avoid network and system failures that can potentially cripple business users. It is an unfortunate fact that sometimes performance failures do occur and users complain about poor response times. Localizing the source of performance problems in large enterprise and cloud networks is a hard problem because each request depends in complex ways on numerous hardware and software components. We devised a new method to identify these dependencies and record them in an Inference Graph which we then used to quickly and accurately localize the failure to a small set of possible culprits. Our SIGCOMM 2007 and OSDI 2008 papers show how to narrow down the set of potential dependencies by 100x – 10000x. Our inference graph technology was adopted by Microsoft Services Engineering Team and shipped as Azure Service called “Service Call Graph 1.0”. Network performance faults that previously took hours or days to diagnose can now be diagnosed in matter of minutes.