Portrait of Manya Ghobadi

Manya Ghobadi

Researcher

About

I am a researcher at Microsoft Research Mobility and Networking Research group. My research interests are in the general area of computer networking and systems, including data center networking, optical networks, transport protocols, and hardware-software co-design. I am interested in designing new networking paradigms as well as building systems and experimenting with them. I started my career as a software engineer at Google’s data center team (aka Platforms group) before joining MSR.


News:


Recent Projects:

Programmable FPGA-based NICs [HotNets’17-1]

Recently, there has been a surge in the adoption of FPGA-based NICs offering a programmable environment for hardware acceleration of network functions. This presents a unique opportunity to enable programmable congestion control algorithms, as new approaches are introduced either by humans or machine learning techniques. To realize this vision, we proposed implementing the entire congestion control algorithm in programmable NICs. To do so, we identified the absence of hardware-aware programming abstractions as the most immediate challenge and solved it using a high-level domain specific language. Our language lies at a sweet spot between the ability to express a broad set of congestion control algorithms and efficient hardware implementation. It offers a set of hardware-aware congestion control abstractions that enable operators to specify their algorithm without having to worry about low-level hardware primitives. Our code is available on GitHub.

 

Dynamic Link Capacities in the Wide-Area Network [HotNets’17-2] [OFC’16] [JOCN’16]

Optical communication is the workhorse of modern systems. Today, nearly all wide-area and data center communications are carried over fiber optics equipment making optics a billion dollar industry. We analyzed the signal-to-noise ratio (SNR) of over 2000 optical wavelengths in Microsoft’s backbone for 2.5 years. We showed that the capacity of 80% of the links could be augmented by 75% or more, leading to an overall capacity gain of 145 Tbps without touching the fiber or amplifiers. Inspired by wireless networks, we also showed that link failures are not always binary events. In fact, some failures are degradation of the SNR, as opposed to complete loss-of-light, and can be replaced by link flaps wherein the capacity is adjusted according to the new SNR. Based on these results, and because of the significant cost savings offered by this work, Microsoft decided to stop purchasing transceivers at fixed 100 Gbps capacity. The company is now installing bandwidth variable transceivers that can vary the modulation between 100, 150, and 200 Gbps depending on the SNR of the fiber path.

 

Optical Links in Data Centers [SIGCOMM’17] [NSDI’17]

While there are many proposals to understand the efficiency of data center networks, little attention has been paid to the role played by the physical links that carry packets. We conducted a large-scale study of millions of operational optical links across entire Microsoft’s data centers. Our analysis was the first in the community to show that data center links are massively over-engineered: 99.9% of the links have an incoming optical signal quality that is higher than the IEEE standard threshold, while the median is 6 times higher. Motivated by this observation, we proposed using transceivers for distances beyond their specified IEEE standard in practice. Our analysis has opened the door to relaxed specifications in transceiver design by showing that commodity transceivers can be used for distances up to four times greater than IEEE specifies. We further correlated this data with hundreds of repair ticket logs from data center field operators and found that a significant source of packet loss can be traced to packet corruptions due to dirty connectors, damaged fibers, or malfunctioning transceivers. To alleviate this issue, we designed a recommendation engine to repair links based on learning of common symptoms of different root causes of corruption. Our recommendation engine is deployed across all Azure data centers worldwide with 85% recommendation accuracy.

 

Programmable Data Center Interconnect [SIGCOMM’16] (Presentation and demo are available online)

In this work, we make a radical departure from present norms in building data center networks by removing all cables above Top-of-Rack (ToR) switches. Our design uses free-space optics to provide one-hop connectivity between ToR switches in the data center by disaggregating transmit and receive elements. A ProjecToR interconnect has a fan-out of 18,000 ports (60× higher than current optical switches) and can switch between different ports in 12µs (2500× faster than current optical switches). Its high fan-out and high agility are enabled by digital micromirror devices (DMDs), commodity products from Texas Instruments which are ubiquitous in digital projection technology, and disco-ball shaped mirror assemblies. A remarkable advantage of our optical setup is that it provides a “sea” of transmitters and receivers that can be linked in a multitude of ways, creating a scheduling and traffic routing scenario akin to that used in traditional switch scheduling problems. We proposed an asynchronous and decentralized scheduling algorithm that is provably within a constant factor of an optimal oracle able to predict traffic demands (proofs available here).

 

Risk-aware Routing [IMC’16] (Best dataset award)

We analyze optical layer outages in a large backbone, using data for over a year from thousands of optical channels carrying live IP layer traffic. Our analysis uncovers several findings that can help improve network management and routing. For instance, we find that optical links have a wide range of availabilities, which questions the common assumption in fault-tolerant routing designs that all links have equal failure probabilities. We also find that by monitoring changes in optical signal quality (not visible at IP layer), we can better predict (probabilistically) future outages. Our results suggest that backbone traffic engineering strategies should consider current and past optical layer performance and route computation should be based on the outage-risk profile of the underlying optical links. The data is publicly available and is unique across optics and systems communities, which was recognized by our best dataset award at the ACM Internet Measurement Conference in 2016.

 


Past Projects:


Publications

Other

RAIL is a proposal to stretch transceivers’ reach beyond their IEEE standard. To explore the parameter space in fine granularity and to eliminate hardware quality differences between manufacturers, we use VPI, a standard optical simulator for data transmission system. The simulation models are provided above for the community to use to simulate optical links in data center networks.

Gnome-Screen integrates GNOME Terminal with GNU Screen, a console based screen manager that multiplexes multiple interactive shells onto a single terminal. Featured in Gnome annual report 2006 (pp. 18-20).

 

Talks

  • Center for Networking Systems lecture series (UCSD), Feb. 2017

Peeking into light: enabling physical layer innovations

Also at: UCSB (Feb 2017), UMass Amherst (March 2017), Harvard (March 2017)

  • Center for Integrated Access Networks (CIAN), Oct. 2016

A Look at the Optical Layer of Cloud Networks

capture

pro

demo

faculty_summit

programmable

loadbalancing2

trickle

pacing

Students

I’ve worked with the following students. Drop me a note if you are interested in working with me at MIT.

  • Maria Apostolaki (ETH Zurich), Microsoft Research
    Project: Cloud traffic characterization in Azure data centers.
  • Mina Tahmasbi Arashloo (Princeton University), Microsoft Research
    Project: Programmable NICs in Azure data centers [HotNets’17-1].
  • Rachee Singh (UMass Amherst), Microsoft Research
    Project: Programmable bandwidth in wide-area backbones [HotNets’17-2, SIGCOMM’18].
  • Danyang Zhuo (University of Washington), Microsoft Research
    Projects: CorrOpt – Analysis of packet corruption in data center networks [SIGCOMM’17]
    RAIL – Inexpensive optics in the data center [NSDI’17]
  • Denis Pankratov (University of Chicago) and Radhika Mittal (UC Berkeley), Google
    Project: TIMELY – RDMA Congestion control [SIGCOMM’15]
  • Nanxi Kang (Ph.D. student at Princeton University), Google
    Project: Niagara – Load-balancing in software-defined networks [CoNEXT’15]

Professional Activities