Portrait of Srikanth Kandula

Srikanth Kandula

Principal Researcher


I am a principal researcher at Microsoft Research. My interests are broadly in building and analyzing networked systems.  Of late, I have worked on big-data platforms and datacenter networks. I completed my PhD in Computer Science from MIT in 2008.

Current project: Lazy approximations

Past projects: Cluster scheduling, Seawall, Flyways, CloudCmp, Netmedic, Broom, EXpose, VL2, FatVAP, Sherlock, TeXCP, Flare, Kill-Bots


Cluster Resource Management

Established: June 1, 2016

We are focused on building a scale-out, predictable, resource management substrate for big-data workloads.  To this end, we started with providing predictable allocation SLOs for jobs that have completion time requirements, and then focused on improving cluster efficiency. Using Apache Hadoop YARN as the base, we have built a scale-out fabric by composing the following projects: 1. Preemption (YARN-45): We added work-conserving preemption to YARN to improve cluster utilization. 2. Rayon (YARN-1051): We added a…

Cluster scheduling

Established: May 1, 2016

We consider various scheduling problems that arise in large clusters.

Quickr: Cost-Effective Data Analytics at Scale

Established: March 8, 2016

We are inundated with data. Resources to analyze the data are finite and expensive. Approximate answers allow us to explore much larger amounts of data than otherwise possible given available resources. Reducing the cost, if doable for a large fraction of the complex queries that run on this data, is of strategic importance because the savings can be re-invested into more sophisticated algorithms or be used as a key differentiator for analytics-as-a-service offerings. Unfortunately, state-of-art…

Software-Driven Wide Area Networks

Established: February 5, 2014

This project re-imagines and re-engineers wide area networks, to more than double their efficiency and allow flexible sharing of resources.


Established: August 22, 2010

Seawall provides new ways to share the network in datacenters. SideCar While investigating how we could get explicit feedback from the network middle to the ends for Seawall, we struck upon a way to provide programmability on a modest fraction of all packets flowing through the switches.  Packets are re-directing, via sampling or marking, to commodity servers that are directly attached to and function as dedicated programmable processors for these packets. We call this the SideCar architecture. Using SideCar,…

NetMedic: Detailed and Understandable Network Diagnosis

Established: August 19, 2010

NetMedic helps operators perform detailed diagnosis in computer networks. It diagnoses not only generic faults (e.g., performance-related) but also application specfic faults (e.g., error codes). It identifies culprits at a fine granularity such as a process or firewall configuration. Our work focuses on both the algorithmic aspects of detailed diagnosis as well as the important task of explaining diagnostic reasoning to the operator. Talks Detailed and understandable network diagnosis University of Wisconsin, Nov 2009; Georgia…


Established: January 6, 2007

Overview Networks are being deployed extensively in large corporations, small offices, and homes. However, a significant number of "pain points" remain for end-users and network administrators. To resolve complaints quickly and efficiently, network administrators need tools that can assist them in detecting, isolating, diagnosing, and correcting faults. Furthermore, such tools should also detect network security breaches, possibly caused by innocent employees. The NetHealth project is about detecting, inferring, diagnosing, and recovering from user perceived performance…














Discovering Dependencies for Network Management
Victor Bahl, Paul Barham, Richard Black, Ranveer Chandra, Moises Goldszmidt, Rebecca Isaacs, Srikanth Kandula, Lun Li, John MacCormick, Dave Maltz, Richard Mortier, Mike Wawrzoniak, Ming Zhang, in Workshop on Hot Topics in Networks (HotNets-V), Association for Computing Machinery, Inc., November 1, 2006, View abstract, Download PDF






Broom Tool Kit to Unbias Network Measurements

November 2009

    Click the icon to access this download

  • Website


I’ve worked with some amazing interns at MSR.

Sameer Agarwal (Berkeley), Ganesh Ananthanarayanan (Berkeley), Spandana Babbula (IIT Madras), Ivan Bliznets (Steklov Inst.), Mosharaf Chowdhury (Berkeley), Hossein Falaki (UCLA), Jonas Fietz (EPFL), Robert Grandl (EPFL), Dan Halperin (UW), Chi-Yao Hong (UIUC), Anand Iyer (Berkeley), Virajith Jalaparti (UIUC), Xin Jin (Princeton), Gautam Kumar (Berkeley), Ang Li (Duke), Hyeontaek Lim (CMU), Hongqiang Liu (Yale), Zhicheng Liu (GaTech), Yao Lu (UW), Matthaios Olma (EPFL), Ashish Patro (Wisconsin-Madison), Jonathan Perry (MIT), Qifan Pu (Berkeley), Anil Shanbhag (MIT/ IIT Bombay), Alan Shieh (Cornell), Aleksandar Vitorovic (EPFL).



  • Quickr‘s samplers and QO pushdown rules for samplers ship with ADLA.
  • SWAN‘s traffic engineering + approx fairness logic manages traffic on Microsoft’s inter datacenter WAN.
  • RoPE‘s reoptimization logic ships for SCOPE jobs on Cosmos servers since December 2011.
  • Mantri‘s outlier mitigation logic ships in all Cosmos servers since May 2010. Cosmos is Microsoft’s internal big data service with over 10K machines.
  • Flare: Splitting flowlets over multiple paths. Per Conga, implemented and shipped by Cisco Insieme. Also ships with Windows Server 2012 R2; the details are here.
  • wcAsync: An asynchronous web traffic generator
  • ospfOpt: Finding optimal weights for OSPF traffic engineering
  • Broom: Unbiasing Internet path measurements