Portrait of Navendu Jain

Navendu Jain

Senior Researcher

About

Navendu Jain is a Senior Researcher and Architect at Microsoft Research. His current focus is on (1) automatically analyzing and learning from big data, and (2) designing and implementing data center network architectures and geo-distributed cloud services. Previously, he worked on designing and building distributed networked systems to improve their scalability, reliability and security. He has been a recipient of several awards at Microsoft and has lead teams in winning Machine Learning Competitions.

Dr Jain is currently leading the SysSieve project whose goal is to automatically derive actionable insights from unstructured data. Such unstructured data sources are prevalent such as customer feedback, customer portal tickets, incident reports, product reviews, software bug reports and build errors, knowledge base (KB) articles and security reports. By building upon techniques in statistical NLP, Deep Learning, Information Retrieval and Distributed Computing, SysSieve enables data-driven decision making for Microsoft business groups. It is used in daily production across Windows, Skype, Azure, Bing, Office 365, MSIT, and Customer Service and Support.

Dr. Jain and a colleague led the design, implementation and deployment of NetWiser service that enables automated real-time analysis of network failures across all of Microsoft datacenters. NetWiser has been awarded the Microsoft Trustworthy Computing Reliability Award for 2013. Details here.

Dr. Jain and his MSR colleagues partnered with the Azure and Bing teams to develop a scalable, agile and cost-effective next generation data center network using commodity switches. This architecture is now the basis of data center networks in Azure and Bing, and cloud infrastructures across the industry. This work appeared in ACM SIGCOMM 2009 and it has been recognized by ACM as one of “the most important research results published in CS in recent years” and appeared as an invited paper in the Research Highlights section of the newly re-formatted Communications of the ACM (CACM). This work has been blogged here.

Dr Jain’s first project at Microsoft focused on building a low-power, high-performance and low-cost data center cluster based on Intel Atom boards. This work was presented to Bill Gates and covered by New York Times.

Publications

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

Projects

Other

Bio

I’m a Senior Researcher with Microsoft Research, Redmond. I received my Ph.D. in Computer Sciences from the University of Texas at Austin, working with Prof. Mike Dahlin. I received B.Tech and M.Tech in CSE from IIT Delhi. After IIT, I spent a fun summer visiting IBM Zurich Research Laboratory. My research interests are broadly in cloud computing, data management, machine learning and distributed networked systems. My work has been a recipient of the Microsoft Trustworthy Computing Award, the Open Source Software Award, the Important to Microsoft project award, the Microsoft Best Learning project award, first rank in the U360 Machine Learning Hackathon, second rank in the Display Ads Machine Learning Competition, the IBM PhD Fellowship and the Microsoft Graduate Fellowship.

Current Projects

SysSieve: Automatically understanding the semantic of human-written free-form text (aka unstructured data) such as software bug reports and build errors in Windows/Windows Phone, knowledge base (KB) articles, Customer Service and Support (CSS) tickets, cloud post-mortems/incident reports, network tickets, and security reports.

  • ConfSeer: Automated detection of software misconfigurations by accurately matching configuration snapshots against Knowledge Base (KB) articles that describe the problems and their solutions in free-form text. [VLDB 2015].
  • NetSieve: Automated problem inference from network trouble tickets to uncover the ‘big picture’of network problems and developing best-practices towards their fast and accurate resolution. [NSDI 2013].

Cloud Network Security: Building scalable, accurate, and real-time network attack detection and mitigation services to protect cloud infrastructure and hosted tenants.

  • Measurement study of cloud attacks: Characterizing attacks on the cloud in terms of their scale, diversity, frequency, origin, targeted services and attack vectors. [IMC 2015].

Prior Projects

NetWiser: Building scalable, cost-efficient, agile, and reliable network architecture for next-generation data centers.

  • Service impact of intra-dc and inter-dc network failures: A field study on understanding how failures at the intra-dc level (Top-of-Rack switches, Aggregation switches and Access Routers) and at the inter-dc level (long-haul WAN links) impact availability of online services, and deriving best practices to improve service availability. [SoCC 2013, SIGMETRICS 2013 (Extended Abstract)].
  • Middlebox reliability analysis: Characterizing the reliability of middleboxes in datacenters such as load balancers, firewalls, intrusion detection and prevention systems, and VPNs, and analyzing their implications to improve middlebox reliability. [IMC 2013].
  • Network failure characterization: Understanding network failures in data centers by analyzing failure incidents and correlating them with network traffic, estimating impact of failures, and deriving implications for designing future network architectures. [SIGCOMM 2011].
  • VL2: A scalable and flexible data center network architecture for hundreds of thousands of servers and built from commodity switches that enables high-bisection bandwidth between all communicating server pairs, agility in mapping any service to any server, and achieves graceful performance degradation under failures [SIGCOMM 2009, CACM 2011].

Marlowe: Automated and adaptive resource management in data centers.

  • Cloud Auto-scaling: Automated scale-out/in of batch workloads on the cloud to minimize the execution cost and the job completion time. [SPAA 2013].
  • URSA: Scalable load balancing and power management for large-scale cluster storage systems that aims to alleviate hot-spots while minimizing reconfiguration costs [Middleware 2011, TOS 2012].
  • WAVE: Topology-Aware VM Migration in Bandwidth Oversubscribed Datacenter Networks [ICALP 2012].
  • ACES: An adaptive power controller that manages the cost, performance, and reliability tradeoffs for energy-aware server provisioning [INFOCOM 2011].
  • Volley: Automated data placement for cloud services across geographically distributed data centers [NSDI 2010].
  • CloudSeer: Integrating Monitoring and Policy Enforcement for Cloud-Hosted Applications.

Cloud Chakra (C2): Developing new pricing and application management frameworks for cloud services across geo-distributed data centers.

  • Batch job pricing and scheduling: A new pricing model and a truthful-in-expectation mechanism that performs efficient resource allocation for executing batch applications on cloud computing systems [TOPC 2014, ICAC 2014, TOCS 2012, SAGT 2011, SPAA 2012].
  • EOA: Online job migration algorithms for reducing the electricity bill of running cloud services across multiple data centers [Networking 2011].

Awards and Honors

  • 2nd rank in the Machine Learning Competition on recommending movies (maximizing NDCG) for Bing users, along the lines of the NetFlix competition (2016)
  • 1st rank in the Microsoft Machine Learning Competition on predicting metrics in the financial sector (2016)
  • 2nd rank in the Display Ads Machine Learning Competition across 68 teams (2015)
  • 3rd rank in the Office 365 Ticket Routing Machine Learning Competition across 49 teams (2015)
  • The `Important to Microsoft’ project award (2014)
  • 1st rank in the U360 Machine Learning Competition across 50+ teams (2014)
  • Best Paper Award Runners-up at ACM International Measurement Conference (2013)
  • Microsoft Trustworthy Computing Reliability Award (2013)
  • Invited paper to CACM (Communications of the ACM) titled VL2: A Scalable and Flexible Data Center Network” (2011)
  • University Co-op Open Source Software Award (2008)
  • IBM Ph.D. Fellowship (2007-2008)
  • Microsoft Graduate Merit Scholarship (2001-2002)
  • Institute Merit Award and Scholarship, IIT Delhi (2000-2002)
  • Outstanding Project Award of the Year, IBM India Research Lab (2000)
  • Top 0.1% IIT-JEE examination among 100,000+ students (1997)
  • Top 0.1% All-India AISSE Merit Award, Government of India (1995)

Technology Transfers

Technology Transfers to Microsoft Business Groups

We work closely with several business groups in Microsoft, and we’ve been fortunate that some of our research has been incorporated in the following engineering innovation efforts (which have been publically disclosed):

  • SysSieve: SysSieve is a automated inference system that aims to derive actionable insights from free-form text. Such unstructured data sources are prevalent such as customer feedback, customer portal tickets, incident reports, product reviews, software bug reports and build errors, knowledge base (KB) articles and security reports. By systematically analyzing these important (yet incredibly noisy) data sources, SysSieve enables data-driven decision making for Microsoft business groups. SysSieve combines statistical natural language processing (NLP), knowledge representation, ontology modeling, and machine learning to achieve these goals. SysSieve was integral in identifying two big issues in Skype which the Skype team ended up fixing and improving the customer experience:
  • Publications on SysSieve:
    • ConfSeer: Automated detection of software misconfigurations by accurately matching configuration snapshots against Knowledge Base (KB) articles that describe the problems and their solutions in free-form text. [VLDB 2015].
    • NetSieve: Automated problem inference from network trouble tickets to uncover the ‘big picture’of network problems and developing best-practices towards their fast and accurate resolution. [NSDI 2013].
  • NetWiser: NetWiser is a first-of-its-kind scalable service that enables automated real-time correlation and analysis of network failures across multiple datacenters. Specifically, the business groups use the NetWiser dashboard to answer three key questions: (1) Is a network problem causing a service outage? Did redundancy work? (2) Can we localize the fault and get details about the problem to perform fast troubleshooting? and (3) How can alarms be correlated to identify high severity outages? NetWiser has been featured here.

 

  • ConfSeer configuration diagnosis service: We have built a configuration diagnosis service that previously used to have human-defined rules to detect configuration errors in software (e.g., Exchange, Lync, Sharepoint, SQL server) deployed on customer machines. This human-driven process was expensive, time consuming and produced only about a limited number of expert rules. In collaboration with the Windows System Center and Advisor team, we developed a scalable learning engine running in production (service URL) that automatically analyzes the technical solutions in Knowledge Base (KB) articles and detects misconfigurations with high accuracy and in near real-time.
  • Reliability Analysis Framework: Our reliability analysis framework has been used to analyze the network telemetry in order to (a) take key decisions on network capacity upgrade, (b) build network domains for a major online service to deliver >99.99% availability, (c) compare reliability across device platforms and vendors, and (d) perform root-cause analysis of high-impact network failures.
  • Flat, commodity-switch based datacenter networks: The VL2 project laid the foundation for deploying flat, agile, commodity-switch based datacenter networks which have been deployed in Windows Azure and Bing. The key contributions of VL2 were: (a) building an overlay network on top of the physical topology by separating infrastructure addresses from application addresses, (b) applying traffic oblivious routing to improve link utilization while avoiding out-of-order delivery and congestion, (c) designing and building the scalable directory service on top of Paxos that provides address resolution and access control to enable dynamic scale-out/in for applications, and (b) analyzing datacenter network failures and deriving their implications to build a fault-tolerant datacenter network. The paper appeared in SIGCOMM 2009 and it has been recognized by ACM as one of “the most important research results published in CS in recent years” and appeared as an invited paper in the Research Highlights section of the Communications of the ACM (CACM). This work has been featured here.

Professional Service

Judge

  • 2016: ACM Student Research Competition Grand Finals
  • 2015: ACM Student Research Competition Grand Finals
  • 2014: ACM Student Research Competition Grand Finals
  • 2013: ACM Student Research Competition Grand Finals
  • 2012: ACM Student Research Competition Grand Finals
  • 2011: SIGCOMM Posters and Demo session
  • 2009: Open Source Software Award

Program Committee

  • 2016: STACS 2016 (Invited Reviewer)
  • 2014: SODA 2014 (Invited Reviewer), NSF Panel Reviewer, NSERC Reviewer, WPBA 2014, SPAA 2014 (Invited Reviewer)
  • 2013: IWQoS 2013, NSERC reviewer
  • 2012: CCSW 2012, SMTPS 2012
  • 2011: ICDE 2011, SMTPS 2011, DISC 2011 (External reviewer)
  • 2010: Eurosys 2010 (Shadow PC), IWSC 2010, DEBS 2010, SMTPS 2010
  • 2009: DEBS 2009, SMTPS 2009
  • (Please submit your best papers!)

Journal Reviewer

  • ACM/IEEE Transactions on Networking, ACM Transactions on Computer Systems, Journal of Supercomputing, IEEE Transactions on the Cloud, IEEE Transactions on Systems, IEEE Transactions on Dependable and Secure Computing, IEEE Transactions on Knowledge and Data Engineering, Journal of Computer Networks and ISDN Systems, IEEE Transactions on the Web, IEEE Transactions on Parallel and Distributed Systems, KSII Transactions on Internet and Information Systems.