Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing

SIGCOMM |

Organized by ACM

Incident routing is critical for maintaining service level objectives in the cloud: the time-to-diagnosis can increase by 10x due to mis-routings.  Properly routing incidents is challenging because of the complexity of today’s data center (DC) applications and their dependencies. For instance, an application running on a VM might rely on a functioning host-server, remote-storage service, and virtual and physical network components. It is hard for any one team, rule-based system, or even machine learning solution to fully learn the complexity and solve the incident routing problem. We propose a different approach using per-team Scouts. Each teams’ Scout acts as its gate-keeper – it routes relevant incidents to the team and routes-away unrelated ones. We solve the problem through a collection of these Scouts. Our PhyNet Scout alone – currently deployed in production – reduces the time-to-mitigation of 65%  of mis-routed incidents in our dataset.