HALO: Hierarchy-aware Fault Localization for Cloud Systems

SigKDD 2021 |

PDF

A typical cloud system has a large amount of telemetry data collected by pervasive software monitors that keep tracking the health status of the system. The telemetry data is essentially multi-dimensional data, which contains attributes and failure/success status of the system being monitored. By identifying the attribute value combinations where the failures are mostly concentrated (which we call fault-indicating combination), we can localize the cause of system failures into a smaller scope, thus facilitating fault diagnosis. However, due to the combinatorial explosion problem and the latent hierarchical structure in cloud telemetry data, it is still intractable to localize the fault to a proper granularity in an efficient way. In this paper, we propose HALO, a hierarchy-aware fault localization approach for locating the fault-indicating combinations from telemetry data. Our approach automatically learns the hierarchical relationship among attributes and leverages the hierarchy structure for precise and efficient fault localization. We have evaluated HALO on both industrial and synthetic datasets and the results confirm that HALO outperforms the existing methods. Furthermore, we have successfully deployed HALO to different services in Microsoft Azure and Microsoft 365, witnessed its impact in real-world practice.