Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Existing failure detection techniques rely on domain knowledge, precious (often unavailable) training data, or intrusive modifications to the service.
We propose a proactive approach to identifying machines with latent faults – behavior deviations that are not yet failures but indicate underperformance or failures later on. We present a novel framework for statistical latent fault detection and demonstrate three outlier detection methods within this framework. The derived tests are domain independent and unsupervised; they do not require background information nor tuning, and scale to very large datacenters and clouds. The framework uses only ordinary machine counters – the standard type of data often collected by such services. We prove strong guarantees on the false positive rate for our tests.
Our experiments on several large production systems confirm the hypothesis that latent faults are abundant. Moreover, our methods were able to detect failures many days in advance, and false detections were negligible.