This talk will review our ongoing work on unsupervised latent fault detection in large scale data centers, such as those used cloud services, supercomputers, and compute clusters.
Modern data centers are comprised of hundreds or thousands of machines (or more!). With so many machines, failures are commonplace, so failure detection is crucial: undetected failures may lead to data loss and outages. Traditional fault detection techniques are often supervised, relying on domain knowledge and precious (often unavailable) training data, and are inflexible. More recent approaches focus on early detection and handling of performance problems, or latent faults. These faults “fly under the radar” of existing detection systems because they are not acute enough, or were not anticipated by maintenance engineers.
We will first discuss unsupervised latent fault detection in scale-out, load-balanced cloud services. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice, and demonstrate three detection methods within this framework. Derived tests are adaptive, domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We proved strong guarantees on the false positive rates of our tests. Our evaluation on a large, real-world production service shows that at least 20% of machine or software failures were preceded by such latent fault. We further show that our latent fault detector can anticipate failures up to 14 days ahead, with high precision and very low FPR.
The second part of the talk will briefly present a communication-efficient variant designed for online outlier detection in distributed data streams. Our offline framework has large bandwidth and processing requirements. Using stream processing techniques that trade accuracy for communication and computation, we present an adapted latent fault detector which can reduce bandwidth costs by an order of magnitude with below 1% error compared to the original algorithm.
Finally, we’ll discuss current work that addresses latent fault detection for unbalanced workloads , such as map-reduce jobs and compute clusters.
One new scheme, based on Principal Components Analysis, retains the advantages of our previous methods: it is unsupervised, robust to changes, and statistically sound. Preliminary evaluation on supercomputer logs shows that the new method is able to correctly predict some failures, while our previous methods completely fail in this setting. We also show preliminary evaluation showing good performance on virtual machines running Hadoop and CassandraDB. Time allows, we’ll also touch on another scheme for opaque VMs, based on a sparse decomposition approach.