Abstract

Distributed computing systems can suffer from occasional catastrophic violation
of performance goals; due to the complexity of these systems, manual diagnosis of the
cause of the crisis is prohibitive. Recognizing the recurrence of a problem automatically
can lead to cause diagnosis and / or informed intervention. We frame this as an online
clustering problem, where the labels (causes) of some of the previous crises may be
known. We give an effective solution using model-based clustering based on a Dirichlet
process mixture; the evolution of each crisis is modeled as a multivariate time series.
We perform fully Bayesian inference on clusters, giving a method for efficient on-
line computation. Such inferences allow for online expected-cost-minimizing decision
making in the distributed computing context. We apply our methods to Microsoft’s
Exchange Hosted Services.