Combining Statistical Monitoring and Predictable Recovery for Self-Management

  • Armando Fox
  • David Patterson
  • Michael Jordan
  • Randy Katz

2004 Workshop on Self-Managed Systems (WOSS'04) in conjunction with ACM SIGSOFT FSE-12 |

Earlier version presented at 2nd Bertinoro Workshop on Future Directions in Distributed Computing (FuDiCo II): Survivability: Obstacles and Solutions, June 2004

Complex distributed Internet services form the basis not only of e-commerce but increasingly of mission-critical networkbased applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of many common recoverable failures. The core of the approach is anomaly detection and localization based on statistical machine learning techniques. Unlike previous approaches, we propose anomaly detection and pattern mining not only for operational statistics such as mean response time, but also for structural behaviors of the system—what parts of the system, in what combinations, are being exercised in response to different kinds of external stimuli. In addition, rather than building baseline models a priori, we extract them by observing the behavior of the system over a short period of time during normal operation. We explain the necessary underlying assumptions and why they can be realized by systems research, report on some early successes using the approach, describe benefits of the approach that make it competitive as a path toward self-managing systems, and outline some research challenges. Our hope is that this approach will enable “new science” in the design of self-managing systems by allowing the rapid and widespread application of statistical learning theory techniques (SLT) to problems of system dependability.