Automatic Failure Diagnosis in Large-Scale Systems
- Alice Zheng | University of California, Berkeley
As modern computer systems grow in both size and complexity, so has the need for automatic analysis and computer-aided administration of these systems. With recent booms in computing power and efficient algorithms, statistical machine learning methods have become increasingly practical for dealing with the deluge of data generated by these systems. In this talk, I present statistical diagnostic platforms for several large-scale systems, focusing on the problem of selecting fault-related components from a long list of potential candidates. Examples include a distributed software monitoring system for automatic debugging, and a probing system for detecting failures on clusters of network computers.
Speaker Details
Alice Zheng received her Ph.D. from UC Berkeley in 2005 and is currently a postdoctoral fellow at Carnegie Mellon University. Her interests lie in applied machine learning, in particular to computer systems, software, and networks. Current projects include statistical software debugging, performance diagnosis of distributed file systems, efficient internet traffic measurements, and modeling social networks.
-
-
Alice Zheng
-
Jeff Running
-
Watch Next
-
-
-
Accelerating MRI image reconstruction with Tyger
- Karen Easterbrook,
- Ilyana Rosenberg
-
-
-
-
From Microfarms to the Moon: A Teen Innovator’s Journey in Robotics
- Pranav Kumar Redlapalli
-
-
-