Capturing and Enhancing In Situ System Observability for Failure Detection

Operating Systems Design and Implementation (OSDI) |

Published by USENIX

Real-world distributed systems suffer unavailability due to various types of failure. But, despite enormous effort, many failures, especially gray failures, still escape detection. In this paper, we argue that the missing piece in failure detection is detecting what the requesters of a failing component see. This insight leads us to the design and implementation of Panorama, a system designed to enhance system observability by taking advantage of the interactions between a system’s components. By providing a systematic channel and analysis tool, Panorama turns a component into a logical observer so that it not only handles errors, but also reports them. Furthermore, Panorama incorporates techniques for making such observations even when indirection exists between components. Panorama can easily integrate with popular distributed systems and detect all 15 real-world gray failures that we reproduced in less than 7 s, whereas existing approaches detect only one of them in under 300 s.