Minimizing Faulty Executions of Distributed Systems

March 23, 2016
Colin Scott | UC Berkeley

When a bug is found in a long-running distributed system, developers typically start by identifying (i) which events in the execution caused their system to arrive at the unsafe state, and (ii) which events are irrelevant. This process of troubleshooting can be highly time-consuming, as developers might spend hours poring over multigigabyte traces containing thousands of events.

I will present a tool that reduces effort spent on troubleshooting distributed systems, by automatically eliminating events from a given buggy execution that are not causally related to the bug at hand. The developer is left with a `minimal causal sequence’ of triggering events, each of which is necessary for triggering the bug. We claim that the greatly reduced size of the trace makes it easier for the developer to figure out which code path contains the underlying bug, allowing them to focus their effort on the task of fixing the problematic code itself. In the talk I formally define the execution minimization problem, the algorithms we have developed to solve it, and the system we have built to apply our techniques in practice. The examples I present are drawn from real bugs we have found and minimized in Raft and Spark. Our full paper draft is available here (opens in new tab).