Lazy Diagnosis of In-Production Concurrency Bugs

Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP) |

Published by ACM

Diagnosing concurrency bugs—the process of understanding the root causes of failures—is hard. Developers depend on reproducing concurrency bugs to diagnose them. Traditionally, systems that attempt to reproduce concurrency bugs record fine-grained thread schedules of events (e.g., shared memory accesses) that lead to failures. Recording schedules incurs high runtime performance overhead and scales poorly, making existing techniques unsuitable in production.

In this paper, we formulate the coarse interleaving hypothesis, which states that events leading to concurrency bugs are coarsely interleaved. Therefore, a fine-grained and expensive recording is unnecessary for diagnosing concurrency bugs. We test the coarse interleaving hypothesis by studying 54 bugs in 13 systems and find that it holds in all cases. In particular, the time elapsed between events leading to concurrency bugs is on average 5 orders of magnitude greater than what is used today in fine-grained recording.

Using the coarse interleaving hypothesis, we develop Lazy Diagnosis, a hybrid dynamic-static interprocedural pointer and type analysis to diagnose the root causes of concurrency bugs. Our Lazy Diagnosis prototype, SNORLAX, relies on commodity hardware to track thread interleavings at a coarse granularity. SNORLAX does not require any source code changes and can diagnose complex concurrency bugs in real large-scale systems (MySQL, httpd, memcached, etc.) with full accuracy and an average runtime performance overhead of below 1%. Broadly, we believe that our findings can be used to build more efficient in-production bug detection and record/replay techniques.