Efficient and Scalable Thread-Safety Violation Detection —- Finding thousands of concurrency bugs during testing

ACM Symposium on Operating Systems Principles (SOSP) |

Organized by ACM

Best paper award, ACM SOSP 2019

Concurrency bugs are hard to find, reproduce, and debug. They often escape rigorous in-house testing, but result in large-scale outages in production. Existing concurrency bug detection techniques unfortunately cannot be part of industry’s integrated build and test environment due to some open challenges: how to handle code developed by thousands of engineering teams that uses a wide variety of synchronization mechanisms, how to report little/no false positives, and how to avoid excessive testing resource consumption.

This paper presents TSVD, a thread-safety violation detector that addresses these challenges through a new design point in the domain of active testing. Unlike previous techniques that inject delays randomly or employ expensive synchronization analysis, TSVD uses lightweight monitoring of the calling behaviors of thread unsafe methods, not any synchronization operations, to dynamically identify bug suspects. It then injects corresponding delays to drive the program towards thread unsafe behaviors, actively learns from its ability or inability to do so, and persists its learning from one test run to the next. TSVD is deployed and regularly used in Microsoft and it has already found over 1000 thread safety violations from thousands of projects. It detects
more bugs than state-of-the-art techniques, mostly with just one test run.