Debugging in the (Very) Large: Ten Years of Implementation and Experience

  • ,
  • Kirk Glerum ,
  • Kinshuman Kinshumann ,
  • Steve Greenberg ,
  • Gabriel Aul ,
  • Vince Orgovan ,
  • Greg Nichols ,
  • David Grant ,
  • Gretchen Loihle

Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09) |

Published by Association for Computing Machinery, Inc.

Windows Error Reporting (WER) is a distributed system that automates the processing of error reports coming from an installed base of a billion machines. WER has collected billions of error reports in ten years of operation. It collects error data automatically and classifies errors into buckets, which are used to prioritize developer effort and report fixes to users. WER uses a progressive approach to data collection, which minimizes overhead for most reports yet allows developers to collect detailed information when needed. WER takes advantage of its scale to use error statistics as a tool in debugging; this allows developers to isolate bugs that could not be found at smaller scale. WER has been designed for large scale: one pair of database servers can record all the errors that occur on all Windows computers worldwide.