Common Causes and Mitigations of Service Quality Issues in Big Data Computing

MSR-TR-2014-34 |

Published by Microsoft

Big data computing platform has evolved to be a multi-tenant service. The service quality matters because system failure or performance slowdown could adversely affect business and user experience. There is few study in literature on service quality issues in production big data computing platform. In this paper, we present an empirical study on the service quality issues in Cosmos, which is a company-wide multi-tenant big data computing platform in Microsoft, serving thousands of customers from hundreds of teams. Cosmos has a well-defined escalation process (i.e., incident management process), which help customers report and mitigate service quality issues 24/7.

This paper explores the common causes and mitigations of service quality issues in big data computing. We conduct an empirical study on randomly sampled 200+ live site service quality issues in Cosmos. Our major findings include (1) 21.0% of escalations are caused by hardware faults; (2) 36.2% are caused by system side defects, including 21.0% code defects, 9.5% design limitations and 5.7% operation faults; (3) 37.1% are due to customer side faults, including 11.4% misuses, 11.0% convention violations, 10.0% operation faults, and 4.8% code defects. We also studied the general diagnosis process and the common adopted mitigation solutions; the findings suggest that it is possible to design an end-to-end tool to automate the diagnosis by analyzing the collected telemetry data.