Abstract

Big Data computing platform has evolved to be a multi-tenant service. The service quality matters because system failure or performance slowdown could adversely affect business and user experience. There is few study in literature on service quality issues of production Big Data computing platform. In this paper, we present an empirical study on the service quality issues of Microsoft ProductA, which is a company-wide multi-tenant Big Data computing platform, serving thousands of customers from hundreds of teams. ProductA has a well-defined incident management process, which helps customers report and mitigate service quality issues on 24/7 basis. This paper explores the common symptom, causes and mitigation of service quality issues in Big Data computing. We conduct an empirical study on 210 real service quality issues in ProductA. Our major findings include (1) 21.0% of escalations are caused by hardware faults; (2) 36.2% are caused by system side defects; (3) 37.2% are due to customer side faults. We also studied the general diagnosis process and the commonly adopted mitigation solutions. Our findings can help improve current development and maintenance practice of Big Data computing platform, and motivate tool support.