Experience from an operational map-reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include (i) machine characteristics – both hardware reliability (e.g., disk failures) as well as run-time contention for processor, memory and other resources, (ii) network characteristics with varying bandwidths and congestion along paths, and (iii) imbalance in workload among tasks.

We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri’s strategies include smart restart of outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Mantri’s principled strategy of dealing with outliers is a significant advancement over prior work that concentrate only on duplicating tasks. Using real-time progress reports, Mantri detects outliers early in their lifetime, and takes appropriate action based on their causes. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Deployment in Bing’s production cluster and extensive trace-driven simulation indicate that Mantri is 3.1x more effective than the existing state-of-the-art in improving job completion times.