Abstract

Today replication has become the de facto standard for storing data within and across data centers that process data-intensive workloads. Erasure coding (a form of software RAID), although heavily researched and theoretically more space efficient than replication, has complex tradeoffs which are not well-understood by practitioners. Today’s data centers have diverse foreground and background data-intensive workloads, and getting these tradeoffs right is becoming increasingly important. Through a series of realistic data center deployment scenarios and workload characteristics, coupled with the implementation of a prototype Hadoop library with erasure coding functionalities, we revisit traditional metrics (performance and dollar cost), present new tradeoffs (power proportionality and complexity) and make recommendations on directions worth researching.