Abstract

Enterprises (such as, Yahoo!, LinkedIn, Facebook) operate their own compute/storage infrastructure, which is effectively a “private cloud”. The private cloud consists of multiple clusters, each of which is managed independently. With HDFS, whenever data is stored in the cluster, it is replicated within the cluster for availability. Unfortunately, for datasets shared across the enterprise, this leads to the problem of over-replication within the private cloud. An analysis of Yahoo!’s HDFS usage suggests that the disk space consumed due to replication of shared datasets is substantial (viz., to the tune of PB’s of storage). New data sets are typically popular and requested by many processing jobs in (different) clusters. This demand is satisfied by copying the dataset to each of the clusters. As data sets age, however, they get used less and become cold. We then have the opposite problem of having data overreplicated across clusters: each cluster has enough replicas to recover from data loss locally, and the sum total of replicas is high.

We address both the problems of initially replicating data and cross cluster recovery in a private cloud setting using the same technique: on-demand replication, which we refer to as Hot Replication-On-Demand(HotROD). By making files visible across HDFS clusters, we let a cluster pull in remote replicas as needed, both for initial replication and later recovery. We implemented HotROD as an extension to a standard HDFS installation.