Protector: A Probabilistic Failure Detector for Cost-effective Peer-to-peer Storage

  • Zhi Yang ,
  • Jing Tian ,
  • Ben Y. Zhao ,
  • ,
  • Yafei Dai

IEEE Transactions on Parallel and Distributed Systems | , Vol 22

Publication

Maintaining a given level of data redundancy is a fundamental requirement of peer-to-peer (P2P) storage systems—to ensure desired data availability, additional replicas must be created when peers fail. Since the majority of failures in P2P networks are transient (i.e., peers return with data intact), an intelligent system can reduce significant replication costs by not replicating data following transient failures. Reliably distinguishing permanent and transient failures, however, is a challenging task, because peers are unresponsive to probes in both cases. In this paper, we propose Protector, an algorithm that enables efficient replication policies by estimating the number of “remaining replicas” for each object, including those temporarily unavailable due to transient failures. Protector dramatically improves detection accuracy by exploiting two opportunities. First, it leverages failure patterns to predict the likelihood that a peer (and the data it hosts) has permanently failed given its current downtime. Second, it detects replication level across groups of replicas (or fragments), thereby balancing false positives for some peers against false negatives for others. Extensive simulations based on both synthetic and real traces show that Protector closely approximates the performance of a perfect “oracle” failure detector, and significantly outperforms time-out-based detectors using a wide range of parameters. Finally, we design, implement and deploy an efficient P2P storage system called AmazingStore by combining Protector with structured P2P overlays. Our experience proves that Protector enables efficient long-term data maintenance in P2P storage systems.