Abstract

The Herald project at Microsoft Research has built working implementations of several scalable peer-to-peer algorithms as part of our work on a scalable, fault-tolerant event notification system. Our goal has been to construct and validate implementations that will work on real networks at scale – not just to simulate such systems and reason about what might be buildable – but to actually build and run them. This paper reports on our experiences building a working peer-to-peer system, relating lessons learned that will potentially be useful to others contemplating similar endeavors. Our experiences and recommendations fall broadly into two categories: (1) We strongly recommend that the system be developed such that it can be run in both simulated and real network environments. (2) We observed that message transport issues such as protocols used, overlay network characteristics, multi-hop routing, layering, and race conditions introduce significant non-local complexities and interactions of which applications must be aware to function correctly. We built real peer-to-peer systems that scale to hundreds of thousands of nodes in a network simulator while also running on real networks, including those with WAN characteristics. We believe the lessons learned along the way will be useful and interesting to others trying to build real working scalable systems.