Load Management in a Large-Scale Decentralized File System

MSR-TR-2004-60 |

This paper discusses our general approach to load management in a distributed system, as well as its application to a particular system, Farsite. Farsite is a peer-to-peer distributed file system that uses its constituent machines to maintain consistency of file system metadata and replicated file content. We argue that control theory is inappropriate for load management in this and other similar systems, and give alternative techniques for preventing overload of limited resources such as CPU and disk. We describe our method of workflow graphs, which allows a system designer to describe the potential sources of overload and ensure all are managed properly, and we apply this method to Farsite. We also describe novel techniques for load management, including clown-car compression and a scheme for achieving approximately even file replication without central coordination or global knowledge.