In designing and building distributed systems, it is common engineering practice to separate steady-state (“normal”) operation from abnormal events such as recovery from failure. This way the normal case can be optimized extensively while recovery can be amortized. However, integrating the recovery procedure with the steady-state protocol is often far from obvious, and can present subtle difficulties. This issue comes to the forefront in modern data centers, where applications are often implemented as elastic sets of replicas that must reconfigure while continuing to provide service, and where it may be necessary to install new versions of active services as bugs are fixed or new functionality is introduced.
Our paper explores this topic in the context of a dynamic reconfiguration model of our own design that unifies two widely popular prior approaches to the problem: virtual synchrony, a model and associated protocols for reliable group communication, and state machine replication (in particular, Paxos), a model and protocol for replicating some form of deterministic functionality specified as an event-driven state machine.