Abstract

In 1998, Jim Reuter of DEC’s storage group asked me for a leader-election algorithm for a network of processors and disks that they were designing. The new wrinkle to the problem was that they wanted a system with only two processors to continue to operate if either processor failed. We could assume that the system had at least three disks, so the idea was to find an algorithm that achieved fault tolerance by replicating disks rather than processors. I convinced them that they didn’t want a leader-election protocol, but rather a distributed state-machine implementation (see [27]). At the time, Eli Gafni was on sabbatical from UCLA and was consulting at SRC. Together, we came up with the algorithm described in this paper, which is a disk-based version of the Paxos algorithm of [122].

Gafni devised the initial version of the algorithm, which didn’t look much like Paxos. As we worked out the details, it evolved into its current form. Gafni wanted a paper on the algorithm to follow the path with which the algorithm had been developed, starting from his basic idea and deriving the final version by a series of transformations. We wrote the first version of the paper in this way. However, when trying to make it rigorous, I found that the transformation steps weren’t as simple as they had appeared. I found the resulting paper unsatisfactory, but we submitted it anyway to PODC’99, where it was rejected. Gafni was then willing to let me do it my way, and I turned the paper into its current form.

A couple of years after the paper was published, Mauro J. Jaskelioff encoded the proof in Isabelle/HOL and mechanically checked it. He found about a dozen small errors. Since I have been proposing Disk Paxos as a test example for mechanical verification of concurrent algorithms, I have decided not to update the paper to correct the errors he found. Anyone who writes a rigorous mechanically-checked proof will find them.