A critical problem in distributed systems is distributed consensus—a set of servers aiming to agree on a single value or a continuing sequence of values, called single-value consensus or multi-value consensus, respectively. It is essential in any distributed service that must maintain and replicate a state to tolerate failures caused by machine crashes, network outages, etc. To this end, we study Paxos—a well-known algorithm for solving distributed consensus used in many distributed services, including Microsoft’s Autopilot and IronRSL.
In this talk, we detail formal specifications of the exact phases of multi-value Paxos in TLA+, Lamport’s Temporal Logic of Actions, and complete proofs of its safety that are mechanically checked in TLAPS, the TLA Proof System. We also discuss general strategies for proving properties involving sets and tuples that helped the proof check succeed insignificantly reduced time. The specification and proof are small (55 lines and 787 lines, respectively) and took about 100 seconds to check, contrasting others that are an order of magnitude larger, more time-consuming, or worse.
Next, we discuss using message history variables, that is, variables holding all sent messages and all received messages, for verification of distributed algorithms, with variants of Paxos as precise case studies. We show that using history variables, instead of using and maintaining other state variables, yields specifications that are more declarative and easier to understand. It also allows easier proofs to be developed by needing fewer invariants and facilitating proof derivations. Furthermore, the proofs are mechanically checked more efficiently. Our specifications, proofs, and proof checking times were reduced by a quarter or more for single-value Paxos and by about half or more for multi-value Paxos.