Stable Paxos


Paxos is a family of widely adopted distributed, crash-failure tolerant consensus protocols. While it can be proven correct, it is by not guaranteed to operate at a steady rate. While it is true that no fault-tolerant deterministic consensus protocol can be sure to arrive at any decision in finite time, certain real-world concessions are often made to at least bound delays. Under certain conditions, however, an otherwise reliable and speedy implementation of Paxos can delay dramatically. For example, if a small group of “acceptor” processes acts slower than the majority, they can find themselves with a backlog of old inputs to process should they ever be needed for a quorum. If the delay is in the line of communication, a such backlog is avoidable by replacing traditional FIFO channels (such as TCP) with channels more likely to deliver recent requests, even while older ones are in transit. If the acceptors themselves are queueing up information to process, a leader can increase the speed and stability of the protocol by sending proposals first to a quorum of acceptors which have been fastest recently, and to the remainder only if necessary. We provide a theoretical basis for quantifying these problems, as well as a test implementation demonstrating such delays as well as these solutions.

Technical Report