The best reference on Paxos is a paper by Butler Lampson titled "How to build a highly available system using consensus".
Paxos' creator Leslie Lamport has written two papers, "The Part-time Parliament" which is the main detailed paper, and "paxos made simple" which presents the same thing simplified. However, Lampson's paper provides a very good context for Paxos: Why is consensus needed when designing an HA service ? What makes Paxosa really good consensus algorithm for asynchronous unreliable systems ? How to use Paxos in a _real_ distributed replicated (HA) service. Google's chubby lock service is a real life application of Paxos for solving a real problem and gives interesting perspectives on where consensus fits in an HA service implementation.
For instance, Lampson shows how an HA service need not use consensus for each and every consensus decision. Instead, it can use consensus once to elect a lease-based primary or grant a lock which can then be used. Chubby also uses Paxos to provide a lock service rather than let clients use Paxos for every consensus decision. This helps performance since the lock owner or primary can pretty much do what it wants without having to consult other hosts. Another real world benefit is when modifying existing non-HA code to be HA. Its easier to modify existing code to convert it to a distributed system using locks, than make it use consensus for every majority decision.
Lampson's paper and the paper on chubby lock service claim thatPaxos consensus algorithm and the consensus algorithm used in Viewstamped replication which was published around the same time are equivalent. Viewstamped replication was used by Ori and Liskov to build a replication file system where the primary election happens through consensus. "Viewstamping" is used to make sure that even after a new primary is elected (and view changed), events pending with the older primary are well known to predate those seen by the new primary. The concept is similar to "round number" in Paxos where each view is like a round number and viewstamps are monotonically increasing like Paxos round numbers.
There was an MIT project by Barbara Liskov in 1990 called Argus which tried to create a programming model for writing applications which would then be automatically made highly available (HA) using viewstamped replication based system. Argus staff (Sanjay Ghemawat) joined Google, so mapreduce is probably inspired to some degree by Argus. Mapreduce does not have Argus style fault tolerance, it has basic fault tolerance implemented by the master.
Safety property of Consensus:
All non-failed processes agree on a value v that is proposed by at least one process (v must be proposed by some process, otherwise the system can e.g., trivially agree on value 0 all the time).
The basic idea behind Paxos:
Consensus is easy if there are no faults. For example, two phase commit: one process decides the outcome and asks all processes to prepare. If all processes agree, it issues a commit and job is done. If leader fails, the outcome is not clear. Another example: majority-based consensus: there is no leader. All processes choose a value and broadcast. If a majority agrees on the same value, there is consensus. But if there is no majority, or if some processes crash so that nobody knows if consensus was reached, this algorithm fails.
Paxox's basic idea comes from this non-fault-tolerant majority consensus algorithm. To make it fault-tolerant, Paxos repeats majority consensus until there is consensus. However, the challenge comes from the asynchronous nature of the problem due to which its difficult to know when the current voting round has failed to reach consensus and hence a new round needs to be started. In practice, some process will start a new round after waiting for a certain time if consensus has not reached till then (Paxos is protected from multiple processes starting a round since all round numbers are ordered, so one of them will take precedence over the other thus effectively killing that other round). But the decision about the perceived failure of the previous round can be false as its made by some process based on messages it receives from other processes (after consensus is reached, typically the result is broadcasted for everyone to know though this is not needed for correctness). So if some messages are lost, its decision can be incorrect. Thus there could be a processe A which has witnessed that there has been a consensus on some value but process B doesn't know it. So B start s another round possibly using a different value. So if B's new request to A is lost, A will wrongly stick to its older consensus value while the new consensus could be on a different value. So for Paxos to work, each new round should use a value for which there has possibly been a consensus in a previous round. The tricky part of Paxos is to ensure this property.
Paxos ensures this by having the leader process of a round ask other processes for their past values and decide for which value there could possibly have been a consensus. Two key details that make the algorithm work are:
- Each process can only be participating in one round at a time.
- Round numbers are monotonically increasing so knowing which round is "previous" is trivial.
References:
- Butler W. Lampson, How to Build a Highly Available System Using Consensus. 10th International Workshop on Distributed Algorithms (WDAG 1996)
- Leslie Lamport, The Part-time Parliament, ACM Transactions on Computer Systems, May 1998
- Leslie Lamport, Paxos made simple,http://research.microsoft.com/users/lamport/pubs/paxos-simple.pdf
Future Reading:
- Jean-Philippe Martin, Lorenzo Alvisi, Fast Byzantine Paxos,http://www.cs.utexas.edu/users/lorenzo/papers/fab.pdf