![]() | ||
Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.
Contents
- History
- Assumptions
- Processors
- Network
- Number of processors
- Roles
- Quorums
- Proposal Number Agreed Value
- Safety and liveness properties
- Typical deployment
- Basic Paxos
- Phase 1a Prepare
- Phase 1b Promise
- Phase 2a Accept Request
- Phase 2b Accepted
- Message flow Basic Paxos
- Error cases in Basic Paxos
- Message flow Basic Paxos failure of Acceptor
- Message flow Basic Paxos failure of redundant Learner
- Message flow Basic Paxos failure of Proposer
- Message flow Basic Paxos dueling Proposers
- Multi Paxos
- Message flow Multi Paxos start
- Message flow Multi Paxos steady state
- Typical Multi Paxos Collapsed Roles deployment
- Message flow Multi Paxos Collapsed Roles start
- Message flow Multi Paxos Collapsed Roles steady state
- Optimizations
- Cheap Paxos
- Message flow Cheap Multi Paxos
- Fast Paxos
- Message flow Fast Paxos non conflicting
- Message flow Fast Paxos conflicting proposals
- Message flow Fast Paxos collapsed roles
- Generalized Paxos
- Example
- Message flow Generalized Paxos example
- Performance
- Byzantine Paxos
- Message flow Byzantine Multi Paxos steady state
- Message flow Fast Byzantine Multi Paxos steady state
- Message flow Fast Byzantine Multi Paxos failure
- Production use of Paxos
- References
Consensus protocols are the basis for the state machine replication approach to distributed computing, as suggested by Leslie Lamport and surveyed by Fred B. Schneider. State machine replication is a technique for converting an algorithm into a fault-tolerant, distributed implementation. Ad-hoc techniques may leave important cases of failures unresolved. The principled approach proposed by Lamport et al. ensures all cases are handled safely.
The Paxos protocol was first published in 1989 and named after a fictional legislative consensus system used on the Paxos island in Greece. It was later published as a journal article in 1998.
The Paxos family of protocols includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Although no deterministic fault-tolerant consensus protocol can guarantee progress in an asynchronous network (a result proven in a paper by Fischer, Lynch and Paterson), Paxos guarantees safety (consistency), and the conditions that could prevent it from making progress are difficult to provoke.
Paxos is usually used where durability is required (for example, to replicate a file or a database), in which the amount of durable state could be large. The protocol attempts to make progress even during periods when some bounded number of replicas are unresponsive. There is also a mechanism to drop a permanently failed replica or to add a new replica.
History
The topic predates the protocol. In 1988, Lynch, Dwork and Stockmeyer had demonstrated the solvability of consensus in a broad family of "partially synchronous" systems. Paxos has strong similarities to a protocol used for agreement in viewstamped replication, first published by Oki and Liskov in 1988, in the context of distributed transactions. Not withstanding this prior work, Paxos offered a particularly elegant formalism, and included one of the earliest proofs of safety for a fault-tolerant distributed consensus protocol.
Reconfigurable state machines have strong ties to prior work on reliable group multicast protocols that support dynamic group membership, for example Birman's work in 1985 and 1987 on the virtually synchronous gbcast protocol. However, it should be noted that gbcast is unusual in supporting durability and addressing partitioning failures. Most reliable multicast protocols lack these properties, which are required for implementations of the state machine replication model. This point is elaborated in a paper by Lamport, Malkhi and Zhou.
Assumptions
In order to simplify the presentation of Paxos, the following assumptions and definitions are made explicit. Techniques to broaden the applicability are known in the literature, and are not covered in this article.
Processors
Network
Number of processors
In general, a consensus algorithm can make progress using 2F+1 processors despite the simultaneous failure of any F processors. However, using reconfiguration, a protocol may be employed which survives any number of total failures as long as no more than F fail simultaneously.
Roles
Paxos describes the actions of the processes by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol.
Quorums
Quorums express the safety properties of Paxos by ensuring at least some surviving processor retains knowledge of the results.
Quorums are defined as subsets of the set of Acceptors such that any two subsets (that is, any two Quorums) share at least one member. Typically, a Quorum is any majority of participating Acceptors. For example, given the set of Acceptors {A,B,C,D}, a majority Quorum would be any three Acceptors: {A,B,C}, {A,C,D}, {A,B,D}, {B,C,D}. More generally, arbitrary positive weights can be assigned to Acceptors and a Quorum defined as any subset of Acceptors with the summary weight greater than half of the total weight of all Acceptors.
Proposal Number & Agreed Value
Each attempt to define an agreed value v is performed with proposals which may or may not be accepted by Acceptors. Each proposal is uniquely numbered for a given Proposer. The value corresponding to a numbered proposal can be computed as part of running the Paxos protocol, but need not be.
Safety and liveness properties
In order to guarantee safety, Paxos defines three safety properties and ensures they are always held, regardless of the pattern of failures:
Typical deployment
In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor and Learner. This reduces the message complexity significantly, without sacrificing correctness:
By merging roles, the protocol "collapses" into an efficient client-master-replica style deployment, typical of the database community. The benefit of the Paxos protocols (including implementations with merged roles) is the guarantee of its safety properties.
A typical implementation's message flow is covered in the section Multi-Paxos.
Basic Paxos
This protocol is the most basic of the Paxos family. Each instance of the Basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds. A successful round has two phases. A Proposer should not initiate Paxos if it cannot communicate with at least a Quorum of Acceptors:
Phase 1a: Prepare
A Proposer (the leader) creates a proposal identified with a number N. This number must be greater than any previous proposal number used by this Proposer. Then, it sends a Prepare message containing this proposal to a Quorum of Acceptors. The Proposer decides who is in the Quorum.Phase 1b: Promise
If the proposal's number N is higher than any previous proposal number received from any Proposer by the Acceptor, then the Acceptor must return a promise to ignore all future proposals having a number less than N. If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal number and previous value in its response to the Proposer.Otherwise, the Acceptor can ignore the received proposal. It does not have to answer in this case for Paxos to work. However, for the sake of optimization, sending a denial (Nack) response would tell the Proposer that it can stop its attempt to create consensus with proposal N.Phase 2a: Accept Request
If a Proposer receives enough promises from a Quorum of Acceptors, it needs to set a value to its proposal. If any Acceptors had previously accepted any proposal, then they'll have sent their values to the Proposer, who now must set the value of its proposal to the value associated with the highest proposal number reported by the Acceptors. If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose any value for its proposal.The Proposer sends an Accept Request message to a Quorum of Acceptors with the chosen value for its proposal.Phase 2b: Accepted
If an Acceptor receives an Accept Request message for a proposal N, it must accept it if and only if it has not already promised to any prepare proposals having an identifier greater than N. In this case, it should register the corresponding value v and send an Accepted message to the Proposer and every Learner. Else, it can ignore the Accept Request.Note that an Acceptor can accept multiple proposals. These proposals may even have different values in the presence of certain failures. However, the Paxos protocol will guarantee that the Acceptors will ultimately agree on a single value.Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number.Notice that when Acceptors accept a request, they also acknowledge the leadership of the Proposer. Hence, Paxos can be used to select a leader in a cluster of nodes.Here is a graphic representation of the Basic Paxos protocol. Note that the values returned in the Promise message are null the first time a proposal is made, since no Acceptor has accepted a value before in this round.A Learner learns a value only if it receives a Quorum of Accepted messages with the same value.Message flow: Basic Paxos
(first round is successful)
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | |<---------X--X--X | | Promise(1,{Va,Vb,Vc}) | X--------->|->|->| | | Accept!(1,Vn) | |<---------X--X--X------>|->| Accepted(1,Vn) |<---------------------------------X--X Response | | | | | | |Vn = highest of (Va,Vb,Vc)
Error cases in Basic Paxos
The simplest error cases are the failure of a redundant Learner, or failure of an Acceptor when a Quorum of Acceptors remains live. In these cases, the protocol requires no recovery. No additional rounds or messages are required, as shown below:
Message flow: Basic Paxos, failure of Acceptor
(Quorum size = 2 Acceptors)
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | | | | ! | | !! FAIL !! | |<---------X--X | | Promise(1,{null,null}) | X--------->|->| | | Accept!(1,V) | |<---------X--X--------->|->| Accepted(1,V) |<---------------------------------X--X Response | | | | | |Message flow: Basic Paxos, failure of redundant Learner
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | |<---------X--X--X | | Promise(1,{null,null,null}) | X--------->|->|->| | | Accept!(1,V) | |<---------X--X--X------>|->| Accepted(1,V) | | | | | | ! !! FAIL !! |<---------------------------------X Response | | | | | |The next failure case is when a Proposer fails after proposing a value, but before agreement is reached. Ignoring Leader election, an example message flow is as follows:
Message flow: Basic Paxos, failure of Proposer
(re-election not shown, one instance, two rounds)
Client Proposer Acceptor Learner | | | | | | | X----->| | | | | | Request | X------------>|->|->| | | Prepare(1) | |<------------X--X--X | | Promise(1,{null, null, null}) | | | | | | | | | | | | | | !! Leader fails during broadcast !! | X------------>| | | | | Accept!(1,Va) | ! | | | | | | | | | | | | !! NEW LEADER !! | X--------->|->|->| | | Prepare(2) | |<---------X--X--X | | Promise(2,{null, null, null}) | X--------->|->|->| | | Accept!(2,V) | |<---------X--X--X------>|->| Accepted(2,V) |<---------------------------------X--X Response | | | | | | |The most complex case is when multiple Proposers believe themselves to be Leaders. For instance the current leader may fail and later recover, but the other Proposers have already re-elected a new leader. The recovered leader has not learned this yet and attempts to begin a round in conflict with the current leader.
Message flow: Basic Paxos, dueling Proposers
(one instance, four unsuccessful rounds)
Client Proposer Acceptor Learner | | | | | | | X----->| | | | | | Request | X------------>|->|->| | | Prepare(1) | |<------------X--X--X | | Promise(1,{null,null,null}) | ! | | | | | !! LEADER FAILS | | | | | | | !! NEW LEADER (knows last number was 1) | X--------->|->|->| | | Prepare(2) | |<---------X--X--X | | Promise(2,{null,null,null}) | | | | | | | | !! OLD LEADER recovers | | | | | | | | !! OLD LEADER tries 2, denied | X------------>|->|->| | | Prepare(2) | |<------------X--X--X | | Nack(2) | | | | | | | | !! OLD LEADER tries 3 | X------------>|->|->| | | Prepare(3) | |<------------X--X--X | | Promise(3,{null,null,null}) | | | | | | | | !! NEW LEADER proposes, denied | | X--------->|->|->| | | Accept!(2,Va) | | |<---------X--X--X | | Nack(3) | | | | | | | | !! NEW LEADER tries 4 | | X--------->|->|->| | | Prepare(4) | | |<---------X--X--X | | Promise(4,{null,null,null}) | | | | | | | | !! OLD LEADER proposes, denied | X------------>|->|->| | | Accept!(3,Vb) | |<------------X--X--X | | Nack(4) | | | | | | | | ... and so on ...Note that Basic Paxos is a stripped down model of the core of every version of Paxos used to prove the correctness of family of Paxos algorithms, as shown in the paper Paxos Made Simple.
Multi-Paxos
If each command is the result of a single instance of the Basic Paxos protocol a significant amount of overhead would result. The paper Paxos Made Simple defines Paxos to be what is commonly called "Multi-Paxos" which in steady state uses a distinguished leader to coordinate an infinite stream of commands. A typical deployment of Paxos uses a continuous stream of agreed values acting as commands to update a distributed state machine.
If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.
To achieve this, the instance number I is included along with each value. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.
Message flow: Multi-Paxos, start
(first instance with new leader)
Client Proposer Acceptor Learner | | | | | | | --- First Request --- X-------->| | | | | | Request | X--------->|->|->| | | Prepare(N) | |<---------X--X--X | | Promise(N,I,{Va,Vb,Vc}) | X--------->|->|->| | | Accept!(N,I,Vm) | |<---------X--X--X------>|->| Accepted(N,I,Vm) |<---------------------------------X--X Response | | | | | | |Vm = highest of (Va, Vb, Vc)
Message flow: Multi-Paxos, steady-state
(subsequent instances with same leader)
Client Proposer Acceptor Learner | | | | | | | --- Following Requests --- X-------->| | | | | | Request | X--------->|->|->| | | Accept!(N,I+1,W) | |<---------X--X--X------>|->| Accepted(N,I+1,W) |<---------------------------------X--X Response | | | | | | |Typical Multi-Paxos Collapsed Roles deployment
The most common deployment of the Paxos family is Multi-Paxos, specialized for participating processors to each be Proposers, Acceptors and Learners. The message flow with roles collapsed may be optimized as depicted here:
Message flow: Multi-Paxos Collapsed Roles, start
(first instance with new leader)
Client Servers | | | | --- First Request --- X-------->| | | Request | X->|->| Prepare(N) | |<-X--X Promise(N,I,{Va,Vb}) | X->|->| Accept!(N,I,Vn) | |<-X--X Accepted(N,I) |<--------X | | Response | | | |Message flow: Multi-Paxos Collapsed Roles, steady state
(subsequent instances with same leader)
Client Servers X-------->| | | Request | X->|->| Accept!(N,I+1,W) | |<-X--X Accepted(N,I+1) |<--------X | | Response | | | |Optimizations
A number of optimizations reduce message complexity and size. These optimizations are summarized below:
Cheap Paxos
Cheap Paxos extends Basic Paxos to tolerate F failures with F+1 main processors and F auxiliary processors by dynamically reconfiguring after each failure.
This reduction in processor requirements comes at the expense of liveness; if too many main processors fail in a short time, the system must halt until the auxiliary processors can reconfigure the system. During stable periods, the auxiliary processors take no part in the protocol.
"With only two processors p and q, one processor cannot distinguish failure of the other processor from failure of the communication medium. A third processor is needed. However, that third processor does not have to participate in choosing the sequence of commands. It must take action only in case p or q fails, after which it does nothing while either p or q continues to operate the system by itself. The third processor can therefore be a small/slow/cheap one, or a processor primarily devoted to other tasks."Message flow: Cheap Multi-Paxos
3 main Acceptors, 1 Auxiliary Acceptor, Quorum size = 3, showing failure of one main processor and subsequent reconfiguration
{ Acceptors }Proposer Main Aux Learner| | | | | | -- Phase 2 --X----------->|->|->| | | Accept!(N,I,V)| | | ! | | --- FAIL! ---|<-----------X--X--------------->| Accepted(N,I,V)| | | | | -- Failure detected (only 2 accepted) --X----------->|->|------->| | Accept!(N,I,V) (re-transmit, include Aux)|<-----------X--X--------X------>| Accepted(N,I,V)| | | | | -- Reconfigure : Quorum = 2 --X----------->|->| | | Accept!(N,I+1,W) (Aux not participating)|<-----------X--X--------------->| Accepted(N,I+1,W)| | | | |Fast Paxos
Fast Paxos generalizes Basic Paxos to reduce end-to-end message delays. In Basic Paxos, the message delay from client request to learning is 3 message delays. Fast Paxos allows 2 message delays, but requires the Client to send its request to multiple destinations.
Intuitively, if the leader has no value to propose, then a client could send an Accept! message to the Acceptors directly. The Acceptors would respond as in Basic Paxos, sending Accepted messages to the leader and every Learner achieving two message delays from Client to Learner.
If the leader detects a collision, it resolves the collision by sending Accept! messages for a new round which are Accepted as usual. This coordinated recovery technique requires four message delays from Client to Learner.
The final optimization occurs when the leader specifies a recovery technique in advance, allowing the Acceptors to perform the collision recovery themselves. Thus, uncoordinated collision recovery can occur in three message delays (and only two message delays if all Learners are also Acceptors).
Message flow: Fast Paxos, non-conflicting
Client Leader Acceptor Learner | | | | | | | | | X--------->|->|->|->| | | Any(N,I,Recovery) | | | | | | | | X------------------->|->|->|->| | | Accept!(N,I,W) | |<---------X--X--X--X------>|->| Accepted(N,I,W) |<------------------------------------X--X Response(W) | | | | | | | |Message flow: Fast Paxos, conflicting proposals
Conflicting proposals with uncoordinated recovery. Note: the protocol does not specify how to handle the dropped client request.
Client Leader Acceptor Learner | | | | | | | | | | | X------->|->|->|->| | | Any(N,I,Recovery) | | | | | | | | | | | | | | | | | | !! Concurrent conflicting proposals | | | | | | | | | !! received in different order | | | | | | | | | !! by the Acceptors | X--------------?|-?|-?|-?| | | Accept!(N,I,V) X-----------------?|-?|-?|-?| | | Accept!(N,I,W) | | | | | | | | | | | | | | | | | | !! Acceptors disagree on value | | |<-------X--X->|->|----->|->| Accepted(N,I,V) | | |<-------|<-|<-X--X----->|->| Accepted(N,I,W) | | | | | | | | | | | | | | | | | | !! Detect collision & recover | | |<-------X--X--X--X----->|->| Accepted(N+1,I,W) |<---------------------------------X--X Response(W) | | | | | | | | |Message flow: Fast Paxos, collapsed roles
(merged Acceptor/Learner roles)
Client Servers | | | | | | | | X->|->|->| Any(N,I,Recovery) | | | | | | | | | | | | !! Concurrent conflicting proposals | | | | | | !! received in different order | | | | | | !! by the Servers | X--------?|-?|-?|-?| Accept!(N,I,V) X-----------?|-?|-?|-?| Accept!(N,I,W) | | | | | | | | | | | | !! Servers disagree on value | | X--X->|->| Accepted(N,I,V) | | |<-|<-X--X Accepted(N,I,W) | | | | | | | | | | | | !! Detect collision & recover |<-----------X--X--X--X Response(W) | | | | | |Generalized Paxos
Generalized consensus explores the relationship between the operations of the replicated state machine and the consensus protocol that implements it. The main discovery involves optimizations of Paxos when conflicting proposals could be applied in any order. i.e., when the proposed operations are commutative operations for the state machine. In such cases, the conflicting operations can both be accepted, avoiding the delays required for resolving conflicts and re-proposing the rejected operations.
This concept is further generalized into ever-growing sequences of commutative operations, some of which are known to be stable (and thus may be executed). The protocol tracks these sequences ensuring that all proposed operations of one sequence are stabilized before allowing any operation non-commuting with them to become stable.
Example
In order to illustrate Generalized Paxos, the example below shows a message flow between two concurrently executing clients and a replicated state machine implementing read/write operations over two distinct registers A and B.
Note that X in this table indicates operations which are non-commutative.
A possible sequence of operations :
<1:Read(A), 2:Read(B), 3:Write(B), 4:Read(B), 5:Read(A), 6:Write(A)>Since 5:Read(A)
commutes with both 3:Write(B)
and 4:Read(B)
, one possible permutation equivalent to the previous order is the following:
In practice, a commute occurs only when operations are proposed concurrently.
Message flow: Generalized Paxos (example)
Responses not shown. Note: message abbreviations differ from previous message flows due to specifics of the protocol, see for a full discussion.
{ Acceptors }Client Leader Acceptor Learner | | | | | | | | !! New Leader Begins Round | | X----->|->|->| | | Prepare(N) | | |<-----X--X--X | | Promise(N,null) | | X----->|->|->| | | Phase2Start(N,null) | | | | | | | | | | | | | | | | !! Concurrent commuting proposals | X--------?|-----?|-?|-?| | | Propose(ReadA) X-----------?|-----?|-?|-?| | | Propose(ReadB) | | |<-----X-->-->-------->|->| Accepted(N,<ReadA,ReadB>) | | |<-----<--X--X-------->|->| Accepted(N,<ReadB,ReadA>) | | | | | | | | | | | | | | | | !! No Conflict, both stable | | | | | | | | V = <ReadA, ReadB> | | | | | | | | | | | | | | | | !! Concurrent conflicting proposals X-----------?|-----?|-?|-?| | | Propose(WriteB) | X--------?|-----?|-?|-?| | | Propose(ReadB) | | | | | | | | | | |<-----X------ | | Accepted(N,V.<WriteB,ReadB>) | | |<--------X--- | | Accepted(N,V.<ReadB,WriteB>) | | | | | | | | | | | | | | | | !! Conflict detected at the leader. | | | | | | | | | | X----->|->|->| | | Prepare(N+1) | | |<-----X------ | | Promise(N+1, N, V.<WriteB,ReadB>) | | |<--------X--- | | Promise(N+1, N, V.<ReadB, WriteB>) | | |<-----------X | | Promise(N+1, N, V) | | | | | | | | | | | | | | | | | | X----->|->|->| | | Phase2Start(N+1,V.<WriteB,ReadB>) | | |<-----X--X--X-------->|->| Accepted(N+1,V.<WriteB,ReadB>) | | | | | | | | | | | | | | | | !! New stable sequence | | | | | | | | U = <ReadA, ReadB>, WriteB, ReadB> | | | | | | | | | | | | | | | | !! More conflicting proposals X-----------?|-----?|-?|-?| | | Propose(WriteA) | X--------?|-----?|-?|-?| | | Propose(ReadA) | | | | | | | | | | | | | | | | !! This time spontaneously ordered by the network | | [<-----X--X----------->|->| Accepted(N+1,U.<WriteA, ReadA>) | | | | | | | |Performance
The above message flow shows us that Generalized Paxos can leverage operation semantics to avoid collisions when the spontaneous ordering of the network fails. This allows the protocol to be in practice quicker than Fast Paxos. However, when a collision occurs, Generalized Paxos needs two additional round trips to recover. This situation is illustrated with operations WriteB and ReadB in the above schema.
In the general case, such round trips are unavoidable and comes from the fact that multiple commands might be accepted during a round. This makes the protocol more expensive than Paxos when conflicts are frequent. Hopefully two possible refinements of Generalized Paxos are possible to improve recovery time.
Byzantine Paxos
Paxos may also be extended to support arbitrary failures of the participants, including lying, fabrication of messages, collusion with other participants, selective non-participation, etc. These types of failures are called Byzantine failures, after the solution popularized by Lamport.
Byzantine Paxos adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors:
Message flow: Byzantine Multi-Paxos, steady state
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Accept!(N,I,V) | | X<>X<>X | | Verify(N,I,V) - BROADCAST | |<---------X--X--X------>|->| Accepted(N,V) |<---------------------------------X--X Response(V) | | | | | | |Fast Byzantine Paxos removes this extra delay, since the client sends commands directly to the Acceptors.
Note the Accepted message in Fast Byzantine Paxos is sent to all Acceptors and all Learners, while Fast Paxos sends Accepted messages only to Learners):
Message flow: Fast Byzantine Multi-Paxos, steady state
Client Acceptor Learner | | | | | | X----->|->|->| | | Accept!(N,I,V) | X<>X<>X------>|->| Accepted(N,I,V) - BROADCAST |<-------------------X--X Response(V) | | | | | |The failure scenario is the same for both protocols; Each Learner waits to receive F+1 identical messages from different Acceptors. If this does not occur, the Acceptors themselves will also be aware of it (since they exchanged each other's messages in the broadcast round), and correct Acceptors will re-broadcast the agreed value: