Coordination and Agreement Distributed Systems Designs and Concept

Coordination and Agreement
Master 2007
Outline
 Introduction
 Distributed Mutual Exclusion
 Election Algorithms
 Group Communication
 Consensus and Related Problems
2
Main Assumptions
 Each pair of processes is connected by reliable
channels
 Processes independent from each other
 Network: don’t
disconnect
 Processes fail only by crashing

 Local failure detector
3
Distributed Mutual Exclusion (1)
Process 2
Process 1 Process 3
…
Shared Process n
resource
 Mutual exclusion very important

 Prevent interference
 Ensure consistency when accessing the resources
4
 Mutual exclusion useful when the server managing

the resources don’t use locks
 Critical section
Enter() enter critical section – blocking

• Access shared resources in critical
• section
•
Exit() Leave critical section
5
 Distributed mutual exclusion: no shared variables,
only message passing
 Properties:
 Safety: At most one process may execute in the critical
section at a time
 Liveness: Requests to enter and exit the critical section
eventually succeed
No deadlock and no starvation
 Ordering: If one request to enter the CS happened-before

another, then entry to the CS is granted in that order
6
Mutual Exclusion Algorithms
 Basic hypotheses:
 System: asynchronous
 Processes: don’t fail
 Message transmission: reliable
 Central Server Algorithm

 Ring-Based Algorithm
 Mutual Exclusion using Multicast and Logical Clocks
 Maekawa’s Voting Algorithm
 Mutual Exclusion Algorithms Comparison
7
Central Server Algorithm
Server
Queue of
Holds the token
requests 4
2
2 3) Grant
token
1) Request 2) Release
token token P4
P1 Waiting
P2 P3
Holds the token
8
Ring-Based Algorithm (1)
A group of unordered
processes in a network
P4 P2 Pn P1 P3
Ethernet
9
Ring-Based Algorithm (2)
P1 Enter()
P2 • Critical
•
• Section
Pn Exit()
P3
P4
Token navigates
around the ring
10
Mutual Exclusion using
Multicast and Logical Clocks (1)
Waiting
queue 19 P3
19
2
P1
23
Enter() 23 P1 and P2 request
• entering the critical
•
• 19 23 section simultaneously
Exit()
P2
Critical Section
11
Multicast and Logical Clocks (2)
 Main steps of the algorithm:
Initialization
State := RELEASED;
Process pi request entering the critical section

State := WANTED;
T := request’s timestamp;
Multicast request <T, pi> to all processes;
Wait until (Number of replies received = (N – 1));
State := HELD;
12
Multicast and Logical Clocks(3)
 Main steps of the algorithm (cont’d):
On receipt of a request <Ti, pi> at pj (i  j)
If (state = HELD) OR
(state = WANTED AND (T, pj) < (Ti, pi))
Then queue request from pi without replying;
Else reply immediately to pi;
To quit the critical section

state := RELEASED;
Reply to any queued requests;
13
Maekawa’s Voting Algorithm (1)
 Candidate process: must collect sufficient votes to
enter to the critical section
 Each process pi maintain a voting set Vi (i=1, …, N),
where Vi  {p1, …, pN}
 Sets Vi: chosen such that  i,j
 pi  Vi
(at least one common member of any
 Vi  Vj   two voting sets)
 Vi = k (fairness)
 Each process pj is contained in M of the voting sets Vi

14
 Main steps of the algorithm:
Initialization
state := RELEASED;
voted := FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi – {pi};
Wait until (number of replies received = K – 1);
pi enter the critical section only
state := HELD; after collecting K-1 votes
15
On receipt of a request from pi at pj (i  j)
If (state = HELD OR voted = TRUE)

Then queue request from pi without replying;
Else Reply immediately to pi;
voted := TRUE;
For pi to exit the critical section

state := RELEASED;
Multicast release to all processes Vi – {pi};
16
On a receipt of a release from pi at pj (i  j)
If (queue of requests is non-empty)
Then remove head of queue, e.g., pk;

send reply to pk;
voted := TRUE;
Else voted := FALSE;
17
M. E. Algorithms Comparison
Number of messages
Algorithm Enter()/Exit Before Enter() Problems
Centralized 3 2 Crash of server
Crash of a process
Virtual Token lost
1 to  0 to N-1
ring Ordering non
satisfied
Logical Crash of a
2(N-1) 2(N-1)
clocks process
Maekawa’s Alg. 3N 2N Crash of a
process who votes
18
Outline
 Introduction
19
Election Algorithms (1)
 Objective: Elect one process pi from a group of
processes p1…pN
Even if multiple elections have
 Utility: Elect a been
primary started simultaneously
manager, a master process, a
coordinator or a central server
 Each process pi maintains the identity of the elected
in the variable Electedi (NIL if it isn’t defined yet)
 Properties to satisfy:  pi,
 Safety: Electedi = NIL or Elected = P A non-crashed
i  identifier
process with the
 Liveness: pi participates and sets Elected
largest
NIL, or
crashes
20
Election Algorithms (2)
 Ring-Based Election Algorithm
 Bully Algorithm
 Election Algorithms Comparison
21
Ring-Based Election Algorithm (1)
5
5
16
16
9
25
Process 5 starts
25
the election
25
22
Initialization
Participanti := FALSE;
Electedi := NIL
Pi starts an election
Participanti := TRUE;
Send the message <election, pi> to its neighbor
Receipt of a message <elected, pj> at pi
Participanti := FALSE;
If pi  pj
Then Send the message <elected, pj> to its neighbor
23
Receipt of the election’s message <election, pi> at pj
If pi > pj
Then Send the message <election, pi> to its neighbor
Participantj := TRUE;
Else If pi < pj AND Participantj = FALSE
Then Send the message <election, pj> to its neighbor

Participantj := TRUE;
Else If pi = pj
Then Electedj := TRUE;
Participantj := FALSE;
Send the message <elected, pj> to its neighbor
24
Bully Algorithm (1)
 Characteristic: Allows processes to crash during
an election
 Hypotheses:
 Reliable transmission
 Synchronous system
DelayTrans.
DelayTrans.
DelayTrait.
T = 2 DelayTrans. + DelayTrait.
25
Bully Algorithm (2)
 Hypotheses (cont’d):
 Each process knows which processes have higher
identifiers, and it can communicate with all such
processes
 Three types of messages:
 Election: starts an election
 OK: sent in response to an election message
 Coordinator: announces the new coordinator
 Election started by a process when it notices, through
timeouts, that the coordinator has failed
26
Bully Algorithm (3)
2
3 6
Process 5 detects
5 it first Election
OK
7 New Coordinator
1 4
8 Coordinator failed
27
Bully Algorithm (4)
Initialization
Electedi := NIL
pi starts the election

Send the message (Election, pi) to pj , i.e., pj > pi
Waits until all messages (OK, pj) from pj are received;
If no message (OK, pj) arrives during T
Then Elected := pi;
Send the message (Coordinator, pi) to pj , i.e., pj < pi
Else waits until receipt of the message (coordinator)
(if it doesn’t arrive during another timeout T’, it begins another election)
28
Bully Algorithm (5)
Receipt of the message (Coordinator, pj)
Elected := pj;
Receipt of the message (Election, pj ) at pi
Send the message (OK, pi) to pj
Start the election unless it has begun one already
 When a process is started to replace a crashed

process: it begins an election
29
Election Algorithms Comparison
Election Number of Problems

algorithm messages
Virtual Don’t tolerate

2N to 3N-1
ring faults
System must be
Bully N-2 to (N2) synchronous
30
Outline
 Introduction
31
Group Communication (1)
 Objective: each of a group of processes must
receive copies of the messages sent to the group
 Group communication requires:
 Coordination
 Agreement: on the set of messages that is
received and on the delivery ordering
 We study multicast communication of processes

whose membership is known (static groups)
32
 System: contains a collection of processes, which
can communicate reliably over one-to-one channels
 Processes: members of groups, may fail only by
crashing
 Groups:
Closed group Open group

33
 Primitives:
 multicast(g, m): sends the message m to all
members of group g
 deliver(m) : delivers the message m to the
calling process
 sender(m) : unique identifier of the process that
sent the message m
 group(m): unique identifier of the group to which
the message m was sent
34
 Basic Multicast
 Reliable Multicast
 Ordered Multicast
35
Basic Multicast
 Objective: Guarantee that a correct process will eventually
deliver the message as long as the multicaster does not crash
 Primitives: B_multicast, B_deliver
 Implementation: Use a reliable one-to-one communication
To B_multicast(g, m)
For each process p  g, send(p, m);
Use
On receive(m) of
at p threads to perform the send
operations simultaneously
B_deliver(m) to p
 Unreliable: Acknowledgments may be dropped

36
Reliable Multicast (1)
 Properties to satisfy:
 Integrity: A correct process P delivers the message
m at most once
 Validity: If a correct process multicasts a message

m, then it will eventually deliver m
 Agreement: If a correct process delivers the

message m, then all other correct processes in
group(m) will eventually deliver m
 Primitives: R_multicast, R_deliver
37
Reliable Multicast (2)
 Implementation using B-multicast:
Initialization Correct algorithm, but
msgReceived := {}; inefficient
(each message is sent |g|
R-multicast(g, m) by p times to each process)
B-multicast(g, m); // p g
B-deliver(m) by q with g = group(m)
If (m  msgReceived)
Then msgReceived := msgReceived  {m};
If (q  p) Then B-multicast(g, m);
R-deliver(m);
38
Ordered Multicast
 Ordering categories:
 FIFO Ordering
 Total Ordering
 Causal Ordering
 Hybrid Ordering: Total-Causal,

Total-FIFO
39
FIFO Ordering (1)
 If a correct process issues multicast(g, m1) and then
multicast(g, m2), then every correct process that
delivers m2 will deliver m1 before m2
m1
m3
m2
Process 1 Process 2 Process 3

40
FIFO Ordering (2)
 Primitives: FO_multicast, FO_deliver
 Implementation: Use of sequence numbers
 Variables maintained by each process p:
p
 Sg : Number of messages sent by p to group g
q
 Rg: sequence number of the latest message p has
delivered from process q that was sent to the group
 Algorithm
 FIFO Ordering is reached only under the assumption
that groups are non-overlapping
41
Total Ordering (1)
 If a correct process delivers message m2 before it
delivers m1, then any correct process that delivers m1
will deliver m2 before m1
m1
m2
 Primitives: TO_multicast, TO_deliver
42
Total Ordering (2)
 Implementation: Assign totally ordered identifiers to
multicast messages
 Each process makes the same ordering decision
based upon these identifiers
 Methods for assigning identifiers to messages:
 Sequencer process
 Processes collectively agree on the assignment of
sequence numbers to messages in a distributed
fashion
43
Total Ordering (3)
 Sequencer process: Maintains a group-specific
sequence number Sg
Initialization
Sg := 0;
B-deliver(<m, Ident.>) with g = group(m)

B-multicast(g, <“order”, Ident., Sg>);
Sg = Sg + 1;
 Algorithm for group Initialization
member p  g
Rg := 0;
44
Total Ordering (4)
Unique
TO-multicast(g, m) by p identifier of m
B-multicast(g  Sequencer(g), <m, Ident.>);
B-deliver(<m, Ident.>) by p, with g = group(m)

Place <m, Ident.> in hold-back queue;
B-deliver(morder= <“order”, Ident., S>) by p, with g = group(morder)
Wait until (<m, Ident.> in hold-back queue AND S = Rg);

TO-deliver(m);
Rg = S + 1;
45
Total Ordering (5)
 Processes collectively agree on the assignment of
sequence numbers to messages in a distributed
fashion
 Variables maintained by each process p:

q
 Pg : largest sequence number proposed by q to
group g
q
 Ag : largest agreed sequence number q has
observed so far for group g
46
Total Ordering (6)
p3
p3 p3 A p3 = SN
Pg = MAX(Ag, Pgg ) + 1 P3
Proposition
Assigning
Message of
a sequence
a sequence
transmission
number to the
P3
<Ident.,
<m, Ident.>
P
SN>
g > number
message
p2
Ag = SN P2 P4
<Ident., Pg SN>
<Ident.,
<m, Ident.>
> <Ident.,
<m,
<Ident., Pg >
Ident.>
SN>
P2 P1 P4
p1 pi p4
p2
Pg =
p2
MAX(Ag,
p2
Pg SN =
A
)+1 gMAX= SN
i=1,..,5 (P
PP5g )
g
p4
= A p4 = SN
p4
MAX(Ag, Pg )
g +1
<Ident.,
<m, Ident.>
P
SN>
g >
p5 p5 p5 p5
Ag = SN P5 Pg = MAX(Ag, Pg )+1
47
Causal Ordering (1)
 If multicast(g, m1)  multicast(g, m3), then any correct
process that delivers m3 will deliver m1 before m3
m1
m2
m3

48
Causal Ordering (2)
 Primitives: CO_multicast, CO_deliver
 Each process pi of group g maintains a timestamp
g
vector Vi
g
Vi [j] = Number of multicast messages received from
pj that happened-before the next message to
be sent
 Algorithm for group member pi:

Initialization
Example
g
Vi [j] := 0 (j = 1, …, N);
49
Causal Ordering (3)
CO-multicast(g, m)
g g
Vi [i] := Vi [i] + 1;
g
B-multicast(g, <m,Vi >);
g
B-deliver(<m, Vj >) of pj, with g = group(m)
g
Place <m, Vj> in a hold-back queue;
g g g g
V
Wait until (Vj [j] = i [j] + 1) AND ( Vj [k]  Vi [k] );
(k  j)
CO-deliver(m);
g g
Vi [j] := Vi [j] + 1;
50
Outline
 Introduction
51
Consensus introduction
 Make agreement in a distributed manner

 Mutual exclusion: who can enter the critical region
 Totally ordered multicast: the order of message delivery
 Byzantine generals: attack or retreat?
 Consensus problem
 Agree on a value after one or more of the processes
has proposed what the value should be

Consensus (1)
 Objective: processes must agree on a value after one or more
of the processes has proposed what that value should be
 Hypotheses: reliable communication, but processes may fail
 Consensus problem:
 Every process Pi begins in the undecided state
 Proposes a value Vi  D (i=1, …, N)
 Processes communicate with one another, exchanging
values
 Each process then sets the value of a decision variable di
Enters the state decided, in which it may no
longer change di (i=1, …, N)
53
Consensus (2)
d1:=proceed d2:=proceed
P1 P2
V1:=proceed V2:=proceed
Consensus
algorithm
V3:=abort
P3 (Crashes)
54
Consensus (3)
 Proprieties to satisfy:
 Termination: Eventually each correct process
sets its decision variable
 Agreement: the decision value of all correct
processes is the same:
Pi and Pj are correct  di = dj (i,j=1, …, N)
 Integrity: If the correct processes all proposed
the same value, then any correct process in the
decided state has chosen that value
55
Consensus (4)
 Consensus in a synchronous system:
 Use of basic multicast Valuesir : set of proposed
values known to process pi at
 At most f processes may crash
the beginning of round r
 f+1 rounds are necessary
 Delay of one round is bounded by a timeout
56
Consensus (5)
 Interactive consistency problem: variant of the consensus
problem
 Objective: correct processes must agree on a vector of values,
one for each process
 Termination: Eventually each correct process sets its
decision variable
 Agreement: the decision vector of all correct processes is
the same
 Integrity: If Pi is correct, then all correct processes decide
on Vi as the ith component of their vector
57
Consensus (6)
 Byzantine generals problems: variant of the consensus
problem
 Objective: a distinguished process supplies a value that the
others must agree upon
 Termination: Eventually each correct process sets its
decision variable
 Agreement: the decision value of all correct processes is
the same
same:
Pi and Pj are
 Integrity: correct
If the  di = dj is
commander (i,j=1, …, N)then all correct
correct,
processes decide on the value that the commander
proposed
58
Consensus (7)
 Byzantine agreement in a synchronous system:
 Example : a system composed of three processes (must
agree on a binary value 0 or 1)
Scenario 1: process j is faulty Scenario 2: Commander is faulty
Commander Commander
1 1 1 0
Nodei Nodej Nodei Nodej

0
Number of faulty processes must be bounded
59
Consensus (8)
 For m faulty processes, n  3m+1, where n denotes
the total number of processes
 Interactive Consistency Algorithm: ICA(m), m>0, m denotes
the maximal number of processes that may fail simultaneously
 Sender: all nodes must agree upon its value
 Receivers: all other processes
 If a process doesn’t send a message, the receiving process

will use a default value 
 ICA Algorithm requires m+1 rounds in order to achieve the
consensus
60
Consensus (9)
 Interactive Consistency Algorithm:
Algorithm
AlgorithmICA(m)ICA(0)
1.1.The
Thesender
sendersends
sendsitsitsvalue
valuetotoallallthe
theother
othern-1
n-1processes
processes
2.2.Let
EachVi beprocess
the value
usesreceived
the value
by received
process ifrom fromthethesender,
sender,
ororthe
use
default
the default
value value
if no message
if no message is received
is received
EndProcess i consider itself as a sender in ICA(m-1):
it sends the value Vi to the n-2 other processes
3.  i, Let Vj be the value received from process j (j  i)
The process i uses the value Choice(V1, …, Vn)
End
61
References
 PhD. Mourad Elhadef’s presentation
 Coulouris G (et al) – Distributed System – Concepts

and Design – Pearson 2001
 Other presentations
 Wikipedia: www.wikipedia.com
62
63

Coordination and Agreement Distributed Systems Designs and Concept

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Coordination and Agreement Distributed Systems Designs and Concept

Hochgeladen von

Copyright:

Verfügbare Formate

Coordination and Agreement

 Processes fail only by crashing

 Mutual exclusion very important

 Ensure consistency when accessing the resources

 Mutual exclusion useful when the server managing

Enter() enter critical section – blocking

Exit() Leave critical section

 Ordering: If one request to enter the CS happened-before

 Central Server Algorithm

Process pi request entering the critical section

To quit the critical section

 Each process pj is contained in M of the voting sets Vi

If (state = HELD OR voted = TRUE)

For pi to exit the critical section

On a receipt of a release from pi at pj (i  j)

If (queue of requests is non-empty)

Then remove head of queue, e.g., pk;

Else voted := FALSE;

Centralized 3 2 Crash of server

 Ring-Based Election Algorithm

 Election Algorithms Comparison

Receipt of a message <elected, pj> at pi

Then Send the message <election, pj> to its neighbor

pi starts the election

Receipt of the message (Election, pj ) at pi

Send the message (OK, pi) to pj

Start the election unless it has begun one already

 When a process is started to replace a crashed

Election Number of Problems

Virtual Don’t tolerate

 We study multicast communication of processes

Closed group Open group

 Primitives: B_multicast, B_deliver

 Implementation: Use a reliable one-to-one communication

 Unreliable: Acknowledgments may be dropped

 Validity: If a correct process multicasts a message

 Agreement: If a correct process delivers the

 Primitives: R_multicast, R_deliver

 Hybrid Ordering: Total-Causal,

Process 1 Process 2 Process 3

Process 1 Process 2 Process 3

 Primitives: TO_multicast, TO_deliver

B-deliver(<m, Ident.>) with g = group(m)

B-multicast(g  Sequencer(g), <m, Ident.>);

B-deliver(<m, Ident.>) by p, with g = group(m)

B-deliver(morder= <“order”, Ident., S>) by p, with g = group(morder)

Wait until (<m, Ident.> in hold-back queue AND S = Rg);

 Variables maintained by each process p:

Process 1 Process 2 Process 3

 Make agreement in a distributed manner

 Totally ordered multicast: the order of message delivery

 Byzantine generals: attack or retreat?

has proposed what the value should be

agree on a binary value 0 or 1)

Scenario 1: process j is faulty Scenario 2: Commander is faulty

Nodei Nodej Nodei Nodej

Number of faulty processes must be bounded

 If a process doesn’t send a message, the receiving process

 Coulouris G (et al) – Distributed System – Concepts

Das könnte Ihnen auch gefallen