Sie sind auf Seite 1von 63

Coordination and Agreement

Master 2007
Outline
 Introduction
 Distributed Mutual Exclusion
 Election Algorithms
 Group Communication
 Consensus and Related Problems

2
Main Assumptions
 Each pair of processes is connected by reliable
channels
 Processes independent from each other

 Network: don’t
disconnect

 Processes fail only by crashing


 Local failure detector
3
Distributed Mutual Exclusion (1)
Process 2
Process 1 Process 3


Shared Process n
resource

 Mutual exclusion very important


 Prevent interference

 Ensure consistency when accessing the resources

4
Distributed Mutual Exclusion (2)

 Mutual exclusion useful when the server managing


the resources don’t use locks

 Critical section

Enter() enter critical section – blocking


• Access shared resources in critical
• section

Exit() Leave critical section

5
Distributed Mutual Exclusion (3)
 Distributed mutual exclusion: no shared variables,
only message passing
 Properties:
 Safety: At most one process may execute in the critical
section at a time
 Liveness: Requests to enter and exit the critical section
eventually succeed
No deadlock and no starvation

 Ordering: If one request to enter the CS happened-before


another, then entry to the CS is granted in that order

6
Mutual Exclusion Algorithms
 Basic hypotheses:
 System: asynchronous
 Processes: don’t fail
 Message transmission: reliable

 Central Server Algorithm


 Ring-Based Algorithm
 Mutual Exclusion using Multicast and Logical Clocks
 Maekawa’s Voting Algorithm
 Mutual Exclusion Algorithms Comparison
7
Central Server Algorithm
Server
Queue of
Holds the token
requests 4
2
2 3) Grant
token

1) Request 2) Release
token token P4
P1 Waiting

P2 P3
Holds the token

8
Ring-Based Algorithm (1)
A group of unordered
processes in a network

P4 P2 Pn P1 P3

Ethernet

9
Ring-Based Algorithm (2)
P1 Enter()
P2 • Critical

• Section
Pn Exit()
P3

P4

Token navigates
around the ring

10
Mutual Exclusion using
Multicast and Logical Clocks (1)
Waiting
queue 19 P3
19
2
P1

23
Enter() 23 P1 and P2 request
• entering the critical

• 19 23 section simultaneously
Exit()
P2
Critical Section

11
Mutual Exclusion using
Multicast and Logical Clocks (2)
 Main steps of the algorithm:
Initialization

State := RELEASED;

Process pi request entering the critical section


State := WANTED;
T := request’s timestamp;
Multicast request <T, pi> to all processes;
Wait until (Number of replies received = (N – 1));
State := HELD;

12
Mutual Exclusion using
Multicast and Logical Clocks(3)
 Main steps of the algorithm (cont’d):
On receipt of a request <Ti, pi> at pj (i  j)
If (state = HELD) OR
(state = WANTED AND (T, pj) < (Ti, pi))
Then queue request from pi without replying;
Else reply immediately to pi;

To quit the critical section


state := RELEASED;
Reply to any queued requests;

13
Maekawa’s Voting Algorithm (1)
 Candidate process: must collect sufficient votes to
enter to the critical section
 Each process pi maintain a voting set Vi (i=1, …, N),
where Vi  {p1, …, pN}
 Sets Vi: chosen such that  i,j
 pi  Vi
(at least one common member of any
 Vi  Vj   two voting sets)
 Vi = k (fairness)

 Each process pj is contained in M of the voting sets Vi


14
Maekawa’s Voting Algorithm (2)
 Main steps of the algorithm:
Initialization
state := RELEASED;
voted := FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi – {pi};
Wait until (number of replies received = K – 1);
pi enter the critical section only
state := HELD; after collecting K-1 votes

15
Maekawa’s Voting Algorithm (3)
 Main steps of the algorithm (cont’d):
On receipt of a request from pi at pj (i  j)

If (state = HELD OR voted = TRUE)


Then queue request from pi without replying;
Else Reply immediately to pi;
voted := TRUE;

For pi to exit the critical section


state := RELEASED;
Multicast release to all processes Vi – {pi};

16
Maekawa’s Voting Algorithm (4)
 Main steps of the algorithm (cont’d):

On a receipt of a release from pi at pj (i  j)

If (queue of requests is non-empty)

Then remove head of queue, e.g., pk;


send reply to pk;
voted := TRUE;

Else voted := FALSE;

17
M. E. Algorithms Comparison
Number of messages
Algorithm Enter()/Exit Before Enter() Problems

Centralized 3 2 Crash of server

Crash of a process
Virtual Token lost
1 to  0 to N-1
ring Ordering non
satisfied
Logical Crash of a
2(N-1) 2(N-1)
clocks process
Maekawa’s Alg. 3N 2N Crash of a
process who votes
18
Outline
 Introduction
 Distributed Mutual Exclusion
 Election Algorithms
 Group Communication
 Consensus and Related Problems

19
Election Algorithms (1)
 Objective: Elect one process pi from a group of
processes p1…pN
Even if multiple elections have
 Utility: Elect a been
primary started simultaneously
manager, a master process, a
coordinator or a central server
 Each process pi maintains the identity of the elected
in the variable Electedi (NIL if it isn’t defined yet)
 Properties to satisfy:  pi,
 Safety: Electedi = NIL or Elected = P A non-crashed

i  identifier
process with the
 Liveness: pi participates and sets Elected
largest
NIL, or
crashes
20
Election Algorithms (2)

 Ring-Based Election Algorithm

 Bully Algorithm

 Election Algorithms Comparison

21
Ring-Based Election Algorithm (1)
5
5
16
16
9
25
Process 5 starts
25
the election

25

22
Ring-Based Election Algorithm (2)
Initialization
Participanti := FALSE;
Electedi := NIL

Pi starts an election
Participanti := TRUE;
Send the message <election, pi> to its neighbor

Receipt of a message <elected, pj> at pi

Participanti := FALSE;
If pi  pj
Then Send the message <elected, pj> to its neighbor
23
Ring-Based Election Algorithm (3)
Receipt of the election’s message <election, pi> at pj
If pi > pj
Then Send the message <election, pi> to its neighbor
Participantj := TRUE;
Else If pi < pj AND Participantj = FALSE

Then Send the message <election, pj> to its neighbor


Participantj := TRUE;
Else If pi = pj
Then Electedj := TRUE;
Participantj := FALSE;
Send the message <elected, pj> to its neighbor
24
Bully Algorithm (1)
 Characteristic: Allows processes to crash during
an election
 Hypotheses:
 Reliable transmission
 Synchronous system
DelayTrans.

DelayTrans.
DelayTrait.

T = 2 DelayTrans. + DelayTrait.

25
Bully Algorithm (2)
 Hypotheses (cont’d):
 Each process knows which processes have higher
identifiers, and it can communicate with all such
processes
 Three types of messages:
 Election: starts an election
 OK: sent in response to an election message
 Coordinator: announces the new coordinator
 Election started by a process when it notices, through
timeouts, that the coordinator has failed
26
Bully Algorithm (3)
2

3 6

Process 5 detects
5 it first Election
OK
7 New Coordinator

1 4

8 Coordinator failed
27
Bully Algorithm (4)
Initialization
Electedi := NIL

pi starts the election


Send the message (Election, pi) to pj , i.e., pj > pi
Waits until all messages (OK, pj) from pj are received;
If no message (OK, pj) arrives during T
Then Elected := pi;
Send the message (Coordinator, pi) to pj , i.e., pj < pi
Else waits until receipt of the message (coordinator)
(if it doesn’t arrive during another timeout T’, it begins another election)
28
Bully Algorithm (5)
Receipt of the message (Coordinator, pj)

Elected := pj;

Receipt of the message (Election, pj ) at pi

Send the message (OK, pi) to pj

Start the election unless it has begun one already

 When a process is started to replace a crashed


process: it begins an election

29
Election Algorithms Comparison

Election Number of Problems


algorithm messages

Virtual Don’t tolerate


2N to 3N-1
ring faults
System must be
Bully N-2 to (N2) synchronous

30
Outline
 Introduction
 Distributed Mutual Exclusion
 Election Algorithms
 Group Communication
 Consensus and Related Problems

31
Group Communication (1)
 Objective: each of a group of processes must
receive copies of the messages sent to the group
 Group communication requires:

 Coordination
 Agreement: on the set of messages that is
received and on the delivery ordering

 We study multicast communication of processes


whose membership is known (static groups)

32
Group Communication (2)
 System: contains a collection of processes, which
can communicate reliably over one-to-one channels
 Processes: members of groups, may fail only by
crashing

 Groups:

Closed group Open group


33
Group Communication (3)
 Primitives:
 multicast(g, m): sends the message m to all
members of group g
 deliver(m) : delivers the message m to the
calling process
 sender(m) : unique identifier of the process that
sent the message m
 group(m): unique identifier of the group to which
the message m was sent

34
Group Communication (4)

 Basic Multicast

 Reliable Multicast

 Ordered Multicast

35
Basic Multicast
 Objective: Guarantee that a correct process will eventually
deliver the message as long as the multicaster does not crash

 Primitives: B_multicast, B_deliver

 Implementation: Use a reliable one-to-one communication

To B_multicast(g, m)
For each process p  g, send(p, m);
Use
On receive(m) of
at p threads to perform the send
operations simultaneously
B_deliver(m) to p

 Unreliable: Acknowledgments may be dropped


36
Reliable Multicast (1)
 Properties to satisfy:
 Integrity: A correct process P delivers the message
m at most once

 Validity: If a correct process multicasts a message


m, then it will eventually deliver m

 Agreement: If a correct process delivers the


message m, then all other correct processes in
group(m) will eventually deliver m

 Primitives: R_multicast, R_deliver

37
Reliable Multicast (2)
 Implementation using B-multicast:
Initialization Correct algorithm, but
msgReceived := {}; inefficient
(each message is sent |g|
R-multicast(g, m) by p times to each process)

B-multicast(g, m); // p g
B-deliver(m) by q with g = group(m)
If (m  msgReceived)
Then msgReceived := msgReceived  {m};
If (q  p) Then B-multicast(g, m);
R-deliver(m);
38
Ordered Multicast
 Ordering categories:
 FIFO Ordering

 Total Ordering

 Causal Ordering

 Hybrid Ordering: Total-Causal,


Total-FIFO

39
FIFO Ordering (1)
 If a correct process issues multicast(g, m1) and then
multicast(g, m2), then every correct process that
delivers m2 will deliver m1 before m2
m1

m3
m2

Process 1 Process 2 Process 3


40
FIFO Ordering (2)
 Primitives: FO_multicast, FO_deliver
 Implementation: Use of sequence numbers
 Variables maintained by each process p:
p
 Sg : Number of messages sent by p to group g
q
 Rg: sequence number of the latest message p has
delivered from process q that was sent to the group
 Algorithm
 FIFO Ordering is reached only under the assumption
that groups are non-overlapping
41
Total Ordering (1)
 If a correct process delivers message m2 before it
delivers m1, then any correct process that delivers m1
will deliver m2 before m1
m1
m2

Process 1 Process 2 Process 3

 Primitives: TO_multicast, TO_deliver

42
Total Ordering (2)
 Implementation: Assign totally ordered identifiers to
multicast messages
 Each process makes the same ordering decision
based upon these identifiers
 Methods for assigning identifiers to messages:
 Sequencer process
 Processes collectively agree on the assignment of
sequence numbers to messages in a distributed
fashion

43
Total Ordering (3)
 Sequencer process: Maintains a group-specific
sequence number Sg
Initialization
Sg := 0;

B-deliver(<m, Ident.>) with g = group(m)


B-multicast(g, <“order”, Ident., Sg>);
Sg = Sg + 1;
 Algorithm for group Initialization
member p  g
Rg := 0;
44
Total Ordering (4)
Unique
TO-multicast(g, m) by p identifier of m

B-multicast(g  Sequencer(g), <m, Ident.>);

B-deliver(<m, Ident.>) by p, with g = group(m)


Place <m, Ident.> in hold-back queue;

B-deliver(morder= <“order”, Ident., S>) by p, with g = group(morder)

Wait until (<m, Ident.> in hold-back queue AND S = Rg);


TO-deliver(m);
Rg = S + 1;
45
Total Ordering (5)
 Processes collectively agree on the assignment of
sequence numbers to messages in a distributed
fashion

 Variables maintained by each process p:


q
 Pg : largest sequence number proposed by q to
group g
q
 Ag : largest agreed sequence number q has
observed so far for group g

46
Total Ordering (6)
p3
p3 p3 A p3 = SN
Pg = MAX(Ag, Pgg ) + 1 P3
Proposition
Assigning
Message of
a sequence
a sequence
transmission
number to the
P3
<Ident.,
<m, Ident.>
P
SN>
g > number
message
p2
Ag = SN P2 P4
<Ident., Pg SN>
<Ident.,
<m, Ident.>
> <Ident.,
<m,
<Ident., Pg >
Ident.>
SN>
P2 P1 P4
p1 pi p4
p2
Pg =
p2
MAX(Ag,
p2
Pg SN =
A
)+1 gMAX= SN
i=1,..,5 (P
PP5g )
g
p4
= A p4 = SN
p4
MAX(Ag, Pg )
g +1
<Ident.,
<m, Ident.>
P
SN>
g >

p5 p5 p5 p5
Ag = SN P5 Pg = MAX(Ag, Pg )+1

47
Causal Ordering (1)
 If multicast(g, m1)  multicast(g, m3), then any correct
process that delivers m3 will deliver m1 before m3
m1

m2
m3

Process 1 Process 2 Process 3


48
Causal Ordering (2)
 Primitives: CO_multicast, CO_deliver
 Each process pi of group g maintains a timestamp
g
vector Vi
g
Vi [j] = Number of multicast messages received from
pj that happened-before the next message to
be sent
 Algorithm for group member pi:


Initialization
Example
g
Vi [j] := 0 (j = 1, …, N);
49
Causal Ordering (3)
CO-multicast(g, m)
g g
Vi [i] := Vi [i] + 1;
g
B-multicast(g, <m,Vi >);
g
B-deliver(<m, Vj >) of pj, with g = group(m)
g
Place <m, Vj> in a hold-back queue;
g g g g
V
Wait until (Vj [j] = i [j] + 1) AND ( Vj [k]  Vi [k] );
(k  j)
CO-deliver(m);
g g
Vi [j] := Vi [j] + 1;
50
Outline
 Introduction
 Distributed Mutual Exclusion
 Election Algorithms
 Group Communication
 Consensus and Related Problems

51
Consensus introduction

 Make agreement in a distributed manner


 Mutual exclusion: who can enter the critical region

 Totally ordered multicast: the order of message delivery

 Byzantine generals: attack or retreat?

 Consensus problem
 Agree on a value after one or more of the processes

has proposed what the value should be


Consensus (1)
 Objective: processes must agree on a value after one or more
of the processes has proposed what that value should be
 Hypotheses: reliable communication, but processes may fail
 Consensus problem:
 Every process Pi begins in the undecided state
 Proposes a value Vi  D (i=1, …, N)
 Processes communicate with one another, exchanging
values
 Each process then sets the value of a decision variable di
Enters the state decided, in which it may no
longer change di (i=1, …, N)
53
Consensus (2)
d1:=proceed d2:=proceed
P1 P2

V1:=proceed V2:=proceed

Consensus
algorithm

V3:=abort

P3 (Crashes)

54
Consensus (3)
 Proprieties to satisfy:
 Termination: Eventually each correct process
sets its decision variable
 Agreement: the decision value of all correct
processes is the same:
Pi and Pj are correct  di = dj (i,j=1, …, N)
 Integrity: If the correct processes all proposed
the same value, then any correct process in the
decided state has chosen that value

55
Consensus (4)
 Consensus in a synchronous system:
 Use of basic multicast Valuesir : set of proposed
values known to process pi at
 At most f processes may crash
the beginning of round r
 f+1 rounds are necessary
 Delay of one round is bounded by a timeout

56
Consensus (5)
 Interactive consistency problem: variant of the consensus
problem
 Objective: correct processes must agree on a vector of values,
one for each process
 Proprieties to satisfy:
 Termination: Eventually each correct process sets its
decision variable
 Agreement: the decision vector of all correct processes is
the same
 Integrity: If Pi is correct, then all correct processes decide
on Vi as the ith component of their vector

57
Consensus (6)
 Byzantine generals problems: variant of the consensus
problem
 Objective: a distinguished process supplies a value that the
others must agree upon
 Proprieties to satisfy:
 Termination: Eventually each correct process sets its
decision variable
 Agreement: the decision value of all correct processes is
the same
same:
Pi and Pj are
 Integrity: correct
If the  di = dj is
commander (i,j=1, …, N)then all correct
correct,
processes decide on the value that the commander
proposed
58
Consensus (7)
 Byzantine agreement in a synchronous system:
 Example : a system composed of three processes (must

agree on a binary value 0 or 1)

Scenario 1: process j is faulty Scenario 2: Commander is faulty

Commander Commander
1 1 1 0

Nodei Nodej Nodei Nodej


0

Number of faulty processes must be bounded

59
Consensus (8)
 For m faulty processes, n  3m+1, where n denotes
the total number of processes
 Interactive Consistency Algorithm: ICA(m), m>0, m denotes
the maximal number of processes that may fail simultaneously
 Sender: all nodes must agree upon its value
 Receivers: all other processes

 If a process doesn’t send a message, the receiving process


will use a default value 
 ICA Algorithm requires m+1 rounds in order to achieve the
consensus
60
Consensus (9)
 Interactive Consistency Algorithm:
Algorithm
AlgorithmICA(m)ICA(0)
1.1.The
Thesender
sendersends
sendsitsitsvalue
valuetotoallallthe
theother
othern-1
n-1processes
processes
2.2.Let
EachVi beprocess
the value
usesreceived
the value
by received
process ifrom fromthethesender,
sender,
ororthe
use
default
the default
value value
if no message
if no message is received
is received
EndProcess i consider itself as a sender in ICA(m-1):
it sends the value Vi to the n-2 other processes
3.  i, Let Vj be the value received from process j (j  i)
The process i uses the value Choice(V1, …, Vn)
End

61
References
 PhD. Mourad Elhadef’s presentation

 Coulouris G (et al) – Distributed System – Concepts


and Design – Pearson 2001

 Other presentations

 Wikipedia: www.wikipedia.com

62
63

Das könnte Ihnen auch gefallen