Sie sind auf Seite 1von 16

GLOBAL STATES AND

CHECKPOINTS
CS 271

Distributed Checkpoints and


Rollback Recovery
Fault tolerance is achieved by periodically
using stable storage to save the
processes states during the failure-free
execution.
Upon a failure, a failed process rolls back
from one of its saved states, thereby
reducing the amount of lost computation.
Each of the saved states is called a
checkpoint
CS 271

Checkpoint based Recovery


Uncoordinated checkpointing: Each
process takes its checkpoints
independently
Coordinated checkpointing: Process
coordinate their checkpoints in order to
save a system-wide consistent state.
Communication-induced
checkpointing: It forces each process to
take checkpoints based on information
piggybacked on the application messages
it receives from other processes.
CS 271

Domino effect: example


Recovery
Line

P0
m2

m0

m5

m3

m7

P1
m4

m1

m6

P2
Domino Effect: Cascaded rollback which causes the
system to roll back to too far in the computation (even to
the beginning), in spite of all the checkpoints
CS 271

Global State
Chandy and LamportTOCS 1985
Global state of a distributed system
Local state of each process
Messages sent but not received

Many applications need the state of the system


Failure recovery, distributed deadlock detection
Detect stable properties.

Problem: how can you figure out the state of a


distributed system?
Each process is independent
Network does not have any processing power.

Distributed snapshot: a consistent global


state
CS 271

Global State

a) A consistent cut
b) An inconsistent cut
CS 271

Distributed Snapshot
Algorithm
Assume each process communicates with
another process using unidirectional FIFO pointto-point channels (e.g, TCP connections)
Any process can initiate the algorithm
Checkpoint local state
Send MARKER on every outgoing channel

On receiving a first marker on a channel:


Process checkpoints local state and
Send markers on all outgoing channels, and save
messages on all other channels.

On receiving subsequent marker on a channel:


stop saving messages for that channel
Saved messages are the state of the channel
CS 271

Distributed Snapshot
A process finishes when
It receives a marker on each incoming
channel and processes them all
State: local state plus state of all
channels
Send state to initiator

Any process can initiate snapshot


Multiple snapshots may be in progress
Each is distinguished by tagging the marker
with the initiator ID (and sequence number)
CS 271

Snapshot Algorithm
Example

a) Organization of a process and channels for


a distributed snapshot
CS 271

Snapshot Algorithm
Example

b) Process Q receives a marker for first time and records local


state
c) Q records all incoming message
d) Q receives a marker for its incoming channel and finishes
recording the state of the incoming channel

CS 271

10

Execution Example

Sp 0

Sp 1

m1
q

Sq0

Sp 2

m2
Sq1

CS 271

Sp 3

m3
Sq2

Sq3

11

Execution Example
q records state as Sq1 , sends marker to p
Sp 0

Sp 1

m1
q

Sq0

Sp 2

m2
Sq1

CS 271

Sp 3

m3
Sq2

Sq3

12

Execution Example
p records state as Sp2, channel state as empty
Sp 0

Sp 1

m1
q

Sq0

Sp 2

m2
Sq1

CS 271

Sp 3

m3
Sq2

Sq3

13

Execution Example
q records channel state as m3
Sp 0

Sp 1

m1
q

Sq0

Sp 2

m2
Sq1

CS 271

Sp 3

m3
Sq2

Sq3

14

Execution Example
Recorded Global State = ((Sp2, Sq1), (0,m3) )
Sp 0

Sp 1

m1
q

Sq0

Sp 2

m2
Sq1

CS 271

Sp 3

m3
Sq2

Sq3

15

Take home Message


(Snapshot and global states)
General solution for global state
detection.
Causality based detection of stable
properties.
Simple efficient protocol, uses
Markers and FIFO properties.
Dont forget Channel States.
Foundation for Distributed
Checkpointing and Rollback Recovery
CS 271

16

Das könnte Ihnen auch gefallen