CHECKPOINTS
CS 271
P0
m2
m0
m5
m3
m7
P1
m4
m1
m6
P2
Domino Effect: Cascaded rollback which causes the
system to roll back to too far in the computation (even to
the beginning), in spite of all the checkpoints
CS 271
Global State
Chandy and LamportTOCS 1985
Global state of a distributed system
Local state of each process
Messages sent but not received
Global State
a) A consistent cut
b) An inconsistent cut
CS 271
Distributed Snapshot
Algorithm
Assume each process communicates with
another process using unidirectional FIFO pointto-point channels (e.g, TCP connections)
Any process can initiate the algorithm
Checkpoint local state
Send MARKER on every outgoing channel
Distributed Snapshot
A process finishes when
It receives a marker on each incoming
channel and processes them all
State: local state plus state of all
channels
Send state to initiator
Snapshot Algorithm
Example
Snapshot Algorithm
Example
CS 271
10
Execution Example
Sp 0
Sp 1
m1
q
Sq0
Sp 2
m2
Sq1
CS 271
Sp 3
m3
Sq2
Sq3
11
Execution Example
q records state as Sq1 , sends marker to p
Sp 0
Sp 1
m1
q
Sq0
Sp 2
m2
Sq1
CS 271
Sp 3
m3
Sq2
Sq3
12
Execution Example
p records state as Sp2, channel state as empty
Sp 0
Sp 1
m1
q
Sq0
Sp 2
m2
Sq1
CS 271
Sp 3
m3
Sq2
Sq3
13
Execution Example
q records channel state as m3
Sp 0
Sp 1
m1
q
Sq0
Sp 2
m2
Sq1
CS 271
Sp 3
m3
Sq2
Sq3
14
Execution Example
Recorded Global State = ((Sp2, Sq1), (0,m3) )
Sp 0
Sp 1
m1
q
Sq0
Sp 2
m2
Sq1
CS 271
Sp 3
m3
Sq2
Sq3
15
16