GRID ENVIRONMENT
Neeraj Upadhyay Mtech 2nd year
1.Introduction and
Motivation
1.
2.
3.
Fault Tolerance
Problem Statement
Dynamic
a) Dynamically varying resource conditions such as fault
occurrences.
b) Faults are more likely to occur during one time frame compared
to others i.e. faults are temporally correlated [10]. For during
weekdays when workload is high compared to weekends faults
are more likely.
Also during day time faults are more likely to occur .
c) Faults are spatially correlated [10].
Why GA?
Why ACO?
Performanc
e of ACO
compared
to other
meta
heuristics
[37]
Remark
3. Research gaps
Lognormal (MTTR)
4. Work done
Proposed solution
Incorporating fault tolerance in GA-based
scheduling in Grid environment
Genetic Algorithm
Initial Population
Fitness
Evaluati
on
Representation of chromosome
14
J1
R1
J2
R2
J3
J4
R3
R4
Jn
Crossover Operator
J1
R3
Chromosome 1
Chromosome 2
J1
R2
J2
J2
J3
J4
J5
R4
R1
R5
R2
R3
J3
R4
J4
R1
J5
R5
Crossover point
offspring
J1
J2
R3
R4
J3
R1
7/8/16
J4
J5
R1
R5
Mutation Operator
15
J1
J5
J5
J2
R1
R4
J1
R1
R4J2
J3
R2
R5
J4
R5
J3
R5
7/8/16
R1
J4 R1
Fitness function
Flowtime =
is mean failure
[33] , [36]
If
Where C1 is current system time, LFn is last failure time of node n
and k is an integer 2.
Checkpointing with
downtimes
(9)
(10)
Parameters
25
R1
R2
R3
:
Rn
No. of Executions
max((FOHT[i][1] FOHT[i][0]),
) ,))
Until terminate_condition
ACO Phases
Other Techniques
Temporal Correlation
5. Optimized
resource list
35
Grid
Resource
7.Fault value
FOHT
Fault
Manager
Resource
MIPS
Broker
GA or ACO
1.
Deadline,
budget,
Grid User
gridlets
17 Submit result
8. Allocated resource
4. Available
resource list
3. Current
load
status
Schedule Advisor
Resource2
Gridlet Dispatcher
2. Available
resources
information
14 reschedule
from last
checkpoint
Resource1
9.Submit
Gridlet
16 decrement
Fault value
Resource3
10. Submit
Checkpoints
13 Get
checkpoint
Grid Information Service
Checkpoint
15 Gridlet Server
completion
12. Increment fault value
Grid Working
Performance Comparison
Performance Metrics
a) Makespan: It is the maximum completion time for any resource and is basically the
time when all jobs finish execution. Completion time for a resource is the point of time
when all jobs allocated to that resource completes execution.
b) Flowtime: It is the sum of the completion time for all the resources.
d) Work lost due to failures: It is the unsaved work which is lost due to failure of jobs.
Simulation Parameters
Parameter
1. Number of Resources (Clusters)
2. Number of Processors per Cluster
3. Number of jobs
4. Computation time per job
5. Checkpoint Overhead
6. Size (Number of processors)of job
7. Checkpoint Interval
[8. MTBF of Resources
9. Failure Distribution
Value
5
64
200
48 hours
720 seconds [19]
64
1000 to 10000 seconds [19]
5 hours to 18 hours
Weibull (shape parameter .7 , 1 , 1.5) [19]
2
.9
Number of jobs (200)
Number of resources (5)
Weibull Distribution
Probability
density
function for
various
shape
parameter
s
1.40E+07
Work lost due to failures (seconds)
1.00E+07
1.20E+07
8.00E+06
1.00E+07
8.00E+06
Adaptive_Checkpointig
Makespan
6.00E+06 (seconds)
6.00E+06
Adaptive_Checkpointig
Periodic_Checkpointing
Periodic_S kip
Periodic_Checkpointing
Periodic_S kip
4.00E+06
Exponential_Backoff_S ki
p
4.00E+06
Exponential_Backoff_S kip
2.00E+06
2.00E+06
0.00E+00
0.00E+00
Flowtime
4.00E+09
3.50E+04
3.50E+09
3.00E+04
3.00E+09
2.50E+04
2.50E+09
2.00E+09
Flow time (seconds)
1.50E+09
Adaptive_Checkpointi
g
2.00E+04
Adaptive_Checkpointi
g
Periodic_Checkpointin
g
Number of C heckpoints
1.50E+04
Periodic_Checkpointin
g
Periodic_S kip
Exponential_Backoff_
S kip
1.00E+09
5.00E+08
5.00E+03
0.00E+00
0.00E+00
Periodic_S kip
1.00E+04
Exponential_Backoff_
S kip
Utilization
0.9
0.8
0.7
0.6
Adaptive_Checkpointi
g
0.5
0.4
Utilization
0.3
Periodic_S kip
0.2
Exponential_Backoff_
S kip
0.1
0
Adaptive_Checkpointi
g
Average bounded slow down (seconds) Periodic_Checkpointin
g
Periodic_S kip
Exponential_B ackoff_
S kip
Periodic_Checkpointin
g
Makespan
Overall
Comparison
Values for
adaptive
checkpointing
relative to periodic
checkpointing are
-2.6% for
makespan, -2.2 for
flowtime, -8% for
average bounded
slowdown, +2%
for utilization,
+5% work lost due
to failures, -9.1%
for number of
checkpoints taken.
120
Number of checkpoints
Flowtime
100
80
Utilization
GA base adaptive
checkpointing using
MTBF
GA based periodic
checkpointing
Average bounded slowdown
Makespan
3000000
4000000
2500000
3500000
3000000
2000000
1500000
Makespan (seconds)
1000000
Adaptive_Checkpointi
g
2500000
Adaptive_Checkpointi
g
Periodic_Checkpointin
g
Periodic_Checkpointin
g
Periodic_S kip
1500000
Periodic_S kip
Exponential_Backoff_
S kip
1000000
Exponential_Backoff_
S kip
500000
500000
0
Checkpoint_Interval (seconds)
Flowtime
18000
1.60E+09
16000
1.40E+09
14000
1.20E+09
12000
1.00E+09
8.00E+08
Flow time (seconds)
6.00E+08
4.00E+08
Adaptive_Checkpointi
g
10000
8000
C heckpoints
Taken
Periodic_Checkpointin
g
6000
Periodic_Checkpointin
g
Periodic_S kip
4000
Periodic_S kip
Random_Backoff_S kip
Random_Backoff_S kip
2000
2.00E+08
0.00E+00
C heckpoint Interval (seconds)
C heckpoint Interval (seconds)
Adaptive_Checkpointi
g
Utilization
0.9
0.8
0.7
0.6
Adaptive_Checkpointi
g
0.5
0.4
Utilization
Periodic_S kip
0.3
Exponential_Backoff_
S kip
0.2
0.1
0
Adaptive_Checkpointi
g
Average B ounded Slo wdo wn (sec onds) Periodic_Checkpointin
g
Periodic_S kip
Exponential_B ackoff_
S kip
Periodic_Checkpointin
g
Overall
Comparison
Values for adaptive
checkpointing
relative to periodic
checkpointing are
-3.13% for
makespan, -13.43
for flowtime,
-13.41% for
average bounded
slowdown, +2.65%
for utilization,
-22.51% for work
lost due to failures,
-7.21% for number
of checkpoints
taken.
Makespan
200
Number of checkpoints
Flowtime
100
Utilization
GA based adaptive
checkpointing using fault
indexes
GA based periodic
checkpointing
Average bounded slowdown
Makespan
Work lost
9000
1400
8000
1200
7000
adaptive_overnet2(/1
0)
6000
adaptive_overnet2(/1
00)
1000
periodic_overnet2
adpative_s kype
5000
periodic_s kype
Makespan ( seconds)
adaptive_ucb
4000
periodic_ucb
periodic_overnet2
adpative_s kype
800
periodic_s kype
Work lost
adaptive_ucb
600
periodic_ucb
adaptive_Notre(/100)
3000
periodic_Notre
adaptive_Notre(/100)
periodic_Notre
400
adaptive_Glow(/100)
2000
periodic_Glow
adaptive_Glow(/10)
periodic_Glow
200
1000
0
0
1
1
Iteration
10
11
12
Iteration
10
11
12
Flowtime
Number of checkpoints
3500
120000
3000
100000
adaptive_overnet2(/1
00)
80000
adaptive_overnet2
2500
periodic_overnet2
periodic_overnet2
adpative_s kype
periodic_s kype
60000
Flow time
adaptive_ucb
adpative_s kype
2000
periodic_s kype
Number of checkpoints
adaptive_ucb
1500
periodic_ucb
periodic_ucb
40000
adaptive_Notre(/100)
periodic_Notre
adaptive_Notre(/10)
1000
periodic_Notre
adaptive_Glow(/10)
adaptive_Glow(/1000)
20000
periodic_Glow
periodic_Glow
500
0
1
Iteration
10
11
12
Iteration
10
11
12
Average bounded
slowdown
Utilization
400
0.9
350
0.8
adaptive_overnet2(/1
0)
300
periodic_overnet2
250
adpative_s kype
periodic_s kype
200
adaptive_ucb
0.7
adaptive_overnet22
0.6
periodic_overnet22
adpative_s kype
0.5
periodic_s kype
Utilization
adaptive_ucb
0.4
periodic_ucb
periodic_ucb
150
adaptive_Notre(/10)
periodic_Notre
100
adaptive_Glow(/100)
periodic_Glow
50
adaptive_Notre
0.3
periodic_Notre
0.2
adaptive_Glow
periodic_Glow
0.1
0
1
Iteration
10
11
12
Iteration
10
11
12
Turnaround time
9000
8000
7000
adaptive_overnet2(/1
0)
6000
periodic_overnet2
adpative_s kype(/10)
5000
Makespan
200
Utilization
Flowtime
periodic_s kype
100
adaptive_ucb
Turnaround time
4000
periodic_ucb
adaptive_Notre(/100)
3000
periodic_Notre
Periodic
Checkpointing
Turnaround time
adaptive_Glow(/100)
2000
periodic_Glow
Adaptive
checkpointing
1000
0
1
Iteration
10
11
12
Work Lost
Number of Checkpoints
Makespan
200
Utilization
Makespan
200
Flowtime
Utilization
100
Flowtime
100
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Work Lost
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Number of Checkpoints
Work Lost
Number of Checkpoints
Makespan
200
Makespan
Utilization
Flowtime
200
100
Utilization
Flowtime
100
Adaptive
checkpointing
0
Periodic
Checkpointing
Adaptive
checkpointing
0
Turnaround time
Work Lost
Periodic
Checkpointing
Turnaround time
Number of Checkpoints
Work Lost
Number of Checkpoints
Makespan
4500
4000
1200
3500
adpative_overnet(/10)
periodic_overnet
3000
adaptive_overnet(/10)
1000
periodic_overnet
adpative_s kype(/100)
periodic_s kype
2500
adaptive_ucb
Makespan
2000
periodic_ucb
adpative_s kype(/100)
800
periodic_s kype
adaptive_ucb
Work lost
periodic_ucb
600
adaptive_Notre(/100)
1500
periodic_Notre
adaptive_Notre(/100)
periodic_Notre
400
adaptive_Glow(/100)
1000
periodic_Glow
adaptive_Glow(/10)
periodic_Glow
200
500
0
0
1
Iteration
10
11
12
Iteration
10
11
12
Number of checkpoints
taken
Flowtime
3500
90000
80000
3000
adpative_overnet(/10
0)
70000
adpative_overnet(/10)
2500
periodic_overnet
periodic_overnet
60000
adpative_s kype(/100)
50000
adpative_s kype(/10)
2000
periodic_s kype
periodic_s kype
adaptive_ucb
Flow time
40000
periodic_ucb
adaptive_ucb
Number of checkpoints
periodic_ucb
1500
adaptive_Notre(/10)
adaptive_Notre(/100)
30000
periodic_Notre
periodic_Notre
1000
adaptive_Glow(/10)
adaptive_Glow(/1000)
20000
periodic_Glow
periodic_Glow
500
10000
0
0
1
Iteration
10
11
12
Iteration
10
11
12
Average bounded
slowdown
Utilization
1
400
0.9
350
0.8
adpative_overnet(/10)
300
periodic_overnet
adpative_s kype(/100)
250
adpative_overnet
0.7
periodic_overnet
adpative_s kype
0.6
periodic_s kype
adaptive_ucb
200
periodic_ucb
periodic_s kype
0.5
adaptive_ucb
Utilization
periodic_ucb
0.4
adaptive_Notre(/10)
150
periodic_Notre
adaptive_Glow(/100)
100
periodic_Glow
50
adaptive_Notre
periodic_Notre
0.3
adaptive_Glow
0.2
periodic_Glow
0.1
0
0
1
Iteration
10
11
12
Iteration
10
11
12
3500
3000
adpative_overnet(/10)
2500
periodic_overnet
adpative_s kype(/100)
Turnaround
time
2000
periodic_s kype
adaptive_ucb
Turnaround time
periodic_ucb
1500
adaptive_Notre(/100)
periodic_Notre
adaptive_Glow(/100)
1000
periodic_Glow
500
0
1
Iteration
10
11
12
Trace 2
Values for adaptive checkpointing
relative to periodic checkpointing for
trace 2 are -2.5% for makespan, -2%
for flowtime, -15.7% for average
bounded slowdown, +2.4% for
utilization, +14% for work lost due to
failures, -17.5% for number of
checkpoints taken and -1.86% for
average turnaround time.
Makespan
Makespan
400
200
Utilization
Flowtime
Utilization
Flowtime
200
100
Adaptive checkpointing
Periodic Checkpointing
Adaptive checkpointing
Periodic Checkpointing
0
Turnaround time
Average bounded slowdown
Work Lost
Turnaround time
Number of Checkpoints
Work Lost
Number of Checkpoints
Trace 4
Makespan
200
200
Utilization
Utilization
Flowtime
Flowtime
100
100
Adaptive checkpointing
Periodic Checkpointing
Adaptive checkpointing
Periodic Checkpointing
0
Average bounded slowdown
0
Average bounded slowdown
Turnaround time
Turnaround time
Work Lost
Work Lost
Number of Checkpoints
Number of Checkpoints
Trace
Trace 5
5
Values
Values for
for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing for
for
trace
5
are
-5.4%
trace 5 are -5.4%
for
for makespan,
makespan,
-5.49%
-5.49% for
for
flowtime,
flowtime, -33.14%
-33.14%
for
for average
average
bounded
bounded
slowdown,
slowdown, +5.7%
+5.7%
for
for utilization,
utilization,
+86.7%
+86.7% for
for work
work
lost
due
lost due to
to
failures,
failures, -33.14%
-33.14%
for
number
for number of
of
checkpoints
checkpoints taken
taken
and
and -5.5%
-5.5% for
for
average
average
turnaround
turnaround time.
time.
Makespan
200
Utilization
Flowtime
100
Adaptive checkpointing
Periodic Checkpointing
Turnaround time
Work Lost
Number of Checkpoints
Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures
Makespan
Work lost
900000
3500000
800000
Makespan (seconds)
3000000
700000
600000
2500000
2000000
Adaptive
checkpointing w = 2
1500000
periodic
checkpointing w=2
1000000
Adaptive
checkpointing w = 4
periodic
checkpointing w=4
500000
500000
Work
lost (seconds)
400000
2
Iteration
periodic
checkpointing w=2
300000
Adaptive
checkpointing w = 4
200000
100000
periodic
checkpointing w=4
0
1
Adaptive
checkpointing w = 2
2
Iteration
Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures
Flowtime
Number of checkpoints
Adaptive
checkpointing w = 2
resource 1
340000000
Adaptive
checkpointing w = 2
resource 2
4500
Number of checkpoints
330000000
4000
320000000
periodic
checkpointing w=2
resource 1
3500
310000000
Flowtime (seconds)
300000000
Adaptive
checkpointing w = 2
3000
periodic
checkpointing w=2
2500
Adaptive
checkpointing w = 4
periodic
checkpointing w=4
290000000
periodic
checkpointing w=2
resource 2
2000
Adaptive
checkpointing w = 4
resource 1
1500
1000
Adaptive
checkpointing w = 4
resource 2
500
280000000
0
1
270000000
1
2
Iteration
Iteration
periodic
checkpointing w=4
resource 1
periodic
checkpointing
w=4resource 2
Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures
Average bounded
slowdown
Utilization
1
12000
0.9
10000
Average bounded slowdown (seconds)
0.8
0.7
8000
Adaptive
checkpointing w = 2
6000
0.6
Adaptive
checkpointing w = 2
0.5
periodic
checkpointing w=2
Adaptive
checkpointing w = 4
0.4
Adaptive
checkpointing w = 4
periodic
checkpointing w=4
0.3
periodic
checkpointing w=4
periodic
checkpointing w=2
4000
2000
Utilization
0.2
0
0.1
1
2
Iteration
4
0
1
2
Iteration
Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures
Completion time
Fault occurences
3500000
3500
Adaptive
checkpointing w = 2
res ource 1
3000
Adaptive
checkpointing w = 2
res ource 2
Adaptive checkpointing
w = 2 resource 1
3000000
Adaptive checkpointing
w = 2 resource 2
2500000
2500
periodic checkpointing
w=2 resource 1
2000000
periodic checkpointing
w=2 resource 2
1500000
Adaptive checkpointing
w = 4 resource 1
1000000
500000
1 2 3 4
Iteration
periodic
checkpointing w=2
res ource 1
2000
periodic
checkpointing w=2
res ource 2
Number of faults
1500
Adaptive checkpointing
w = 4 resource 2
1000
periodic checkpointing
w=4 resource 1
500
periodic checkpointing
w=4resource 2
Adaptive
checkpointing w = 4
res ource 1
Adaptive
checkpointing w = 4
res ource 2
periodic
checkpointing w=4
res ource 1
1
2
Iteration
periodic
checkpointing
w=4res ource 2
Makespan
Work lost
80000
160000
Work lost (seconds)
140000
70000
120000
60000
100000
80000
50000
40000
Makespan (seconds)
30000
Periodic S kip
60000
Periodic S kip
Adaptive Periodic
S kip
40000
Adaptive Periodic
S kip
20000
20000
0
150
10000
200
250
300
350
0
150
200
250
300
350
400
400
Flowtime
Number of checkpoints
Number of checkpoints
5000
1280000
4500
1270000
4000
1260000
3500
1250000
3000
1240000
Periodic S kip
2500
Periodic S kip
1230000
Adaptive Periodic
S kip
2000
Adaptive Periodic
S kip
1220000
1500
1210000
1000
1200000
500
1190000
0
150
200
250
300
350
400
150
200
250
300
350
400
1000
800
Average
bounded
slowdown
600
Periodic S kip
Adaptive Periodic S kip
400
200
0
150
200
250
300
350
400
Makespan
Work lost
2500000
3500000
3000000
2000000
2500000
2000000
1500000
Makespan (seconds)
1500000
Adaptive periodic
s kip
Periodic s kip
1000000
Periodic skip
1000000
500000
500000
Iteration
0
Iteration
Flowtime
Number of Checkpints
600
3.00E+08
500
2.50E+08
400
2.00E+08
1.50E+08
Flow time (seconds)
Adaptive periodic
s kip
Periodic s kip
1.00E+08
C heckpoints taken
300
Adaptive periodic
s kip
200
5.00E+07
100
0.00E+00
0
Iteration
Iteration
Periodic s kip
Utilization
1.2
0.8
0.6
Utilization
Adaptive periodic
s kip
0.4
Periodic s kip
0.2
0
Ave rage bo unded s lo wdow n (seco nds)
Adaptive periodic
s kip
Per iodic s kip
Iteration
Iteration
Number of faults
Completion time
3500000
2000
1800
3000000
1600
Adaptive
periodic
s kip
(res ource
1)
1400
1200
1000
Number of failures
Adaptive
periodic
s kip
(Res ource
2)
800
600
400
Periodic
s kip
(Res ource
1)
200
0
Iteration
Adaptive
periodic s kip
(res ource 1)
2500000
2000000
Adaptive
periodic s kip
(Res ource 2)
1000000
Periodic s kip
(Res ource 1)
500000
Periodic s kip
(Res ource 2)
Iteration
Overall
comparison
Values for adaptive
checkpointing
relative to periodic
checkpointing are
-6.42% for
makespan, -2.25
for flowtime,
-10.1% for average
bounded
slowdown, +5.36%
for utilization,
-11.425% for work
lost due to failures,
-.79% for number
of checkpoints
taken.
Makespan
120
Number of checkpoints
Flowtime
100
80
Adaptive periodic
skip
Periodic skip
Work lost
Utilization
Makespan
Work lost
1.80E+07
1.60E+07
1.40E+07
1.20E+07
1.00E+07
8.00E+06
Makespan (seconds)
6.00E+06
4.00E+06
Adaptive_Checkpointi
g
Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing
2.00E+06
0.00E+00
Wo rk lost due to failures (se co nds)
Adaptive_C heckpointi
g
Periodic_Checkpointin
g (With s ch eduling
as s is ted fault
to lerance)
Periodic
Checkpointing
Flowtime
Number of checkpoints
6.00E+09
50000
45000
5.00E+09
40000
35000
4.00E+09
3.00E+09
Flow time (seconds)
2.00E+09
1.00E+09
Adaptive_Checkpointi
g
Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing
30000
25000
15000
Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)
10000
Periodic
Checkpointing
Number of checkpoints
20000
5000
0.00E+00
Adaptive_Checkpointi
g
C heckpoint Interval(seconds)
Utilization
0.9
0.8
0.7
0.6
Adaptive_Checkpointi
g
0.5
0.3
Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)
0.2
Periodic
Checkpointing
0.4
Utilization
0.1
0
Adaptive_Checkpoin
tig
Avverage boude d slowdown (se conds)
Periodic_Checkpoint
ing (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing
C he ckpoint Inte rval (se co nds)
Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-8.7%
-8.7% for
for
makespan, -4.22
for
for flowtime,
flowtime,
-7.8%
-7.8% for
for average
average
bounded
bounded
slowdown,
slowdown,
+6.569%
+6.569% for
for
utilization,
utilization,
-3.24%
-3.24% for
for work
work
lost
due
to
lost due to
failures,
failures, -10%
-10% for
for
number
of
number of
checkpoints
taken.
taken.
Makespan
120
Number of checkpoints
100
80
Flowtime
Ant colony based
adaptive checkpointing
using MTBF
Ant Colony based
periodic checkpointing
Utilization
Makespan
Work lost
1.20E+07
1.00E+07
8.00E+06
6.00E+06
Makespan (seconds)
4.00E+06
Adaptive_Checkpointi
g
Periodic_Checkpointin
g
Periodic_S kip
2.00E+06
Random_Backoff_S kip
0.00E+00
Adaptive_C heckpointi
g
Periodic_Checkpointin
g
Periodic_S kip
Random_B ackoff_S kip
Flowtime
Number of Checkpoints
3.50E+09
45000
40000
3.00E+09
35000
2.50E+09
30000
2.00E+09
Adaptive_Checkpointi
g
Periodic_Checkpointin
g
Periodic_S kip
1.00E+09
Random_Backoff_S kip
Adaptive_Checkpointi
g
25000
20000 of C heckpoints
Number
15000
Periodic_S kip
Random_Backoff_S kip
10000
5.00E+08
5000
0.00E+00
Periodic_Checkpointin
g
Utilization
0.9
0.8
0.7
0.6
0.5
0.4
Utilization
0.3
Periodic_Checkpointin
g
0.2
Periodic_S kip
Random_Backoff_S kip
0.1
0
Adaptive_Checkpointi
g
Average bounded slow down (seconds) Periodic_Checkpointin
g
Periodic_S kip
Random_Backoff _S kip
#REF!
Adaptive_Checkpointi
g
Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-6.95%
-6.95% for
for
makespan, -1.55
for
for flowtime,
flowtime,
-7.7%
-7.7% for
for average
average
bounded
bounded
slowdown,
slowdown,
+7.17%
+7.17% for
for
utilization,
utilization,
-28.2%
-28.2% for
for work
work
lost
due
to
lost due to
failures,
failures, +30%
+30%
for
number
for number of
of
checkpoints
taken.
taken.
Makespan
200
Number of checkpoints
100
Flowtime
Ant colony
based adaptive
checkpointing
using fault
indexes
Ant Colony
based periodic
checkpointing
Average bounded slowdown
Utilization
Makespan
3000000
1800000
1600000
2500000
1400000
2000000
1200000
1500000
Makespan (seconds)
GA bas ed
adaptive
checkpointing
GA bas ed
periodic
Checkpointing
1000000
1000000
Work
lost (seconds)
800000
600000
400000
500000
200000
0
0
Iteration
Iteration
GA bas ed
adaptive
checkpointi
ng
Flowtime
Number of checkpoints
3.00E+08
1000
900
2.50E+08
800
700
2.00E+08
600
GA
bas ed
adaptive
checkpoi
nting
1.50E+08
Flow time (seconds)
1.00E+08
500
Number of C heckpoints
400
300
200
5.00E+07
100
0.00E+00
Iteration
Iteration
GA bas ed
adaptive
checkpointin
g
Average bounded
slowdown
Utilization
1.2
18000
1
16000
14000
0.8
12000
10000
Average
8000 bounded slow dow n (seconds)
6000
4000
GA bas ed
adaptive
checkpoint
ing
Utilization
0.6
GA bas ed adaptive
checkpointing
GA bas ed periodic
Checkpointing
0.4
0.2
2000
0
Iteration
Iteration
Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-3.5%
-3.5% for
for
makespan, -3.1
for
for flowtime,
flowtime,
-17.84%
-17.84% for
for
average
average bounded
bounded
slowdown,
slowdown,
+1.395%
+1.395% for
for
utilization,
utilization,
-24.84%
-24.84% for
for work
work
lost
due
to
lost due to
failures,
failures, -5.76%
-5.76%
for
number
for number of
of
checkpoints
taken.
taken.
Makespan
200
Number of checkpoints
Flowtime
100
Utilization
Makespan
Work lost
3000000
1800000
1600000
2500000
1500000
Makespan (seconds)
1000000
1000000
800000
600000
500000
400000
200000
Iteration
0
Iteration
Flowtime
Number of Checkpoints
1000
900
3.00E+08
800
2.50E+08
Flowtime (seconds)
700
2.00E+08
600
Adaptive
Ant Colony
Algorithm
500
1.50E+08
Adaptive Ant
Colony
Algorithm
1.00E+08
Number of checkpoints
400
Ant Colony
Algorithm
300
Ant Colony
Algorithm
5.00E+07
200
100
0.00E+00
0
Iteration
Iteration
Average bounded
slowdown
Utilization
18000
1.2
16000
1
14000
0.8
12000
Adaptive
Ant Colony
Algorithm
10000
Average bounded slowdown (seconds)
8000
Ant Colony
Algorithm
Adaptive Ant
Colony
Algorithm
0.6
Utilization
0.4
Ant Colony
Algorithm
6000
0.2
4000
2000
Iteration
Iteration
Overall
Overall
Comparison
Comparison
Values
Values for
for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-2.3% for
makespan,
makespan, -3.8
-3.8
for
flowtime,
for flowtime,
-20.5%
-20.5% for
for
average
average bounded
bounded
slowdown,
slowdown, +.97%
+.97%
for
utilization,
for utilization,
-25.3% for work
lost
lost due
due to
to
failures,
-7.52%
failures, -7.52%
for
for number
number of
of
checkpoints
checkpoints
taken.
taken.
Makespan
200
Number of checkpoints
Flowtime
100
Ant Colony bas ed adaptive
checkpointing us ing fault indexes
0
Utilization
Trace 2
Values of adaptive checkpointing
technique relative to periodic
checkpointing are 0% for makespan,
0% for flowtime, -15% for average
bounded slowdown, 31% for work lost
due to failures, -40.5% for number of
checkpoints taken and -1.3% for
average turnaround time.
Makespan
Makespan
100
200
Flowtime
Average bounded slowdown
50
Flowtime
100
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Work Lost
Turnaround time
Number of Checkpoints
Number of Checkpoints
Trace 3
Trace 4
Makespan
100
100
Flowtime
50
Flowtime
50
Adaptive
checkpointing
0
Periodic
Checkpointing
Work Lost
Work Lost
Periodic
Checkpointing
Turnaround time
Turnaround time
Number of Checkpoints
Number of Checkpoints
Adaptive
checkpointing
Trace
Trace 5
5
Values
Values of
of
adaptive
adaptive
checkpointing
checkpointing
technique
technique relative
relative
to
periodic
to periodic
checkpointing
checkpointing are
are
0%
0% for
for
makespan,
makespan, 0%
0% for
for
flowtime,
-13.6%
flowtime, -13.6%
for
for average
average
bounded
bounded
slowdown,
slowdown, -7.2%
-7.2%
for
for work
work lost
lost due
due
to
failures,
-26.8%
to failures, -26.8%
for
for number
number of
of
checkpoints
checkpoints taken
taken
and
and -1.85%
-1.85% for
for
average
average
turnaround
turnaround time.
time.
Makespan
200
Flowtime
100
Adaptive checkpointing
Periodic Checkpointing
0
Work Lost
Turnaround time
Number of Checkpoints
Trace 2
Makespan
Makespan
200
200
Flowtime
100
Flowtime
100
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Turnaround time
Number of Checkpoints
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Turnaround time
Number of Checkpoints
Trace 4
Makespan
Makespan
100
100
Average bounded slowdown
Flowtime
50
Flowtime
50
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Work Lost
Turnaround time
Number of Checkpoints
Number of Checkpoints
Makespan
Trace 5
Values of
adaptive
checkpointing
technique
technique
relative
relative to
to
periodic
periodic
checkpointing
are 0% for
makespan, +9%
for
for flowtime,
flowtime,
-31.5%
-31.5% for
for
average
average bounded
bounded
slowdown, --38 %
for work lost due
to failures, -43%
for
for number
number of
of
checkpoints
checkpoints
taken
taken and
and -6%
-6%
for average
turnaround time.
200
Flowtime
100
Adaptive checkpointing
Periodic Checkpointing
Work Lost
Turnaround time
Number of Checkpoints
Conclusions
Future Work
Traces workload trace and failure traces used in this work are small portions of
available traces. Future works will focus on using complete trace for evaluation
Experiments have been performed for workload and failure traces separately.
Future works will use both workload trace and failure trace in an experiment.
Downtime (MTTR) of resources is ignored in this work and resource is assumed
to recover immediately from failure. This assumption will be removed in future.
Checkpointing technique used considers restart after failure on the same
resource. Another technique can be checkpoint with migration where job is
restarted on a different resource. This technique along with its various issues
such as spare node allocation is to be pondered upon.
Heuristics developed for fault tolerance are not restricted to metaheuristics.
Rather they can be incorporated in any scheduling algorithm. Future work will
look into that.
This work considers only transient faults on resources. Other fault classes are
not considered.
Finally future work will focus on working in an actual Grid setup rather than
simulated one.
7. Publications
Heuristic (Wikipedia)
Heuristic(/hjrstk/; orheuristics;
Greek: "","find"or"discover")
refers to experience-based techniques
for problem solving, learning, and
discovery. Where an exhaustive search is
impractical, heuristic methods are used
to speed up the process of finding a
satisfactory solution.
Heuristic (Wikipedia)
Meta-heuristic (Wikipedia)
Incomputer science
,metaheuristicdesignates a
computational method thatoptimizesa
problembyiterativelytrying to improve
acandidate solutionwith regard to a
given measure of quality.
ACO vs GA
Chromosome 1
J1
J2
R1
R1
R2
R1
J1: 2sec
J3
R2
J2:3sec
J4
R2
C1 = 5 sec
C2 = 4 sec
Chromosome 2
back
J1
R1
J2
R2
J3
R1
J4
R2
References
[1] Townend, P. and Xu, J. 2003. Fault tolerance within a grid environment. As component of eDemand project at the
University of Durham, United Kingdom.
[2] Foster I., Kesselman C.,The Grid: Blueprint for a New Computing Infrastructure, The Elsevier Series in
GridComputing.
[3] Foster, I. 2001. The anatomy of the grid: enabling scalable virtual organizations. In Proceedings of the First
IEEE/ACM International Symposium onCluster Computing and the Grid 2001 (Brisbane, Australia May 15-18,
2001). CCGRID '01. IEEE Computer Society, Washington, DC, USA, 6-7.
[4] Foster I. What is the Grid? A three point checklist, Argonne National Laboratory, fp.mcs.anl.gov/~foster/
Articles/WhatIsTheGrid.pdf, 2002.
[5] Avizienis, A.; Laprie, J.-C.; Randell, B.; Landwehr, C.; , "Basic concepts and taxonomy of dependable and secure
computing,"IEEE Transactions on Dependable and Secure Computing, vol.1, no.1, pp. 11- 33, Jan.-March 2004.
[6] Huda, M.T.; Schmidt, H.W.; Peake, I.D., "An agent oriented proactive fault-tolerant framework for grid
computing,"First International Conference on e-Science and Grid Computing, 2005, pp. 8-15, July 2005.
[7] Hofer, J.; Fahringer, T., "A Multi-Perspective Taxonomy for Systematic Classification of Grid Faults,"16th
Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2008. PDP 2008, pp.126-130, 1315 Feb. 2008.
[8] Jafar, S.; Krings, A.; Gautier, T.; , "Flexible Rollback Recovery in Dynamic Heterogeneous Grid
Computing,"Dependable and Secure Computing, IEEE Transactions on, vol.6, no.1, pp.32-44, Jan.-March 2009
[9] Avizienis, A., "The N-Version Approach to Fault-Tolerant Software,"IEEE Transactions on Software Engineering,
vol. SE-11, no.12, pp. 1491- 1501, Dec. 1985.
References
[10] Schroeder B. and G. A. Gibson (2006). A large-scale study of failures in high-performance
Computing systems, International Conference on Dependable Systems and Networks, DSN 2006.
[11] Hayashibara N, Cherif A, Katayama T. Failure detectors for large-scale distributed systems,
Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems. IEEE Computer Society Press:
Los Alamitos, CA, 2002, 404-409 October 2002.
[12] Elnozahy E, Johnson D, Wang Y. A survey of rollback recovery protocols in message-passing systems,
ACM Computing Surveys 2002, 34(3), 375408.
[13] Alvisi L. and K. Marzullo (1998). Message logging: pessimistic, optimistic, causal, and optimal, IEEE
Transactions on Software Engineering, 24(2), 149-159.
[14] H-C Nam, J. Kim, SJ. Hong and S. Lee. Probabilistic checkpointing, In Proceedings of the Twenty
Seventh International Symposium on Fault-Tolerant Computing (FTCS-27), pp.4857, June 1997.
[15] Gabriel Rodrguez, Xon C. Pardo, Mara J. Martn, Patricia Gonzlez, Performance evaluation of an
application-level checkpointing solution on grids, Future Generation Computer Systems, Volume 26, Issue
7, July 2010, Pages 1012-1023, ISSN 0167-739X, 10.1016/j.future.2010.04.016.
[16] Oliner, A.J.; Sahoo, R.K.; Moreira, J.E.; Gupta, M.; , "Performance implications of periodic checkpointing
on large-scale cluster systems,"Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th
IEEE International, vol., no., pp. 8 pp., 4-8 April 2005
[17] Plank, J.S.; Elwasif, W.R.; , "Experimental assessment of workstation failures and their impact on
checkpointing systems,"Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual
International Symposium on, vol., no., pp.48-57, 23-25 Jun 1998.
[18] Adam J. Oliner, Larry Rudolph, and Ramendra K. Sahoo. 2006. Cooperative checkpointing: a robust
approach to large-scale systems reliability. InProceedings of the 20th annual international conference on
Supercomputing(ICS '06). ACM, New York, NY, USA, 14-23.
References
[19] Oliner, A.; Sahoo, R.; , "Evaluating cooperative checkpointing for supercomputing systems,"Parallel and
Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.8 pp., 25-29 April 2006.
[20] Oliner, A.; Rudolph, L.; Sahoo, R.; , "Cooperative checkpointing theory,"Parallel and Distributed Processing
Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.10 pp., 25-29 April 2006.
[21] Zhiling Lan; Yawei Li; , "Adaptive Fault Management of Parallel Applications for High-Performance
Computing,"Computers, IEEE Transactions on, vol.57, no.12, pp.1647-1660, Dec. 2008.
[22] Chtepen M., F. H. A. Claeys, et al. (2009). Adaptive Task Checkpointing and Replication: Toward Efficient
Fault-Tolerant Grids, IEEE Transactions on Parallel and Distributed Systems, vol.20, no.2, pp.180-190, Feb.
2009.
[23] Nazir B., K. Qureshi, et al. (2009). "Adaptive checkpointing strategy to tolerate faults ineconomy based
grid," The Journal of Supercomputing, 50(1), 1-18, 2009.
[24] Antonios Litke, Konstantinos Tserpes, Konstantinos Dolkas, and Theodora Varvarigou. 2005. A task
replication and fair resource management scheme for fault tolerant grids. InProceedings of the 2005 European
conference on Advances in Grid Computing(EGC'05), Peter A. Sloot, Alfons G. Hoekstra, Thierry Priol,
Alexander Reinefeld, and Marian Bubak (Eds.). Springer-Verlag, Berlin, Heidelberg, 1022-1031.
[25] Qin Z., B. Veeravalli, et al. (2009). On the Design of Fault-Tolerant Scheduling Strategies Using PrimaryBackup Approach for Computational Grids with Low Replication Costs, IEEE Transactions on Computers, vol.
58, no.3, pp.380-393, March 2009.
[26] Hwang S., and Kesselman C., A flexible framework for fault tolerance in the grid, Journal of Grid
Computing, vol. 1, no. 3, pp. 251-272, 2003.
[27] Lopes R. F. and F. J. da Silva e Silva (2006). Fault tolerance in a mobile agent based computational grid,
Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops, 2006,vol. 2, 16-19 May
2006.
References
[28] Kandaswamy G., A. Mandal, et al. (2008). Fault Tolerance and Recovery of Scientific Workflows on
Computational Grids, 8th IEEE International Symposium on Cluster Computing and the Grid, 2008. CCGRID08,
pp.777-782, 19-22 May 2008.
[29] Yang Z., A. Mandal, et al. (2009). Combined Fault Tolerance and Scheduling Techniques for Workflow
Applications on Computational Grids, 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
CCGRID09, pp.244-251, 18-21 May 2009.
[30] SungJin C., B. MaengSoon, et al. (2004). Volunteer availability based fault tolerant scheduling mechanism in
desktop grid computing environment, Proceedings of the third IEEE International Symposium on Network
Computing and Applications, 2004, (NCA 2004), pp. 366- 371, 30 Aug.-1 Sept. 2004.
[31] Hou E. S. H., N. Ansari, et al. (1994). A genetic algorithm for multiprocessor scheduling, IEEE transactions on
Parallel and Distributed Systems, vol.5, no.2, pp.113-120, Feb 1994.
[32] Song, S., Hwang, K., and Kwok, K. 2006. Risk-resilient heuristics and genetic algorithms for security-assured
grid job scheduling.IEEE Transactions onComputers 55, 6 (June 2006), 703-719.
[33] Khanli, L. M., Far, M. E., and Rahmani, A. M. 2010. RFOH: A New Fault Tolerant Job Scheduler in Grid
Computing. InProceedings of the Second International Conference on Computer Engineering and
Applications(Bali Island, Indonesia, March 19 21, 2010).ICCEA '10. IEEE Computer Society, Washington, DC,
USA, 422-425.
[34] Priya, S.B., Prakash, M., Dhawan, K.K. 2007. Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using
Checkpoint In Proceedings of the Sixth International Conference on Grid and Cooperative Computing (Los
Alamitos, CA, Aug. 16 18, 2007). GCC '07. 676-680.
[35] Abdulal, W., and Ramachandram, S 2011. Reliability-Aware Genetic Scheduling Algorithm in Grid Environment.
In Proceedings of the International Conference on Communication Systems and Network Technologies (Katra,
Jammu India , June 03 05). 673-677.
[36] Wu, C., Lai K., and Sun R. 2008. GA-Based Job Scheduling Strategies for Fault Tolerant Grid Systems. In
Proceedings of the Asia-Pacific Conference on Services Computing (Dec. 09
References
12, 2008). IEEE, 27-32.
[37] Dorigo, M.; Gambardella, L.M.; "Ant colony system: a cooperative learning approach to the traveling
salesman problem,"Evolutionary Computation, IEEE Transactions on, vol.1, no.1, pp.53-66, Apr 1997
[38] Zhihong Xu; Xiangdan Hou; Jizhou Sun; , "Ant algorithm-based task scheduling in grid
computing,"Electrical and Computer Engineering, 2003. IEEE CCECE 2003. Canadian Conference on,
vol.2, no., pp. 1107- 1110 vol.2, 4-7 May 2003.
[39] Hui Yan; Xue-Qin Shen; Xing Li; Ming-Hui Wu; , "An improved ant algorithm for job scheduling in grid
computing,"Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on,
vol.5, no., pp.2957-2961 Vol. 5, 18-21 Aug. 2005.
[40] Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, and Ramendra K. Sahoo. 2004.
Performance implications of failures in large-scale cluster scheduling. InProceedings of the 10th
international conference on Job Scheduling Strategies for Parallel Processing(JSSPP'04), Dror G.
Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer-Verlag, Berlin, Heidelberg, 233-252.
[41] Dalibor Klusek and Hana Rudov. 2010. Alea 2: job scheduling simulator. InProceedings of the 3rd
International ICST Conference on Simulation Tools and Techniques(SIMUTools '10). ICST (Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium,
Belgium.
[42] S. Lorpunmanae , Mohd Sap , A.H.Abdullah and C. C. Inwai , An Ant Colony Optimization for Dynamic
Job Scheduling in GridEnvironment , International Journal of Computer and Information Science and
Engineering , 2007.
[43] Jing Hu, Mingchu Li, Weifeng Sun, Yuanfang Chen, "An Ant Colony Optimization for Grid Task
Scheduling with Multiple QoS Dimensions," Grid and Cloud Computing, International Conference on, pp.
415-419, 2009 Eighth International Conference on Grid and Cooperative Computing, 2009
References
[44] Ruay-Shiung Chang, Jih-Sheng Chang, Po-Sheng Lin, An ant algorithm for balanced job scheduling in
grids, Future Generation Computer Systems, Volume 25, Issue 1, January 2009, Pages 20-27, ISSN 0167739X, 10.1016/j.future.2008.06.004.
[45] Wei-Neng Chen; Jun Zhang; , "An Ant Colony Optimization Approach to a Grid Workflow Scheduling
Problem With Various QoS Requirements,"Systems, Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, vol.39, no.1, pp.29-43, Jan. 2009
[46] Gosia Wrzesinska, Rob V. van Nieuwpoort, Jason Maassen, Henri E. Bal (2005). Fault-Tolerance,
Malleability and Migration for Divide-and-Conquer Applications on the Grid, Proceedings of the 19th IEEE
International Symposium on Parallel and Distributed Processing, 2005, pp. 13a, 04-08 April 2005.
[47] Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed re-source
management and scheduling for grid computing. Concurr Comput Pract Exp (CCPE) 14(1311751220.
[48] Caminero, A.; Sulistio, A.; Caminero, B.; Carrion, C.; Buyya, R.; , "Extending GridSim with an
architecture for failure detection,"International Conference on Parallel and Distributed Systems, 2007,
vol.2, no., pp.1-8, 5-7 Dec. 2007 doi: 10.1109/ICPADS.2007.4447756.
[49] Failure Trace Archive [Online]. http://fta.inria.fr/apache2-default/pmwiki/index.php.
[50] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. 2010. The Failure Trace Archive:
Enabling Comparative Analysis of Failures in Diverse Distributed Systems. InProceedings of the 2010
10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing(CCGRID '10). IEEE
Computer Society, Washington, DC, USA, 398-407.
[51] Parallel Workload Archive [Online]. http://www.cs.huji.ac.il/labs/parallel/workload/.
[52] Grid Workload Archive [Online]. http://gwa.ewi.tudelft.nl/pmwiki/.
[53] MATLAB R2010a [Online]. http://www.mathworks.in/help/techdoc/rn/br_03sl.html