Fault Tolerance in Grid

FAULT TOLERANCE IN
GRID ENVIRONMENT
Neeraj Upadhyay Mtech 2nd year
1.Introduction and
Motivation
Grid computing is defined as [3] coordinated
resource sharing and

problem solving in dynamic, multi-institutional virtual organization.
Differences from traditional distributed systems

Scale
Heterogeneity
Dynamicity
Above attributes affects reliability
1.
2.
3.
reliability of a system = product of reliabilities of its components.

Heterogeneity - interaction faults.
Dynamicity - delay and loss of jobs.
Fault Tolerance
preserve the delivery of expected

services despite the presence of
fault-caused errors within the system
itself. Errors are detected and
corrected, and permanent faults are
located and removed while the
system continues to deliver
acceptable service.
Problem Statement
Design and performance study of adaptive

checkpointing based fault tolerance techniques in
Grid environment
Extending the meta-heuristic algorithms such as Genetic

Algorithm and Ant Colony Optimization with support for
fault tolerance technique : adaptive checkpointing.
Inventing adaptive approaches for fault tolerance in

computational grids by suitable modifying traditional
approaches, such as checkpointing, to take into account
various characteristics of the Grid environment.
Why adaptive checkpointing for Grid ?
Grid by its definition [3] is highly dynamic distributed

environment
Dynamic
a) Dynamically varying resource conditions such as fault
occurrences.
b) Faults are more likely to occur during one time frame compared
to others i.e. faults are temporally correlated [10]. For during
weekdays when workload is high compared to weekends faults
are more likely.
Also during day time faults are more likely to occur .
c) Faults are spatially correlated [10].
Performance of checkpointing technique depends on the size of

checkpointing interval.
If interval is very high large amount of work is lost due to failures

If it is very low then overhead of each checkpoint operation will
be very high.
Why GA?
Simple structure for mapping to

scheduling problem.
GA used is based on Global Optimization
toolbox of MATLAB [53]. It tends to
converge to good solution quickly.
Why ACO?
Performanc
e of ACO
compared
to other
meta
heuristics
[37]
Remark
Heuristics for adaptive checkpointing

developed in this work is not restricted
to any metaheuristic employed. It can be
suitably used with any scheduling
technique. Our main aim is to show how
maintaining information about failure
conditions of resources can be used in
adapting the checkpointing interval to
improve the performance.
2. Fault tolerance in grid environment
Fault tolerance techniques

Checkpointing
Replication
workflow level fault tolerance techniques
mobile agent based fault tolerance
fault tolerant scheduling
application model specific fault tolerance
techniques.
3. Research gaps
Scheduling support for adaptive checkpointing approach using GA.
Tackling the problem of autonomous nature of Grid resources.
Consideration for Mean Time To Repair (MTTR) in meta-heuristic

based scheduling.
Support for fault tolerance in Ant Colony Optimization based

scheduling.
Spatially and temporally correlated faults.
Weibull and Lognormal distributions respectively for MTTF and

MTTR.
Weibull and lognormal [10]

Weibull (MTBF)
Lognormal (MTTR)
4. Work done
Proposed solution
Incorporating fault tolerance in GA-based
scheduling in Grid environment
Genetic Algorithm
Initial Population
Fitness
Evaluati
on
Representation of chromosome
14
J1
R1
J2
R2
J3
J4
R3
R4
Jn
Crossover Operator
J1
R3
Chromosome 1
Chromosome 2
J1
R2
J2
J2
J3
J4
J5
R4
R1
R5
R2
R3
J3
R4
J4
R1
J5
R5
Crossover point
offspring
J1
J2
R3
R4
J3
R1
7/8/16
J4
J5
R1
R5
Mutation Operator
15
J1
J5
J5
J2
R1
R4
J1
R1
R4J2
J3
R2
R5
J4
R5
J3
R5
7/8/16
R1
J4 R1
Fitness function
Flowtime =
Cr is the completion time of jobs

allocated to resource r.
m is the total number of resources in
Grid.
Adaptive Checkpointing based fitness

functions
Mean Failure Time Based Checkpointing
is the task size in Million Instructions
is resource speed in Millions of Instructions per

Second
is execution time of job i in node n,
time of node n
is mean failure
Some existing approaches
[33] , [36]
Last Failure Time Based Checkpointing and Checkpointing without Migration
If
Where C1 is current system time, LFn is last failure time of node n
and k is an integer 2.
Resource Provider Autonomy Based

Scheduling
[7] presents volunteer autonomy failures

time of resource (a volunteer) registration
maintained in Grid Information Service
mean time for which a resource remains in the
grid (stay time)
Checkpointing with
downtimes
(9)
Assumption: No work lost due to failures
Work lost taken into

account
(10)
Fault index and Fault ratio based Adaptive Checkpointing

based fitness function
1.
Fault ratio based adaptive checkpointing
Parameters
limits the increase in checkpointing

interval and limits the decrease in
is taken as 1 and as .5 for
experiments to show the applicability of
the approach. So the checkpoint interval
can vary from (.5*Check_interval , 2 *
check_interval)
25
Fault Occurrence History

Table
No. of faults
R1
R2
R3
:
Rn
No. of Executions
Fault Index based adaptive

Checkpointing
limits the increase in checkpointing

interval and limits the decrease in
max((FOHT[i][1] FOHT[i][0]),
) ,))
Ant Colony Optimization
Ant Colony Optimization
Initialize all parameters
Loop /* outer loop represents each iteration of ACO */
Each ant chooses a random sequence of tasks

Loop /* inner loop represents a step
Each ant incrementally builds a solution by applying state
transition rule and
a local pheromone updating rule
Until all ants have completed building a solution
Apply global pheromone updating rule
Until terminate_condition
ACO Phases
Pseudorandom state transition rule:

R=
Local pheromone update rule:

Pheromone(r,j) = (1-).pheromone(r,j) +
. initial_pheromone
is set to .1, = 1.2, q0 = .9 [45]
ACO Phases Continued
Global pheromone update rule:

Pheromone(r,j) = (1-).pheromone(r,j)
+ . score
score = 1 +
minimum_makespan/makespan
Fault Index based periodic

Skip
FI(i): Fault Index of resource i
FI1, FI2, FI3..FIN fault index values such that FI1 < FI2 <
FI3 < ..<FIN
D1, D2, D3,.,DN skip parameter to determine intensity of
skip such that D1 < D2 <D3 DN
If(FI(i) > Fin) then
Perform all checkpoints
If( FIN >FI(i)>FIN-1
Use D1 has skip parameter
If(FI1 >FI(i))
Use DN has skip parameter
Exit
Other Techniques
fault index based exponential backoff

skip
Fault ratio based periodic and
exponential skip
Temporal Correlation
i) If a node has not failed for a long time

then there is less probability that it will
fail in the near future.
ii) If a node has failed recently then
there is high probability that it will fail in
the near future.
MTBF and Last Failure Based Adaptive

Checkpointing for Temporally Correlated
Failures
If ((C1-Lf) > * MTBF)

AI = AI + I for each checkpoint request in interval ( * MTBF, *
MTBF) where > and
>1
(17)
If (((C1-Lf) < * MTBF)

AI = AI - I for each checkpoint request in interval ( * MTBF, *
MTBF) where > and
<1
I is an initial periodic checkpoint interval and AI is adapted checkpoint
interval. C1 is current time and Lf is last failure time of resource
To show the applicability of the approach is taken as 2, as 2.5,

as .25, as .5 in our experiments. Higher value of leads to less
opportunity for application of the technique and lower value
decreases performance due to time since last failure being very
small. Similar is the reasoning for .
Grid Working + Adaptive Checkpointing

7/8/16
5. Optimized
resource list
35
Grid
Resource
7.Fault value
FOHT
Fault
Manager
Resource
MIPS
Broker
6. Get resource fault

info
GA or ACO
1.
Deadline,
budget,
Grid User
gridlets
17 Submit result
8. Allocated resource
4. Available
resource list
3. Current
load
status
Schedule Advisor
Resource2
Gridlet Dispatcher
2. Available
resources
information
14 reschedule
from last
checkpoint
Resource1
9.Submit
Gridlet
11. Gridlet failure

Gridlet Receptor
16 decrement
Fault value
Resource3
10. Submit
Checkpoints
13 Get
checkpoint
Grid Information Service
Checkpoint
15 Gridlet Server
completion
12. Increment fault value
Grid Working
Performance Comparison
We propose adaptive checkpointing

techniques.
Comparison can be done with commonly
used existing checkpointing techniques
Periodic Checkpointing
Skipping checkpointing techniques
Periodic Skip
Exponential backoff skip
Performance Metrics
a) Makespan: It is the maximum completion time for any resource and is basically the
time when all jobs finish execution. Completion time for a resource is the point of time
when all jobs allocated to that resource completes execution.
b) Flowtime: It is the sum of the completion time for all the resources.
c) Average bounded slowdown: It is the average slowdown of a job. It is the difference

between time taken to execute a job and the CPU time () averaged over all jobs. Sizes of
jobs are taken to be comparable to each other.
d) Work lost due to failures: It is the unsaved work which is lost due to failure of jobs.
e) Utilization: Utilization is the fraction of time of the resources which is used in

executing jobs i.e. in doing useful work. This work does not include the time spent in
carrying out work which is lost due to failures.
f) Number of Checkpoints: It is the total number of checkpoints performed during the

entire simulation run for a batch of job.
g) Average turnaround time: It is the average of completion times of jobs. Completion

time of job is the finish time of a job minus the submission time.
Simulation Parameters
Parameter
1. Number of Resources (Clusters)
2. Number of Processors per Cluster
3. Number of jobs
4. Computation time per job
5. Checkpoint Overhead
6. Size (Number of processors)of job
7. Checkpoint Interval
[8. MTBF of Resources
9. Failure Distribution
10. Elite Count

11. Crossover fraction
12 Initial Population (GA)
13 Number of Ants
Value
5
64
200
48 hours
720 seconds [19]
64
1000 to 10000 seconds [19]
5 hours to 18 hours
Weibull (shape parameter .7 , 1 , 1.5) [19]
2
.9
Number of jobs (200)
Number of resources (5)
Weibull Distribution
Probability
density
function for
various
shape
parameter
s
GA-based Adaptive Fault Tolerance

Using MTBF of Resources
Makespan
Work Lost due to failures

1.20E+07
1.40E+07
Work lost due to failures (seconds)
1.00E+07
1.20E+07
8.00E+06
1.00E+07
8.00E+06
Adaptive_Checkpointig
Makespan
6.00E+06 (seconds)
6.00E+06
Adaptive_Checkpointig
Periodic_Checkpointing
Periodic_S kip
Periodic_Checkpointing
Periodic_S kip
4.00E+06
Exponential_Backoff_S ki
p
4.00E+06
Exponential_Backoff_S kip
2.00E+06
2.00E+06
0.00E+00
0.00E+00
C heckpoint Interval (seconds)
C heckpoint interval (seconds)
GA-based Adaptive Fault Tolerance Using MTBF

of Resources
Number of Checkpoints
taken
Flowtime
4.00E+09
3.50E+04
3.50E+09
3.00E+04
3.00E+09
2.50E+04
2.50E+09
2.00E+09
Flow time (seconds)
1.50E+09
Adaptive_Checkpointi
g
2.00E+04
g
Periodic_Checkpointin
g
Number of C heckpoints
1.50E+04
g
Periodic_S kip
Exponential_Backoff_
S kip
1.00E+09
5.00E+08
5.00E+03
0.00E+00
0.00E+00
C heckpont Interval (seconds)
Periodic_S kip
1.00E+04
S kip

of Resources
Average bounded
slowdown
Utilization
0.9
0.8
0.7
0.6
g
0.5
0.4
Utilization
0.3
Periodic_S kip
0.2
S kip
0.1
0
g
Average bounded slow down (seconds) Periodic_Checkpointin
g
Periodic_S kip
Exponential_B ackoff_
S kip

C heckpoint Inte rval (se co nds)
g

of Resources
Makespan
Overall
Comparison
Values for
adaptive
checkpointing
relative to periodic
checkpointing are
-2.6% for
makespan, -2.2 for
flowtime, -8% for
average bounded
slowdown, +2%
for utilization,
+5% work lost due
to failures, -9.1%
for number of
checkpoints taken.
120
Number of checkpoints
Flowtime
100
80
Work lost due to failures
Utilization
GA base adaptive
checkpointing using
MTBF
GA based periodic
checkpointing
Average bounded slowdown
GA-based Adaptive Fault Tolerance Using Fault

Ratios of Resources
Makespan

4500000
3000000
4000000
2500000
3500000
3000000
2000000
1500000
Makespan (seconds)
1000000
g
2500000
g
g
2000000 lost (seconds)

Work
g
Periodic_S kip
1500000
Periodic_S kip
S kip
1000000
S kip
500000
500000
0
Checkpoint_Interval (seconds)
Checkpoint Interval (seconds)

Ratios of Resources
taken
Flowtime
18000
1.60E+09
16000
1.40E+09
14000
1.20E+09
12000
1.00E+09
8.00E+08
Flow time (seconds)
6.00E+08
4.00E+08
g
10000
8000
C heckpoints
Taken
g
6000
g
Periodic_S kip
4000
Periodic_S kip
Random_Backoff_S kip
2000
2.00E+08
0.00E+00
g

Ratios of Resources
Average bounded
slowdown
Utilization
0.9
0.8
0.7
0.6
g
0.5
0.4
Utilization
Periodic_S kip
0.3
S kip
0.2
0.1
0
g
Average B ounded Slo wdo wn (sec onds) Periodic_Checkpointin
g
Periodic_S kip
Exponential_B ackoff_
S kip
g

Ratios of Resources
Overall
Comparison
Values for adaptive
checkpointing
checkpointing are
-3.13% for
makespan, -13.43
for flowtime,
-13.41% for
average bounded
slowdown, +2.65%
for utilization,
-22.51% for work
lost due to failures,
-7.21% for number
of checkpoints
taken.
Makespan
200
Flowtime
100
Utilization
GA based adaptive
checkpointing using fault
indexes
GA based periodic
checkpointing
Performance Comparison for GA based Adaptive Checkpointing

and Periodic Checkpointing for Failure traces
Makespan
Work lost
9000
1400
8000
1200
7000
adaptive_overnet2(/1
0)
6000
00)
1000
periodic_overnet2
adpative_s kype
5000
periodic_s kype
Makespan ( seconds)
adaptive_ucb
4000
periodic_ucb
periodic_overnet2
adpative_s kype
800
periodic_s kype
Work lost
adaptive_ucb
600
periodic_ucb
adaptive_Notre(/100)
3000
periodic_Notre
periodic_Notre
400
adaptive_Glow(/100)
2000
periodic_Glow
adaptive_Glow(/10)
periodic_Glow
200
1000
0
0
1
1
Iteration
10
11
12
Iteration
10
11
12

Flowtime
3500
120000
3000
100000
00)
80000
adaptive_overnet2
2500
periodic_overnet2
periodic_overnet2
adpative_s kype
periodic_s kype
60000
Flow time
adaptive_ucb
adpative_s kype
2000
periodic_s kype
adaptive_ucb
1500
periodic_ucb
periodic_ucb
40000
periodic_Notre
adaptive_Notre(/10)
1000
periodic_Notre
adaptive_Glow(/10)
adaptive_Glow(/1000)
20000
periodic_Glow
periodic_Glow
500
0
1
Iteration
10
11
12
Iteration
10
11
12

Average bounded
slowdown
Utilization
400
0.9
350
0.8
0)
300
periodic_overnet2
250
adpative_s kype
periodic_s kype
200
Average bounded slow dow n
adaptive_ucb
0.7
adaptive_overnet22
0.6
periodic_overnet22
adpative_s kype
0.5
periodic_s kype
Utilization
adaptive_ucb
0.4
periodic_ucb
periodic_ucb
150
adaptive_Notre(/10)
periodic_Notre
100
adaptive_Glow(/100)
periodic_Glow
50
adaptive_Notre
0.3
periodic_Notre
0.2
adaptive_Glow
periodic_Glow
0.1
0
1
Iteration
10
11
12
Iteration
10
11
12

Overall comparison (Trace 1)

Values for adaptive checkpointing relative to
periodic checkpointing for trace 1 are -1%
for makespan, -5% for flowtime, -26.7% for
average bounded slowdown, +.96% for
utilization, +58.8% for work lost due to
failures, -37.5% for number of checkpoints
taken and -5% for average turnaround time.
Turnaround time
9000
8000
7000
0)
6000
periodic_overnet2
adpative_s kype(/10)
5000
Makespan
200
Utilization
Flowtime
periodic_s kype
100
adaptive_ucb
Turnaround time
4000
periodic_ucb
3000
periodic_Notre
Periodic
Checkpointing
Turnaround time
adaptive_Glow(/100)
2000
periodic_Glow
Adaptive
checkpointing
1000
0
1
Iteration
10
11
12
Work Lost

Values for adaptive checkpointing

relative to periodic checkpointing for
trace 2 are -2.6% for makespan, -2.7%
for flowtime, -29.8% for average
bounded slowdown, +2.4% for
utilization, +36% for work lost due to
failures, -32% for number of checkpoints
taken and -3.88% for average
turnaround time.

trace 3 are -5% for makespan, -7.88% for
flowtime, -42.7% for average bounded
slowdown, +5.47% for utilization,
+55.43% for work lost due to failures,
-47% for number of checkpoints taken
and -8.35% for average turnaround time.
Makespan
200
Utilization
Makespan
200
Flowtime
Utilization
100
Flowtime
100
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Work Lost
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Work Lost


for flowtime, -22% for average bounded
slowdown, +4.69% for utilization,
+1.4% for work lost due to failures,
-23.5% for number of checkpoints taken
and -3.2% for average turnaround time.

failures, -32% for number of
checkpoints taken and -6.7% for
average turnaround time.
Makespan
200
Makespan
Utilization
Flowtime
200
100
Utilization
Flowtime
100
Adaptive
checkpointing
0
Periodic
Checkpointing
Adaptive
checkpointing
0
Turnaround time
Work Lost
Periodic
Checkpointing
Turnaround time
Work Lost
Performance Comparison for ACO based Adaptive Checkpointing

Makespan

1400
4500
4000
1200
3500
adpative_overnet(/10)
periodic_overnet
3000
adaptive_overnet(/10)
1000
periodic_overnet
periodic_s kype
2500
adaptive_ucb
Makespan
2000
periodic_ucb
800
periodic_s kype
adaptive_ucb
Work lost
periodic_ucb
600
1500
periodic_Notre
periodic_Notre
400
adaptive_Glow(/100)
1000
periodic_Glow
adaptive_Glow(/10)
periodic_Glow
200
500
0
0
1
Iteration
10
11
12
Iteration
10
11
12

taken
Flowtime
3500
90000
80000
3000
adpative_overnet(/10
0)
70000
2500
periodic_overnet
periodic_overnet
60000
50000
2000
periodic_s kype
periodic_s kype
adaptive_ucb
Flow time
40000
periodic_ucb
adaptive_ucb
periodic_ucb
1500
adaptive_Notre(/10)
30000
periodic_Notre
periodic_Notre
1000
adaptive_Glow(/10)
adaptive_Glow(/1000)
20000
periodic_Glow
periodic_Glow
500
10000
0
0
1
Iteration
10
11
12
Iteration
10
11
12

Average bounded
slowdown
Utilization
1
400
0.9
350
0.8
300
periodic_overnet
250
adpative_overnet
0.7
periodic_overnet
adpative_s kype
0.6
periodic_s kype
adaptive_ucb
200
Average bounded slow dow n
periodic_ucb
periodic_s kype
0.5
adaptive_ucb
Utilization
periodic_ucb
0.4
adaptive_Notre(/10)
150
periodic_Notre
adaptive_Glow(/100)
100
periodic_Glow
50
adaptive_Notre
periodic_Notre
0.3
adaptive_Glow
0.2
periodic_Glow
0.1
0
0
1
Iteration
10
11
12
Iteration
10
11
12

3500
3000
2500
periodic_overnet
Turnaround
time
2000
periodic_s kype
adaptive_ucb
Turnaround time
periodic_ucb
1500
periodic_Notre
adaptive_Glow(/100)
1000
periodic_Glow
500
0
1
Iteration
10
11
12

Trace 1
bounded slowdown, +5% for
utilization, +182% for work lost due
to failures, -23% for number of
Trace 2
trace 2 are -2.5% for makespan, -2%
failures, -17.5% for number of
Makespan
Makespan
400
200
Utilization
Flowtime
Utilization
Flowtime
200
100
Adaptive checkpointing
0
Turnaround time
Work Lost
Turnaround time
Work Lost

Trace 3
Trace 4

relative to periodic checkpointing
for trace 3 are -2.3% for makespan,
-4% for flowtime, -25% for average
utilization, +35.5% for work lost due
to failures, -26.7% for number of
checkpoints taken and -4% for

trace 4 are -2.5% for makespan, -2%
utilization, +14.13% for work lost due
to failures, -17.5% for number of
Makespan
Makespan
200
200
Utilization
Utilization
Flowtime
Flowtime
100
100
0
0
Turnaround time
Turnaround time
Work Lost
Work Lost

Trace
Trace 5
5
Values
Values for
for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing for
for
trace
5
are
-5.4%
trace 5 are -5.4%
for
for makespan,
makespan,
-5.49%
-5.49% for
for
flowtime,
flowtime, -33.14%
-33.14%
for
for average
average
bounded
bounded
slowdown,
slowdown, +5.7%
+5.7%
for
for utilization,
utilization,
+86.7%
+86.7% for
for work
work
lost
due
lost due to
to
failures,
failures, -33.14%
-33.14%
for
number
for number of
of
checkpoints
checkpoints taken
taken
and
and -5.5%
-5.5% for
for
average
average
turnaround
turnaround time.
time.
Makespan
200
Utilization
Flowtime
100
Turnaround time
Work Lost
Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures
Makespan
Work lost
900000
3500000
800000
Makespan (seconds)
3000000
700000
600000
2500000
2000000
Adaptive
checkpointing w = 2
1500000
periodic
checkpointing w=2
1000000
Adaptive
checkpointing w = 4
periodic
checkpointing w=4
500000
500000
Work
lost (seconds)
400000
2
Iteration
periodic
checkpointing w=2
300000
Adaptive
checkpointing w = 4
200000
100000
periodic
checkpointing w=4
0
1
Adaptive
checkpointing w = 2
2
Iteration
Correlated Failures
Flowtime
Adaptive
checkpointing w = 2
resource 1
340000000
Adaptive
checkpointing w = 2
resource 2
4500
330000000
4000
320000000
periodic
checkpointing w=2
resource 1
3500
310000000
Flowtime (seconds)
300000000
Adaptive
checkpointing w = 2
3000
periodic
checkpointing w=2
2500
Adaptive
checkpointing w = 4
periodic
checkpointing w=4
290000000
periodic
checkpointing w=2
resource 2
2000
Adaptive
checkpointing w = 4
resource 1
1500
1000
Adaptive
checkpointing w = 4
resource 2
500
280000000
0
1
270000000
1
2
Iteration
Iteration
periodic
checkpointing w=4
resource 1
periodic
checkpointing
w=4resource 2
Correlated Failures
Average bounded
slowdown
Utilization
1
12000
0.9
10000
Average bounded slowdown (seconds)
0.8
0.7
8000
Adaptive
checkpointing w = 2
6000
0.6
Adaptive
checkpointing w = 2
0.5
periodic
checkpointing w=2
Adaptive
checkpointing w = 4
0.4
Adaptive
checkpointing w = 4
periodic
checkpointing w=4
0.3
periodic
checkpointing w=4
periodic
checkpointing w=2
4000
2000
Utilization
0.2
0
0.1
1
2
Iteration
4
0
1
2
Iteration
Correlated Failures
Completion time
Fault occurences
3500000
3500
Adaptive
checkpointing w = 2
res ource 1
3000
Adaptive
checkpointing w = 2
res ource 2
w = 2 resource 1
3000000
w = 2 resource 2
completion time (seconds)
2500000
2500
periodic checkpointing
w=2 resource 1
2000000
w=2 resource 2
1500000
w = 4 resource 1
1000000
500000
1 2 3 4
Iteration
periodic
checkpointing w=2
res ource 1
2000
periodic
checkpointing w=2
res ource 2
Number of faults
1500
w = 4 resource 2
1000
w=4 resource 1
500
w=4resource 2
Adaptive
checkpointing w = 4
res ource 1
Adaptive
checkpointing w = 4
res ource 2
periodic
checkpointing w=4
res ource 1
1
2
Iteration
periodic
checkpointing
w=4res ource 2
Performance Evaluation: Fault Index based

periodic Skip (Part 1)
Makespan
Work lost
80000
160000
Work lost (seconds)
140000
70000
120000
60000
100000
80000
50000
40000
Makespan (seconds)
30000
Periodic S kip
60000
Periodic S kip
Adaptive Periodic
S kip
40000
Adaptive Periodic
S kip
20000
20000
0
150
10000
200
250
300
350
0
150
200
250
300
350
400
400

Flowtime
Flow time (seconds)

1290000
5000
1280000
4500
1270000
4000
1260000
3500
1250000
3000
1240000
Periodic S kip
2500
Periodic S kip
1230000
Adaptive Periodic
S kip
2000
Adaptive Periodic
S kip
1220000
1500
1210000
1000
1200000
500
1190000
0
150
200
250
300
350
400
150
200
250
300
350
400

1200
Average bounded slow dow n (seconds)
1000
800
Average
bounded
slowdown
600
Periodic S kip
Adaptive Periodic S kip
400
200
0
150
200
250
300
350
400

Makespan
Work lost
2500000
3500000
3000000
2000000
2500000
2000000
1500000
Makespan (seconds)
1500000
Adaptive periodic
s kip
Periodic s kip
1000000
Adaptive periodic skip
Work lost (seconds)
Periodic skip
1000000
500000
500000
Iteration
0
Iteration

Flowtime
Number of Checkpints
600
3.00E+08
500
2.50E+08
400
2.00E+08
1.50E+08
Flow time (seconds)
Adaptive periodic
s kip
Periodic s kip
1.00E+08
C heckpoints taken
300
Adaptive periodic
s kip
200
5.00E+07
100
0.00E+00
0
Iteration
Iteration
Periodic s kip

Average bounded
slowdown
Utilization
1.2
0.8
0.6
Utilization
Adaptive periodic
s kip
0.4
Periodic s kip
0.2
0
Ave rage bo unded s lo wdow n (seco nds)
Adaptive periodic
s kip
Per iodic s kip
Iteration
Iteration

Number of faults
Completion time
3500000
2000
1800
3000000
1600
Adaptive
periodic
s kip
(res ource
1)
1400
1200
1000
Number of failures
Adaptive
periodic
s kip
(Res ource
2)
800
600
400
Periodic
s kip
(Res ource
1)
200
0
Iteration
Adaptive
periodic s kip
(res ource 1)
2500000
2000000
C ompletion time (seconds)

1500000
Adaptive
periodic s kip
(Res ource 2)
1000000
Periodic s kip
(Res ource 1)
500000
Periodic s kip
(Res ource 2)
Iteration

Overall
comparison
Values for adaptive
checkpointing
checkpointing are
-6.42% for
makespan, -2.25
for flowtime,
-10.1% for average
bounded
slowdown, +5.36%
for utilization,
-11.425% for work
lost due to failures,
-.79% for number
of checkpoints
taken.
Makespan
120
Flowtime
100
80
Adaptive periodic
skip
Periodic skip
Work lost
Utilization
Ant Colony Based Adaptive Checkpointing Using

MTBF of Resources
Makespan
Work lost
1.80E+07
1.60E+07
1.40E+07
1.20E+07
1.00E+07
8.00E+06
Makespan (seconds)
6.00E+06
4.00E+06
g
g (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing
2.00E+06
0.00E+00
Wo rk lost due to failures (se co nds)
Adaptive_C heckpointi
g
g (With s ch eduling
as s is ted fault
to lerance)
Periodic
Checkpointing
C heckpo int Interva l (seconds)

MTBF of Resources
Flowtime
6.00E+09
50000
45000
5.00E+09
40000
35000
4.00E+09
3.00E+09
Flow time (seconds)
2.00E+09
1.00E+09
g
g (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing
30000
25000
15000
g (With s cheduling
as s is ted fault
tolerance)
10000
Periodic
Checkpointing
20000
5000
0.00E+00
g
C heckpoint Interval(seconds)

MTBF of Resources
Average bounded
slowdown
Utilization
0.9
0.8
0.7
0.6
g
0.5
0.3
g (With s cheduling
as s is ted fault
tolerance)
0.2
Periodic
Checkpointing
0.4
Utilization
0.1
0
Adaptive_Checkpoin
tig
Avverage boude d slowdown (se conds)
Periodic_Checkpoint
ing (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing
C he ckpoint Inte rval (se co nds)

MTBF of Resources
Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-8.7%
-8.7% for
for
makespan, -4.22
for
for flowtime,
flowtime,
-7.8%
-7.8% for
for average
average
bounded
bounded
slowdown,
slowdown,
+6.569%
+6.569% for
for
utilization,
utilization,
-3.24%
-3.24% for
for work
work
lost
due
to
lost due to
failures,
failures, -10%
-10% for
for
number
of
number of
checkpoints
taken.
taken.
Makespan
120
100
80
Flowtime
Ant colony based
adaptive checkpointing
using MTBF
Ant Colony based
Utilization

Fault Ratios of Resources
Makespan
Work lost
1.20E+07
1.00E+07
8.00E+06
6.00E+06
Makespan (seconds)
4.00E+06
g
g
Periodic_S kip
2.00E+06
0.00E+00
Wo rk lost due to failures (se conds)
Adaptive_C heckpointi
g
g
Periodic_S kip
Random_B ackoff_S kip
C heckpo int Interva l (seconds)

Flowtime
3.50E+09
45000
40000
3.00E+09
35000
2.50E+09
30000
2.00E+09
g
Flow time (seconds)

1.50E+09
g
Periodic_S kip
1.00E+09
g
25000
20000 of C heckpoints
Number
15000
Periodic_S kip
10000
5.00E+08
5000
0.00E+00
g

Average bounded
slowdown
Utilization
0.9
0.8
0.7
0.6
0.5
0.4
Utilization
0.3
g
0.2
Periodic_S kip
0.1
0
g
Average bounded slow down (seconds) Periodic_Checkpointin
g
Periodic_S kip
Random_Backoff _S kip
#REF!
g

Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-6.95%
-6.95% for
for
makespan, -1.55
for
for flowtime,
flowtime,
-7.7%
-7.7% for
for average
average
bounded
bounded
slowdown,
slowdown,
+7.17%
+7.17% for
for
utilization,
utilization,
-28.2%
-28.2% for
for work
work
lost
due
to
lost due to
failures,
failures, +30%
+30%
for
number
for number of
of
checkpoints
taken.
taken.
Makespan
200
100
Flowtime
Ant colony
based adaptive
checkpointing
using fault
indexes
Ant Colony
based periodic
checkpointing
Utilization
GA-based Adaptive Fault Tolerance Using Fault Ratios of

Resources for spatially and temporally Correlated Failures
Makespan
3000000
1800000
1600000
2500000
1400000
2000000
1200000
1500000
Makespan (seconds)
GA bas ed
adaptive
checkpointing
GA bas ed
periodic
Checkpointing
1000000
1000000
Work
lost (seconds)
800000
600000
400000
500000
200000
0
0
Iteration
Iteration
GA bas ed
adaptive
checkpointi
ng

Flowtime
3.00E+08
1000
900
2.50E+08
800
700
2.00E+08
600
GA
bas ed
adaptive
checkpoi
nting
1.50E+08
Flow time (seconds)
1.00E+08
500
Number of C heckpoints
400
300
200
5.00E+07
100
0.00E+00
Iteration
Iteration
GA bas ed
adaptive
checkpointin
g

Average bounded
slowdown
Utilization
1.2
18000
1
16000
14000
0.8
12000
10000
Average
8000 bounded slow dow n (seconds)
6000
4000
GA bas ed
adaptive
checkpoint
ing
Utilization
0.6
GA bas ed adaptive
checkpointing
GA bas ed periodic
Checkpointing
0.4
0.2
2000
0
Iteration
Iteration

Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-3.5%
-3.5% for
for
makespan, -3.1
for
for flowtime,
flowtime,
-17.84%
-17.84% for
for
average
average bounded
bounded
slowdown,
slowdown,
+1.395%
+1.395% for
for
utilization,
utilization,
-24.84%
-24.84% for
for work
work
lost
due
to
lost due to
failures,
failures, -5.76%
-5.76%
for
number
for number of
of
checkpoints
taken.
taken.
Makespan
200
Flowtime
100
GA bas ed adaptive checkpointing

us ing fault indexes
GA bas ed periodic checkpointing
Utilization
ACO-based Adaptive Fault Tolerance Using Fault Ratios of

Makespan
Work lost
3000000
1800000
1600000
2500000
Work lost (seconds)

1400000
2000000
1200000
1500000
Makespan (seconds)
Adaptive Ant Colony

Algorithm
Ant Colony Algorithm
1000000
1000000
Adaptive Ant Colony

Algorithm
800000
Ant Colony Algorithm
600000
500000
400000
200000
Iteration
0
Iteration

Flowtime
1000
900
3.00E+08
800
2.50E+08
Flowtime (seconds)
700
2.00E+08
600
Adaptive
Ant Colony
Algorithm
500
1.50E+08
Adaptive Ant
Colony
Algorithm
1.00E+08
400
Ant Colony
Algorithm
300
Ant Colony
Algorithm
5.00E+07
200
100
0.00E+00
0
Iteration
Iteration

Average bounded
slowdown
Utilization
18000
1.2
16000
1
14000
0.8
12000
Adaptive
Ant Colony
Algorithm
10000
Average bounded slowdown (seconds)
8000
Ant Colony
Algorithm
Adaptive Ant
Colony
Algorithm
0.6
Utilization
0.4
Ant Colony
Algorithm
6000
0.2
4000
2000
Iteration
Iteration

Overall
Overall
Comparison
Comparison
Values
Values for
for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-2.3% for
makespan,
makespan, -3.8
-3.8
for
flowtime,
for flowtime,
-20.5%
-20.5% for
for
average
average bounded
bounded
slowdown,
slowdown, +.97%
+.97%
for
utilization,
for utilization,
-25.3% for work
lost
lost due
due to
to
failures,
-7.52%
failures, -7.52%
for
for number
number of
of
checkpoints
checkpoints
taken.
taken.
Makespan
200
Flowtime
100
Ant Colony bas ed adaptive
checkpointing us ing fault indexes
0
Ant Colony bas ed periodic

checkpointing
Utilization
GA performance comparison for workload traces

Trace 1
Values of adaptive checkpointing
technique relative to periodic
checkpointing are 0% for
makespan, 0% for flowtime, -20.5%
for average bounded slowdown,
-11.7% for work lost due to failures,
-21% for number of checkpoints
taken and -4.3% for average
turnaround time
Trace 2
checkpointing are 0% for makespan,
0% for flowtime, -15% for average
bounded slowdown, 31% for work lost
due to failures, -40.5% for number of
Makespan
Makespan
100
200
Flowtime
50
Flowtime
100
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Work Lost
Turnaround time
GA performance comparison for workload traces
Trace 3
Trace 4

0% for flowtime, -19.76% for average
bounded slowdown, -13% for work lost
due to failures, -29% for number of

checkpointing are -2.2% for makespan,
-2.1% for flowtime, -20.3% for average
bounded slowdown, -4.5% for work lost
checkpoints taken and -4% for average
turnaround time.
Makespan
Makespan
100
100

Flowtime
50
Flowtime
50
Adaptive
checkpointing
0
Periodic
Checkpointing
Work Lost
Work Lost
Periodic
Checkpointing
Turnaround time
Turnaround time
Adaptive
checkpointing
GA performance comparison for workload

traces
Trace
Trace 5
5
Values
Values of
of
adaptive
adaptive
checkpointing
checkpointing
technique
technique relative
relative
to
periodic
to periodic
checkpointing
checkpointing are
are
0%
0% for
for
makespan,
makespan, 0%
0% for
for
flowtime,
-13.6%
flowtime, -13.6%
for
for average
average
bounded
bounded
slowdown,
slowdown, -7.2%
-7.2%
for
for work
work lost
lost due
due
to
failures,
-26.8%
to failures, -26.8%
for
for number
number of
of
checkpoints
checkpoints taken
taken
and
and -1.85%
-1.85% for
for
average
average
turnaround
turnaround time.
time.
Makespan
200
Flowtime
100
0
Work Lost
Turnaround time
ACO performance comparison for workload

traces
Trace 1
Trace 2

0% for flowtime, -39.6% for average
bounded slowdown, -87% for work
lost due to failures, -38% for number
of checkpoints taken and -9% for

-3% for flowtime, -22% for average
bounded slowdown, +34% for work lost
due to failures, -51% for number of
turnaround time.
Makespan
Makespan
200
200
Flowtime
100
Flowtime
100
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Turnaround time
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Turnaround time

traces
Trace 3
Trace 4

boundedslowdown, -69% for work lost
turnaround time.

checkpointing are -2% for makespan,
bounded slowdown, -9.56% for work
lost due to failures, -24.5% for number
of checkpoints taken and -3.5% for
Makespan
Makespan
100
100
Flowtime
50
Flowtime
50
Adaptive
checkpointing
Periodic
Checkpointing
Work Lost
Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time
Work Lost
Turnaround time

traces
Makespan
Trace 5
Values of
adaptive
checkpointing
technique
technique
relative
relative to
to
periodic
periodic
checkpointing
are 0% for
makespan, +9%
for
for flowtime,
flowtime,
-31.5%
-31.5% for
for
average
average bounded
bounded
slowdown, --38 %
for work lost due
to failures, -43%
for
for number
number of
of
checkpoints
checkpoints
taken
taken and
and -6%
-6%
for average
turnaround time.
200
Flowtime
100
Work Lost
Turnaround time
Conclusions
Design of adaptive checkpointing based fault tolerant heuristics and their

incorporation in Genetic Algorithm (GA). These heuristics are based on
information related to reliability of resources such as MTBF, fault index and fault
ratios. All adaptive checkpointing heuristics have been compared with GA-based
periodic checkpointing for a wide range of scenarios.
Incorporating heuristics designed in Ant Colony Optimization based scheduling in
Grid.
Design of fault index based periodic skip technique and its performance
comparison with periodic skip.
Design of adaptive checkpointing based on information about MTBF and last
failure time of resources.
Design of experimental scenarios for testing performance of various techniques
for temporally and spatially correlated failures.
Performance comparison of ACO-based and GA-based fault tolerance techniques
using real failure traces available from Failure Trace Archive.
Performance comparison of ACO-based and GA- based fault tolerance techniques
for real workload traces available from various parallel workloads archives.
Future Work
Traces workload trace and failure traces used in this work are small portions of
available traces. Future works will focus on using complete trace for evaluation
Experiments have been performed for workload and failure traces separately.
Future works will use both workload trace and failure trace in an experiment.
Downtime (MTTR) of resources is ignored in this work and resource is assumed
to recover immediately from failure. This assumption will be removed in future.
Checkpointing technique used considers restart after failure on the same
resource. Another technique can be checkpoint with migration where job is
restarted on a different resource. This technique along with its various issues
such as spare node allocation is to be pondered upon.
Heuristics developed for fault tolerance are not restricted to metaheuristics.
Rather they can be incorporated in any scheduling algorithm. Future work will
look into that.
This work considers only transient faults on resources. Other fault classes are
not considered.
Finally future work will focus on working in an actual Grid setup rather than
simulated one.
7. Publications
Upadhyay, N., and Misra, M. 2011.

Incorporating fault tolerance in GA-based
Scheduling in Grid environment. In
Proceedings of World Congress on
Information and Communication
Technologies (Mumbai India , Dec 11
14). WICT 2011. IEEE. 776 781.
Heuristic (Wikipedia)
Heuristic(/hjrstk/; orheuristics;
Greek: "","find"or"discover")
refers to experience-based techniques
for problem solving, learning, and
discovery. Where an exhaustive search is
impractical, heuristic methods are used
to speed up the process of finding a
satisfactory solution.
Heuristic (Wikipedia)
Incomputer science, a heuristic is a technique designed

to solve a problem that ignores whether the solution can
be proven to be correct, but which usually produces a
good solution or solves a simpler problem that contains or
intersects with the solution of the more complex problem.
Heuristics are intended to gain computational
performance or conceptual simplicity, potentially at the
cost ofaccuracy or precision.
In theirTuring Awardacceptance speech,Herbert Simon
andAllen Newelldiscuss the Heuristic Search Hypothesis:
a physical symbol system will repeatedly generate and
modify known symbol structures until the created
structure matches the solution structure.
Meta-heuristic (Wikipedia)
Incomputer science
,metaheuristicdesignates a
computational method thatoptimizesa
problembyiterativelytrying to improve
acandidate solutionwith regard to a
given measure of quality.
ACO vs GA
Compared to GAs (Genetic Algorithms):

retains memory of entire colony instead
of previous generation onlyless affected
by poor initial solutions (due to
combination of random path selection
and colony memory
Makespan and flowtime
Chromosome 1
J1
J2
R1
R1
R2
R1
J1: 2sec
J3: 1.5 sec
J3
R2
J2:3sec
J4: 2.5 sec
J4
R2
C1 = 5 sec
C2 = 4 sec
Makespan = max(5,4) = 5 seconds

Flowtime = C1 + 2 = 5 + 4 = 9 seconds
Chromosome 2
back
J1
R1
J2
R2
J3
R1
J4
R2
References
[1] Townend, P. and Xu, J. 2003. Fault tolerance within a grid environment. As component of eDemand project at the
University of Durham, United Kingdom.
[2] Foster I., Kesselman C.,The Grid: Blueprint for a New Computing Infrastructure, The Elsevier Series in
GridComputing.
[3] Foster, I. 2001. The anatomy of the grid: enabling scalable virtual organizations. In Proceedings of the First
IEEE/ACM International Symposium onCluster Computing and the Grid 2001 (Brisbane, Australia May 15-18,
2001). CCGRID '01. IEEE Computer Society, Washington, DC, USA, 6-7.
[4] Foster I. What is the Grid? A three point checklist, Argonne National Laboratory, fp.mcs.anl.gov/~foster/
Articles/WhatIsTheGrid.pdf, 2002.
[5] Avizienis, A.; Laprie, J.-C.; Randell, B.; Landwehr, C.; , "Basic concepts and taxonomy of dependable and secure
computing,"IEEE Transactions on Dependable and Secure Computing, vol.1, no.1, pp. 11- 33, Jan.-March 2004.
[6] Huda, M.T.; Schmidt, H.W.; Peake, I.D., "An agent oriented proactive fault-tolerant framework for grid
computing,"First International Conference on e-Science and Grid Computing, 2005, pp. 8-15, July 2005.
[7] Hofer, J.; Fahringer, T., "A Multi-Perspective Taxonomy for Systematic Classification of Grid Faults,"16th
Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2008. PDP 2008, pp.126-130, 1315 Feb. 2008.
[8] Jafar, S.; Krings, A.; Gautier, T.; , "Flexible Rollback Recovery in Dynamic Heterogeneous Grid
Computing,"Dependable and Secure Computing, IEEE Transactions on, vol.6, no.1, pp.32-44, Jan.-March 2009
[9] Avizienis, A., "The N-Version Approach to Fault-Tolerant Software,"IEEE Transactions on Software Engineering,
vol. SE-11, no.12, pp. 1491- 1501, Dec. 1985.
References
[10] Schroeder B. and G. A. Gibson (2006). A large-scale study of failures in high-performance
Computing systems, International Conference on Dependable Systems and Networks, DSN 2006.
[11] Hayashibara N, Cherif A, Katayama T. Failure detectors for large-scale distributed systems,
Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems. IEEE Computer Society Press:
Los Alamitos, CA, 2002, 404-409 October 2002.
[12] Elnozahy E, Johnson D, Wang Y. A survey of rollback recovery protocols in message-passing systems,
ACM Computing Surveys 2002, 34(3), 375408.
[13] Alvisi L. and K. Marzullo (1998). Message logging: pessimistic, optimistic, causal, and optimal, IEEE
Transactions on Software Engineering, 24(2), 149-159.
[14] H-C Nam, J. Kim, SJ. Hong and S. Lee. Probabilistic checkpointing, In Proceedings of the Twenty
Seventh International Symposium on Fault-Tolerant Computing (FTCS-27), pp.4857, June 1997.
[15] Gabriel Rodrguez, Xon C. Pardo, Mara J. Martn, Patricia Gonzlez, Performance evaluation of an
application-level checkpointing solution on grids, Future Generation Computer Systems, Volume 26, Issue
7, July 2010, Pages 1012-1023, ISSN 0167-739X, 10.1016/j.future.2010.04.016.
[16] Oliner, A.J.; Sahoo, R.K.; Moreira, J.E.; Gupta, M.; , "Performance implications of periodic checkpointing
on large-scale cluster systems,"Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th
IEEE International, vol., no., pp. 8 pp., 4-8 April 2005
[17] Plank, J.S.; Elwasif, W.R.; , "Experimental assessment of workstation failures and their impact on
checkpointing systems,"Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual
International Symposium on, vol., no., pp.48-57, 23-25 Jun 1998.
[18] Adam J. Oliner, Larry Rudolph, and Ramendra K. Sahoo. 2006. Cooperative checkpointing: a robust
approach to large-scale systems reliability. InProceedings of the 20th annual international conference on
Supercomputing(ICS '06). ACM, New York, NY, USA, 14-23.
References
[19] Oliner, A.; Sahoo, R.; , "Evaluating cooperative checkpointing for supercomputing systems,"Parallel and
Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.8 pp., 25-29 April 2006.
[20] Oliner, A.; Rudolph, L.; Sahoo, R.; , "Cooperative checkpointing theory,"Parallel and Distributed Processing
Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.10 pp., 25-29 April 2006.
[21] Zhiling Lan; Yawei Li; , "Adaptive Fault Management of Parallel Applications for High-Performance
Computing,"Computers, IEEE Transactions on, vol.57, no.12, pp.1647-1660, Dec. 2008.
[22] Chtepen M., F. H. A. Claeys, et al. (2009). Adaptive Task Checkpointing and Replication: Toward Efficient
Fault-Tolerant Grids, IEEE Transactions on Parallel and Distributed Systems, vol.20, no.2, pp.180-190, Feb.
2009.
[23] Nazir B., K. Qureshi, et al. (2009). "Adaptive checkpointing strategy to tolerate faults ineconomy based
grid," The Journal of Supercomputing, 50(1), 1-18, 2009.
[24] Antonios Litke, Konstantinos Tserpes, Konstantinos Dolkas, and Theodora Varvarigou. 2005. A task
replication and fair resource management scheme for fault tolerant grids. InProceedings of the 2005 European
conference on Advances in Grid Computing(EGC'05), Peter A. Sloot, Alfons G. Hoekstra, Thierry Priol,
Alexander Reinefeld, and Marian Bubak (Eds.). Springer-Verlag, Berlin, Heidelberg, 1022-1031.
[25] Qin Z., B. Veeravalli, et al. (2009). On the Design of Fault-Tolerant Scheduling Strategies Using PrimaryBackup Approach for Computational Grids with Low Replication Costs, IEEE Transactions on Computers, vol.
58, no.3, pp.380-393, March 2009.
[26] Hwang S., and Kesselman C., A flexible framework for fault tolerance in the grid, Journal of Grid
Computing, vol. 1, no. 3, pp. 251-272, 2003.
[27] Lopes R. F. and F. J. da Silva e Silva (2006). Fault tolerance in a mobile agent based computational grid,
Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops, 2006,vol. 2, 16-19 May
2006.
References
[28] Kandaswamy G., A. Mandal, et al. (2008). Fault Tolerance and Recovery of Scientific Workflows on
Computational Grids, 8th IEEE International Symposium on Cluster Computing and the Grid, 2008. CCGRID08,
pp.777-782, 19-22 May 2008.
[29] Yang Z., A. Mandal, et al. (2009). Combined Fault Tolerance and Scheduling Techniques for Workflow
Applications on Computational Grids, 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
CCGRID09, pp.244-251, 18-21 May 2009.
[30] SungJin C., B. MaengSoon, et al. (2004). Volunteer availability based fault tolerant scheduling mechanism in
desktop grid computing environment, Proceedings of the third IEEE International Symposium on Network
Computing and Applications, 2004, (NCA 2004), pp. 366- 371, 30 Aug.-1 Sept. 2004.
[31] Hou E. S. H., N. Ansari, et al. (1994). A genetic algorithm for multiprocessor scheduling, IEEE transactions on
Parallel and Distributed Systems, vol.5, no.2, pp.113-120, Feb 1994.
[32] Song, S., Hwang, K., and Kwok, K. 2006. Risk-resilient heuristics and genetic algorithms for security-assured
grid job scheduling.IEEE Transactions onComputers 55, 6 (June 2006), 703-719.
[33] Khanli, L. M., Far, M. E., and Rahmani, A. M. 2010. RFOH: A New Fault Tolerant Job Scheduler in Grid
Computing. InProceedings of the Second International Conference on Computer Engineering and
Applications(Bali Island, Indonesia, March 19 21, 2010).ICCEA '10. IEEE Computer Society, Washington, DC,
USA, 422-425.
[34] Priya, S.B., Prakash, M., Dhawan, K.K. 2007. Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using
Checkpoint In Proceedings of the Sixth International Conference on Grid and Cooperative Computing (Los
Alamitos, CA, Aug. 16 18, 2007). GCC '07. 676-680.
[35] Abdulal, W., and Ramachandram, S 2011. Reliability-Aware Genetic Scheduling Algorithm in Grid Environment.
In Proceedings of the International Conference on Communication Systems and Network Technologies (Katra,
Jammu India , June 03 05). 673-677.
[36] Wu, C., Lai K., and Sun R. 2008. GA-Based Job Scheduling Strategies for Fault Tolerant Grid Systems. In
Proceedings of the Asia-Pacific Conference on Services Computing (Dec. 09
References
12, 2008). IEEE, 27-32.
[37] Dorigo, M.; Gambardella, L.M.; "Ant colony system: a cooperative learning approach to the traveling
salesman problem,"Evolutionary Computation, IEEE Transactions on, vol.1, no.1, pp.53-66, Apr 1997
[38] Zhihong Xu; Xiangdan Hou; Jizhou Sun; , "Ant algorithm-based task scheduling in grid
computing,"Electrical and Computer Engineering, 2003. IEEE CCECE 2003. Canadian Conference on,
vol.2, no., pp. 1107- 1110 vol.2, 4-7 May 2003.
[39] Hui Yan; Xue-Qin Shen; Xing Li; Ming-Hui Wu; , "An improved ant algorithm for job scheduling in grid
computing,"Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on,
vol.5, no., pp.2957-2961 Vol. 5, 18-21 Aug. 2005.
[40] Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, and Ramendra K. Sahoo. 2004.
Performance implications of failures in large-scale cluster scheduling. InProceedings of the 10th
international conference on Job Scheduling Strategies for Parallel Processing(JSSPP'04), Dror G.
Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer-Verlag, Berlin, Heidelberg, 233-252.
[41] Dalibor Klusek and Hana Rudov. 2010. Alea 2: job scheduling simulator. InProceedings of the 3rd
International ICST Conference on Simulation Tools and Techniques(SIMUTools '10). ICST (Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium,
Belgium.
[42] S. Lorpunmanae , Mohd Sap , A.H.Abdullah and C. C. Inwai , An Ant Colony Optimization for Dynamic
Job Scheduling in GridEnvironment , International Journal of Computer and Information Science and
Engineering , 2007.
[43] Jing Hu, Mingchu Li, Weifeng Sun, Yuanfang Chen, "An Ant Colony Optimization for Grid Task
Scheduling with Multiple QoS Dimensions," Grid and Cloud Computing, International Conference on, pp.
415-419, 2009 Eighth International Conference on Grid and Cooperative Computing, 2009
References
[44] Ruay-Shiung Chang, Jih-Sheng Chang, Po-Sheng Lin, An ant algorithm for balanced job scheduling in
grids, Future Generation Computer Systems, Volume 25, Issue 1, January 2009, Pages 20-27, ISSN 0167739X, 10.1016/j.future.2008.06.004.
[45] Wei-Neng Chen; Jun Zhang; , "An Ant Colony Optimization Approach to a Grid Workflow Scheduling
Problem With Various QoS Requirements,"Systems, Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, vol.39, no.1, pp.29-43, Jan. 2009
[46] Gosia Wrzesinska, Rob V. van Nieuwpoort, Jason Maassen, Henri E. Bal (2005). Fault-Tolerance,
Malleability and Migration for Divide-and-Conquer Applications on the Grid, Proceedings of the 19th IEEE
International Symposium on Parallel and Distributed Processing, 2005, pp. 13a, 04-08 April 2005.
[47] Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed re-source
management and scheduling for grid computing. Concurr Comput Pract Exp (CCPE) 14(1311751220.
[48] Caminero, A.; Sulistio, A.; Caminero, B.; Carrion, C.; Buyya, R.; , "Extending GridSim with an
architecture for failure detection,"International Conference on Parallel and Distributed Systems, 2007,
vol.2, no., pp.1-8, 5-7 Dec. 2007 doi: 10.1109/ICPADS.2007.4447756.
[49] Failure Trace Archive [Online]. http://fta.inria.fr/apache2-default/pmwiki/index.php.
[50] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. 2010. The Failure Trace Archive:
Enabling Comparative Analysis of Failures in Diverse Distributed Systems. InProceedings of the 2010
10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing(CCGRID '10). IEEE
Computer Society, Washington, DC, USA, 398-407.
[51] Parallel Workload Archive [Online]. http://www.cs.huji.ac.il/labs/parallel/workload/.
[52] Grid Workload Archive [Online]. http://gwa.ewi.tudelft.nl/pmwiki/.
[53] MATLAB R2010a [Online]. http://www.mathworks.in/help/techdoc/rn/br_03sl.html

Fault Tolerance in Grid

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Fault Tolerance in Grid

Diunggah oleh

Hak Cipta:

Format Tersedia

FAULT TOLERANCE IN

Grid computing is defined as [3] coordinated

resource sharing and

Differences from traditional distributed systems

Above attributes affects reliability

reliability of a system = product of reliabilities of its components.

preserve the delivery of expected

Design and performance study of adaptive

Extending the meta-heuristic algorithms such as Genetic

Inventing adaptive approaches for fault tolerance in

Why adaptive checkpointing for Grid ?

Grid by its definition [3] is highly dynamic distributed

Performance of checkpointing technique depends on the size of

If interval is very high large amount of work is lost due to failures

Simple structure for mapping to

Heuristics for adaptive checkpointing

2. Fault tolerance in grid environment

Fault tolerance techniques

Scheduling support for adaptive checkpointing approach using GA.

Tackling the problem of autonomous nature of Grid resources.

Consideration for Mean Time To Repair (MTTR) in meta-heuristic

Support for fault tolerance in Ant Colony Optimization based

Spatially and temporally correlated faults.

Weibull and Lognormal distributions respectively for MTTF and

Weibull and lognormal [10]

Cr is the completion time of jobs

Adaptive Checkpointing based fitness

Mean Failure Time Based Checkpointing

is the task size in Million Instructions

is resource speed in Millions of Instructions per

Some existing approaches

Last Failure Time Based Checkpointing and Checkpointing without Migration

Resource Provider Autonomy Based

[7] presents volunteer autonomy failures

Assumption: No work lost due to failures

Work lost taken into

Fault index and Fault ratio based Adaptive Checkpointing

Fault ratio based adaptive checkpointing

limits the increase in checkpointing

Fault Occurrence History

Fault Index based adaptive

limits the increase in checkpointing

Ant Colony Optimization

Ant Colony Optimization

Initialize all parameters

Loop /* outer loop represents each iteration of ACO */

Each ant chooses a random sequence of tasks

Pseudorandom state transition rule:

Local pheromone update rule:

ACO Phases Continued

Global pheromone update rule:

Fault Index based periodic

fault index based exponential backoff

i) If a node has not failed for a long time

MTBF and Last Failure Based Adaptive

If ((C1-Lf) > * MTBF)

If (((C1-Lf) < * MTBF)

To show the applicability of the approach is taken as 2, as 2.5,

Grid Working + Adaptive Checkpointing

6. Get resource fault

11. Gridlet failure

We propose adaptive checkpointing

c) Average bounded slowdown: It is the average slowdown of a job. It is the difference

e) Utilization: Utilization is the fraction of time of the resources which is used in