Sie sind auf Seite 1von 109

FAULT TOLERANCE IN

GRID ENVIRONMENT
Neeraj Upadhyay Mtech 2nd year

1.Introduction and
Motivation

Grid computing is defined as [3] coordinated

resource sharing and


problem solving in dynamic, multi-institutional virtual organization.

Differences from traditional distributed systems


Scale
Heterogeneity
Dynamicity

Above attributes affects reliability

1.
2.
3.

reliability of a system = product of reliabilities of its components.


Heterogeneity - interaction faults.
Dynamicity - delay and loss of jobs.

Fault Tolerance

preserve the delivery of expected


services despite the presence of
fault-caused errors within the system
itself. Errors are detected and
corrected, and permanent faults are
located and removed while the
system continues to deliver
acceptable service.

Problem Statement

Design and performance study of adaptive


checkpointing based fault tolerance techniques in
Grid environment

Extending the meta-heuristic algorithms such as Genetic


Algorithm and Ant Colony Optimization with support for
fault tolerance technique : adaptive checkpointing.

Inventing adaptive approaches for fault tolerance in


computational grids by suitable modifying traditional
approaches, such as checkpointing, to take into account
various characteristics of the Grid environment.

Why adaptive checkpointing for Grid ?

Grid by its definition [3] is highly dynamic distributed


environment

Dynamic
a) Dynamically varying resource conditions such as fault
occurrences.
b) Faults are more likely to occur during one time frame compared
to others i.e. faults are temporally correlated [10]. For during
weekdays when workload is high compared to weekends faults
are more likely.
Also during day time faults are more likely to occur .
c) Faults are spatially correlated [10].

Performance of checkpointing technique depends on the size of


checkpointing interval.

If interval is very high large amount of work is lost due to failures


If it is very low then overhead of each checkpoint operation will
be very high.

Why GA?

Simple structure for mapping to


scheduling problem.
GA used is based on Global Optimization
toolbox of MATLAB [53]. It tends to
converge to good solution quickly.

Why ACO?

Performanc
e of ACO
compared
to other
meta
heuristics
[37]

Remark

Heuristics for adaptive checkpointing


developed in this work is not restricted
to any metaheuristic employed. It can be
suitably used with any scheduling
technique. Our main aim is to show how
maintaining information about failure
conditions of resources can be used in
adapting the checkpointing interval to
improve the performance.

2. Fault tolerance in grid environment

Fault tolerance techniques


Checkpointing
Replication
workflow level fault tolerance techniques
mobile agent based fault tolerance
fault tolerant scheduling
application model specific fault tolerance
techniques.

3. Research gaps

Scheduling support for adaptive checkpointing approach using GA.

Tackling the problem of autonomous nature of Grid resources.

Consideration for Mean Time To Repair (MTTR) in meta-heuristic


based scheduling.

Support for fault tolerance in Ant Colony Optimization based


scheduling.

Spatially and temporally correlated faults.

Weibull and Lognormal distributions respectively for MTTF and


MTTR.

Weibull and lognormal [10]


Weibull (MTBF)

Lognormal (MTTR)

4. Work done

Proposed solution
Incorporating fault tolerance in GA-based
scheduling in Grid environment

Genetic Algorithm
Initial Population
Fitness
Evaluati
on

Representation of chromosome
14

J1

R1

J2

R2

J3

J4

R3

R4

Jn

Crossover Operator
J1

R3

Chromosome 1
Chromosome 2

J1

R2

J2

J2

J3

J4

J5

R4

R1

R5

R2

R3

J3

R4

J4

R1

J5

R5

Crossover point

offspring

J1

J2

R3

R4

J3

R1

7/8/16

J4

J5

R1

R5

Mutation Operator

15

J1
J5

J5

J2

R1

R4

J1
R1

R4J2

J3
R2

R5

J4
R5

J3

R5

7/8/16

R1

J4 R1

Fitness function

Flowtime =

Cr is the completion time of jobs


allocated to resource r.
m is the total number of resources in
Grid.

Adaptive Checkpointing based fitness


functions

Mean Failure Time Based Checkpointing

is the task size in Million Instructions

is resource speed in Millions of Instructions per


Second
is execution time of job i in node n,
time of node n

is mean failure

Some existing approaches

[33] , [36]

Last Failure Time Based Checkpointing and Checkpointing without Migration

If
Where C1 is current system time, LFn is last failure time of node n
and k is an integer 2.

Resource Provider Autonomy Based


Scheduling

[7] presents volunteer autonomy failures


time of resource (a volunteer) registration
maintained in Grid Information Service
mean time for which a resource remains in the
grid (stay time)

Checkpointing with
downtimes

(9)

Assumption: No work lost due to failures

Work lost taken into


account

(10)

Fault index and Fault ratio based Adaptive Checkpointing


based fitness function
1.

Fault ratio based adaptive checkpointing

Parameters

limits the increase in checkpointing


interval and limits the decrease in
checkpointing interval.
is taken as 1 and as .5 for
experiments to show the applicability of
the approach. So the checkpoint interval
can vary from (.5*Check_interval , 2 *
check_interval)

25

Fault Occurrence History


Table
No. of faults

R1
R2
R3
:
Rn

No. of Executions

Fault Index based adaptive


Checkpointing

limits the increase in checkpointing


interval and limits the decrease in
checkpointing interval.

max((FOHT[i][1] FOHT[i][0]),
) ,))

Ant Colony Optimization

Ant Colony Optimization

Initialize all parameters

Loop /* outer loop represents each iteration of ACO */

Each ant chooses a random sequence of tasks


Loop /* inner loop represents a step
Each ant incrementally builds a solution by applying state
transition rule and
a local pheromone updating rule
Until all ants have completed building a solution
Apply global pheromone updating rule

Until terminate_condition

ACO Phases

Pseudorandom state transition rule:


R=

Local pheromone update rule:


Pheromone(r,j) = (1-).pheromone(r,j) +
. initial_pheromone
is set to .1, = 1.2, q0 = .9 [45]

ACO Phases Continued

Global pheromone update rule:


Pheromone(r,j) = (1-).pheromone(r,j)
+ . score
score = 1 +
minimum_makespan/makespan

Fault Index based periodic


Skip
FI(i): Fault Index of resource i
FI1, FI2, FI3..FIN fault index values such that FI1 < FI2 <
FI3 < ..<FIN
D1, D2, D3,.,DN skip parameter to determine intensity of
skip such that D1 < D2 <D3 DN
If(FI(i) > Fin) then
Perform all checkpoints
If( FIN >FI(i)>FIN-1
Use D1 has skip parameter
If(FI1 >FI(i))
Use DN has skip parameter
Exit

Other Techniques

fault index based exponential backoff


skip
Fault ratio based periodic and
exponential skip

Temporal Correlation

i) If a node has not failed for a long time


then there is less probability that it will
fail in the near future.
ii) If a node has failed recently then
there is high probability that it will fail in
the near future.

MTBF and Last Failure Based Adaptive


Checkpointing for Temporally Correlated
Failures

If ((C1-Lf) > * MTBF)


AI = AI + I for each checkpoint request in interval ( * MTBF, *
MTBF) where > and
>1
(17)

If (((C1-Lf) < * MTBF)


AI = AI - I for each checkpoint request in interval ( * MTBF, *
MTBF) where > and
<1
I is an initial periodic checkpoint interval and AI is adapted checkpoint
interval. C1 is current time and Lf is last failure time of resource

To show the applicability of the approach is taken as 2, as 2.5,


as .25, as .5 in our experiments. Higher value of leads to less
opportunity for application of the technique and lower value
decreases performance due to time since last failure being very
small. Similar is the reasoning for .

Grid Working + Adaptive Checkpointing


7/8/16

5. Optimized
resource list

35

Grid
Resource

7.Fault value

FOHT

Fault

Manager
Resource

MIPS

Broker

6. Get resource fault


info

GA or ACO

1.
Deadline,
budget,
Grid User
gridlets
17 Submit result
8. Allocated resource

4. Available
resource list

3. Current
load
status

Schedule Advisor

Resource2
Gridlet Dispatcher

2. Available
resources
information

14 reschedule
from last
checkpoint

Resource1

9.Submit
Gridlet

11. Gridlet failure


Gridlet Receptor

16 decrement
Fault value
Resource3

10. Submit
Checkpoints

13 Get
checkpoint
Grid Information Service

Checkpoint
15 Gridlet Server
completion
12. Increment fault value

Grid Working

Performance Comparison

We propose adaptive checkpointing


techniques.
Comparison can be done with commonly
used existing checkpointing techniques
Periodic Checkpointing
Skipping checkpointing techniques
Periodic Skip
Exponential backoff skip

Performance Metrics

a) Makespan: It is the maximum completion time for any resource and is basically the
time when all jobs finish execution. Completion time for a resource is the point of time
when all jobs allocated to that resource completes execution.

b) Flowtime: It is the sum of the completion time for all the resources.

c) Average bounded slowdown: It is the average slowdown of a job. It is the difference


between time taken to execute a job and the CPU time () averaged over all jobs. Sizes of
jobs are taken to be comparable to each other.

d) Work lost due to failures: It is the unsaved work which is lost due to failure of jobs.

e) Utilization: Utilization is the fraction of time of the resources which is used in


executing jobs i.e. in doing useful work. This work does not include the time spent in
carrying out work which is lost due to failures.

f) Number of Checkpoints: It is the total number of checkpoints performed during the


entire simulation run for a batch of job.

g) Average turnaround time: It is the average of completion times of jobs. Completion


time of job is the finish time of a job minus the submission time.

Simulation Parameters
Parameter
1. Number of Resources (Clusters)
2. Number of Processors per Cluster
3. Number of jobs
4. Computation time per job
5. Checkpoint Overhead
6. Size (Number of processors)of job
7. Checkpoint Interval
[8. MTBF of Resources
9. Failure Distribution

10. Elite Count


11. Crossover fraction
12 Initial Population (GA)
13 Number of Ants

Value
5
64
200
48 hours
720 seconds [19]
64
1000 to 10000 seconds [19]
5 hours to 18 hours
Weibull (shape parameter .7 , 1 , 1.5) [19]

2
.9
Number of jobs (200)
Number of resources (5)

Weibull Distribution

Probability
density
function for
various
shape
parameter
s

GA-based Adaptive Fault Tolerance


Using MTBF of Resources
Makespan

Work Lost due to failures


1.20E+07

1.40E+07
Work lost due to failures (seconds)
1.00E+07
1.20E+07
8.00E+06

1.00E+07
8.00E+06
Adaptive_Checkpointig
Makespan
6.00E+06 (seconds)

6.00E+06

Adaptive_Checkpointig

Periodic_Checkpointing
Periodic_S kip

Periodic_Checkpointing
Periodic_S kip

4.00E+06

Exponential_Backoff_S ki
p

4.00E+06

Exponential_Backoff_S kip
2.00E+06

2.00E+06
0.00E+00

0.00E+00

C heckpoint Interval (seconds)

C heckpoint interval (seconds)

GA-based Adaptive Fault Tolerance Using MTBF


of Resources
Number of Checkpoints
taken

Flowtime
4.00E+09

3.50E+04

3.50E+09

3.00E+04

3.00E+09

2.50E+04

2.50E+09
2.00E+09
Flow time (seconds)
1.50E+09

Adaptive_Checkpointi
g

2.00E+04

Adaptive_Checkpointi
g

Periodic_Checkpointin
g

Number of C heckpoints
1.50E+04

Periodic_Checkpointin
g

Periodic_S kip
Exponential_Backoff_
S kip

1.00E+09
5.00E+08

5.00E+03

0.00E+00

0.00E+00

C heckpont Interval (seconds)

Periodic_S kip

1.00E+04

Exponential_Backoff_
S kip

C heckpoint Interval (seconds)

GA-based Adaptive Fault Tolerance Using MTBF


of Resources
Average bounded
slowdown

Utilization
0.9
0.8
0.7
0.6
Adaptive_Checkpointi
g

0.5
0.4
Utilization
0.3

Periodic_S kip

0.2

Exponential_Backoff_
S kip

0.1
0
Adaptive_Checkpointi
g
Average bounded slow down (seconds) Periodic_Checkpointin
g
Periodic_S kip
Exponential_B ackoff_
S kip

C heckpoint Interval (seconds)


C heckpoint Inte rval (se co nds)

Periodic_Checkpointin
g

GA-based Adaptive Fault Tolerance Using MTBF


of Resources

Makespan

Overall
Comparison
Values for
adaptive
checkpointing
relative to periodic
checkpointing are
-2.6% for
makespan, -2.2 for
flowtime, -8% for
average bounded
slowdown, +2%
for utilization,
+5% work lost due
to failures, -9.1%
for number of
checkpoints taken.

120
Number of checkpoints

Flowtime
100

80

Work lost due to failures

Utilization

GA base adaptive
checkpointing using
MTBF
GA based periodic
checkpointing
Average bounded slowdown

GA-based Adaptive Fault Tolerance Using Fault


Ratios of Resources

Makespan

Work lost due to failures


4500000

3000000

4000000
2500000
3500000
3000000

2000000

1500000

Makespan (seconds)
1000000

Adaptive_Checkpointi
g

2500000

Adaptive_Checkpointi
g

Periodic_Checkpointin
g

2000000 lost (seconds)


Work

Periodic_Checkpointin
g

Periodic_S kip

1500000

Periodic_S kip

Exponential_Backoff_
S kip

1000000

Exponential_Backoff_
S kip

500000
500000
0

Checkpoint_Interval (seconds)

Checkpoint Interval (seconds)

GA-based Adaptive Fault Tolerance Using Fault


Ratios of Resources
Number of checkpoints
taken

Flowtime

18000

1.60E+09

16000

1.40E+09

14000
1.20E+09
12000
1.00E+09
8.00E+08
Flow time (seconds)
6.00E+08
4.00E+08

Adaptive_Checkpointi
g

10000
8000
C heckpoints
Taken

Periodic_Checkpointin
g

6000

Periodic_Checkpointin
g

Periodic_S kip

4000

Periodic_S kip

Random_Backoff_S kip

Random_Backoff_S kip
2000

2.00E+08

0.00E+00
C heckpoint Interval (seconds)
C heckpoint Interval (seconds)

Adaptive_Checkpointi
g

GA-based Adaptive Fault Tolerance Using Fault


Ratios of Resources
Average bounded
slowdown

Utilization
0.9
0.8
0.7
0.6
Adaptive_Checkpointi
g

0.5
0.4
Utilization

Periodic_S kip

0.3

Exponential_Backoff_
S kip

0.2
0.1
0
Adaptive_Checkpointi
g
Average B ounded Slo wdo wn (sec onds) Periodic_Checkpointin
g
Periodic_S kip
Exponential_B ackoff_
S kip

C heckpoint Inte rval (se co nds)

Periodic_Checkpointin
g

C heckpoint Interval (seconds)

GA-based Adaptive Fault Tolerance Using Fault


Ratios of Resources

Overall
Comparison
Values for adaptive
checkpointing
relative to periodic
checkpointing are
-3.13% for
makespan, -13.43
for flowtime,
-13.41% for
average bounded
slowdown, +2.65%
for utilization,
-22.51% for work
lost due to failures,
-7.21% for number
of checkpoints
taken.

Makespan
200
Number of checkpoints

Flowtime
100

Work lost due to failures

Utilization

GA based adaptive
checkpointing using fault
indexes
GA based periodic
checkpointing
Average bounded slowdown

Performance Comparison for GA based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Makespan

Work lost

9000

1400

8000
1200
7000

adaptive_overnet2(/1
0)

6000

adaptive_overnet2(/1
00)

1000

periodic_overnet2
adpative_s kype

5000

periodic_s kype

Makespan ( seconds)

adaptive_ucb

4000

periodic_ucb

periodic_overnet2
adpative_s kype

800

periodic_s kype

Work lost

adaptive_ucb

600

periodic_ucb

adaptive_Notre(/100)

3000

periodic_Notre

adaptive_Notre(/100)
periodic_Notre

400

adaptive_Glow(/100)

2000

periodic_Glow

adaptive_Glow(/10)
periodic_Glow

200

1000
0
0

1
1

Iteration

10

11

12

Iteration

10

11

12

Performance Comparison for GA based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Flowtime

Number of checkpoints
3500

120000

3000

100000

adaptive_overnet2(/1
00)
80000

adaptive_overnet2

2500

periodic_overnet2

periodic_overnet2
adpative_s kype
periodic_s kype

60000

Flow time

adaptive_ucb

adpative_s kype

2000

periodic_s kype

Number of checkpoints

adaptive_ucb

1500

periodic_ucb

periodic_ucb
40000

adaptive_Notre(/100)
periodic_Notre

adaptive_Notre(/10)
1000

periodic_Notre
adaptive_Glow(/10)

adaptive_Glow(/1000)

20000

periodic_Glow

periodic_Glow

500

0
1

Iteration

10

11

12

Iteration

10

11

12

Performance Comparison for GA based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Average bounded
slowdown

Utilization

400

0.9

350

0.8

adaptive_overnet2(/1
0)

300

periodic_overnet2

250

adpative_s kype
periodic_s kype

200

Average bounded slow dow n

adaptive_ucb

0.7

adaptive_overnet22
0.6

periodic_overnet22
adpative_s kype

0.5

periodic_s kype

Utilization

adaptive_ucb

0.4

periodic_ucb

periodic_ucb

150

adaptive_Notre(/10)
periodic_Notre

100

adaptive_Glow(/100)
periodic_Glow

50

adaptive_Notre

0.3

periodic_Notre
0.2

adaptive_Glow
periodic_Glow

0.1

0
1

Iteration

10

11

12

Iteration

10

11

12

Performance Comparison for GA based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Overall comparison (Trace 1)


Values for adaptive checkpointing relative to
periodic checkpointing for trace 1 are -1%
for makespan, -5% for flowtime, -26.7% for
average bounded slowdown, +.96% for
utilization, +58.8% for work lost due to
failures, -37.5% for number of checkpoints
taken and -5% for average turnaround time.

Turnaround time
9000

8000

7000

adaptive_overnet2(/1
0)

6000

periodic_overnet2
adpative_s kype(/10)

5000

Makespan
200

Utilization

Flowtime

periodic_s kype
100

adaptive_ucb

Turnaround time

4000

periodic_ucb
adaptive_Notre(/100)

3000

periodic_Notre

Periodic
Checkpointing
Turnaround time

adaptive_Glow(/100)
2000

periodic_Glow

Adaptive
checkpointing

Average bounded slowdown

1000

0
1

Iteration

10

11

12
Work Lost

Number of Checkpoints

Performance Comparison for GA based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces
Overall comparison (Trace 2)

Overall comparison (Trace 3)

Values for adaptive checkpointing


relative to periodic checkpointing for
trace 2 are -2.6% for makespan, -2.7%
for flowtime, -29.8% for average
bounded slowdown, +2.4% for
utilization, +36% for work lost due to
failures, -32% for number of checkpoints
taken and -3.88% for average
turnaround time.

Values for adaptive checkpointing


relative to periodic checkpointing for
trace 3 are -5% for makespan, -7.88% for
flowtime, -42.7% for average bounded
slowdown, +5.47% for utilization,
+55.43% for work lost due to failures,
-47% for number of checkpoints taken
and -8.35% for average turnaround time.

Makespan
200

Utilization

Makespan
200

Flowtime
Utilization

100

Flowtime

100

Adaptive
checkpointing
Periodic
Checkpointing
Turnaround time

Average bounded slowdown

Average bounded slowdown

Work Lost

Adaptive
checkpointing
Periodic
Checkpointing

Turnaround time

Number of Checkpoints

Work Lost

Number of Checkpoints

Performance Comparison for GA based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces
Overall comparison (Trace 4)

Overall comparison (Trace 5)

Values for adaptive checkpointing


relative to periodic checkpointing for
trace 4 are -4.4% for makespan, -2.7%
for flowtime, -22% for average bounded
slowdown, +4.69% for utilization,
+1.4% for work lost due to failures,
-23.5% for number of checkpoints taken
and -3.2% for average turnaround time.

Values for adaptive checkpointing


relative to periodic checkpointing for
trace 5 are -3.8% for makespan, -5.7%
for flowtime, -33.43% for average
bounded slowdown, +3.92% for
utilization, +91% for work lost due to
failures, -32% for number of
checkpoints taken and -6.7% for
average turnaround time.

Makespan
200

Makespan
Utilization

Flowtime
200
100

Utilization

Flowtime

100

Adaptive
checkpointing
0

Average bounded slowdown

Periodic
Checkpointing

Adaptive
checkpointing
0

Turnaround time

Average bounded slowdown

Work Lost

Periodic
Checkpointing
Turnaround time

Number of Checkpoints
Work Lost

Number of Checkpoints

Performance Comparison for ACO based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Makespan

Work lost due to failures


1400

4500

4000

1200

3500

adpative_overnet(/10)
periodic_overnet

3000

adaptive_overnet(/10)

1000

periodic_overnet

adpative_s kype(/100)
periodic_s kype

2500

adaptive_ucb

Makespan
2000

periodic_ucb

adpative_s kype(/100)
800

periodic_s kype
adaptive_ucb

Work lost

periodic_ucb

600

adaptive_Notre(/100)
1500

periodic_Notre

adaptive_Notre(/100)
periodic_Notre

400

adaptive_Glow(/100)

1000

periodic_Glow

adaptive_Glow(/10)
periodic_Glow

200

500
0

0
1

Iteration

10

11

12

Iteration

10

11

12

Performance Comparison for ACO based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Number of checkpoints
taken

Flowtime

3500

90000

80000

3000

adpative_overnet(/10
0)

70000

adpative_overnet(/10)

2500

periodic_overnet

periodic_overnet

60000

adpative_s kype(/100)
50000

adpative_s kype(/10)
2000

periodic_s kype

periodic_s kype
adaptive_ucb

Flow time
40000

periodic_ucb

adaptive_ucb

Number of checkpoints

periodic_ucb

1500

adaptive_Notre(/10)

adaptive_Notre(/100)

30000

periodic_Notre

periodic_Notre

1000

adaptive_Glow(/10)

adaptive_Glow(/1000)

20000

periodic_Glow

periodic_Glow

500

10000
0

0
1

Iteration

10

11

12

Iteration

10

11

12

Performance Comparison for ACO based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Average bounded
slowdown

Utilization
1

400

0.9
350
0.8

adpative_overnet(/10)

300

periodic_overnet
adpative_s kype(/100)

250

adpative_overnet
0.7

periodic_overnet
adpative_s kype

0.6

periodic_s kype
adaptive_ucb

200

Average bounded slow dow n

periodic_ucb

periodic_s kype
0.5

adaptive_ucb

Utilization

periodic_ucb

0.4

adaptive_Notre(/10)

150

periodic_Notre
adaptive_Glow(/100)

100

periodic_Glow
50

adaptive_Notre
periodic_Notre

0.3

adaptive_Glow

0.2

periodic_Glow

0.1
0

0
1

Iteration

10

11

12

Iteration

10

11

12

Performance Comparison for ACO based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

3500

3000

adpative_overnet(/10)

2500

periodic_overnet
adpative_s kype(/100)

Turnaround
time

2000

periodic_s kype
adaptive_ucb

Turnaround time

periodic_ucb

1500

adaptive_Notre(/100)
periodic_Notre
adaptive_Glow(/100)

1000

periodic_Glow
500

0
1

Iteration

10

11

12

Performance Comparison for ACO based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces
Trace 1
Values for adaptive checkpointing
relative to periodic checkpointing for
trace 1 are -4.8% for makespan, -3.7%
for flowtime, -22.3% for average
bounded slowdown, +5% for
utilization, +182% for work lost due
to failures, -23% for number of
checkpoints taken and -4.1% for
average turnaround time.

Trace 2
Values for adaptive checkpointing
relative to periodic checkpointing for
trace 2 are -2.5% for makespan, -2%
for flowtime, -15.7% for average
bounded slowdown, +2.4% for
utilization, +14% for work lost due to
failures, -17.5% for number of
checkpoints taken and -1.86% for
average turnaround time.

Makespan
Makespan

400

200
Utilization

Flowtime
Utilization

Flowtime

200
100
Adaptive checkpointing
Periodic Checkpointing
Adaptive checkpointing

Periodic Checkpointing
0

Average bounded slowdown

Turnaround time
Average bounded slowdown

Work Lost

Turnaround time

Number of Checkpoints
Work Lost

Number of Checkpoints

Performance Comparison for ACO based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces
Trace 3

Trace 4

Values for adaptive checkpointing


relative to periodic checkpointing
for trace 3 are -2.3% for makespan,
-4% for flowtime, -25% for average
bounded slowdown, +2.34% for
utilization, +35.5% for work lost due
to failures, -26.7% for number of
checkpoints taken and -4% for
average turnaround time.

Values for adaptive checkpointing


relative to periodic checkpointing for
trace 4 are -2.5% for makespan, -2%
for flowtime, -15.7% for average
bounded slowdown, +2.4% for
utilization, +14.13% for work lost due
to failures, -17.5% for number of
checkpoints taken and -1.8% for
average turnaround time.
Makespan

Makespan

200

200
Utilization
Utilization

Flowtime

Flowtime
100
100

Adaptive checkpointing
Periodic Checkpointing

Adaptive checkpointing
Periodic Checkpointing
0
Average bounded slowdown

0
Average bounded slowdown

Turnaround time

Turnaround time

Work Lost
Work Lost

Number of Checkpoints

Number of Checkpoints

Performance Comparison for ACO based Adaptive Checkpointing


and Periodic Checkpointing for Failure traces

Trace
Trace 5
5
Values
Values for
for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing for
for
trace
5
are
-5.4%
trace 5 are -5.4%
for
for makespan,
makespan,
-5.49%
-5.49% for
for
flowtime,
flowtime, -33.14%
-33.14%
for
for average
average
bounded
bounded
slowdown,
slowdown, +5.7%
+5.7%
for
for utilization,
utilization,
+86.7%
+86.7% for
for work
work
lost
due
lost due to
to
failures,
failures, -33.14%
-33.14%
for
number
for number of
of
checkpoints
checkpoints taken
taken
and
and -5.5%
-5.5% for
for
average
average
turnaround
turnaround time.
time.

Makespan
200

Utilization

Flowtime

100
Adaptive checkpointing
Periodic Checkpointing

Average bounded slowdown

Turnaround time

Work Lost

Number of Checkpoints

Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures

Makespan

Work lost
900000

3500000

800000

Makespan (seconds)
3000000

700000
600000

2500000
2000000

Adaptive
checkpointing w = 2

1500000

periodic
checkpointing w=2

1000000

Adaptive
checkpointing w = 4
periodic
checkpointing w=4

500000

500000
Work
lost (seconds)
400000

2
Iteration

periodic
checkpointing w=2

300000

Adaptive
checkpointing w = 4

200000
100000

periodic
checkpointing w=4

0
1

Adaptive
checkpointing w = 2

2
Iteration

Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures

Flowtime

Number of checkpoints
Adaptive
checkpointing w = 2
resource 1

340000000

Adaptive
checkpointing w = 2
resource 2

4500

Number of checkpoints

330000000

4000
320000000

periodic
checkpointing w=2
resource 1

3500

310000000
Flowtime (seconds)
300000000

Adaptive
checkpointing w = 2

3000

periodic
checkpointing w=2

2500

Adaptive
checkpointing w = 4
periodic
checkpointing w=4

290000000

periodic
checkpointing w=2
resource 2

2000

Adaptive
checkpointing w = 4
resource 1

1500
1000

Adaptive
checkpointing w = 4
resource 2

500

280000000

0
1

270000000
1

2
Iteration

Iteration

periodic
checkpointing w=4
resource 1
periodic
checkpointing
w=4resource 2

Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures

Average bounded
slowdown

Utilization
1

12000
0.9
10000
Average bounded slowdown (seconds)

0.8

0.7
8000

Adaptive
checkpointing w = 2

6000

0.6

Adaptive
checkpointing w = 2

0.5

periodic
checkpointing w=2

Adaptive
checkpointing w = 4

0.4

Adaptive
checkpointing w = 4

periodic
checkpointing w=4

0.3

periodic
checkpointing w=4

periodic
checkpointing w=2

4000

2000

Utilization

0.2
0

0.1
1

2
Iteration

4
0
1

2
Iteration

Performance Evaluation:
MTBF and Last Failure Based Adaptive Checkpointing for Temporally
Correlated Failures

Completion time

Fault occurences

3500000

3500

Adaptive
checkpointing w = 2
res ource 1

3000

Adaptive
checkpointing w = 2
res ource 2

Adaptive checkpointing
w = 2 resource 1
3000000

Adaptive checkpointing
w = 2 resource 2

completion time (seconds)

2500000

2500

periodic checkpointing
w=2 resource 1

2000000

periodic checkpointing
w=2 resource 2

1500000

Adaptive checkpointing
w = 4 resource 1

1000000

500000

1 2 3 4
Iteration

periodic
checkpointing w=2
res ource 1

2000

periodic
checkpointing w=2
res ource 2

Number of faults
1500

Adaptive checkpointing
w = 4 resource 2

1000

periodic checkpointing
w=4 resource 1

500

periodic checkpointing
w=4resource 2

Adaptive
checkpointing w = 4
res ource 1
Adaptive
checkpointing w = 4
res ource 2
periodic
checkpointing w=4
res ource 1
1

2
Iteration

periodic
checkpointing
w=4res ource 2

Performance Evaluation: Fault Index based


periodic Skip (Part 1)

Makespan

Work lost

80000

160000
Work lost (seconds)
140000

70000

120000

60000

100000
80000

50000
40000
Makespan (seconds)
30000

Periodic S kip

60000

Periodic S kip

Adaptive Periodic
S kip

40000

Adaptive Periodic
S kip

20000

20000

0
150

10000

200

250

300

350

C heckpoint Interval (seconds)

0
150

200

250

300

350

C heckpoint Interval (seconds)

400

400

Performance Evaluation: Fault Index based


periodic Skip (Part 1)

Flowtime

Number of checkpoints

Number of checkpoints

Flow time (seconds)


1290000

5000

1280000

4500

1270000

4000

1260000

3500

1250000

3000

1240000

Periodic S kip

2500

Periodic S kip

1230000

Adaptive Periodic
S kip

2000

Adaptive Periodic
S kip

1220000

1500

1210000

1000

1200000

500

1190000

0
150

200

250

300

350

C heckpoint Interval (seconds)

400

150

200

250

300

350

C heckpoint Interval (seconds)

400

Performance Evaluation: Fault Index based


periodic Skip (Part 1)
1200

Average bounded slow dow n (seconds)

1000

800

Average
bounded
slowdown

600
Periodic S kip
Adaptive Periodic S kip
400

200

0
150

200

250

300

350

400

C heckpoint Interval (seconds)

Performance Evaluation: Fault Index based


periodic Skip (Part 2)

Makespan

Work lost
2500000

3500000

3000000

2000000

2500000

2000000

1500000

Makespan (seconds)
1500000

Adaptive periodic
s kip
Periodic s kip

1000000

Adaptive periodic skip

Work lost (seconds)

Periodic skip

1000000

500000

500000

Iteration
0
Iteration

Performance Evaluation: Fault Index based


periodic Skip (Part 2)

Flowtime

Number of Checkpints
600

3.00E+08
500

2.50E+08

400

2.00E+08

1.50E+08
Flow time (seconds)

Adaptive periodic
s kip
Periodic s kip

1.00E+08

C heckpoints taken

300
Adaptive periodic
s kip
200

5.00E+07

100

0.00E+00

0
Iteration
Iteration

Periodic s kip

Performance Evaluation: Fault Index based


periodic Skip (Part 2)
Average bounded
slowdown

Utilization
1.2

0.8

0.6
Utilization

Adaptive periodic
s kip

0.4

Periodic s kip

0.2

0
Ave rage bo unded s lo wdow n (seco nds)

Adaptive periodic
s kip
Per iodic s kip

Iteration

Iteration

Performance Evaluation: Fault Index based


periodic Skip (Part 2)

Number of faults

Completion time
3500000

2000
1800

3000000

1600
Adaptive
periodic
s kip
(res ource
1)

1400
1200
1000
Number of failures

Adaptive
periodic
s kip
(Res ource
2)

800
600
400

Periodic
s kip
(Res ource
1)

200
0

Iteration

Adaptive
periodic s kip
(res ource 1)

2500000

2000000

C ompletion time (seconds)


1500000

Adaptive
periodic s kip
(Res ource 2)

1000000

Periodic s kip
(Res ource 1)

500000

Periodic s kip
(Res ource 2)

Iteration

Performance Evaluation: Fault Index based


periodic Skip (Part 2)

Overall
comparison
Values for adaptive
checkpointing
relative to periodic
checkpointing are
-6.42% for
makespan, -2.25
for flowtime,
-10.1% for average
bounded
slowdown, +5.36%
for utilization,
-11.425% for work
lost due to failures,
-.79% for number
of checkpoints
taken.

Makespan

120

Number of checkpoints

Flowtime

100

80

Adaptive periodic
skip

Periodic skip

Work lost

Average bounded slowdown

Utilization

Ant Colony Based Adaptive Checkpointing Using


MTBF of Resources

Makespan

Work lost

1.80E+07
1.60E+07
1.40E+07
1.20E+07
1.00E+07
8.00E+06
Makespan (seconds)
6.00E+06
4.00E+06

Adaptive_Checkpointi
g
Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing

2.00E+06
0.00E+00
Wo rk lost due to failures (se co nds)

Adaptive_C heckpointi
g
Periodic_Checkpointin
g (With s ch eduling
as s is ted fault
to lerance)

C heckpoint Interval (seconds)

Periodic
Checkpointing

C heckpo int Interva l (seconds)

Ant Colony Based Adaptive Checkpointing Using


MTBF of Resources

Flowtime

Number of checkpoints

6.00E+09

50000
45000

5.00E+09

40000
35000

4.00E+09

3.00E+09
Flow time (seconds)
2.00E+09

1.00E+09

Adaptive_Checkpointi
g
Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)
Periodic
Checkpointing

30000
25000

15000

Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)

10000

Periodic
Checkpointing

Number of checkpoints
20000

5000
0.00E+00

C heckpoint Interval (seconds)

Adaptive_Checkpointi
g

C heckpoint Interval(seconds)

Ant Colony Based Adaptive Checkpointing Using


MTBF of Resources
Average bounded
slowdown

Utilization
0.9
0.8
0.7
0.6
Adaptive_Checkpointi
g

0.5

0.3

Periodic_Checkpointin
g (With s cheduling
as s is ted fault
tolerance)

0.2

Periodic
Checkpointing

0.4
Utilization

0.1
0
Adaptive_Checkpoin
tig
Avverage boude d slowdown (se conds)
Periodic_Checkpoint
ing (With s cheduling
as s is ted fault
tolerance)

Periodic
Checkpointing
C he ckpoint Inte rval (se co nds)

C heckpoint Interval (seconds)

Ant Colony Based Adaptive Checkpointing Using


MTBF of Resources

Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-8.7%
-8.7% for
for
makespan, -4.22
for
for flowtime,
flowtime,
-7.8%
-7.8% for
for average
average
bounded
bounded
slowdown,
slowdown,
+6.569%
+6.569% for
for
utilization,
utilization,
-3.24%
-3.24% for
for work
work
lost
due
to
lost due to
failures,
failures, -10%
-10% for
for
number
of
number of
checkpoints
taken.
taken.

Makespan
120
Number of checkpoints

100

80

Work lost due to failures

Flowtime
Ant colony based
adaptive checkpointing
using MTBF
Ant Colony based
periodic checkpointing

Average bounded slowdown

Utilization

Ant Colony Based Adaptive Checkpointing Using


Fault Ratios of Resources

Makespan

Work lost

1.20E+07

1.00E+07

8.00E+06

6.00E+06
Makespan (seconds)
4.00E+06

Adaptive_Checkpointi
g
Periodic_Checkpointin
g
Periodic_S kip

2.00E+06

Random_Backoff_S kip

0.00E+00

Wo rk lost due to failures (se conds)

C heckpoint Interval (seconds)

Adaptive_C heckpointi
g
Periodic_Checkpointin
g
Periodic_S kip
Random_B ackoff_S kip

C heckpo int Interva l (seconds)

Ant Colony Based Adaptive Checkpointing Using


Fault Ratios of Resources

Flowtime

Number of Checkpoints

3.50E+09

45000
40000

3.00E+09

35000
2.50E+09
30000
2.00E+09

Adaptive_Checkpointi
g

Flow time (seconds)


1.50E+09

Periodic_Checkpointin
g
Periodic_S kip

1.00E+09

Random_Backoff_S kip

Adaptive_Checkpointi
g

25000
20000 of C heckpoints
Number
15000

Periodic_S kip
Random_Backoff_S kip

10000

5.00E+08
5000
0.00E+00

C heckpoint Interval (seconds)

Periodic_Checkpointin
g

C heckpoint Interval (seconds)

Ant Colony Based Adaptive Checkpointing Using


Fault Ratios of Resources
Average bounded
slowdown

Utilization
0.9
0.8
0.7
0.6
0.5
0.4
Utilization
0.3

Periodic_Checkpointin
g

0.2

Periodic_S kip
Random_Backoff_S kip

0.1
0

Adaptive_Checkpointi
g
Average bounded slow down (seconds) Periodic_Checkpointin
g
Periodic_S kip
Random_Backoff _S kip
#REF!

C heckpoint Inte rval (se co nds)

Adaptive_Checkpointi
g

C heckpoint Interval (seconds)

Ant Colony Based Adaptive Checkpointing Using


Fault Ratios of Resources

Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-6.95%
-6.95% for
for
makespan, -1.55
for
for flowtime,
flowtime,
-7.7%
-7.7% for
for average
average
bounded
bounded
slowdown,
slowdown,
+7.17%
+7.17% for
for
utilization,
utilization,
-28.2%
-28.2% for
for work
work
lost
due
to
lost due to
failures,
failures, +30%
+30%
for
number
for number of
of
checkpoints
taken.
taken.

Makespan
200

Number of checkpoints
100

Flowtime
Ant colony
based adaptive
checkpointing
using fault
indexes
Ant Colony
based periodic
checkpointing
Average bounded slowdown

Work lost due to failures

Utilization

GA-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Makespan

Work lost due to failures

3000000

1800000
1600000

2500000
1400000
2000000

1200000

1500000
Makespan (seconds)

GA bas ed
adaptive
checkpointing
GA bas ed
periodic
Checkpointing

1000000

1000000
Work
lost (seconds)
800000
600000
400000

500000

200000
0

0
Iteration

Iteration

GA bas ed
adaptive
checkpointi
ng

GA-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Flowtime

Number of checkpoints

3.00E+08

1000
900

2.50E+08

800
700

2.00E+08

600
GA
bas ed
adaptive
checkpoi
nting

1.50E+08
Flow time (seconds)
1.00E+08

500
Number of C heckpoints
400
300
200

5.00E+07

100
0.00E+00

Iteration

Iteration

GA bas ed
adaptive
checkpointin
g

GA-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Average bounded
slowdown

Utilization
1.2

18000
1

16000
14000

0.8
12000
10000
Average
8000 bounded slow dow n (seconds)
6000
4000

GA bas ed
adaptive
checkpoint
ing

Utilization

0.6

GA bas ed adaptive
checkpointing
GA bas ed periodic
Checkpointing

0.4

0.2

2000
0

Iteration

Iteration

GA-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Overall
Overall
Comparison
Comparison
Values for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-3.5%
-3.5% for
for
makespan, -3.1
for
for flowtime,
flowtime,
-17.84%
-17.84% for
for
average
average bounded
bounded
slowdown,
slowdown,
+1.395%
+1.395% for
for
utilization,
utilization,
-24.84%
-24.84% for
for work
work
lost
due
to
lost due to
failures,
failures, -5.76%
-5.76%
for
number
for number of
of
checkpoints
taken.
taken.

Makespan

200

Number of checkpoints

Flowtime

100

GA bas ed adaptive checkpointing


us ing fault indexes
GA bas ed periodic checkpointing

Work lost due to failures

Average bounded slowdown

Utilization

ACO-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Makespan

Work lost

3000000
1800000

1600000

2500000

Work lost (seconds)


1400000
2000000
1200000

1500000
Makespan (seconds)

Adaptive Ant Colony


Algorithm
Ant Colony Algorithm

1000000

1000000

Adaptive Ant Colony


Algorithm

800000

Ant Colony Algorithm

600000
500000
400000

200000

Iteration

0
Iteration

ACO-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Flowtime

Number of Checkpoints
1000

900

3.00E+08

800
2.50E+08
Flowtime (seconds)

700

2.00E+08

600

Adaptive
Ant Colony
Algorithm

500

1.50E+08

Adaptive Ant
Colony
Algorithm

1.00E+08

Number of checkpoints
400

Ant Colony
Algorithm

300

Ant Colony
Algorithm

5.00E+07

200

100

0.00E+00

0
Iteration
Iteration

ACO-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Average bounded
slowdown

Utilization

18000

1.2

16000
1
14000
0.8

12000

Adaptive
Ant Colony
Algorithm

10000
Average bounded slowdown (seconds)
8000

Ant Colony
Algorithm

Adaptive Ant
Colony
Algorithm

0.6
Utilization

0.4

Ant Colony
Algorithm

6000
0.2
4000

2000

Iteration

Iteration

ACO-based Adaptive Fault Tolerance Using Fault Ratios of


Resources for spatially and temporally Correlated Failures

Overall
Overall
Comparison
Comparison
Values
Values for
for
adaptive
adaptive
checkpointing
checkpointing
relative
relative to
to
periodic
periodic
checkpointing
checkpointing are
are
-2.3% for
makespan,
makespan, -3.8
-3.8
for
flowtime,
for flowtime,
-20.5%
-20.5% for
for
average
average bounded
bounded
slowdown,
slowdown, +.97%
+.97%
for
utilization,
for utilization,
-25.3% for work
lost
lost due
due to
to
failures,
-7.52%
failures, -7.52%
for
for number
number of
of
checkpoints
checkpoints
taken.
taken.

Makespan

200

Number of checkpoints

Flowtime

100
Ant Colony bas ed adaptive
checkpointing us ing fault indexes
0

Work lost due to failures

Ant Colony bas ed periodic


checkpointing

Average bounded slowdown

Utilization

GA performance comparison for workload traces


Trace 1
Values of adaptive checkpointing
technique relative to periodic
checkpointing are 0% for
makespan, 0% for flowtime, -20.5%
for average bounded slowdown,
-11.7% for work lost due to failures,
-21% for number of checkpoints
taken and -4.3% for average
turnaround time

Trace 2
Values of adaptive checkpointing
technique relative to periodic
checkpointing are 0% for makespan,
0% for flowtime, -15% for average
bounded slowdown, 31% for work lost
due to failures, -40.5% for number of
checkpoints taken and -1.3% for
average turnaround time.

Makespan

Makespan

100
200

Average bounded slowdown

Flowtime
Average bounded slowdown

50

Flowtime
100

Adaptive
checkpointing

Periodic
Checkpointing
Work Lost

Adaptive
checkpointing
Periodic
Checkpointing

Turnaround time
Work Lost

Turnaround time

Number of Checkpoints
Number of Checkpoints

GA performance comparison for workload traces

Trace 3

Trace 4

Values of adaptive checkpointing


technique relative to periodic
checkpointing are 0% for makespan,
0% for flowtime, -19.76% for average
bounded slowdown, -13% for work lost
due to failures, -29% for number of
checkpoints taken and -4.85% for
average turnaround time.

Values of adaptive checkpointing


technique relative to periodic
checkpointing are -2.2% for makespan,
-2.1% for flowtime, -20.3% for average
bounded slowdown, -4.5% for work lost
due to failures, -20.5% for number of
checkpoints taken and -4% for average
turnaround time.
Makespan

Makespan

100

100

Average bounded slowdown


Average bounded slowdown

Flowtime

50

Flowtime
50

Adaptive
checkpointing
0

Periodic
Checkpointing

Work Lost
Work Lost

Periodic
Checkpointing

Turnaround time

Turnaround time

Number of Checkpoints
Number of Checkpoints

Adaptive
checkpointing

GA performance comparison for workload


traces

Trace
Trace 5
5
Values
Values of
of
adaptive
adaptive
checkpointing
checkpointing
technique
technique relative
relative
to
periodic
to periodic
checkpointing
checkpointing are
are
0%
0% for
for
makespan,
makespan, 0%
0% for
for
flowtime,
-13.6%
flowtime, -13.6%
for
for average
average
bounded
bounded
slowdown,
slowdown, -7.2%
-7.2%
for
for work
work lost
lost due
due
to
failures,
-26.8%
to failures, -26.8%
for
for number
number of
of
checkpoints
checkpoints taken
taken
and
and -1.85%
-1.85% for
for
average
average
turnaround
turnaround time.
time.

Makespan
200

Average bounded slowdown

Flowtime
100

Adaptive checkpointing
Periodic Checkpointing
0

Work Lost

Turnaround time

Number of Checkpoints

ACO performance comparison for workload


traces
Trace 1

Trace 2

Values of adaptive checkpointing


technique relative to periodic
checkpointing are 0% for makespan,
0% for flowtime, -39.6% for average
bounded slowdown, -87% for work
lost due to failures, -38% for number
of checkpoints taken and -9% for
average turnaround time.

Values of adaptive checkpointing


technique relative to periodic
checkpointing are 0% for makespan,
-3% for flowtime, -22% for average
bounded slowdown, +34% for work lost
due to failures, -51% for number of
checkpoints taken and -2% for average
turnaround time.

Makespan

Makespan

200

Average bounded slowdown

200

Flowtime
100

Average bounded slowdown

Flowtime
100

Adaptive
checkpointing

Periodic
Checkpointing
Work Lost

Turnaround time

Number of Checkpoints

Adaptive
checkpointing
Periodic
Checkpointing

Work Lost

Turnaround time

Number of Checkpoints

ACO performance comparison for workload


traces
Trace 3

Trace 4

Values of adaptive checkpointing


technique relative to periodic
checkpointing are 0% for makespan,
-1.6% for flowtime, -43.6% for average
boundedslowdown, -69% for work lost
due to failures, -39.1% for number of
checkpoints taken and -13% for average
turnaround time.

Values of adaptive checkpointing


technique relative to periodic
checkpointing are -2% for makespan,
-2.3% for flowtime, -20.3% for average
bounded slowdown, -9.56% for work
lost due to failures, -24.5% for number
of checkpoints taken and -3.5% for
average turnaround time.

Makespan
Makespan

100

100
Average bounded slowdown

Flowtime
50

Average bounded slowdown

Flowtime
50

Adaptive
checkpointing

Periodic
Checkpointing
Work Lost

Adaptive
checkpointing
Periodic
Checkpointing

Turnaround time
Work Lost

Turnaround time

Number of Checkpoints
Number of Checkpoints

ACO performance comparison for workload


traces

Makespan

Trace 5
Values of
adaptive
checkpointing
technique
technique
relative
relative to
to
periodic
periodic
checkpointing
are 0% for
makespan, +9%
for
for flowtime,
flowtime,
-31.5%
-31.5% for
for
average
average bounded
bounded
slowdown, --38 %
for work lost due
to failures, -43%
for
for number
number of
of
checkpoints
checkpoints
taken
taken and
and -6%
-6%
for average
turnaround time.

200

Average bounded slowdown

Flowtime

100

Adaptive checkpointing
Periodic Checkpointing

Work Lost

Turnaround time

Number of Checkpoints

Conclusions

Design of adaptive checkpointing based fault tolerant heuristics and their


incorporation in Genetic Algorithm (GA). These heuristics are based on
information related to reliability of resources such as MTBF, fault index and fault
ratios. All adaptive checkpointing heuristics have been compared with GA-based
periodic checkpointing for a wide range of scenarios.
Incorporating heuristics designed in Ant Colony Optimization based scheduling in
Grid.
Design of fault index based periodic skip technique and its performance
comparison with periodic skip.
Design of adaptive checkpointing based on information about MTBF and last
failure time of resources.
Design of experimental scenarios for testing performance of various techniques
for temporally and spatially correlated failures.
Performance comparison of ACO-based and GA-based fault tolerance techniques
using real failure traces available from Failure Trace Archive.
Performance comparison of ACO-based and GA- based fault tolerance techniques
for real workload traces available from various parallel workloads archives.

Future Work

Traces workload trace and failure traces used in this work are small portions of
available traces. Future works will focus on using complete trace for evaluation
Experiments have been performed for workload and failure traces separately.
Future works will use both workload trace and failure trace in an experiment.
Downtime (MTTR) of resources is ignored in this work and resource is assumed
to recover immediately from failure. This assumption will be removed in future.
Checkpointing technique used considers restart after failure on the same
resource. Another technique can be checkpoint with migration where job is
restarted on a different resource. This technique along with its various issues
such as spare node allocation is to be pondered upon.
Heuristics developed for fault tolerance are not restricted to metaheuristics.
Rather they can be incorporated in any scheduling algorithm. Future work will
look into that.
This work considers only transient faults on resources. Other fault classes are
not considered.
Finally future work will focus on working in an actual Grid setup rather than
simulated one.

7. Publications

Upadhyay, N., and Misra, M. 2011.


Incorporating fault tolerance in GA-based
Scheduling in Grid environment. In
Proceedings of World Congress on
Information and Communication
Technologies (Mumbai India , Dec 11
14). WICT 2011. IEEE. 776 781.

Heuristic (Wikipedia)

Heuristic(/hjrstk/; orheuristics;
Greek: "","find"or"discover")
refers to experience-based techniques
for problem solving, learning, and
discovery. Where an exhaustive search is
impractical, heuristic methods are used
to speed up the process of finding a
satisfactory solution.

Heuristic (Wikipedia)

Incomputer science, a heuristic is a technique designed


to solve a problem that ignores whether the solution can
be proven to be correct, but which usually produces a
good solution or solves a simpler problem that contains or
intersects with the solution of the more complex problem.
Heuristics are intended to gain computational
performance or conceptual simplicity, potentially at the
cost ofaccuracy or precision.
In theirTuring Awardacceptance speech,Herbert Simon
andAllen Newelldiscuss the Heuristic Search Hypothesis:
a physical symbol system will repeatedly generate and
modify known symbol structures until the created
structure matches the solution structure.

Meta-heuristic (Wikipedia)

Incomputer science
,metaheuristicdesignates a
computational method thatoptimizesa
problembyiterativelytrying to improve
acandidate solutionwith regard to a
given measure of quality.

ACO vs GA

Compared to GAs (Genetic Algorithms):


retains memory of entire colony instead
of previous generation onlyless affected
by poor initial solutions (due to
combination of random path selection
and colony memory

Makespan and flowtime

Chromosome 1

J1

J2

R1

R1
R2

R1

J1: 2sec

J3: 1.5 sec

J3
R2

J2:3sec

J4: 2.5 sec

J4
R2

C1 = 5 sec

C2 = 4 sec

Makespan = max(5,4) = 5 seconds


Flowtime = C1 + 2 = 5 + 4 = 9 seconds

Chromosome 2

back

J1
R1

J2
R2

J3

R1

J4

R2

References
[1] Townend, P. and Xu, J. 2003. Fault tolerance within a grid environment. As component of eDemand project at the
University of Durham, United Kingdom.
[2] Foster I., Kesselman C.,The Grid: Blueprint for a New Computing Infrastructure, The Elsevier Series in
GridComputing.

[3] Foster, I. 2001. The anatomy of the grid: enabling scalable virtual organizations. In Proceedings of the First
IEEE/ACM International Symposium onCluster Computing and the Grid 2001 (Brisbane, Australia May 15-18,
2001). CCGRID '01. IEEE Computer Society, Washington, DC, USA, 6-7.

[4] Foster I. What is the Grid? A three point checklist, Argonne National Laboratory, fp.mcs.anl.gov/~foster/
Articles/WhatIsTheGrid.pdf, 2002.

[5] Avizienis, A.; Laprie, J.-C.; Randell, B.; Landwehr, C.; , "Basic concepts and taxonomy of dependable and secure
computing,"IEEE Transactions on Dependable and Secure Computing, vol.1, no.1, pp. 11- 33, Jan.-March 2004.
[6] Huda, M.T.; Schmidt, H.W.; Peake, I.D., "An agent oriented proactive fault-tolerant framework for grid
computing,"First International Conference on e-Science and Grid Computing, 2005, pp. 8-15, July 2005.
[7] Hofer, J.; Fahringer, T., "A Multi-Perspective Taxonomy for Systematic Classification of Grid Faults,"16th
Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2008. PDP 2008, pp.126-130, 1315 Feb. 2008.
[8] Jafar, S.; Krings, A.; Gautier, T.; , "Flexible Rollback Recovery in Dynamic Heterogeneous Grid
Computing,"Dependable and Secure Computing, IEEE Transactions on, vol.6, no.1, pp.32-44, Jan.-March 2009
[9] Avizienis, A., "The N-Version Approach to Fault-Tolerant Software,"IEEE Transactions on Software Engineering,
vol. SE-11, no.12, pp. 1491- 1501, Dec. 1985.

References
[10] Schroeder B. and G. A. Gibson (2006). A large-scale study of failures in high-performance
Computing systems, International Conference on Dependable Systems and Networks, DSN 2006.
[11] Hayashibara N, Cherif A, Katayama T. Failure detectors for large-scale distributed systems,
Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems. IEEE Computer Society Press:
Los Alamitos, CA, 2002, 404-409 October 2002.
[12] Elnozahy E, Johnson D, Wang Y. A survey of rollback recovery protocols in message-passing systems,
ACM Computing Surveys 2002, 34(3), 375408.
[13] Alvisi L. and K. Marzullo (1998). Message logging: pessimistic, optimistic, causal, and optimal, IEEE
Transactions on Software Engineering, 24(2), 149-159.
[14] H-C Nam, J. Kim, SJ. Hong and S. Lee. Probabilistic checkpointing, In Proceedings of the Twenty
Seventh International Symposium on Fault-Tolerant Computing (FTCS-27), pp.4857, June 1997.
[15] Gabriel Rodrguez, Xon C. Pardo, Mara J. Martn, Patricia Gonzlez, Performance evaluation of an
application-level checkpointing solution on grids, Future Generation Computer Systems, Volume 26, Issue
7, July 2010, Pages 1012-1023, ISSN 0167-739X, 10.1016/j.future.2010.04.016.
[16] Oliner, A.J.; Sahoo, R.K.; Moreira, J.E.; Gupta, M.; , "Performance implications of periodic checkpointing
on large-scale cluster systems,"Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th
IEEE International, vol., no., pp. 8 pp., 4-8 April 2005
[17] Plank, J.S.; Elwasif, W.R.; , "Experimental assessment of workstation failures and their impact on
checkpointing systems,"Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual
International Symposium on, vol., no., pp.48-57, 23-25 Jun 1998.
[18] Adam J. Oliner, Larry Rudolph, and Ramendra K. Sahoo. 2006. Cooperative checkpointing: a robust
approach to large-scale systems reliability. InProceedings of the 20th annual international conference on
Supercomputing(ICS '06). ACM, New York, NY, USA, 14-23.

References
[19] Oliner, A.; Sahoo, R.; , "Evaluating cooperative checkpointing for supercomputing systems,"Parallel and
Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.8 pp., 25-29 April 2006.
[20] Oliner, A.; Rudolph, L.; Sahoo, R.; , "Cooperative checkpointing theory,"Parallel and Distributed Processing
Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.10 pp., 25-29 April 2006.
[21] Zhiling Lan; Yawei Li; , "Adaptive Fault Management of Parallel Applications for High-Performance
Computing,"Computers, IEEE Transactions on, vol.57, no.12, pp.1647-1660, Dec. 2008.
[22] Chtepen M., F. H. A. Claeys, et al. (2009). Adaptive Task Checkpointing and Replication: Toward Efficient
Fault-Tolerant Grids, IEEE Transactions on Parallel and Distributed Systems, vol.20, no.2, pp.180-190, Feb.
2009.
[23] Nazir B., K. Qureshi, et al. (2009). "Adaptive checkpointing strategy to tolerate faults ineconomy based
grid," The Journal of Supercomputing, 50(1), 1-18, 2009.
[24] Antonios Litke, Konstantinos Tserpes, Konstantinos Dolkas, and Theodora Varvarigou. 2005. A task
replication and fair resource management scheme for fault tolerant grids. InProceedings of the 2005 European
conference on Advances in Grid Computing(EGC'05), Peter A. Sloot, Alfons G. Hoekstra, Thierry Priol,
Alexander Reinefeld, and Marian Bubak (Eds.). Springer-Verlag, Berlin, Heidelberg, 1022-1031.
[25] Qin Z., B. Veeravalli, et al. (2009). On the Design of Fault-Tolerant Scheduling Strategies Using PrimaryBackup Approach for Computational Grids with Low Replication Costs, IEEE Transactions on Computers, vol.
58, no.3, pp.380-393, March 2009.
[26] Hwang S., and Kesselman C., A flexible framework for fault tolerance in the grid, Journal of Grid
Computing, vol. 1, no. 3, pp. 251-272, 2003.
[27] Lopes R. F. and F. J. da Silva e Silva (2006). Fault tolerance in a mobile agent based computational grid,
Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops, 2006,vol. 2, 16-19 May
2006.

References
[28] Kandaswamy G., A. Mandal, et al. (2008). Fault Tolerance and Recovery of Scientific Workflows on
Computational Grids, 8th IEEE International Symposium on Cluster Computing and the Grid, 2008. CCGRID08,
pp.777-782, 19-22 May 2008.
[29] Yang Z., A. Mandal, et al. (2009). Combined Fault Tolerance and Scheduling Techniques for Workflow
Applications on Computational Grids, 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
CCGRID09, pp.244-251, 18-21 May 2009.
[30] SungJin C., B. MaengSoon, et al. (2004). Volunteer availability based fault tolerant scheduling mechanism in
desktop grid computing environment, Proceedings of the third IEEE International Symposium on Network
Computing and Applications, 2004, (NCA 2004), pp. 366- 371, 30 Aug.-1 Sept. 2004.
[31] Hou E. S. H., N. Ansari, et al. (1994). A genetic algorithm for multiprocessor scheduling, IEEE transactions on
Parallel and Distributed Systems, vol.5, no.2, pp.113-120, Feb 1994.
[32] Song, S., Hwang, K., and Kwok, K. 2006. Risk-resilient heuristics and genetic algorithms for security-assured
grid job scheduling.IEEE Transactions onComputers 55, 6 (June 2006), 703-719.
[33] Khanli, L. M., Far, M. E., and Rahmani, A. M. 2010. RFOH: A New Fault Tolerant Job Scheduler in Grid
Computing. InProceedings of the Second International Conference on Computer Engineering and
Applications(Bali Island, Indonesia, March 19 21, 2010).ICCEA '10. IEEE Computer Society, Washington, DC,
USA, 422-425.
[34] Priya, S.B., Prakash, M., Dhawan, K.K. 2007. Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using
Checkpoint In Proceedings of the Sixth International Conference on Grid and Cooperative Computing (Los
Alamitos, CA, Aug. 16 18, 2007). GCC '07. 676-680.
[35] Abdulal, W., and Ramachandram, S 2011. Reliability-Aware Genetic Scheduling Algorithm in Grid Environment.
In Proceedings of the International Conference on Communication Systems and Network Technologies (Katra,
Jammu India , June 03 05). 673-677.
[36] Wu, C., Lai K., and Sun R. 2008. GA-Based Job Scheduling Strategies for Fault Tolerant Grid Systems. In
Proceedings of the Asia-Pacific Conference on Services Computing (Dec. 09

References
12, 2008). IEEE, 27-32.
[37] Dorigo, M.; Gambardella, L.M.; "Ant colony system: a cooperative learning approach to the traveling
salesman problem,"Evolutionary Computation, IEEE Transactions on, vol.1, no.1, pp.53-66, Apr 1997
[38] Zhihong Xu; Xiangdan Hou; Jizhou Sun; , "Ant algorithm-based task scheduling in grid
computing,"Electrical and Computer Engineering, 2003. IEEE CCECE 2003. Canadian Conference on,
vol.2, no., pp. 1107- 1110 vol.2, 4-7 May 2003.
[39] Hui Yan; Xue-Qin Shen; Xing Li; Ming-Hui Wu; , "An improved ant algorithm for job scheduling in grid
computing,"Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on,
vol.5, no., pp.2957-2961 Vol. 5, 18-21 Aug. 2005.
[40] Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, and Ramendra K. Sahoo. 2004.
Performance implications of failures in large-scale cluster scheduling. InProceedings of the 10th
international conference on Job Scheduling Strategies for Parallel Processing(JSSPP'04), Dror G.
Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer-Verlag, Berlin, Heidelberg, 233-252.
[41] Dalibor Klusek and Hana Rudov. 2010. Alea 2: job scheduling simulator. InProceedings of the 3rd
International ICST Conference on Simulation Tools and Techniques(SIMUTools '10). ICST (Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium,
Belgium.
[42] S. Lorpunmanae , Mohd Sap , A.H.Abdullah and C. C. Inwai , An Ant Colony Optimization for Dynamic
Job Scheduling in GridEnvironment , International Journal of Computer and Information Science and
Engineering , 2007.
[43] Jing Hu, Mingchu Li, Weifeng Sun, Yuanfang Chen, "An Ant Colony Optimization for Grid Task
Scheduling with Multiple QoS Dimensions," Grid and Cloud Computing, International Conference on, pp.
415-419, 2009 Eighth International Conference on Grid and Cooperative Computing, 2009

References
[44] Ruay-Shiung Chang, Jih-Sheng Chang, Po-Sheng Lin, An ant algorithm for balanced job scheduling in
grids, Future Generation Computer Systems, Volume 25, Issue 1, January 2009, Pages 20-27, ISSN 0167739X, 10.1016/j.future.2008.06.004.
[45] Wei-Neng Chen; Jun Zhang; , "An Ant Colony Optimization Approach to a Grid Workflow Scheduling
Problem With Various QoS Requirements,"Systems, Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, vol.39, no.1, pp.29-43, Jan. 2009
[46] Gosia Wrzesinska, Rob V. van Nieuwpoort, Jason Maassen, Henri E. Bal (2005). Fault-Tolerance,
Malleability and Migration for Divide-and-Conquer Applications on the Grid, Proceedings of the 19th IEEE
International Symposium on Parallel and Distributed Processing, 2005, pp. 13a, 04-08 April 2005.
[47] Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed re-source
management and scheduling for grid computing. Concurr Comput Pract Exp (CCPE) 14(1311751220.
[48] Caminero, A.; Sulistio, A.; Caminero, B.; Carrion, C.; Buyya, R.; , "Extending GridSim with an
architecture for failure detection,"International Conference on Parallel and Distributed Systems, 2007,
vol.2, no., pp.1-8, 5-7 Dec. 2007 doi: 10.1109/ICPADS.2007.4447756.
[49] Failure Trace Archive [Online]. http://fta.inria.fr/apache2-default/pmwiki/index.php.
[50] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. 2010. The Failure Trace Archive:
Enabling Comparative Analysis of Failures in Diverse Distributed Systems. InProceedings of the 2010
10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing(CCGRID '10). IEEE
Computer Society, Washington, DC, USA, 398-407.
[51] Parallel Workload Archive [Online]. http://www.cs.huji.ac.il/labs/parallel/workload/.
[52] Grid Workload Archive [Online]. http://gwa.ewi.tudelft.nl/pmwiki/.
[53] MATLAB R2010a [Online]. http://www.mathworks.in/help/techdoc/rn/br_03sl.html