Presentation report on various fault tolerance approaches for Grid and Cloud environment.

© All Rights Reserved

Als PPTX, PDF, TXT **herunterladen** oder online auf Scribd lesen

7 Aufrufe

Presentation report on various fault tolerance approaches for Grid and Cloud environment.

© All Rights Reserved

Als PPTX, PDF, TXT **herunterladen** oder online auf Scribd lesen

- Effective BDD with Cucumber.doc
- Tabu Search
- LF_CAT_EBook_Steering-Suspension-Parts-PC_05558_IN_V01_Mi-Z (1)jabucici lemforder.pdf
- 2. Mechatronics Workshop KCC Day1Session2.pdf
- Mis
- Putty Debug Prints for XFI Line Configuration Error for CS ENABLE
- Uni Heilbron OPAC
- A_Social_Spider_Algorithm_for_Global_Opt.pdf
- Wat is Profibus
- Project Process Management
- Ji Se 39411208201400
- Cluster Report Preparation Flowchart
- TSP-bookchapter.pdf
- EE370 Old Exams2
- Project Title
- 10.1.1.64.6085_neural_network
- VeSys_介绍
- Academic COSYSMO User Manual v1.1
- process flow chart
- 10.1016@j.infrared.2018.08.014

Sie sind auf Seite 1von 109

GRID ENVIRONMENT

Neeraj Upadhyay Mtech 2nd year

1.Introduction and

Motivation

problem solving in dynamic, multi-institutional virtual organization.

Scale

Heterogeneity

Dynamicity

1.

2.

3.

Heterogeneity - interaction faults.

Dynamicity - delay and loss of jobs.

Fault Tolerance

services despite the presence of

fault-caused errors within the system

itself. Errors are detected and

corrected, and permanent faults are

located and removed while the

system continues to deliver

acceptable service.

Problem Statement

checkpointing based fault tolerance techniques in

Grid environment

Algorithm and Ant Colony Optimization with support for

fault tolerance technique : adaptive checkpointing.

computational grids by suitable modifying traditional

approaches, such as checkpointing, to take into account

various characteristics of the Grid environment.

environment

Dynamic

a) Dynamically varying resource conditions such as fault

occurrences.

b) Faults are more likely to occur during one time frame compared

to others i.e. faults are temporally correlated [10]. For during

weekdays when workload is high compared to weekends faults

are more likely.

Also during day time faults are more likely to occur .

c) Faults are spatially correlated [10].

checkpointing interval.

If it is very low then overhead of each checkpoint operation will

be very high.

Why GA?

scheduling problem.

GA used is based on Global Optimization

toolbox of MATLAB [53]. It tends to

converge to good solution quickly.

Why ACO?

Performanc

e of ACO

compared

to other

meta

heuristics

[37]

Remark

developed in this work is not restricted

to any metaheuristic employed. It can be

suitably used with any scheduling

technique. Our main aim is to show how

maintaining information about failure

conditions of resources can be used in

adapting the checkpointing interval to

improve the performance.

Checkpointing

Replication

workflow level fault tolerance techniques

mobile agent based fault tolerance

fault tolerant scheduling

application model specific fault tolerance

techniques.

3. Research gaps

based scheduling.

scheduling.

MTTR.

Weibull (MTBF)

Lognormal (MTTR)

4. Work done

Proposed solution

Incorporating fault tolerance in GA-based

scheduling in Grid environment

Genetic Algorithm

Initial Population

Fitness

Evaluati

on

Representation of chromosome

14

J1

R1

J2

R2

J3

J4

R3

R4

Jn

Crossover Operator

J1

R3

Chromosome 1

Chromosome 2

J1

R2

J2

J2

J3

J4

J5

R4

R1

R5

R2

R3

J3

R4

J4

R1

J5

R5

Crossover point

offspring

J1

J2

R3

R4

J3

R1

7/8/16

J4

J5

R1

R5

Mutation Operator

15

J1

J5

J5

J2

R1

R4

J1

R1

R4J2

J3

R2

R5

J4

R5

J3

R5

7/8/16

R1

J4 R1

Fitness function

Flowtime =

allocated to resource r.

m is the total number of resources in

Grid.

functions

Second

is execution time of job i in node n,

time of node n

is mean failure

[33] , [36]

If

Where C1 is current system time, LFn is last failure time of node n

and k is an integer 2.

Scheduling

time of resource (a volunteer) registration

maintained in Grid Information Service

mean time for which a resource remains in the

grid (stay time)

Checkpointing with

downtimes

(9)

account

(10)

based fitness function

1.

Parameters

interval and limits the decrease in

checkpointing interval.

is taken as 1 and as .5 for

experiments to show the applicability of

the approach. So the checkpoint interval

can vary from (.5*Check_interval , 2 *

check_interval)

25

Table

No. of faults

R1

R2

R3

:

Rn

No. of Executions

Checkpointing

interval and limits the decrease in

checkpointing interval.

max((FOHT[i][1] FOHT[i][0]),

) ,))

Loop /* inner loop represents a step

Each ant incrementally builds a solution by applying state

transition rule and

a local pheromone updating rule

Until all ants have completed building a solution

Apply global pheromone updating rule

Until terminate_condition

ACO Phases

R=

Pheromone(r,j) = (1-).pheromone(r,j) +

. initial_pheromone

is set to .1, = 1.2, q0 = .9 [45]

Pheromone(r,j) = (1-).pheromone(r,j)

+ . score

score = 1 +

minimum_makespan/makespan

Skip

FI(i): Fault Index of resource i

FI1, FI2, FI3..FIN fault index values such that FI1 < FI2 <

FI3 < ..<FIN

D1, D2, D3,.,DN skip parameter to determine intensity of

skip such that D1 < D2 <D3 DN

If(FI(i) > Fin) then

Perform all checkpoints

If( FIN >FI(i)>FIN-1

Use D1 has skip parameter

If(FI1 >FI(i))

Use DN has skip parameter

Exit

Other Techniques

skip

Fault ratio based periodic and

exponential skip

Temporal Correlation

then there is less probability that it will

fail in the near future.

ii) If a node has failed recently then

there is high probability that it will fail in

the near future.

Checkpointing for Temporally Correlated

Failures

AI = AI + I for each checkpoint request in interval ( * MTBF, *

MTBF) where > and

>1

(17)

AI = AI - I for each checkpoint request in interval ( * MTBF, *

MTBF) where > and

<1

I is an initial periodic checkpoint interval and AI is adapted checkpoint

interval. C1 is current time and Lf is last failure time of resource

as .25, as .5 in our experiments. Higher value of leads to less

opportunity for application of the technique and lower value

decreases performance due to time since last failure being very

small. Similar is the reasoning for .

7/8/16

5. Optimized

resource list

35

Grid

Resource

7.Fault value

FOHT

Fault

Manager

Resource

MIPS

Broker

info

GA or ACO

1.

Deadline,

budget,

Grid User

gridlets

17 Submit result

8. Allocated resource

4. Available

resource list

3. Current

load

status

Schedule Advisor

Resource2

Gridlet Dispatcher

2. Available

resources

information

14 reschedule

from last

checkpoint

Resource1

9.Submit

Gridlet

Gridlet Receptor

16 decrement

Fault value

Resource3

10. Submit

Checkpoints

13 Get

checkpoint

Grid Information Service

Checkpoint

15 Gridlet Server

completion

12. Increment fault value

Grid Working

Performance Comparison

techniques.

Comparison can be done with commonly

used existing checkpointing techniques

Periodic Checkpointing

Skipping checkpointing techniques

Periodic Skip

Exponential backoff skip

Performance Metrics

a) Makespan: It is the maximum completion time for any resource and is basically the

time when all jobs finish execution. Completion time for a resource is the point of time

when all jobs allocated to that resource completes execution.

b) Flowtime: It is the sum of the completion time for all the resources.

between time taken to execute a job and the CPU time () averaged over all jobs. Sizes of

jobs are taken to be comparable to each other.

d) Work lost due to failures: It is the unsaved work which is lost due to failure of jobs.

executing jobs i.e. in doing useful work. This work does not include the time spent in

carrying out work which is lost due to failures.

entire simulation run for a batch of job.

time of job is the finish time of a job minus the submission time.

Simulation Parameters

Parameter

1. Number of Resources (Clusters)

2. Number of Processors per Cluster

3. Number of jobs

4. Computation time per job

5. Checkpoint Overhead

6. Size (Number of processors)of job

7. Checkpoint Interval

[8. MTBF of Resources

9. Failure Distribution

11. Crossover fraction

12 Initial Population (GA)

13 Number of Ants

Value

5

64

200

48 hours

720 seconds [19]

64

1000 to 10000 seconds [19]

5 hours to 18 hours

Weibull (shape parameter .7 , 1 , 1.5) [19]

2

.9

Number of jobs (200)

Number of resources (5)

Weibull Distribution

Probability

density

function for

various

shape

parameter

s

Using MTBF of Resources

Makespan

1.20E+07

1.40E+07

Work lost due to failures (seconds)

1.00E+07

1.20E+07

8.00E+06

1.00E+07

8.00E+06

Adaptive_Checkpointig

Makespan

6.00E+06 (seconds)

6.00E+06

Adaptive_Checkpointig

Periodic_Checkpointing

Periodic_S kip

Periodic_Checkpointing

Periodic_S kip

4.00E+06

Exponential_Backoff_S ki

p

4.00E+06

Exponential_Backoff_S kip

2.00E+06

2.00E+06

0.00E+00

0.00E+00

of Resources

Number of Checkpoints

taken

Flowtime

4.00E+09

3.50E+04

3.50E+09

3.00E+04

3.00E+09

2.50E+04

2.50E+09

2.00E+09

Flow time (seconds)

1.50E+09

Adaptive_Checkpointi

g

2.00E+04

Adaptive_Checkpointi

g

Periodic_Checkpointin

g

Number of C heckpoints

1.50E+04

Periodic_Checkpointin

g

Periodic_S kip

Exponential_Backoff_

S kip

1.00E+09

5.00E+08

5.00E+03

0.00E+00

0.00E+00

Periodic_S kip

1.00E+04

Exponential_Backoff_

S kip

of Resources

Average bounded

slowdown

Utilization

0.9

0.8

0.7

0.6

Adaptive_Checkpointi

g

0.5

0.4

Utilization

0.3

Periodic_S kip

0.2

Exponential_Backoff_

S kip

0.1

0

Adaptive_Checkpointi

g

Average bounded slow down (seconds) Periodic_Checkpointin

g

Periodic_S kip

Exponential_B ackoff_

S kip

C heckpoint Inte rval (se co nds)

Periodic_Checkpointin

g

of Resources

Makespan

Overall

Comparison

Values for

adaptive

checkpointing

relative to periodic

checkpointing are

-2.6% for

makespan, -2.2 for

flowtime, -8% for

average bounded

slowdown, +2%

for utilization,

+5% work lost due

to failures, -9.1%

for number of

checkpoints taken.

120

Number of checkpoints

Flowtime

100

80

Utilization

GA base adaptive

checkpointing using

MTBF

GA based periodic

checkpointing

Average bounded slowdown

Ratios of Resources

Makespan

4500000

3000000

4000000

2500000

3500000

3000000

2000000

1500000

Makespan (seconds)

1000000

Adaptive_Checkpointi

g

2500000

Adaptive_Checkpointi

g

Periodic_Checkpointin

g

Work

Periodic_Checkpointin

g

Periodic_S kip

1500000

Periodic_S kip

Exponential_Backoff_

S kip

1000000

Exponential_Backoff_

S kip

500000

500000

0

Checkpoint_Interval (seconds)

Ratios of Resources

Number of checkpoints

taken

Flowtime

18000

1.60E+09

16000

1.40E+09

14000

1.20E+09

12000

1.00E+09

8.00E+08

Flow time (seconds)

6.00E+08

4.00E+08

Adaptive_Checkpointi

g

10000

8000

C heckpoints

Taken

Periodic_Checkpointin

g

6000

Periodic_Checkpointin

g

Periodic_S kip

4000

Periodic_S kip

Random_Backoff_S kip

Random_Backoff_S kip

2000

2.00E+08

0.00E+00

C heckpoint Interval (seconds)

C heckpoint Interval (seconds)

Adaptive_Checkpointi

g

Ratios of Resources

Average bounded

slowdown

Utilization

0.9

0.8

0.7

0.6

Adaptive_Checkpointi

g

0.5

0.4

Utilization

Periodic_S kip

0.3

Exponential_Backoff_

S kip

0.2

0.1

0

Adaptive_Checkpointi

g

Average B ounded Slo wdo wn (sec onds) Periodic_Checkpointin

g

Periodic_S kip

Exponential_B ackoff_

S kip

Periodic_Checkpointin

g

Ratios of Resources

Overall

Comparison

Values for adaptive

checkpointing

relative to periodic

checkpointing are

-3.13% for

makespan, -13.43

for flowtime,

-13.41% for

average bounded

slowdown, +2.65%

for utilization,

-22.51% for work

lost due to failures,

-7.21% for number

of checkpoints

taken.

Makespan

200

Number of checkpoints

Flowtime

100

Utilization

GA based adaptive

checkpointing using fault

indexes

GA based periodic

checkpointing

Average bounded slowdown

and Periodic Checkpointing for Failure traces

Makespan

Work lost

9000

1400

8000

1200

7000

adaptive_overnet2(/1

0)

6000

adaptive_overnet2(/1

00)

1000

periodic_overnet2

adpative_s kype

5000

periodic_s kype

Makespan ( seconds)

adaptive_ucb

4000

periodic_ucb

periodic_overnet2

adpative_s kype

800

periodic_s kype

Work lost

adaptive_ucb

600

periodic_ucb

adaptive_Notre(/100)

3000

periodic_Notre

adaptive_Notre(/100)

periodic_Notre

400

adaptive_Glow(/100)

2000

periodic_Glow

adaptive_Glow(/10)

periodic_Glow

200

1000

0

0

1

1

Iteration

10

11

12

Iteration

10

11

12

and Periodic Checkpointing for Failure traces

Flowtime

Number of checkpoints

3500

120000

3000

100000

adaptive_overnet2(/1

00)

80000

adaptive_overnet2

2500

periodic_overnet2

periodic_overnet2

adpative_s kype

periodic_s kype

60000

Flow time

adaptive_ucb

adpative_s kype

2000

periodic_s kype

Number of checkpoints

adaptive_ucb

1500

periodic_ucb

periodic_ucb

40000

adaptive_Notre(/100)

periodic_Notre

adaptive_Notre(/10)

1000

periodic_Notre

adaptive_Glow(/10)

adaptive_Glow(/1000)

20000

periodic_Glow

periodic_Glow

500

0

1

Iteration

10

11

12

Iteration

10

11

12

and Periodic Checkpointing for Failure traces

Average bounded

slowdown

Utilization

400

0.9

350

0.8

adaptive_overnet2(/1

0)

300

periodic_overnet2

250

adpative_s kype

periodic_s kype

200

adaptive_ucb

0.7

adaptive_overnet22

0.6

periodic_overnet22

adpative_s kype

0.5

periodic_s kype

Utilization

adaptive_ucb

0.4

periodic_ucb

periodic_ucb

150

adaptive_Notre(/10)

periodic_Notre

100

adaptive_Glow(/100)

periodic_Glow

50

adaptive_Notre

0.3

periodic_Notre

0.2

adaptive_Glow

periodic_Glow

0.1

0

1

Iteration

10

11

12

Iteration

10

11

12

and Periodic Checkpointing for Failure traces

Values for adaptive checkpointing relative to

periodic checkpointing for trace 1 are -1%

for makespan, -5% for flowtime, -26.7% for

average bounded slowdown, +.96% for

utilization, +58.8% for work lost due to

failures, -37.5% for number of checkpoints

taken and -5% for average turnaround time.

Turnaround time

9000

8000

7000

adaptive_overnet2(/1

0)

6000

periodic_overnet2

adpative_s kype(/10)

5000

Makespan

200

Utilization

Flowtime

periodic_s kype

100

adaptive_ucb

Turnaround time

4000

periodic_ucb

adaptive_Notre(/100)

3000

periodic_Notre

Periodic

Checkpointing

Turnaround time

adaptive_Glow(/100)

2000

periodic_Glow

Adaptive

checkpointing

1000

0

1

Iteration

10

11

12

Work Lost

Number of Checkpoints

and Periodic Checkpointing for Failure traces

Overall comparison (Trace 2)

relative to periodic checkpointing for

trace 2 are -2.6% for makespan, -2.7%

for flowtime, -29.8% for average

bounded slowdown, +2.4% for

utilization, +36% for work lost due to

failures, -32% for number of checkpoints

taken and -3.88% for average

turnaround time.

relative to periodic checkpointing for

trace 3 are -5% for makespan, -7.88% for

flowtime, -42.7% for average bounded

slowdown, +5.47% for utilization,

+55.43% for work lost due to failures,

-47% for number of checkpoints taken

and -8.35% for average turnaround time.

Makespan

200

Utilization

Makespan

200

Flowtime

Utilization

100

Flowtime

100

Adaptive

checkpointing

Periodic

Checkpointing

Turnaround time

Work Lost

Adaptive

checkpointing

Periodic

Checkpointing

Turnaround time

Number of Checkpoints

Work Lost

Number of Checkpoints

and Periodic Checkpointing for Failure traces

Overall comparison (Trace 4)

relative to periodic checkpointing for

trace 4 are -4.4% for makespan, -2.7%

for flowtime, -22% for average bounded

slowdown, +4.69% for utilization,

+1.4% for work lost due to failures,

-23.5% for number of checkpoints taken

and -3.2% for average turnaround time.

relative to periodic checkpointing for

trace 5 are -3.8% for makespan, -5.7%

for flowtime, -33.43% for average

bounded slowdown, +3.92% for

utilization, +91% for work lost due to

failures, -32% for number of

checkpoints taken and -6.7% for

average turnaround time.

Makespan

200

Makespan

Utilization

Flowtime

200

100

Utilization

Flowtime

100

Adaptive

checkpointing

0

Periodic

Checkpointing

Adaptive

checkpointing

0

Turnaround time

Work Lost

Periodic

Checkpointing

Turnaround time

Number of Checkpoints

Work Lost

Number of Checkpoints

and Periodic Checkpointing for Failure traces

Makespan

1400

4500

4000

1200

3500

adpative_overnet(/10)

periodic_overnet

3000

adaptive_overnet(/10)

1000

periodic_overnet

adpative_s kype(/100)

periodic_s kype

2500

adaptive_ucb

Makespan

2000

periodic_ucb

adpative_s kype(/100)

800

periodic_s kype

adaptive_ucb

Work lost

periodic_ucb

600

adaptive_Notre(/100)

1500

periodic_Notre

adaptive_Notre(/100)

periodic_Notre

400

adaptive_Glow(/100)

1000

periodic_Glow

adaptive_Glow(/10)

periodic_Glow

200

500

0

0

1

Iteration

10

11

12

Iteration

10

11

12

and Periodic Checkpointing for Failure traces

Number of checkpoints

taken

Flowtime

3500

90000

80000

3000

adpative_overnet(/10

0)

70000

adpative_overnet(/10)

2500

periodic_overnet

periodic_overnet

60000

adpative_s kype(/100)

50000

adpative_s kype(/10)

2000

periodic_s kype

periodic_s kype

adaptive_ucb

Flow time

40000

periodic_ucb

adaptive_ucb

Number of checkpoints

periodic_ucb

1500

adaptive_Notre(/10)

adaptive_Notre(/100)

30000

periodic_Notre

periodic_Notre

1000

adaptive_Glow(/10)

adaptive_Glow(/1000)

20000

periodic_Glow

periodic_Glow

500

10000

0

0

1

Iteration

10

11

12

Iteration

10

11

12

and Periodic Checkpointing for Failure traces

Average bounded

slowdown

Utilization

1

400

0.9

350

0.8

adpative_overnet(/10)

300

periodic_overnet

adpative_s kype(/100)

250

adpative_overnet

0.7

periodic_overnet

adpative_s kype

0.6

periodic_s kype

adaptive_ucb

200

periodic_ucb

periodic_s kype

0.5

adaptive_ucb

Utilization

periodic_ucb

0.4

adaptive_Notre(/10)

150

periodic_Notre

adaptive_Glow(/100)

100

periodic_Glow

50

adaptive_Notre

periodic_Notre

0.3

adaptive_Glow

0.2

periodic_Glow

0.1

0

0

1

Iteration

10

11

12

Iteration

10

11

12

and Periodic Checkpointing for Failure traces

3500

3000

adpative_overnet(/10)

2500

periodic_overnet

adpative_s kype(/100)

Turnaround

time

2000

periodic_s kype

adaptive_ucb

Turnaround time

periodic_ucb

1500

adaptive_Notre(/100)

periodic_Notre

adaptive_Glow(/100)

1000

periodic_Glow

500

0

1

Iteration

10

11

12

and Periodic Checkpointing for Failure traces

Trace 1

Values for adaptive checkpointing

relative to periodic checkpointing for

trace 1 are -4.8% for makespan, -3.7%

for flowtime, -22.3% for average

bounded slowdown, +5% for

utilization, +182% for work lost due

to failures, -23% for number of

checkpoints taken and -4.1% for

average turnaround time.

Trace 2

Values for adaptive checkpointing

relative to periodic checkpointing for

trace 2 are -2.5% for makespan, -2%

for flowtime, -15.7% for average

bounded slowdown, +2.4% for

utilization, +14% for work lost due to

failures, -17.5% for number of

checkpoints taken and -1.86% for

average turnaround time.

Makespan

Makespan

400

200

Utilization

Flowtime

Utilization

Flowtime

200

100

Adaptive checkpointing

Periodic Checkpointing

Adaptive checkpointing

Periodic Checkpointing

0

Turnaround time

Average bounded slowdown

Work Lost

Turnaround time

Number of Checkpoints

Work Lost

Number of Checkpoints

and Periodic Checkpointing for Failure traces

Trace 3

Trace 4

relative to periodic checkpointing

for trace 3 are -2.3% for makespan,

-4% for flowtime, -25% for average

bounded slowdown, +2.34% for

utilization, +35.5% for work lost due

to failures, -26.7% for number of

checkpoints taken and -4% for

average turnaround time.

relative to periodic checkpointing for

trace 4 are -2.5% for makespan, -2%

for flowtime, -15.7% for average

bounded slowdown, +2.4% for

utilization, +14.13% for work lost due

to failures, -17.5% for number of

checkpoints taken and -1.8% for

average turnaround time.

Makespan

Makespan

200

200

Utilization

Utilization

Flowtime

Flowtime

100

100

Adaptive checkpointing

Periodic Checkpointing

Adaptive checkpointing

Periodic Checkpointing

0

Average bounded slowdown

0

Average bounded slowdown

Turnaround time

Turnaround time

Work Lost

Work Lost

Number of Checkpoints

Number of Checkpoints

and Periodic Checkpointing for Failure traces

Trace

Trace 5

5

Values

Values for

for

adaptive

adaptive

checkpointing

checkpointing

relative

relative to

to

periodic

periodic

checkpointing

checkpointing for

for

trace

5

are

-5.4%

trace 5 are -5.4%

for

for makespan,

makespan,

-5.49%

-5.49% for

for

flowtime,

flowtime, -33.14%

-33.14%

for

for average

average

bounded

bounded

slowdown,

slowdown, +5.7%

+5.7%

for

for utilization,

utilization,

+86.7%

+86.7% for

for work

work

lost

due

lost due to

to

failures,

failures, -33.14%

-33.14%

for

number

for number of

of

checkpoints

checkpoints taken

taken

and

and -5.5%

-5.5% for

for

average

average

turnaround

turnaround time.

time.

Makespan

200

Utilization

Flowtime

100

Adaptive checkpointing

Periodic Checkpointing

Turnaround time

Work Lost

Number of Checkpoints

Performance Evaluation:

MTBF and Last Failure Based Adaptive Checkpointing for Temporally

Correlated Failures

Makespan

Work lost

900000

3500000

800000

Makespan (seconds)

3000000

700000

600000

2500000

2000000

Adaptive

checkpointing w = 2

1500000

periodic

checkpointing w=2

1000000

Adaptive

checkpointing w = 4

periodic

checkpointing w=4

500000

500000

Work

lost (seconds)

400000

2

Iteration

periodic

checkpointing w=2

300000

Adaptive

checkpointing w = 4

200000

100000

periodic

checkpointing w=4

0

1

Adaptive

checkpointing w = 2

2

Iteration

Performance Evaluation:

MTBF and Last Failure Based Adaptive Checkpointing for Temporally

Correlated Failures

Flowtime

Number of checkpoints

Adaptive

checkpointing w = 2

resource 1

340000000

Adaptive

checkpointing w = 2

resource 2

4500

Number of checkpoints

330000000

4000

320000000

periodic

checkpointing w=2

resource 1

3500

310000000

Flowtime (seconds)

300000000

Adaptive

checkpointing w = 2

3000

periodic

checkpointing w=2

2500

Adaptive

checkpointing w = 4

periodic

checkpointing w=4

290000000

periodic

checkpointing w=2

resource 2

2000

Adaptive

checkpointing w = 4

resource 1

1500

1000

Adaptive

checkpointing w = 4

resource 2

500

280000000

0

1

270000000

1

2

Iteration

Iteration

periodic

checkpointing w=4

resource 1

periodic

checkpointing

w=4resource 2

Performance Evaluation:

MTBF and Last Failure Based Adaptive Checkpointing for Temporally

Correlated Failures

Average bounded

slowdown

Utilization

1

12000

0.9

10000

Average bounded slowdown (seconds)

0.8

0.7

8000

Adaptive

checkpointing w = 2

6000

0.6

Adaptive

checkpointing w = 2

0.5

periodic

checkpointing w=2

Adaptive

checkpointing w = 4

0.4

Adaptive

checkpointing w = 4

periodic

checkpointing w=4

0.3

periodic

checkpointing w=4

periodic

checkpointing w=2

4000

2000

Utilization

0.2

0

0.1

1

2

Iteration

4

0

1

2

Iteration

MTBF and Last Failure Based Adaptive Checkpointing for Temporally

Correlated Failures

Completion time

Fault occurences

3500000

3500

Adaptive

checkpointing w = 2

res ource 1

3000

Adaptive

checkpointing w = 2

res ource 2

Adaptive checkpointing

w = 2 resource 1

3000000

Adaptive checkpointing

w = 2 resource 2

2500000

2500

periodic checkpointing

w=2 resource 1

2000000

periodic checkpointing

w=2 resource 2

1500000

Adaptive checkpointing

w = 4 resource 1

1000000

500000

1 2 3 4

Iteration

periodic

checkpointing w=2

res ource 1

2000

periodic

checkpointing w=2

res ource 2

Number of faults

1500

Adaptive checkpointing

w = 4 resource 2

1000

periodic checkpointing

w=4 resource 1

500

periodic checkpointing

w=4resource 2

Adaptive

checkpointing w = 4

res ource 1

Adaptive

checkpointing w = 4

res ource 2

periodic

checkpointing w=4

res ource 1

1

2

Iteration

periodic

checkpointing

w=4res ource 2

periodic Skip (Part 1)

Makespan

Work lost

80000

160000

Work lost (seconds)

140000

70000

120000

60000

100000

80000

50000

40000

Makespan (seconds)

30000

Periodic S kip

60000

Periodic S kip

Adaptive Periodic

S kip

40000

Adaptive Periodic

S kip

20000

20000

0

150

10000

200

250

300

350

0

150

200

250

300

350

400

400

periodic Skip (Part 1)

Flowtime

Number of checkpoints

Number of checkpoints

1290000

5000

1280000

4500

1270000

4000

1260000

3500

1250000

3000

1240000

Periodic S kip

2500

Periodic S kip

1230000

Adaptive Periodic

S kip

2000

Adaptive Periodic

S kip

1220000

1500

1210000

1000

1200000

500

1190000

0

150

200

250

300

350

400

150

200

250

300

350

400

periodic Skip (Part 1)

1200

1000

800

Average

bounded

slowdown

600

Periodic S kip

Adaptive Periodic S kip

400

200

0

150

200

250

300

350

400

periodic Skip (Part 2)

Makespan

Work lost

2500000

3500000

3000000

2000000

2500000

2000000

1500000

Makespan (seconds)

1500000

Adaptive periodic

s kip

Periodic s kip

1000000

Periodic skip

1000000

500000

500000

Iteration

0

Iteration

periodic Skip (Part 2)

Flowtime

Number of Checkpints

600

3.00E+08

500

2.50E+08

400

2.00E+08

1.50E+08

Flow time (seconds)

Adaptive periodic

s kip

Periodic s kip

1.00E+08

C heckpoints taken

300

Adaptive periodic

s kip

200

5.00E+07

100

0.00E+00

0

Iteration

Iteration

Periodic s kip

periodic Skip (Part 2)

Average bounded

slowdown

Utilization

1.2

0.8

0.6

Utilization

Adaptive periodic

s kip

0.4

Periodic s kip

0.2

0

Ave rage bo unded s lo wdow n (seco nds)

Adaptive periodic

s kip

Per iodic s kip

Iteration

Iteration

periodic Skip (Part 2)

Number of faults

Completion time

3500000

2000

1800

3000000

1600

Adaptive

periodic

s kip

(res ource

1)

1400

1200

1000

Number of failures

Adaptive

periodic

s kip

(Res ource

2)

800

600

400

Periodic

s kip

(Res ource

1)

200

0

Iteration

Adaptive

periodic s kip

(res ource 1)

2500000

2000000

1500000

Adaptive

periodic s kip

(Res ource 2)

1000000

Periodic s kip

(Res ource 1)

500000

Periodic s kip

(Res ource 2)

Iteration

periodic Skip (Part 2)

Overall

comparison

Values for adaptive

checkpointing

relative to periodic

checkpointing are

-6.42% for

makespan, -2.25

for flowtime,

-10.1% for average

bounded

slowdown, +5.36%

for utilization,

-11.425% for work

lost due to failures,

-.79% for number

of checkpoints

taken.

Makespan

120

Number of checkpoints

Flowtime

100

80

Adaptive periodic

skip

Periodic skip

Work lost

Utilization

MTBF of Resources

Makespan

Work lost

1.80E+07

1.60E+07

1.40E+07

1.20E+07

1.00E+07

8.00E+06

Makespan (seconds)

6.00E+06

4.00E+06

Adaptive_Checkpointi

g

Periodic_Checkpointin

g (With s cheduling

as s is ted fault

tolerance)

Periodic

Checkpointing

2.00E+06

0.00E+00

Wo rk lost due to failures (se co nds)

Adaptive_C heckpointi

g

Periodic_Checkpointin

g (With s ch eduling

as s is ted fault

to lerance)

Periodic

Checkpointing

MTBF of Resources

Flowtime

Number of checkpoints

6.00E+09

50000

45000

5.00E+09

40000

35000

4.00E+09

3.00E+09

Flow time (seconds)

2.00E+09

1.00E+09

Adaptive_Checkpointi

g

Periodic_Checkpointin

g (With s cheduling

as s is ted fault

tolerance)

Periodic

Checkpointing

30000

25000

15000

Periodic_Checkpointin

g (With s cheduling

as s is ted fault

tolerance)

10000

Periodic

Checkpointing

Number of checkpoints

20000

5000

0.00E+00

Adaptive_Checkpointi

g

C heckpoint Interval(seconds)

MTBF of Resources

Average bounded

slowdown

Utilization

0.9

0.8

0.7

0.6

Adaptive_Checkpointi

g

0.5

0.3

Periodic_Checkpointin

g (With s cheduling

as s is ted fault

tolerance)

0.2

Periodic

Checkpointing

0.4

Utilization

0.1

0

Adaptive_Checkpoin

tig

Avverage boude d slowdown (se conds)

Periodic_Checkpoint

ing (With s cheduling

as s is ted fault

tolerance)

Periodic

Checkpointing

C he ckpoint Inte rval (se co nds)

MTBF of Resources

Overall

Overall

Comparison

Comparison

Values for

adaptive

adaptive

checkpointing

checkpointing

relative

relative to

to

periodic

periodic

checkpointing

checkpointing are

are

-8.7%

-8.7% for

for

makespan, -4.22

for

for flowtime,

flowtime,

-7.8%

-7.8% for

for average

average

bounded

bounded

slowdown,

slowdown,

+6.569%

+6.569% for

for

utilization,

utilization,

-3.24%

-3.24% for

for work

work

lost

due

to

lost due to

failures,

failures, -10%

-10% for

for

number

of

number of

checkpoints

taken.

taken.

Makespan

120

Number of checkpoints

100

80

Flowtime

Ant colony based

adaptive checkpointing

using MTBF

Ant Colony based

periodic checkpointing

Utilization

Fault Ratios of Resources

Makespan

Work lost

1.20E+07

1.00E+07

8.00E+06

6.00E+06

Makespan (seconds)

4.00E+06

Adaptive_Checkpointi

g

Periodic_Checkpointin

g

Periodic_S kip

2.00E+06

Random_Backoff_S kip

0.00E+00

Adaptive_C heckpointi

g

Periodic_Checkpointin

g

Periodic_S kip

Random_B ackoff_S kip

Fault Ratios of Resources

Flowtime

Number of Checkpoints

3.50E+09

45000

40000

3.00E+09

35000

2.50E+09

30000

2.00E+09

Adaptive_Checkpointi

g

1.50E+09

Periodic_Checkpointin

g

Periodic_S kip

1.00E+09

Random_Backoff_S kip

Adaptive_Checkpointi

g

25000

20000 of C heckpoints

Number

15000

Periodic_S kip

Random_Backoff_S kip

10000

5.00E+08

5000

0.00E+00

Periodic_Checkpointin

g

Fault Ratios of Resources

Average bounded

slowdown

Utilization

0.9

0.8

0.7

0.6

0.5

0.4

Utilization

0.3

Periodic_Checkpointin

g

0.2

Periodic_S kip

Random_Backoff_S kip

0.1

0

Adaptive_Checkpointi

g

Average bounded slow down (seconds) Periodic_Checkpointin

g

Periodic_S kip

Random_Backoff _S kip

#REF!

Adaptive_Checkpointi

g

Fault Ratios of Resources

Overall

Overall

Comparison

Comparison

Values for

adaptive

adaptive

checkpointing

checkpointing

relative

relative to

to

periodic

periodic

checkpointing

checkpointing are

are

-6.95%

-6.95% for

for

makespan, -1.55

for

for flowtime,

flowtime,

-7.7%

-7.7% for

for average

average

bounded

bounded

slowdown,

slowdown,

+7.17%

+7.17% for

for

utilization,

utilization,

-28.2%

-28.2% for

for work

work

lost

due

to

lost due to

failures,

failures, +30%

+30%

for

number

for number of

of

checkpoints

taken.

taken.

Makespan

200

Number of checkpoints

100

Flowtime

Ant colony

based adaptive

checkpointing

using fault

indexes

Ant Colony

based periodic

checkpointing

Average bounded slowdown

Utilization

Resources for spatially and temporally Correlated Failures

Makespan

3000000

1800000

1600000

2500000

1400000

2000000

1200000

1500000

Makespan (seconds)

GA bas ed

adaptive

checkpointing

GA bas ed

periodic

Checkpointing

1000000

1000000

Work

lost (seconds)

800000

600000

400000

500000

200000

0

0

Iteration

Iteration

GA bas ed

adaptive

checkpointi

ng

Resources for spatially and temporally Correlated Failures

Flowtime

Number of checkpoints

3.00E+08

1000

900

2.50E+08

800

700

2.00E+08

600

GA

bas ed

adaptive

checkpoi

nting

1.50E+08

Flow time (seconds)

1.00E+08

500

Number of C heckpoints

400

300

200

5.00E+07

100

0.00E+00

Iteration

Iteration

GA bas ed

adaptive

checkpointin

g

Resources for spatially and temporally Correlated Failures

Average bounded

slowdown

Utilization

1.2

18000

1

16000

14000

0.8

12000

10000

Average

8000 bounded slow dow n (seconds)

6000

4000

GA bas ed

adaptive

checkpoint

ing

Utilization

0.6

GA bas ed adaptive

checkpointing

GA bas ed periodic

Checkpointing

0.4

0.2

2000

0

Iteration

Iteration

Resources for spatially and temporally Correlated Failures

Overall

Overall

Comparison

Comparison

Values for

adaptive

adaptive

checkpointing

checkpointing

relative

relative to

to

periodic

periodic

checkpointing

checkpointing are

are

-3.5%

-3.5% for

for

makespan, -3.1

for

for flowtime,

flowtime,

-17.84%

-17.84% for

for

average

average bounded

bounded

slowdown,

slowdown,

+1.395%

+1.395% for

for

utilization,

utilization,

-24.84%

-24.84% for

for work

work

lost

due

to

lost due to

failures,

failures, -5.76%

-5.76%

for

number

for number of

of

checkpoints

taken.

taken.

Makespan

200

Number of checkpoints

Flowtime

100

us ing fault indexes

GA bas ed periodic checkpointing

Utilization

Resources for spatially and temporally Correlated Failures

Makespan

Work lost

3000000

1800000

1600000

2500000

1400000

2000000

1200000

1500000

Makespan (seconds)

Algorithm

Ant Colony Algorithm

1000000

1000000

Algorithm

800000

600000

500000

400000

200000

Iteration

0

Iteration

Resources for spatially and temporally Correlated Failures

Flowtime

Number of Checkpoints

1000

900

3.00E+08

800

2.50E+08

Flowtime (seconds)

700

2.00E+08

600

Adaptive

Ant Colony

Algorithm

500

1.50E+08

Adaptive Ant

Colony

Algorithm

1.00E+08

Number of checkpoints

400

Ant Colony

Algorithm

300

Ant Colony

Algorithm

5.00E+07

200

100

0.00E+00

0

Iteration

Iteration

Resources for spatially and temporally Correlated Failures

Average bounded

slowdown

Utilization

18000

1.2

16000

1

14000

0.8

12000

Adaptive

Ant Colony

Algorithm

10000

Average bounded slowdown (seconds)

8000

Ant Colony

Algorithm

Adaptive Ant

Colony

Algorithm

0.6

Utilization

0.4

Ant Colony

Algorithm

6000

0.2

4000

2000

Iteration

Iteration

Resources for spatially and temporally Correlated Failures

Overall

Overall

Comparison

Comparison

Values

Values for

for

adaptive

adaptive

checkpointing

checkpointing

relative

relative to

to

periodic

periodic

checkpointing

checkpointing are

are

-2.3% for

makespan,

makespan, -3.8

-3.8

for

flowtime,

for flowtime,

-20.5%

-20.5% for

for

average

average bounded

bounded

slowdown,

slowdown, +.97%

+.97%

for

utilization,

for utilization,

-25.3% for work

lost

lost due

due to

to

failures,

-7.52%

failures, -7.52%

for

for number

number of

of

checkpoints

checkpoints

taken.

taken.

Makespan

200

Number of checkpoints

Flowtime

100

Ant Colony bas ed adaptive

checkpointing us ing fault indexes

0

checkpointing

Utilization

Trace 1

Values of adaptive checkpointing

technique relative to periodic

checkpointing are 0% for

makespan, 0% for flowtime, -20.5%

for average bounded slowdown,

-11.7% for work lost due to failures,

-21% for number of checkpoints

taken and -4.3% for average

turnaround time

Trace 2

Values of adaptive checkpointing

technique relative to periodic

checkpointing are 0% for makespan,

0% for flowtime, -15% for average

bounded slowdown, 31% for work lost

due to failures, -40.5% for number of

checkpoints taken and -1.3% for

average turnaround time.

Makespan

Makespan

100

200

Flowtime

Average bounded slowdown

50

Flowtime

100

Adaptive

checkpointing

Periodic

Checkpointing

Work Lost

Adaptive

checkpointing

Periodic

Checkpointing

Turnaround time

Work Lost

Turnaround time

Number of Checkpoints

Number of Checkpoints

Trace 3

Trace 4

technique relative to periodic

checkpointing are 0% for makespan,

0% for flowtime, -19.76% for average

bounded slowdown, -13% for work lost

due to failures, -29% for number of

checkpoints taken and -4.85% for

average turnaround time.

technique relative to periodic

checkpointing are -2.2% for makespan,

-2.1% for flowtime, -20.3% for average

bounded slowdown, -4.5% for work lost

due to failures, -20.5% for number of

checkpoints taken and -4% for average

turnaround time.

Makespan

Makespan

100

100

Average bounded slowdown

Flowtime

50

Flowtime

50

Adaptive

checkpointing

0

Periodic

Checkpointing

Work Lost

Work Lost

Periodic

Checkpointing

Turnaround time

Turnaround time

Number of Checkpoints

Number of Checkpoints

Adaptive

checkpointing

traces

Trace

Trace 5

5

Values

Values of

of

adaptive

adaptive

checkpointing

checkpointing

technique

technique relative

relative

to

periodic

to periodic

checkpointing

checkpointing are

are

0%

0% for

for

makespan,

makespan, 0%

0% for

for

flowtime,

-13.6%

flowtime, -13.6%

for

for average

average

bounded

bounded

slowdown,

slowdown, -7.2%

-7.2%

for

for work

work lost

lost due

due

to

failures,

-26.8%

to failures, -26.8%

for

for number

number of

of

checkpoints

checkpoints taken

taken

and

and -1.85%

-1.85% for

for

average

average

turnaround

turnaround time.

time.

Makespan

200

Flowtime

100

Adaptive checkpointing

Periodic Checkpointing

0

Work Lost

Turnaround time

Number of Checkpoints

traces

Trace 1

Trace 2

technique relative to periodic

checkpointing are 0% for makespan,

0% for flowtime, -39.6% for average

bounded slowdown, -87% for work

lost due to failures, -38% for number

of checkpoints taken and -9% for

average turnaround time.

technique relative to periodic

checkpointing are 0% for makespan,

-3% for flowtime, -22% for average

bounded slowdown, +34% for work lost

due to failures, -51% for number of

checkpoints taken and -2% for average

turnaround time.

Makespan

Makespan

200

200

Flowtime

100

Flowtime

100

Adaptive

checkpointing

Periodic

Checkpointing

Work Lost

Turnaround time

Number of Checkpoints

Adaptive

checkpointing

Periodic

Checkpointing

Work Lost

Turnaround time

Number of Checkpoints

traces

Trace 3

Trace 4

technique relative to periodic

checkpointing are 0% for makespan,

-1.6% for flowtime, -43.6% for average

boundedslowdown, -69% for work lost

due to failures, -39.1% for number of

checkpoints taken and -13% for average

turnaround time.

technique relative to periodic

checkpointing are -2% for makespan,

-2.3% for flowtime, -20.3% for average

bounded slowdown, -9.56% for work

lost due to failures, -24.5% for number

of checkpoints taken and -3.5% for

average turnaround time.

Makespan

Makespan

100

100

Average bounded slowdown

Flowtime

50

Flowtime

50

Adaptive

checkpointing

Periodic

Checkpointing

Work Lost

Adaptive

checkpointing

Periodic

Checkpointing

Turnaround time

Work Lost

Turnaround time

Number of Checkpoints

Number of Checkpoints

traces

Makespan

Trace 5

Values of

adaptive

checkpointing

technique

technique

relative

relative to

to

periodic

periodic

checkpointing

are 0% for

makespan, +9%

for

for flowtime,

flowtime,

-31.5%

-31.5% for

for

average

average bounded

bounded

slowdown, --38 %

for work lost due

to failures, -43%

for

for number

number of

of

checkpoints

checkpoints

taken

taken and

and -6%

-6%

for average

turnaround time.

200

Flowtime

100

Adaptive checkpointing

Periodic Checkpointing

Work Lost

Turnaround time

Number of Checkpoints

Conclusions

incorporation in Genetic Algorithm (GA). These heuristics are based on

information related to reliability of resources such as MTBF, fault index and fault

ratios. All adaptive checkpointing heuristics have been compared with GA-based

periodic checkpointing for a wide range of scenarios.

Incorporating heuristics designed in Ant Colony Optimization based scheduling in

Grid.

Design of fault index based periodic skip technique and its performance

comparison with periodic skip.

Design of adaptive checkpointing based on information about MTBF and last

failure time of resources.

Design of experimental scenarios for testing performance of various techniques

for temporally and spatially correlated failures.

Performance comparison of ACO-based and GA-based fault tolerance techniques

using real failure traces available from Failure Trace Archive.

Performance comparison of ACO-based and GA- based fault tolerance techniques

for real workload traces available from various parallel workloads archives.

Future Work

Traces workload trace and failure traces used in this work are small portions of

available traces. Future works will focus on using complete trace for evaluation

Experiments have been performed for workload and failure traces separately.

Future works will use both workload trace and failure trace in an experiment.

Downtime (MTTR) of resources is ignored in this work and resource is assumed

to recover immediately from failure. This assumption will be removed in future.

Checkpointing technique used considers restart after failure on the same

resource. Another technique can be checkpoint with migration where job is

restarted on a different resource. This technique along with its various issues

such as spare node allocation is to be pondered upon.

Heuristics developed for fault tolerance are not restricted to metaheuristics.

Rather they can be incorporated in any scheduling algorithm. Future work will

look into that.

This work considers only transient faults on resources. Other fault classes are

not considered.

Finally future work will focus on working in an actual Grid setup rather than

simulated one.

7. Publications

Incorporating fault tolerance in GA-based

Scheduling in Grid environment. In

Proceedings of World Congress on

Information and Communication

Technologies (Mumbai India , Dec 11

14). WICT 2011. IEEE. 776 781.

Heuristic (Wikipedia)

Heuristic(/hjrstk/; orheuristics;

Greek: "","find"or"discover")

refers to experience-based techniques

for problem solving, learning, and

discovery. Where an exhaustive search is

impractical, heuristic methods are used

to speed up the process of finding a

satisfactory solution.

Heuristic (Wikipedia)

to solve a problem that ignores whether the solution can

be proven to be correct, but which usually produces a

good solution or solves a simpler problem that contains or

intersects with the solution of the more complex problem.

Heuristics are intended to gain computational

performance or conceptual simplicity, potentially at the

cost ofaccuracy or precision.

In theirTuring Awardacceptance speech,Herbert Simon

andAllen Newelldiscuss the Heuristic Search Hypothesis:

a physical symbol system will repeatedly generate and

modify known symbol structures until the created

structure matches the solution structure.

Meta-heuristic (Wikipedia)

Incomputer science

,metaheuristicdesignates a

computational method thatoptimizesa

problembyiterativelytrying to improve

acandidate solutionwith regard to a

given measure of quality.

ACO vs GA

retains memory of entire colony instead

of previous generation onlyless affected

by poor initial solutions (due to

combination of random path selection

and colony memory

Chromosome 1

J1

J2

R1

R1

R2

R1

J1: 2sec

J3

R2

J2:3sec

J4

R2

C1 = 5 sec

C2 = 4 sec

Flowtime = C1 + 2 = 5 + 4 = 9 seconds

Chromosome 2

back

J1

R1

J2

R2

J3

R1

J4

R2

References

[1] Townend, P. and Xu, J. 2003. Fault tolerance within a grid environment. As component of eDemand project at the

University of Durham, United Kingdom.

[2] Foster I., Kesselman C.,The Grid: Blueprint for a New Computing Infrastructure, The Elsevier Series in

GridComputing.

[3] Foster, I. 2001. The anatomy of the grid: enabling scalable virtual organizations. In Proceedings of the First

IEEE/ACM International Symposium onCluster Computing and the Grid 2001 (Brisbane, Australia May 15-18,

2001). CCGRID '01. IEEE Computer Society, Washington, DC, USA, 6-7.

[4] Foster I. What is the Grid? A three point checklist, Argonne National Laboratory, fp.mcs.anl.gov/~foster/

Articles/WhatIsTheGrid.pdf, 2002.

[5] Avizienis, A.; Laprie, J.-C.; Randell, B.; Landwehr, C.; , "Basic concepts and taxonomy of dependable and secure

computing,"IEEE Transactions on Dependable and Secure Computing, vol.1, no.1, pp. 11- 33, Jan.-March 2004.

[6] Huda, M.T.; Schmidt, H.W.; Peake, I.D., "An agent oriented proactive fault-tolerant framework for grid

computing,"First International Conference on e-Science and Grid Computing, 2005, pp. 8-15, July 2005.

[7] Hofer, J.; Fahringer, T., "A Multi-Perspective Taxonomy for Systematic Classification of Grid Faults,"16th

Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2008. PDP 2008, pp.126-130, 1315 Feb. 2008.

[8] Jafar, S.; Krings, A.; Gautier, T.; , "Flexible Rollback Recovery in Dynamic Heterogeneous Grid

Computing,"Dependable and Secure Computing, IEEE Transactions on, vol.6, no.1, pp.32-44, Jan.-March 2009

[9] Avizienis, A., "The N-Version Approach to Fault-Tolerant Software,"IEEE Transactions on Software Engineering,

vol. SE-11, no.12, pp. 1491- 1501, Dec. 1985.

References

[10] Schroeder B. and G. A. Gibson (2006). A large-scale study of failures in high-performance

Computing systems, International Conference on Dependable Systems and Networks, DSN 2006.

[11] Hayashibara N, Cherif A, Katayama T. Failure detectors for large-scale distributed systems,

Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems. IEEE Computer Society Press:

Los Alamitos, CA, 2002, 404-409 October 2002.

[12] Elnozahy E, Johnson D, Wang Y. A survey of rollback recovery protocols in message-passing systems,

ACM Computing Surveys 2002, 34(3), 375408.

[13] Alvisi L. and K. Marzullo (1998). Message logging: pessimistic, optimistic, causal, and optimal, IEEE

Transactions on Software Engineering, 24(2), 149-159.

[14] H-C Nam, J. Kim, SJ. Hong and S. Lee. Probabilistic checkpointing, In Proceedings of the Twenty

Seventh International Symposium on Fault-Tolerant Computing (FTCS-27), pp.4857, June 1997.

[15] Gabriel Rodrguez, Xon C. Pardo, Mara J. Martn, Patricia Gonzlez, Performance evaluation of an

application-level checkpointing solution on grids, Future Generation Computer Systems, Volume 26, Issue

7, July 2010, Pages 1012-1023, ISSN 0167-739X, 10.1016/j.future.2010.04.016.

[16] Oliner, A.J.; Sahoo, R.K.; Moreira, J.E.; Gupta, M.; , "Performance implications of periodic checkpointing

on large-scale cluster systems,"Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th

IEEE International, vol., no., pp. 8 pp., 4-8 April 2005

[17] Plank, J.S.; Elwasif, W.R.; , "Experimental assessment of workstation failures and their impact on

checkpointing systems,"Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual

International Symposium on, vol., no., pp.48-57, 23-25 Jun 1998.

[18] Adam J. Oliner, Larry Rudolph, and Ramendra K. Sahoo. 2006. Cooperative checkpointing: a robust

approach to large-scale systems reliability. InProceedings of the 20th annual international conference on

Supercomputing(ICS '06). ACM, New York, NY, USA, 14-23.

References

[19] Oliner, A.; Sahoo, R.; , "Evaluating cooperative checkpointing for supercomputing systems,"Parallel and

Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.8 pp., 25-29 April 2006.

[20] Oliner, A.; Rudolph, L.; Sahoo, R.; , "Cooperative checkpointing theory,"Parallel and Distributed Processing

Symposium, 2006. IPDPS 2006. 20th International, vol., no., pp.10 pp., 25-29 April 2006.

[21] Zhiling Lan; Yawei Li; , "Adaptive Fault Management of Parallel Applications for High-Performance

Computing,"Computers, IEEE Transactions on, vol.57, no.12, pp.1647-1660, Dec. 2008.

[22] Chtepen M., F. H. A. Claeys, et al. (2009). Adaptive Task Checkpointing and Replication: Toward Efficient

Fault-Tolerant Grids, IEEE Transactions on Parallel and Distributed Systems, vol.20, no.2, pp.180-190, Feb.

2009.

[23] Nazir B., K. Qureshi, et al. (2009). "Adaptive checkpointing strategy to tolerate faults ineconomy based

grid," The Journal of Supercomputing, 50(1), 1-18, 2009.

[24] Antonios Litke, Konstantinos Tserpes, Konstantinos Dolkas, and Theodora Varvarigou. 2005. A task

replication and fair resource management scheme for fault tolerant grids. InProceedings of the 2005 European

conference on Advances in Grid Computing(EGC'05), Peter A. Sloot, Alfons G. Hoekstra, Thierry Priol,

Alexander Reinefeld, and Marian Bubak (Eds.). Springer-Verlag, Berlin, Heidelberg, 1022-1031.

[25] Qin Z., B. Veeravalli, et al. (2009). On the Design of Fault-Tolerant Scheduling Strategies Using PrimaryBackup Approach for Computational Grids with Low Replication Costs, IEEE Transactions on Computers, vol.

58, no.3, pp.380-393, March 2009.

[26] Hwang S., and Kesselman C., A flexible framework for fault tolerance in the grid, Journal of Grid

Computing, vol. 1, no. 3, pp. 251-272, 2003.

[27] Lopes R. F. and F. J. da Silva e Silva (2006). Fault tolerance in a mobile agent based computational grid,

Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops, 2006,vol. 2, 16-19 May

2006.

References

[28] Kandaswamy G., A. Mandal, et al. (2008). Fault Tolerance and Recovery of Scientific Workflows on

Computational Grids, 8th IEEE International Symposium on Cluster Computing and the Grid, 2008. CCGRID08,

pp.777-782, 19-22 May 2008.

[29] Yang Z., A. Mandal, et al. (2009). Combined Fault Tolerance and Scheduling Techniques for Workflow

Applications on Computational Grids, 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

CCGRID09, pp.244-251, 18-21 May 2009.

[30] SungJin C., B. MaengSoon, et al. (2004). Volunteer availability based fault tolerant scheduling mechanism in

desktop grid computing environment, Proceedings of the third IEEE International Symposium on Network

Computing and Applications, 2004, (NCA 2004), pp. 366- 371, 30 Aug.-1 Sept. 2004.

[31] Hou E. S. H., N. Ansari, et al. (1994). A genetic algorithm for multiprocessor scheduling, IEEE transactions on

Parallel and Distributed Systems, vol.5, no.2, pp.113-120, Feb 1994.

[32] Song, S., Hwang, K., and Kwok, K. 2006. Risk-resilient heuristics and genetic algorithms for security-assured

grid job scheduling.IEEE Transactions onComputers 55, 6 (June 2006), 703-719.

[33] Khanli, L. M., Far, M. E., and Rahmani, A. M. 2010. RFOH: A New Fault Tolerant Job Scheduler in Grid

Computing. InProceedings of the Second International Conference on Computer Engineering and

Applications(Bali Island, Indonesia, March 19 21, 2010).ICCEA '10. IEEE Computer Society, Washington, DC,

USA, 422-425.

[34] Priya, S.B., Prakash, M., Dhawan, K.K. 2007. Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using

Checkpoint In Proceedings of the Sixth International Conference on Grid and Cooperative Computing (Los

Alamitos, CA, Aug. 16 18, 2007). GCC '07. 676-680.

[35] Abdulal, W., and Ramachandram, S 2011. Reliability-Aware Genetic Scheduling Algorithm in Grid Environment.

In Proceedings of the International Conference on Communication Systems and Network Technologies (Katra,

Jammu India , June 03 05). 673-677.

[36] Wu, C., Lai K., and Sun R. 2008. GA-Based Job Scheduling Strategies for Fault Tolerant Grid Systems. In

Proceedings of the Asia-Pacific Conference on Services Computing (Dec. 09

References

12, 2008). IEEE, 27-32.

[37] Dorigo, M.; Gambardella, L.M.; "Ant colony system: a cooperative learning approach to the traveling

salesman problem,"Evolutionary Computation, IEEE Transactions on, vol.1, no.1, pp.53-66, Apr 1997

[38] Zhihong Xu; Xiangdan Hou; Jizhou Sun; , "Ant algorithm-based task scheduling in grid

computing,"Electrical and Computer Engineering, 2003. IEEE CCECE 2003. Canadian Conference on,

vol.2, no., pp. 1107- 1110 vol.2, 4-7 May 2003.

[39] Hui Yan; Xue-Qin Shen; Xing Li; Ming-Hui Wu; , "An improved ant algorithm for job scheduling in grid

computing,"Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on,

vol.5, no., pp.2957-2961 Vol. 5, 18-21 Aug. 2005.

[40] Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, and Ramendra K. Sahoo. 2004.

Performance implications of failures in large-scale cluster scheduling. InProceedings of the 10th

international conference on Job Scheduling Strategies for Parallel Processing(JSSPP'04), Dror G.

Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer-Verlag, Berlin, Heidelberg, 233-252.

[41] Dalibor Klusek and Hana Rudov. 2010. Alea 2: job scheduling simulator. InProceedings of the 3rd

International ICST Conference on Simulation Tools and Techniques(SIMUTools '10). ICST (Institute for

Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium,

Belgium.

[42] S. Lorpunmanae , Mohd Sap , A.H.Abdullah and C. C. Inwai , An Ant Colony Optimization for Dynamic

Job Scheduling in GridEnvironment , International Journal of Computer and Information Science and

Engineering , 2007.

[43] Jing Hu, Mingchu Li, Weifeng Sun, Yuanfang Chen, "An Ant Colony Optimization for Grid Task

Scheduling with Multiple QoS Dimensions," Grid and Cloud Computing, International Conference on, pp.

415-419, 2009 Eighth International Conference on Grid and Cooperative Computing, 2009

References

[44] Ruay-Shiung Chang, Jih-Sheng Chang, Po-Sheng Lin, An ant algorithm for balanced job scheduling in

grids, Future Generation Computer Systems, Volume 25, Issue 1, January 2009, Pages 20-27, ISSN 0167739X, 10.1016/j.future.2008.06.004.

[45] Wei-Neng Chen; Jun Zhang; , "An Ant Colony Optimization Approach to a Grid Workflow Scheduling

Problem With Various QoS Requirements,"Systems, Man, and Cybernetics, Part C: Applications and

Reviews, IEEE Transactions on, vol.39, no.1, pp.29-43, Jan. 2009

[46] Gosia Wrzesinska, Rob V. van Nieuwpoort, Jason Maassen, Henri E. Bal (2005). Fault-Tolerance,

Malleability and Migration for Divide-and-Conquer Applications on the Grid, Proceedings of the 19th IEEE

International Symposium on Parallel and Distributed Processing, 2005, pp. 13a, 04-08 April 2005.

[47] Buyya R, Murshed M (2002) GridSim: a toolkit for the modeling and simulation of distributed re-source

management and scheduling for grid computing. Concurr Comput Pract Exp (CCPE) 14(1311751220.

[48] Caminero, A.; Sulistio, A.; Caminero, B.; Carrion, C.; Buyya, R.; , "Extending GridSim with an

architecture for failure detection,"International Conference on Parallel and Distributed Systems, 2007,

vol.2, no., pp.1-8, 5-7 Dec. 2007 doi: 10.1109/ICPADS.2007.4447756.

[49] Failure Trace Archive [Online]. http://fta.inria.fr/apache2-default/pmwiki/index.php.

[50] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. 2010. The Failure Trace Archive:

Enabling Comparative Analysis of Failures in Diverse Distributed Systems. InProceedings of the 2010

10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing(CCGRID '10). IEEE

Computer Society, Washington, DC, USA, 398-407.

[51] Parallel Workload Archive [Online]. http://www.cs.huji.ac.il/labs/parallel/workload/.

[52] Grid Workload Archive [Online]. http://gwa.ewi.tudelft.nl/pmwiki/.

[53] MATLAB R2010a [Online]. http://www.mathworks.in/help/techdoc/rn/br_03sl.html

- Effective BDD with Cucumber.docHochgeladen vonvenunaini
- Tabu SearchHochgeladen vonsaishv
- LF_CAT_EBook_Steering-Suspension-Parts-PC_05558_IN_V01_Mi-Z (1)jabucici lemforder.pdfHochgeladen vonSpic
- 2. Mechatronics Workshop KCC Day1Session2.pdfHochgeladen vonMugilan Mohan
- MisHochgeladen vonVijay Bharath
- Putty Debug Prints for XFI Line Configuration Error for CS ENABLEHochgeladen vonrajaramghosh
- Uni Heilbron OPACHochgeladen vonalberto
- A_Social_Spider_Algorithm_for_Global_Opt.pdfHochgeladen vonnithal01
- Wat is ProfibusHochgeladen vonjopardon
- Project Process ManagementHochgeladen vonf8blue
- Ji Se 39411208201400Hochgeladen vonRonaldSy
- Cluster Report Preparation FlowchartHochgeladen vonRAMPRASATH
- Project TitleHochgeladen vonAyothy Senthil
- EE370 Old Exams2Hochgeladen voniwuo4797
- 10.1.1.64.6085_neural_networkHochgeladen vonadriano_fr
- Academic COSYSMO User Manual v1.1Hochgeladen vonmailinator1901
- TSP-bookchapter.pdfHochgeladen vonDevraj Mandal
- VeSys_介绍Hochgeladen vonRobert Samuel
- process flow chartHochgeladen vonapi-397128047
- 10.1016@j.infrared.2018.08.014Hochgeladen vonLong Đào Hải
- Final Log Sheet MarcoHochgeladen vonMarco Paolo Aclan
- thomas randazzo final wbsHochgeladen vonapi-279592219
- Requirement Enginering Software Requirement Tutorial 4Hochgeladen vonSoleh My
- Question Papers_B.Tech_ECE_Control SystemsECT-208(541)_4.pdfHochgeladen vonDanny Adonis
- Process DesignHochgeladen vonBlah Blam
- H-O-A OperationHochgeladen vont_i_f_ano
- q.pdfHochgeladen vonAman Ullah Ghazi
- barieriHochgeladen vonsimon4e simon4e
- Lesson 2 Models of CommunicationHochgeladen vonCmpdmp
- Sr Engineer or Electrical Engineer or Sr Systems Engineer or ProHochgeladen vonapi-78119720

- iTNC530-ISO Manual-533_188-23Hochgeladen vonlastowl
- 400-101 CCIE Routing and Switching Written Exam v5.0 2017-08-18Hochgeladen vonPrashant shinde
- IDEAL Installation GuideHochgeladen vonfast4rohit
- MetaboAnalyst 2.0—a Comprehensive Server for Metabolomic Data AnalysisHochgeladen vonZamzam
- Object Oriented ABAPHochgeladen vonajitmca
- Stewart.pptHochgeladen vonagmarchena
- 23799065 Cisco CCNP Semester 7 Moduel 7Hochgeladen vonellokonighthunter
- Z80-CPU peripheralHochgeladen vonFaisal Advanturer
- List of Na-Adjectives for the JLPT N4 – NIHONGO ICHIBANHochgeladen vonryuuki09
- Testgen Quick Reference GuideHochgeladen vonanasudin
- cnMaestro On-Premises Quick Start 1 2 0.pdfHochgeladen vonmdcarraro
- LGPLHochgeladen vonEdvan Moura
- Pro Skiing VIIIHochgeladen vonAleksandar Sretkovic
- History Of Maths 1900 To The PresentHochgeladen vonJ G
- Rough SetsHochgeladen vonalmisaany
- E-Magazine October 2016Hochgeladen vonVar India
- JayHochgeladen vonindrajeetsinh
- Anti CollisionHochgeladen vonMukesh Mahato
- AJAB Asset Year- End ClosingHochgeladen vonVenkata Araveeti
- AutoCAD Commands _ Civil Engineers PKHochgeladen vonEthan Hunt
- Uv Vis Nir SpectroscopyHochgeladen vonmsr_roni
- SAS ManualHochgeladen vonPajaroloco Carpintero
- tobkalilHochgeladen vonAbderhem Ghshh
- Lateral Loads ManualHochgeladen vonVignesh Ramalingam
- Education Technology in India Kaizen INSEAD Team May 2013Hochgeladen vonSagar Shinde
- AteTools for Cortex-M4 The Cortex-M3 and Cortex-M4 processors are two of the products in the ARM Cortex-M processor family. The whole Cortex-M processor family is shown in Figure 1.1. The Cortex-M3 and Cortex-M4 processors are based on ARMv7-M architecture. Both are high-performance processors that are designed for microcontrollers. Because the Cortex-M4 processor has SIMD, fast MAC, and saturate arithmetic instructions, it can also carry out some of the digital signal processing applications that traditionally have been carried out by a separate Digital Signal Processor (DSP).Hochgeladen vonkhodabandelou
- Audiocore Manual Version 8.40Hochgeladen vonCarles Marti
- V4R4 APIs Expand as 400 IFS File Sizes to 256 GBHochgeladen vonrachmat99
- QueryHochgeladen vonapi-27048744
- Pain Of Salvation - Iter Impius (guitar tab)Hochgeladen vonlauscho