Lecture 1: Overview

LECTURE-1: Overview
Prof. Anand Mohan

Electronics Engineering, IT-BHU
Fault Tolerant System Correctly Performs
Specified Task in Presence of Hardware
Failures / Software Errors
Fault Tolerance System’s Attribute to Achive

Fault Tolerant Operation
Easily Testable System Simple & Straightforward
Verification of Correct Operation
Approaches for Design of Dependable Systems
 Fault avoidance: Preventing fault occurrence in the
operational system limits introduction of faults
during system construction
 Fault prevention: Attempts eliminating possibility
of fault creeping in
 Fault tolerance: Masking effect of fault
(System meets specifications in
the presence of faults )
 Fault removal: Reduces presence / number /
seriousness of faults
 Fault forecasting: Estimates present number /
future incidence &
consequences of faults
 Although fault forecasting can be used for
fault avoidance, fault prevention, fault
removal & fault tolerance
Difficult to ensure that system shall not
develop faults
 Hardware / Software / Networks can’t be

totally free from failures
 Therefore fault tolerant design becomes an

important system attribute for dependability
System Design
Process
5
6
Three Barriers Created due to Design Techniques
7
Fault Tolerant Systems
“To find fault is easy;
to do better may be difficult”
--
“fault-tolerant system” is one that continues to perform
Plutarch
at desired level of service in spite of failures in some
components
 Fault Tolerance System Property to Recover
from Partial Failure
Performance affected by Partial Failure but should not

stop System Operation
Fault tolerance measures attempt minimizing impact of

partial failure on operation & performance of the
system
Essence of fault tolerant system is Dependability
Fault tolerance QoS Measure to be achieved with

minimal user’s involvement
Attributes of Dependable Systems:
 Availability
 Reliability
 Safety
 Maintainability
 Testability
QoS Availability + Reliability + Safety + Maintainability

+ Testability + Performability
 Availability : System is ready to be used
immediately
 Reliability : System can run consistently without
complete failure
 Safety : If system temporarily fails to operate
correctly, nothing catastrophic
happens
Stopping nuclear
Or
reaction on leakage
detection
Measure of Fail-Safe Probability of system’s correct
Capability performance; else discontinuing function
with overall safety
 Maintainability: Ease in repairing a failed system
 Testability : Measure of ease in testing certain

attributes of HW/SW system
Goal of Fault Tolerant Design
Fulfilling Performability P(L, t) & Dependability
requirements of a system
Probability of System Reduced throughput / Available

Performance at or memory
above some level ‘L’ at
time ‘t’
Graceful Degradation System ability to automatically

degrade its performance level to compensate for HW failures
/ Software errors
Closely related to Performability
Failure Modes:
 Fail-Action (Fail-Safe) – Protective system initiates
protective action on fault occurrence
 Fail-No-Action (Fail-to-Danger) – No protective action

can be taken upon occurrence of fault
Fault Response:
 Covert Faults – Hidden or non-self revealing faults
• Difficult to detect ( no fault response)
• Could result in a fail-to-danger situation
 Overt Faults– Obvious or self-revealing faults
• Overt faults may result in unnecessary shutdown if system
does not posses fault tolerance
• Fail-safe designs ensure safe state of system operation
Failure Models
Crash failures: Device / Component simply halts, but
behaves correctly before halting
Omission failures: Device / Component fails to respond
(send / receive a message)
Timing failures: The output of a component is correct,
but lies outside a specified real-
time interval (e.g. performance failure:
too slow)
Response failures: The output of a component is

incorrect (& the error can not be
accounted to another component)
Problem Crash Failures Clients cannot distinguish
between crashed component &
one that is just a bit slow
 Value failure: Wrong value is produced

 State transition failure: Execution of service by
component brings it into a wrong state
 Arbitrary (Byzantine) failures: A component may

produce arbitrary output and exhibits arbitrary timing
failures
14
Fail-stop: The component exhibits crash failures, but
its failure can be detected (either through
announcement or timeouts)
Fail-silent: The component exhibits omission or
crash failures; clients cannot tell what
went wrong
Fault Tolerance – Standards

 Safety Instrument System (SIS): Instrumentation or controls
that are responsible for bringing a process to a safe state
in the event of a failure
 Safety Integrity Level (SIL): Statistical representation of the
availability of a SIS at the time of a process demand
15
Fault Tolerant Strategies
Fault tolerance is achieved
Redundancy in hardware, software, information,
and /or Time (Computations)
Static / Dynamic / Hybrid Redundancy
Or
Fault Masking i.e. preventing introduction
of errors in a system due to faults
Reconfiguration process of
eliminating faulty component and restoring
system to some operational state
Reconfiguration Steps
Fault detection Recognizing fault occurrence
Often required before any recovery

procedure can be initiated
Fault location Locating faulty component / module
required for initiating appropriate

recovery process
Fault containment Isolating fault and preventing

propagation of its effects
throughout system
Fault recovery Process of remaining operational /
regaining operational status via
reconfiguration
Fault Tolerance Necessity
Long-Life High-
Hazardous
Systems Availability
Production
Critical Control/ Maintenance
Environments Postponement
Computation
•Satellites,
•Unmanned/ Chemical industry
Manned •Aircraft (methyl iso cyanate
Space Probes (Bhopal gas tragedy)
Controllers,
•Space Shuttles nitric acid !
•Life Support / Transaction Processing

Telephone
& Banking (Stock Market
Diagnostic Switching
Systems / ATM)
Medical Systems System
Challenger Columbia catastrophe Expensive
Maintenance
 could FT design help??
(e.g. Space
Applications)
Prevention from Fault Consequences
 Training operations / maintenance personnel

on protective system operation
 Simulated emergency training both initial
and refresher
 Review of protective system adequacy
When modules / units are changed /
considering performance history
 Design verification through both qualitative
and quantitative review exercises
Lecture 2
Reliability & Related Issues

Reliability ⇒Probability of Failure ⇒ Performance
Under Stated Conditions & Specified Time
Period
Reliability ⇒0.9i Where ‘i’ is the no. of 9s
Factors Affecting Reliability:

 Design ⇒ Reliability Included ⇒ Design Parameter
Top-level Reliability Requirements Allocated to Subsystems
 Environment ⇒(Temperature, Location & Velocity) ⇒ Difference

in Manufacturing & Operational Environmental Conditions
 Components ⇒ Using High Quality Components
Increased Systems Cost
RELIABILITY ENHANCEMENT TECHNIQUES:
High Quality Components : e.g. Low Tolerance Active /

Passive Components
Quality Control Procedures : Adhering to High Quality
Assembly Standards
Worst Case Design : Design Considering Worst Case
Parameters
Above Procedures ⇒Effective But Costly ⇒Cost
Effective Approach ⇒ Redundancy

REDUNDANCY ⇒Used for “Masking Effects of Faults”
Does Not Require High Quality Components
Uses Standard Components ⇒ Redundant / Reconfigurable

Self-Repairing Systems
REDUNDANT & RECONFIGURABLE ARCHITECTURE
Redundancy is an Important Design Technique
Automobile Brake Light Might Use 2 Light Bulbs,if

One Fails, Other is Available
Surviving
S (t)
Reliability ; R (t) Components
N
Failed
F (t)
Unreliability ; Q (t) Components
N
Reliability + Unreliability = One
S(t) No. of Surviving components

F(t) No. of Failure components
N No. of Identical components
FAILURE/HAZARD RATE ⇒Number of Failures Per Unit
Time Compared to Number of Surviving
Components
dF (t)
dt
Failure / Hazard Rate ; Z (t) ⇒ =λ
S (t)
Estimating Useful Life Period
BATH-TUB CURVE ⇒ Failure Rates of Components Under

Normal Conditions
BATH-TUB CURVE
Component(s) &
Ageing effect
interconnection
defect(s)
Wear-out period/
Early life /
Useful Life Period/ End of Life Period
burn in period
Constant Failure Rate (λ)
Failure Rate
(Random Failure)
Time
EXPONENTIAL FAILURE LAW ⇒Relates Reliability &
Failure Rate
R (t) = exp ( -λt )
For small λt R (t) = 1 - λt Percentage Failures Per

1000 Hours / Failures Per
Hour
SYSTEM FAILURE RATE ⇒ System Failures ⇒Component

Failures
There are ‘K’ Type of Components, Each with Failure Rate λk
λov = Σ Nk λk
Where there are N K of Each Type of Component

RELIABILITY ⇒Depends on Operating Conditions &
Time Of Operation
Not Suitable for Realistic Conditions
MEAN TIME BETWEEN FAILURE ⇒Area Underneath

Reliability Curve or Average Time a
System will Run Between Failures
MTBF
∞
1
MTBF = ∫ R (t) dt = (hours)
º λ
t
MTBF =
1 – R (t)
RELIABILITY CURVE 1.0
Reliability Decreases with
Reliability R (t)
0.8 Increasing Time
0.6
0.4
0.2
1MTBF 2MTBF 3MTBF
Time t
MAINTAINABILITY: M (t)⇒ Probability of Isolating & Repairing

a Fault within Time
M (t) = 1 – exp(-µt) = 1 – exp (- t / MTTR)
Repair Rate Mean Time to Repair

SYSTEM REPAIR TIME (for Maintainability)
Passive Repair Time Active Repair Time

(Traveling Time to Site) (Actual Repair Time)
Time Between Time to Detect Fault Time to Verify

Time to Replace
Occurrence & + & Isolate Faulty + Faulty Component + Correct System
Awareness of Component Operation
Failure
Repair Time Replaceable Components With
Reduction
Self Test Feature
AVAILABILITY:A (t) ⇒ Function of Time ⇒Defined as
Probability of Correct System
Time Instant Instead of Time Operations & its Availability to
Interval as in Reliability Perform its Functions at Time “t”
TYPES of AVAILABILITY
Inherent Availability Operational Availability
 Depends on  Depends On
Inherent design
Inherent Design
 Theoretical value Availability of Spare Parts
Maintenance Policy
Maximum
 Highly Available Systems ⇒ May Have Frequent
Inoperability Periods of Extremely Short Duration
 Availability ⇒ Depends Upon Frequency of
Inoperability & Quickness of Repairing
 Availability ⇒ Important Design Goal
Providing Services as Often as Possible is Primary

System’s Purpose
 Highly Available Systems Possess ⇒ High Probability

of Performing Correctly at Desired Instant
High Availability Applications ⇒Time Shared Computing /

Transaction Processing Systems
System up time
Availability =
System up time + System down time
System up time
=
System up time +(No. of Failures × MTTR)
System up time
=
System up time +(System up-time × λ × MTTR)
1
=
1 + (λ ×MTTR)
MTBF
= Since λ = 1/MTBF
MTBF + MTTR
If MTTR Availability Economical System

Fault Tolerance Improves System’s Availability ⇒ Spare /
Standby Components
MTBF (MAX)
INHERENT AVAILABILITY Ai =
(MTBF+MTTR)
If MTBF ⇒ [R (t) ] compared to MTTR ⇒Availability

for Low MTBF we require better Maintainability (i.e.,
short MTTR) to obtain same availability
Tradeoff between Reliability & Maintainability to
obtain same Availability
MTBM
OPERATIONAL AVAILABILITY A =
o (MTBM+MDT)
MTBM Mean Time Between Maintenance
MDT Mean Down Time
Figure 1 Operational Availability Relationships
For Fixed % Availability
Reliability Improves as Time Between Failure
1000000
A = 99.9%
100000
Mean Time Between Maintenance (hrs)
10000 A = 99%
1000
Actions Increases
100
10 A = 95%
MTBM A = 90%
1 A =
o
MTBM + MDT
A = 85%
.1
1/(Mean Down Times)  .0001 .001 .01 .1 1 10 100 1000 10000 1/hrs
or
Mean Down Times (hrs)  10000 1000 100 10 1 .1 .01 .001 .0001 hrs
Maintainability Improves as Time to Repair Decreases
SAFETY: S (t) ⇒ Probability of Correct System Performance;
else Discontinuing Function with Overall Safety
(of other systems or people)
Measure of “Fail-Safe” Capability of a System
PERFORMABILITY: P (L, t) ⇒Probability that System’s

Performance will be at ,or Above, Some Level “L”
at the Instant of Time
Lowered Performance ⇒Multiprocessor
Reduced Throughput or Reduced Available Memory
DIFFERENCE BETWEEN RELIABILITY & PERFORMABILITY :
Reliability ⇒ All the Functions are Performed Correctly

Performability ⇒ Subset of the Functions are Performed
Correctly
GRACEFUL DEGRADATION ⇒System’s Ability to

Automatically Decrease its Level of Performance to
Compensate Hardware Failures & Software Errors
Fault Tolerance can Provide Graceful Degradation & Improve

Performability by Eliminating Failed Hardware and Software
from System
TEST ⇒Determining Existence & Quality of Certain
Attributes within a System ⇒Designing Test to
Verify Processor’s Throughput
TESTABILITY ⇒Ability to Test for Certain Attributes

within a System
Can be Improved ⇒Including Testing as Integral

Part of the Design Parameter
• Measure of Testability is Ease in Testing
• Testability ⇒ Related to Maintainability ⇒Minimizing
Time to Identify & Locate Problems
Purpose of Testing:
 Does System Work ?

 Does System Possess Complete Capability ?
DEPENDABILITY ⇒ QoS Provided by a Particular System
Reliability , Availability , Safety , Maintainability , Performability

& Testability are Measures to Quantify System Dependability
SERIES SYSTEM Each subsystem must function
1 2 N
Overall Reliability Rov = Ri ; i 1 to N
For identical subsystems, Ri = R
Rov = R N MTBF = 1 / Nλ
Decreases by N - fold Decrease by factor N
Need highly reliable individual subsystem

PARALLEL SYSTEMS One subsystem is sufficient
2 Rov = 1 - ( 1 - Ri ); i = 1 to N
For identical subsystems, Ri = R
Rov = 1- ( 1 – R ) N
MTBF = Σ(1 / j) λ ; j =1 to N
DIFFERENT INTERCONNECTIONS
Parallel to Series Series to Parallel
A B A B
C D C D
Used when primary failure Used when primary failure

mode is an open circuit. mode is an short circuit
Always R ps > R sp
REFERENCES:
Fault tolerant digital system design by Parag.K.Lala
Design and Analysis of fault tolerant digital system by

Barry W. Johnson
http://www.barringer1.com/jul01prb.htm
http://www.relex.com/resources/maintpred.asp
http://en.wikipedia.org/wiki/Mean_time_between_failure

Lecture 1: Overview

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 1: Overview

Hochgeladen von

Copyright:

Verfügbare Formate

LECTURE-1: Overview

Prof. Anand Mohan

Fault Tolerance System’s Attribute to Achive

 Hardware / Software / Networks can’t be

 Therefore fault tolerant design becomes an

Performance affected by Partial Failure but should not

Fault tolerance measures attempt minimizing impact of

Fault tolerance QoS Measure to be achieved with

QoS Availability + Reliability + Safety + Maintainability

 Testability : Measure of ease in testing certain

Probability of System Reduced throughput / Available

Graceful Degradation System ability to automatically

 Fail-No-Action (Fail-to-Danger) – No protective action

Response failures: The output of a component is

 Value failure: Wrong value is produced

 Arbitrary (Byzantine) failures: A component may

Fault Tolerance – Standards

Static / Dynamic / Hybrid Redundancy

Often required before any recovery

required for initiating appropriate

Fault containment Isolating fault and preventing

•Life Support / Transaction Processing

 Training operations / maintenance personnel

Reliability & Related Issues

Factors Affecting Reliability:

Top-level Reliability Requirements Allocated to Subsystems

 Environment ⇒(Temperature, Location & Velocity) ⇒ Difference

High Quality Components : e.g. Low Tolerance Active /

Above Procedures ⇒Effective But Costly ⇒Cost

Effective Approach ⇒ Redundancy

Uses Standard Components ⇒ Redundant / Reconfigurable

REDUNDANT & RECONFIGURABLE ARCHITECTURE

Redundancy is an Important Design Technique

Automobile Brake Light Might Use 2 Light Bulbs,if

Reliability + Unreliability = One

S(t) No. of Surviving components

BATH-TUB CURVE ⇒ Failure Rates of Components Under

R (t) = exp ( -λt )

For small λt R (t) = 1 - λt Percentage Failures Per

SYSTEM FAILURE RATE ⇒ System Failures ⇒Component

There are ‘K’ Type of Components, Each with Failure Rate λk

Where there are N K of Each Type of Component

MEAN TIME BETWEEN FAILURE ⇒Area Underneath

1MTBF 2MTBF 3MTBF

MAINTAINABILITY: M (t)⇒ Probability of Isolating & Repairing

M (t) = 1 – exp(-µt) = 1 – exp (- t / MTTR)

Repair Rate Mean Time to Repair

Passive Repair Time Active Repair Time

Time Between Time to Detect Fault Time to Verify

Inherent Availability Operational Availability

Providing Services as Often as Possible is Primary

 Highly Available Systems Possess ⇒ High Probability

High Availability Applications ⇒Time Shared Computing /

If MTTR Availability Economical System

If MTBF ⇒ [R (t) ] compared to MTTR ⇒Availability

Measure of “Fail-Safe” Capability of a System

PERFORMABILITY: P (L, t) ⇒Probability that System’s

Reliability ⇒ All the Functions are Performed Correctly

GRACEFUL DEGRADATION ⇒System’s Ability to

Fault Tolerance can Provide Graceful Degradation & Improve

TESTABILITY ⇒Ability to Test for Certain Attributes

Can be Improved ⇒Including Testing as Integral

 Does System Work ?