Sie sind auf Seite 1von 43

LECTURE-1: Overview

Prof. Anand Mohan


Electronics Engineering, IT-BHU
Fault Tolerant System Correctly Performs
Specified Task in Presence of Hardware
Failures / Software Errors

Fault Tolerance System’s Attribute to Achive


Fault Tolerant Operation
Easily Testable System Simple & Straightforward
Verification of Correct Operation
Approaches for Design of Dependable Systems
 Fault avoidance: Preventing fault occurrence in the
operational system limits introduction of faults
during system construction
 Fault prevention: Attempts eliminating possibility
of fault creeping in
 Fault tolerance: Masking effect of fault
(System meets specifications in
the presence of faults )
 Fault removal: Reduces presence / number /
seriousness of faults
 Fault forecasting: Estimates present number /
future incidence &
consequences of faults
 Although fault forecasting can be used for
fault avoidance, fault prevention, fault
removal & fault tolerance
Difficult to ensure that system shall not
develop faults

 Hardware / Software / Networks can’t be


totally free from failures

 Therefore fault tolerant design becomes an


important system attribute for dependability
System Design
Process

5
6
Three Barriers Created due to Design Techniques
7
Fault Tolerant Systems
“To find fault is easy;
to do better may be difficult”
--
“fault-tolerant system” is one that continues to perform
Plutarch
at desired level of service in spite of failures in some
components
 Fault Tolerance System Property to Recover
from Partial Failure

Performance affected by Partial Failure but should not


stop System Operation

Fault tolerance measures attempt minimizing impact of


partial failure on operation & performance of the
system
Essence of fault tolerant system is Dependability

Fault tolerance QoS Measure to be achieved with


minimal user’s involvement
Attributes of Dependable Systems:
 Availability
 Reliability
 Safety
 Maintainability
 Testability

QoS Availability + Reliability + Safety + Maintainability


+ Testability + Performability
 Availability : System is ready to be used
immediately
 Reliability : System can run consistently without
complete failure
 Safety : If system temporarily fails to operate
correctly, nothing catastrophic
happens
Stopping nuclear
Or
reaction on leakage
detection
Measure of Fail-Safe Probability of system’s correct
Capability performance; else discontinuing function
with overall safety
 Maintainability: Ease in repairing a failed system

 Testability : Measure of ease in testing certain


attributes of HW/SW system
Goal of Fault Tolerant Design
Fulfilling Performability P(L, t) & Dependability
requirements of a system

Probability of System Reduced throughput / Available


Performance at or memory
above some level ‘L’ at
time ‘t’

Graceful Degradation System ability to automatically


degrade its performance level to compensate for HW failures
/ Software errors
Closely related to Performability
Failure Modes:
 Fail-Action (Fail-Safe) – Protective system initiates
protective action on fault occurrence

 Fail-No-Action (Fail-to-Danger) – No protective action


can be taken upon occurrence of fault
Fault Response:
 Covert Faults – Hidden or non-self revealing faults
• Difficult to detect ( no fault response)
• Could result in a fail-to-danger situation
 Overt Faults– Obvious or self-revealing faults
• Overt faults may result in unnecessary shutdown if system
does not posses fault tolerance
• Fail-safe designs ensure safe state of system operation
Failure Models
Crash failures: Device / Component simply halts, but
behaves correctly before halting
Omission failures: Device / Component fails to respond
(send / receive a message)
Timing failures: The output of a component is correct,
but lies outside a specified real-
time interval (e.g. performance failure:
too slow)

Response failures: The output of a component is


incorrect (& the error can not be
accounted to another component)
Problem Crash Failures Clients cannot distinguish
between crashed component &
one that is just a bit slow

 Value failure: Wrong value is produced


 State transition failure: Execution of service by
component brings it into a wrong state

 Arbitrary (Byzantine) failures: A component may


produce arbitrary output and exhibits arbitrary timing
failures

14
Fail-stop: The component exhibits crash failures, but
its failure can be detected (either through
announcement or timeouts)
Fail-silent: The component exhibits omission or
crash failures; clients cannot tell what
went wrong

Fault Tolerance – Standards


 Safety Instrument System (SIS): Instrumentation or controls
that are responsible for bringing a process to a safe state
in the event of a failure
 Safety Integrity Level (SIL): Statistical representation of the
availability of a SIS at the time of a process demand
15
Fault Tolerant Strategies
Fault tolerance is achieved
Redundancy in hardware, software, information,
and /or Time (Computations)

Static / Dynamic / Hybrid Redundancy

Or
Fault Masking i.e. preventing introduction
of errors in a system due to faults

Reconfiguration process of
eliminating faulty component and restoring
system to some operational state
Reconfiguration Steps
Fault detection Recognizing fault occurrence

Often required before any recovery


procedure can be initiated
Fault location Locating faulty component / module

required for initiating appropriate


recovery process

Fault containment Isolating fault and preventing


propagation of its effects
throughout system
Fault recovery Process of remaining operational /
regaining operational status via
reconfiguration
Fault Tolerance Necessity

Long-Life High-
Hazardous
Systems Availability
Production
Critical Control/ Maintenance
Environments Postponement
Computation
•Satellites,
•Unmanned/ Chemical industry
Manned •Aircraft (methyl iso cyanate
Space Probes (Bhopal gas tragedy)
Controllers,
•Space Shuttles nitric acid !

•Life Support / Transaction Processing


Telephone
& Banking (Stock Market
Diagnostic Switching
Systems / ATM)
Medical Systems System
Challenger Columbia catastrophe Expensive
Maintenance
 could FT design help??
(e.g. Space
Applications)
Prevention from Fault Consequences

 Training operations / maintenance personnel


on protective system operation
 Simulated emergency training both initial
and refresher
 Review of protective system adequacy
When modules / units are changed /
considering performance history
 Design verification through both qualitative
and quantitative review exercises
Lecture 2

Reliability & Related Issues


Reliability ⇒Probability of Failure ⇒ Performance
Under Stated Conditions & Specified Time
Period
Reliability ⇒0.9i Where ‘i’ is the no. of 9s

Factors Affecting Reliability:


 Design ⇒ Reliability Included ⇒ Design Parameter

Top-level Reliability Requirements Allocated to Subsystems

 Environment ⇒(Temperature, Location & Velocity) ⇒ Difference


in Manufacturing & Operational Environmental Conditions
 Components ⇒ Using High Quality Components
Increased Systems Cost
RELIABILITY ENHANCEMENT TECHNIQUES:

High Quality Components : e.g. Low Tolerance Active /


Passive Components
Quality Control Procedures : Adhering to High Quality
Assembly Standards
Worst Case Design : Design Considering Worst Case
Parameters

Above Procedures ⇒Effective But Costly ⇒Cost

Effective Approach ⇒ Redundancy


REDUNDANCY ⇒Used for “Masking Effects of Faults”
Does Not Require High Quality Components

Uses Standard Components ⇒ Redundant / Reconfigurable


Self-Repairing Systems

REDUNDANT & RECONFIGURABLE ARCHITECTURE

Redundancy is an Important Design Technique

Automobile Brake Light Might Use 2 Light Bulbs,if


One Fails, Other is Available
Surviving
S (t)
Reliability ; R (t) Components
N

Failed
F (t)
Unreliability ; Q (t) Components
N

Reliability + Unreliability = One

S(t) No. of Surviving components


F(t) No. of Failure components
N No. of Identical components
FAILURE/HAZARD RATE ⇒Number of Failures Per Unit
Time Compared to Number of Surviving
Components
dF (t)
dt
Failure / Hazard Rate ; Z (t) ⇒ =λ
S (t)
Estimating Useful Life Period

BATH-TUB CURVE ⇒ Failure Rates of Components Under


Normal Conditions
BATH-TUB CURVE
Component(s) &
Ageing effect
interconnection
defect(s)

Wear-out period/
Early life /
Useful Life Period/ End of Life Period
burn in period
Constant Failure Rate (λ)
Failure Rate

(Random Failure)

Time
EXPONENTIAL FAILURE LAW ⇒Relates Reliability &
Failure Rate

R (t) = exp ( -λt )

For small λt R (t) = 1 - λt Percentage Failures Per


1000 Hours / Failures Per
Hour

SYSTEM FAILURE RATE ⇒ System Failures ⇒Component


Failures

There are ‘K’ Type of Components, Each with Failure Rate λk

λov = Σ Nk λk

Where there are N K of Each Type of Component


RELIABILITY ⇒Depends on Operating Conditions &
Time Of Operation
Not Suitable for Realistic Conditions

MEAN TIME BETWEEN FAILURE ⇒Area Underneath


Reliability Curve or Average Time a
System will Run Between Failures
MTBF


1
MTBF = ∫ R (t) dt = (hours)
º λ
t
MTBF =
1 – R (t)
RELIABILITY CURVE 1.0
Reliability Decreases with

Reliability R (t)
0.8 Increasing Time
0.6
0.4
0.2

1MTBF 2MTBF 3MTBF

Time t

MAINTAINABILITY: M (t)⇒ Probability of Isolating & Repairing


a Fault within Time

M (t) = 1 – exp(-µt) = 1 – exp (- t / MTTR)

Repair Rate Mean Time to Repair


SYSTEM REPAIR TIME (for Maintainability)

Passive Repair Time Active Repair Time


(Traveling Time to Site) (Actual Repair Time)

Time Between Time to Detect Fault Time to Verify


Time to Replace
Occurrence & + & Isolate Faulty + Faulty Component + Correct System
Awareness of Component Operation
Failure
Repair Time Replaceable Components With
Reduction
Self Test Feature
AVAILABILITY:A (t) ⇒ Function of Time ⇒Defined as
Probability of Correct System
Time Instant Instead of Time Operations & its Availability to
Interval as in Reliability Perform its Functions at Time “t”

TYPES of AVAILABILITY

Inherent Availability Operational Availability

 Depends on  Depends On
Inherent design
Inherent Design
 Theoretical value Availability of Spare Parts
Maintenance Policy
Maximum
 Highly Available Systems ⇒ May Have Frequent
Inoperability Periods of Extremely Short Duration
 Availability ⇒ Depends Upon Frequency of
Inoperability & Quickness of Repairing
 Availability ⇒ Important Design Goal

Providing Services as Often as Possible is Primary


System’s Purpose

 Highly Available Systems Possess ⇒ High Probability


of Performing Correctly at Desired Instant

High Availability Applications ⇒Time Shared Computing /


Transaction Processing Systems
System up time
Availability =
System up time + System down time
System up time
=
System up time +(No. of Failures × MTTR)
System up time
=
System up time +(System up-time × λ × MTTR)
1
=
1 + (λ ×MTTR)

MTBF
= Since λ = 1/MTBF
MTBF + MTTR

If MTTR Availability Economical System


Fault Tolerance Improves System’s Availability ⇒ Spare /
Standby Components
MTBF (MAX)
INHERENT AVAILABILITY Ai =
(MTBF+MTTR)

If MTBF ⇒ [R (t) ] compared to MTTR ⇒Availability


for Low MTBF we require better Maintainability (i.e.,
short MTTR) to obtain same availability
Tradeoff between Reliability & Maintainability to
obtain same Availability
MTBM
OPERATIONAL AVAILABILITY A =
o (MTBM+MDT)
MTBM Mean Time Between Maintenance
MDT Mean Down Time
Figure 1 Operational Availability Relationships
For Fixed % Availability
Reliability Improves as Time Between Failure
1000000
A = 99.9%
100000
Mean Time Between Maintenance (hrs)

10000 A = 99%

1000
Actions Increases

100

10 A = 95%

MTBM A = 90%
1 A =
o
MTBM + MDT
A = 85%
.1
1/(Mean Down Times)  .0001 .001 .01 .1 1 10 100 1000 10000 1/hrs
or
Mean Down Times (hrs)  10000 1000 100 10 1 .1 .01 .001 .0001 hrs
Maintainability Improves as Time to Repair Decreases
SAFETY: S (t) ⇒ Probability of Correct System Performance;
else Discontinuing Function with Overall Safety
(of other systems or people)

Measure of “Fail-Safe” Capability of a System

PERFORMABILITY: P (L, t) ⇒Probability that System’s


Performance will be at ,or Above, Some Level “L”
at the Instant of Time
Lowered Performance ⇒Multiprocessor
Reduced Throughput or Reduced Available Memory
DIFFERENCE BETWEEN RELIABILITY & PERFORMABILITY :

Reliability ⇒ All the Functions are Performed Correctly


Performability ⇒ Subset of the Functions are Performed
Correctly

GRACEFUL DEGRADATION ⇒System’s Ability to


Automatically Decrease its Level of Performance to
Compensate Hardware Failures & Software Errors

Fault Tolerance can Provide Graceful Degradation & Improve


Performability by Eliminating Failed Hardware and Software
from System
TEST ⇒Determining Existence & Quality of Certain
Attributes within a System ⇒Designing Test to
Verify Processor’s Throughput

TESTABILITY ⇒Ability to Test for Certain Attributes


within a System

Can be Improved ⇒Including Testing as Integral


Part of the Design Parameter
• Measure of Testability is Ease in Testing
• Testability ⇒ Related to Maintainability ⇒Minimizing
Time to Identify & Locate Problems
Purpose of Testing:

 Does System Work ?


 Does System Possess Complete Capability ?

DEPENDABILITY ⇒ QoS Provided by a Particular System

Reliability , Availability , Safety , Maintainability , Performability


& Testability are Measures to Quantify System Dependability
SERIES SYSTEM Each subsystem must function

1 2 N

Overall Reliability Rov = Ri ; i 1 to N

For identical subsystems, Ri = R

Rov = R N MTBF = 1 / Nλ
Decreases by N - fold Decrease by factor N

Need highly reliable individual subsystem


PARALLEL SYSTEMS One subsystem is sufficient

2 Rov = 1 - ( 1 - Ri ); i = 1 to N

For identical subsystems, Ri = R

Rov = 1- ( 1 – R ) N
MTBF = Σ(1 / j) λ ; j =1 to N
DIFFERENT INTERCONNECTIONS

Parallel to Series Series to Parallel

A B A B

C D C D

Used when primary failure Used when primary failure


mode is an open circuit. mode is an short circuit

Always R ps > R sp
REFERENCES:

Fault tolerant digital system design by Parag.K.Lala

Design and Analysis of fault tolerant digital system by


Barry W. Johnson

http://www.barringer1.com/jul01prb.htm

http://www.relex.com/resources/maintpred.asp

http://en.wikipedia.org/wiki/Mean_time_between_failure

Das könnte Ihnen auch gefallen