Sie sind auf Seite 1von 12

Failures and Fault Tolerance

Classification of failures
Security
Fundamentals of Fault tolerance
It is simply not possible to devise absolutely
foolproof, 100% reliable software.
The best we can do is to reduce the
probability of failure to an "acceptable" level.
Fault tolerance is the ability of a system to
perform its function correctly even in the
presence of internal faults. The purpose of
fault tolerance is to increase the dependability
of a system.

A failure occurs when an actual running system
deviates from this specified behavior. The cause
of a failure is called an error.
An error represents an invalid system state, one
that is not allowed by the system behavior
specification. The error itself is the result of a
defect in the system or fault, which fault is the
root cause of a failure.
A fault may not necessarily result in an error, but
the same fault may result in multiple errors

Fault Classification
Based on duration, faults can be classified as transient or
permanent.
A different way to classify faults is by their underlying
cause.
Design faults are the result of design failures
Operational faults, on the other hand, are faults that occur during
the lifetime of the system and are invariably due to physical
causes

General Fault Tolerant Procedure
Series of distinct activities that are typically
(although not necessarily) performed in
sequence.
Error detection is the process of identifying that
the system is in an invalid state - damage
confinement; In other words, we first treat the
symptoms and then go after the underlying cause
The most common techniques for error detection
are: Replication checks, Timing checks, Run-time
constraints checking, Diagnostic checks



Error Recovery
The system needs to be restored to a valid
state(Two general approaches exists]
In backward error recovery, the system is
restored to a previous known valid state. This
often requires check pointing the system state
and, once an error is detected, rolling back the
system state to the last check pointed state.
forward error recovery is more appropriate. This
involves driving the system from the erroneous
state to a new valid state.

Fault Treatment
Repair procedure.failure
component.replacestandby.COLD, WARM
and HOT standby components

Das könnte Ihnen auch gefallen