Sie sind auf Seite 1von 5

Fault Tolerance Techniques

Failure rates of real-time systems: lower than that of its components RTS must continue operating despite the failure of a limited subset of their hardware or software. Gracefully Degradable as the size of the faulty set increases, the system must not suddenly collapse, but continue executing part of its workload.

Performance Catastrophic failure region Failure Extent

Initially, the performance is kept from degrading despite a limited number of failures by switching in spares and using up slack computational capacity. As the system runs out of slack capacity, the operating system must begin shedding computational load, starting from the less critical tasks, until the computer can no longer meet even the critical computational requirements, when the system is said to have encountered a failure which may have catastrophic consequences for the application. Methods to ensure system robustness: y Reducing design and manufacturing faults o Careful specification and design, followed by extensive design reviews o Usage of high-grade components to reduce manufacturing faults o Design for testability and extensive testing Making the system robust against interferences from operating environment o Shielding, usage of radiation-hardened components

The system must be able to tolerate the errors occurring during release or while in use, should they occur in spite of robust design and testing. For RTS, deadlines define two types of response to failure: i. Short term response: quickly correcting for a failure to allow immediate deadlines to be met.

[Rony K Mathews/S2 Mtech/No:18]

ii.

Long-term response: consists of locating the failure, determining the best response to it, and initiating a recovery and reconfiguration procedure.

Definitions
Hardware Fault: physical defect that can cause a component to malfunction broken wire, stuck-at-faults Software Fault: bug that cause a program to fail for a given set of inputs Error: Manifestation of a fault hardware/software faults resulting in undesired system behavior Fault Latency: duration between the onset of a fault and its manifestation as an error, which might affect system reliability, since the latency masks the occurrence of the fault until its manifested as an error
Gate Not Used Actual Output Correct Output Input2

Fault latency Fault occurs here Error occurs here

Error Latency: duration between when an error is produced and when its either recognized as an error or causes the failure of the system Error Recovery: process by which the system attempts to recover from the effects of an error. i. ii. Forward Error Recovery: error is masked without any computations having to be redone Backward Error Recovery: the system is rolled back to a moment in time before the error is believed to have occurred, restoring the system state to that at that instant and the computation is carried out again. Time redundancy is used, since it consumes additional time to mask the effects of failure.

Causes of Failure
i. Errors in the specification or design

Ref: Real-Time Systems C M Krishna/Kang G Shin: Sections 7.1,7.2 and 7.3

a. Cause of many hardware and all of software failures b. While mapping the real-world requirements to the specification space, the requirement

must be thoroughly understood, as also are the application and the environment in which the application is going to operate c. Because specification is the only link between the real-world application and the design process and every other step proceeds from it, specification must be: i. unambiguous must not admit to more than one interpretation ii. complete in its definition of the whole of the system
iii. should still allow for initiatives from the designer d. Avoiding mistakes i. Checks ii. Third party review by people not connected with writing specifications iii. Each line being analyzed and defended e. Designs from formal specifications i. Formal checking of the design against the specification, though, at present, formal methods are primitive for large-scale systems

ii.

Defects in components a. Defective hardware components, defects being caused by manufacturing process, wear and tear of use, etc.

iii.

Operating Environment a. Stresses applied on the device depending on the application b. Poor ventilation or excessively high ambient temperature, leading to melting of components and other damages c. Vibration stresses d. Aerospace application environments a. Gravitational force differences b. Electro-magnetic or elementary particle radiation leading to spurious changes in the states of flip-flops

Estimating the reliability of a component: Failure rate, =


L: Q: L Q (C1 T V

+ C2

E)

Fabrication process - 1, if mature technology, 10 otherwise testing process to discard devices that have manufacturing defects (ranges between 0.25 an

20) C1, C2 : complexity factors, expressed as a function of transistors in the device and the number of pins Ref: Real-Time Systems C M Krishna/Kang G Shin: Sections 7.1,7.2 and 7.3

T: V:

effects of temperature; is a function of the type of device (ranges between 0.1 and 1000)

voltage stress for CMOS devices (1 if the device is not CMOS; ranges from 1 to 10 for CMOS)
E:

other stresses in the operating environment (ranges between 0.38 and 220)

Analysis of the parameters of operating environment: For CMOS, voltage and temperature stresses are multiplicative, their product ranging across four orders of magnitude. A component used in benign environment is 10,000 times more reliable than the same component used in harsh environment.

Fault Types
Faults are classified according to their temporal behavior and output behavior. An active fault is physically capable of producing errors, whereas, a benign fault is not.

Temporal Behavior Classification


Three fault types: i. ii. iii. Permanent: does not die away with time, but remains until its repaired or the affected unit is replaced Intermittent: cycles between the fault-active and fault-benign states Transient: dies away after some time

a(t)
No Fault Fault Active

b(t)
Fault Benign

c(t) Fault Type Permanent Transient Intermittent

d(t)

Condition a(t) 0 , b(t) = c(t) = d(t) = 0 a(t) 0 , b(t) = 0, c(t) 0, d(t) = 0 a(t) 0 , b(t) 0, c(t) = 0, d(t) 0

a(t) and b(t) : rates at which the fault switches sate t: age of the fault

Vast majority of faults are transient and only a minority are permanent. Transient failures are hard to catch, since by the time the system recognizes that such a failure has occurred, it might have disappeared. As the devices become faster, their vulnerability to environmental effects that lead to transient failure will increase because as the size becomes smaller, switching times get Ref: Real-Time Systems C M Krishna/Kang G Shin: Sections 7.1,7.2 and 7.3

smaller and charges can cause spurious changes as it passes through the device through EM induction. This is critical in RTS, operating in hazardous environment such as in space where the system is more exposed to radiation

Output Behavior Classification


y Based on the nature of the error that the fault generates i. Non-malicious: If the fault is interpreted the same way by all of the receiving systems ii. Malicious: if the same fault can be interpreted not in a consistent manner by the receivers

A stuck-at-zero is interpreted the same way by all of the lines loads. But if the voltage range is outside the pre-defined ranges for logic 0 [(0l, 0h)] and logic 1 [(1l, 1h)], the voltage can be interpreted differently. Here, the consistency breaks down. Failures corresponding to an inconsistent output are much harder to neutralize than the nonmalicious type. Inconsistent outputs are more like the outputs from malicious intelligence that has affected the system which puts forth errors to disrupt the system. Such failures are known as Malicious failures or Byzantine failures. They are assumed to behave arbitrarily. Fail-Stop Units: if it responds to up to a certain maximum number of failures by simply stopping, rather than putting out incorrect output. Fail-stop units typically consist of multiple processors running the same tasks and comparing results. Faults are detected by comparing outputs if the outputs are different, the whole unit turns itself off. Fail-Safe Units: if its failure mode is biased so that the application process does not suffer catastrophe upon failure.

Independence and Correlation


Component failures independent or correlated Independent: those which do not cause another failure Correlated: if failures are related in some way may be caused by the same cause or one of them may cause the others to occur. Difficult to deal with and means must be found to avoid them. Correlation can be due to physical or electrical coupling or the same effect affecting both systems. Solution: y y y Shielding reduces transient upset due to environment Ensuring that sources of power for the processors are disparate Physically separating the hardware

Ref: Real-Time Systems C M Krishna/Kang G Shin: Sections 7.1,7.2 and 7.3

Das könnte Ihnen auch gefallen