Beruflich Dokumente
Kultur Dokumente
Failure rates of real-time systems: lower than that of its components RTS must continue operating despite the failure of a limited subset of their hardware or software. Gracefully Degradable as the size of the faulty set increases, the system must not suddenly collapse, but continue executing part of its workload.
Initially, the performance is kept from degrading despite a limited number of failures by switching in spares and using up slack computational capacity. As the system runs out of slack capacity, the operating system must begin shedding computational load, starting from the less critical tasks, until the computer can no longer meet even the critical computational requirements, when the system is said to have encountered a failure which may have catastrophic consequences for the application. Methods to ensure system robustness: y Reducing design and manufacturing faults o Careful specification and design, followed by extensive design reviews o Usage of high-grade components to reduce manufacturing faults o Design for testability and extensive testing Making the system robust against interferences from operating environment o Shielding, usage of radiation-hardened components
The system must be able to tolerate the errors occurring during release or while in use, should they occur in spite of robust design and testing. For RTS, deadlines define two types of response to failure: i. Short term response: quickly correcting for a failure to allow immediate deadlines to be met.
ii.
Long-term response: consists of locating the failure, determining the best response to it, and initiating a recovery and reconfiguration procedure.
Definitions
Hardware Fault: physical defect that can cause a component to malfunction broken wire, stuck-at-faults Software Fault: bug that cause a program to fail for a given set of inputs Error: Manifestation of a fault hardware/software faults resulting in undesired system behavior Fault Latency: duration between the onset of a fault and its manifestation as an error, which might affect system reliability, since the latency masks the occurrence of the fault until its manifested as an error
Gate Not Used Actual Output Correct Output Input2
Error Latency: duration between when an error is produced and when its either recognized as an error or causes the failure of the system Error Recovery: process by which the system attempts to recover from the effects of an error. i. ii. Forward Error Recovery: error is masked without any computations having to be redone Backward Error Recovery: the system is rolled back to a moment in time before the error is believed to have occurred, restoring the system state to that at that instant and the computation is carried out again. Time redundancy is used, since it consumes additional time to mask the effects of failure.
Causes of Failure
i. Errors in the specification or design
a. Cause of many hardware and all of software failures b. While mapping the real-world requirements to the specification space, the requirement
must be thoroughly understood, as also are the application and the environment in which the application is going to operate c. Because specification is the only link between the real-world application and the design process and every other step proceeds from it, specification must be: i. unambiguous must not admit to more than one interpretation ii. complete in its definition of the whole of the system
iii. should still allow for initiatives from the designer d. Avoiding mistakes i. Checks ii. Third party review by people not connected with writing specifications iii. Each line being analyzed and defended e. Designs from formal specifications i. Formal checking of the design against the specification, though, at present, formal methods are primitive for large-scale systems
ii.
Defects in components a. Defective hardware components, defects being caused by manufacturing process, wear and tear of use, etc.
iii.
Operating Environment a. Stresses applied on the device depending on the application b. Poor ventilation or excessively high ambient temperature, leading to melting of components and other damages c. Vibration stresses d. Aerospace application environments a. Gravitational force differences b. Electro-magnetic or elementary particle radiation leading to spurious changes in the states of flip-flops
+ C2
E)
Fabrication process - 1, if mature technology, 10 otherwise testing process to discard devices that have manufacturing defects (ranges between 0.25 an
20) C1, C2 : complexity factors, expressed as a function of transistors in the device and the number of pins Ref: Real-Time Systems C M Krishna/Kang G Shin: Sections 7.1,7.2 and 7.3
T: V:
effects of temperature; is a function of the type of device (ranges between 0.1 and 1000)
voltage stress for CMOS devices (1 if the device is not CMOS; ranges from 1 to 10 for CMOS)
E:
other stresses in the operating environment (ranges between 0.38 and 220)
Analysis of the parameters of operating environment: For CMOS, voltage and temperature stresses are multiplicative, their product ranging across four orders of magnitude. A component used in benign environment is 10,000 times more reliable than the same component used in harsh environment.
Fault Types
Faults are classified according to their temporal behavior and output behavior. An active fault is physically capable of producing errors, whereas, a benign fault is not.
a(t)
No Fault Fault Active
b(t)
Fault Benign
d(t)
Condition a(t) 0 , b(t) = c(t) = d(t) = 0 a(t) 0 , b(t) = 0, c(t) 0, d(t) = 0 a(t) 0 , b(t) 0, c(t) = 0, d(t) 0
a(t) and b(t) : rates at which the fault switches sate t: age of the fault
Vast majority of faults are transient and only a minority are permanent. Transient failures are hard to catch, since by the time the system recognizes that such a failure has occurred, it might have disappeared. As the devices become faster, their vulnerability to environmental effects that lead to transient failure will increase because as the size becomes smaller, switching times get Ref: Real-Time Systems C M Krishna/Kang G Shin: Sections 7.1,7.2 and 7.3
smaller and charges can cause spurious changes as it passes through the device through EM induction. This is critical in RTS, operating in hazardous environment such as in space where the system is more exposed to radiation
A stuck-at-zero is interpreted the same way by all of the lines loads. But if the voltage range is outside the pre-defined ranges for logic 0 [(0l, 0h)] and logic 1 [(1l, 1h)], the voltage can be interpreted differently. Here, the consistency breaks down. Failures corresponding to an inconsistent output are much harder to neutralize than the nonmalicious type. Inconsistent outputs are more like the outputs from malicious intelligence that has affected the system which puts forth errors to disrupt the system. Such failures are known as Malicious failures or Byzantine failures. They are assumed to behave arbitrarily. Fail-Stop Units: if it responds to up to a certain maximum number of failures by simply stopping, rather than putting out incorrect output. Fail-stop units typically consist of multiple processors running the same tasks and comparing results. Faults are detected by comparing outputs if the outputs are different, the whole unit turns itself off. Fail-Safe Units: if its failure mode is biased so that the application process does not suffer catastrophe upon failure.