Sie sind auf Seite 1von 5

Programming Embedded Systems Lecture outline

● Notion + definition of reliability


Lecture 9 ● Notion of fault tolerance
Reliability and fault tolerance ● Typical classes of faults in
Monday Feb 14, 2011 embedded systems
● Techniques to improve fault tolerance
Philipp Rümmer
Uppsala University
Philipp.Ruemmer@it.uu.se

1/37 2/37

Correctness vs. reliability Reliability

● Last lecture: ● Probability R(t) that a device/system


Correctness of software fulfils its intended function for a
→ Absolute notion, assessed using V&V period of time t, given precisely
● In practice: stated conditions
Every system will fail (sooner or later) ● If T is time (random variable) when first
● Totally correct of software is usually too failure occurs, then:
expensive
● Hardware faults (unavoidable)
● More relative notion: reliability Assumes knowledge
about probability
distribution
3/37 4/37

Reliability block model: Reliability block model:


parallelism (“or”) series (“and”)

1 2

● Compare with: fault tree analysis


5/37 6/37

Mean time between failures


Lusser's equation
(MTBF)
● E.g., “power supply A has MTBF of ● Relationship between reliability and
40000 hours” MTBF:
● Determined by testing N devices for T
hours each, counting the number R of
failures (special case of Weibull distribution)
● Interesting consequence:
Devices will survive their MTBF only
with likelihood
● Inverse: failure rate

7/37 8/37
Bathtubs Improving reliability of a system
● Define fault model describing what could
● Failure rate of hardware is considered go wrong
to follow a “bathtub” distribution
● E.g., bit-flips in memory, spurious
interrupts, noisy sensor data
“Infant
Failure
mortality” Wear-out ● Assess fault tolerance of system
rate
● Can a fault lead to a failure? How critical?
● Testing, fault injection
Normal life
● Measures to improve fault tolerance
(“useful life”)
● E.g., error correction codes, redundancy
Time
9/37 10/37

Faults categories Fault persistence


● Transient faults
● Software faults ● Isolated event during system execution
● Environment faults ● E.g., overheating, glitch in a power line
● E.g., wrong sensor data ● Intermittent faults
● Internal hardware faults ● Malfunction that occurs repeatedly
● E.g., bit flips, defective components (periodic or aperiodic)
● E.g., damaged sensor Could each be
caused by
software bug!
● Permanent faults
● Permanently defective component
11/37 ● E.g., stuck signal or memory cell 12/37

Fault model: memory faults


● Can affect internal RAM, ROM,
registers, external RAM, etc.
● Soft error: memory contents are
Typical faults modified (→ transient)
in embedded systems ● Hard error: memory cell is damaged,
e.g., stuck at zero
(→ permanent or intermittent)
● Many possible causes: electrostatic
discharge, power surging, vibration,
radiation (single-event upsets)
13/37 14/37

Testing memory faults Memory fault injection


● Hwifi: hardware-implemented fault
● Simple approach:
injection
Let system run for a long time, observe
failures ● Can be manual or automatic
● E.g., through heavy-ion radiation
● Problem: faults might only occur under
special circumstances ● Swifi: software-implemented f.i.
● Acceleration through fault injection
● Simulate faults by code instrumentation
● More systematic, more possibilities for
optimisation
● Tools available to automate Swifi
15/37
● Related to mutation testing 16/37
Swifi example Swifi example (2)
for (;;) {
    if (GPIO_ReadInputDataBit(GPIOC, SwitchPin)) {
      ++count;
● Instrumented code is tested
    } else if (count != ­1) {
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET); ● If no failures occur, software is tolerant
      GPIO_WriteBit(GPIOC, ON2Pin, Bit_RESET);
      count = ­1; w.r.t. the injected fault
    }
Instrumentation ● Alternative method:
    if (CONDITION) simulating a bit flip
      count ^= 1 << BIT; (e.g., triggered randomly) Compare outputs of the original and
    if (count == 10)           // 0.2 seconds the instrumented software during tests
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_SET);
    else if (count == 100) {   // 0.2 + 1.8 seconds
● If no difference in outputs can be
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET);
      GPIO_WriteBit(GPIOC, ON2Pin, Bit_SET);
observed, fault is tolerated
    }

17/37 18/37

Protection from memory faults CPU/MCU defects


● Error-correcting memory
● Components of controller or system
can be defective
● RAM: error-correcting codes (ECC),
→ Usually permanent fault
traditionally Hamming codes
● ROM: checksums (e.g., CRC)
● Might be a byzantine fault
● Can be implemented in software or ● Component continues operating in a
hardware (e.g., some CORTEX M3 faulty manner
have error-corrected flash memory) ● “Babbling idiot”
● Memory “scrubbing:” fix errors in ● Difficult to predict or inject
memory in the background
● Check-point schemes, runtime
monitors, self-stabilising algorithms, ...19/37 20/37

Detection of defects Spurious interrupts


● Built-in self-test (BIST);
aka built-in test software (BITS) ● Defective components might
erroneously trigger interrupts
● Permanently run software tests in the
background (when system is idle) ● Can destroy real-time properties if load
becomes too high
● Built-in watchdogs ● Effects can be mitigated by redundancy
● When fault is detected, component can ● Additional checks whether interrupts
be switched off or replaced, fault can are genuine (e.g., confirmation flag
be reported set by component via DMA)
● Interrupts can be disabled if spurious
interrupts are detected
21/37 22/37

Precision of sensors Networking faults


● Most sensors exhibit some amount of ● Various possible problems
noise ● Transmission errors, dropped messages
● E.g., GPS usually off a few meters ● Defective (byzantine) component sends
● Position sensors (like in the elevator spurious messages
lab) might count wrongly ● Crucial: error-correcting protocols
● Improve precision by combining
different sources of information
● Can be tested for:
● Simulation of faulty channels
● Multiple sensors
● Fuzzing techniques
● Expected sensor values, domain knowl.
● Combined using Kalman filters 23/37 24/37
N-version programming

● Implement N completely independent


control systems
General approaches ● Independent hardware
to fault tolerance ● Independently developed software
● Independent programmers
● But same specification
● Main idea: different systems will
contain different kinds of bugs
→ Unlikely that all fail at the same time
25/37 26/37

N-version programming (2) N-version programming (3)

● Main idea: ... different kinds of bugs ● Different topologies possible


● Not so clear whether this is actually ● Voting scheme:
true: Majority wins
Studies show that even independent ● Master/slave
teams tend to make the same mistakes If master system fails, slave takes
over

● Nevertheless: this is one of the


standard techniques

27/37 28/37

Checkpoints Recovery blocks

● Record system state at particular ● Extended version of checkpoints:


points during execution Split computation into recovery blocks,
● E.g., write to file, send to other system which can be rolled back if a fault is
over network detected
● Can be used for diagnostic purposes, or
● Standard version:
after a system failure ensure <acceptance test>
by <primary alternate>
else by <alternate 2>
… …
else by <alternate n>
else error
29/37 30/37

Basic recovery schema Properties of recovery blocks

Enter recovery block:


● Can mitigate various faults
establish checkpoint of Execute one
relevant system state alternate ● Software faults in one of the alternates
Acceptance
condition holds
● Transient hardware faults
Acceptance
condition
● Transient erroneous sensor data
violated ● Timing problems:
first try expensive computation, abort
Restore checkpoint Discard checkpoint if it takes too long and use a faster
(but sub-optimal) variant

31/37 32/37
Properties of recovery blocks (2) Watchdogs
● Of course, introduces new problems
● One of the most important techniques
● Side-effects of alternates can be hard
to undo ● Watchdog is a component that
● Some computations might be monitors the system
impossible to repeat ● If system does not react any more,
● Checkpointing can be expensive watchdog restarts it
● Maybe no time for repetition ● Should be as independent as possible
from system
● But: most micro-controllers have built-
● Concept is related to in watchdogs
transactional memory (often with independent clock)
33/37 34/37

Watchdogs (2) Window watchdog example

● Typically: watchdog works like a timer


● 8-bit timer, counting downward
● Continuously counts up to some limit
● Timer has to be refreshed within a
● If limit is reached, system is restarted
certain “window”
● System has to reset the timer regularly ● Too early or too
to prevent restart late refresh
● STM32 CORTEX M3 has two built-in → System is
watchdogs restarted
● “Independent watchdog” (own clock)
● Refreshing can be
● “Window watchdog”
done using
35/37
an interrupt 36/37

Watchdogs in general

● Watchdogs have to discriminate:


● Normal system execution, from
● System that hangs, or that is running
rogue
● Difficult in general:
Byzantine system can do anything
● Interesting read:
http://www.ganssle.com/watchdogs.htm

37/37