Sie sind auf Seite 1von 86

Industrial Automation Automation Industrielle Industrielle Automation

9.2

Dependability - Evaluation Estimation de la fiabilit Verlsslichkeitsabschtzung Prof. Dr. H. Kirrmann ABB Research Center, Baden, Switzerland

2010 May, HK

Dependability Evaluation This part of the course applies to any system that may fail.

Dependability evaluation (fiabilit prvisionnelle, Verlsslichkeitsabschtzung) determines:


the expected reliability, the requirements on component reliability, the repair and maintenance intervals and the amount of necessary redundancy. Dependability analysis is the base on which risks are taken and contracts established Dependability evaluation must be part of the design process, it is quite useless once a system has been put into service.

Industrial Automation

Dependability Evaluation 9.2 - 2

9.2.1 Reliability definitions

9.2.1 Reliability definitions 9.2.2 Reliability of series and parallel systems 9.2.3 Considering repair 9.2.4 Markov models 9.2.5 Availability evaluation 9.2.6 Examples

Industrial Automation

Dependability Evaluation 9.2 - 3

Reliability Reliability = probability that a mission is executed successfully (definition of success? : a question of satisfaction) Reliability depends on: duration (tant va la cruche leau., "der Krug geht zum Brunnen bis er bricht)) environment: temperature, vibrations, radiations, etc... R(t) 1,0 lim R(t) = 0 t
25 25 85 40

laboratory time

vehicle
4 5

85

Such graphics are obtained by observing a large number of systems, or calculated for a system knowing the expected behaviour of the elements.
Industrial Automation Dependability Evaluation 9.2 - 4

Reliability and failure rate - Experimental view


Experiment: large quantity of light bulbs 100%

remaining good bulbs

R(t) time
aging infancy mature t t + Dt

time

Reliability R(t): number of good bulbs remaining at time t divided by initial number of bulbs Failure rate (t): number of bulbs that failed in interval t, t+Dt, divided by number of remaining bulbs Industrial Automation Dependability Evaluation 9.2 - 5

Reliability R(t) definition failure good bad Reliability R(t): probability that a system does not enter a terminal state until time t, while it was initially in a good state at time t=0" R(0) = 1; lim R(t) = 0 t

Failure rate (t) = probability that a (still good) element fails during the next time unit dt. definition: (t) = dR(t) / dt R(t)
t

R(t) 1

(x) dx

R(t) = e

t MTTF = mean time to fail = surface below R(t)

definition: MTTF =
0

R(t) dt

Industrial Automation

Dependability Evaluation 9.2 - 6

(t)
bathtub
childhood (burn-in) mature

Assumption of constant failure rate


aging

Reliability = probability of not having failed until time t expressed: by discrete expression

R (t+Dt) = R (t) - R (t) (t)*Dt

t
R(t)
1

by continuous expression simplified when = constant

0.8

R(t)= e

-0.001 t ( = 0.001/h)

R (t) = e -t
assumption of = constant is justified by experience, simplifies computations significantly

0.6

0.4

R(t) = bathtub

0.2

MTTF = mean time to fail = surface below R(t)


0

t
MTTF
Industrial Automation

MTTF =
0

-t dt

Dependability Evaluation 9.2 - 7

Examples of failure rates To avoid the negative exponentials, values are often given in FIT (Failures in Time), 1 fit = 10-9 /h =
1 114'000 years

Element
resistor capacitor capacitor processor RAM Flash FPGA PLC digital I/O analog I/O battery VLSI soldering

Rating
0.25 W (dry) 100 nF (elect.) 100 F 486 4MB 4MB 5000 gates compact 32 points 8 points per element per package per point

failure rate
0.1 fit 0.5 fit 10 fit 500 fit 1 fit 12 fit 80 fit 6500 fit 2000 fit 1000 fit 400 fit 100 fit 0.01 fit

These figures can be obtained from catalogues such as MIL Standard 217F or from the manufacturers data sheets. Warning: Design failures outweigh hardware failures for small series
Industrial Automation Dependability Evaluation 9.2 - 8

MIL HDBK 217 (1) MIL Handbook 217B lists failure rates of common elements. Failure rates depend strongly on the environment: temperature, vibration, humidity, and especially the location: - Ground benign, fixed, mobile - Naval sheltered, unsheltered - Airborne, Inhabited, Uninhabited, cargo, fighter - Airborne, Rotary, Helicopter - Space, Flight

Usually the application of MIL HDBK 217 results in pessimistic results in terms of the overall system reliability (computed reliability is lower than actual reliability). To obtain more realistic estimations it is necessary to collect failure data based on the actual application instead of using the generic values from MIL HDBK 217.

Industrial Automation

Dependability Evaluation 9.2 - 9

Failure rate catalogue MIL HDBK 217 (2) Stress is expressed by lambda factors Basic models: discrete components (e.g. resistor, transistor etc.) = b pE pQ pA integrated components (ICs, e.g. microprocessors etc.) = pQ pL (C1 pT pV + C2 pE) MIL handbook gives curves/rules for different element types to compute factors, b based on ambient temperature QA and electrical stress S pE based on environmental conditions pQ based on production quality and burn-in period pA based on component characteristics and usage in application C1 based on the complexity C2 based on the number of pins and the type of packaging pT based on chip temperature QJ and technology pV based on voltage stress Example: b usually grows exponentially with temperature QA (Arrhenius law)
Industrial Automation Dependability Evaluation 9.2 - 10

What can go wrong

poor soldering (manufacturing) broken wire (vibrations)

broken isolation (assembly)


Industrial Automation

chip cracking (thermal stress)

tin whiskers (lead-free soldering)

Dependability Evaluation 9.2 - 11

Failures that affect logic circuits

Thermal stress (different dilatation coefficients, contact creeping) Electrical stress (electromagnetic fields) Radiation stress (high-energy particles, cosmic rays in the high atmosphere)

Errors that are transient in nature (called soft-errors) can be latched in memory and become firm errors. Solid errors will not disappear at restart.
E.g. FPGA with 3 M gates, exposed to 9.3 108 neutrons/cm2 exhibited 320 FIT at sea level and 150000 FIT at 20 km altitude (see: http:\\www.actel.com/products/rescenter/ser/index.html) Things are getting worse with smaller integrated circuit geometries !

Industrial Automation

Dependability Evaluation 9.2 - 12

Exercise: Wearout Failures

The development of (t) towards the end of the lifetime of a component is usually described by a Weibull distribution: (t) = b b tb1 with b > 0. a) Draw the functions for the parameters b = 1, 2, 3 in a common coordinate system. b) Compute the reliability function R(t) from (t).

c) Draw the reliability functions for the parameters b = 1, 2, 3 in a common coordinate system.
d) Compare the wearout behavior with the behavior assuming constant failure rates (t) = .

Industrial Automation

Dependability Evaluation 9.2 - 13

Cold, Warm and Hot redundancy


Hot redundancy: the reserve element is fully operational and under stress, it has the same failure rate as the operating element. Warm redundancy: the reserve element can take over in a short time, it is not operational and has a smaller failure rate. Cold redundancy (cold standby): the reserve is switched off and has zero failure rate
failure of primary element switchover

R(t) reliability of redundant 1 element 0 R(t) 1 0

reliability of reserve element

t
Dependability Evaluation 9.2 - 14

Industrial Automation

9.2.2 Reliability of series and parallel systems (combinatorial)

9.2.1 Reliability definitions 9.2.2 Reliability of series and parallel systems 9.2.3 Considering repair 9.2.4 Markov models 9.2.5 Availability evaluation 9.2.6 Examples

Industrial Automation

Dependability Evaluation 9.2 - 15

Reliability of a system of unreliable elements

The reliability of a system consisting of n elements, each of which is necessary for the function of the system, whereby the elements fail independently is:
n

R total = R1 * R2 * .. * Rn = P (Ri)
I=1

Assuming a constant failure rate allows to calculate easily the failure rate of a system by summing the failure rates of the individual components. R NooN = e -Si t This is the base for the calculation of the failure rate of systems (MIL-STD-217F)
Industrial Automation Dependability Evaluation 9.2 - 16

Example: series system, combinatorial solution


controller inverter / power supply

control = 0.00005 h-1

supply = 0.001 h-1


encoder motor

motor = 0.0001 h-1


power supply Rtot = Rsupply * Rmotor * Rcontrol motor+encoder controller

= e -supply t * e -motor t * e -control t = e -(supply + motor + control) t


total= supply + motor + control = 0.00115 h-1 Warning: This calculation does not apply any more for redundant system !
Industrial Automation Dependability Evaluation 9.2 - 17

Exercise: Reliability estimation An electronic circuit consists of the following elements: 1 processor 30 resistors 6 plastic capacitors 1 FPGA 2 tantal capacitors 1 quartz 1 connector MTTF= 600 years MTTF= 100000 years MTTF= 50000 years MTTF= 300 years MTTF= 10000 years MTTF= 20000 years MTTF= 5000 years 48 pins 2 pins 2 pins 24 pins 2 pins 2 pins 16 pins

the reliability of one solder point (pin) is 200000 years What is the expected Mean Time To Fail of this system ? Repair of this circuit takes 10 hours, replacing it by a spare takes 1 hour. What is the availability in both cases ? The machine where it is used costs 100 per hour, 24 hours/24 production, 30 years installation lifetime. What should the price of the spare be ?
Industrial Automation Dependability Evaluation 9.2 - 18

Exercise: MTTF calculation

An embedded controller consists of: - one microprocessor 486 - 2 x 4 MB RAM - 1 x Flash EPROM - 50 dry capacitors - 5 electrolytic capacitors - 200 resistors - 1000 soldering points - 1 battery for the real-time-clock what is the MTTF of the controller and what is its weakest point ? (use the numbers of a previous slide)

Industrial Automation

Dependability Evaluation 9.2 - 19

Redundant, parallel system 1-out-of-2 with no repair - combinatorial solution


simple redundant system: the system is good if any (or both) are good

R1

R2
R1 good R2 good

R1 ok ok R2 ok ok
R1 good R2 down R1 down R2 good

1-R1 R1

R1oo2 =

R1R2 + R1 (1-R2) +

(1-R1) R2

R1oo2 = 1 - (1-R2)(1-R1)
R2

with R1 = R2 = R: R1oo2 = 2 R - R2
1-R2

with R = e -t R1oo2 = 2 e -t - e -2t


Industrial Automation Dependability Evaluation 9.2 - 20

R(t) for 1oo2 redundancy

=1

1.000

0.800 1oo2
R 0.600 0.400 0.200 0.000 MTTF
Industrial Automation Dependability Evaluation 9.2 - 21

1oo1

t [MTTF]

Combinatorial: R1oo2, no repair Example R1oo2: airplane with two motors


MTTF of one motor = 1000 hours (this value is rather pessimistic) Flight duration, t = 2 hours

- what is the probability that any motor fails ? - what is the probability that both motors did not fail until time t (landing)?

apply:

R1oo1 = e -t

single motor doesn't fail: 0.998 (0.2 % chance it fails)

R2oo2 = e -2t
R1oo2 = 2 e -t - e -2t

no motor failure: 0.996 (0.4 % chance it fails)


both motors fail: 0.0004 % chance

assuming there is no common mode of failure (bad fuel or oil, hail, birds,)
Industrial Automation Dependability Evaluation 9.2 - 22

MIF, ARL, reliability of redundant structures ARL: Acceptable Reliability Level


1,0

ARL R2 R1

with redundancy

simplex MT1 MIF: MT2

time

Mission Time Improvement Factor (for given ARL) MIF = MT2/MT1 Reliability Improvement Factor (at given Mission Time) RIF = (1-Rwith) / (1-Rwithout) = quotient of unreliability
Industrial Automation Dependability Evaluation 9.2 - 23

RIF:

R1oo2 Reliability Improvement Factor


10 hours = 0.001
1

Reliability improvement factor (RIF) = (1-Rwith) / (1-Rwithout)


RIF for 10 hours mission: R1oo1 = 0.990; R1oo2 = 0.999901 RIF = 100

0.8

1oo2
0.6 0.4

1oo1
0.2

but:
0

MTTF1oo2 = (2
0

-t -

-2t)

dt =

3 2

no spectacular increase in MTTF ! 1oo2 without repair is only suited when mission time << 1/
Industrial Automation Dependability Evaluation 9.2 - 24

Combinatorial: 2 out of three system E.g. three computers, majority voting R3 R2 R1


R1 good R2 good R3 good

R1

R2 2/3

R3

work fail ok ok ok ok ok ok ok ok ok ok ok ok
R1 bad R2 good R3 good R1 good R2 good R3 bad

R1 good R2 bad R3 good

R2oo3 = R1R2R3 + (1-R1)R2R3 + R1(1-R2)R3 + R1R2(1- R3)

with identical elements: R1=R2=R3= R


R2oo3 = 3R2-2R3 with R = e -t R2oo3 = 3 e -2t - 2 e -3t
Industrial Automation Dependability Evaluation 9.2 - 25

2 out of 3 without repair - combinatorial solution

R1

R2

R3

R2oo3 = 3R2 - 2R3 = 3e -2t - 2e -3t

2/3

MTTF2oo3 =
0
1

(3e

-2t -

2e

-3t)

dt

5
6

RIF < 1 when t > 0.7 MTTF !


0.8

2003 without repair is not interesting for long mission

1oo1
0.6

1oo2
0.4 0.2

2oo3
0

Industrial Automation

Dependability Evaluation 9.2 - 26

General case: k out of N Redundancy (1) K-out-of-N computer (KooN) N units perform the function in parallel K fault-free units are necessary to achieve a correct result N K units are reserve units, but can also participate in the function E.g.: aircraft with 8 engines: 6 are needed to accomplish the mission. voting in computers: If the output is obtained by voting among all N units N 2K 1 worst-case assumption: all faulty units fail in same way

Industrial Automation

Dependability Evaluation 9.2 - 27

What is better ?

4 motors, three of which are sufficient to accomplish the mission (fly 21 days, MTTF = 10'000 h per motor)

12 motors, 8 of which are sufficient to accomplish the mission (fly 21 days, MTTF = 5'000 h per motor)
Industrial Automation Dependability Evaluation 9.2 - 28

General case: k out of N redundancy (2) R4 R3 R2 R1

Example with N=4

no fail

one of N fail

two of N fail

K of N fail

all fail

N RKooN = RN + ( 1 ) (1-R) RN-1 +


N of N N + (N-1) of N N + (N-1) + (N-2) of N

( 2 ) (1-R)2RN-2 +...+ ( K ) (1-R)KRN-K +....+ (1-R)N = 1

RKooN =

i=0

N ) (1 R)i RN-i i

Industrial Automation

Dependability Evaluation 9.2 - 29

Comparison chart

1.000

1oo4 0.800
1oo1 R 0.600 2oo4 0.400 0.200 8oo12 0.000 3oo4 2oo3 1oo1 1oo2

t
Industrial Automation Dependability Evaluation 9.2 - 30

What does cross redundancy brings ?


Reliability chain controller

network
separate: double fault brings system down

controller

network

controller

network
cross-coupling better in principle since some double faults can be outlived

controller

network

controller UL controller

network but cross-coupling needs a switchover logic availability sinks again. network

Industrial Automation

Dependability Evaluation 9.2 - 31

Summary
Assumes: all units have identical failure rates and comparison/voting hardware does not fail. 1oo1 (non redundant) 1oo2 (duplication and error detection) 2oo3 (triplication and voting)

R1oo1 = R

R1oo2 = 2R R2 kooN (k out of N must work)

R2oo3 = 3R2 2R3

RKooN =

S
i=0

N ) Ri (1 R)N-i i
Dependability Evaluation 9.2 - 32

Industrial Automation

Exercise: 2oo3 considering voter unreliability Compute the MTTF of the following 2-out-of-3 system with the component failure rates: redundant units 1 = 0.1 h-1 voter unit 2 = 0.001 h-1 input

R1

R1

R1

2/3

R2

Industrial Automation

output

Dependability Evaluation 9.2 - 33

Complex systems

R2 R1 R2

R3

R7

R8 R9

R5
R3

R6
R7 R8 R8

R7

Reliability is dominated by the non-redundant parts, in a first approximation, forget the redundant parts.

Industrial Automation

Dependability Evaluation 9.2 - 34

Exercise: Reliability of Fault-Tolerant Structures

Assume that all units in the sequel have a constant failure rate . Compute the reliability functions (and MTTF) for the following structures a) non-redundant b) 1/2 system c) 2/3 system assuming perfect (p = 0) voters, error detection, reconfiguration circuits etc. d) Draw all functions in a common coordinate system. e) For a railway signalling system, which structure is preferable? f) Is the answer different for a space application with a given mission time? Why?

Industrial Automation

Dependability Evaluation 9.2 - 35

9.2.3 Considering repair

9.2.1 Reliability definitions 9.2.2 Reliability of series and parallel systems 9.2.3 Considering repair 9.2.4 Markov Processes 9.2.5 Availability evaluation 9.2.6 Examples

Industrial Automation

Dependability Evaluation 9.2 - 36

Repair

Fault-tolerance does not improve reliability under all circumstances. It is a solution for short mission duration

Solution: repair (preventive maintenance, off-line repair, on-line repair)

Example: short Mission time, high MTTF: pilot, co-pilot long Mission time, low MTTF: how to reach the stars ? (hibernation, reproduction in space)
Problem: exchange of faulty parts during operation (safety !) reintegration of new parts, teaching and synchronization

Industrial Automation

Dependability Evaluation 9.2 - 37

Preventive maintenance
R(t)
1

MTBPM
Mean Time between preventive maintenance Preventive maintenance reduces the probability of failure, but does not prevent it. in systems with wear, preventive maintenance prevents aging (e.g. replace oil, filters) Preventive maintenance is a regenerative process (maintained parts as good as new)

Industrial Automation

Dependability Evaluation 9.2 - 38

Considering Repair

beyond combinatorial reliability, more suitable tools are required. the basic tool is the Markov Chain (or Markov Process)

Industrial Automation

Dependability Evaluation 9.2 - 39

9.2.4 Markov models

9.2.1 Reliability definitions 9.2.2 Reliability of series and parallel systems 9.2.3 Considering repair 9.2.4 Markov models 9.2.5 Availability evaluation 9.2.6 Examples

Industrial Automation

Dependability Evaluation 9.2 - 40

Markov

Describe system through states, with transitions depending on fault-relevant events


States must be mutually exclusive collectively exhaustive

Let pi (t) = Probability of being in state Si at time t ->

pi(t) = 1
all states

The probability of leaving that state depends only on current state (is independent of how much time was spent in state or how state was reached) Example: protection failure normal OK s lightning strikes (not dangerous) repair

protection not working PD s lightning strikes danger

DG

what is the probability that protection is down when lightning strikes ?


Industrial Automation Dependability Evaluation 9.2 - 41

Continuous Markov Chains State 1

State 2 P2

P1

Time is considered continuous. Instead of transition probabilities, the temporal behavior is given by transition rates (i.e. transition probabilities per infinitesimal time step). A system will remain in the same state unless going to a different state.

Relationship between state probabilities are modeled by differential equations,


e.g. dP1/dt = P2 P1, dP2/dt = P1 P2
for any state: inflow outflow

dpi(t) = k pk(t) dt
Industrial Automation

i pi(t)
Dependability Evaluation 9.2 - 42

Markov - hydraulic analogy

12
P1
P2

42

P4

32

P3

Output flow = probability of being in a state P output rate of state

from other states

42

12
p1(t)

32

pump

State S1 p2(t) State S2

i
p2(t)

Simplification: output rate j = constant (not a critical simplification)


Industrial Automation Dependability Evaluation 9.2 - 43

Reliability expressed as state transition one element:


good

(t)

fail

P0

P1

dp0 = - p0 dt dp1 = + p0 dt

R(t) = p0(t) = e -t
R(t=0) = 1

arbitrary transitions:
good
down fail1 all all ok

fail

up1

R(t) = 1 - (pfail1+ pfail2 )

up2

fail2

non-terminal states
Industrial Automation

terminal states
Dependability Evaluation 9.2 - 44

Reliability and Availability expressed in Markov Reliability good Availability

(t)
failure rate

failure rate
bad up down repair rate state

state
MTTF

good

bad time

up

down repair

up
MDT

up time

definition: "probability that an item will perform its required function in the specified manner and under specified or assumed conditions over a given time period"
Industrial Automation

definition: "probability that an item will perform its required function in the specified manner and under specified or assumed conditions at a given time "
Dependability Evaluation 9.2 - 45

reliable systems have absorbing states, they may include repair, but eventually, they will fail

Industrial Automation

Dependability Evaluation 9.2 - 46

Redundancy calculation with Markov: 1 out of 2 (no repair) Markov:


good

2 P1

fail

P0

P2

= constant

What is the probability that system be in state S0 or S1 until time t ?

Linear Differential Equation

dp0 = - 2 p0 dt dp1 = + 2 p0 - p1 dt dp2 = + p1 dt

initial conditions: p0 (0) = 1 (initially good) p1 (0) = 0 p2 (0) = 0

Solution:

p0 (t) = e -2t p1 (t) = 2 e -t - 2 e -2t


R(t) = p0 (t) + p1 (t) = 2 e -t - e -2t
Industrial Automation (same result as combinatorial - QED) Dependability Evaluation 9.2 - 47

Reliable 1-out-of-2 with on-line repair (1oo2)


S1: on-line unit failed

good P0

n n

P1

back-up also fails

dp0 = - 2 p0
dt dt

+ p1 + p2

dp1 = + p0 - (+) p1
P3 fail

dp2 = + p0
dt

- (+) p2
+ p1 + p2

P2

n
on-line unit fails

dp3 =
dt

S2: back-up unit failed

is equivalent to:

dp0 = - 2 p0
dt dt

+ p1 + p2

2 P0
with n = b ; n = b

P12 P3
fail

dp1+2 = + 2 p0 - (+) p1+2


dp3 =
dt

+ (p1+p2)

it is easier to model with a repair team for each failed unit (no serialization of repair)
Industrial Automation Dependability Evaluation 9.2 - 48

Reliable 1-out-of-2 with on-line repair (1oo2) What is the probability that a system fails while one failed element awaits repair ?

failure rate
Markov:

absorbing state

P0

repair rate

P1

P2

Linear Differential Equations:

dp0 = - 2 p0
dt dt dt

+ p1 + p1

initial conditions: p0 (0) = 1 (initially good) p1 (0) = 0 p2 (0) = 0

dp1 = + 2 p0 - (+) p1 dp2 =

Ultimately , the absorbing states will be filled, the non-absorbing will be empty.
Industrial Automation Dependability Evaluation 9.2 - 49

Results: reliability R(t) of 1oo2 with repair rate (3+)+W e -(3+-W) t (3+)-W e -(3++W) t R(t) = P0+ P1 = 2W 2W with:
W= 2 + 6 + 2

we do not consider short mission time

= 0.01
1

= 10 h-1
0.8

= 1.0 h-1
0.6

1oo2 no repair
0.4

repair does not interrupt mission

= 0.1 h-1
0.2

Time in hours

R(t) accurate, but not very helpful - MTTF is a better index for long mission time Industrial Automation Dependability Evaluation 9.2 - 50

Mean Time To Fail (MTTF) non-absorbing states i

P0

P1

P3

absorbing states j

P2
R(t)
1.0000

P4

non-absorbing states i

0.8000

0.6000

MTTF =
0.4000

Spi(t) dt

0
0.2000

0.0000 0 2 4 6 8 10 12 14

time Dependability Evaluation 9.2 - 51

Industrial Automation

MTTF calculation in Laplace (example 1oo2)

sP0 (s) - p0(t=0) = - 2 P0 (s)


Laplace transform initial conditions: p0 (t=0) = 1 (initially good)

+ P1(s)

sP1(s) - 0 = + 2 P0(s) - (+) P1(s) sP2(s) - 0 =

+ P1(s)

apply boundary theorem

lim
t
only include non-absorbing states (number of equations = number of non-absorbing states)
0

p(t) dt = lim s P(s)


s0

-1 = - 2 P0 0 = + 2 P0
MTTF = P0 + P1 = ( + ) +

+ P1 - (+)P1
1 = / + 3 2

solution of linear equation system:


Industrial Automation

22

Dependability Evaluation 9.2 - 52

General equation for calculating MTTF


1) Set up differential equations 2) Identify terminal states (absorbing) 3) Set up Laplace transform for the non-absorbing states

1 0 0 ..

= M Pna

the degree of the equation is equal to the number of non-absorbing states 4) Solve the linear equation system 5) The MTTF of the system is equal to the sum of the non-absorbing state integrals. 6) To compute the probability of not entering a certain state, assign a dummy (very low) repair rate to all other absorbing states and recalculate the matrix

Industrial Automation

Dependability Evaluation 9.2 - 53

Example 1oo2 control computer in standy

input

w
repair rate same for both E D

on-line

stand-by E D

idle

error detection (also of idle parts) coverage = c

output

Industrial Automation

Dependability Evaluation 9.2 - 54

Correct diagram for 1oo2 Consider that the failure rate of a device in a 1oo2 system is divided into two failure rates: 1) a benign failure, immediately discovered with probability c
- if device is on-line, switchover to the stand-by device is successful and repair called - if device is on stand-by, repair is called

2) a malicious failure, which is not discovered, with probability (1-c)


- if device is on-line, switchover to the standby device fails, the system fails - if device is on stand-by, switchover will be unsuccessful should the online device fail

P0

(w+s) c

w (1-c) P1 P2 s w

P3

s (1-c)

(absorbing state)

1: on-line fails, fault detected (successful switchover and repair) or standby fails, fault detected, successful repair 2: standby fails, fault not detected 3: both fail, system down

1 = - 2 P0 + P1 0 = + 2c P0 - (+)P1 0 = + (1-c) P0 - P2
Industrial Automation

MTTF =

(2+c) + / (2-c)
2 ( + (1-c) )

Dependability Evaluation 9.2 - 55

Approximation found in the literature This simplified diagram considers that the undetected failure of the spare causes immediately a system failure simplified when w = s =

2 (1-c) 2c

absorbing state

-1 = - 2 P0 0 = + 2c P0

+ P1 - (+)P1 + P1

P0

P1

P3
+P2

0 = + 2(1-c) P0

applying Markov:

MTTF =

(1+2c) + / 2 ( + (1-c) )

The results are nearly the same as with the previous four-state model, showing that the state 2 has a very short duration
Industrial Automation Dependability Evaluation 9.2 - 56

MTTF (c)
600000 500000 400000 300000 200000 100000 0

Influence of coverage (2) Example: = 10-5 h-1 (MTTF = 11.4 year), = 1 hour-1 MTTF with perfect coverage = 570468 years When coverage falls below 60%, the redundant (1oo2) system performs no better than a simplex one ! Therefore, coverage is a critical success factor for redundant systems ! In particular, redundancy is useless if failure of the spare remains undetected (lurking error).

coverage 1 3 1 lim MTTF = / 0 (1-c)

) lim MTTF = ( + 2 2 0
Industrial Automation

Dependability Evaluation 9.2 - 57

Application: 1oo2 for drive-by-wire

x coverage is assumed to be the probability that self-check detects an error in the controller. when self-check detects an error, it passivates the controller (output is disconnected) and the other controller takes control. one assumes that an accident occurs if both controllers act differently, i.e. if a computer does not fail to silent behaviour. Self-check is not instantaneous, and there is a probability that the self-check logic is not operational, and fails in underfunction (overfunction is an availability issue) control a1 selfcheck selfcheck control a2

Industrial Automation

Dependability Evaluation 9.2 - 58

Results 1oo2c, applied to drive-by-wire = reliability of one chain (sensor to brake) = 10-5 h-1 (MTTF = 10 years) c = coverage: variable (expressed as uncoverage: 3nines = 99.9 % detected)

= repair rate = parameter - 1 Second: reboot and restart - 6 Minutes: go to side and stop - 30 Minutes: go to next garage

log (MTTF)
16.00

1 second
14.00 12.00

6 minutes 30 minutes

10.00

1 Mio years

or once per year on a million vehicles

8.00

6.00

4.00

0.1% undetected

2.00

conclusion: the repair interval does not matter when coverage is poor
Industrial Automation

0.00 1 2 3 4 5 6 7 8 9 10

poor

uncoverage

excellent

Dependability Evaluation 9.2 - 59

Protection system (general) In protection systems, the dangerous situation occurs when the plant is threatened (e.g. short circuit) and the protection device is unable to respond. The threat is a stochastic event, therefore it can be treated as a failure event. protection failure normal OK PD protection down (detection and repair) threat to plant danger

s
threat to plant (not dangerous)

s
DG

The repair rate includes the detection time t ! This impacts directly the maintenance rate. What is an acceptable repair interval ?
Note: another way to express the reliability of a protection system will be shown under availability

Industrial Automation

Dependability Evaluation 9.2 - 60

Protection system: how to compute test intervals


1 = overfunction of protection 2 = lurking overfunction 3 = lurking underfunction s = plant suffers attack Plant down Single fault repaired P1 plant s threat P3 P2

protection failed by immediate overfunction

Plant down Double fault

lurking overfunction (unwanted trip at next attack)


t detected error P5

t = test rate (e.g. 1/6 months) = repair rate (e.g. 1/8 hours)

1
Normal

P0

test rate
repaired 3 P4

t
test rate lurking underfunction s2 (unlikely) Danger

unavailable states

repaired

plant threat s protection failed P6 by underfunction (fail-to-trip)

since there exist back-up protection systems, utilities are more concerned by non-productive states Industrial Automation Dependability Evaluation 9.2 - 61

9.2.5 Availability evaluation

9.2.1 Reliability definitions 9.2.2 Reliability of series and parallel systems 9.2.3 Considering repair 9.2.4 Markov models 9.2.5 Availability evaluation 9.2.6 Examples

Industrial Automation

Dependability Evaluation 9.2 - 62

Availability

u p
up

down

down

up

up

up

up

up

down

Availability expresses how often a piece of repairable equipment is functioning it depends on failure rate and repair rate . Punctual availability = probability that the system working at time t (not relevant for most processes). Stationary availability = duty cycle (impacts financial results) A = availability = lim

up times
(up times + down times)

MTTF = MTTF + MTTR

Unavailability is the complement of availability (U = 1,0 A) as convenient expression.


(e.g. 5 minutes downtime per year = availability is 0.999%) Industrial Automation Dependability Evaluation 9.2 - 63

Assumption behind the model: renewable system

R(t) A(t) due to repair or preventive maintenance (exchange parts that did not yet fail)

A(t)

after repair, as new

1
0

Stationary availability A =

MTTF MTTF + MTTR

over the lifetime

Industrial Automation

Dependability Evaluation 9.2 - 64

Examples of availability requirements

substation automation telecom power supply

> 99,95% 5 * 10-7

4 hours per year 15 seconds per year

Industrial Automation

Dependability Evaluation 9.2 - 65

Availability expressed in Markov states

up states i

up

down

P0

P1 P2

P3 P4

down states j (non-absorbing)

Availability =

Spi(t = )

Unavailability =

Spj (t = oo)

Industrial Automation

Dependability Evaluation 9.2 - 66

Availability of repairable system Markov states: P0

P1

down state (but not absorbing)

dp0 = - p0 + p1 dt dp1 = + p0 - p1 dt
1 1+

stationary state:

lim dp = dp = 0 0 1 t dt dt

due to linear dependency add condition: p0 + p1 = 1 1 unavailability U = (1 - A) = 1 + /

A=

e.g. =

MTBF = 100 Y -> = 1 / (100 * 8765) h-1 MTTR = 72 h -> = 1/ 72 h-1


Industrial Automation

-> A = 99.991 % -> U = 43 mn / year Dependability Evaluation 9.2 - 67

Example: Availability of 1oo2 (1 out-of-2) Markov states: P0 P1 2 P2

down state (but not absorbing)

assumption: devices can be repaired independently (little impact when << ) dp0 = - 2 p0 + p1 dt dp1 = + 2 p0 - (+) p1 + 2 p2 dt dp2 = + p1 - 2 p2 dt A= 1+ e.g. = 1 22 2 + 2
-> A = 99.9999993 % -> U = 0.2 s / year Dependability Evaluation 9.2 - 68

stationary state:

lim dp = dp = dp = 0 0 1 2 t dt dt dt

due to linear dependency add condition: p0 + p1 + p2 = 1 2 (/)2 + 2(/)

unavailability U = (1 - A) =

lim U<<1

MTBF = 100 Y -> = 1 / (100 * 8765) h-1 MTTR = 72 h -> = 1/ 72 h-1


Industrial Automation

Availability calculation
1) Set up differential equations for all states 2) Identify up and down states (no absorbing states allowed !) 3) Remove one state equation save one (arbitrary, for numerical reasons take unlikely state) 4) Add as first equation the precondition: 1 = p (all states)

1 0 0 ..

= M Pall

5) The degree of the equation is equal to the number of states 6) Solve the linear equation system, yielding the % of time each state is visited 7) The unavailability is equal to the sum of the down states

We do not use Laplace for calculating the availability !

Industrial Automation

Dependability Evaluation 9.2 - 69

1oo2 including coverage 2(1-c) Markov states: P0

2c
P1

P2 2

down state (but not absorbing)

assumption: devices can be repaired independently (little impact when << )

dp0 = - 2 p0 dt

+ p1

stationary state:

lim dp = dp = dp = 0 0 1 2 t dt dt dt

dp1 = + 2c p0 - (+) p1 + 2 p2 dt due to linear dependency add condition: p0 + p1 + p2 = 1 dp2 = + 2(1-c) p0 + p1 - 2 p2 dt


1 1+ unavailability U = (1 - A) =
lim
/ >> 1

A=

22
2 + 2

2 (/)2 + 2(/)

Industrial Automation

Dependability Evaluation 9.2 - 70

Exercise

A repairable system has a constant failure rate = 10-4 / h. Its mean time to repair (MTTR) is one hour.

a) Compute the mean time to failure (MTTF). b) Compute the MTBF and compare with the MTTF. c) Compute the stationary availability. Assume that the unavailability has to be halved. How can this be achieved d) by only changing the repair time? e) by only changing the failure rate?

f) Make a drawing that shows how a varying repair time influences availability.

Industrial Automation

Dependability Evaluation 9.2 - 71

9.2.6 Examples

9.2.1 Reliability definitions 9.2.2 Reliability of series and parallel systems 9.2.3 Considering repair 9.2.4 Markov models 9.2.5 Availability evaluation with Markov 9.2.6 Examples

Industrial Automation

Dependability Evaluation 9.2 - 72

Exercise: Markov diagram

1
0 b 1

1
2 2 n 4

b 3 n

Is this a reliable or an available system ?

Set up the differential equations for this Markov model.


Compute the probability of not reaching state 4 (set up equations)

Industrial Automation

Dependability Evaluation 9.2 - 73

Case study: Swiss Locomotive 460 control system availability normal


member N MVB I/O system

reserve
member R member N member R member N member R

Assumption: each unit has a back-up unit which is switched on when the on-line unit fails The error detection coverage c of each unit is imperfect The switchover is not always bumpless - when the back-up unit is not correctly actualized, the main switch trips and the locomotive is stuck on the track What is the probability of the locomotive to be stuck on track ?

Industrial Automation

Dependability Evaluation 9.2 - 74

Markov model: SBB Locomotive 460 availability


bumpless takeover member N failure detected all OK member R failure detected b train stop and reboot r member R on-line member R fails

(1-s-b) s c

P0

takeover unsuccessful member N fails member N fails = 10-4 = 0.1 c = 0.9 b = 0.9 s = 0.01 r = 10 p = 1/8765

stuck on track

(1-c)

member R fails undetected

c b s r p

probability that member N or member R fails mean time to repair for member N or member P probability of detected failure (coverage factor) probability of bumpless recovery (train continues) probability of unsuccessful recovery (train stuck) time to reboot and restart train periodic maintenance check

(MTTF is 10000 hours or 1,2 years) (repair takes 10 hours, including travel to the works) (probability is 9 out of 10 errors are detected) (probability is that 9 out of 10 take-over is successful) (probability is 1 failure in 100 cannot be recovered) (mean time to reboot and restart train is 6 minutes) (mean time to periodic maintenance is one year).

Industrial Automation

Dependability Evaluation 9.2 - 75

SBB Locomotive 460 results


.

How the down-time is shared: unsuccessful recovery 7%

Stuck: 2nd failure before maintenance


Under these conditions:

32% 61% OK after reboot

unavailability will be 0.5 hours a year. stuck on track is once every 20 years. recovery will be successful 97% of the time.

Stuck: 2nd failure before repair Stuck: after reboot 0.0009% 0.00045%

recommendation: increase coverage by using alternatively members N and R (at least every start-up)
Industrial Automation Dependability Evaluation 9.2 - 76

Example protection device

Protection device

current sensor

circuit breaker

Industrial Automation

Dependability Evaluation 9.2 - 77

Probability to Fail on Demand for safety (protection) system

IEC 61508 characterizes a protection device by its Probability to Fail on Demand (PFD):

PFD = (1 - availability of the non-faulty system) (State 0)


good

underfunction

u = probability of underfunction

P0
(1-u)
overfunction

u R

P1

P4
plant damaged

P3
plant down

Industrial Automation

Dependability Evaluation 9.2 - 78

Protection system with error detection (self-test) 1oo1


overfunction

danger

: protection failure
u: probability of underfunction [IEC 61508: 50%] C: coverage, probability of failure detection by self-check

(1-u)

R uc

P1

P3

P0
u(1-c) T
normal

P4
P2

P1: protection failed in underfunction, failure detected by self-check (instantaneous), repaired with rate R = 1/MRT P2: protection failed in underfunction, failure detected by periodic check with rate T = 2/TestPeriod P3: protection failed in overfunction, plant down P4: system threatened, protection inactive, danger

PFD = 1 - P0 = 1 -

1 1 + u (1-c) + u c R T

u(

(1-c) T

c R

with:

= 10-7 h-1 MTTR = 8 hours -> R =0.125 h-1 Test Period = 3 months -> T =2/4380 coverage = 90% Industrial Automation

PFD = 1.1 10-5 for S1 and S2 to have same probability: c = 99.8% ! Dependability Evaluation 9.2 - 79

Example: Protection System

tripping algorithm 1
inputs

trip signal

&
tripping algorithm 2

overfunctions reduced P = Po 2
over

underfunctions increased P = 2Pu - Pu2


under

tripping algorithm 1
inputs

trip signal

comparison tripping algorithm 2

&
repair

dynamic modeling necessary

Industrial Automation

Dependability Evaluation 9.2 - 80

Markov Model for a protection system

2(1-c) (1+2)(1- c) latent overfunction 1 chain, n. detectable (1+2)c+3 (1+2+3) c OK detectable error 1 chain, repair latent underfunction 2 chains, n. detectable

(1+2)c+3
1(1-c)

s1+1(1- c)

overfunction

3(1-c)

1+2+3 c latent underfunction not detectable

s2 s2 underfunction

1=0.01, 2=3=0.025, s1=5, s2=1, =365,


Industrial Automation

c=0.9 [1/ Y]
Dependability Evaluation 9.2 - 81

Analysis Results

mean time to underfunction [Y]

400 weekly test 300

permanent comparison (SW) assumption: SW error-free

permanent comparison (red. HW)

200
2-yearly test mean time to overfunction [Y]

50

500

5000

Industrial Automation

Dependability Evaluation 9.2 - 82

Example: CIGRE model of protection device with self-check


PLANT DOWN SINGLE FAULT self-check overfunction 1 1 (1-c) 2 c S1 e1 (1-c) 3 c 3 S6 s1 s1 P1 PLANT DOWN DOUBLE FAULT

S2

S8

S10

S4 dT 1 c dT dM S3 dM S5

e2

2 (1-c)

S9
self-check underfunction

S11 s2
dM 3

s2 s2

S7
s2 P8, P9: error detection failed P10, P11: failure detectable by self-check

DANGER

P4, P3: failure detectable by inspection

Industrial Automation

Dependability Evaluation 9.2 - 83

Summary: difference reliability - availability Reliability Availability

fail down

fail down

all all ok

up

fail

all all ok

up

down

up

fail

up

fail

good

up

look for: Mean Time To Fail (integral over time of all non-absorbing states) set up linear equation with s = 0, initial conditions S(T = 0) =1.0 solve linear equation

look for: stationary availability A (t = ) (duty cycle in UP states) set up differential equation (no absorbing states!) initial condition is irrelevant solve stationary case with p = 1

Industrial Automation

Dependability Evaluation 9.2 - 84

Exercise: set up the Markov model for this system

A brake can fail open or fail close. A car is unable to brake if both brakes fail open. A car is unable to cruise if any of the brakes fail close. A fail open brake is detected at the next service (rate ). There is an hydaulic and an electric brake. ce = 0.9 ( 99% fail close) e = 10 -5 h-1

electric brake
hydraulic brake h = 10 -6 h-1 ch =.99 % fail close (.01 fail open)

: service every month

Industrial Automation

Dependability Evaluation 9.2 - 85

Industrial Automation

Dependability Evaluation 9.2 - 86

Das könnte Ihnen auch gefallen