Sie sind auf Seite 1von 47

9.

Reliability theory

Material based on original slides by Tuomas Tirronen

ELEC-C7210 Modeling and analysis of communication networks 1


9. Reliability theory

Contents

• Introduction
• Structural system models
• Reliability of structures of independent repairable components
• Reliable network topology design

2
9. Reliability theory

History

• As a technological concept, reliability emerged after WW1, practical


methods were developed during and after WW2
– For example Lusser’s law, i.e., product probability law of series of
components, was formulated by Robert Lusser during V1 flying bomb tests
– Arised from the need to improve and control the quality of industrial
products with many parts
• 50s, 60s
– ballistic missiles, space programs
– first journal, IEEE Transactions on Reliability 1963
• 70s
– safety of nuclear power plants
• 80s, 90s
– oil and gas industries, computer programs to evaluate reliability, software
reliability, ...
• 00s, new kinds of operation concepts (remote control/maintenance of
3
systems) require reliability analysis, network reliability etc
9. Reliability theory

Approaches to reliability

• Hardware reliability
– Physical approach
• Strength S of an item is a random variable
• Load L the item is exposed to is another random variable
• Reliability R = Pr(S > L)
• Structural reliability analysis
– Actuarial approach ¬ our approach
• Time to failure T is studied using its distribution F(t)
• All information of individual strengths, loads, etc is conveyed in F(t)
• System reliability analysis
• Software reliability
• Human reliability

4
9. Reliability theory

Basic concepts (1)

• Reliability: The ability of an item to perform a required function, under


given environmental and operational conditions and for a stated period
of time (ISO 8402)
– the item can be a single component or a larger entity (system)
– required function may refer to a single function or many
• Quality: The totality of features and characteristics of a product or
service that bear on its ability to satisfy stated or implied needs (ISO
8402)
– i.e., ”conformance to specifications”
– reliability can be seen as an extension of quality into the time domain
• Availability: The ability of an item to perform its required function at a
stated instant of time or over a stated period of time (BS4778)
– i.e., can the item be used at some time instant or what is the time fraction
the item is usable (= average availability)

5
9. Reliability theory

Basic concepts (2)

• Maintainability: The ability of an item, under stated conditions, to be


retained in, or restored to, a state in which it can perform its required
functions, when maintenance is performed under stated conditions and
using prescribed procedures and resources (BS4778)
– if an item can be repaired, then maintainability determines the availability of
the item
• Dependability: Collective term to describe availability performance
and influencing factors: Reliability performance, maintainability
performance and maintenance support performance (IEC60300)
– umbrella term often used when covering reliability issues
• Safety: Freedom from those conditions that can cause death, injury,
occupational illness, or damage to or loss of equipment or property
(MIL-STD-882D)
• Security: Dependability with respect to prevention of deliberate hostile
actions
6
9. Reliability theory

Basic concepts(3)

• Fault
– a defect or mistake which leads to error. Reason for an error.
• Error
– a system state which can lead to a failure
• Failure
– The termination of its ability to perform a required function (BS 4778)
– An unacceptable deviation from the design tolerance or in the anticipated
delivered service, an incorrect output, the incapacity to perform the desired
function (NASA 2002)

7
9. Reliability theory

Basic concepts (4)

Cause
Fault
Fault prevention Fault tolerance
• aim is to design a system without
• aim is to be able to provide the
faults
• physical shielding of components, Error service even in the presence of
faults
careful manufacturing etc.
• main tool: redundancy!
• hardware
• software
• information
Failure • time

8
9. Reliability theory

Repairable and nonrepairable items

• Can study two types of items


– Nonrepairable items
• The item can be single item or larger system
• We are only interested in the time until first failure – whatever happens
after this is of no interest to us
• Interesting measures include: Mean time to failure, reliability
(function) and failure rate

– Repairable items
Our focus

• Single item or larger system


• Interesting measures include: Availability, mean time between
failures, mean down time, number of failures in some time interval
• In some sources the term dependability is used instead of availability
to mean the same thing

9
9. Reliability theory

Systems of items

• We also study systems of many items or components. There are two


possibilities for modeling systems:

– Systems of independent components


Our focus

• Easy analysis: independence of components → independence of


probabilities
• Most examples during this course assume independence

– Systems with dependent components


• Exact analysis is harder, even impossible, because of the
dependencies
• Analysis of the system as a stochastic process

10
9. Reliability theory

Tools and models

• As function of time, we have state models where systems are


modelled as stochastic processes (cf. queueing models in earlier
lectures)
– especially repairable items/systems
– the failure process, repair times, etc.
• Structure of systems and its subsystems -> structural models
– reliability block diagrams, structure function
• Tools:
– Basic probability theory
– Stochastic processes
• Markov chains/processes
• (Renewal processes)
– Statistical methods
• Main limitations of ”probabilistic reliability analysis”: human errors,
human factor
11
9. Reliability theory

Applications

• Risk analysis
– Identification of accidental events
– Causal analysis
– Consequence analysis
• Environmental protection
• Quality
• Optimization and maintenance
• Engineering design
• Verification of quality
• Research and development
• ...

12
9. Reliability theory

Reliability in communications and networking (1)

• From user point of view, an interesting quality of service concept is the


network availability
– = Pr (user can access the agreed network services at time )
– Average availability tells us the time fraction the system is available
• A way to understand availability of networks is to study the downtime
of a network (or outage of some specific service) per year
# nines Avg. availability Downtime / year
2-nines 0.99 87.6 hours
3-nines 0.999 8 hours 46 mins
4-nines 0.9999 52 mins 34 secs
5-nines 0.99999 5 mins 15 secs
6-nines 0.999999 31.5 secs
7-nines 0.9999999 3.15 secs 13
9. Reliability theory

Reliability in communications and networking (2)

• In addition to availability, network operators, service providers and


equipment manufacturers are interested also in
– reliability of components (mean times to failure, number of failures in
some time interval etc.)
– maintainability
– security of networks
• Reliability is an important factor when planning new services, networks
or equipment
• Note that dependability, reliability and availability may have different
definitions in different sources. Be careful to understand what are the
definitions of the different concepts.

14
9. Reliability theory

Aim of the lecture

• We focus on
– Repairable systems
– Systems of independent components
– Exponential assumptions on mean time to failure and mean down time
– Thus, we get simple models using Markovian analysis
– Apply the models to topology design of communication networks where
availability is defined as connectivity of the network

15
9. Reliability theory

Literature

• Reliability theory / Dependability


– System Reliability Theory: Models, Statistical Methods and Applications,
2nd edition, Marvin Rausand and Arnljot Høyland, Wiley, 2004
– Mesh-Based Survivable Networks: Options and Strategies for Optical,
MPLS, SONET and ATM Networking, Wayne D. Grover, Prentice Hall,
2004
– Moniste: Luotettavuus, käytettävyys, huollettavuus (luotettavuusteoria.pdf),
Keijo Ruohonen, TTKK, 2002
– TKK courses AS-116.3180, Mat-2.3118

16
9. Reliability theory

Contents

• Introduction
• Structural system models
• Reliability of structures of independent repairable components
• Reliable network topology design

17
9. Reliability theory

Reliability block diagrams

• Reliability block diagrams (RBD) are used to describe the function of


a system of components
– it shows the logical connections between components
• A system works if there is a path of functioning components from the
start point (a) to the end point (b)
• RBDs give a deterministic model for the structure of a system
– the whole system works properly if and only if some set of the components
function
• It is important to determine which specific function of the system is
modelled: the logical structure may be different for different functions

18
9. Reliability theory

Series and parallel structures

• When a system functions if and only if all of the components function,


the logical structure is a series structure
b
a 1 2 3 4

• When a system functions if at least one of all possible n components


functions, the logical structure is a parallel structure
1

a 2 b
3

4
• Series and parallel structures can be further combined to model more
complex structures 19
9. Reliability theory

Structure function (1)

• The state vector of a structure is x = (x1, x2, ... , xn), where each state
variable xi is either 1 when component i is functioning or 0 when
component i is in a failed state
• The structure function of the system is
ì1 if the system is functioning
f (x) = í
î0 if the system is in a failed state
• For a series structure, the structure function is
n
f (x) = x1 × x2 L xn = Õ xi
i =1
– system works if and only if xi = 1 for all i

20
9. Reliability theory

Structure function (2)

• For parallel structure the structure function is


n n
f (x) =1 - (1 - x1 ) × (1 - x2 )L(1 - xn ) = 1 - Õ (1 - xi ) = C xi
i =1 i =1
– If any xi = 1, then the system functions
– The last operator (upwards product) is reap ”ip”
• Example:
For structure with 2 components in parallel we have
2
f ( x1 , x2 ) = C xi = 1 - (1 - x1 )(1 - x2 ) = x1 + x2 - x1 x2
i =1

21
9. Reliability theory

Path set and cut set methods

• For small systems the structure function ( ) can be written down by


visually inspecting the system as a combination of series and parallel
structures

• However, for large systems it is not possible!

• Therefore, we need systematic computational methods for


generating the structure function ( )
– Path set and cut set methods allow this

22
9. Reliability theory

Path/cut sets (1)

• Definition: A path set P is set of components which by functioning


ensure that the system is functioning. A path set is minimal if it cannot
be reduced without losing its status as a path set.
• Definition: A cut set K is set of components which by failing cause
the system to fail. A cut set is minimal if it cannot be reduced.
• Example:
2
1
3
Path sets: Cut sets:
{1, 2} {1}
Minimal path sets: P1 = {1, 2} P2 = {1, 3}
Minimal cut sets: K1 = {1} K 2 = {2, 3}
{1, 3} {2, 3}
{1, 2, 3} {1, 2}
{1, 3}
23
{1, 2, 3}
9. Reliability theory

Path set method

• Let us denote rj(x) the structure function of jth minimal path


r j (x) = Õ xi
iÎPj
• The whole structure functions if and only if at least one minimal path
set is functioning,
f ( x) = 1 - Õ (1 -r j (x)) = C r j ( x) = C Õ xi
j j j iÎPj
• Path set method:
1. Determine the path sets of the structure
2. Determine minimal path sets Pj
3. Calculate the structure functions of minimal path setc as series stuctures
4. Take ”ip” over all functions you get in step 3.
5. Simplify as needed (TIP: Power of binary variable = variable without any
power, xij=xi)
24
9. Reliability theory

Cut set method

• Let us denote k j(x) the structure function of jth minimal cut


k j (x) = C xi = 1 - Õ (1 - xi )
iÎK j iÎK j
• Now the structure fails if and only if at least one structure
corresponding to the minimal cut sets fail
f ( x) = Õ k j (x) = Õ C xi
j j iÎK j
• Cut set method:
1. Determine the cut sets of the structure
2. Determine minimal cut sets Kj
3. Calculate the structure functions of minimal cuts sets as parallel structures
4. Multiply all functions you get in step 3.
5. Simplify as needed

25
9. Reliability theory

Demo/Exercise

• Determine the structure function of independent components below

a) directly (by using results for series/parallel structures and combining)


b) using path set method
c) using cut set method

2
1
3

26
9. Reliability theory

Contents

• Introduction
• Structural system models
• Reliability of structures of independent repairable components
• Reliable network topology design

27
9. Reliability theory

Repairable components/systems

• Now we study systems where components can be repaired or replaced


upon failures (or even before), i.e., repairable components
• We are interested for example in
– system reliability
– component/system availability:
– mean number of failures during a time interval
– mean time between failures, MTBF
– mean downtime (or repair time) of systems, MDT (MTTR)
• For this purpose we can model the systems/failure processes as
stochastic processes
– thus, we have studied the theoretical background already in the beginning
of this course

28
9. Reliability theory

Reliability of maintained systems (1)

• The system is called maintainable, when its components are


repaired/restored to working condition using some kind of maintenance
– Can be preventive, corrective, …
• Let X(t) denote the stochastic process of the system with X(t) = 1 if the
system is operational and 0 otherwise
• The main measure is availability, A(t)
– also ( ) = Ā(t) = 1 – A(t), the unavailability is studied

Availabili ty A(t ) =P{ X (t ) = 1}


t
1
Average availabili ty Aav (t ) = ò A(t ) dt
t 0
t
1
Long run average availabili ty Aav = lim Aav (t ) = lim ò A(t ) dt
t ®¥ t ®¥ t
0

Limiting availabili ty (when exists) A = lim A (t ) 29


t ®¥
9. Reliability theory

Availability of single component as on-off process (1)

• We can model a single component as an on-off type process X(t) with


1 if component is operational
=
0 otherwise

• Measures related to maintainable systems are


– Mean time between failures, MTBF
– Mean downtime, MDT
– Mean time to failure, MTTF
MTBF
X(t) MTTF
1

MDT
0
30
t
9. Reliability theory

Reliability of single component as on-off process (2)

• Markov model
– MTTF is independent and exponentially distributed with mean 1/
– MDT is independent and exponentially distributed with mean 1/

1 m
0 0 1
142 43 1424
3
l
~Exp(l) ~Exp(m)

• Steady state distribution simply:


ì l MDT
U
ïï av = p = =
l + m MTTF + MDT
0

í m MTTF
ï Aav = p 1 = =
ïî l + m MTTF + MDT
– Steady-state distribution holds even when MTTF and MDT have general
distributions (but still independent), insensitivity property
– Then no more a Markovian process but a so-called renewal process 31
9. Reliability theory

Examples (1)

• Example 1:
A machine has MTTF = 1000 hours and MDT = 5 hours

MTTF 1000
The average availability is Aav = = » 0.995
MTTF + MDT 1000 + 5
• Example 2:
Item has independent uptimes with constant failure rate l. Downtimes are IID
with mean MDT. Usually we have MDT << MTTF, the average unavailability is
then approximately
MTTF MDT
Aav = 1 - Aav = 1 - =
MTTF+ MDT MTTF + MDT
l × MDT
= » l × MDT
1 + l × MDT
32
9. Reliability theory

Systems of independent components (1)

• Consider a system consisting of n independent components


– The state vector of a system is X(t ) = ( X 1 (t ), X 2 (t ), ... , X n (t ))

• MTTF and MDT of component i independent and exp. distributed with


mean 1/ and 1/ , respectively
– Let = =1 = / +
– That is, is the availability of component i
• Then the steady state distribution of state = ,…, is simply the
product of Bernoulli distributions of each component i,

= 1−

– Again distribution holds even under general distributions for MTTF and
MDT (insensitivity)
33
9. Reliability theory

Systems of independent components (2)

• In general, the average availability of the system is defined as

= =1
• The state space Ω can be partitioned into two sets
1. Up states Ω
• where the system is working. Note that some components may be in
failed state, but the system still provides the intended service.
2. Down states Ω
• where the system does not perform the required function
• The (average) availability of the system is given by
= ( )= 1 =1 ( )
∈ ∈
– similarly, unavailability is the sum of probabilities of down states
34
9. Reliability theory

Systems of independent components (3)

• As ( ) is a binary-valued function,

Aav = P{f (X) = 1} = E[f ( X)]

• For series structure: (independence of ’s !!)


æ n ö n n
Aav = E[f (X)] = Eçç Õ X i ÷÷ = Õ E[ X i ] = Õ pi
è i =1 ø i =1 i =1

• And similarly for parallel structure:


æ n ö æ n
ö
Aav = E[f (X)] = Eçç C X i ÷÷ = Eçç1 - Õ (1 - X i ) ÷÷
è i =1 ø è i =1 ø
n n n
=1 - Õ (1 - E[ X i ]) =1 - Õ (1 - pi ) = C pi
i =1 i =1 i =1 35
9. Reliability theory

Systems of independent components (4)

• However, in general

= ( ) ≠ ( )

– Thus, to calculate availability one can not just write down the structure
function ( ) and replace ’s by the corresponding ’s!

• Instead, the function must be first simplified


– Note that ( ) is a polynomial function
– All higher exponents of ’s are equal to , i.e., → etc.
– To the simplified structure function one can then apply the expectation
operator

36
9. Reliability theory

Demo/exercise

• Calculate the availability of the system below using the data given in
the table

2
1
3

i MTTFi (hours) MDTi (hours)


1 750 8
2 300 15
3 500 10

• Hint: Use the structure function derived earlier, use availabilities of


components 37
9. Reliability theory

Models with state-dependent rates

• Earlier we assumed components are completely independent from


each other

• Markov models can have state-dependent rates


– The dynamics (or transition rates) may depend on the state to reflect some
physical causes resulting from the given state
– For example, if there is only one repair man, when there are many faults
the repair rates are affected
– But still we assume that MTTF’s and MDT’s obey exponential distributions

• One can construct the associated Markov process and solve steady
state via global balance equations

38
9. Reliability theory

Example

• Consider parallel structure of two components. Uptimes are exp.


distributed with rates l1 and l2. Repair rates are, correspondingly, m1
and m2. Also, there is only one person to repair and he spends half of
the time repairing component 1 and 2 when both are down.
l1 System State of State of
state component 1 component 2
0 1
m1 0 1 1
m2 l2 m2/2 l2 1 0 1
l1
2 1 0
2 3
m1/2 3 0 0

• Now solve equilibrium probabilities pi. Average availability is the


probability that at least one component works:
Aav = p 0 + p 1 + p 2 39
9. Reliability theory

Contents

• Introduction
• Structural system models
• Reliability of structures of independent repairable components
• Reliable network topology design

40
9. Reliability theory

Topology design problem

• Topology design is the starting point in network design

• Think of the network as a graph with nodes connected by links


– Typically network topology is heavily influenced by the set of physical
locations that need connectivity, so nodes are often given
– Also, many of the primary links between nodes are defined by the node
locations
– In practice, design space allows to add some or few additional links and
nodes

• Question is..
– Given a network topology (nodes + links), what is a reliable network?
– By considering the network as a graph, reliability/availability can be
formalized by the notion of graph connectivity

41
9. Reliability theory

Graphs and k-connectivity

• Consider the network as a graph G(N,J) consisting of a set of nodes N


and set of links J
• Definition: A graph is said to be connected if there exists a path
between every pair of nodes in the graph.
• Definition: Graph G is k-edge-connected if it remains connected after
removal of any k-1 edges.
– Remember: edge = link
• Definition: Graph G is k-vertex-connected if it remains connected
after removal of any k-1 vertices.
– Remember: vertex = node
– Removal of node means that all links connected to the node are removed
from the graph
• Efficient algorithms exist to check k-connectivity of the graph

42
9. Reliability theory

Examples

• 1-edge-connected

• 2-edge-connected

43
9. Reliability theory

Topology design method (1)

• Topology design objective:


– For redundancy, all nodes in the network need to be at least 2-(edge)-
connected with probability 0.99999 (i.e., “5 nines”)
– That is, the network must be resilient to single link failures

• Consider a given network topology represented by graph G(N,J)

• Assume that link ∈ is operational with probability , but the nodes


are perfectly reliable
– State = ,…,
– State space Ω = 0,1

44
9. Reliability theory

Topology design method (2)

• The structure function is then

1, if network in state is 2 − connected


=
0, otherwise

• And the availability is defined as

= is 2 − connected = ( )

– Note that the size of state space is 2^J (grows exponentially!)

• If availability is too low new links need to be added


– Need to define heuristics for identifying most useful locations
45
9. Reliability theory

Topology design method (3)

• Taking into account node failures


– Node ∈ is operational with probability

• We still require that all nodes must stay 2-connected with 5-nines
– Thus, all nodes must then be operational and

= is 2 − connected | all nodes on ∙ all nodes on


=( ⋯ )∙ is 2 − connected | all nodes on

– The conditional probability of 2-(edge)-connectedness is evaluated as


before assuming that nodes do not fail

• Note! This is just one version of the topology design objective and new
ones can be easily defined.
46
9. Reliability theory

THE END

• What you should understand/remember:


– what kind of things reliability theory studies
– basic measures, MTTF, MDT
– how to calculate structure function of simple systems and how to use that
to calculate the availability/reliability of a system
– how to make Markov models of simple maintained systems and calculate
the availability
– how can graph connectivity be used as a measure of reliability in data
network topology design

47