Sie sind auf Seite 1von 19

RF RF i

Introduction to Reliability Theory and Practice









John B. Bowles








John B. Bowles
Computer Science and Engineering
University of South Carolina
Columbia, Sc 29208 USA
e-mail: bowles@engr.sc.edu
RF RF ii
Summary & Purpose
This tutorial serves as an introduction to the main concepts and techniques in reliability and provides an
overview of the field. It tells what reliability is, describes how reliability is measured, modeled, analyzed,
and managed, and illustrates system models and design techniques aimed at improving system reliability.
The practicalities and misconceptions of some common approaches to achieving reliability, such as the use
of redundancy, are also discussed.



John Bowles is an Associate Professor in the Computer Science and Engineering Department at the
University of South Carolina where he teaches and does research in reliable system design. Previously he
was employed by NCR Corporation and Bell Laboratories where he worked on several large system design
projects. He has a BS in Engineering Science from the University of Virginia, an MS in Applied
Mathematics from the University of Michigan, and a Ph.D. in Computer Science from Rutgers University.
In 1987 he was a visiting scholar at the US Air Force Institute of Technology, Wright-Patterson AFB. Dr.
Bowles is a Senior Member of IEEE, a Member of SAE and ASQ, and an ASQC Certified Reliability
Engineer.


Contents
1. Introduction 1
1.1. What is Reliability
1.2. Questions Addressed by Reliability
1.3. Why is Reliability Important?
1.4. Scope of Reliability
2. Measures and Models of Reliability 2
2.1. Exponential Failure Distribution
2.2. Weibull Failure Distribution
3. Component-Level Reliability 5
3.1. Improving Early Life Reliability
3.2. Improving Component Reliability
3.3. Software Reliability
4. System-level reliability 10
4.1. System Structure
4.2 Effects of the Failure Distribution on System Reliability
4.3. Reliability Specification and Allocation
5. System Level Analysis Techniques 13
5.1 Failure Modes and Effects Analysis
5.2 Fault Trees
5.3. Markov Models
6. Perspectives and Future Challenges 16
RF RF 1
1. INTRODUCTION
1.1. What is Reliability?
In everyday parlance something is said to be reliable if it
works when it is needed. Thus, a car is considered reliable if
we are confident that it will start and takes us where we want
to go. Conversely, it is unreliable if it frequently will not start
or stops unexpectedly before we get to our destination.
A narrow, working definition of reliability is that
reliability is the probability that a device will operate
successfully for a specified period of time, under specified
conditions when used in the manner and for the purpose for
which it was intended. This definition allows us to define
the characteristics of reliability in quantitative and measurable
terms. It has several important words. First, reliability is a
probability. This recognizes that we cannot say absolutely
that something will work. Rather, we specify a probability
that something will work. If we have many similar items this
is easily interpreted as the relative number of items that work.
Interpretation becomes more problematic if we have only a
single item or one that is used only once. Second, we must
specify what constitutes successful operation. For devices
that are either working or broken this is easy; for items that
degrade with use, some minimum level of performance must
be required. Third, the period of time must be specified. For
military items this is often the time needed to accomplish
some mission; for commercial items, it may be the
warranty period; for other items, it may be the working
life or some other arbitrary interval. For non-repairable
items the question of interest is will the item last for the
period of interest?; for repairable items it is for what part of
the interval will the item be working? Finally, the item must
be used for the purpose for which it was designed; if it is used
otherwise it can hardly be expected to operate successfully.
A broad definition of reliability, which defines the field, is
that Reliability is the science of predicting, analyzing,
preventing, and mitigating failures over time [1]. This
defines reliability as an area of expertise with well-defined
sub-disciplines all of which are associated with failures in
some way. As a science reliability is based on well-defined
principals and poses testable hypotheses whose veracity can
then be determined. It also associates reliability with many
other fields of endeavor. Prediction implies the use of
modelsprimarily mathematicalto foretell when or how
failures will occur. Analysis seeks to explain not only why a
failure occurred but also to quantify how many and how
often failures occur. Thus, reliability is closely associated
with the physical sciences of physics, chemistry, mechanics,
and electronics; statistics and data collection; and since
people are a part of most systems, human factors and even
psychiatry as well. Prevention tries to stop a failure from
occurring or stop a minor failure from escalating into a major
catastrophe; thus it associates reliability with the design arts.
Finally, mitigation seeks to find ways to offset the effects of a
failure when one occurs.
1.2. Questions Addressed by Reliability
Reliability seeks to answer many questions about an item,
all of which ultimately affect how the product is designed.
The traditional and fundamental question is How long will
the item last?. This question is important because equipment
that lasts a long time is usually preferred over equipment that
only lasts a short while. For example, most people would
prefer a television set that lasts 10 years to one that lasts only
2 years. But the answers to many other questions depend on,
or influence, the answer to this question:
- What is the system availability? The longer a system
lasts between failures and the faster its repair time, the
higher its availability will be.
- How can failures be prevented? Items can be made to
last longer by eliminating potential failures. Failures can
be prevented by making changes in the design of a
device, changes in the materials of which it is made,
changes in the way the product is maintained, and
changes in the way the product is used. The more
failures that can be prevented, the longer the item will
last.
- What is the life-cycle cost of a piece of equipment? Life-
cycle costs include the initial cost of a piece of
equipment as well as the costs of repairing it when it
fails. They also include the costs of keeping an inventory
of spare parts, transporting them to where they are
needed, the lost opportunity costs when the item is out of
service, and the cost of disposing of the equipment at the
end of its life. Minimizing these costs determines an
optimum lifetime for which an item should be designed
to last and reliability is an integral part of the calculation.
- What are the biggest risks? Risks are most often
associated with what happens when something fails.
Thus, the biggest risks are those which have the worst
consequences when a failure occurs and which are most
likely to occur.
1.3. Why is Reliability Important?
Reliability has a significant impact on almost every aspect
of ownership of a product:
- Cost of ownership. Reliability affects both the initial cost
of an item and the cost of repairing the item. Sometimes
changes to increase reliability increase the initial cost or
the maintenance costs of equipment. The use of more
expensive materials such as gold and using soldered
rather than socketed electronic components are examples
of this. But surprisingly often there is a synergistic effect
and the same changes that produce greater reliability also
result in lower costs. The progression of electronic
components from vacuum tubes to discrete transistors to
integrated circuits to very large integrated circuits is an
example of this synergy.
- Ability to meet service objectives. An item that has failed
is not serving the purpose for which it was purchased.
Thus, the item must be working if service objectives that
depend on the item are to be met. The higher the
reliability of a piece of equipment, the fewer failures it
will have; fewer failures mean that fewer resources will
have to be diverted to respond to the failures; hence,
more resources can be focused on the primary task.
- Customer satisfaction. Customers purchase things
because they expect them to be used. If an item fails, it
RF RF 2
cannot be used for the purpose intended by the person
who bought it. This leads to dissatisfaction with the
product. If it happens too often the customer may
abandon not only that product but also others from the
same manufacturer. Thus higher reliability can bring
about greater customer satisfaction with a product and
lower reliability brings about customer dissatisfaction.
- Ability to market products or services. Since greater
reliability has both a cost benefit and brings about greater
customer satisfaction, it can be an important marketing
tool. Greater reliability enabled Japanese automobile
manufacturers to take a larger market share from their
American counterparts in the 1980s.
- Safety. Reliability is closely associated with some
aspects of safety. When an item fails, it is no longer
doing the function it was intended to do; that can have
very serious consequences if the item was performing a
safety critical function. For example, if the motor in an
airplane or an automobile fails, the vehicle will
immediately lose power and speed. How the item fails
can also be important, for example certain failure modes
might cause an item to explode or to damage other
components in the system.
1.4. Scope of Reliability
Reliability affects every device from the smallest
transistor on an integrated circuit to communication networks
that encompass the entire world. Since designers want the
things they build to work they would like for the things
they design to be reliable. They would like for a single
failure of a single device to not prevent an entire large system
from working. Thus, reliability affects the choice of materials
used in a device, the design of the operational units in a
device, the way individual devices are built up into larger
pieces of equipment, and the way equipment is combined into
systems.
Reliability is also closely related to, and associated with,
several other terms with which it is sometimes confused.
- Quality is most often used with regard to manufacturing
processes. It refers to how well the manufactured item
conforms to its specifications. It has been said that
reliability must be designed into a product. Poor quality,
meaning that the product does not conform to
specifications, can lead to poor reliability; but quality, no
matter how good, cannot make up for a lack of reliability
in the product design.
- Availability, like reliability, is a probability; it is the
probability that a device is operational at some instant in
time. Availability is often interpreted as the fraction of
some time period that an item is operational. Availability
is sometimes used as a measure of reliability for
repairable devices.
- Maintainability is also a probability. It is the probability
that a device that is not working can be restored to
working condition in a specified amount of time.
Reliability and Maintainability determine availability.

2. MEASURES AND MODELS OF RELIABILITY
The fundamental unit of measure in reliability is the time
that an item is operational until it fails. Since this can never
be known with certainty in advance it is expressed as a
probability, usually designated by its distribution function as
F(t). F(t) tells the proportion of items, out of those that were
initially working at time 0 that have failed by time t. If N
F
(t)
is the number of items that have failed by time t, and N is the
initial number of items, then a reasonable estimate of F(t),
denoted ) (

t F , is
N t N t F
F
/ ) ( ) (

= .
Similarly, the reliability, designated by R(t), is the
proportion of items that were initially working at time 0 that
are still working at time t.
R(t) = 1 F(t).
If N
W
(t) is the number of items that are still working at time t,
(N
F
(t) + N
W
(t) = N) then R(t) can be estimated by:
N t N t R
W
/ ) ( ) (

= .
The failure density function, defined as the derivative of
the failure distribution, is often more useful than the failure
distribution for purposes of mathematical calculations:
f(t) = dF(t)/dt.
Conversely,
}
=
t
0
) ( ) ( dx x f t F .
Since F(t) is a probability and F(t) 1 as t , the area
under a curve of the density function must be 1. Thus, f(t)At
defines a rectangle of height f(t) and width At, which is
approximately the fraction of the area under the f(t) curve
between t and t+At. As At 0 this approximation becomes
exact and f(t)dt is the probability that a device fails at time t
(more precisely, it is the probability that the device fails
between time t and t+dt).
The hazard function, designated as h(t), is often called the
instantaneous failure rate for a device. It is defined as
h(t) = f(t)/R(t).
h(t)At is the probability that a device fails in the next
increment of time, At, given that it was working at time t. In
general:
) ) ( exp( ) (

0
}
=
t
dx x h t R
.
The system availability, A(t), is the probability that the
system is working at time t. For non-repairable systems, A(t)
= R(t).
Since the distribution function is usually too complex to
be easily understood or compared, several figures of merit
are often used in assessing reliability. These are:
- Mean Time To Failure (MTTF)
- Mean Time Between Failures (MTBF)
- Mean Time To First Failure (MTFF)
- Mean Time To Repair (MTTR) (for repairable
systems)
- Average Availability (A)
RF RF 3
- Reliability at time t
i
: R(t
i
)
- Availability at time t
i
: A(t
i
)
The failure distribution function, F(t), describes how items
in a population fail over time. It is usually represented as a
standard mathematical function and for any given set of
failure data, statistical analyses are done to determine the
function parameters that best describe the data. The
exponential and Weibull functions (discussed here) are
among the most commonly used distribution functions but
many others are used for particular types of components.
2.1. Exponential Failure Distribution Function
The exponential failure distribution (also known as the
constant failure rate distribution) is the simplest failure
distribution function. It has only a single parameter , which
is called the failure rate. is constant with respect to time but
it may be a function of other factors such as the type of
device, its load, its complexity, its manufactured quality, how
familiar the system designers are with using it, the operating
environment, the temperature, etc. Letting t be the time
period of interest, the exponential distribution is described by:
- Distribution function: F(t) = 1 exp( t)
- Density function: f(t) = exp( t)
- Reliability function: R(t) = exp( t) (1)
- Hazard function: h(t) =
- MTTF = 1/
The exponential distribution has many advantages for
reliability analysis:
1) The assumption of a constant failure rate leads to highly
tractable and easily constructed system models. For non-
redundant systems the system failure rate is simply the
sum of the individual component failure rates,

=
=
n
i
i sys
1


For non-repairable systems, with n level redundancy, the
system MTTF can be calculated via simple formulas such
as:

=
=
n
i
sys
i
MTTF
1
1 1

.
2) Markov models are often the only feasible way of
modeling the states (both operational and failed) and the
transitions between states of a complex system. Such
models implicitly assume constant transition rates
between states, although in some cases non-constant rates
can be modeled by combinations of states.
3) Some databases store item failure rates making them
readily available and the corresponding item failure
probabilities are easily constructed for any time period by
applying the appropriate time interval to eqn. (1). The
most widely used reliability prediction procedures,
including the Bellcore Reliability Prediction Procedure
**


**
Bellcores name has been changed to Telcordia
Technologies.
and Mil-Std-217
*
, generally assume a failure rate that is
constant in time but a function of other properties of the
component being modeled and the environment in which
it is placed [2].
4) Textbooks on reliability often use the exponential failure
distribution to give easily solved models that illustrate
the concepts being explained. Thus, if they use the
exponential distribution, novice practitioners have readily
available models to follow; even experienced
practitioners often assume a constant failure rate in their
analyses, knowing that it does not apply, but assuming
(or hoping) that the results will be robust or close
enough.
5) Failures are relatively rare events in the life of a system.
Thus, collecting failure data is a time consuming and
expensive task and often only small sample sizes are
available. Furthermore, since recording the time of
failure (if it is known) is usually of secondary importance
when a system fails, failure data is often incomplete and
frequently uncertain as to the exact time of failure, the
cause, or even what item failed. Under these
circumstances the statistical methods available for
computing failure distributions often do not support
finding more than a one-parameter distribution.
6) From a theoretical perspective Drenicks theorem states
that under suitable conditions the reliability of any
system that is maintained by replacing failed components
approaches the exponential as a limiting distribution [22].
This is often given as theoretical justification for
assuming an exponential failure distribution; however,
simulation studies have shown that it takes many
generations for a system to reach equilibrium, so that as a
practical matter, most systems do not last long enough to
reach this steady state [3, p. 88]. Bowles and Dobbins
have also shown that in certain cases the failures in a
repairable, redundant system can be closely
approximated by an exponential distribution [4].
On the other hand, some well-known characteristics of the
exponential distribution make little sense from a reliability
point of view:
1) It implies that failures are due to chance events,
uniformly distributed over any interval of time. Thus, it
ignores the existence of many well-known failure
mechanisms whose effects are cumulative. For example,
many components experience failures from mechanisms
such as: wear (e.g., bearings and other mechanical
components), fatigue (e.g., structural components),
chemical reactions (e.g., corrosion), burning (e.g.,
filaments in electronic tubes), and submicroscopic effects
(e.g., ion migration in integrated circuits).
2) The assumption that the failure rate does not change over
time removes age from the reliability model, thus the
probability that an item fails is the same regardless of

*
Mil-Hdbk-217 was last updated as revision F in January
1990 and is no longer supported by the US Department of
Defense. However, we will use some of the device models
here for illustrative purposes.
RF RF 4
whether the item is brand new or has been operating for
many years. This, so called, memorylessness
characteristic of the exponential distribution means that
the model cannot capture degradation over time such as
occurs in batteries and a used system is as good as new
from a reliability point of view.
In summary, the exponential model captures the important
concepts in the definition of reliability, is easy to understand,
and leads highly tractable system models. Unfortunately, it is
also easily misapplied and can provide very misleading
results. We will return to this point in section 4.1. The
characteristics of the exponential distribution and its
suitability for modeling device failures in reliability work
have been a source of endless debate within the reliability
community.
2.3. Weibull Failure Distribution Function
The Weibull distribution function is a two-parameter
distribution as shown in Figure 1. o is often called the
location parameter and | is the shape parameter. Since it is
parameterized, the Weibull is able to capture the failure
characteristics of many different types of systems. It is
described by:
- Distribution function: F(t) = 1 exp( ot
|
)
- Density function: f(t) = o|t
|1
exp( ot
|
)
- Reliability function: R(t) = exp( ot
|
)
- Hazard function: h(t) = o|t
|1

- MTTF = o
1/|
I(1+1/|)

0 250 500 750 1000 1250 1500 1750 2000 2250 2500
5 10
4
0.001
0.002
0.002
w2( ) t
w5( ) t
wp( ) t
e1( ) t
t

Weibull: | = 2
Weibull: | = 5
Weibull: | = 1
(Exponential)
f(t)
time (t)
Weibull: | = 0.6

Figure 1. Weibull failure density function for several values
of |. MTTF = 1000 for all four density functions.

Figure 1 shows the density function for the Weibull
distribution for several values of |. Observe that as |
increases the density function becomes more bell shaped
and clustered about its mean. A Weibull distribution with | =
3.5 approximates the normal distribution.
The Weibull hazard function, h(t), is illustrated in Figure
2 for different values of |. Observe that it is decreasing for |
< 1as items age, they are less likely to fail. This matches
our intuition for many items such a software and metals that
work harden; when such an item is first put into service, we
expect it to fail, but after it has been operating a while, our
confidence that it will not fail in the next increment of time
increases. For | = 1 the Weibull distribution reduces to the
exponential and the failure rate is constant. For | > 1, the
Weibull distribution has an increasing failure rateas items
age, they are more likely to fail. This represents the
reliability characteristics of such items as mechanical devices
that wear out as they age.
h(t)
(Failures/hr)
Time (t) (hours)
| > 1
| = 1
| < 1

Figure 2. Weibull hazard function.

The ready availability of software to calculate the
parameters of the Weibull distribution that best match a set of
failure data has lead to a subdiscipline of reliability known as
Weibull analysis [23]. A careful Weibull analysis can give
many hints as to the causes of failure for a device. For
example, if failures are occurring during the decreasing
failure period (i.e., | < 1), they are most likely due to
components that were either defective or inadequately
burned-in; if they occur during the wearout period (i.e., | >
1), they are most likely due to items wearing out; if the data
can be partitioned into sets having different failure
characteristics the components may have come from different
manufacturing lots or from different manufacturers; etc.
The reliability characteristics of many systems follow
what is often described as a bathtub shaped curve like that,
shown in Figure 3. This is a composite of decreasing,
constant, and increasing failure rates at different times.

h(t)
(Failures/hr)
Time (t)
(hours)
Infant
Mortality
Steady State
Wearout

Figure 3. Bathtub curve for the reliability of most systems.

When the device is first put into service (i.e., the infant
mortality period) the failure rate decreases with time as
defective components fail and are removed from the
population. During the steady state period system failures
occur due to accidents and random events rather than inherent
system weaknesses and the failure rate is approximately
constant. During the wearout period the failure rate
increases as accumulated wear and stresses lead to a loss of
resiliency.
It is important to realize that the bathtub shaped curve,
and in fact, any failure distribution, describes a population of
devices rather than a single item. For example, if we consider
the failure distribution for a population of automobile tires,
during the burn-in period tires with defects such as being out-
RF RF 5
of-round, with bubbles in the rubber, or which have poor
bonding between the cords and the rubber will fail quickly;
once these (defective) tires are removed from the population,
those that remain fail largely due to accidental road hazards or
punctures; as time goes by the tires accumulate wear, the
rubber gets thinner, and the flexing during use slowly
weakens the bonds with the cords leading to more failures.
The bathtub shaped curve for human life (Figure 4),
consisting of a decreasing infant mortality period, a
constant adult life period, and an increasing old age
period, also provides some perspective to the model. During
adult life the MTTF is approximately 800 years. This does
not imply that people live to be 800 years old. It means that
the death rate is 0.00125 deaths/yr. or about 5 deaths/yr. in a
population of 4000 adults. The same interpretation should be
kept in mind when considering electronic components whose
MTTFs are typically several million hours.
h(t)
(Deaths/yr)
Age (t) (years)
Infant
Mortality
MTTF = 800 yr
Adult Life
Old Age
10 15 60 70
0.00125

Figure 4. Bathtub shaped curve of human reliability.

3. COMPONENT-LEVEL RELIABILITY
Traditionally, reliability engineering has focused on the
failure characteristics of the individual components in a
system. The main idea is that the overall device reliability is
determined by the reliability of its individual components is
improved. Efforts to improve the reliability of individual
components have concentrated on reducing the initial failure
rate of a device and shortening the infant-mortality or burn-in
period, and reducing the failure rate during the constant
failure period and extending the constant failure period as
long as possible.
3.1. Improving Early-Life Reliability
The objective during the early life of a product is to make
the initial failure rate as small as possible and to shorten the
burn-in period as much as possible. Techniques for doing this
focus on insuring that high quality parts are used, that the
parts are properly burned-in, and that component tolerances
are handled properly.
3.1.1. Part Quality
Some simple calculations help to show that a high level of
part quality is required. Consider a system with 1000 parts
and suppose the part defect rate is 0.005. Then the
probability that a system will contain all good parts is
(0.995)
1000
= 0.0066; the probability that the system will need
to be repaired to fix a defective component is 0.9934.
Furthermore, the expected number of defective components
per system is 10000.005 = 5 and hence, on average, 5
repairs will need to be made due to defective components.
Incoming part inspections are one way to reduce the
number of defective parts received. If we receive a shipment
of 1000 parts with a defect rate of 5/1000 parts we would
expect to have 5 bad components in the shipment. Suppose
we test 100 components, a 10% sample, from this shipment.
Then the probability that we would have 0 bad parts in the
sample is found from the hypergeometric distribution:
59 . 0
100
1000
100
995
0
5
=
|
.
|

\
|
|
.
|

\
|
|
.
|

\
|

Clearly, we must cut down on the number of defective
parts to reduce the probability that the product will be
defective, but sampling incoming parts to determine
acceptance or rejection of a shipment is not an effective way
of doing this. For even moderate defect rates the probability
of detecting a bad part in a small sample is not high and
increasing the sample size increases the cost of testing
components, especially if the tests are destructive.
Furthermore, as the quality of incoming parts improves,
sampling becomes increasingly less effective as a way of
finding unacceptable shipments.
Measuring the variance in a shipment of incoming parts is
a more effective way than sampling of determining how many
bad parts are likely to be in the shipment. The parts produced
in any manufacturing operation will vary slightly due to many
factorsfor example, wear on the machinery used in the
manufacturing operation, variations in the materials used,
differences in the machine operators, to name just a few. The
combined result of all these variations is that the part
parameter can be closely modeled as having a normal
(Gaussian) distribution. This distribution is characterized by
its mean and standard deviation (o) as shown in Figure 5.
68.3% of parts will lie within 1o of the mean, 95.4% will lie
within 2o of the mean, and 99.7% will lie within 3o of the
mean. 3o is usually taken as a measure of the variation in the
manufacturing process. By measuring the variance of the
incoming parts and knowing the allowable tolerance limits
one can estimate the number of bad (out of tolerance) parts in
a shipment. This procedure has been formalized in the use of
the process capability index.
norm( ) , , x m s
x

4 3 2 1 0 1 2 3 4

Part tolerance limits
3o limits
Standard Deviation (o)

Figure 5. Normal distribution.

RF RF 6
The product process capability indices, C
p
and C
pk
,
compare the part tolerance specification to the 3o
manufacturing variation [5]. The C
pk
index is defined as the
ratio of the difference between the process mean parameter
value ( x) and the nearest tolerance limit to the 3o process
variation:
o 3
|) | |, min(| x T x T
C
L U
pk

=

where T
U
is the upper tolerance limit and T
L
is the lower
tolerance limit. Ideally the process mean parameter value will
equal the nominal part specification but often this is not the
case. A C
pk
of 1 implies that the specification tolerance limits
are 3o of the production process (Assuming the mean and
nominal parameter values are equal and the specification
limits are symmetrical.).
Table 1 gives the probability that the production process
produces an item that is out of specification for various values
of C
pk
. Since C
pk
considers the nearest tolerance limit to the
mean, the Probability Out-of-Tolerance in Table 1 is a worst-
case estimate.

Table 1. C
pk
index probability values

C
pk

Tolerance
Limit
Probability Out
of Tolerance

Defect Rate*
0.33 1o 0.31731 317,300 PPM
0.67 2o 0.0455026 45,502 PPM
1.00 3o 0.00270 0 2,700 PPM
1.33 4o 0.00006 33 63.5 PPM
1.67 5o 0.00000 0573 0.573 PPM
2.00 6o
1.973 10
-9

1.97 PPB
* PPM = Parts Per Million; PPB = Parts Per Billion

Usually x is not known until after the production process
is set up and some units have been produced. Thus, in the
design process the C
p
index, defined as:
o o 3 6
T -T T
C
L U
p
A
= =

may provide an estimate of the out-of-tolerance probability.
The C
P
and C
Pk
indices are related by the factor K,
C K C
pk p
= ( ) 1
where,
2 / ) (
L U
n
T T
T x
K

=

and T
n
is the nominal parameter value [5].
A C
pk
(or C
p
) of 1.33 or higher is generally considered
good, 1.0 to 1.32 marginal, and 1.0 or less bad.
*
The
capability index is often thought of as a measure of
component quality rather than reliability. However, it is

*
As process technology improves, these definitions of
"good", "marginal" and "bad" will likely change and higher
C
p
or C
pk
indices will be expected.
noteworthy that the design group and the manufacturing
group must work together to simultaneously maximize the
allowable tolerances in the design and to minimize the
variation in the manufacturing processes.
3.1.2. Taguchi Loss Function Tolerancing
The Taguchi loss function provides another way of
assessing the effect of variability in a design and, more
importantly from a managerial perspective, it relates the
variability to cost. Cost is usually far more important to
management than an abstract quality such as reliability. The
Taguchi loss function is defined as:
L(y) = k(y-y
0
)
2
(2)
where y is the characteristic of interest (e.g., a dimension,
performance measure, lifetime, etc.); y
0
is the target value of
the parameter; and k is a constant that depends on the
organization cost structure [6].
L(y) associates a cost with the amount the parameter is off
target. Unlike other measures of variability, which treat all
items within specifications as equally good, the loss function
imposes a cost whenever the parameter differs from its target
value, even if it is still within specifications. Furthermore, the
cost increases rapidly as the parameter moves away from its
target value. This is illustrated in Figure 6 along with an off-
target density function for the parameter.
L( ) y
dnorm( ) , , y 1.4 .25
y
L(y) = k(y- y
0
)
2
f(y)
y
0
L(y)
f(y)

Figure 6. Taguchi Loss function

The Taguchi loss function assumes that any parameter that
is off its target value imposes a cost on society. For example,
a pair of shoes that are size too small might be purchased
but worn only once if they cause blisters, thus wasting their
initial cost. A pair that is size too small might be worn
only occasionally (they are uncomfortable), thus the buyer
does not realize their full value.
Using the cost function, an expected loss can be
developed for a manufacturing process that depends on both
the process variability and how much (if any) it has shifted
from its nominal value (y
0
):
( ) | |
2 2
0
) ( o + = y k L E (3)
In eqn. (3) is the process mean and o
2
is its variance.
The loss function provides a tool for evaluating the effect
of variation on possible design changes. For example,
consider the transistor circuit in Figure 7a with the gain
characteristic in Figure 7b. The loss function is calibrated by
RF RF 7
considering the loss at an extreme point. Suppose that the
nominal line voltage is 115V and the circuit, valued at $100,
is destroyed if the voltage rises to 140 V. Then eqn. (2)
reduces to $100 = k(140 115)
2
which yields k = $0.16/V
2
If the transistor has a nominal gain of 23 V and standard
deviation of 2.5 V, we find an expected loss of $1.00 that
must be included in the warranty cost. If a transistor with less
variation would be too expensive we can try moving the
nominal operating point to the flat part of the gain curve by
using a transistor with a nominal gain of 35 and a standard
deviation of 0.66 V. This produces an off-target output
voltage of 122 V. From eqn. (3) the expected loss is $7.91 of
which $7.84 is due to being off-target and $0.07 is due to
variation. Replacing the load resistor with one of 60 kO
brings the design back on-target by reducing the gain and
gives an expected loss of $0.07.
150 V
R
Vout
Vin
V
out
120
115
110
105
R=40 kO
transistor gain
10 20 30 40 50

(a) (b)
Figure 7. Transistor circuit (a) and gain (b).

3.2. Improving Component reliability
The objective with fault intolerant (also called fault
avoidance) techniques is to reduce the overall failure rate
over the life of the system, essentially lowering the bathtub
curve in Figure 3. In terms of the exponential reliability
model, eqn. (1) (which is applicable during the constant
failure period) they seek to reduce the failure rate . A
common approach in many reliability models (e.g., Mil-
Hdbk-217 [7], Bellcore Reliability Prediction Procedure [8])
is to express the failure rate as the product of a base failure
rate
b
and several pie factors determined by the part
quality (H
Q
), the environment (H
E
), and other factors. Thus

p
=
b
H
Q
H
E

Reducing the overall failure rate involves:
- improving component quality
- derating components
- improving the environment
- reducing complexity
- miscellaneous techniques

3.2.1. Improving Component Quality
Most manufacturers produce several quality grades of
parts. These grades are determined partly by the number and
types of monitors, controls, and inspections put on the
manufacturing process. Mil-Hdbk-217 provides quality
rating scales based on the Joint Army Navy (JAN)
specification used by the military.
Part quality can also be improved by doing burn-in tests
and acceptance testing on incoming parts (but see the
discussion on part quality in Section 3.1.1). Working with
suppliers to reduce variability (discussed earlier) and
performing quality audits of supplier manufacturing are also
effective in improving component quality. Although it is a
management technique rather than a design technique,
the practice of qualifying suppliers who deliver high quality
parts and giving them a preferred status is an effective way
of improving component quality.
3.2.2. Derating Components
Components are derated by using them at less than their
rated level of stress. Usually this is done by replacing a
component with one of greater strength. Different
components are derated in different ways as shown in
Table 2.

Table 2. Component derating methods
Type Component Load parameter
Resistor Operating power
Capacitor Applied voltage
Diode Applied current
Transistor Power dissipated
Semiconductor Power dissipated

Some rough but useful guidelines for derating provide that
it should not be conservative to the point of increasing costs
excessively. (Usually higher rated parts are more expensive
than lower rated parts.) Nor should it be done to the point
where the device complexity must be increased to gain back
the performance lost by derating, thus off setting the benefits
of derating. Typical derating factors range from 0.5 to 0.8.
Light bulbs provide an excellent example of the
effectiveness of derating at increasing component lifetimes
and the tradeoffs that often must be made with performance.
Standard 60 W light bulbs are rated by the manufacturer in
terms of their life expectancy and the amount of light
produced. Table 3 shows this relationship for several
manufacturers light bulbs.

Table 3. Light output (Lumens) and Life
(Hours) of 60 W light bulbs.
Lumens Life (Hours)
870 1000
680 1500
620 4000
563 5400

Over-rating is the opposite of derating. This is sometimes
done to gain performance, but almost always, at the expense
of reliability. This effect was clearly illustrated in a senior
design project where students found that they could get better
performance from a small 5 V motor by running it at 10 to 15
V. Although the motor appeared to work fine at first, the
class was plagued by a rash of motor failures.
3.2.3. Reducing Complexity
One measure of complexity is the number of separate
RF RF 8
devices needed to implement a given function or piece of
equipment. By this measure a circuit board that utilizes ten
integrated circuits to implement some function is more
complex than one that uses two. Similarly, a bumper for a car
that is built up from several separate pieces is more complex
than one that is a single unit. Complexity is reduced by
achieving a higher level of component integration and
reducing the number of components needed to implement a
function. The failure rate for non-redundant systems is the
sum of the individual component failure rates. Thus, if there
are fewer components, all else being the same, there will
generally be a lower failure rate. But all else is usually not
the same. One Very Large Scale Integrated (VLSI) circuit
chip might replace several Small Scale Integrated (SSI)
circuit chips on a board, but its failure rate will likely be
higher than that of an SSI chip. Similarly, the single unit
bumper might be more difficult to manufacture within
tolerances than each piece of the multi-piece bumper.
However, in many cases, device reliability and the level of
integration have a power law relationship so that the
reduction in the number of components more than
compensates for any accompanying increase in the
component failure rate caused by increasing its level of
integration. Consider, for example a 1 M byte DRAM. The
device model from Mil-Hdbk-217 is
= [C
1
H
T
+ C
2
H
E
+
CYC
]H
Q
H
L
.
The important factors are C
1
, the circuit complexity
failure rate, and C
2
, the package complexity failure rate. The
C
1
complexity factor of a DRAM is measured by the number
of bits:
B C
4
00125 . 0
1
=

where B is the number of bits. The C
2
complexity factor
depends on the type of package; for a DIP C
2
= 9.010
-5
N
1.51

where N is the number of pins in the package. These values
result in the C
1
and C
2
factors shown in Table 4 for the
DRAMS

Table 4. DRAM Circuit (C
1
) and packaging (C
2
)
complexity factors
Device C
1
C
2

64 K DRAM, 18 pins 0.0025 0.0071
256 K DRAM, 20 pins 0.0050 0.0083
1 M DRAM, 22 pins 0.010 0.0096

A 1 M bit memory built from these chips would result in
the values of C
1
and C
2
in Table 5, which shows the
advantage of increasing the level of component integration.

Table 5. Circuit (C
1
) and packaging (C
2
) complexity
factors for 1 M bit memory

Device
Number
of chips

Total C
1


Total C
2

64 K DRAM, 18 pins 16 0.04 0.114
256 K DRAM, 20 pins 4 0.02 0.033
1 M DRAM, 22 pins 1 0.01 0.01

The steady decrease of the device size in integrated
circuits has led to an increasing level of component
integration in electronic devices. The result is: fewer
components, fewer solder joints, less required board space
allowing smaller boards, generally less required power
resulting in cooler operating temperatures, better
performance, and improved reliability. The synergy produced
by these advantages has driven electronics toward ever higher
levels of component integration.
Another goal of reducing complexity is to simplify device
assembly. To do this the designer should: Use fewer parts;
use fewer distinct parts or part numbers; key parts so that they
can be assembled only one way; and use fewer vendors.
Pughs complexity factor provides a useful measure of
complexity [9]:
( )
F
I T P
C
3 / 1

=

where:
P = number of Parts
T = number of types of parts
I = number of interfaces
F = number of functions
3.2.4 Improving the Environment
Environmental effects on the components in a system can
be reduced by isolating the components to protect them from
the environment or by choosing components that are tolerant
of the environment. For example, germanium transistors are
known to be more tolerant of radiation than silicon transistors
and should be used if a circuit must operate in that type of
environment. Rubber supports are often used to shield parts
from shocks and vibration in the environment.
In many cases environmental effects combine to either
intensify or weaken their effects. For example, the
mechanical deterioration due to sand and dust in the
environment is weakened by high temperatures but
accelerated when combined with vibration. Mil-Hdbk-338
describes many environmental effects. [9]
One particularly important environmental effect is heat.
Several models are used to show the affect of heat on
component reliability. The Arrhenius model is one of the best
known of these.
Arrhenius Model. In the late 1800s Svante Arrhenius, a
Swedish chemist found that a plot of the log of the reaction
rate versus the inverse of the temperature was a straight line
for the inversion of sucrose. In modern terminology this is
expressed as,
|
|
.
|

\
|
=
T K
E
k k
b
a
exp
0

where,
T is the absolute temperature (degrees Kelvin)
E
a
is the activation energy (eV)
K
B
is Boltzmann Constant (8.61710-5 eV/K)
k
0
is a constant.
The Arrhenius equation is often used to describe the effect
of steady state temperature on many of the physical and
RF RF 9
chemical processes such as ion drift and impurity diffusion
that lead to component failures. Assuming that failures occur
when the concentration of the reactant corresponding to a
particular failure mechanism reaches some critical value, the
change in time to failure from t
1
to t
2
due to a change in
temperature from T
1
to T
2
is expressed as,
|
|
.
|

\
|
|
|
.
|

\
|
=
2 1 2
1
1 1
exp
T T K
E
t
t
b
a
We observe that if T
2
> T
1
then t
1
> t
2
implying that the
device fails sooner at the higher temperature. This
relationship between temperature and reliability is captured in
many reliability prediction procedures. For example, Mil-
Hdbk-217 provides the following temperature factor for
composition resistors:
|
|
.
|

\
|
|
|
.
|

\
|

= H

298
1 1
10 617 . 8
exp
1
3
T
E
a
T

Here, the activation energy for composition resistors is
determined experimentally to be 0.2 eV and 298 K (25 C) is
the base temperature for measurement. Figure 8 shows the
effect of temperature on H
T
.
0 50 100 150
5
10
Pi ( ) t
t Temperature (C)
H
T

Figure 8. Temperature acceleration factor.

Accelerated Stress Testing. In accelerated stress testing
components are subjected to a much harsher environment
than their normal operating environment in order to induce
the equipment to fail. Then by examining the types of
failures that occur the design can be changed to make the
equipment more robust. The stressed environment usually
involves operating the equipment at
- higher and lower temperatures,
- with high levels of vibration and sometimes shock
- at higher and lower voltages
- cycling between high and low temperatures
To the extent that the stressed environment does not induce
new failure mechanisms this type of testing can reveal
weaknesses in the design, which can then be improved upon
[12].
3.2.5. Miscellaneous Techniques
The designer must be aware of the factors that affect the
failure rate of his or her particular design. For example, NPN
transistors have a lower inherent failure rate than PNP
transistors due to their having a lower junction temperature;
so the logic in digital circuits should be designed to use NPN
transistors.
General Electric found that the expected life of an
incandescent lamp is reduced by 25% if a single spot on the
filament is just 1% less than specifications and that the
mandrel on which the filament is coiled must be accurate to
0.0001 in. or the life of the lamp is reduced by 20%.
Knowing this, manufacturers set very tight tolerance limits
and manufacturing controls to have as little variation in the
filament diameter as possible.
3.3. Software Reliability
Software is a major component of most systems with any
degree of complexity and its importance is increasing
particularly with respect to embedded control systems.
Software cannot be ignored in any assessment of reliability.
Software differs from hardware in several respects:
- It does not wear out or degrade with use or over time.
- It can fail only when it is being used. Hence
reliability must be measured with respect to execution
time.
- Errors are logical rather than because some
physical parameter has changed.
- It is generally not tested to the same degree as
hardware.
- It can be much more complicated than anything done
in hardware.
Many models have been developed for assessing software
reliability. One relatively simple model, which we will
discuss, is the Basic Execution Time Model [13]. This model
assumes that a program contains a finite number of faults and
that the failure rate decreases as faults are found and fixed.
Thus, the software failure rate as a function of the faults
found () is:
() =
0
(1 /V
0
) (4)
where:

0
is the initial failure rate
is the number of faults found
V
0
is the initial number of faults
This is shown in Figure 9. Eqn. (4) leads to a software
reliability model in terms of the program execution time t,
|
|
.
|

\
|
= t

t
0
0
0
exp ) (
V

The model can be used to determine how much more
execution time will be required for testing and how many
more faults must be found and fixed to achieve a given level
Faults in code ()
V
0
()
() =
0
(1 /V
0
)

Figure 9. Basic Execution time software reliability model
RF RF 10
Table 6. Software defect ratio per 1000 lines of code


Development Process
Total errors
in
development
Faults
remaining
at delivery
Traditional Development
- Bottom-up design
- Unstructured code
- Fault removal through
testing


50 60


15 18
Modern Practice
- Top-down design
- Structured code
- Design/Code reviews &
inspections
- Incremental releases


20 40


2 4
Advanced Software
Engineering Practices
- Verification Practice
- Reliability measurement
& tracking


0 20


0 1

of reliability. By considering the limiting resources available
programmers to fix faults when they are found, test
personnel, machines to use for running tests, etc. the
execution time estimates can be mapped into an elapsed time
model.
The initial failure rate can be found using rules of thumb
for the development methodology. Table 6 shows some
rough guidelines. Alternatively, it can be determined by
measuring the number of failure during the first few hours of
testing.

4 SYSTEM-LEVEL RELIABILITY
4.1. System Structure
Most devices are composed of more than one component.
In such cases, the reliability of the device depends upon the
reliability of the underlying components and how they are
configured to form the higher-level system. From a reliability
point of view we are mainly concerned with how failures of
the systems components affect the operation of the system.
This effect is captured in a reliability logic diagram which
shows the reliability structure of the system.
Series Systems. A series system is one in which the device
fails if any of its components fail. It is represented by a
series diagram like that in Figure 10. Light switches
connected in series are a good analogy of a series system. If a
component is working, the corresponding switch is closed; if
the component has failed the corresponding switch is open.
For the light to be on all of the switches must be closed
i.e. all of the components must be working. Most simple
devices are of this type.
Since a series system fails when the first component fails,
the time to failure for the system is the minimum of all the
component times to failure. Thus, the probability distribution
of the time to failure for a series system with n components is,
Pr(T
SYS
s t) = Pr(T
1
,s t AND T
2
s t AND ... AND T
n
s t)
where T
SYS
is the system time to failure and T
i
, i = 1, ... n, are
the component times to failure. When the component states
(working, not working) are mutually independent, this
reduces to the product of the reliabilities of its constituent
parts:
R
SYS
= R
1
R
2
... R
n

where R
SYS
is the system reliability and R
i
is the reliability of
component i.
Redundant (Parallel) Systems. When the system has
redundant components, the redundancy can be represented by
"parallel" blocks in the diagram as illustrated in Figure 11.
Again, using the light switch analogy, light switches in
parallel are a good analogy for a redundant systemthe light
is on if either switch is closed. The reliability of a system
with redundant components depends on the type of
redundancy used. In the simplest case, the system will
continue to operate as long as any one of the components
operates. In that case the system time to failure is the
maximum of the time to failure of its components, and we
have,
Pr(T
SYS
> t) = Pr(T
1
> t OR T
2
> t OR ... OR T
n
> t)
The system reliability is then given by the expression:
R
SYS
= 1 [(1 R
1
)(1 R
2
) ... (1 R
n
)]
where again, R
i
is the reliability of the i
th
redundant
component and independence is assumed.
Redundancy has two main uses in design:
- to eliminate single points of failure.
- to increase reliability or availability
Eliminating single points of failure is very important in
military systems and becoming more important in commercial
systems since such points are obvious targets for saboteurs.
Redundancy is also a requirement for high availability
systems in which repairs must be made without shutting down
the system and it is an important consideration for safety
critical systems.
Redundancy is often viewed as an easy way to increase
system reliability. A simple calculation shows why. Suppose
each component in a redundant system configuration has a
(fairly poor) reliability of 0.9 for some time period. A two-
component system would give a reliability of 0.99; three
components would give a reliability of 0.999; four
components would give a reliability of 0.9999, and so on. It
is all very simple, very appealing, and very, very misleading.
When used to increase reliabilitythat is extend the
lifetime of the system with high probabilityredundancy
may be particularly ineffective and it depends critically on the
component failure distribution function. We will discuss this
further in Section 4.2.
Other types of redundancy besides parallel redundancy,
that are often used include M-out-of-N:G (read M out of N
good) redundancy in which the system continues to operate
as long as M of its original N components are operational, and
standby redundancy in which a spare module can be switched
in to replace a failed module. These types of redundancy are
RF RF 11
most often used in degradable systems and in systems with
backup subsystems. One problem with all types of
redundancy is that as the level of redundancy increases, the
reliability of whatever mechanism is used to determine which
modules are good or to switch in good modules (and switch
out bad modules) can quickly dominate the overall system
reliability.
Combination Systems. Most reliability logic diagrams
consist of combinations of series and parallel components.
Figure 12 shows a reliability logic diagram for a high
availability computer system with redundant processor, disk
array controllers, and mirrored disks. Observe that in this
system each pair of redundant disk array controllers is in
series with a processor; this subsystem, in turn, is redundant,
and the entire processor subsystem is in series with the disk
array subsystem which is itself quadruple redundant. The
analysis of such systems consists of analyzing each series or
parallel subsystem iteratively until the whole system is
analyzed. This is straightforward but it can become rather
tedious.

1 2 n

Figure 10. "Series" reliability logic diagram.

1
2
n

Figure 11. "Parallel" reliability logic diagram.

Processor
Processor
Disk Array
Controller
Disk Array
Controller
Disk Array
Controller
Disk Array
Controller
Disk Array
Disk Array
Disk Array
Disk Array

Figure 12. Reliability logic diagram for a high-availability
computer system.


Figure 13. Diodes electrically in series.

The type of reliability logic diagram used to model the
reliability of a system depends strongly on the type of failure
mode being considered. For example, a system consisting of
two diodes that are electrically in series (Figure 13), is
modeled as a series system (both must work) if the failure
mode "open" (current is blocked) is being analyzed. They are
modeled as a parallel system (either must work) if the failure
mode "short" (no rectification) is being considered. The
analysis of a system having redundant components can get
very complex when they can have different failure modes.
4.2. Effects of the Failure Distribution on System Reliability
The reliability of a system depends critically on the
failure distribution function. This is illustrated in Table 7 and
8 for redundant and series systems with 3 different
distribution functions. All three distribution functions have
the same MTTF for a single component.
Table 7 shows the effect of increasing the level of
redundancy on the system MTTF, assuming no repairs are
made following a failure. In all three cases the MTTF
increases with the level of redundancy. For the 4-unit parallel
system the exponential distribution has approximately double
the MTTF of the simplex system. For the first Weibull
distribution (| = 0.7), which has a decreasing hazard function,
the gain is even more at all levels of redundancy. On the
other hand, for the second Weibull distribution (| = 5), which
has an increasing hazard function, the increase is not nearly as
much and diminishing returns quickly set in as the
redundancy is increased.

Table 7. MTTF for redundant system (without repair)
with different failure distributions.


Parallel System
Weibull
o = 0.020251
| = 0.6

Exponential
= 0.001
Weibull
o = 6.510
16

| = 5
1-unit (Simplex) 1000 1000 1000
2-unit 1685 1500 1130
3-unit 2215 1833 1192
4-unit 2652 2083 1231

The opposite result occurs for the series system as shown
in Table 8. Table 8 shows the system MTTF as more
components are added to a series system for the same 3
distribution functions. For the exponential distribution, the 4-
unit series system has an MTTF one quarter that of the
simplex system. For the first Weibull distribution (| = 0.6)
the MTTF decreases even faster; for the second Weibull
distribution (| = 5) the decrease is much less.

Table 8. MTTF for series system with different failure
distributions.


Series System
Weibull
o = 0.020251
| = 0.6

Exponential
= 0.001
Weibull
o = 6.510
16
| = 5
1-unit (simplex) 1000 1000 1000
2-unit 315 500 871
3-unit 160 333 803
4-unit 99 250 758

The results in Tables 7 and 8 are easily explained with
reference to Figure 1, which shows the density functions for
the three distributions.
RF RF 12
First, comparing the exponential and the first Weibull (| =
0.6) distributions, observe that the exponential distribution
has a very long tail. 63.2% of the component population fail
before 1 MTTF, but those that survive may survive quite a
long time; 13.5% survive longer than 2 MTTF and 5% longer
than 3 MTTF. Thus for the parallel system, which works as
long as any unit is working, increasing the level of
redundancy greatly increases the likelihood that at least one
unit has not failed and the MTTF is greatly increased. This
effect is even more pronounced in the Weibull distribution in
which the failure rate decreases with timeonly 27.8%
survive more than 1 MTTF, but 14.4% survive for 2 MTTF
and 8.5% survive for 3 MTTF. For the series system, the
large proportion of components in both distributions that fail
well before 1 MTTF increases the likelihood that at least one
component will fail resulting in a greatly reduced MTTF as
the number of components increases. 39.4% of components
from the exponential distribution fail before 0.5 MTTF; 57%
of components from the Weibull distribution.
Comparing the exponential and second Weibull (| = 5)
distributions we observe that the Weibull distribution is very
clustered about its MTTF; very few components fail before
1/2 MTTF or last longer than 1.5 MTTF. Thus, in the parallel
system, adding more components easily extends the lifetime
of the system for exponentially distributed components but it
is very unlikely to extend the system lifetime much beyond
1.5 MTTF for the Weibull. Similarly, in a series system one
of the exponentially distributed components is quite likely to
fail well before 1 MTTF whereas this is much less likely for
the Weibull distribution.
As already discussed, the exponential distribution is
particularly tractable and makes reliability calculations quite
easy. However, as we have seen in Tables 7 and 8 the effect
of assuming a constant failure rate when the actual failure
probability is either increasing or decreasing can lead to very
erroneous results. This is especially true for the clustered
distribution (Weibull b = 5); redundancy does not appreciably
increase the system lifetime as might be anticipated and a
series combination of components does not shorten it as much
as might be anticipated.
Modern, high quality manufacturing strives to reduce the
variance in the products produced. Thus, important product
parameters tend to be tightly clustered about their mean
values with only small variation. A consequence of this is
that products whose predominate failure mode is due to
wearout tend to have lifetime distributions that are also tightly
clustered. Examples of such components include
incandescent lamps, electronic tubes, and many mechanical
components. As an example, a study by General Electric
found that the expected life of an incandescent lamp is
reduced by 25% if a single spot on the filament is just 1% less
than specifications. Knowing this the manufacturer set very
tight tolerance limits and manufacturing controls to have as
little variation in the filament diameter as possible. The
predictable result is that the lamps lifetimes are tightly
clustered.
From another perspective, a manufacturer who produces
lamps with an advertised mean lifetime of 1000 hours (typical
of standard 60W light bulbs) would receive many complaints
if many of those bulbs lasted only 500 hours. A constant
failure rate predicts that 39% of such lamps would fail within
their first 500 hours of operation and 22% would fail in their
first 250 hours of operation. A manufacturer striving for a
reputation for high quality products could not tolerate this.
Other components without a predominant wearout failure
mode might be better described by a constant or decreasing
failure rate. Many electronic parts are of this type. For
products with a decreasing failure rate, allowing a significant
burn-in period before shipment can be a useful way of
improving reliability. Products that do not fail then are less
likely to fail later. For example, the reliability of disk drives
during the first few months of operation is typically less than
it is later [14].
The implications are clear. Ignoring the distribution and
assuming a constant failure rate equal to the inverse of the
system or component MTTF can give extremely misleading
results when modeling system reliability. The results of such
models are not conservative in the sense that they under
estimate reliability, nor are they close approximations, nor
do they provide bounds on the system reliability. They are
simply wrong! Reliability engineers must determine and
consider the actual failure distributions of the components in
their designs. Models based on generic failure rates, when in
fact the actual failure rate is either increasing or decreasing,
are of little value and might be harmful to a design.
4.3. Reliability Specification and Allocation
Reliability is a non-deterministic performance
requirement in much the same way as appearance. This
makes it difficult to specify and even more difficult to
measure whether it has been achieved. Most authorities warn
against asserting vague reliability requirements such as as
reliable as possible, or will have high reliability, or more
reliable than product X, or even the reliability will be
99%. They suggest focusing instead on describing the
environment in which the product will function, defining
failures in terms of its functions, and identifying particularly
critical failure modes that must have a very low probability of
occurrence. Even so, a quantitative reliability requirement
can be useful and sometimes gives perspective on what will
be required of the system.
As noted in Section 1.3 reliability affects such factors as
the product warranty, the cost of ownership, customer
satisfaction, and the competitive positioning of the product in
the market. In the computer industry, sophisticated customers
are beginning to demand that manufacturers provide a
reliability or availability guarantee with their products. An
economic analysis of these requirements can often be used to
quantify the required reliability for a product.
Some facts from the medical profession help to put this
in perspective. If the medical profession had 99.9%
reliability:
- Surgeons would do 26,000 bad operations per year;
- Pharmacists would fill 20,000 drug prescriptions
wrong
- Doctors and nurses would drop 15,000 new born
babies
This of course results from the huge number of operations,
RF RF 13
prescriptions filed, and babies born each year.
Reliability allocation is the process of distributing the
required system reliability, R
sys
(t), among the various
modules making up the system. In general reliability is
allocated assuming a series system:
[
=
=
n
i
i sys
t R t R
1
) ( ) (

where R
i
(t) is the i
th
module reliability and n is the number of
system modules.
The simplest allocation technique is equal reliability
allocation, in which case,
| |
n
sys i
t R t R
/ 1
) ( ) ( =
With 5 modules a system reliability R
sys
(t) = 0.99
generates a module reliability requirement of R
i
(t) = 0.9980;
10 modules generate a module reliability requirement of
0.9990; 20 modules generate a requirement of 0.9995. Equal
reliability gives the lowest reliability requirement for all
modules, requires no knowledge of function, and usually
provides a good ballpark estimate of the module reliability
needed to achieve the system reliability. It is worthwhile to
observe how quickly the reliability requirement for each
module increases as the number of modules increases.
Another common allocation technique is complexity
weighting where complexity is measured by the number of
module components. In this technique the failure rate for
each module is set proportionally to the number of module
components:
i
= (n
i
/N) where:

i
= module failure rate

sys
= system failure rate
n
i
= estimated number of components in module

=
ules
i
n N
mod

The allocated module reliability is
| |
N n
sys i i
i
t R t t R
/
) ( ) exp( ) ( = =
(5)
The count of active components such as transistors,
diodes, and integrated circuits, is often adjusted by a
weighting factor (typically 1.3 to 3.0) since active
components tend to fail more often than passive components.
Another technique that is especially useful in the early
concept phase of the design, is the relative weight allocation
method. In this technique each subsystem is ranked relatively
(on a scale of 1 to 10) according to such factors as its
complexity, its degree of state-of-the-art, its working
environment, the time needed to achieve a given level of
reliability, the importance of the module and any other factor
deemed important. The factors are then summed to provide a
value n
i
, i = 1, n, for each module and a reliability is
calculated as in eqn. (5). Table 9 illustrates the method.
Inexperienced designers tend to treat reliability allocation
as a numbers game. Instead, it should be used as a tool for
identifying what problems are likely to occur in the design.
This is especially true of the relative weight technique
which can be used early in the design process as the concept
is being developed. As is evident from Table 9, the auto pilot
will be more complex, require more development time,
operate in a harsher environment, and push the technology
more (more state-of-the-art) than the communications
subsystem. Hence, it is likely to have a lower reliability and
is assigned a lower reliability requirement than the
communications subsystem. The important point in the
analysis is not the lower reliability requirement for the auto
pilot per say, but rather the factors that lead up to it and that
must be considered in its design. The analysis suggests that
adequate time must be allowed to develop the autopilot, that
new and possibly untried technology may be needed in its
construction, and that interactions between the autopilot and
the environment must be carefully assessed. Insights that
identify areas of concern are the important outcomes of the
allocation process rather than the specific numbers produced.

5. SYSTEM-LEVEL ANALYSIS TECHNIQUES
Several system analysis techniques allow the analyst to
determine how low-level failures affect the reliability of the
system. The most frequently used of these analysis
techniques are Failure Modes and Effects Analysis (FMEA),
fault tree analysis, and the use of Markov models. FMEA
examines the effect that each failure mode of an item has on
the system; fault tree analysis determines what low-level
failures can cause a given system-level failure; and Markov
models describe the changes in the state of the system
following a device failure.
5.1 Failure Modes and Effects Analysis
Failure modes and effects analysis is one of the most
effective tools available for building a reliable system. It
requires that the designer examine each item in the system,
consider all the ways that the item can fail and either 1)
Table 9. Relative weight reliability allocation for a rocket with an overall reliability of 0.99.

Major Subsystem

Com-
plexity
State
of the
Art

Time to
Develop

Environ-
ment

Module
Sum (n
i
)

Ratio
(n
i
/N)

i



R
i

Fuel 6 5 10 5 26 0.181 0.001815 0.998187
Auxiliary Power 5 4 8 5 22 0.153 0.001535 0.998466
Communications 6 1 5 2 14 0.0972 0.0009771 0.9990234
Auto Pilot 8 6 9 7 30 0.208 0.002094 0.997908
Navigation 7 6 8 6 27 0.188 0.001884 0.998117
Ecology 8 7 8 2 25 0.174 0.001745 0.998257
Total 144 0.01005 0.99


RF RF 14
accept the consequences; 2) find some way to mitigate their
effects on the system; or 3) eliminate the failure altogether. It
provides a basis for recognizing specific component failure
modes identified in component and system prototype tests,
and failure modes developed from historical "lessons learned"
in design requirements. It aids in identifying unacceptable
failure effects that prevent achieving design requirements. It
is used to assess the safety of system components, and to
identify design modifications and corrective actions needed to
mitigate the effects of a failure on the system. It is used in
planning system maintenance activities, subsystem design,
and as a framework for system failure detection and isolation.
In the military, aerospace, nuclear, and other industries where
safety issues are of prime importance, FMECA has become
an essential part of the system safety analysis [15, 16].
Table 10 shows a partial FMEA analysis of the stop valve
for the gas hot water heater in Figure 10. By focusing on how
the system behaves when a component fails, FMEA gives the
system designers deeper insights into the role that each
component plays in the system operation. However,
designers whose focus is necessarily on creating and
implementing the functions that the system will do, generally
find it to be a tedious and discouraging analysis to perform.


Figure 10. Schematic for a domestic hot water heater.
One of the most difficult tasks in performing a FMEA is
to identify component failure modes. Most analysts can
readily identify failure modes such as open and short for
electrical components but they have difficulty identifying
other types of failure modes that can be broadly characterized
as partial operation. Libraries of potential component
failure modes can be very helpful in this regard.
FMEA originated in the aerospace industry in the 1960s
as demands for greater reliability drove studies of component
failures to be broadened to include the effects of the failures
on the systems of which they were a part. Thus, FMEAs were
traditionally done near the end of the design process and the
analysis focused on the physical, piecepart components as in
Table 10. Thus, it could have little impact on the design
except when it revealed a major safety-related failure effect.
More recently, FMEA been extended to become a more
effective tool in the design process.
A functional FMEA focuses on functional failures. These
types of failures can be identified early in the design process
when only a functional description of the system is available.
Functional failures are usually a failure to perform some task
or doing the task incorrectly. Resolution of these types of
failures is accomplished by changing the system
requirements.
An interface FMEA focuses on failures of the interfaces
between the major functional modules of a system. This type
of analysis can be done before the internal design of the
modules has even begun and it can reveal deficiencies in the
module interconnects.
FMEA has also been extended to software where it is
particularly effective on systems such as microprocessor
based control systems that have little internal hardware
checking. Finally, FMEA has been applied to the
manufacturing process by which a device is produced.
Computerization has also made the FMEA task more
efficient. System simulations show the effects of failures
more readily than analysis in all but the simplest systems,
libraries of part failure modes ensure that all failure modes are
considered, and groups of failure modes having similar
consequences can sometimes be considered together rather
than repeating the analysis for each one individually [16] .
Table 10. Failure mode analysis of hot water heater stop valve.
Effect
Component Failure Mode Local System
stop valve 1) Fails closed Burner off No hot water
2) Fails open Burner won't shut off Overheats, release valve releases
pressure, may get scalded
3) Does not open fully Burner not fully on Water heats slowly or doesn't reach
desired temperature
4) Does not respond to
controller -- stays open
(same as 2)
5) Does not respond to
controller -- stays closed
(same as 1)
6) Leaks through valve Burner won't shut off, burns at
low level
Water overheats (possibly)
7) Leaks around valve Gas leaks into room Possible fire or gas asphyxiation

RF RF 15
5.2. Fault Tree Analysis
A fault tree is developed by considering a single,
important system-level failure and then identifying lower-
level failures that can cause that failure either directly or
indirectly or in combination with other failures. By
developing the lower-level failure mechanisms necessary to
cause the top-level effect a total overview of the system is
achieved. Once completed, the fault tree allows the system
designer to easily evaluate the effect of low-level changes on
the system safety and reliability.
Beginning with the top-level failure, the fault tree for a
failure mode at a given level is built up from combinations of
subsystem failures at levels lower than that at which the
failure is postulated. If the failure mode can result from any
of several lower level events, it is represented logically as the
OR of those events; if it can result only if all of several lower
level events occur, it is the AND of those events. This build-
up of failures gives a very visual representation of how
failures will propagate in the system. Table 11 shows the
symbols used to represent the logic gates in a fault tree.
Figure 11 shows an example of a partial fault tree for the
failure Box free falls for a passenger elevator. Observe that
the fault tree can include operational causes of failure such as
control unit disengages brake as well as component failures
such as broken cable, and subsystem failures such as
motor failure.
A fault tree is analyzed by determining the terminal node
probability of failure and combining these through the AND
and OR gates to determine the probability of the top event.
The fault tree can also be analyzed to find cut sets
combinations of components whose failures are sufficient to
cause the top event.
Recent advances in fault tree technology have enabled the
fault tree to better analyze the types of failures encountered in
computer systems. These include sequence failures and
maintenance operations [17, 18].
1
4 3 2
6
7 8 9
10
Key
1. Box free falls
2. Cable slips off pulley
3. Holding brake failure
4. Broken cable
5. No holding brake
6. Motor turns free
7. Worn friction material
8. Stuck brake solenoid
9. Control unit disengages brake
10. No power to motor
11. Motor failure
11
5
Figure 11. Fault tree representation of the Elevator box free
falls[19]

Table 11. Symbols used in the construction of a fault tree.

Symbol

Name

Description

Reliability Model
Number
of Inputs

Basic Event Event that is not further decomposed and
for which reliability information is
available
Component failure mode, or a
failure mode cause
0

Conditional
Event
Event that is a condition of occurrence
of another event when both must occur
for the output to occur
Occurrence of event that must
occur for another event to occur
0

Undeveloped
Event
A part of the system that has not yet
been developed or defined
A contributor to the probability of
failure but the structure of that
system part has not been defined
0

Transfer Gate A gate indicating that the corresponding
part of the system fault tree is developed
on another page or part of the diagram
A partial reliability block diagram
is shown in another location of the
overall system block diagram
0

OR Gate The output event occurs if any of its
input events occur
Failure occurs if any of the parts of
the system failseries system
> 2
m
.

Majority OR
Gate
The output event occurs if mof the input
events occur
k-out-of-n module redundancy. > 3

Exclusive OR
Gate
The output event occurs if one but not
both input events takes place
Failure occurs only if one, but not
both, of the two possible failures
occur
2

AND Gate The output event occurs only if all of the
input events occur
Failure occurs if all of the parts of
the system failredundant system
> 2

NOT Gate The output event occurs only if the input
events does not occur
Exclusive event or preventative
measure does not take place
Failure occurs if all of the parts of
the system failredundant system
1


RF RF 16
5.3. Markov Models
Markov models describe a system in terms of a set of
states and transitions between those states [2, 20]. In
reliability models the states usually represent the various
working and failed conditions of the system. Transitions
between states occur as various components fail and as
repairs are made. A Markov model is memoryless in the
sense that transitions between states depend only on the state
that the system is in and not its previous history; in particular,
the probability of a transition from one state to another does
not depend on how the system came to be in its present state
nor on how long it has been in that state. This memoryless
property of a Markov model implies that components in the
model must have a constant failure rate and a constant repair
rate. Figure 12 shows an example of a Markov model for a
repairable system. It has two states: S
1
is the working state
and S
2
is the failed state. When in state S
1
failures occur at
rate and take the system to the failed state, S
0
. When in
state S
0
the system is repaired at rate and a repair restores
the system to the working condition. According to this model
the system follows a pattern of working ,failed,
working, failed, etc. with the working state ending in a
failure and the failed state ending in a repair.
Markov models are particularly useful for analyzing
repairable systems and calculating reliability measures such
as system availability. For the model in Figure 12 the system
availability at any time t, A(t), is the probability it is in state
S
1
(working). If it is initially in state S
1
(working) this is,
( ) t t A ) ( exp ) (

+
+
+
+
=

The steady state availability, A, is

+
= A


S
1

S
0




Figure 12. Markov model of a system with repair.

A slightly more complex model is shown in Figure 13. It
has 3 states for modeling the reliability and availability of a 2-
unit redundant system with repair. In this model we assume
that the units are identical and have a constant failure rate, .
If only one unit has failed the system is still operational; if
both units fail, the system fails. State S
2
is the state with all
units working. State S
1
is the state with one unit failed. Note
that this state does not distinguish which unit has failed since
the system behavior is the same in both cases. State S
0
is the
system failed stateit is entered if both units have failed at
the same time. The failure rate for the transition from state S
2

to state S
1
is 2 since both units are working and either can
fail; from state S
1
to S
2
the failure rate is since there is only
one unit that can fail. The repair rate when one unit has
failed, i.e., for transitions from state S
1
back to S
2
is
1
.
When the system has failed (i.e., it is in state S
0
) the model
assumes that it is repaired at rate
2
and that the repair fully
restores the system to state S
2
.
From this model, it is straightforward, but rather tedious,
to find an expression for the subsystem reliability. States S
2

and S
1
are working states; hence the system availability is
the probability that the system is in one of those states. The
system reliability is the probability the system has entered
state S
0
by time t.

S
2
S
0
S
1

2

2


Figure 13. 3-state Markov model of a 2-unit parallel
system with repair.

The number of states and transitions in a Markov model
can be very large for a complex system since every
component failure (or failure mode!) can potentially result in
a different system state. Many software programs are
available for analyzing Markov models.

6. PERSPECTIVES AND FUTURE
CHALLENGES
Reliability has always been an important attribute of any
device for the simple reason that if the product couldnt be
counted upon to function as intended it was of little value.
Reliability began to develop as a discipline with the advent of
the industrial age when formal consideration of reliability
problems began to be undertaken [21]. Design of equipment
for a certain anticipated life also dates back to this time
period. At first efforts to characterize equipment reliability
were applied mainly to mechanical equipment. A classic
example is the studies of the life characteristics of ball and
roller bearings in the early days of railroad transportation.
Electrification brought with it the need to make the
electric power supply much more reliable. This reliability
was achieved through the parallel operation of generators,
transformers, and transmission lines and the interlinking of
high-voltage supply lines into nation-wide power grids. Thus,
the use of redundancy and parallel operation, as well as better
equipment, solved what had once been a very real problem.
Today, the overall reliability of our power supply and
communication networks is truly astonishing.
Aircraft introduced a new set of constraints which made
the reliability problems of airborne equipment more difficult
to solve than those of stationary and land-based equipment.
They also added safety as an important concern. These
problems were solved largely through the intuition and
ingenuity of aircraft designers.
The age of electronics, the age of high-speed jet aircraft,
the age of missiles and spacecraft brought reliability into a
new era. Previously, the reliability problem had been largely
solved by the use of high safety factors, extensive use of
redundancy, and learning from the failures and breakdowns of
earlier designs when designing equipment of similar design.
Safety factors and redundancy added tremendously to the size
and weight of a piece of equipment at a time when aircraft
and missile development were demanding just the opposite so
RF RF 17
that thousands of components could be squeezed into small
volumes. At the same time, the rapid progress of technology
nullified the efforts of those who hoped to learn from and
correct the mistakes of previous designs in the next design
technological changes meant that the next design had to be
radically different from its predecessor. Very little use could
be made of the experience gained from previous mistakes and
neither time nor money was available for redesign as both had
to be directed to the next project. The earlier intuitive
approach to solving problems and the practice of redesigning
earlier designs, which had previously been used so
successfully, gave way to an entirely new approach to
reliabilityone that was statically defined, calculated, and
designed.
The computer age has brought about yet another change in
both the way reliability is practiced and in the problems that
must be addressed. Computer based tools now do many of
the mathematical calculations needed to analyze failure data;
and computer simulations permit the timely evaluation of
many alternative designs. Computers have also become an
integral part of almost every type of equipment with any
degree of complexity. Software controls the computer, and
even a small program can be incredibly complex.
Increasingly, simple testing is being seen as an inadequate
approach for providing reliable software. Instead, attention
has been focused on reducing the number of errors in the
development process and reliability tools such as failure
modes and effects analysis, fault tree analysis, and Markov
models have been extended so that they can be applied to
software systems.
In addition, human errors in operating and maintaining
equipment are becoming the predominant cause of failure.
Thus human reliability is emerging as an important area of
study and equipment must be designed so that it not only
functions correctly, but that it is easy to use and maintain
correctly (and difficult to use or maintain incorrectly).
The reliance of equipment on computers and the
connection of those computers into extensive communication
networks have also introduced a new type of challenge for
reliability engineers. The challenge is man-made but very
real nevertheless. Software viruses that can be carried on
programs from one computer to another, software worms that
can move along the connections of a communications
network, logic bombs that can be loaded into the software of a
computer, and Trojan horses that can hide malicious
programs in attractive packages all can cause catastrophic
failures of the computers that manage entire enterprises. So-
called denial of service attacks and unauthorized access
to protected data housed in a computer can also prevent
equipment from operating successfully.
Finally, simply defining reliability in the 21
st
century is a
challenge. How should reliability be defined for an Internet
service? What is a failure for a packet network?
With these new threats and challenges there are many
questions for reliability engineers to ponder and provide
useful answers to.
ACKNOWLEDGEMENT
I would like to thank John Healy from whose Basic
Reliability tutorial [1], given for many years at this
symposium, I borrowed much of the introductory material.
Also, the perspective on the early days of reliability
engineering is based on a similar discussion by Igor
Bazovsky, who wrote one of the first reliability engineering
textbooks [21].

REFERENCES
1. J. D. Healy, Basic Reliability, Ann. Reliability and Maintainability
Symp. Tutorial Notes, 2000.
2. J. B. Bowles, "Survey of Reliability Prediction Procedures for
Microelectronic Devices", IEEE Transactions on Reliability, March
1993, pp. 2 12.
3. F. R. Nash, Estimating Device Reliability: Assessment of Credibility,
Kluwer, Boston, 1993.
4. J. B. Bowles, Simple, approximate, system reliability and availability
analysis techniques, Reliability Review, Vol. 20, September 2000, pp. 5
11, 26 27.
5. P. D. T. OConnor, Practical Reliability Engineering, 3
rd
ed. revised,
Wiley, New York, 1995.
6. R. Roy, A Primer on the Taguchi Method, Van Nostrand Reinhold, NY,
1990.
7. Reliability Prediction Procedure for Electronic Equipment, Mil-Hdbk-
217/F, December 1991.
8. Reliability Prediction Procedure for Electronic Equipment, TR-NWT-
000332, Issue 4, Bellcore, September 1992.
9. S. Pugh, Quality assurance and design: the problem of cost versus
quality, Quality Assurance, Vol. 4, March 1978, pp. 3 6.
10. Electronic Reliability Design Handbook, Mil-Hdbk-338, October
1988, App. B, Environmental Considerations in Design.
11. D. J. Klinger, Y. Nakada, M. A. Menendez, AT&T Reliability Manual,
Van Nostrand Reinhold, 1990.
12. H. A. Chan and T. P. Parker Product Reliability Through Stress
Testing, Ann. Reliability and Maintainability Symp. Tutorial Notes,
2000.
13. J. D. Musa, A. Iannino, and K. Okumoto, Software Reliability,
Prediction and Measurement, McGraw-Hill, New York, 1987.
14. J. G. Elerath, Specifying reliability in the disk drive industry: no more
MTBFs, Proc. Ann. Reliability and Maintainability Symp, 2000. pp.
194 199.
15. J. B. Bowles, Failure modes, effects, and criticality analysis, Ann.
Reliability and Maintainability Symp. Tutorial Notes, 1999.
16. Society of Automotive Engineers, Recommended failure modes and
effects analysis procedures for non-automobile applications, SAE
ARP5580, May 2000 (Draft).
17. L. L. Pullum and J. B. Dugan, Fault-tree models for the analysis of
complex computer-based systems, Proc. Ann. Reliability and
Maintainability Symp., 1996, pp. 200-207.
18. J. B. Dugan, Fault-tree analysis of computer-based systems, Ann.
Reliability and Maintainability Sym. Tutorial Notes, 1999.
19. Reliability Analysis Center, Fault Tree Analysis Application Guide,
1990.
20. M. L. Shooman, Probabilistic Reliability: An Engineering Approach, 2
nd

Ed., Krieger, Malabar, 1990.
21 I Bazovsky, Reliability Theory and Practice, Prentice-Hall, Englewood
Cliffs, 1961.
22. R. F. Drenick, The failure law of complex equipment, Journal of the
Society of Industrial and Applied Mathematics, Vol. 8, December 1960,
pp. 680 690.
23. R. H. Salzman, Understanding Weibull Analysis, Ann. Reliability and
Maintainability Symp. Tutorial Notes, 2000.

Das könnte Ihnen auch gefallen