Beruflich Dokumente
Kultur Dokumente
RenyanJiang
Introduction
to Quality and
Reliability
Engineering
Springer Series in Reliability Engineering
Series editor
Hoang Pham, Piscataway, USA
More information about this series at http://www.springer.com/series/6917
Renyan Jiang
123
Renyan Jiang
School of Automotive and Mechanical
Engineering
Changsha University of Science
and Technology
Changsha
China
v
vi Preface
ADDED VALUE
RESEARCH AND BRAND AND
MANUFACTURING
DEVELOPMENT SERVICES
and background materials such as product life cycle, basic concepts of quality and
reliability, common distribution models in quality and reliability, basic statistical
methods for data analysis and modeling and, models and methods for modeling
failure point processes.
The second part consists of ve chapters and deals with major quality and
reliability problems in product design and development phase. The covered topics
include design for X, design for quality, design for reliability, and reliability tests
and data analysis.
The third part consists of four chapters and deals with quality and reliability
problems in product manufacturing phase. The covered topics include product
quality variations, quality control at input, statistical process control, and quality
control at output.
The fourth part consists of two chapters and deals with product warranty and
maintenance.
The extra materials consist of three appendices and deal with some important
theories and tools, including multi-criteria decision making analysis techniques,
principal component analysis, and Microsoft Excel, with which a number of real-
world examples in this book can be computed and solved. Exercises for each
chapter are also included in extra materials.
This book is the main outcome of the Bilingual teaching program of the course
Quality and Reliability Engineering supported by the education ministry of
China (No. 109, 2010).
The publication of this book was nancially supported by the China National
Natural Science Foundation (No. 71071026 and No. 71371035), the Science
Publication Foundation of Chinese Academy of Sciences (No. 025, 2012), and the
Academic Work Publication Foundation of Changsha University of Science and
Technology, China.
The author would like to thank Prof. D.N. Prabhakar Murthy for his invaluable
support and constructive comments on the earlier outlines and manuscripts of this
book, and thank Profs. Dong Ho Park and Toshio Nakagawa for their comments
and suggestions on the manuscripts of this book.
Contents
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Product and Product Life Cycle . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Product Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Technology Life Cycle of a Product . . . . . . . . . . . . . 4
1.3 Notions of Reliability and Quality. . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Product Reliability . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Product Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Link Between Quality and Reliability . . . . . . . . . . . . 7
1.4 Objective, Scope, and Focus of this Book . . . . . . . . . . . . . . . 8
1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vii
viii Contents
3 Fundamental of Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Concepts of Reliability and Failure . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Failure Mode and Cause . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 Failure Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.5 Failure Severity and Consequences . . . . . . . . . . . . . . 30
3.2.6 Modeling Failures . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Reliability Basic Functions. . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Probability Density Function . . . . . . . . . . . . . . . . . . 31
3.3.2 Cumulative Distribution and Reliability Functions. . . . 32
3.3.3 Conditional Distribution and Residual Life. . . . . . . . . 33
3.3.4 Failure Rate and Cumulative Hazard Functions. . . . . . 34
3.3.5 Relations Between Reliability Basic Functions . . . . . . 35
3.4 Component Bathtub Curve and Hockey-Stick Line . . . . . . . . . 36
3.5 Life Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 Measures of Lifetime . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.2 Dispersion of Lifetime. . . . . . . . . . . . . . . . . . . . . . . 40
3.5.3 Skewness and Kurtosis of Life Distribution . . . . . . . . 41
3.6 Reliability of Repairable Systems . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Failure-Repair Process . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.2 Reliability Measures . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.3 Failure Point Process. . . . . . . . . . . . . . . . . . . . . . . . 44
3.7 Evolution of Reliability Over Product Life Cycle . . . . . . . . . . 46
3.7.1 Design Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7.2 Inherent Reliability . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7.3 Reliability at Sale . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7.4 Field Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7.5 Values of Weibull Shape Parameter Associated
with Different Reliability Notions . . . . . . . . . . . . ... 48
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 48
Contents ix
4 Distribution Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Basic Functions of a Discrete Distribution . . . . . . . . . 51
4.2.2 Single-Parameter Models . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 Two-Parameter Models . . . . . . . . . . . . . . . . . . . . . . 53
4.2.4 Hypergeometric Distribution. . . . . . . . . . . . . . . . . . . 56
4.3 Simple Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.3 Lognormal Distribution . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Complex Distribution Models Involving Multiple Simple
Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.2 Competing Risk Model . . . . . . . . . . . . . . . . . . . . . . 60
4.4.3 Multiplicative Model . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.4 Sectional Models . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Delay Time Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
xix
xx Abbreviations
1.1 Introduction
1.2.1 Product
The concept of product life cycle (PLC) is different for manufacturers and con-
sumers [4]. From the perspective of the manufacturer, the PLC refers to the phases
of a products life, from its conception, through design and manufacture to post-sale
service and disposal.
On the other hand, from the perspective of the consumer, the PLC is the time
from the purchase of a product to its discarding when it reaches the end of its useful
life or being replaced earlier due to either technological obsolescence or the product
being no longer of any use. As such, its life cycle involves only the following three
phases: acquisition, operations and maintenance, and retirement that leads to
replacement by a new one.
From the perspective of marketing, the PLC involves four phases (see Fig. 1.1):
Introduction phase with low sales
Growth phase with rapid increase in sales
Maturity phase with large and nearly constant sales, and
Decline phase with decreasing sales and eventually withdrawing from the
market.
It is desired to keep the maturity period going as long as possible. However, the
PLC gets shorter and shorter due to rapid technological change, global markets and
multiple vendor environments, erce competition and partnerships (or alliances)
environment and ever-increasing customer expectations.
Time
1.3 Notions of Reliability and Quality 5
How to measure and test reliability in different stages and this refers to reliability
assessment and reliability testing;
How to maintain systems reliable and this refers to maintenance, fault diagnosis,
and prognosis.
Garvin [1] proposes the following ve criteria for dening the notion of quality:
(1) Judgmental criteria. Here, quality is associated with something universally
recognizable as a mark of high standard, achievement, or degree of excellence,
and hence is called the transcendent denition.
(2) Product-based criteria. Here, quality is dened in terms of some measurable
variable such as the acceleration of a car, efciency of an engine, or the like.
(3) User-based criteria. Here, quality is dened through tness for intended use.
For example, the user-based quality for a car may be smoothness of the ride,
ease of steering, etc.
(4) Value-based criteria. Here, quality is linked to the price of the product and its
usefulness or satisfaction.
(5) Manufacturing-based criteria: Here, quality is dened in terms of manufac-
tured items conforming to the design specication. Items that do not conform
either need some rectication action to make them conform or need to be
scrapped.
The product quality involves many dimensions. Garvin [1] suggests the fol-
lowing eight quality dimensions:
(1) Performance. This characterizes the primary operating characteristics or spe-
cic functions of the product. For a car, it can include acceleration, braking
distance, efciency of engine, emissive pollution generated, and so on.
(2) Features. These are the special or additional features of a product. For
example, for a car, the features include air conditioner, cruise control, or the
like.
(3) Aesthetics. This deals with issues such as appearance, feel, sound, and so on.
For a car, the body design and interior layout reflect the quality in this sense.
(4) Reliability. This is a measure of the product performing satisfactorily over a
specied time under stated conditions of use. Simply speaking, it reflects how
often the product fails.
(5) Durability. This is an indicator of the time interval after which the product has
deteriorated sufciently so that it is unacceptable for use. For a car, it may
correspond to corrosion affecting the frame and body to such a level that it is
no longer safe to drive.
1.3 Notions of Reliability and Quality 7
(6) Serviceability. This deals with all maintenance related issues, including fre-
quency and cost of maintenance, ease of repair, availability of spares, and so on.
(7) Conformance. This indicates the degree to which the physical and perfor-
mance characteristics meet some pre-established standards (i.e., design
requirements).
(8) Perceived quality. This refers to the perceptions of the buyers or potential
buyers. This impression is shaped by several factors such as advertising, the
reputation of the company or product, consumer report, etc.
A customer-driven concept of quality denes product quality as the collection of
features and characteristics of a product that contribute to its ability to meet or
exceed customers expectations or given requirements. Here, quality characteristics
are the parameters that describe the product quality. Excessive variability in critical
quality characteristics results in more nonconforming products or waste and hence
the reduction of variability in products and processes results in quality improve-
ment. In this sense, quality is sometimes dened as inversely proportional to
variability [2].
Product quality deals with quality of design and quality of conformance. The
quality of design means that the products can be produced in different levels of
quality, and this is achieved by the differences in design. The quality of confor-
mance means that the product conforms to the specications required by design,
and this is influenced by many factors such as manufacturing processes and quality-
assurance system.
Quality engineering is the set of operational, managerial and engineering
activities used to ensure that the quality characteristics of a product are at the
required levels.
Due to variability in the manufacturing process, some items produced may not meet
the design specication and such items are called nonconforming. The performance
of nonconforming items is usually inferior to the performance of conforming items.
As a result, nonconforming items are less reliable than conforming items in terms of
reliability measures such as mean time to failure.
In a broad sense, reliability is one of quality dimensions and usually termed as
time-oriented quality or quality over time (e.g., see Ref. [6]). However, quality
is different from reliability in a narrow sense. This can be explained by looking at
quality and reliability defects [5]. Quality defects usually deal with decient
products (or components) or incorrectly assembled sets, which can be identied by
inspection against component drawings or assembly specications. In this sense,
quality is expressed in percentages. On the other hand, reliability defects generally
deal with the failures of a product in the future when it has been working well.
8 1 Overview
Traditionally, quality, and reliability belong to two closely related disciplinary elds.
A main objective of this book is to provide a comprehensive presentation to these two
elds in a systematic way. We will discuss typical quality and reliability problems in
PLC, and present the models, tools, and techniques needed in modeling and ana-
lyzing these problems. The focus is on concepts, models, and techniques of quality
and reliability in the context of design, manufacturing, and operation of products.
The book serves as an introductory textbook for senior undergraduate or grad-
uate students in engineering and management elds such as mechanical engineer-
ing, manufacturing engineering, industrial engineering, and engineering
management. It can be also used as a reference book for product engineers and
researchers in quality and reliability elds.
The book comprises four parts and three appendices. Part I (comprising six chap-
ters) deals with the background materials with focus on relevant concepts, statistical
modeling, and data analysis. Part II (comprising ve chapters) deals with product
quality and reliability problems in pre-manufacturing phase; Part III (comprising
four chapters) and Part IV (comprising two chapters) deal with product quality and
reliability problems in manufacturing phase and post-manufacturing phase,
respectively. A brief description of each chapter or appendix is as follows.
This chapter provides an overview of the book. It deals with basic notions of
quality and reliability, their importance in the context of product manufacturing and
operation engineering, and the scope and focus of the book. Chapter 2 discusses
typical quality and reliability problems in each phase of PLC. Chapter 3 presents
the fundamentals of reliability, including basic concepts, reliability basic functions,
and various life characteristics and measures. Chapter 4 presents common distri-
bution models widely used in quality and reliability elds, and Chap. 5 discusses
statistical methods for lifetime data analysis with focus on parameter estimation and
model selection for lifetime distribution models. Chapter 6 presents models and
methods for modeling failure processes, including counting process models,
variable-parameter distribution models, and hypothesis tests for trend and
randomness.
The above six chapters are Part I of the book, which provides the background
materials. The following ve chapters are Part II of the book and focus on major
quality and reliability problems in the design and development phase of product.
1.5 Outline of the Book 9
References
1. Garvin DA (1988) Managing quality: the strategic and competitive edge. The Free Press,
New York
2. Montgomery DC (2007) Introduction to statistical quality control, 4th edn. Wiley, New York
3. Murthy DNP (2010) New research in reliability, warranty and maintenance. In: Proceedings of
the 4th Asia-Pacic international symposium on advanced reliability and maintenance
modeling, pp 504515
4. Murthy DNP, Rausand M, sters T (2009) Product reliability: specication and performance.
Springer, London
5. Ryu D, Chang S (2005) Novel concepts for reliability technology. Microelectron Reliab
45(34):611622
6. Yang K, Kapur KC (1997) Customer driven reliability: integration of QFD and robust design.
In: Proceedings of 1997 annual reliability and maintainability symposium, pp 339345
7. Zio E (2009) Reliability engineering: old problems and new challenges. Reliab Eng Syst Saf
94(2):125141
Chapter 2
Engineering Activities in Product
Life Cycle
2.1 Introduction
In this chapter we discuss the main engineering activities in each phase of product
life cycle (PLC) with focus on quality and reliability activities. The purpose is to set
the background for the following chapters of this book.
From the manufacturers perspective, the life cycle of a product can be roughly
divided into three main phases: pre-manufacturing phase, manufacturing phase, and
post-manufacturing phase. We discuss the main engineering activities in each of
these phases in Sects. 2.22.4, respectively. An approach to solving quality and
reliability problems is presented in Sect. 2.5.
The pre-manufacturing phase starts with identifying a need for a product, through a
sequence of design and development activities, and ends with a prototype of the
product. This phase can be further divided into two stages: front-end or feasibility
stage and, design and development stage. The main activities in these two stages are
outlined below [8].
The front-end stage mainly deals with product denition. Specically speaking, this
stage will dene the requirements of the product, its major technical parameters,
and main functional aspects and carries out the initial concept design. The main
activities include generation and screening of ideas, product denition, project plan,
and project denition review.
Once the need for a product is identied, a number of ideas are generated and
some of them will be screened to further pursue. The screening deals with
answering questions such as whether the idea is consistent with the strategic focus
of the company, whether the market size, growth, and opportunities are attractive,
whether the product can be developed and produced, and whether there are issues
that may make the project fail.
Product denition states what characteristics the product should have in order to
meet the business objectives and customer needs. It rst translates feasible ideas
into technically feasible and economically competitive product concepts, and then
produces product concept through concept generation and selection. Two com-
monly used techniques to decide the best design candidate are design-to-cost and
life-cycle-cost analyses. The design-to-cost aims to minimize the unit manufac-
turing cost, whose cost elements include the costs of design and development,
testing, and manufacturing. The life-cycle-cost analysis considers the total cost of
acquisition, operation, maintenance, and discarding, and is used for expensive
products.
Project plan deals with planning the remainder of the new product development
project in detail, including time and resource allocation, scheduling of tasks, and so
on. A nal review and evaluation of the product denition and project plan is
conducted to decide whether to commit potentially extensive resources to a full-
scale development project.
The design and development stage starts with the detail design of the products
form, then progresses to prototype testing and design renement through a test-
analysis-and-x (TAF) iterative process, and eventually ends with full product
launch.
The initial efforts of the design stage aims to arrive at optimal product architecture.
The product architecture is the arrangement of the functional elements of a product
into several physical building blocks (e.g., modules), including mapping from
functional elements to physical components and specication of interfaces among
interacting physical components. Establishing the product architecture needs to
conduct functional decomposition and dene the functional relationships between
assemblies and components.
Once the product architecture is established, the design process enters the detail
design stage. In this stage, the forms, dimensions, tolerances, materials, and surface
2.2 Engineering Activities in Pre-manufacturing Phase 13
properties of all individual components and parts are specied; and all the drawings
and other production documents (including the transport and operating instructions)
are produced.
The detail design involves a detailed analysis for the initial design. Based on this
analysis, the design is improved and the process is repeated until the analysis
indicates that the performance requirements are met.
The detailed analysis involves simultaneously considering various product
characteristics such as reliability, maintainability, availability, safety, supportabil-
ity, manufacturability, quality, life cycle cost, and so on. Design for these char-
acteristics or performances are further discussed in Chap. 7.
Design for quality is an integrated design technique for ensuring product quality.
It starts with an attempt to understand the customers needs. Then, the House of
Quality is used to transform the customer needs into the technical requirements or
engineering specications of the product in the concept design stage; and the
quality function deployment (QFD) is used to determine more specic requirements
in the detail design stage. The Taguchi method can be used to determine important
design parameters. These techniques are discussed in detail in Chap. 8.
Design for reliability involves a number of reliability related issues, including
reliability allocation and prediction. Reliability allocation is the process to deter-
mine the reliability goals of subsystems and components based on the system
reliability goal, which includes the system-level reliability, maintainability, and
availability requirements (e.g., mean time between failures, mean time to repair,
mean down time, and so on). Reliability prediction is a process used for estimating
the reliability of a design prior to manufacturing and testing of produced items.
These are discussed in detail in Chap. 9.
The development stage deals with component and product prototype testing. The
purpose is to rene the design. Using the TAF cycle, the initial design is revised and
improved to meet design requirements and specications. The reliability activities
involved in the development stage fall into the following three categories:
Reliability assessment,
Development tests, and
Reliability improvement.
Reliability assessment is basically concerned with evaluation of the current
reliability during the development process. It can be at any level from system down
to component. Reliability assessment requires test data from carefully designed
experiments and statistical analysis to estimate the reliability.
Development tests are carried out during the development stage to assess and
improve product reliability. Some of the tests carried out during product develop-
ment stage are as follows:
14 2 Engineering Activities in Product Life Cycle
Testing to failure. This can be carried out at any level and each failure is
analyzed and xed.
Environmental and design limit testing. These tests are carried out at the extreme
conditions of its operating environment (including worst-case operating condi-
tions). All failures resulting from the test are analyzed through root-cause
analysis and xed through design changes.
Accelerated life testing. This involves putting items on test under conditions that
are far more severe than those normally encountered. It is used to reduce the
time required for testing.
Testing involves additional costs that depend on the type of tests, number of
items tested, and the test duration. On the other hand, more testing effort results in
better estimates of reliability and this in turn leads to better decision making. As a
result, the optimal testing effort must be based on a tradeoff between the testing
costs and benets derived through more accurate assessment of reliability. These
issues are further discussed in Chap. 10.
Reliability improvement can be achieved through stress-strength analysis,
redundancy design, reliability growth through a development program, and pre-
ventive maintenance (PM) regime design.
Stress-strength analysis assumes that both the strength of a component and the
stress applied to the component are random variables that are characterized by two
distribution functions, from which the probability of failure can be derived.
Different designs can have different distributions of stress and strength and hence
different reliabilities. As such, the stress-strength analysis can be used for the
purpose of reliability improvement.
Redundancy design involves using a multi-component module to replace a single
component. The reliability and cost of module increase with the number of com-
ponents and depend on the type of redundancy. Three different types of redundancy
are hot standby, cold standby, and warm standby. In hot standby, several identical
components are connected in parallel and work simultaneously. The module fails
when all the components fail. As a result, the module lifetime is the largest of all the
components. In cold standby, only one component is in use at any given time.
When it fails, it is replaced by a working component (if available) through a
switching mechanism. If the switch is perfect and the components do not degrade
when not in use, the module lifetime is the sum of lifetimes of all the components of
the module. In warm standby, one component works in a fully loading state and the
other components work in a partially loading state. The component in the partially
loading state has a longer expected life than the component in the fully loading
state. As a result, the warm standby module has a longer expected life than the hot
standby module when the other conditions are the same.
Reliability growth involves research and development effort where the product is
subjected to an iterative TAF process. During this process, an item is tested for a
certain period of time or until a failure occurs. Based on a root-cause failure
analysis for the failures observed during the test, design and/or engineering mod-
ications are made to improve the reliability. The process is repeated until the
2.2 Engineering Activities in Pre-manufacturing Phase 15
reliability reaches a certain required level. The reliability growth process also deals
with the reliability prediction of the system based on the test data and design
changes using various reliability growth analysis techniques, which are discussed in
detail in Chap. 11.
Finally, the eld failure probability can be considerably reduced by imple-
menting a well-designed PM regime, which species various PM activities in a
systematic and comprehensive way. The maintenance related concepts and issues
are discussed in Chaps. 16 and 17.
Three main activities involved in the production phase are production system
design, production system operation, and quality control for operations. In this
section, we rst introduce various types of production systems and then discuss
these three main activities.
Based on production volume and production variety (i.e., number of different types
of products produced), the production system varies from factory to factory and
from product to product. Three common types of production systems are job shop
production, batch production, and mass production. We briefly discuss them below.
The job shop production system is characterized by the low production volume and
high production variety. Production equipments are mostly general purpose and
flexible to meet specic customer orders, and highly skilled labor is needed to
handle such equipments.
Flexible manufacturing systems (FMS) have been widely used in the job shop
production system. The main components of an FMS are computer numerical
controlled (CNC) machine tools, robots, automated material handling system,
automated storage and retrieval system, and computers or workstations. The FMS
can be quickly congured to produce a variety of products with changeable volume
and mix on the same system. However, it is complex as it is made up of various
different techniques, expensive as it requires a substantial investment of both time
and resources, and of low throughput. For more details about the FMS, see Ref. [1].
16 2 Engineering Activities in Product Life Cycle
3 Set-up
Batch
2
Processing
1 Wait
0
Time
Batch production is suited for medium volume lot with moderate product variety. In
batch production, the production order is repeated at regular intervals as shown in
Fig. 2.1. Generally, production equipments are general purpose and suitable for
high production volume, and specially designed jigs and xtures are usually used to
reduce the setup time and increase the production rate. It requires that the skill level
of labor should be reasonably high but may be less compared to job shop
production.
Mass production is suited for large production volume and low production variety
with low cost per produced unit. The mass production process is characterized by
Mechanization to achieve high volume;
Elaborate organization of materials flow through various stages of
manufacturing;
Careful supervision of quality standards; and
Minute division of labor.
The mass production is usually in a continuous production or line production
way. The line production is a machining system designed for production of a
specic part type at high volume and low cost. Such production lines have been
widely used in the automotive industry.
Two key elements in the production phase are obtaining raw materials and con-
verting raw materials into products (including manufacture and assembly of com-
ponents as well as assembly of assemblies). Raw materials and some components
and assemblies of a product are usually obtained from external suppliers, which
form a complex network termed as supplier chain. The supply chain design deals
with a variety of decisions, including supplier selection, transportation way,
inventory management policies, and so on. Various options form many combina-
tions, and each combination has different cost and performance. Given various
choices along the supply chain, the supply chain design aims to select the options so
as to minimize the total supply chain cost.
One key problem with supply chain design is to appropriately select suppliers.
Supplier selection is a multi-criteria decision making (MCDM) problem, which
involves many criteria such as quality, price, production time and direct cost added,
transportation, warehouse, and so on. Many methods have been developed for
solving the MCDM problems, and the main methods are presented in Online
Appendix A.
Once suppliers are selected, they will be managed through the activities of
several production functions (groups or departments), which include quality,
manufacturing, logistics, test, and so on. For details on supply chain design, see
Refs. [2, 6, 7].
There are many design parameters for a manufacturing system, such as number of
flow paths, number of stations, buffer size, overall process capability, and so on.
These depend on production planning, and further depend on process planning,
tolerance analysis, and process capability indicators.
Process planning determines the steps by which a product is manufactured. A
key element is setup planning, which arranges manufacturing features in a sequence
of setups that ensures quality and productivity.
In product design, the tolerance analysis deals with tolerance design and allo-
cation for each component of the product. In production system design, the toler-
ance analysis deals with the design and allocation of manufacturing tolerance,
which serves as the manufacturing process selection.
18 2 Engineering Activities in Product Life Cycle
Different from the process capability indices that measure a specic processs
capability, process capability indicators attempt to predict a proposed production
systems performance. By identifying key drivers of quality in the production
system, these indicators can serve as guidelines for designing production systems
for quality.
An important step in production system design is system layout. The system layout
impacts manufacturing flexibility, production complexity, and robustness.
Manufacturing flexibility is the capability of building several different products
in one system with no interruption in production due to product differences.
Manufacturing flexibility allows mass customization and high manufacturing uti-
lization. There exists a certain complex relation between flexibility and quality, and
use of robots can improve both flexibility and quality.
Production systems become more and more complex due to the demand for more
product functionality and variety. The manufacturing complexity is characterized
by the number of parts and products, the types of processes, and the schedule
stability. Generally, complexity negatively impacts manufacturing performance
measures, including quality.
Robustness deals with the capability against process drift and fluctuations in
operations. The process drift will lead to producing defective parts. Different
equipment and inspection allocation can have different machine processing time
and defective part arrival rate, and have different yields and drift rates. Sensitivity
analyses can be conducted to examine their interrelations for different design
candidates. The fluctuations in operations result from uncertain or inaccurate system
parameters and can damage product quality. Robust production system design aims
to minimize this damage.
The product production includes three elements: inputs (i.e., materials and labor of
operators), processes and outputs (i.e., nished products). Techniques to control
product quality evolve over time and can be divided into the following four
approaches:
Creating standards for producing acceptable products. It focuses on quality
testing at the output end of the manufacturing process.
Statistical quality control, including acceptance sampling with focus on the
input end of the manufacturing process as well as statistical process control with
focus on the manufacturing process.
Total production systems for achieving quality at minimum cost. It focuses on
the whole production system from raw materials to nished product, through
research and development.
Meeting concerns and preferences of consumers. It focuses on consumers needs
and involves the whole PLC.
As seen, the approach to product quality evolves from focusing on quality test
and control to focusing on the quality assurance and improvement. In other words,
the focus gradually moves from the downstream of the PLC toward the upstream of
the PLC. This is because xing a product quality problem in the upstream is much
more cost-effective than xing it in the downstream.
Quality control techniques can be divided into two categories: quality control for
product quality design and improvement, and quality control for production sys-
tems. The techniques in the second category include quality testing and statistical
quality control, and the techniques in the rst category include several basic
approaches. These are further discussed below.
Basic design approaches for design and improvement of product and process
include QFD, design of experiments (DOE), and failure mode and effects analysis
(FMEA). We briefly discuss these issues here and further details are presented in
Chaps. 7 and 8.
QFD has been widely applied to both product design and production planning. It
rst translates customer requirements into product attributes for the purpose of
product design, and then further translates the product attributes into production
process requirements to provide guidelines for the design of the production process
and the design of the quality control process.
DOE approach is developed by Taguchi [9] for the parametric design of product.
The basic idea is to optimally select the combination of controllable (or design)
parameters so that the output performance is insensitive to uncontrollable factor
variation (or noise). The optimization is based on the data from a set of
20 2 Engineering Activities in Product Life Cycle
Quality control in process deals with quality inspection planning and statistical
process control. We rst look at inspection planning, which deals with quality
inspection in production systems. The principal issues with inspection planning
include quality failures, quality inspection, and the actions that may be taken in
response to inspection and measures of system performance.
The design variables of quality inspection system include the number and
locations of inspection stations, inspection plans (e.g., full inspection or sampling),
and corrective actions (e.g., rework, repair, or scrapping). The number and location
of inspection stations are dependent on both the production system and quality
control system; main influence factors include system layout and type of production
systems; and design constraints can be inspection time, average outgoing quality
limit, or budget limit.
2.3 Engineering Activities in Production Phase 21
When some controllable factors signicantly deviate from their nominal values,
the state of production process changes from in control to out of control. If the
change is immediately detected, then the state can be brought back to in control at
once so as to avoid the situation where many nonconforming items are produced.
The process control methods depend on the type of manufacturing system.
In the case of batch production, a process control technique is to optimize the
batch size. The expected fraction of nonconforming items increases and the setup
cost per item decreases as the batch size increases. As a result, an optimal batch size
exists to make the unit manufacturing cost minimal.
In the case of continuous production, a main process control technique is the use
of control charts to monitor product quality and detect process changes. It involves
taking small samples of the output periodically and plotting the sample statistics
(e.g., the mean, the spread, the number or fraction of defective items) on a chart.
A signicant deviation in the statistics is more likely to be the result of a change in
the state of the process. When this occurs, the process is stopped and the con-
trollable factors that have deviated are restored back to their nominal values before
the process is put into operation.
The cost of quality and accuracy of state prediction depend on the inspection
policy, nature, and duration of the testing involved as well as the control limits. The
design parameters of the inspection policy include the sampling frequency and
sample size. The inspection policy impacts not only quality but also productivity.
This is because normal production may be interrupted when a control chart gen-
erates an out-of-control signal, which can be either an indication of a real quality
problem or a false alarm. Generally, reducing the number of controls leads to better
productivity. Further discussion on control charts are presented in Chap. 14.
Quality control at output deals with the quality inspection and testing of produced
items to detect nonconforming (nonfunctional or inferior) items and to weed them
out before the items are released for sale. For nonfunctional items, testing takes very
little time; but for inferior items, the testing can take a signicant length of time.
In either case, testing involves additional costs, and the cost of testing per unit is an
increasing function of the test period. As such, testing design needs to achieve a
tradeoff between the detection accuracy and test effort (i.e., time and cost).
For electronic products, the manufactured items may contain defects, and the
defects can be patent or latent. Environmental stress screening (ESS) can be
effective to force the latent defects to fail, and burn-in can be used to detect the
items with patent defects. Burn-in involves testing the item for a period of s. Those
items that fail during testing are scrapped or repaired. The probability that an item is
conforming after the burn-in increases with s. As such, the reliability of the item
population is improved through burn-in but this is achieved at the expense of the
burn-in cost and a useful life loss of s. A number of models have been developed to
nd the optimal testing scheme. These are discussed in Chap. 15.
22 2 Engineering Activities in Product Life Cycle
The post-manufacturing phase can be further divided into three stages: marketing,
post-sale support, and retirement. We discuss the main activities in each of these
stages below.
Standard products involve a marketing stage and there is no such a stage for
custom-built products. Marketing deals with issues such as the logistics of getting
the product to markets, sale price, promotion, warranty, channels of distribution,
etc. To address these issues one needs to respond to external factors such as
competitors actions, economy, customer response, and so forth.
The support service is necessary to ensure satisfactory operation of the product, and
can add value to the product from both manufacturers perspective (e.g., direct
value in initial sale of product) and customers perspective (e.g., extending the life
cycle, postponing product replacement, etc.). The support service includes one or
more of the following activities:
Providing spares parts, information, and training
Installation
2.4 Engineering Activities in Post-manufacturing Phase 23
Warranties
Maintenance and service contracts
Design modication and customization.
Among these activities, warranties and maintenance are two major issues. Here, we
briefly discuss these two issues and more details are presented in Chaps. 16 and 17.
Warranty is an assurance that the manufacturer offers to the buyer of its product,
and may be considered to be a contractual agreement between the buyer and
manufacturer (or seller). It species both the performance to be expected and the
redress available to the buyer if a failure occurs or the performance is unsatisfac-
tory. Usually, the manufacturer repairs or replaces the items that do not perform
satisfactorily or refunds a fraction or the whole of the sale price. Three important
issues associated with product warranty are warranty policy, warranty servicing cost
analysis, and warranty servicing strategy (e.g., repair or replace).
Maintenance is the actions to control the deterioration process leading to failure
of a system or to restore the system to its operational state through corrective
actions after a failure. As such, maintenance can be broadly divided into two
categories: PM and corrective maintenance (CM). Two important issues for the
manufacturer of a product are maintainability and serviceability design, and the
development of an appropriate PM program. The program will include various PM
actions with different intervals or implementation rules for the components and
assemblies of the product.
Carrying out maintenance involves additional costs to the buyer and is worth-
while only if the benets derived from such actions exceed the costs. This implies
that maintenance must be examined in terms of its impact on the system perfor-
mance. For more details about maintenance, see Ref. [5].
Defective or retired products may return to the manufacturer, who can get prots
from the return through recycling, refurbishing, or remanufacturing. These have
signicant differences in the process and product performances.
Recycling is a process that involves disassembling the original product and
reusing components in other ways, and none of the original value is preserved.
Recycling often discards many of the parts, uses large amounts of energy and
creates much waste and burdens.
Refurbishing is servicing and/or renovation of older or damaged equipment to
bring it to a workable or better looking condition. A refurbished product is usually
worse than the new one in condition.
Remanufacturing is the process of disassembly and recovery. In remanufactur-
ing, the entire product is taken apart, all parts are cleaned and inspected, defective
parts are repaired or replaced, and the product is reassembled and tested. As such,
24 2 Engineering Activities in Product Life Cycle
Modern manufacturing deals with not only the technical aspects but also com-
mercial and managerial aspects. All these aspects need to be properly coordinated.
To effectively manage product quality and reliability requires solving a variety of
problems. These include:
deciding the reliability of a new product,
ensuring certain level of quality of the product,
assessing the quality and reliability of current products being manufactured, and
improving the reliability and quality of the current product.
Solving these problems generally involves the following four steps:
Step 1: Identify and clearly dene a real-world problem.
Step 2: Collect the data and information needed for developing a proper model
to assist the decision-making process.
Step 3: Develop the model for solving the problem.
Step 4: Develop necessary tools and techniques for analyzing the model and
solving the problem.
This approach can be jointly implemented with the plan-do-check-action
(PDCA) management cycle (e.g., see Ref. [4]). Here, Plan deals with establishing
the objectives and processes necessary to produce the expected output, Do means
to implement the plan, Check deals with studying the actual results and com-
paring them with the expected ones, and Action means corrective actions
(including adjustments) on signicant differences between actual and expected
results. The PDCA cycle is repeatedly implemented so that the ultimate goal is
gradually approached.
References
1. El-Tamimi AM, Abidi MH, Mian SH et al (2012) Analysis of performance measures of flexible
manufacturing system. J King Saud Univ Eng Sci 24(2):115129
2. Farahani RZ, Rezapour S, Drezner T et al (2014) Competitive supply chain network design: an
overview of classications, models, solution techniques and applications. Omega 45(C):92118
3. Inman RR, Blumenfeld DE, Huang N et al (2013) Survey of recent advances on the interface
between production system design and quality. IIE Trans 45(6):557574
References 25
3.1 Introduction
This chapter introduces reliability basic concepts, basic functions, and various life
characteristics and measures. We also discuss the evolution of product reliability in
different phases of the product life cycle.
The outline of the chapter is as follows. We start with a brief discussion of basic
concepts of reliability in Sect. 3.2. Reliability basic functions are presented in
Sect. 3.3, the bathtub failure rate curve is discussed in Sect. 3.4, and life charac-
teristics are presented in Sect. 3.5. Failure processes and characteristics of repair-
able systems are introduced in Sect. 3.6. Evolution of reliability over product life
cycle is discussed in Sect. 3.7.
3.2.1 Reliability
Reliability is the probability that an item performs specic functions under given
conditions for a specied period of time without failure. This denition contains
four elements. First, reliability is a probability of no failure, and hence it is a
number between zero and one. The probability element of reliability allows us to
calculate reliabilities in a quantitative way. The second element of reliability def-
inition deals with function and failure, which are two closely linked terms. A
failure means that a device cannot perform its function satisfactorily. There are
several concepts of failure and these are further discussed later. Third, reliability
depends on operating conditions. In other words, a device is reliable under given
conditions but can be unreliable under more severe conditions. Finally, reliability
usually varies with time so that the time to failure becomes a primary random
variable. However, the time element of reliability is not applicable for one-shot
devices such as automobile air-bags and the like. In this case, reliability may be
dened as the proportion of the devices that will operate properly when used.
3.2.2 Failure
Failure can be any incident or condition that causes an item or system to be unable
to perform its intended function safely, reliably and cost-effectively. A fault is the
state of the product characterized by its inability to perform its required function.
Namely, a fault is a state resulting from a failure [1].
Some failures last only for a short time and they are termed as intermittent
failures, while other failures continue until some corrective action recties the
failures. Such failures are termed extended failures. Extended failures can be further
divided into complete and partial failures. A complete failure results in total loss of
function, while a partial failure results in partial loss of function. According to
whether a failure occurs with warning or not, the extended failures can be divided
into sudden and gradual failures. A complete and sudden failure is called a cata-
strophic failure and a gradual and partial failure is called a degraded failure.
Engineering systems degrade with time and usage. Figure 3.1 displays a plot of
the degradation amount (denoted as D(t)) versus time. Reliability-centered main-
tenance (RCM [2]) calls it the P-F curve. Here, the point P is called potential
failure where the item has an identiable defect or the degradation rate changes
quickly. If the defect or degradation continues, the potential failure will evolve into
a functional failure (i.e., the performance is lower than the required standard) at
time point F. The time interval between these two points is called the P-F interval.
A failure can be self-announced (e.g., the failure of light bulbs) or non-self-
announced (e.g., the failure of protective devices). In the case where the failure is
not self-announced, it can be identied only by an inspection. Such a failure is
called the hidden failure.
P
D (t )
P-F interval
F
t
3.2 Concepts of Reliability and Failure 29
A failure mode is a description of a fault whereby the failure is observed, or the way
in which the failure happens. It is possible to have several causes for the same
failure mode. Knowledge of the cause of failure is useful in the prevention of
failures. A classication of failure causes is as follows:
Design failure due to inadequate design.
Weakness failure due to weakness in the system so that it is unable to withstand
the stress encountered in the normal operating environment.
Manufacturing failure due to the item being not conforming to the design
specications.
Aging failure due to the effects of age and/or usage.
Misuse failure due to misuse of the system (e.g., operating in environments for
which it was not designed).
Mishandling failure due to incorrect handling and/or lack of care and
maintenance.
Stress
t
30 3 Fundamental of Reliability
D (t )
Time to failure
In the wear-out case (see Fig. 3.3, where Dt indicates the cumulative damage
amount), the stress causes damage that accumulates irreversibly. The accumulated
damage does not disappear when the stress is removed, although sometimes
annealing is possible. The item fails when the cumulative damage reaches the
endurance limit. The deterioration process is a typical cumulative damage process.
The severity of a failure mode indicates the impact of the failure mode on the
system and the outside environment. A severity ranking classication scheme is as
follows [4]:
Catastrophic if failures result in death or total system loss.
Critical if failures result in severe injury or major system damage.
Marginal if failures result in minor injury or minor system damage.
Negligible if failures result in the injury or damage that is lower than marginal.
RCM [2] classies the failure consequence into four levels in descending order
of severity:
Failures with safety consequences,
Failures with environmental consequences,
Failures with operational consequences, and
Failures with non-operational consequences.
be done at the component level; and one might model failures at the system level if
the interest is in determining the expected warranty servicing cost.
Modeling of failures also depends on the information available. At the compo-
nent level, a thorough understanding of the failure mechanisms will allow building
a physics-based model. When no such understanding exists, one might need to
model the failures based on failure data. In this case the modeling is data-driven.
The data-driven approach is the most basic approach in reliability study.
For an item, the time to failure, T, is a nonnegative random variable (i.e., T 0). If
there is a set of complete failure observations, we can incorporate the observations
into grouped data: (n1 ; n2 ; . . .; nm ), where ni is the number of failures in time
interval (ti1 ; ti ), 1 i m. Usually, t0 0; tm 1, and ti ti1 Dt with
Dt being the interval length. One can display the grouped data in a plot of ni versus
ti as shown inPFig. 3.4. This plot is termed as histogram of data.
m
Let n i 1 i denote the total number of failures. The relative failure fre-
n
quency in a unit time is given by
ni
fi ; Dt ti ti1 : 3:1
nDt
When n tends to innity and Dt tends to zero, the relative frequency histogram
will tend to a continuous curve. We denote it as f t and call it the probability
density function (pdf). A stricter denition of the pdf is given by
Prt \ T t Dt
f t lim 3:2
Dt!0 Dt
30
n
20
10
0
0 50 100 150 200 250
t
32 3 Fundamental of Reliability
f (t )
=1.8
0.5
=0.8
0
0 1 2 3
t
The pdf has the following properties. First, it is nonnegative, i.e., f t 0. The
probability that the failure occurs in (t; t Dt) is f tDt, and hence the area under
the pdf curve is the total probability, which equals one, i.e.,
Z1
f tdt 1: 3:3
0
A typical pdf that has been widely used in reliability eld is the Weibull dis-
tribution given by
b t b1 t=gb
f t e ; b [ 0; g [ 0 3:4
g g
where b is the shape parameter and g is the scale parameter. Figure 3.5 shows the
plots of the Weibull pdf with g 1 and b 0:813:8, respectively. It is noted
that the distribution is less dispersive as the Weibull shape parameter b increases.
When b is about 3.44, the Weibull pdf is very close to the normal pdf with the same
mean and variance.
The cumulative distribution function (cdf) is the probability of event T t (i.e., the
item fails by time t). Letting Ft denote the cdf, we have
Zt
Ft PrT t f xdx: 3:5
0
f (t )
F (t ) R (t )
The reliability (or survival) function is the probability of event T [ t (i.e., the
item survives to time t). Letting Rt denote the reliability function, we have
Z1
Rt PrT [ t f xdx: 3:6
t
R0 F1 1; R1 F0 0: 3:7
The relations between the reliability function, cdf, and pdf are graphically dis-
played in Fig. 3.6.
For the Weibull distribution, the cdf and reliability function are given, respec-
tively, by
b b
Ft 1 et=g ; Rt et=g : 3:8
If an item has survived to age x, the residual lifetime of the item is given by T x,
which is a random variable. We call the distribution of T x the conditional (or
residual life) distribution. The cdf of residual lifetime is given by
Ft Fx
Ftjx PrT tjT [ x ; t [ x: 3:9
Rx
34 3 Fundamental of Reliability
f (t ), f (t|x)
f (t )
x t
The pdf and reliability function of residual lifetime are given, respectively, by
f t Rt
f tjx ; Rtjx : 3:10
Rx Rx
Figure 3.7 shows the relation between the underlying pdf and conditional pdf.
For the Weibull distribution, the reliability function of the residual life is given by
b
x=gb
Rtjx et=g : 3:11
Ft Dt Ft f tDt
Prt\T t DtjT [ t : 3:12
Rt Rt
Prt\T t DtjT [ t f t
rt : 3:13
Dt Rt
It is noted that the failure rate is nonnegative and can take value in (0; 1).
Therefore, it is not a probability. The probability that the item will fail in t; t Dt,
given that it has not failed prior to t, is given by rtDt.
3.3 Reliability Basic Functions 35
The failure rate characterizes the effect of age on item failure explicitly. The plot
of rt versus t can be either monotonic or nonmonotonic. For the monotonic case,
an item is called positive aging if the failure rate is increasing, negative aging if the
failure rate is decreasing, or nonaging if the failure rate is constant.
For the Weibull distribution, the failure rate function is given by
b t b1
rt : 3:14
g g
Zt
Ht rxdx: 3:15
0
The pdf, cdf, reliability function, and failure rate function are four basic reliability
functions. Given any one of them, the other three can be derived from it. This is
shown in Table 3.1.
To illustrate, we look at a special case where the failure rate is a positive constant
k. From Eq. (3.15), we have Ht kt. Using this in the last column of Table 3.1
we have
There is a close link between the shape of failure rate function (e.g., increasing or
decreasing) and aging property (or failure modes) of an item. The shape type of the
failure rate function is sometimes termed as failure pattern [2]. A well-known
nonmonotonic failure pattern is bathtub-shaped failure rate (or bathtub curve) as
shown in Fig. 3.8.
The bathtub curve can be obtained from observations for many nominally
identical nonrepairable items. It is composed of three parts, which correspond to
early failure phase, normal use phase, and wear-out phase, respectively. In the early
phase of the product use, the failure rate is usually high due to manufacturing and
assembly defects, and it decreases with time since the defects are removed. In the
normal use phase, the failure rate is low and the failure is mainly due to occasional
and random accidents or events (e.g., over-stress) so that the failure rate roughly
keeps constant. In the wear-out phase, the failure rate is high again due to accu-
mulation of damages, gradual degradation, or aging, and hence it increases with
time.
The time point where the failure rate quickly changes is called the change point
of the failure rate. The bathtub curve has two change points. The rst change point
can be viewed as the partition point between the early failure phase and the normal
use phase; and the second change point as the partition point between the normal
use phase and the wear-out phase. A produced item is usually subjected to a burn-in
test to reduce the failure rate in the early phase. In this case, the burn-in period
should not exceed the rst change point. The item can be preventively replaced
after the second change point so as to prevent the wear-out failure.
The desired failure pattern for a product should have the following features [5]:
The initial failure resulting from manufacturing or assembly defects should be
reduced to zero so that there are only random failures in the early phase. This
can be achieved by quality control.
The random failures should be minimized and no wear-out failure occurs during
the normal use period. This can be achieved by design and development
process.
t
3.4 Component Bathtub Curve and Hockey-Stick Line 37
The occurrence of wear-out failure should be delayed to lengthen the useful life
of the product. This can be achieved by preventive maintenance.
This leads to a change from the bathtub curve to a hockey-stick line (i.e., the
dotted line shown in Fig. 3.8).
The mean time to failure (MTTF) describes the average of lifetime and is given by
Z1 Z1
l tf tdt Rtdt: 3:17
0 0
where u is the shape parameter, v is the scale parameter, and Cu is the gamma
function evaluated at u. Compared with Eq. (3.20), Eq. (3.19) can be rewritten as
below:
where Gz; u; v is the cdf of the gamma distribution with shape parameter u and
scale parameter v. Noting that G1 1, we have the MTTF of the Weibull life
given by
1
l m1; 1 gC 1 : 3:22
b
Microsoft Excel has standard functions to evaluate the gamma function and, the
pdfs and cdfs of the Weibull and gamma distributions. Specic details can be found
from Online Appendix B.
3.5.1.2 BX Life
The BX life with X 10 is called B10 life, which has been widely used in
industries. The BX life associated with x 1 e1 0:6321 is called the char-
acteristic life (denoted as tc ), which meets Htc 1.
Compared with the mean life, the BX life is more meaningful when an item is
preventively replaced at age BX to avoid its failure. In this case, the probability that
the item fails before t BX is x. This implies that this measure links the life with
reliability (i.e., 1 x). For the Weibull distribution, we have
The mean residual life (MRL) is the expectation of the residual life and given by
Z1 Z1
1
lx t xf tjxdt Rtdt: 3:26
Rx
x x
3.5 Life Characteristics 39
(x )
80 =1.5
60
40 =2.5
20
0
0 50 100 150
x
For the Weibull distribution, from Eqs. (3.11), (3.18) and after some simpli-
cations, we have
" b !#
l x
lx 1G ; 1 1=b; 1 x: 3:27
Rx g
It is noted that the MRL is measured from age x, which is the lifetime already
achieved without failure. Combining the lifetime already achieved with the
expected remaining life, we have the expectation of the entire life given by
MLx lx x. It is called the mean life with censoring.
For g 100 and a set of values of b, Fig. 3.9 shows the plots of lx. As seen,
lx is increasing for b\1, constant for b 1 , and decreasing for b [ 1.
Suppose the life of a nonrepairable item follows the Weibull distribution. After
operating for T time units (i.e., mission interval), the item is inspected. If the
inspection at ti i 1T indicates that the item is at the normal state, the mission
reliability that the item will survive the next mission interval is evaluated using the
conditional reliability given by
" #
RiT T b b b
b b
Rm iT exp i i 1 RTi i1 : 3:28
Ri 1T g
Assume that the mission reliability is required not to be smaller than a. For
b [ 1, Rm iT decreases with i so that the item has to be replaced after surviving to
a certain mission interval, say I, to ensure the required mission reliability. Clearly,
the I must meet the following relations:
b
I1b b
I b
RTI a; RTI 1 \ a: 3:29
40 3 Fundamental of Reliability
As such, we have I intx, where intx is the nearest integer that is not larger
than x. The largest useful life of the item can be achieved when each inspection
indicates the normal state, and is equal to I T.
Example 3.1 Assume that the life of an item follows the Weibull distribution with
parameters b 2:5 and g 100. The duration of each mission is T 16, and
the required mission reliability is a 0:9. The problem is to calculate the largest
useful life of the item.
Solving Eq. (3.30) yields x 3:67, i.e., I 3. As such, the largest useful life
of the item equals 48. It is noted that the mean life is 88.73 and the tradeoff BX life
is 69.31 with X 32:97. This implies that the selection of life measure is appli-
cation-specic.
Z1 Z1
r2 t l2 f tdt 2 tRtdt l2 : 3:31
0 0
It describes the dispersion of life and has a dimension of t2 . Its square root r is
termed as standard deviation and has the dimension of t. The ratio of standard
deviation and mean life is called coefcient of variation (CV) and is given by
q r=l: 3:32
Z1
1
c1 3 t l3 f tdt: 3:34
r
0
Z1
1
c2 4 t l4 f tdt 3: 3:35
r
0
When a repairable system fails, it can be restored to its work state by a repair and
then continues to work. This forms a failure-repair process. Depending on the depth
of repairs, the time between two successive failures usually decreases in a statistical
sense. In other words, the time between failures is not an independent and identi-
cally distributed random variable. As a result, the life distribution model and
associated life measures are generally no longer applicable for representing the
failure behavior of a repairable system. In this section, we briefly introduce basic
concepts and reliability measures of a repairable system.
t1 t2 t3 t4 t5 t6 t7 t8
0
0 5 10 15 20 25 30
t
state 1 means working or up state and state 0 means failure or down state.
The start point of the ith up-down cycle is at t2i2 ; i 1; 2; 3; . . ., and the end point
is at t2i ; and the failure occurs at t2i1 . The uptime and the downtime are given,
respectively, by
The downtime can be broadly divided into two parts: direct repair time and other
time. The direct repair time is related to the maintainability of the system, which is a
design related attribute; and other time depends on many factors such as the sup-
portability of the system.
Let si denote the direct repair time of the ith up-down cycle. The mean time to
repair (MTTR) can be dened as below:
1X i
si si : 3:37
i k1
Let Dt denote the total downtime by time t. The total uptime is given by
Ut t Dt: 3:38
Figure 3.11 shows the plots of Dt and Ut for the data of Table 3.2.
3.6 Reliability of Repairable Systems 43
U (t ), D (t )
20
15 U (t )
10
5
D (t )
0
0 5 10 15 20 25 30
t
Mean time between failures (MTBF) and mean downtime (MDT) can be evaluated
at the end of each up-down cycle (i.e., t2i ) and given, respectively, by
Ut2i Dt2i
mi ; di : 3:39
i i
For the data shown in Table 3.2, the MTBF and MDT are shown in the last two
columns of the table.
3.6.2.2 Availability
Ut
At : 3:40
t
For the data shown in Table 3.2, Fig. 3.12 shows the plot of instantaneous
availability.
Depending on the depth of maintenance activities performed previously, the
MTBF, MDT, and availability may or may not be asymptotically constant when i or
t is large. Usually, At decreases at rst and then asymptotically tends to a con-
stant. When t ! 1, Eq. (3.40) can be written as
MTBF
A1 : 3:41
MTBF MDT
0.6
A (t )
0.4
0.2
0
0 5 10 15 20 25 30
t
For a new product, MTBF and MTTR can be estimated based on test data. In this
case, one can use the following to assess the inherent availability
MTBF
A : 3:42
MTBF MTTR
Since MDT [ MTTR, the inherent availability is always larger than the eld
availability.
Relative to uptime, downtime is small and can be ignored. In this case, the failure-
repair process reduces into a failure point process. A failure point process can be
represented in two ways. In the rst way, the process is represented by the time to
the ith failure, Ti , which is a continuous random variable and can be represented by
a distribution function. In the second way, the process is represented by the total
number of failures by time t, Nt, which is a discrete random variable. We briefly
discuss these two representations as follows.
N (t )
2
0
0 5 10 15 20 25 30
t
imperfect repair. In this case, Xi s distribution is also different from the distribution
of T1 . The models and methods for modeling Ti or Xi are discussed in Chap. 6.
It has the same expression as the Weibull cumulative hazard function but they have
completely different meanings.
Using a curve tting method such as least squared method, we can obtain the
estimates of b and g for the data in Table 3.2, which are b 1:1783 and
g 7:9095. The plot of the tted power-law model is also shown in Fig. 3.13 (the
smooth curve).
The rate of occurrence of failures (or failure intensity function) is dened as
mt dENt=dt: 3:44
It has the same expression as the Weibull failure rate function but with completely
different meanings.
The plot of mt versus t is often termed as the failure pattern of a repairable system
(see Ref. [7]). The plot can be bathtub-shaped. In this case, it is called the system
bathtub curve. The system bathtub curve is different from the component bathtub
curve. The former refers to the rate of occurrence of failures for a repairable system
and the latter refers to the failure rate function for a nonrepairable component. More
specically, the failure rate function represents the effect of the age of an item on
failure and is independent of maintenance actions. On the other hand, the rate of
occurrence of failure represents the intensity for a repairable system to occur the
next (or subsequent) failure, and strongly depends on the maintenance actions
completed previously.
The reliability of a product depends on technical decisions made during the design
and manufacturing phases of the product and is affected by many factors such as
use environment, operating conditions, maintenance activities, and so on. This
implies that product reliability evolves over time. In other words, the reliabilities
evaluated at different time points in the life cycle can be different.
According to the time points when the reliabilities are evaluated, there are four
different reliability notions [8]. They are design reliability, inherent reliability,
reliability at sale, and eld reliability. These notions are important for completely
understanding product reliability, and also useful for one to select an appropriate
model to model the reliability at a certain specic time point. We briefly discuss
these notions as follows.
Design reliability is the reliability predicted at the end of design and development
phase. The design reliability is inferred based on the test data of product prototypes
and their components, and corrective actions taken during the development process.
The test data is obtained from strictly controlled conditions without being impacted
by actual operating conditions and maintenance activities. As such, the precision of
the prediction will depend on the prediction method and the agreement between the
test conditions and actual use conditions. If the prediction method is appropriate
3.7 Evolution of Reliability Over Product Life Cycle 47
and the test conditions are close to the actual operating conditions, the design
reliability can be viewed as the average eld reliability of product population.
Precise prediction of the reliability of new products in the design phase is
desirable since it can provide an adequate basis for comparing design options.
Specic methods of reliability prediction will be discussed in Chaps. 9 and 11.
For a given product, there is a time interval from its assembly to customer delivery.
Usually, the customer delivery time is used as the origin of the product life. Before
this time point, the product is subjected to storage and transportation, which can
result in performance deterioration. The deterioration is equivalent to the product
having been used for a period of time. As a result, the reliability at sale is different
from the inherent reliability and depends on the packaging and packing, trans-
portation process, storage duration, and storage environment.
The eld reliability is evaluated based on eld failure data. It is different from the
reliability at sale due to the influence of various extrinsic factors on the reliability.
These factors include
Usage mode (continuous or intermittent),
Usage intensity (high, medium, or low),
48 3 Fundamental of Reliability
Assume that inherent reliability, reliability at sale, and eld reliability of a non-
repairable component can be represented by the Weibull distribution with the shape
parameter being bI , bS and bF , respectively. The variation sources that impact the
inherent reliability is less than the variation sources that impact the reliability at
sale. Similarly, the variation sources that impact the reliability at sale is less than the
variation sources that impact the eld reliability. Larger variability results in larger
life spread, and smaller Weibull shape parameter. As a result, it is expected that
bF \ bS \ bI , which have been empirically validated (see Ref. [11]).
References
1. Blischke WR, Murthy DNP (2000) Reliability: modeling, prediction, and optimization. Wiley,
New York, pp 1314
2. Moubray J (1997) Reliability-centered maintenance. Industrial Press Inc, New York
3. Dasgupta A, Pecht M (1991) Material failure mechanisms and damage models. IEEE Trans
Reliab 40(5):531536
4. US Department of Defense (1984) System safety program requirement. MIL-STD-882
5. Ryu D, Chang S (2005) Novel concepts for reliability technology. Microelectron Reliab 45(3
4):611622
6. Jiang R (2013) A tradeoff BX life and its applications. Reliab Eng Syst Saf 113:16
7. Sherwin D (2000) A review of overall models for maintenance management. J Qual Maint Eng
6(3):138164
8. Murthy DNP (2010) New research in reliability, warranty and maintenance. In: Proceedings of
the 4th Asia-pacic international symposium on advanced reliability and maintenance
modeling, pp 504515
9. Jiang R, Murthy DNP (2009) Impact of quality variations on product reliability. Reliab Eng
Syst Saf 94(2):490496
References 49
10. Cox DR (1972) Regression models and life tables (with discussion). JR Stat Soc B 34
(2):187220
11. Jiang R, Tang Y (2011) Influence factors and range of the Weibull shape parameter. Paper
presented at the 7th international conference on mathematical methods in reliability,
pp 238243
Chapter 4
Distribution Models
4.1 Introduction
In this chapter we introduce typical distributional models that have been widely
used in quality and reliability engineering.
The outline of the chapter is as follows. We start with discrete distributions in
Sect. 4.2, and then present simple continuous distributions in Sect. 4.3. The con-
tinuous distributions involving multiple simple distributions are presented in
Sect. 4.4. Finally, the delay time model involving two random variables is presented
in Sect. 4.5.
where nx is a nonnegative integer. We call the data given by Eq. (4.1) the count
data. There are many situations where the count data arise, e.g., grouped failure
data, number of defects in product quality analysis, accident data in trafc safety
study, and so on.
The probability mass function (pmf) f x is the probability of event X x, i.e.,
f x PrX x: 4:2
f 0 F0 1 R0
4:3
Rx 1 Fx; f x Fx Fx 1; x [ 1:
For the data given by Eq. (4.1), we have the empirical cdf given by
P
Fx 1n xi0 ni .
The discrete failure rate function, rx, is dened as
f x
r0 f 0; rx ; x 1: 4:4
Rx 1
Many discrete distribution models have been developed in the literature (e.g., see
Refs. [1, 10]). Based on the number of distribution parameters, the discrete dis-
tributions can be classied into the following three categories:
Single-parameter models (e.g., geometric and Poisson distributions),
Two-parameter models (e.g., binomial, negative binominal, and zero-inflated
Poisson distributions), and
Models with more than two parameters (e.g., hypergeometric distribution).
Suppose that there is a sequence of independent Bernoulli trials and each trial has
two potential outcomes: failure (or no) and success (or yes). Let p 2 0; 1
[q 1 p] denote the success [failure] probability in each trial. The geometric
distribution is the probability distribution of the event X failures before one suc-
cess trial. As such, the pmf of the geometric distribution is given by
f x qx p: 4:5
The cdf, reliability function, and failure rate function are given, respectively, by
There is a close link between the exponential distribution and the geometric
distribution. Suppose that a continuous random variable T 0 follows the
exponential distribution and the observed times are at tx x 1Dt. In this case,
we have
Letting p 1 ekDt , Eq. (4.8) becomes Eq. (4.5). This implies that
X 0; 1; 2; . . . follows the geometric distribution.
kx ek
f x ; k [ 0: 4:9
x!
l r2 k: 4:10
where Cn; x is the number of combinations choosing x items from n items. The
mean and variance of X is given, respectively, by
l np; r2 l1 p: 4:12
Example 4.1 Suppose that n = 10 items are tested and the success probability is
p = 95 %. Calculate the probability that the number of conforming items equals x.
The probability that the number of conforming items equals x is evaluated by
Eq. (4.11) and the results are shown in Table 4.1.
f x Cx r 1; xpx 1 pr : 4:13
The negative binomial distribution can be extended to the case where r is a positive
real number rather than an integer. In this case, Cx r 1; x is evaluated as
Cx r
Cx r 1; x : 4:15
Cx 1Cr
In the above denition, the number of failures is xed and the number of
successes is a random variable. The negative binomial distribution can be dened
differently. Let x denote the number of failure and r denote the number of successes.
The experiment is stopped at the rth success. In this case, the pmf of X is still given
by Eq. (4.13).
Example 4.2 Suppose we need to have 100 conforming items and the probability
that an item is conforming is 0.95. The problem is to determine how many items we
need to buy so that we can obtain the required number of conforming items with a
probability of 90 %.
For this example, the number of successes is xed and the number of failures is a
random variable so that the second denition is more appropriate. In this case, the
problem is to nd the value of x so that Fx 1 < 0.9 and Fx > 0.9 for r = 100
and p = 0.95. The computational process is shown in Table 4.2. As seen from the
table, we need to buy x r 108 items to ensure the probability of 90 % to have
100 conforming items.
The problem can be solved using the binomial distribution. Suppose we want to
buy n items with n [ 100. If the failure number is smaller than n 100, the
requirement can be met. The probability of this event is given by
Fn 100; n; 1 p, which must be larger than or equal to 90 %. The computa-
tional process based on this idea is shown in Table 4.3, and the solution is the same
as the one obtained from the negative binomial distribution.
Table 4.3 Results from the n 105 106 107 108 109 110
binomial distribution
F(n) 0.5711 0.7200 0.8327 0.9081 0.9533 0.9779
56 4 Distribution Models
f x Cmx CNm
nx
=CNn 4:19
where CAB CA; B is the number of combinations choosing B items from A items.
The mean and variance are given, respectively, by
N mN n
l nm=N; r2 l : 4:20
NN 1
Example 4.3 Assume N; m; n = (50, 45, 10). In this case, we have x 2 5; 10.
The pmf of X is shown in the second row of Table 4.4. For purpose of comparison,
the last row shows the pmf from the binomial distribution with n 10 and p 0:9.
The univariate continuous distributions can be broadly divided into two categories:
simple distributions and complex distributions. The complex distributions can be
further divided into several sub-categories (e.g., see Ref. [12]). We present some
simple distributions in this section and several complex distributions that involve
two or more simple distributions in the following section.
The Weibull pdf is given by Eq. (3.4), the mean is given by Eq. (3.22), and the
variance is given by Eq. (3.33). The Weibull distribution is mathematically tractable
with closed-form expressions for all the reliability basic functions. It is also flexible
since the failure rate function can be decreasing, constant, or increasing. The shape
parameter b represents the aging characteristics and the scale parameter g is the
characteristic life and proportional to various life measures. Jiang and Murthy [7]
present a detailed study for the properties and signicance of the Weibull shape
parameter.
The three-parameter Weibull distribution is an extension of the two-parameter
Weibull model with the cdf given by a piecewise function:
8
< 0; t 2 0; c
Ft tc b 4:21
: 1 exp ; t [ c
g
It is a straight line in x-y plane. The plot of y versus x is called the Weibull
probability paper (WPP) plot. The Weibull transformations can be applied to any
other distribution with the nonnegative support but the resulting WPP plot is no
longer a straight line. For example, the WPP plot of the three-parameter Weibull
distribution is concave.
Due to y lnHt, we have jyj Ht for small t and y Ht for large
t. Similarly, due to x lnt, we have jxj t for small t t 2 0; 1 and x t for
large t t 1. As a result, the Weibull transformations produce an amplication
effect for the lower-left part of the WPP plot and a compression effect for the upper-
right part [3].
The pdf of the gamma distribution is given by Eq. (3.20). Generally, there are no
closed-form expressions for the other three basic functions but Microsoft Excel has
standard functions to evaluate them (see Online Appendix B).
The k-order origin moment of the gamma distribution is given by:
Z1
mk xk gxdx vk Ck u=Cu: 4:24
0
f t f 0 t u1 1 1
rt ! lnf t0 ! :
1 Ft f t t v v
This implies that the failure rate of the gamma distribution tends to a constant rather
than zero or innity.
4.3 Simple Continuous Distributions 59
The gamma distribution has a long right tail. It reduces to the exponential
distribution when u 1; to the Erlang distribution when u is a positive integer; and
to the chi-square distribution with n degrees of freedom when u n=2 and v 2.
The chi-squared distribution is the distribution of the random variable
P
Qn ni1 Xi2 , where Xi 1 i n are independent and standard normal random
variables. The chi-squared distribution is widely used in hypothesis testing and
design of acceptance sampling plans.
lnt ll t
Ft U Ufln l 1=rl g 4:26
rl el
where U: is the standard normal cdf. It is noted that ell is similar to the Weibull
scale parameter and r1 l is similar to the Weibull shape parameter. Therefore, we
call ll and rl the scale and shape parameters, respectively. The mean and variance
are given, respectively, by
The lognormal distribution has a longer right tail than the gamma distribution.
The failure rate function is unimodal, and can be effectively viewed as increasing
when rl < 0.8, constant when rl 2 0:8; 1:0 and decreasing when rl [ 1 [9].
In a batch of products, some are normal while others are defective. The lifetime of
the normal product is longer than that of the defective product, and hence the
60 4 Distribution Models
former is sometimes called the strong sub-population and the latter is sometimes
called the weak sub-population.
In general, several different product groups are mixed together and this forms a
mixture population. Two main causes for the mixture are:
(a) product parts can come from different manufacturers, and
(b) products are manufactured in different production lines or by different oper-
ators or by different production technologies.
Let Fj t denote the life distribution for sub-population j, and pj denote its
proportion. The life distribution of population is given by
X
n X
n
Ft pj Fj t; 0 \ pj \ 1; pj 1: 4:28
j1 j1
When n 2 and Fj t is the Weibull distribution, we call Eq. (4.28) the twofold
Weibull mixture. The main characteristics for this special model are as follows [6]:
The WPP plot is S-shaped.
The pdf has four different shapes as shown in Fig. 4.1.
The failure rate function has eight different shapes.
The mixture model has many applications, e.g., burn-in time optimization and
warranty data analysis. We will further discuss these issues in Chaps. 15 and 16.
An item can fail due to several failure modes, and each can be viewed as a risk. All
the risks compete and the failure occurs due to the failure mode that rst reaches.
Such a model is termed as competing risk model. An example is the system
composed of n independent components without any redundant component. The
system fails when any component fails; or the system can survive to t only if each
component of the system survives to t.
Let Ti denote the time to failure of component i, and Ri t denote the probability
that component i survives to t. Similarly, let T denote the time to failure of the
0.006
0.004
0.002
0
0 50 100 150 200
t
4.4 Complex Distribution Models Involving Multiple Simple Distributions 61
system, and Rt denote the probability that the system survives to t. Clearly,
T minTi ; 1 i n. As a result, under the independent assumption we have
Y
n
Rt Ri t: 4:29
i1
If the ith item has an initial age ai at the time origin, Ri t should be replaced by
Ri t ai =Rai :
From Eq. (4.29), the system failure rate function is given by
X
n
rt ri t: 4:30
i1
This implies that the system failure rate is the sum of component failure rates.
If n items are simultaneously tested and the test stops when the rst failure
occurs, this test is called the sudden death testing. The test duration T is a random
variable and follows the n-fold competing risk model with Fi t F1 t; 2 i n.
In this case, the cdf of T is given by Ft 1 1 F1 tn .
Another special case of Eq. (4.29) is n 2 and is termed as the twofold com-
peting risk model. In this case, the item failure can occur due to one of two
competing causes. The time to failure T1 due to Cause 1 is distributed according
to F1 t, and the time to failure T2 due to Cause 2 is distributed according to
F2 t. The item failure is given by the minimum of T1 and T2 , and Ft is given by:
Ft 1 1 F1 t1 F2 t 4:31
notations as those in the competing risk model, the system life is given by
T maxTi ; 1 i n. Under the independent assumption, we have
Y
n
Ft Fi t: 4:32
i1
If the ith item has an initial age ai at the time origin, Fi t should be replaced by
1 Ri t ai =Rai .
The multiplicative model has two typical applications. The rst application is the
hot standby system, where n components with the same function simultaneously
operate to achieve high reliability. The second application is in reliability test,
where n items are simultaneously tested and the test stops when all the components
fail. In this case, the test duration T is a random variable and follows the n-fold
multiplicative model with Fi t F1 t; 2 i n; the cdf of T is given by
Ft F1n t.
Another special case of the model given by (4.32) is n 2. In this case, Ft is
given by
Ft F1 tF2 t: 4:33
Ft Gi t; t 2 ti1 ; ti ; 1 i n; t0 0; tn 1 4:34
It is noted that a step-stress testing model has the form of Eq. (4.34).
4.4 Complex Distribution Models Involving Multiple Simple Distributions 63
For the distribution to be continuous at the break points, the model parameters
need to be constrained. We consider two special cases as follows.
Consider the model given by Eq. (4.36). Assume that Ri t 1 i n are the two-
parameter Weibull distribution; k1 1 and ki [ 0 for 2 i n. As such, the model
has 3n 1 parameters (assume that tis are known). To be continuous, the
parameters meet the following n 1 relations:
This twofold Weibull sectional model has only three independent parameters.
Consider the model given by Eq. (4.37). Assume that F1 t is the two-parameter
Weibull distribution, Fi t (2 i n) are the three-parameter Weibull distribution
with the location parameter ci , and ki 1 for 1 i n. To be continuous, the
parameters meet the following n 1 relations:
64 4 Distribution Models
F (t )
0.4
0.2
0
0 5 10 15
t
As such, the model has 2n independent parameters (if tis are known).
Especially, when n 2 and b1 b2 b, Eq. (4.40) reduces to
8
>
< 1 expt=g1 b ; t 2 0; t1
Ft g 4:41
>
: 1 exptc 2 b ; c2 1 2 t1 :
g2 ; t 2 t1 ; 1 g1
Example 4.4 The models given by Eqs. (4.39) and (4.41) can be used to model
simple step-stress testing data. Assume t1 8 and g1 ; b = (10, 2.3). When
g2 6:88, we have k2 2:2639 for Model (4.39); when g2 5, we have c2 4
for Model (4.41). Figure 4.2 shows the plots of the distribution functions obtained
from Models (4.39) and (4.41). As seen, they are almost overlapped, implying that
the two models can provide almost the same t to a given dataset.
The distribution models presented above involve only a single random variable. In
this section, we introduce a distribution model, which involves two random
variables.
Referring to Fig. 4.3, the item lifetime T is divided into two parts: normal and
defective parts. The normal part is the time interval (denoted as U) from the
beginning to the time when a defect initiates; and the defective part is the time
period from the defect initiation to failure, which is termed as delay time and
denoted as H. Both U and H are random variables.
The delay time concept and model are usually applied to optimize an inspection
scheme, which is used to check whether the item is defective or not. Suppose an
item is periodically inspected. If a defect is identied at an inspection (as Case 1 in
4.5 Delay Time Model 65
U H
Case
1
0
t
Fig. 4.3), the item is preventively replaced by a new one; if the item fails before the
next inspection (as Case 2 in Fig. 4.3), it is correctively replaced. As such, the
maintenance action can be arranged in a timely way and the operational reliability is
improved. For more details about the concept and applications of the delay time, see
Ref. [13] and the literature cited therein.
Suppose a single item is subjected to a major failure mode (e.g., fatigue) and the
failure process of the item can be represented by the delay time concept. Let Fu t
and Fh t denote the distributions of U and H, respectively. The time to failure is
given by T U H. Assuming that U and H are mutually independent and, Fu t
and Fh t are known, the distribution function of T is given by
Zt
Ft Fh t xdFu x: 4:42
0
References
5.1 Introduction
Data for reliability modeling and analysis are mainly from testing and use eld, and
sometimes from the published literature and experts judgments. The test data are
obtained under controlled conditions and the eld data are usually recorded and
stored in a management information system.
The data used for reliability modeling and analysis can be classied into the
following three types:
Data of time to failure (TTF) or time between failures (TBF);
Data of performance degradation (or internal covariate resulting from degra-
dation); and
A non-repairable item fails (or is used) only once. In this case, a failure datum is an
observation of TTF, and the observations from nominally identical items are
independent and identically distributed (i.i.d.). A life distribution can be used to t
TTF data.
On the other hand, a repairable item can fail or be used several times. In this
case, a failure datum is an observation of time between successive failures
(including the time to the rst failure (TTFF) for short). Depending on the main-
tenance activities to restore the item to its working state, the TBF data are generally
not i.i.d. except the TTFF data. The model for modeling TBF data is usually a
stochastic process model such as the power-law model. In this chapter, we focus on
the i.i.d. data, and we will look at the modeling problem for the data that are not i.i.
d. in the next chapter.
According to whether or not the failure time is exactly known, life data can be
classied into complete data and incomplete data. For complete data, the exact
failure times are known. In other words, each of the data is a failure observation. It
is often hard to get a complete dataset since the test usually stops before all the
tested items fail or the items under observation are often in a normal operating state
when collecting data from eld.
The incomplete data arise from censoring and truncation. In the censoring case,
the value of an observation is only partially known. For example, one (or both) of
the start point and endpoint of a life observation is (are) unknown so that we just
know that the life is larger or smaller than a certain value, or falls in some closed
interval. A censoring datum contains partial life information and should not be
ignored in data analysis.
In the truncation case, the items are observed in a time window and the failure
information is completely unknown out of the observation window. For example,
the failure information reported before the automotive warranty period is known but
completely unknown after the warranty period. There are situations where the
failure information before a certain time is unknown. For example, if the time for an
item to start operating is earlier than the time for a management information system
5.2 Reliability Data 69
Censored observations
Observation window
to begin running, the failure information of the item before this information system
began running is unknown. However, truncation usually produces two censored
observations as shown in Fig. 5.1, where sign indicates a failure. The censored
observation on the left is actually a residual life observation and is usually termed
left-truncated data and the censored observation on the right is usually termed right-
truncated data.
There are three types of censoring: left censoring, right censoring, and interval
censoring. In left censoring case, we do not know the exact value of a TTF
observation but know that it is below a certain value. We let tf denote the actual
failure time and t denote its known upper bound. A left censoring observation
meets the following:
0\ tf \ t : 5:1
Left censoring can occur when a failure is not self-announced and can be identied
only by an inspection.
In the right censoring case, we only know that the TTF is above a certain value.
Let t denote the known lower bound. A right censoring observation meets the
following:
t \ tf \1: 5:2
Right censoring can occur when an item is preventively replaced or the life test is
stopped before failure of an item.
70 5 Statistical Methods for Lifetime Data Analysis
In interval censoring case, we only know that the TTF is somewhere between
two known values. An interval censoring observation meets the following:
Interval censoring can occur when observational times are scheduled at some dis-
crete time points.
It is noted that both left censoring and right censoring can be viewed as special
cases of interval censoring with the left endpoint of the interval at zero or the right
endpoint at innity, respectively.
Grouped data arise from interval censoring. Suppose n items are under test, and
the state of each item (working or failure) is observed at time ti iDt, i 1; 2; . . ..
If an item is at the failure state at ti , then the exact failure time is unknown but we
know ti1 \tf ti . As a result, a grouped (or count) dataset can be represented by
Table 5.1, and the interval with tk 1 is called half-open interval.
When the sample size n is large, a complete dataset can be simplied into the
form of the grouped data. Such an example is the well-known bus-motor major
failure data from Ref. [4]. The data deal with the bus-motor major failure times. The
major failure is dened as serious accidents (usually involved worn cylinders,
pistons, piston rings, valves, camshafts, or connecting rod or crankshaft bearings) or
performance deterioration (e.g., the maximum power produced fell below a spec-
ied proportion of the normal value).
Table 5.2 shows the times to the rst through fth major failures of the bus-
motor fleet. The time unit is 1000 miles and tu is the upper bound of the nal
interval. It is noted that minor failures and preventive maintenance actions are not
shown and the total number of bus-motors varies, implying that a large amount of
information is missed.
There are three typical life test schemes. The rst scheme is test-to-failure. Suppose
n items are tested and the test ends when all the items fail. Times to failure are given
by
t1 ; t2 ; . . .; tn : 5:4
This dataset is a complete dataset with all failure times being known exactly. With
this scheme, the test duration is a random variable and equals the maximum of
lifetimes of the tested items.
5.2 Reliability Data 71
Table 5.2 Grouped bus- ti1 ti 1st 2nd 3rd 4th 5th
motor failure data
0 20 6 19 27 34 29
20 40 11 13 16 20 27
40 60 16 13 18 15 14
60 80 25 15 13 15 8
80 100 34 15 11 12 7
100 120 46 18 16
120 140 33 7
140 160 16 4
160 180 2
180 220 2
tu 220 210 190 120 170
Total 191 104 101 96 85
number
To reduce the test time, the second test scheme (termed as Type I censoring or
xed time test) stops the test at a predetermined time t s. In this scheme, the
number of failure observations, k (0 k n), is a random variable, and the data are
given by
t1 ; t2 ; . . .; tk ; tj s ; k 1 j n : 5:5
For the manufacturer of a product, the eld information can be used as feedback to
learn about the reliability problems of a product and to improve future generations
of the same or similar product. For the user, the eld information can be used to
optimize the maintenance activities and spare part inventory control policy. Many
enterprises use a management information system to store the maintenance-related
information. Most such systems are designed for the purpose of management rather
than for the purpose of reliability analysis. As a result, the records are often
ambitious and some important information useful for reliability analysis is missed.
In extracting eld data from a management information system, there is a need to
differentiate the item age from the calendar time and inter-failure time. Figure 5.1
shows the failure point process of an item, with the repair time being ignored. The
72 5 Statistical Methods for Lifetime Data Analysis
Table 5.3 An alternately ti 110 151 255 343 404 438 644
censored dataset
di 1 1 0 0 1 1 0
ti 658 784 796 803 958 995 1000
di 0 1 0 1 0 0 1
ti 1005 1146 1204 1224 1342 1356 1657
di 1 1 1 1 1 1 1
horizon axis is calendar time, in which the failure times can be denoted as
T fTi ; i 1; 2; . . .g. The time between two adjacent failures Ti Ti1 can be
either item age (denoted as Xi ) if the failure is corrected by a replacement, or
inter-failure time (denoted as Yi ) if the failure is corrected by a repair so that the
item is repeatedly used. Though Xi and Yi look similar, they are totally different
characteristics. The Xi and Xi1 come from two different items and can be i.i.d.; Yi
and Yi1 come from the same item and are usually not i.i.d. As a result, models for
modeling T, X and Y can be considerably different.
When extracting life data of multiple nominally identical and non-repairable
items from a management information system, we will have many right-censored
data. If the data are reordered in ascending order, we will obtain an alternately
censored dataset. The alternately censored dataset is different from the Type-I and
Type-II datasets where the censored observations are always larger than or equal to
the failure observations. Table 5.3 shows a set of alternately censored data, where ti
is failure or censored time and di is the number of failures at ti . In practice we often
observe several failures at the same time. Such data are called tied data and result
from grouping of data or from coarse measurement (e.g., rounding-errors). In this
case, we can have di [ 1.
The performance of an item deteriorates with time and usage and leads to failure if
no preventive maintenance action is carried out. The degradation information is
usually obtained through condition monitoring or inspection, and is useful for life
prediction. The degradation can be measured by the variables or parameters (e.g.,
wear amount, vibration level, debris concentration in oil, noise level, etc.) that can
directly or indirectly reflect performance. Such variables or parameters are often
called covariates or condition parameters.
Another type of information useful for life prediction is the use condition and
environment data (e.g., load, stress level, use intensity, temperature, humidity, etc.).
5.2 Reliability Data 73
As an example, consider accelerated life testing. It deals with testing items under
more severe conditions than normal use conditions (e.g., more intensive usage,
higher stress level, etc.) so as to make the item fail faster. Under a constant-stress
accelerated testing scheme, the failure data are given by paired data
(si ; tij ; 1 i m; 1 j ni ), where si is the stress level and can be viewed as data on
use condition and environment.
Probability plots are commonly used to identify an appropriate model for tting a
given dataset. To present the data on a plotting paper, the empirical cdf for each
failure time must be rst estimated using a nonparametric method. Some parameter
estimation methods such as the graphical and least square methods also require
estimating the empirical cdf.
Nonparameter estimation method of cdf depends on the type of dataset available.
We rst look at the case of complete data and then look at the case of incomplete
data.
t1 t2 . . . tn : 5:7
i 0:3
Fi : 5:8
n 0:4
Assume that there are di failures in the interval ti1 ; ti ; 1 i m. The total number
P
of failures by ti equals ni ij1 dj and the sample size is n nm . Fti can be
estimated by betainv0:5; ni ; n ni 1 or Eq. (5.8) with i being replaced by ni .
74 5 Statistical Methods for Lifetime Data Analysis
Example 5.1 Consider the rst set of bus-motor failure data given in Table 5.2.
Based on the median estimate of Fi , the estimates of empirical cdf are shown in the
third column of Table 5.4. The estimates obtained from Eq. (5.8) are shown in the
fourth column. As seen, the estimates from the two methods are very close to each
other. The last two columns of Table 5.4 are the corresponding Weibull transfor-
mations obtained from Eq. (4.22). The WPP plot (along with the regression straight
line of the data points) is shown in Fig. 5.2.
3
2
1
0
-1 0 1 2 3 4 5 6
y
-2
-3
-4
-5
x
prior to time tik is n ik 1 and the number of survival items just after tik is n ik .
n ik
The conditional reliability is given by Rk . As such, the empirical cdf at
n ik 1
tik is estimated as
Y
j
Fj 1 Rk ; t 2 tij ; tij 1 : 5:9
k1
Ztk1
nf tDt dk
DHk rtdt r ttk1 tk : 5:11
nRt Nk
tk
76 5 Statistical Methods for Lifetime Data Analysis
0
0 2 4 6 8
-1
y
-2
-3
-4
x
Fig. 5.3 WPP plots obtained from different methods for Example 5.2
X
j X
j
dk
Hj DHk ; t 2 tj ; tj1 : 5:12
k1 k1
Nk
The empirical chf is a staircase function of t with Htj Hj1 and Htj Hj .
As such, the empirical cdf is evaluated by
Ftij 1 eHj : 5:13
Example 5.2 (continued): For the data in Table 5.3, the empirical cdf evaluated by
NelsonAalen method is shown in the sixth column of Table 5.5 and the corre-
sponding WPP plot is also displayed in Fig. 5.3.
Consider the data structure that KaplanMeier method adopts. Let rj denote the
rank order number of the jth failure observation. For a complete dataset, we have
rj rj1 1 dj : 5:15
ki
dj : 5:16
Nj 1
78 5 Statistical Methods for Lifetime Data Analysis
Using Eq. (5.16) to Eq. (5.15) and noting that Nj kj n rj1 , we have
N j kj 1 n rj1 1
rj rj1 rj1 : 5:17
Nj 1 n ij 2
This is the mean rank order estimator. The empirical cdf at tij can be evaluated by
Eq. (5.8) with i being replaced by rj , or by
Fj betainv0:5; rj ; n rj 1: 5:18
Due to use of average and median, this estimator is more robust than Kaplan
Meier and NelsonAalen methods, where the cdf tends to be overestimated due to a
jump in failure number. This can be clearly seen from Table 5.5. As such, we
recommend using this estimator.
Example 5.2 (continued): For the data in Table 5.3, the empirical cdf evaluated by
the mean rank order method is shown in the seventh column of Table 5.5 and the
corresponding WPP plot is also displayed in Fig. 5.3.
All the above three methods ignore information in the exact position of censored
observations. The piecewise exponential method considers this information and can
be viewed as an improvement to the NelsonAalen method.
Consider the dataset given by Eq. (5.10) and the time interval (tj1 ; tj ]. Let kj
denote the average failure rate in this interval. The chf at tj is given by
X
j
Hj kk tk tk1 ; t0 0: 5:19
k1
Similar to the case in the NelsonAalen method, the empirical chf is a staircase
function of t. The empirical cdf is given by Eq. (5.13).
The remaining problem is to specify the value of kj . We rst present the fol-
lowing relation here and its proof will be presented after we discuss the maximum
likelihood method of parameter estimation:
kj dj =TTTj 5:20
where
X
kj1
TTTj sj1;l tj1 Nj tj tj1 : 5:21
l1
5.3 Nonparametric Estimation Methods for Cdf 79
From Eq. (5.21), it is clear that the information in the exact position of a censored
observation is included in the estimator of the empirical cdf.
Example 5.2 (continued): For the data in Table 5.3, the empirical cdf evaluated by
the piecewise exponential method is shown in the last column of Table 5.5 and the
corresponding WPP plot is also displayed in Fig. 5.3. As seen, the estimated cdf
values are smaller than those from the NelsonAalen method and the KaplanMeier
method.
5.3.3.5 Discussion
As seen from Table 5.5, the estimates obtained from all the four methods are fairly
close to each other for this example. It is worthwhile noting that the increments of
the empirical cdf from t 1000 to 1005 are 0.0838, 0.0798, 0.0787, and 0.0829 for
the KMM, NAM, MROM, and PEM, respectively. For such a small interval, a
small increment is more reasonable. In this sense, the MROM really provides better
estimates.
In addition, according to the WPP plots shown in Fig. 5.3, the Weibull distri-
bution is not an appropriate model for tting the dataset in this example.
For a given set of data and a given parametric model, the parameter estimation deals
with determining the model parameters. There are several methods to estimate the
parameters and different methods produce different estimates. Typical parameter
estimation methods are
Graphical method,
Method of moments,
Maximum likelihood method,
Least square method, and
Expectation-maximum method.
The graphical parameter estimation method is useful for model selection and can be
used to get the initial estimates of the model parameters. Generally, the graphical
method is associated with a probability plot, and different distribution can have
different probability plot(s). That is, the graphical method is distribution-specic.
80 5 Statistical Methods for Lifetime Data Analysis
In this subsection, we focus on the WPP plot. This is because the characteristics of
the WPP plots of many Weibull-related models have been studied (see Ref. [11]).
The graphical method starts with the nonparametric estimate of cdf. Once this is
done, the data pair (tj ; Ftj ) are transformed by Eq. (4.22). The WPP plot of the
data can be obtained by drawing yj versus xj .
For the two-parameter Weibull distribution, we can t the WPP plot of the data
into a straight line y a bx by regression. Comparing it with Eq. (4.23), we have
the graphical estimates of the Weibull parameters given by
b b; g ea=b : 5:22
For a complete dataset, the rst two sample moments can be estimated as
1X n
1 X n
m1 ti ; s2 ti m1 2 : 5:23
n i1 n 1 i1
For a grouped data with the interval length Dt, under the assumption that data points
are uniformly distributed over each interval, the rst two sample moments can be
estimated as
X
n
ni 1X n
ni
m1 ti Dt=2 ; s2 ti m1 3 ti1 m1 3 : 5:24
i1
n 3 i1 nDt
5.4 Parameter Estimation Methods 81
On the other hand, the theoretic moments (e.g., l and r2 ) of a distribution are the
functions of the distributional parameters. The parameters can be estimated through
letting the theoretic moments equal the corresponding sample moments. This
method is termed the method of moments. It needs to solve an equation system
using an analytical or numerical method.
For a single-parameter model, we use the rst order moment (i.e., mean); for a
two-parameter model, we can use both the rst- and second-order moments (i.e.,
mean and variance). Clearly, the method of moments is applicable only for situa-
tions where the sample moments can be obtained. For example, this method is not
applicable for Example 5.2.
To illustrate, we look at Example 5.1. From Eq. (5.24), the rst two sample
moments are estimated as (m1 ; s 96:91; 38:3729). Assume that the Weibull
distribution is appropriate for tting the data. We need to solve the following
equation system:
Using Solver of Microsoft Excel we obtained the solution of Eq. (5.25) shown in the
third row of Table 5.6.
Let h denote the parameter set of a distribution function Ft. The likelihood
function of an observation is dened as follows:
Lt f t; h for a failure observation t,
Lt Ft; h for a left-censoring observation t ,
Lt Rt; h for a right-censoring observation t ,
Lt Fb; h Fa; h for an interval observation t 2 a; b.
For a given dataset, the overall likelihood function is given by
Y
n
Lh Li h 5:26
i1
where Li h is the likelihood function of the ith observation, and depends on the
distributional parameters. The maximum likelihood method (MLM) is based on the
idea that if an event occurs in a single sampling it should have the greatest prob-
ability. As such, the parameter set is determined by maximizing the overall like-
lihood function given by Eq. (5.26) or its logarithm given by
82 5 Statistical Methods for Lifetime Data Analysis
X
n
lnLh lnLi h: 5:27
i1
Compared with Eq. (5.26), Eq. (5.27) is preferred since Lh is usually very small.
The maximum likelihood estimates (MLE) of the parameters can be obtained using
Solver to directly maximize lnLh.
The MLM has a sound theoretical basis and is suitable for various data types. Its
major limitation is that the MLEs of the parameters may be nonexistent for dis-
tribution with the location parameter as the lower or upper limit of the lifetime. In
this case, the maximum spacing method can be used to estimate the parameters for a
complete dataset without ties (see Ref. [5]). For an incomplete dataset or a complete
dataset with ties, one needs to use its variants (see Ref. [7] and the literature cited
therein).
Using the MLM to t the Weibull distribution to the dataset of Example 5.1, we
obtained the estimated parameters shown in the fourth row of Table 5.6. The last
row of Table 5.6 shows the average of the mean life estimates obtained from
different estimation methods. As seen, the mean life estimate obtained from the
MLM is closest to this average.
We now prove Eq. (5.20) using the MLM. Consider the time interval (tj1 ; tj ].
There are totally Mj1 survival items just after tj1 ; there are kj1 censored
observations in this interval; their values are (sj1;l ; 1 l kj1 ), which meet
tj1 sj1;l \tj ; there are dj failure observations at tj , and the other Mj1 kj1 dj
observations are larger than tj . Assume that Ft can be approximated by the
exponential distribution with failure rate kj . To derive the overall likelihood
function, we divide the observations that have survived to tj1 into three parts:
censored observation in interval (tj1 ; tj ), failure observations at tj , and the other
observations that are larger than tj . Their likelihood functions are given, respec-
tively, by
X
kj1 X
kj1
lnL1 lnekj sj1;l tj1 kj sj1;l tj1 ;
l1 l1
X
3
lnL lnLi dj lnkj kj TTTj : 5:28
i1
The least square method (LSM) is a curve tting technique. Similar to the graphical
method, it needs the nonparametric estimate of cdf. Let Fj be the nonparametric
estimate at a failure time tj ; 1 j m, and Ft; h denote the cdf to be tted. The
parameters are estimated by minimizing SSE given by
X
m
SSE Ftj ; h Fj 2 : 5:29
j1
The least square estimates of the parameters for Example 5.1 are shown in the
fth row of Table 5.6.
t t MRLt 5:30
where MRLt is the mean residual life function evaluated at t . Using t to replace
t , the incomplete dataset is transformed into an equivalently complete dataset.
The second step is the Maximum-step. It applies the MLM to the equivalently
complete dataset to estimate the parameters. After that, the expected life of a
censoring observation is updated using the new estimates of the parameters. The
process is repeated until convergence. Using an Excel spreadsheet program, the
iterative process can be completed conveniently.
Since the expectation step reduces the randomness of the censored data, the
model tted by this method tends to have smaller dispersion (i.e., overestimating b
for the Weibull distribution).
84 5 Statistical Methods for Lifetime Data Analysis
When the sample size n is not small and the dataset is given in the form of interval
data or can be transformed into interval data, the goodness of t of a tted model
(with m parameters) can be evaluated using the chi square statistic given by
X
k
ni Ei 2
v2 ; Ei nFti Fti1 : 5:31
i1
Ei
The smaller the v2 is, the better the tted model is. The goodness of t can be
measured by the p-value given by pv PrfQ [ v2 g, where Q is a chi-squared
random variable with degree of freedom k 1 m. The larger the p-value is (i.e.,
v2 is small), the better the goodness of t is. To accept a tted model, we usually
require pv 0:10:3 (e.g., see Ref. [2]). It is noted that this range is much larger
than the commonly used signicance level (0.01 or 0.05).
To illustrate, we look at Example 5.1. We merge the last two intervals to make
nk larger than or close to 5. In this case, k 9; m 2; v2 24:42 and
pv 0:0004. Since pv 0:1, we conclude that the Weibull distribution is not an
appropriate model for tting the rst bus-motor failure dataset.
Example 5.3 Consider the rst bus-motor failure dataset. It was mentioned earlier
that the bus-motor major failure is due to two failure modes: serious accidents and
performance deterioration. This implies that the twofold Weibull competing risk
model given by Eq. (4.31) can be appropriate for tting this dataset. The MLEs of
the model parameters are shown in the second row of Table 5.7. By merging the last
5.5 Hypothesis Testing 85
Table 5.7 Parameters of the twofold competing risk models for Example 5.3
b1 g1 b2 g2 ln L m AIC
Model 0 1.2939 279.84 4.2104 122.27 384.962 4 1166.887
Model 1 1 530.11 3.9341 118.66 385.182 3 1164.545
i i1
Dn max fmaxjFt j; jFt jg: 5:32
1in n n
If the sample comes from distribution Ft, then Dn will be sufciently small. The
null hypothesis is rejected at level a if Dn [ ka , where ka is the critical value at
signicance level of a. The critical value of the test statistic can be approximated by
a b
kc p 1 c 5:33
n n
The coefcient set (a; b; c) is given in Table 5.8; and the relative error (e; %) is
shown in Fig. 5.4. As seen, the relative error is smaller than 0.7 % for n 5.
Often one considers more than one candidate model and chooses the best one from
the tted models based on some criterion. Determination of candidate models can
be based on failure mechanism, experience, or graphical approach.
The numbers of parameters of candidate models can be the same or different. If
the numbers of parameters of candidate models are the same, we can directly
compare the performance measures of the tted models. The performance measure
is the logarithm maximum likelihood value if the parameters are estimated by
MLM; it is the sum of squared errors if the parameters are estimated by LSM. The
selection will give the model with largest log-likelihood value or the smallest sum
of squared errors.
86 5 Statistical Methods for Lifetime Data Analysis
0.7
0.6
0.5
0.4
, %
Suppose that there are two candidate models (denoted as Model 0 and Model 1,
respectively). Model 1 is a special case of Model 0. Namely, Model 1 is nested
within Model 0. For example, the exponential distribution is a special case of the
Weibull distribution or gamma distribution. The MLM is used to t the two can-
didate models to a given dataset. Let lnL0 and lnL1 denote their log-likelihood
values, respectively. The test statistic is given by:
The information criterion is appropriate for model selection when the candidate
models are either nested or non-nested. A statistical model should have an appro-
priate tradeoff between the model simplicity and goodness of t. The Akaike
information criterion [1] incorporates these two concerns through giving a penalty
for extra model parameters to avoid possible over-tting. In terms of log-likelihood,
the Akaike information criterion (AIC) is dened as below:
Smaller AIC implies a better model. As such, the best model is the one with the
smallest AIC.
The AIC given by Eq. (5.35) is applicable for the cases where the sample size is
large and m is comparatively small. If the sample size is small relative to m, the
penalty given by AIC is not enough and several modications have been proposed
in the literature (e.g., see Refs. [3, 6]).
To illustrate, we look at Example 5.3. The values of AIC for the two candidate
models are shown in the last column of Table 5.7. The exponential-Weibull
competing risk model has a smaller AIC and hence is preferred. This is consistent
with the conclusion obtained from the likelihood ratio test.
References
1. Akaike H (1974) A new look at the statistical model identication. IEEE Trans Autom Control
19(6):716723
2. Blischke WR, Murthy DNP (2000) Reliability: modeling, prediction, and optimization. Wiley,
New York
3. Burnham KP, Anderson DR (2004) Multimodel inference understanding AIC and BIC in
model selection. Sociol Methods Res 33(2):261304
4. Davis DJ (1952) An analysis of some failure data. J Am Stat Assoc 47(258):113150
88 5 Statistical Methods for Lifetime Data Analysis
5. Ekstrm M (2008) Alternatives to maximum likelihood estimation based on spacings and the
KullbackLeibler divergence. J Stat Plan Infer 138(6):17781791
6. Hurvich CM, Tsai CL (1989) Regression and time series model selection in small samples.
Biometrika 76(2):297307
7. Jiang R (2013) A new bathtub curve model with a nite support. Reliab Eng Syst Saf 119:44
51
8. Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am
Stat Assoc 53(282):457481
9. Kim JS, Proschan F (1991) Piecewise exponential estimator of the survivor function. IEEE
Trans Reliab 40(2):134139
10. Lehmann EL, Romano JP (2005) Testing statistical hypotheses, 3E edn. Springer, New York
11. Murthy DNP, Xie M, Jiang R (2003) Weibull models. Wiley, New York
12. Nelson W (1982) Applied life data analysis. Wiley, New York
Chapter 6
Reliability Modeling of Repairable
Systems
6.1 Introduction
Most of the models presented in Chaps. 3 and 4 are univariate life distributions.
Such models are suitable for modeling an i.i.d. random variable (e.g., time to the
rst failure), and represent the average behavior of the populations reliability
characteristics.
A repairable system can fail several times since the failed system can be restored
to its operating condition through corrective maintenance actions. If the repair time
is neglected, the times to failure form a failure point process. The time between the
i 1th failure and the ith failure, Xi , is a continuous random variable. Depending
on the effect of the maintenance actions, the inter-failure times Xi s are generally not
i.i.d. As such, we need new models and methods for modeling the failure process.
This chapter focuses on such models and methods.
There are two categories of models for modeling a failure process. In the rst
category of models, the underlying random variable is Nt, which is the number of
failures by t; and in the second category of models, the underlying random variable
P
is Xi or Ti ij1 Xj , which is the time to the ith failure. We call the rst category
of models the discrete models (which are actually counting process models) and the
second category of models the continuous models (which are actually variable-
parameter distribution models).
The model and method for modeling a given failure process depend on whether
or not the inter-failure times have a trend. As such, the trend analysis for a failure
process plays a fundamental role in reliability analysis of repairable systems. When
the trend analysis indicates that there is no trend for a set of inter-failure times, a
further test for their randomness is needed.
This chapter is organized as follows. We rst look at the failure counting
process models in Sect. 6.2, and then look at the distribution models in Sect. 6.3.
Zt
Mt Ft Mt xf xdx: 6:1
0
where l and r are the mean and standard deviation of the inter-failure time. The
variance of Nt is given by
X
1
Vt 2n 1F n t Mt2 6:3
n1
For a repairable system, a renewal process assumes that the system is returned to
an as new condition every time it is repaired. As such, the distribution of Xi is the
same as the distribution of X1 . For a multi-components series system, if each
component is replaced by a new one when it fails, then the system failure process is
a superposed renewal process. In general, a superposed renewal process is not a
renewal process. In fact, it is close to a minimal repair process when the number of
components is large.
If the times between failures are independent and identically exponentially dis-
tributed, the renewal process reduces into a homogeneous Poisson process (also
termed as stationary Poisson process). In this case, Nt follows a Poisson distri-
bution with the Poisson parameter kt, where k is failure intensity.
Depending on the time origin and observation window, the power-law model
can have two variants. If we begin the failure counting process at t d (either
known or unknown) and set this time as the time origin, then Eq. (6.5) can be
revised as
b
td b d
Mt : 6:6
g g
Suppose we have several failure point processes that come from nominally identical
systems with different observation windows 0; Ti ; 1 i n. Arrange all the
failure data in ascending order. The ordered data are denoted as
t1 s1 t2 s2 tJ sJ T maxTi ; 1 i n 6:8
where tj s are failure times (i.e., not including censored times) and sj is the number
of the systems under observation at tj . The nonparametric estimate of the MCF is
given by
For a given theoretical model Mh t such as the power-law model, the parameter
set h can be estimated by the MLM or LSM. The LSM is simple and estimates the
parameters by minimizing the sum of squared errors given by:
X
m
SSE Mh tj Mtj 2 : 6:11
j1
We consider three categories of models that can be used for modeling failure
processes in different situations. They are:
Ordinary life distribution models;
Imperfect maintenance models; and
Distribution models with the parameters varying with the numbers of failures or
system age.
We briefly discuss them below.
Ordinary life distribution models can be used to model the renewal process and
minimal repair process. When each failure is corrected by a replacement or perfect
repair, times to failure form a renewal process, whose inter-failure times are i.i.d.
random variables and hence can be modeled by an ordinary life distribution.
When each failure is corrected by a minimal repair, times to failure form a
minimal repair process. After a minimal repair completed at age t, the time to the
next failure follows the conditional distribution of the underlying distribution (i.e.,
the distribution of X1 T1 ). This implies that the distribution of inter-failure times
can be expressed in terms of the underlying distribution though they are not i.i.d.
random variables.
When each failure is corrected by either a replacement or a minimal repair, the
inter-failure times can be modeled by a statistical distribution. Brown and Proschan
[2] develop such a model. Here, the item is returned to the good-as-new state with
probability p and to the bad-as-old state with probability q 1 p. The parameter
p can be constant or time-varying. The process reduces into the renewal process
when p 1 and into the minimal repair process when p 0.
94 6 Reliability Modeling of Repairable Systems
When each failure is corrected by an imperfect maintenance, the time to the next
failure depends on the effects of prior maintenance actions. As such, the ordinary
life distribution is no longer applicable, and a category of imperfect maintenance
models can be used for modeling subsequent failures.
Preventive maintenance (PM) aims to maintain a working item in a satisfactory
condition. The PM is often imperfect, whose effect is in between the perfect
maintenance and minimal maintenance. As such, the effect of PM can be repre-
sented by an imperfect maintenance model.
There are a large number of imperfect maintenance models in the literature, and
Pham and Wang [11] present a review on imperfect maintenance models, and Wu
[16] provides a comprehensive review on the PM models (which are actually
imperfect maintenance models). Several typical imperfect maintenance models will
be presented in Chap. 16.
This category of models assumes that Xi s can be represented by the same life
distribution family Fx; hi with the parameter set hi being functions of i or ti .
Clearly, when hi is independent of i or ti , the model reduces into an ordinary
distribution model.
For the bus-motor data shown in Table 5.2, Jiang [4] presents a normal variable-
parameter model, whose parameters vary with i; and Jiang [6] presents a Weibull
variable-parameter model, whose parameters are also functions of i. The main
advantage of these models is that they can be used to infer the life distribution after
a future failure.
6.4.1 An Illustration
Example 6.1 The data shown in Table 6.1 come from Ref. [12] and deal with
failure times (in 1000 h) of a repairable component in a manufacturing system.
6.4 A Procedure for Modeling Failure Processes 95
Under the assumption that the times to failure form an RP with the underlying
distribution being the Weibull distribution, we obtained the MLEs of the parameters
shown in the second column of Table 6.2.
Under the NHPP assumption with the MCF given by Eq. (6.5) (i.e., the power-
law model), we obtained the MLEs of the parameters shown in the third column of
Table 6.2 (for the MLE of the power-law model, see Sect. 11.5.3.1). The empirical
and tted MCFs are shown in Fig. 6.1.
From Table 6.2 and Fig. 6.1, we have the following observations:
the parameters of the tted models are signicantly different, but
the plots of Mt are close to each other.
A question is which model we should use. The answer to this question depends
on the appropriateness of the assumption for the failure process. This deals with
testing whether the failure process is stationary and whether the inter-failure times
are i.i.d. Such tests are called test for trend and test for randomness, respectively. As
a result, a procedure is needed to combine these tests to the modeling process.
Modeling a failure point process involves a multi-step procedure. Specic steps are
outlined as follows.
Step 1: Draw the plot of the MCF of data and other plots (e.g., running arith-
metic average plot, which will be presented later). If the plots indicate that the trend
is obvious, implement Step 3; otherwise, implement Step 2.
96 6 Reliability Modeling of Repairable Systems
14
12
10
8
M(t )
6
NHPP
4
Asymptotic RF
2
0
0 5 10 15 20 25
t
Step 2: If the trend is not very obvious, carry out one or more tests for
stationarity to further check for trend. If no trend is conrmed, a further test for i.i.d.
assumption needs to carry out. If the i.i.d. assumption is conrmed, the data can be
modeled by an appropriate life distribution model.
Step 3: This step is implemented when the inter-failure times have a trend or
they are not i.i.d. In this case, the data should be modeled using nonstationary
models such as the power-law model, variable-parameter models, or the like.
that there exists a trend in the process. However, when we cannot reject the null
hypothesis at the given level of signicance, it does not necessarily imply that we
accept the null hypothesis unless the test has a particularly high power (which is the
probability of correctly rejecting the null hypothesis given that it is false [10]). This
is because the conclusion is made based on the assumption that the null hypothesis
is true and depends on the signicance level (which is the probability the null
hypothesis to be rejected assumed that it is true [10]), whose value is commonly
small (0.05 or 0.01).
In this section, we present several tests for stationarity. We will use the data
shown in Table 6.1 to illustrate each test.
A plot of data helps get a rough impression for trend before conducting a quanti-
tative trend test. Such a plot is the empirical MCF. If the process is stationary, the
plot of empirical MCF is approximately a straight line through the origin.
Another useful plot is the plot of the running arithmetic average. Consider a set
P
of inter-failure times xi ; 1 i n. Let ti ij1 xj . The running arithmetic
average is dened as below:
ri ti =i; i 1; 2; . . .: 6:12
If the running arithmetic average increases as the failure number increases, the
time between failures is increasing, implying that the systems reliability gets
improved with time. Conversely, if the running arithmetic average decreases with
the failure number, the average time between failures is decreasing, implying that
the systems reliability deteriorates with time. In other words, if the process is
stationary, the plot of running arithmetic average is approximately a horizon line.
Figure 6.2 shows the plot of running arithmetic average for the data in Table 6.1.
As seen, the reliability gets improved at the beginning and then becomes stationary.
For this case, one could implement the second step or directly go to the third step.
Tests with HPP null hypothesis include Crow test, Laplace test, and Anderson-
Darling test.
98 6 Reliability Modeling of Repairable Systems
1.5
r (i )
1
0.5
0
0 2 4 6 8 10 12 14
i
This test is developed by Crow [3] and is based on the power-law model given by
Eq. (6.5). When b 1, the failure process follows an HPP. As such, the test
involves whether an estimate of b is signicantly different from 1. The null
hypothesis is b 1 and the alternative hypothesis is b 6 1.
For one system on test, the maximum likelihood estimate of b is
X
n
^ n=
b lnT=ti 6:13
i1
where n is the number of observed failures and T is the censored time, which can be
^ follows a chi-squared distribution
larger than or equal to tn . The test statistic 2n=b
with the degree of freedom of 2n. The rejection criterion for null hypothesis H0 is
given by
^ \ v2
2n=b ^
2n;1a=2 or 2n=b [ v2n;a=2 6:14
2
where v2k;p is the inverse of the one-tailed probability of the chi-squared distribution
associated with probability p and degree of freedom k.
Example 6.2 Test the stationarity of the data in Table 6.1 using the Crow test.
From Eq. (6.13), we have b ^ 0:9440 and 2n=b ^ 25:423. For signicant level
a 0:05, v22n;a=2 39:364 and v22n;1a=2 12:401. As a result, we cannot reject H0 .
X
n1
U ti : 6:15
i1
The test statistic is the standard normal score given by Z U lU =rU . For large
n, Z approximately follows a standard normal distribution. The rejection criterion
for H0 is given by
Example 6.2 (continued) Test the stationarity of the data in Table 6.1 using the
Laplace test.
From Eqs. (6.15) and (6.16), we have U 132.687, lU 121.682, rU
21.182, and Z 0.5280. For a 0:05, za=2 z1a=2 1:96. As a result, we
cannot reject H0 .
The AndersonDarling test for trend is based on the AndersonDarling test statistic
given by (see Ref. [8]):
1X n0 h t tn 1i i
i
AD n0 2i 1 ln ln 1 0 6:18
n0 i1 T T
From Eq. (6.18), we have AD 0:2981 and hence the null hypothesis is not
rejected for a 0:05.
Tests with RP null hypothesis include Mann test, LewisRobinson test, and gen-
eralized AndersonDarling test.
This test is presented in Ref. [1] and is sometimes called reverse arrangement test or
pairwise comparison nonparametric test (see Refs. [14, 15]). The null hypothesis is
renewal process and the alternative hypothesis is nonrenewal process. The test
needs to compare all the interarrival times xj and xi for j [ i. Let uij 1 if xj [ xi ;
otherwise uij 0. The number of reversals of the data is given by
X
U uij : 6:19
i\j
Too many reversals indicate an increasing trend, too few reversals imply a
decreasing trend, and there is no trend if the number of reversals is neither large nor
small.
Under H0 , the mean and variance of U are given, respectively, by
The test statistic is the standard normal score given by Z U lU =rU . For large
n (e.g., n 10), Z approximately follows a standard normal distribution. The
rejection criterion for H0 is given by Eq. (6.17).
Example 6.2 (continued) Test the stationarity of the data in Table 6.1 using the
Mann test.
From Eqs. (6.19) and (6.20), we have U 44, lU 33, rU 7:2915 and
Z 1:5086. As a result, we cannot reject H0 for a 0:05.
Laplace test statistic and CV is the coefcient of variation for the observed inter-
arrival times. The critical value for rejecting H0 is shown in the third column of
Table 6.3.
Example 6.2 (continued) Test the stationarity of the data in Table 6.1 using the
LewisRobinson test.
Using the approach outlined above, we have Z 0.5280, CV 0.4051 and
LR 1.3034. As a result, we still cannot reject H0 for a 0:05.
The test statistic of the generalized AndersonDarling test is given by (see Ref. [8])
n
n 4x2 X i 2 1 1 2
GAD 2
qi ln qi ri ln 1 ri 6:21
r2 i1
i1 ni n
where
nxi 1 Xn1
qi ti ixi =tn ; ri 1; and r2 xi1 xi 2
tn 2n 1 i1
with
i 2 1
q2i ln j 0; qi ri ln 1 j 0:
i 1 i1 n i in
It is one-sided and the null hypothesis is rejected if GAD is greater than the critical
value, which is shown in the last column of Table 6.3.
Example 6.2 (continued) Test the stationarity of the data in Table 6.1 using the
AndersonDarling test.
From Eq. (6.21), we have GAD 1:3826 and hence cannot reject the null
hypothesis for a 0:05.
The performances of the tests discussed above have been studied (see Refs. [8, 9,
15]), and the results are summarized in Table 6.4. It is noted that no test provides
very good performance for the decreasing case.
102 6 Reliability Modeling of Repairable Systems
Randomness means that the data are not deterministic and/or periodic. Tests for
randomness fall into two categories: nonparametric methods and parametric
methods. In this section, we focus on nonparametric methods.
xj xj1 ; 1 j n 1: 6:22
Under the null hypothesis that the data are random, the number of runs r is a
discrete random variable with mean and variance given by:
2n1 n n1 2n1 n n1 n
lM 1; r2M lM 1 : 6:23
n nn 1
P
where n1 ni1 ri . For n 10, r approximately follows the normal distribution.
The test statistic is the standard normal score given by
Z M lM =rM : 6:24
For this example, the median x0:5 1:771. The values of ri are shown in the
third column of Table 6.5. As seen, M0 5 and hence M 6. From Eqs. (6.23)
and (6.24), we have lM 7, rM 1:6514 and Z 0:6055. When a 0:05,
za=2 1:96 and z1a=2 1:96. As a result, the null hypothesis is not rejected.
The test statistic is Z P lP =rP , and the critical value is given by Eq. (6.25).
Example 6.3 (continued) Test the randomness of the data in Table 6.1 using the
sign test.
The values of Si are shown in the fourth column of Table 6.5. From these values,
we have m 11, P 7 and Z 1:5667. As a result, the null hypothesis is not
rejected at the signicance level of 5 %.
Let R denote the number of runs of Si (which is dened in Sect. 6.6.2). Under the
null hypothesis, R is approximately a normal random variable with mean and
variance given by:
lR 2m 1=3; r2R 16m 13=90: 6:27
The test statistic is the normal score Z R lR =rR and the critical values are
given by Eq. (6.25).
Example 6.3 (continued) Test the randomness of the data in Table 6.1 using the
runs up and down test.
The numbers of runs of Si are shown in the last column of Table 6.5. We have
m 11, R 6 and Z 1:2384. As a result, the null hypothesis is not rejected at
the signicance level of 5 %.
6.6 Tests for Randomness 105
The null hypothesis of the MannKendall test is that the data are i.i.d. and the
alternative hypothesis is that the data have a monotonic trend. The test statistic is
n1 X
X n
S signxj xi : 6:28
i1 ji1
Under the null hypothesis and for a large n, S approximately follows the normal
distribution with zero mean and the variance given by
Z S signS=rS : 6:30
The critical values with signicance level of a are given by Eq. (6.25).
Example 6.3 (continued) Test the randomness of the data in Table 6.1 using the
MannKendall test.
Similar to the MannKendall test, the null hypothesis of the Spearman test is that
the data are i.i.d. and the alternative hypothesis is that the data have a monotonic
trend. The test statistic is
Pn
6 Rxi i2
D1 i1
6:31
nn2 1
where Rxi is the rank of xi in the sample with the rank of the smallest observation
being 1. Under the null hypothesis and for a large n, D approximately follows the
normal distribution with zero mean and the variance given by
The standardized test statistic is given by Z D=rD , and the critical values with
signicance level a are given by Eq. (6.25).
Example 6.3 (continued) Test the randomness of the data in Table 6.1 using the
Spearman test.
6.6.6 Discussion
According to the results of Examples 6.2 and 6.3, the data in Table 6.1 can be
modeled by an appropriate distribution model. After tting the data to the normal,
lognormal and Weibull distributions using the MLM, it is found that the Weibull
distribution is the best in terms of log maximum likelihood value. The estimated
parameters are shown in the second column of Table 6.2.
However, one may directly implement the third step after carrying out the rst
step since Fig. 6.2 indicates that there is a trend at the early stage of the use. In this
case, the variable-parameter model can be used.
According to Fig. 6.2, the Weibull scale parameter can increase with i and tend
to a constant. Therefore, we assume that the shape parameter keeps unvarying and
the scale parameter is given by
gi g1 1 ei=d ; i 1; 2; . . .: 6:33
Using the MLM, we obtained the parameters shown in the last column of
Table 6.2. In terms of the AIC (see the last row of Table 6.2), the variable-
parameter Weibull model is much better than the two-parameter Weibull distri-
bution. This illustrates the importance of the graphical methods in modeling a
failure process.
Some statistical analyses (e.g., regression analysis) sometimes need to test nor-
mality and constant-variance. We briefly discuss these two issues in this section.
6.7 Tests for Normality and Constant Variance 107
The chi square test and the KolmogorovSmirnov test discussed in Sect. 5.5 are
general methods for testing the goodness of t of a distribution, including the
normal distribution. The normal QQ plot and the skewnesskurtosis-based method
are two simple and specic methods for testing the normality.
The normal QQ plot is applicable for both complete and incomplete data and
can be easily generated using Excel. To be simple, we consider a complete ordered
sample: x1 x2 xn . The empirical cdf at xi can be evaluated by Fi
i=n 1 or Fi betainv0:5; i; n i 1. Let zi U1 Fi ; 0; 1. The normal
QQ plot is the plot of xi versus zi . If the data come from a normal distribution, then
the normal QQ plot of the data should be roughly linear.
Example 6.4 Test the normality of the data in Table 6.1 using the normal QQ plot.
Using the approach outlined above, we obtained the normal QQ plot of the data
shown in Fig. 6.3. As seen, the data points scatter roughly along a straight line,
implying that the normality hypothesis cannot be rejected.
The skewness and kurtosis (i.e., c1 and c2 , see Sect. 3.5.3) of a normal distri-
bution are zero. If the sample skewness and kurtosis are signicantly different from
zero, the data may not be normally distributed. The JarqueBera statistic (see Ref.
[13]) combines these two measures as
n 2
J c c22 =4 6:34
6 1
2
x
0
-2 -1 0 1 2
z
The skewness and kurtosis of the data are c1 0:2316 and c2 0:0254,
respectively. From Eq. (6.34), we have J 0:1076, which is much smaller than 6,
implying that the normality hypothesis cannot be rejected.
References
14. Tobias PA, Trindade D (2011) Applied reliability, 2nd edn. Van Nostrand Reinhold,
New York
15. Wang P, Coit DW (2005) Repairable systems reliability trend tests and evaluation. In:
Proceedings of 51st annual reliability and maintainability symposium, pp 416421
16. Wu S (2011) Preventive maintenance models: A review. In: Ouali MS, Tadj L, Yacout S et al
(eds) Replacement models with minimal repair. Springer-Verlag, London, pp 129140
Part II
Product Quality and Reliability
in Pre-manufacturing Phase
Chapter 7
Product Design and Design for X
7.1 Introduction
The life cycle of the product starts from identication of a need. Product design
transforms the need into the idea that produces the desired product. Traditionally, the
product design focused mainly on the acquisition phase of the products life cycle
and was completed purely based on the consideration of product functionality (see
Ref. [5]). To produce a competitive product, product design needs to consider a wide
range of requirements, including product features, cost, quality, reliability, manu-
facturability, and supportability. These requirements are often conflicting. Design
for X (DFX for short) is a set of design methodologies to address these requirements.
In this chapter we briefly discuss the DFX in the context of product life cycle.
The outline of the chapter is as follows. We start with a brief discussion of
product design and relevant issues in Sect. 7.2. Section 7.3 deals with designs for
safety, environment, quality, and reliability. Designs for production-related per-
formances are discussed in Sect. 7.4; designs for use-related performances are
discussed in Sect. 7.5, and designs for retirement-related performances are dis-
cussed in Sect. 7.6.
Product design is the process of creating a new product. This process is generally
divided into the following ve distinct stages:
Product planning stage. It starts with setting up a project team and denes major
technical parameters and product requirements in terms of performance, costs,
safety, etc.
Concept design stage. During this stage, several design concepts are generated
and evaluated to determine whether the product requirements can be met and to
assess their levels of technology and risks. The basic outcome of this stage is
one or more product concepts or options for further consideration. A life cycle
cost (LCC) analysis can be carried out for each design option.
System-level design stage. In this stage, more details are specied, detailed
analysis is carried out, and subsystems begin to take shape for the selected
concept(s).
Detail design stage. During this stage, all components and parts are dened in all
details and most of the manufacturing documentation is produced.
Design renement stage. In this stage, one or more product prototypes are made
and tested so as to nd possible design defects and accordingly modify the design.
The modern product design needs to use numerous working methods and
software packages such as computer-aided design (CAD), computer-aided engi-
neering (CAE), and computer-aided quality (CAQ). These packages are usually
integrated to a product lifecycle management (PLM) system (see Ref. [6] and the
literature cited therein).
Product design aims to create a product with excellent functional utility and sales
appeal at an acceptable cost and within a reasonable time. This deals with the
following three aspects:
Excellent functional utility and sales appeal. This actually deals with product
quality, including reliability and other performance characteristics. Design for X
can be used to address this issue.
Acceptable cost. The cost is evaluated through considering all cost elements
involved in product life cycle. Design for life cycle addresses this issue.
Reasonable time. Product design has become a regular and routine action and
time-to-market has to become shorter and shorter. A time-based product design
approach is used to address this issue.
These approaches are further discussed as follows.
The main purpose of time-based product design is to reduce the time to market. The
basic approach is to make key participants (e.g., marketing, research and devel-
opment, engineering, operations, and suppliers) be involved as early as possible.
This implies (a) use of a team-based concurrent design process and (b) early
7.2 Product Design and Relevant Issues 115
The objective of design for life cycle is to maximize the life cycle value of prod-
ucts users and minimize the LCC of the product. To maximize the life cycle value,
the design needs to take into account various performance characteristics by using a
methodology of Design for X, where X stands for key performance charac-
teristics of the product.
To minimize LCC, the design needs to take into account various activities and
costs that involve in various phases of the life cycle. Life-cycle assessment is a key
activity of design for life cycle. It assesses materials, services, products, processes,
and technologies over the entire life of a product and, identies and quanties
energy and materials used as well as wastes released to the environment. The main
outcome of the life-cycle assessment is LCC of the product, which is often used as
the decision objective to choose the best design alternative. Therefore, the LCC
analysis is usually carried out in the early stage of product design and has become a
common practice in many organizations.
Life cycle cost is composed of numerous cost elements. The main cost elements
for the manufacturer include research and development cost, manufacturing cost,
marketing cost, operation and maintenance cost, environmental preservation cost,
disposal, and recycle cost. For the user they include purchase cost, operation and
maintenance cost, environmental preservation cost, and residual value of the
product at retirement, which is an income. The LCC model of a product represents
cost elements and their interrelationships.
In addition to the LCC and time to market, the manufacturer is also concerned about
other design factors such as manufacturability, assembliability, testability, and so
on. On the other hand, several factors (e.g., price, quality, safety, serviceability,
maintainability, etc.) impact the purchase decisions of customers. These imply that
product design needs to consider many performance requirements or characteristics.
Design for X addresses the issue of how to achieve the desired performances
through design. There is a vast literature on DFX, (see Refs. [15]), and the lit-
erature cited therein.
116 7 Product Design and Design for X
The performance requirements for a product can be roughly divided into two
categories: overall and phase-specic. The overall performances are those that are
related to more than one phase of the product life cycle (e.g., safety, environment,
quality and reliability). The phase-specic performances are those that are related to
a certain specic phase of the product life cycle. For example, manufacturability is a
production-related performance and maintainability is a use-related performance.
As such, the phase-specic performances can be further divided into three sub-
categories: production-related, use-related, and retirement-related. Figure 7.1 dis-
plays this classication and main performances in each category.
Safety is referred to the relative protection from exposure to various hazards (e.g.,
death, injury, occupational illness, damage to the environment, loss of equipment,
and so on). Hazards are unsafe acts or conditions that could lead to harm or damage
7.3 Design for Several Overall Performances 117
to humans or the environment. Human errors are typical unsafe acts that can occur
at any time throughout the product life cycle; and unsafe conditions can be faults,
failures, malfunctions, and anomalies.
Risk is usually dened as the product of the likelihood or probability of a hazard
event and its negative consequence (e.g., level of loss of damage). As such, a risk
can be characterized by answering the following three questions:
What can happen?
How likely will it happen?
If it does happen, what are the consequences?
Clearly, risk results from a hazard (which is related to the rst question) but a
hazard is not necessary to produce risk if there is no exposure to that hazard (which
is related to the second question).
The levels of risk can be classied as:
Acceptable risk without immediate attention;
Tolerable risk that needs immediate attention; and
Unacceptable risk.
A product is considered safe if the risks associated with the product are assessed to
be acceptable.
Products must be produced safely and be safe for the user. System safety aims to
optimize safety by the identication of safety related risks, eliminating or con-
trolling them by design and/or procedures, based on acceptable system safety level.
A system safety study is usually carried out in both the concept design and
system-level design phases. The main issues of the system safety study are hazard
analysis, specication of safety requirement, and mitigation of safety risk through
design. We discuss these three issues as follows.
Hazard analysis includes hazard identication, risk estimation, and risk evaluation.
Preliminary hazard analysis is a procedure to identify potential hazardous condi-
tions inherent within the system by engineering experience. It also determines the
criticality of potential accidents. Functional hazard assessment is a technique to
identify hazardous function failure conditions of part of a system and to mitigate
their effects. Risk estimation deals with quantifying the probability of an identied
hazard and its consequence value. If the risk is unacceptable, risk mitigation
measures must be developed so that the risk is reduced to an acceptable level. Risk
evaluation aims to validate and verify risk mitigation measures.
Three typical hazard analysis techniques are failure mode, effect and criticality
analysis (FMECA), fault tree analysis (FTA), and event tree analysis (ETA).
FMECA is an extension of the failure mode and effects analysis (FMEA). FMEA
aims to identify failure modes and, their causes and effects. The criticality analysis
uses a risk priority number (which is the product of risk and the probability that the
118 7 Product Design and Design for X
hazard event would not be detected) to quantify each failure mode. It facilitates the
identication of the design areas that need improvements.
FTA is a commonly used method to derive and analyze potential failures and
their potential influences on system reliability and safety. FTA builds a fault tree
with the top event being an undesired system state or failure condition. The analysis
helps to understand how systems can fail and to identify possible ways to reduce
risk. According to this analysis, the safety requirements of the system can be further
broken down.
ETA builds an event tree based on detailed product knowledge. The event tree
starts from a basic initiating event and provides a systematic coverage of the time
sequence of event propagation to its potential outcomes. The initiating event is
usually identied by hazard analysis or FMECA.
According to hazard analysis, a set of the safety requirements of the product must
be established in early stages of the product development. The product design must
achieve these safety requirements.
Hierarchical design divides the product into a number of subsystems. In this
case, the product safety requirement will be further allocated to appropriate safety-
related subsystems.
Safety must be built into a product by considering safety at all phases. Typical
design techniques for a product to satisfy its safety requirements are redundancy
design, fail-safe design, and maintainability design. Maintainability design will be
discussed in Sect. 7.5 and hence we briefly discuss the rst two techniques below.
The redundancy design is a kind of fault tolerance design. Fault tolerance means
that a product can operate in the presence of faults. In other words, the failure
of some part of the product does not result in the failure of the product. All methods
of fault tolerance are based on some form of redundancy. The redundancy design
uses additional component to provide protection against random component fail-
ures. It is less useful for dependent failures. Design diversity is an appropriate way
to deal with dependent failures. The redundancy design will be further discussed in
Chap. 9.
Products can be designed to be fail-safe. A fail-safe device will cause no harm to
other devices or danger to personnel when a failure occurs. In other words, the fail-
safe design focuses on mitigating the unsafe consequences of failures rather than on
avoiding the occurrence of failures. Use of a protective device is a typical approach
of fail-safe design. For example, the devices that operate with fluids usually use
safety valves as a fail-safe mechanism. In this case, the inspection of protective
devices will be an important maintenance activity.
7.3 Design for Several Overall Performances 119
Quality must be designed in the product; and poor design cannot be compensated
through inspection and statistical quality control. Design for quality is a set of
methodologies to proactively assure high quality by design. It aims to offer
excellent performances to meet or exceed customer expectations.
There are many design guidelines for quality. These include:
using quality function deployment (QFD) to capture the voice of the customer
for product denition,
using Taguchi method to optimize key parameters (e.g., tolerances),
reusing proven designs, parts, and modules to minimize risk,
simplifying the design with fewer parts, and
using high-quality parts.
The QFD and Taguchi method will be further discussed in the next chapter.
120 7 Product Design and Design for X
Reliability must be designed into products and processes using appropriate meth-
ods. Design for reliability (DFR) is a set of tools or methodologies to support
product and process design so that customer expectations for reliability can be met.
DFR begins early in the concept stage, and involves the following four key
activities:
Determining the usage and environmental conditions of the product and dening
its reliability requirements. The requirements will be further allocated to
assemblies, components and failure modes, and translated into specic design
and manufacturing requirements using the QFD approach.
Identifying key reliability risks and corresponding mitigation strategies.
Predicting the products reliability so that different design concepts can be evaluated.
Performing a reliability growth process. The process involves repeatedly testing
for prototypes, failure analysis, design changes, and life data analysis. The
process continues until the design is considered to be acceptable. The accept-
ability can be further conrmed by a reliability demonstration test.
The rst three activities will be further discussed in Chap. 9 and the reliability
growth by development will be discussed in detail in Chap. 11.
The product design obtained after these activities may be modied based on the
feedbacks from manufacturing process and eld usage.
Manufacturability is a design attribute for the designed product to be easy and cost-
effective to build. Design for manufacturability (DFM) uses a set of design
guidelines to ensure the manufacturability of the product. It is initiated at the
conceptual design.
DFM involves various selection problems on structure, raw material, manufac-
ture method and equipment, assembly process, and so on. The main guidelines of
DFM include:
Reducing the total number of parts. For this purpose, one-piece structures or
multi-functional parts should be used. Typical manufacturing processes asso-
ciated with the one-piece structures include injection molding and precision
castings.
Usage of modules and standard components. The usage of modules can simplify
manufacturing activities and add versatility; and the usage of standard compo-
nents can minimize product variations, reduce manufacture cost and lead times.
Usage of multi-use parts. Multi-use parts can be used in different products with
the same or different functions. To develop multi-use parts, the parts that are used
commonly in all products are identied and grouped into part families based on
similarity. Multi-use parts are then created for the grouped part families.
There are many other considerations such as ease of fabrication, avoidance of
separate fasteners, assembly direction minimization, and so on.
with DFM. In other words, some design guidelines for DFM (e.g., modularity design
and minimization of the total number of parts) are also applicable for DFA.
The basic guidelines of DFA are
to ensure the ease of assembly, e.g., minimizing assembly movements and
assembly directions; providing suitable lead-in chamfers and automatic align-
ment for locating surfaces and symmetrical parts;
to avoid or simplify certain assembly operations, e.g., avoiding visual obstruc-
tions, simultaneous tting operations, and the possibility of assembly errors.
For a manufacturing company, logistics deals with the management of the flow of
resources (e.g., materials, equipment, product and information, etc.) from procure-
ment of the raw materials to the distribution of nished products to the customer.
The product architecture has an important influence on the logistic performance
of the product. For example, the make-or-buy decision for a specic part will result
in considerably different logistic activities. Design for logistics (DFL) is a design
method that aims at optimizing the product structure to minimize the use of
resources.
A considerable part of the product cost stems from purchased materials and
parts. DFL designs a product to minimize total logistics cost through integrating the
manufacturing and logistic activities. As such, DFL overlaps with DFM and DFA,
and hence some guidelines of DFM and DFA (e.g., modular design and usage of
multi-use parts) are also applicable for DFL.
The logistic system usually consists of three interlinked subsystems: supply
system, production system, and distribution system. A systematic approach is
needed to scientically organize the activities of purchase, transport, storage, dis-
tribution, and warehousing of materials and nished products.
The supply system depends on nature of the product and make-or-buy decision
of its parts, and needs to be flexible in order to match different products.
In a production system, two key approaches to achieve desired logistics perfor-
mance are postponement and concurrent processing. The postponement means to
delay differentiation of products in the same family as late as possible (e.g., painting
cars with different colors); and the concurrent processing means to produce multiple
different products concurrently. The main benet of delaying product differentiation
is more precise demand forecasts due to aggregation of forecasts for each product
variant into one forecast for the common parts. The precise demand forecasts can
result in lower stock levels and better customer service. Product designs that allow for
delaying product differentiation usually involve a modular structure of the product,
and hence modularity is an important design strategy for achieving desired logistic
performance. However, the postponement may result in higher manufacturing costs,
adjustment of manufacturing processes, and purchase of new equipment.
7.4 Design for Production-Related Performances 123
The performance requirements in the usage phase can be divided into two cate-
gories: user-focused and post-sale-focused. The user-focused performances include
user-friendliness, ergonomics, and aesthetics. These can have high priorities for
consumer durables. The post-sale-focused performances include reliability, avail-
ability, maintainability, safety, serviceability, supportability, and testability and
have high priorities for capital goods. Among these performances, reliability,
availability, maintainability and safety or supportability (RAMS) are particularly
important. Design for RAMS involves the development of a service system
(including a preventive maintenance program) for the product.
Some of the post-sale-focused performances have been discussed earlier, and the
others are briefly discussed as follows.
References
3. Huang GQ, Shi J, Mak KL (2000) Synchronized system for Design for X guidelines over the
WWW. J Mater Process Tech 107(13):7178
4. Keys LK (1990) System life cycle engineering and DFX. IEEE Trans CHMT 13(1):8393
5. Kuo TC, Huang SH, Zhang HC (2001) Design for manufacture and design for X: concepts,
applications, and perspectives. Comput Ind Eng 41(3):241260
6. Saaksvuori A, Immonen A (2008) Product lifecycle management, 3rd edn. Springer, Berlin
Chapter 8
Design Techniques for Quality
8.1 Introduction
As mentioned earlier, the design and development process of a product starts with
requirement denition and ends with a prototype version of the product that meets
customer needs. The HOQ is a design technique developed to identify and
Correlations
ECs
Benchmarking
Importance
Relationships
Effects
CAs
between ECs
and CAs
Target levels
transform customer needs into technical specications. It is based on the belief that
a product should be designed to reflect customer needs.
Figure 8.1 shows an HOQ in the product planning stage. The customers needs
or attributes (CA) are represented on the left-hand side (LHS) of the HOQ, and are
usually qualitative and vague. The relative importance or weight of a CA helps to
identify critical CAs and to prioritize design efforts. The weight of a CA can be
determined using various approaches such as the AHP (see Online Appendix A).
For example, for CA i, a score of si 1; 2; . . .; 9 can be assigned to it based on
the customers preference. The weight of CA i can be calculated by
si
xi P
m 8:1
sk
k1
represent their correlation degree based on the experts judgment. The correlation
coefcient between the ECs i and j can be calculated by
The right-hand side (RHS) of the HOQ is the comprehensive effects from all ECs
for all CAs, and also may include a competitive benchmarking value for each CA.
The bottom part of the HOQ may give the competing products performance,
comprehensive evaluation, and conclusions about how the designing product is
superior to the competing products. The target levels of ECs are determined using
all the information in the HOQ.
There can be different versions for the LHS, RHS, and bottom part of the HOQ,
depending on specic applications. For example, there can be a correlation matrix
for the CAs, which is usually placed on the LHS of the HOQ.
The HOQ helps transform customer needs into engineering characteristics, pri-
oritize each product characteristic, and set development targets. To achieve these
purposes, an evaluation model is used to evaluate the importance rating of each EC,
and another evaluation model is used to evaluate the comprehensive effect of all the
ECs on each CA. As such, the future performance of the designing product can be
predicted by aggregating these effects. We discuss these models as follows.
The importance ratings or priorities for the ECs can be evaluated using the CAs
relative importance and the relationship matrix. If the correlations among the CAs
can be ignored, the priorities of the ECs can be calculated by (see Ref. [3])
X
m
pj xi rij ; 1 j n: 8:4
i1
132 8 Design Techniques for Quality
X
n
wj pj = pk ; 1 j n: 8:5
k1
Example 8.1 Consider four CAs and ve ECs. Their relationship matrix is shown
in the top part of Table 8.1. Using Eq. (8.4) yields the priorities of the ECs shown in
the third row from the bottom; and the normalized weights are shown in the second
row from the bottom. The last row shows the ranking number of each EC. As seen,
EC 5 is the most important and EC 4 is the least important.
Y
n Y
n
Di 1 Si dij 2 0; 1; Si 1 1 rij 2 0; 1: 8:6
i1 i1
X
m
S x i Si : 8:7
i1
Y
n
1 KUi 1 Kaj uij ; aj 2 0; 1; K [ 1 8:8
j1
8.2 House of Quality and Quality Function Deployment 133
where uij is the utility of attribute j, Ui is the overall utility, aj is the attribute weight
and K is a constant. When uij 1, the overall utility should equal 1, i.e.,
Y
n
1K 1 Kaj : 8:9
j1
As such, K is the nonzero solution of Eq. (8.9). Clearly, Eq. (8.8) reduces into
Eq. (8.6) when aj 1, K 1 and uij rij .
Example 8.1 (continued) Consider the relationship matrix shown in Table 8.1. The
satisfaction degrees of the CAs calculated from Eq. (8.6) are shown in Table 8.2. As
seen, CA 4 can be met well by the design but CA 2 is not met well. The overall
performance of design equals 0.8975, which is equivalent to an 8-point in terms of
the 9-point scale used in AHP.
Quality function deployment is a series of HOQs, where the ECs of the current
HOQ become the CAs of the next HOQ. Each HOQ relates the variables of one
design stage to the variables of the subsequent design stage. The process stops at a
stage when the design team has specied all the engineering and manufacturing
details. In this way, the QFD ensures quality throughout each stage of the product
development and production process.
Typically, QFD is composed of four HOQs. The rst HOQ transforms the
customers needs to the engineering characteristics (or design requirements); the
second HOQ transforms the engineering characteristics to parts characteristics (or
part requirements); the third HOQ transforms the part characteristics to
134 8 Design Techniques for Quality
The quality costs include prevention cost (e.g., process improvement and training
costs), appraisal or evaluation cost (e.g., inspection or test costs), external loss (e.g.,
warranty cost and sale loss) and internal loss (e.g., scrap and rework costs). These
cost elements are highly correlated. For example, as product quality level increases,
the prevention and appraisal costs increase but the internal and external losses
decrease. The traditional viewpoint with quality is to nd an optimal quality level
so that the total quality cost achieves its minimum. However, the modern viewpoint
with quality thinks that the continuous quality improvement is more cost-effective
based on the following reasons:
It will result in an improved competitive position so that the product will be sold
at a higher price and will have an increased market share.
It can result in decrease in failure costs and operational cost. These lead to an
increase in prots.
As a result, the quality should be continuously improved and should not adhere to
an optimal quality level.
Let Y denote the quality characteristic and, LSL and USL denote the lower and
upper specication limits, respectively. Items that conform to the design speci-
cations are called conforming and those that do not are called nonconforming or
defective. The quality loss depends on the value of Y (denote the value as y).
Traditionally, the loss is thought to be zero when y falls inside the specication
limits; otherwise, the loss is a positive constant. As such, the conventional quality
loss function is a piecewise step function as shown in Fig. 8.2. Such a function
implies that any value of Y within the specication limits is equally desirable.
Taguchi [8] considers that any deviation from a predetermined target value T
represents an economic loss to the society. The loss can be incurred by the man-
ufacturer as warranty or scrap costs; by the customer as maintenance or repair costs;
or by the society as pollution or environmental costs [2]. As such, there can be
8.3 Cost of Quality and Loss Function 135
Target value
L (y)
LSL USL
a quality cost for any conforming product as long as its quality characteristic is not
at the target value.
Using the Taylor expansion series, Taguchi proposes a quadratic loss function to
model the loss of the deviation of the quality characteristic from its target value:
Ly Ky T2 8:10
ZUSL
lL Ky l2 f ydy KV 8:13
LSL
136 8 Design Techniques for Quality
where
V r2 f1 2D=r/D=r; 0; 1=2UD=r; 0; 1 1g: 8:14
Equations (8.13) and (8.14) clearly show the benet of reducing the variability.
When the target T is a nite value, it is called nominal-the-best case where the
quality characteristic Y should be densely distributed around the target value. Two
other cases are smaller-the-better and larger-the-better. If Y is non-negative, a
smaller-the-better quality characteristic has the target value T 0 so that we have
Ly Ky2 . A larger-the-better quality characteristic can be transformed into a
smaller-the-better quality characteristic using the transformation Y 1 so that we
have Ly K=y2 .
In a part batch production environment, Yacout and Boudreau [10] assess the
quality costs of the following quality policies:
Policy 1: Nothing is done to control or prevent variations
Policy 2: 100 % inspection
Policy 3: Prevention by statistical process control (SPC) techniques
Policy 4: A combination of Policies 2 and 3.
Policy 1 can have large external loss due to delivering nonconforming units to
customers. The quality loss function with a relatively large value of K can be used
to evaluate the expected costs in the in-control state and out-of-control states,
respectively. The total cost per cycle is the sum of the costs in the two states.
The quality costs of Policy 2 mainly include inspection cost and internal loss due
to reworking or scraping nonconforming units. The internal loss can be evaluated
using the quality loss function with a relatively small value of K. Relative to Policy
1, this policy costs less to discover the nonconforming units and to prevent them
from reaching the customer.
The quality costs of Policy 3 mainly include external loss and prevention cost.
The prevention involves the use of control charts, which are used to detect whether
the process is in-control or not. If an out-of-control state is detected, the assignable
causes can be corrected at the end of the cycle. This reduces the fraction non-
conformance and hence also reduces the external loss. The prevention costs are
sampling costs, costs of investigating false and true alarms, and correction costs. If
the detection and correction of assignable causes can effectively reduce the
occurrence of out-of-control state, this policy will result in quality improvement.
The quality costs of Policy 4 mainly include internal loss, inspection cost and the
prevention cost. Compared with Policy 2, it has a smaller internal loss; and com-
pared with Policy 3, it does not have the external loss.
The optimal quality policy can be determined through evaluating the effects of
different quality policies on the quality costs and the outgoing quality.
8.4 Experimental Optimum Method 137
Taguchi [8] divides the design phase into three stages: systematic design, para-
metric design, and tolerance design; and develops an experimental optimum method
for the parametric design. The basic idea is to divide the parameters that impact the
performances of a product (or process) into two categories: controllable and
uncontrollable. The controllable parameters are design variables and the uncon-
trollable parameters are called the noises (e.g., manufacturing variation, environ-
mental and use conditions, and degradation or wear in components and materials).
The problem is to nd the optimal levels of the controllable parameters so that the
performance is not sensitive to the uncontrollable parameters. The method needs to
carry out a set of experiments. To reduce the experimental efforts and obtain
sufcient information, the experiment design is stressed. This approach is called
Taguchi method, which is based on orthogonal array experiments.
Taguchi method is an experiment-based optimization design method, and
applicable for the situation where no good mathematical model is available for
representing the product performance. Taguchi method does not conne to the
parametric design of the product. In fact, it is applicable for optimizing any engi-
neering process. The word parameter or variable can be a design option or a
type of part.
When the number of controllable parameters is large, the total number of experi-
ments to be completed can be specied as a constraint condition. For a given
variable Pi and number of levels ki , let pj 1 j ki denote the probability for level
8.4 Experimental Optimum Method 139
Pj
j to happen. Further, let q0 0; qj l1 pl and qki 1. The level of Pi can be
randomly generated by li j if
where r is a uniform random number within 0 and 1. The experiments with the
required number can be obtained by repeatedly using Eq. (8.15).
Taguchi method is based on orthogonal array experiments, with which each vari-
able and each level will be tested equally. The array is selected by the number of
parameters and the number of levels. Here, the parameters can be controllable and
uncontrollable. The experiment design for controllable and uncontrollable param-
eters is separately conducted.
Design for controllable parameters. Once the controllable parameters have been
determined, the levels of these parameters must be determined. Determining the
levels of a variable needs rst to specify its minimum and maximum, and then to
specify the number of levels taking into account the change range and the cost of
conducting experiments. The number of levels for all parameters in the experi-
mental design is usually chosen to be the same so as to facilitate the selection of the
proper orthogonal array. Once the orthogonal array is determined, necessary
adjustment is allowed for different parameters to have different numbers of levels.
The proper orthogonal array can be selected based on the number of parameters
(n) and the number of levels (k) as indicated in Table 8.3, where the subscript of the
array indicates the number of experiments to be completed. Once the name of the
array has been determined, the predened array can be looked up (see Refs. [7],
[8]). For example, when n 4 and k 3, the array is L9 . The corresponding
combinations of levels are shown in Table 8.4.
Design for uncontrollable parameters. Uncontrollable parameters or external
factors affect the performance of a product or process, and the experiments may
reflect their effects on the performance in two different ways. The rst way is to
conduct several trials (i.e., repeated tests) for a given combination of controllable
parameters (i.e., an experiment). The second way is to explicitly consider a set of
uncontrollable parameters, and to generate several combinations of levels of these
parameters. The outcome can be an orthogonal array called the noise matrix. For a
given experiment for controllable parameters, all the combinations specied in the
noise matrix will be tested. As a result, each experiment corresponds to several
trials.
A mixed experiment design is a mixture of the factorial design, random design, and
Taguchi design. For example, if one considers the number of experiments specied
by the Taguchi design is somehow small, a given number of additional experiments
can be carried out based on random design.
Finally, it is worth indicating that an experiment is not necessary to be
physical. In other words, the experiment can be computational, including simula-
tion. In this case, the experiment design is still needed to change the values of
parameters in an appropriate way.
The data analysis deals with three issues: calculation of signal-to-noise ratios,
evaluation of the effects of the different parameters, and optimization of levels of
the controllable parameters. We separately discuss these issues in the next three
subsections.
where
1X ki
r2i y2 : 8:19
ki j1 ij
After obtaining the signal-to-noise ratio for each experiment, the average signal-to-
noise ratio can be calculated for each level of a given parameter. Let SNij denote the
average of the signal-to-noise ratios for the jth level of parameter Pi . For example,
for the experiments shown in Table 8.4, we have
Once the averages of the signal-to-noise ratios are obtained, the range of the
averages for parameter Pi can be calculated as
A large value of Di implies that Pi has a large effect on the output. As such, the
effects of the parameters can be ranked based on the values of Di .
The correlation coefcient between (SNij ; 1 j k) and (SNlj ; 1 j k) can
represent the interaction between Pi and Pl .
Example 8.2 Assume that the problem involves four variables and each variable
has three levels, and the performance characteristic is nominal-the-best. The
orthogonal array is L9 shown in Table 8.4. Each experiment is repeated for three
times and the results are shown in the second to fourth columns of Table 8.5. The
values of mean, standard deviation and signal-to-noise ratio for each experiment are
shown in the last three columns of Table 8.5.
From the signal-to-noise ratios of the experiments, we can obtain the average
signal-to-noise ratios and the results are shown in Table 8.6. The range of the
averages for each parameter is shown in the second row from the bottom, and the
rank numbers of the parameters are shown in the last row. According to the ranking,
we can conclude that P2 has the highest effect and, P3 and P4 have the lowest effect
on the output.
From the average signal-to-noise ratios shown in Table 8.6, we can calculate
their correlation coefcients and the results are shown in Table 8.7. As seen, there
can be weak interactions between P1 and P4 , P2 and P3 , and P2 and P4 .
8.4 Experimental Optimum Method 143
For Example 8.2, Table 8.6 shows that the best level combination is (1, 2, 3, 1)
for (Pi , 1 i 4). It is noted that there is not such a combination in Table 8.4, and
the combination that is closest to this combination is Experiment 5. A supple-
mentary test may be conducted to verify this combination.
An approximate method can be used to predict the performance under the
optimal combination. It is noted that the optimal combination is obtained through
changing the P1 s level of Experiment 5 from Level 2 to Level 1. Referring to the
second column of Table 8.4, the performance increment resulting from changing
P P
Level 2 of P1 to Level 1 of P1 can be estimated as Dy 13 3i1 yi 13 6i4 yi ,
where y can be l or r. As such, the performance under the optimal combination can
be estimated as y y5 Dy. The computational process is shown in Table 8.8, and
the last row gives the predicted performances.
As mentioned earlier, the Taguchi method is applicable for the situation where no
good mathematical model is available. There are many problems where mathe-
matical models can be developed to optimize the design of products or processes. In
this section, we use the tolerance design problem as an example to illustrate the
model-based optimum method.
Tolerance design is an important issue in mass production. The design variables
are tolerances for individual components, which are subject to the constraints from
machine tools capabilities and functional requirements, and the objective function
is the total cost per produced unit [9]. As such, two key issues are (a) specifying the
constraint conditions and (b) developing the cost model.
144 8 Design Techniques for Quality
In a tolerance chain, there exists a resultant dimension that is derived from the
primary dimensions. The tolerance of the resultant dimension is a function of the
tolerances of the primary dimensions. The statistical method is commonly
employed for tolerance analysis. It is based on the assumption that the primary
dimensions are normally distributed.
Let xr and tr denote the resultant dimension and its tolerance, respectively; xi
denote the ith dimension and ti denote its tolerance; and xr Gxi ; 1 i n
denote the dimension relationship. From the dimension relationship and statistical
method, the tolerance relationship is given by
" #1=2
Xn
@G 2
tr ti : 8:22
i1
@xi
Due to the capability of machine tools, ti has a lower bound timin . The tolerance
of the resultant dimension determines the performance of assembly, and hence an
upper bound trmax has to be specied for tr . As a result, the constraints are given by
In the tolerance design problem, two main cost elements are manufacturing cost and
quality loss. The manufacturing cost decreases and the quality loss increases with
tolerances in a complex way.
We rst examine the manufacturing cost, which consists of manufacturing cost
and assembly cost. Generally, small tolerances result in an increase in the manu-
facturing cost due to using precision machine and measuring devices. The toler-
ance-cost relationship can be obtained by tting empirical data into a proper
function. Two typical functions are:
The assembly cost is usually not sensitive to the tolerances of components, and
hence can be excluded for the optimal tolerance design problem. As such, the total
manufacturing cost of an assembly is given by
X
n
CM ci ti : 8:25
i1
We now look at the quality loss. For an assembly, the functional requirement is
the resultant dimension. Let X denote the actual resultant dimension, and f x
denote the distribution of X for a batch of products. Assume that X follows the
normal distribution with mean xr and the standard deviation r. We further assume
that the process capability index Cp 3r
tr
is a constant, which is larger than or equal
to 1 (for process capability indices, see Sect. 14.4). Letting A 3C1 2 , we have
p
r2 Atr2 : 8:26
For a given value of x, the loss function is given by Eq. (8.11). For a batch of
products, the average loss is given by
Z1
K1 K2 2
Ltr Lxf xdx r KAtr2 8:27
2
1
where K K1 K2 =2.
Under the assumption that CM and Ltr are mutually independent, the total cost
is given by
X
n
CT ci ti KAtr2 : 8:28
i1
The optimal tolerances can be obtained by minimizing the total cost subject to the
constraints given by (8.23).
Example 8.3 Consider an assembly consisted of a shaft (x1 ) and a hole (x2 ). Design
variables are the tolerances of the shaft and hole, i.e., t1 and t2 . The clearance is
p
xr x2 x1 and the tolerances of xr is given by tr t12 t22 . The lower bound of
t1 and t2 is 0.05 mm; and the upper bound of tr is 0.2 mm.
The empirical data for manufacturing costs are shown in Table 8.9. It is found
that the negative exponential model in Eq. (8.24) is suitable for tting the data and
the parameters are shown in the last three rows of Table 8.9.
146 8 Design Techniques for Quality
Assume that Cp 1, i.e., A 1=9. When tr \0:2, excess clearance has a larger
loss than insufcient clearance. Therefore, assume that K1 130 and K2 520 so
that K 325 and Ltr 36:11tr2 . As a result, the total cost is given by
The optimal solution is t1 0:13 and t2 0:15, and corresponds to tr 0:20
and CT 6:86. It is found that the solution is insensitive to the value of Cp in this
example.
References
1. Chan LK, Wu ML (2002) Quality function deployment: a literature review. Eur J Oper Res
143(3):463497
2. Ganeshan R, Kulkarni S, Boone T (2001) Production economics and process quality: a
Taguchi perspective. Int J Prod Econ 71(1):343350
3. Han CH, Kim JK, Choi SH (2004) Prioritizing engineering characteristics in quality function
deployment with incomplete information: a linear partial ordering approach. Int J Prod Econ
91(3):235249
4. Hauser JR, Clausing D (1988) The house of quality. Harv Bus Rev 66(3):6373
5. Keeney RL (1974) Multiplicative utility functions. Oper Res 22(1):2234
6. Montgomery DC (2007) Introduction to statistical quality control, 4th edn. Wiley, New York
7. Roy RK (2001) Design of experiments using the Taguchi approach: 16 steps to product and
process improvement. Wiley, New York
8. Taguchi G (1986) Introduction to quality engineering. Asian Productivity Organization,
Tokyo
9. Wu CC, Chen Z, Tang GR (1998) Component tolerance design for minimum quality loss and
manufacturing cost. Comput Ind 35(3):223232
10. Yacout S, Boudreau J (1998) Assessment of quality activities using Taguchis loss function.
Comput Ind Eng 35(1):229232
Chapter 9
Design Techniques for Reliability
9.1 Introduction
Design for reliability (DFR) is a process to ensure that customer expectations for
reliability are fully met. It begins early in the concept stage and focuses on iden-
tifying and designing out or mitigating potential failure modes. Many reliability
activities are conducted to determine, calculate, and achieve the desired reliability.
Main reliability-related issues considered in the design stage include
Specication of reliability requirements
Reliability analysis
Reliability prediction
Reliability allocation, and
Reliability improvement.
In this chapter, we focus on these issues.
The outline of the chapter is as follows. In Sect. 9.2, we discuss the process to
implement DFR. Sections 9.39.7 deal with each of the above-mentioned issues,
respectively. Finally, we briefly discuss reliability control and monitoring in
manufacturing and usage phases in Sect. 9.8.
Step 2: Carry out a qualitative reliability analysis to identify key reliability risks
and risk reduction strategies.
Step 3: Perform a reliability prediction so as to quantitatively evaluate design
options and identify the best.
Step 4: Allocate the product reliability requirements to its components (or failure
modes).
Step 5: Achieve the desired reliability using various reliability improvement
strategies such as deration, redundancy, preventive maintenance, and reliability
growth by development.
The DFR process continues to the manufacturing phase to control the reliability
of the produced product, and to the usage phase to monitor the eld reliability of the
product so as to obtain necessary information for further design or process changes.
The main purpose is to compare feasible design options and prepare necessary
information for reliability allocation.
Once the quantitative analysis is carried out, the system-level reliability
requirements are allocated to the elements of the other hierarchical levels. The
specied reliability requirements at each hierarchical level then are further trans-
lated into manufacturing requirements using the QFD discussed in the previous
chapter.
For more details about the reliability allocation process, see Refs. [46].
A product can be
a completely new product,
an upgrade of an existing product,
an existing product introduced to a new market or application, or
a product existing in the market but being new to the company.
Different types of product result in different changes in design, manufacturing,
usage environment, performance requirements, and so on. Changes imply risks, and
hence a thorough change point analysis will help to identify and understand design
and/or application changes introduced with this new product and associated risks in
a qualitative way.
9.4.2 FMEA
FMEA connects given initiating causes to their end consequences. For a given
design option, the main objectives of an FMEA are
to identify the items or processes to be analyzed,
to identify their functions, failures modes, causes, effects, and currently used
control strategies,
to evaluate the risk associated with the issues identied by the analysis, and
to identify corrective actions.
150 9 Design Techniques for Reliability
RPN P S D 9:1
where D is the probability that the current control scheme cannot detect or prevent
the cause of failure.
FMEA can be extended to FMECA to include a criticality analysis. MIL-STD-
1629A [7] presents the procedures for conducting a FMECA. For each potential
failure mode, a criticality matrix is established to evaluate risk and prioritize cor-
rective actions. The horizontal axis of the matrix is the severity of the potential
effects of failure and the vertical axis is the likelihood of occurrence. For each
potential failure and for each item, a quantitative criticality value is calculated based
on failure probability analysis at a given operating time under the constant failure
rate assumption.
SAE J1739 [8] divides FMEA into design FMEA, process FMEA, and
machinery FMEA. The design FMEA is used to improve designs for products and
processes, the process FMEA can be used in quality control of manufacturing
process, and the machinery FMEA can be applied to the plant machinery and
equipment used to build the product. RCM is actually a systematic application of
the machinery FMEA.
Many applications (e.g., risk assessment, reliability prediction, etc.) require carrying
out a system reliability analysis. In system reliability analysis, system failure is
modeled in terms of the failures of the components of the system. There are two
different approaches to link component failures to system failures. They are bottom-
up and top-down approaches, respectively.
In the bottom-up approach, one starts with failure events at the component level
and then proceeds to the system level to evaluate the consequences of such failures
on system performance. FMEA uses this approach.
In the top-down approach, one starts at the system level and then proceeds
downward to the component level to link system performance to failures at the
component level. Fault tree analysis (FTA) uses this approach. A similar graphical
tool is reliability block diagram (RBD). In FTA or RBD, the state of the system can
be expressed in terms of the component states through the structure function. The
9.4 Reliability Analysis 151
difference between FTA and RBD is that the former is failure-oriented and the latter
is success-oriented.
A fault tree is composed of basic events, top event, and logic gates. The basic
events are the bottom events of the fault tree, the top event is some particular system
failure mode, and the gates serve to permit or inhibit the passage of fault logic up
the tree. The inputs of the gate are the lower events, and the output is a higher event.
As such, the gates show the relationships between the input events and the output
event, and the gate symbol denotes the type of relationship. A fault tree shows the
logical interrelationships of basic events that lead to the top event.
A cut set is a combination of basic events that can cause the top event. A
minimal cut set is the smallest combination of basic events that result in the top
event. All the minimal cut sets for the top event represent the ways that the basic
events can cause the top event. Through identifying all realistic ways in which the
undesired event can occur, the characteristics of the top event can be calculated.
The fault tree includes only those faults that contribute to this top event and are
assessed to be realistic. It is often used to analyze safety-related systems.
In a RBD, the system is divided into blocks that represent distinct elements
(components or modules). These elemental blocks are then combined according to
system-success pathways. Each of the blocks is often comprised of units placed in
series, parallel, or a combination of both. Based on the RBD, all system-success
pathways are identied and the overall system reliability can be evaluated.
A RBD is developed for a given system function. If the system has more than
one function, each function must be considered individually. The RBD cannot
effectively deal with complex repair and maintenance strategies and hence the
analysis is generally limited to the study of time to the rst failure.
The state of the system is given by a function uXt, which is called the
structure function with
XS t uXt: 9:2
The form of uX depends on the RBD. The reliability structure of most systems
can be represented as a network involving series, parallel, and k-out-of-n connec-
tions. For the system with series structure, the system fails whenever a component
fails. In this case, the structure function is given by
Y
n
uX Xi : 9:3
i1
For the system with parallel structure, the system fails only when all the com-
ponents fail. In this case, the structure function is given by
Y
n
uX 1 1 Xi 9:4
i1
Specially, when k 1, the k-out-of-n system reduces into the system with parallel
structure; and when k n, the k-out-of-n system reduces into the system with series
structure. In the latter case, the components are not necessary to be identical or similar.
A component is said to be irrelevant if the system state is not affected by the state
of the component; and a system is said to be coherent if it does not have irrelevant
components.
Example 9.1 Consider a two-out-of-three system. In this case, y X1 X2 X3 .
The event y 2 corresponds to the following four mutually independent events:
no component fails so that Xs X1 X2 X3 ,
the third component fails so that Xs X1 X2 1 X3 ,
the second component fails so that Xs X1 1 X2 X3 , and
the rst component fails so that Xs 1 X1 X2 X3 .
As a result, the structure function of the system is given by their sum, i.e.,
/X X1 X2 X1 X3 X2 X3 2X1 X2 X3 : 9:6
9.4 Reliability Analysis 153
The reliability function of a system can be derived from its structure function.
Assume that component failures are statistically independent, and components are
new and working at t 0. To be simple, we focus on the distribution of time to the
rst failure of the system.
The reliability functions of the components are given by
Ri t PrfXi t 1g 9:7
Let FS t and Fi t denote the failure distributions for the system and component i,
respectively. We have RS t 1 FS t and Ri t 1 Fi t. Since the compo-
nent and system states are binary valued, we have
RS t PrfuXt 1g EuXt: 9:9
FS t 1 upt: 9:11
For the system with series structure, the system reliability function is the com-
peting risk model by Eq. (4.29). For the system with parallel structure, the system
distribution function is the multiplicative model by Eq. (4.32). For the k-out-of-n
system with n identical components, the system reliability function is given by
X
n
RS t Cn; xpx 1 pnx ; p Ri t 9:12
xk
0.8
RS,n (t ) 0.6
n =4
0.4 n =3
n =2
0.2
0
0 1 2 3
t
This implies that the k-out-of-n system with exponential components can be aging.
For k 1, Fig. 9.1 shows the plots of RS;n t. As seen, the system reliability gets
considerably improved as n increases. In fact, the B10 life equals 0.0527, 0.2179,
and 0.3863 for n 2, 3 and 4, respectively. As a result, B10 n 3=B10 n 2
4:1 and B10 n 4=B10 n 2 7:3.
The overall product reliability should be estimated early in the design phase.
The reliability prediction uses mathematical models and component reliability data to
estimate the eld reliability of a design before eld failure data of the product are
available. Though the estimates obtained from reliability prediction can be rough
since real-world failure data are not available, such estimates are useful for identifying
potential design weaknesses and comparing different designs and their life cycle costs.
The reliability prediction requires knowledge of the components reliabilities,
the design, the manufacturing process, and the expected operating conditions.
Typical prediction methods include empirical method, physics of failure method,
life testing method, and simulation method (e.g., see Refs. [9, 10]). Each method
has its advantages and disadvantages and can be used in different situations. We
briefly discuss them as follows.
The empirical methods (also termed as the part count approach) are based on the
statistical analysis of historical failure data collected in the eld, and can be used to
quickly obtain a rough estimation of product eld reliability. The empirical methods
assume:
9.5 Reliability Prediction 155
X
n
k kb;i pS pT pE pQ pA 9:13
i1
where kb;i is the basic failure rate of the ith component, which usually come from the
reliability databases or handbooks for similar components, pS ; pT ; pE ; pQ and pA
are revising coefcients to reflect the effects of stresses, temperature, environment,
quality specications, and components complexity, respectively. The coefcient is
1 if the actual conditions are consistent with the reference conditions; otherwise,
larger or smaller than 1.
There are differences between different prediction methods. For example,
Telcordia predictive method [13, 14] allows combining the historical data with data
from laboratory tests and eld tracking data, and its corrected part only considers
the quality factor, electrical stress factor and temperature stress factor. The corrected
part of RDF 2000 method [15] is based on the mission proles, which compre-
hensively reflect the effect of mission operational cycling, ambient temperature
variation and so on.
The empirical methods are simple and can be used in early design phases when
information is limited. However, the data may be out-of-date and it is hard to
adequately specify the revising coefcients.
When the component has multiple failure modes, the components failure rate is the
sum of the failure rates of all failure modes. Similarly, the systems failure rate is
the sum of the failure rates of the components involved. Several popularly models
are briefly outlined as follows (for more details, see Ref. [16] and the literature cited
therein).
Arrhenius model. This model describes the relation between the time-to-failure
and temperature. It is based on the phenomenon that chemical reactions can be
accelerated by increasing the operating temperature. The model is given by
where LT is the life characteristic (e.g., MTBF or median life), T is the absolute
temperature, k is the Boltzmann constant (1=11605), A is a constant to be
specied, and Ea is the activation energy, which depends on the product or material
characteristics.
Eyring model. The Eyring model is given by
where LT is given by Eq. (9.14). The model given by Eq. (9.16) can be extended
to include the third stress (e.g., humidity, H) with the inverse power relation
given by
where LT; V is given by Eq. (9.16), and c is a constant. A variant of Eq. (9.17)
(also termed as the corrosion model) is given by
when the cumulative damage exceeds its critical value. The number of cycles to
failure is given by
LTmax
Nf 9:19
f a DbT
where LTmax has the form of Eq. (9.14), a and b are constants.
The physics of failure methods can provide accurate prediction, and needs
detailed component manufacturing information (e.g., material, process, and design
data) and operational condition information (e.g., life cycle load prole). Due to the
need for detailed information, complex systems are difcult to be modeled physi-
cally and hence it is only applicable for components.
The physics of failure models have important applications in accelerated life test
design and data analysis. This will be further discussed in the next chapter.
Life testing methods are used to determine reliability by testing a relatively large
sample of units operating under normal or higher stresses. The data can be tted to
an appropriate life distribution using statistical methods discussed in Chap. 5, and
reliability metrics can be estimated from the tted life distribution model.
The prediction results obtained from the life testing method are usually more
accurate than those from the empirical method since the prediction is based on
failure data from particular products.
The life testing method is product-specic. It is particularly suited to obtain
realistic predictions at the system level because the prediction results at the system
level obtained from the empirical and physics of failure methods may be inaccurate
due to the fact that the assumptions can be unrealistic. However, the life testing
method can be costly and time-consumed.
Reliability allocation aims to establish target reliability for each level in product
structure. It rst allocates the entire target reliability of a product to its subsystem,
and then allocates the sub-target reliability of each subsystem to its components.
Similar to reliability prediction, the underlying assumptions used in reliability
allocation are:
(a) all components in the system are in series,
(b) components failures are mutually independent, and
(c) failure rates of the components are constant.
The allocation methods depend on whether the system is nonrepairable or
repairable. We separately discuss these two cases as follows.
Y
n
Rs Ri Rn0 ; Ri R1=n
s : 9:20
i1
X
n
c
ki [ ks 9:21
i1
c
where ki is the current failure rates of subsystem i and ks is the required system
failure rate. To reach the system failure rate goal, some improvement efforts must
9.6 Reliability Allocation 159
c
be made to reduce ki to ki . The ARINC method reduces the current failure rates by
equal percentages. The required failure rate reduction factor is given by
X
n
c
r ks = ki \1: 9:22
i1
As such, ki is calculated as
c
ki rki : 9:23
ks lnRs s
k0 : 9:24
N sN
ki ni k0 : 9:25
Equation (9.25) indicates that the failure rate allocated to each subsystem is pro-
portional to the number of components it contains (i.e., complexity).
We now look at the second step. Let wi (2 0; 1) denote the importance of the
w
ith subsystem and ki denote the failure rate allocated to the ith subsystem after
considering subsystem importance. The importance is subjectively determined
based on experts judgments. If subsystem i is important, a large value is assigned
to wi ; otherwise, a small value is assigned.
If wi 1, high subsystem failure rate cannot be tolerated; if wi 0, a subsystem
failure actually has no impact on the system. This implies that the allocated failure
160 9 Design Techniques for Reliability
rate should be inversely proportional to wi [19]. Based on this idea, the failure rate
adjusted after considering the importance is given by
w
X
n
ki
ki kki =wi ; k ks = : 9:26
i1
wi
Example 9.3 Consider a safety-related system with system reliability goal being
103 failure per year. The system comprises three subsystems, and the complexity
and importance information of the subsystems are shown in the second and third
columns of Table 9.1, respectively. The problem is to allocate the target failure rate
to each subsystem.
1
Ai : 9:27
1 ki s i
The equal-availability implies that the ki si is a constant. Let k denote this constant,
which will be derived later. As such, the failure rate allocated to the ith subsystem is
given by
k
ki : 9:28
si
We now derive the expression of k. The failure rate for the system is given by
X
n
ks ki : 9:29
i1
The mean time to failure for the system is given by ls 1=ks . The expected
downtime for each system failure is given by
X
n
ki nk
ss si : 9:30
i1
ks ks
1 1
As : 9:31
1 ks ss 1 nk
1=As 1
k : 9:32
n
Example 9.4 Consider the system discussed in Example 9.3 and assume that the
reliability goal is As 0:999. Mean times to repair information of subsystems are
shown in the second row of Table 9.2. The problem is to determine the failure rate
of each subsystem.
From Eq. (9.32), we have k 0:3337 103 ; and from Eq. (9.28), we have the
values of ki shown in the last row of Table 9.2.
When the predicted reliability is poorer than the required reliability target, the
reliability must be improved by design and development. The techniques to achieve
the desired reliability include component deration and selection, redundancy
design, preventive maintenance (including condition monitoring), and reliability
growth through development. We separately discuss these techniques as follows.
The probability of failure of the product can be decreased by limiting its maximum
operational and environmental stresses (e.g., temperature, pressure, etc.) to the level
below the capabilities of the components or by adopting the components with larger
capabilities. The former is called the component deration and the latter is called the
component selection. The criteria for the components (or materials) selection are
components reliability and its ability to withstand the expected environmental and
operational stresses. Components with better load bearing ability are preferred.
Selection of high performance components or materials increases service life and
reduces the maintenance cost but increases manufacturing cost. As such, the
selection decisions can be optimally made based on a life cycle costing analysis.
The component deration and selection require the information on component
reliability and operational and environmental stresses, and the failure probability
can be quantitatively evaluated using the stressstrength interference model if
component failure is due to overstress mechanism.
Because of manufacturing variability, the strength of a component, X, may vary
signicantly. For example, fracture and fatigue properties of engineering material
usually exhibit greater variability than the yield strength and the tensile strength. As
such, the strength is a random variable with distribution [density] function FX x
[fX x]. When the component is put into use, it is subjected to a stress, Y, which is
also a random variable. Let FY y [fY y] denote the distribution [density] function
of Y. If X is larger than Y, then the strength of the component is sufcient to
withstand the stress, and the component is functional. When a shock occurs, the
stress may be larger than the strength so that the component fails immediately
because its strength is not sufcient to withstand the stress to which it is subjected.
Assume that X and Y are independent. The reliability R that the component will not
fail when put into operation can be obtained using a conditional approach.
Conditional on Y y, we have
Z1
R PfX [ Yg 1 FX yfY ydy: 9:34
0
Z1
R PfY\Xg FY xfX x dx: 9:35
0
Clearly, the reliability increases as r2 decreases (i.e., the component has small
variability in strength) and as l2 l1 increases (i.e., the component has large
ability or margin of safety).
In the above discussion, the distributions of stress and strength are assumed to be
independent of time. However, the strength Xt usually degrades with time so that
it is nonincreasing and the stress Yt can change with time in an uncertain manner.
In this case, the time to failure (T) is the rst time instant that Xt falls below Yt,
i.e.,
We rst derive the expression of the failure rate function and then study the
effect of the strength on the failure rate. Let T denote the time to failure and Ft
164 9 Design Techniques for Reliability
denote its distribution function. The probability for the item to survive to n shocks
is given by
X
1
Rt PfT [ tg pn Gxn 9:39
n0
where
ktn ekt
pn : 9:40
n!
Using Eq. (9.40) to Eq. (9.39) and after some simplication, we have
This implies that the time to failure is given by an exponential distribution with the
failure rate function
rt k1 Gx: 9:42
mean of the stress). As seen, the failure rate quickly decreases as the strength
increases. For large x=ly (e.g., [ 2), the failure rate decreases as the dispersion of
2.5
l =0.8
2
l =1.0
1.5
(x )
0.5 l =1.5
0
0 1 2 3 4 5
x / y
the stress (represented by rl ) decreases. This illustrates that the failure rate in the
normal usage phase can be controlled through design.
9.7.2 Redundancy
The component reliability target can be achieved through the use of preventive
maintenance. Typical preventive maintenance actions include inspection, replace-
ment, and condition monitoring.
The failure of safety-related components can be non-self-announced. In this
case, periodic inspection must be performed to check their state. The key parameter
to be determined is the inspection interval (RCM calls it failure-nding interval). A
shorter interval leads to a smaller failure downtime, which is the time interval
between occurrence and identication of a failure. On the other hand, too frequent
inspections will lead to more production disturbances. As such, two issues for
scheduling the inspection are to optimize the inspection interval and to group the
inspection and preventive maintenance actions (e.g., see Ref. [22]).
166 9 Design Techniques for Reliability
The DFR efforts will continue to the manufacturing and usage phases. In this
section, we briefly discuss the reliability activities in these two phases.
When the product goes into production, the DFR efforts will focus primarily on
reducing or eliminating problems introduced by the manufacturing process, and
include the activities such as quality inspections, supplier control, routine tests,
9.8 Reliability Control and Monitoring 167
measurement system analysis, and so on. Relevant techniques associated with these
activities will be discussed in Chaps. 1215.
Continuous monitoring and eld data analysis are needed to observe and analyze
the behavior of the product in its actual use conditions. The experiences and lessons
obtained from this process are useful for further improvements or in future projects.
Failure reporting, analysis and corrective action systems (FRACAS) is a tool
used to capture such knowledge throughout the product development cycle. The
basic functions of FRACAS include data reporting, data storage, and data analysis.
When a failure is reported, failure analysis is carried out to identify the root cause of
failure. Once this is done, the corrective action is identied using an appropriate
approach such as identify-design-optimize-validate approach or dene-measure-
analyze-improve-control approach. In such a way, FRACAS accumulates a lot of
information useful for resolving reliability-related issues during the product life
cycle. For example, a model for eld reliability can be obtained through failure data
analysis and the tted model can be used to predict expected failures under war-
ranty and demand of key spare parts. In addition, eld data analysis helps to
identify reliability bottleneck problems, which is useful for improving future gen-
erations of the same or similar product.
References
1. Grifn A, Hauser JR (1993) The voice of the customer. Marketing Sci 12(1):127
2. Kano N, Seraku N, Takahashi F et al (1984) Attractive quality and must-be quality. J Jpn Soc
Qual Control 14(2):3948
3. Straker D (1995) A toolbook for quality improvement and problem solving. Prentice Hall,
New York
4. Murthy DNP, Rausand M, Virtanen S (2009) Investment in new product reliability. Reliab
Eng Syst Saf 94(10):15931600
5. Murthy DNP, sters T, Rausand M (2009) Component reliability specication. Reliab Eng
Syst Saf 94(10):16091617
6. Murthy DNP, Hagmark PE, Virtanen S (2009) Product variety and reliability. Reliab Eng Syst
Saf 94(10):16011608
7. US Department of Defense (1980) Procedures for performing a failure mode, effects and
criticality analysis. MILHDBK1629A
8. Society of Automotive Engineers (2000) Surface vehicle recommended practice. J1739
9. Denson W (1998) The history of reliability prediction. IEEE Trans Reliab 47(3):321328
10. OConnor PDT, Harris LN (1986) Reliability prediction: a state-of-the-art review. In: IEEE
Proceedings A: Physical Science, Measurement and Instrumentation, Management and
Education, Reviews 133(4):202216
11. US Military (1992) Reliability prediction of electronic equipment. MIL-HDBK-217F Notice 1
12. US Military (1995) Reliability prediction of electronic equipment. MIL-HDBK-217F Notice 2
13. Telcordia (2001) Reliability prediction procedure for electronic equipment. SR-332 Issue 1
14. Telcordia (2006) Reliability prediction procedure for electronic equipment. SR-332 Issue 2
168 9 Design Techniques for Reliability
15. IEC TR (2004) 62380. Reliability data handbookuniversal model for reliability prediction of
electronics components. PCBs and equipment. International Electrotechnical Commission,
Geneva-Switzerland
16. Escobar LA, Meeker WQ (2006) A review of accelerated test models. Stat Sci 21(4):552577
17. Minehane S, Duane R, OSullivan P et al (2000) Design for reliability. Microelectron Reliab
40(810):12851294
18. US Military (1998) Electronic reliability design handbook, Revision B. MIL-HDBK-338B,
pp 619
19. Amari SV, Hegde V (2006) New allocation methods for repairable systems. In: Proceedings of
2006 annual reliability and maintainability symposium, pp 290295
20. Kuo W, Wan R (2007) Recent advances in optimal reliability allocation. IEEE Trans Syst Man
Cybernet Part A 37(2):143156
21. Murthy DNP, Xie M, Jiang R (2003) Weibull models. Wiley, New York
22. Jiang R, Murthy DNP (2008) Maintenance: decision models for management. Science Press,
Beijing
Chapter 10
Reliability Testing and Data Analysis
10.1 Introduction
According to the time when testing is conducted, testing can be grouped into three
categories: developmental testing, manufacturing testing, and eld operational
testing.
The tests carried out during product development stage focus on discovering failure
modes and improving reliability, and provide information on degradation and
reliability of failure modes. The tests include development tests, reliability growth
tests, and reliability demonstration tests.
Development tests are conducted at material, part, and component levels, and can
be divided into performance testing and life testing. The performance testing
includes critical item evaluation and part qualication testing as well as environ-
mental and design limit testing; and the life testing includes testing to failure, ALT,
and ADT.
The critical item evaluation and part qualication testing is conducted at part
level. It deals with testing a part under the most severe conditions (i.e., maximum
operating stress level, which is larger than the nominal operating stress level)
encountered under normal use in order to verify that the part is suitable under those
conditions. The tests to be performed depend on the product. For example, for
electronic components the temperature and humidity tests are the most commonly
conducted tests.
The environmental and design limit testing is conducted at part, subsystem, and
system levels and at the extreme stress level (i.e., the worst-case operating condi-
tions with the stress level larger than the maximum operating stress level under
normal use). It applies environmentally induced stresses (e.g., vibration loading due
to road input for automotive components) to the product. The test can be conducted
using accelerated testing with time-varying load. These tests aim to assure that the
product can properly perform at the extreme conditions of its operating prole. Any
failures resulting from the test is analyzed through root cause analysis and xed
through design changes.
Life testing deals with observing the times to failure for a group of similar items.
In some test situations (e.g., one-shot devices), one observes whether the test item is
success or failure rather than the time of failure.
Reliability growth testing is conducted at system or subsystem level by testing
their prototypes to failure under increasing levels of stress. Each failure is analyzed
and some of the observed failure modes are xed. The corrective actions lead to
reduction in failure intensities, and hence reliability is improved.
Reliability demonstration testing is conducted at system or subsystem level. The
purpose is to demonstrate that the designed product meets its requirements before it
is acceptable for large volume production or goes into service. It deals with testing a
sample of items under operational conditions.
This chapter focuses on life testing-related issues and the reliability growth
testing-related issues are discussed in the next chapter.
Tests carried out during manufacturing phase are called manufacturing tests.
Manufacturing tests are used to verify or demonstrate nal-product reliability or to
10.2 Product Reliability Tests in Product Life Cycle 171
remove defective product before shipping. Some such tests include environmental
stress screening and burn-in. These are further discussed in Chap. 15.
Field operational testing can provide useful information relating to product reli-
ability and performance in the real world. The testing needs the joint effort of the
manufacturer and users. A useful tool to collect eld reliability information is
FRACAS, which was mentioned in the previous chapter.
Accelerated testing has been widely used to obtain reliability information about a
product and to evaluate the useful life of critical parts in a relatively short test time.
There are two ways to accelerate the failure of a product [4]:
The product works under severer conditions than the normal operating condi-
tions. This is called accelerated stress testing.
The product is used more intensively than in normal use without changing the
operating conditions. This is called accelerated failure time. This approach is
suitable for products or components that are not constantly used.
Accelerated stress testing (also termed as ALT) is used for situations where
products are constantly used, such as the components of a power-generating unit.
Such tests are often used to evaluate the useful life of critical parts or components of
a system. Accelerated testing with the evaluation purpose is sometimes termed as
quantitative accelerated testing. In this case, the results in accelerated stress testing
are related to the normal conditions by using a stress-life relationship model. The
underlying assumption for such models is that the components operating under
normal conditions experience the same failure mechanism as those occurring at
accelerated stress conditions. As such, the range of the stress level must be chosen
from operational conditions to the maximum design limits. Since the results are
obtained through extrapolation, the accuracy of the inference depends strongly on
the adequacy of the stress-life relationship model and on the degree of extrapolation
(i.e., difference between the test stress and the normal stress). Compared with
accelerated stress testing, accelerated failure time testing is preferred since it does
not need the stress-life relationship for purpose of extrapolation.
172 10 Reliability Testing and Data Analysis
If the purpose is to identify failure modes rather than to evaluate the lifetime,
very high stress can be used and the testing is called highly accelerated stress testing
or qualitative accelerated testing. Usually, a single stress is increased step-by-step
from one level to another until the tested item fails. While the test time can be
considerably reduced, the testing introduces new failure modes, and the interactions
among different stresses may be ignored (see Ref. [8]).
The models or/and methods for ALT data analysis depend on the loading scheme.
According to the number of stresses and whether the stresses change with time,
there are the following three typical loading schemes (see Ref. [9]):
Single factor constant stress scheme. It involves only a single stress, each test
item experiences a xed stress level but different items can experience different
stress levels.
Multiple factors constant stress scheme. It involves several stresses and the
levels of stresses remain unchanged during testing.
Time-varying stress scheme. It involves one or more stresses and the stresses
change with time. A typical example is the step-stress testing.
Accelerated testing usually involves a single stress factor. In this case, there are
three typical stress test plans: constant stress test plan, step-stress test plan, and tests
with progressive censoring [11].
Under a constant stress test (see Fig. 10.1), the test is conducted at several stress
levels. At the ith stress level, ni items are tested. The test terminates when a
prespecied criterion (e.g., a prespecied test time or failure number) is met. Time
to failure depends on the stress level.
In the constant stress test plan, many of the test items will not fail during the
available time if the stress level is not high enough. The step-stress test plan can
avoid this problem. Referring to Fig. 10.2, items are rst tested at a constant stress
level s1 for t1 period of time. The surviving items will be tested at the next higher
level of stress for another specied period of time (i.e., ti ti1 ). The process is
10.3 Accelerated Testing and Loading Schemes 173
n3
s, f (t ) s3
n2
s2
n1 s1
t3 t2 t t1
s3
s2
s, F (t )
s1
F (t )
t1 t t2 t3
continued until a prespecied criterion is met. A simple step stress test involves
only two stress levels.
In tests involving progressive censoring (see Fig. 10.3), a fraction of the sur-
viving items are removed at several prespecied time instants to carry out detailed
studies relating to the degradation mechanisms causing failure (e.g., to obtain the
degradation measurement of a certain performance). Two typical progressive cen-
soring schemes are progressive type-I and type-II censoring schemes.
In the type-I censoring scheme, n items are put on life test at time zero, and ni of
Ki surviving items (ni \Ki ) are randomly removed from the test at the prespecied
censoring time ti 1 i m 1. The test terminates at the prespecied time tm or
at an earlier time instant when the last item fails. This is a general xed-time
censoring scheme, as the case shown in Fig. 10.3.
In the type-II censoring scheme, the censoring time ti is the time when the ki th
failure occurs. As such, the test terminates at the occurrence of the km th failure so
that the test duration is a random variable. This is a general xed-number censoring
scheme.
174 10 Reliability Testing and Data Analysis
n1 n2 n3
t1 t2 t3 tm
Referring to Fig. 10.4, ALT data analysis involves two models. One is the distri-
bution model of lifetime of a product at a given stress level and the other is the
stress-life relationship model, which relates a certain life characteristic (e.g., mean
life, median life or scale parameter) to stress level.
Depending on the test plan, the life distribution models can be a simple model or a
piecewise (or sectional) model. We separately discuss them as follows.
Let s0 denote the nominal stress level of a component in the normal use conditions.
The components are tested at higher stress levels s. The time to failure is a random
variable and depends on s so that the distribution function can be written as
Ft; s; h. Here, t is called the underlying variable, s is sometimes called covariate,
and h is the distributional parameters set. Some distributional parameters are
functions of stress while others are independent of stress.
Under the constant stress plan, the lifetime data obtained at different stress levels
are tted to the same distribution family (e.g., the Weibull and lognormal distri-
butions). For the Weibull distribution with shape parameter b and scale parameter
g, it is usually assumed that the shape parameter is independent of stress and the
scale parameter depends on stress. For the lognormal distribution with parameters
ll and rl , the variable can be written as
1=r
lnt ll t l
x ln : 10:1
rl l
el
As seen from Eq. (10.1), ell is similar to the Weibull scale parameter and 1=rl is
similar to the Weibull shape parameter. Therefore, it is usually assumed that rl is
10.4 Accelerated Life Testing Data Analysis Models 175
8
7
6
5
L (s )= (s )
s
4
3
2 f (t;s 0)
1
0
0 10 20 30 40 50 60 70 80 90
t
independent of stress, and ll is a function of stress. The life data analysis techniques
discussed in Chap. 5 can be used to estimate the distribution parameters.
Consider the step-stress test plan with k stress levels. Let si (ti ti1 ) denote the
test duration at stress level si and Fi t denote the life distribution associated with
the constant stress test at si . Assume that Fi ts come from the same distribution
family Ft. Fi t can be derived based on the concept of initial age.
When the test begins, the test item is new so that the item has an initial age of
zero. Therefore, we have F1 t Ft for t 2 0; t1 . Now we consider F2 t
dened in t 2 t1 ; t2 . At t t1 , the surviving item is no longer new since it has
operated for t1 time units at s1 . Operating for t1 time units at s1 can be equiva-
lently viewed as operating for c2 (c2 \ t1 ) time units at s2 . The value of c2 can be
determined by letting
F1 t1 ; h1 F2 c2 ; h2 10:2
where hi is parameter set. When Fi t is the Weibull distribution with the common
shape parameter b and different scale parameter gi , from Eq. (10.2) we have
t1
c2 g : 10:3
g1 2
c2 t1 expl2 l1 10:4
176 10 Reliability Testing and Data Analysis
As a result, we have
F2 t Ft t1 c2 ; h2 : 10:5
Generally, ci i 2 is determined by
and
Fi t Ft ti1 ci ; hi : 10:7
The stress-life relationship model is used to extrapolate the life distribution of the
item at stress level s0 . It relates a life characteristic L to the stress level s. Let
Ls ws denote this model. Generally, ws is a monotonically decreasing
function of s.
Relative to the life characteristic (e.g., MTTF or scale parameter) at the normal
stress level, an acceleration factor can be dened as
Using the stress-life relationship or acceleration factor model, the life distribu-
tion at stress level s0 can be predicted. The accuracy of the life prediction strongly
depends on the adequacy of the model. As such, the key issue for ALT data analysis
is to appropriately determine the stress-life relationship.
Stress-life relationship models fall into three broad categories [4]: physics of
failure models, physics-experimental models, and statistical models. Physics of
failure models have been discussed in Sect. 9.5.2. We look at the other two cate-
gories of models as follows.
A physics-experimental model directly relates a life estimate to a physical
parameter (i.e., stress). For example, the relation between the median life and
electric current stress is given by
t0:5 aJ b 10:9
10.4 Accelerated Life Testing Data Analysis Models 177
where J is the current density. The relation between the median life and humidity
stress is given by
t0:5 aH b or t0:5 aebH 10:10
where H is the relative humidity. For more details about this category of models,
see Ref. [4].
The statistical models are also termed as empirical models. Three typical
empirical models are inverse power-law model, proportional hazard model (PHM),
and generalized proportional model. We discuss these models in the following three
subsections.
Let T0 and Ts denote the time to failure of an item at stress levels s0 and s [s0 ,
respectively. Ts is related to T0 by the inverse power-law relationship
s c
0
Ts T0 10:11
s
The PHM is developed by Cox [2] for modeling the failure rate involving covar-
iates. Let Z zi ; 1 i k denote a set of covariates that affects the failure rate of
178 10 Reliability Testing and Data Analysis
Pk
where b0 i1 bi z0i , and
k bi
Y Y
k
zi
uZ b0 zbi i 10:17
i1
z0i i1
Q
where b0 ki1 zb i
0i . As such, lnuZ is a linear function of Z (for Eq. (10.16)) or
lnZ (for Eq. (10.17)).
When Z does not change with time, Eq. (10.14) can be written as
In the ALT context, the PHM is particularly useful for modeling the effects of
multiple stresses on the failure rate. For example, when a product is subjected to
10.4 Accelerated Life Testing Data Analysis Models 179
where Yt; Z can be hazard rate, lifetime, residual life, failure intensity function,
cumulative failure number or wear amount, t is the items age or a similar variable,
y0 t is a deterministic function of t, uZ [0 is independent of t and meets
uZ0 1, and et; Z is a stochastic process with a zero mean and standard
deviation function rt; Z. As such, the model consists of baseline part y0 t,
covariate part uZ and stochastic part et; Z.
The proportional intensity model is particularly useful for representing the
failure process of a repairable component or system. More details about the gen-
eralized proportional model can be found in Ref. [6].
10.4.6 Discussion
In general, the reliability obtained from ALT data can be viewed as an approxi-
mation of the inherent reliability. This is because it is hard for the test conditions to
be fully consistent with the actual use conditions. As such, the accelerated testing is
used for the following purposes:
identifying problems,
comparing design options, and
obtaining rough estimate of the reliability at component-level.
Denition of stress is another issue that needs attention. A stress can be dened
in different ways, and hence the function form of a stress-life relationship depends
on the way in which the stress is dened. For example, according to the Arrhenius
180 10 Reliability Testing and Data Analysis
model, the temperature is measured using the Celsius scale T while the temperature
stress is usually written as
It is noted that a large stress level T has a small value of s in Eq. (10.21).
Finally, when involving multiple failure modes and multiple stresses the ALT
data analysis and modeling are much more complex than the cases discussed above.
Example 10.1 The data shown in Table 10.1 come from Example 6.8 of Ref. [4]
and deal with the times to failure or censoring. The experiment is carried out at two
temperature stress levels and the sample sizes n are different. Assume that the
design temperature is 70 C. The problem is to nd the life distribution of the
component at the design stress.
Assume that the time to failure follows the Weibull distribution and the shape
parameter is independent of the stress. Since the stress is temperate, the stress-life
relationship model can be represented by the Arrhenius model given by
gs aecs 10:22
where s is given by Eq. (10.21). For the purpose of comparison, we also consider
the Weibull PHM as an optional model. For the model associated with Eq. (10.16),
we have
Letting g0 a and c b=b, Eq. (10.23) becomes Eq. (10.22). For the model
associated with Eq. (10.17), we have
gs g 0 s c : 10:24
Noting that a small s implies a large stress level, Eq. (10.24) is consistent with the
inverse power-law model given by Eq. (10.11). As a result, we consider the
optional models given by Eqs. (10.23) and (10.24).
10.4 Accelerated Life Testing Data Analysis Models 181
Using the maximum likelihood method for all the observations obtained at all
the stress levels, we have the results shown in Table 10.2. As seen, though the two
models have almost the same values of b and lnL, the predicted values of MTTF
(l0 ) have a relative error of 13 %. This conrms the importance to use an appro-
priate stress-life relationship model. In this example, the appropriate model is the
Arrhenius model given by Eq. (10.23).
For some products, there is a gradual loss of performance, which accompanies one
or more degradation processes. We conne our attention on the case where only a
single degradation process is involved, which is usually a continuous stochastic
process.
Let Yt; s denote the performance degradation quantity at time t and stress level
s. Failure is dened at a specied degradation level, say yf , so that the time to
failure is given by
If the life of an item is sufciently long so that the time of testing to failure is still
long even under accelerated stress conditions, one can stop the test before it fails
and extrapolate the time to failure based on the observed degradation measurement
using a tted degradation process model. This is the basic idea of ADT.
The ADT data analysis involves three models. One is the life distribution of
ts given by Eq. (10.25). Let Ft; s denote this distribution. The second model is
the stress-life relationship model that represents the relationship between ts and
s. The third model represents how Yt; s changes with t for a xed stress level s,
and it is called the degradation process model. The rst two models are similar to
the ALT models. As such, we focus on the degradation process model in this
section.
The models for degradation can be roughly divided into two categories: phys-
ical-principle-based models and data-driven models. We separately discuss them as
follows.
182 10 Reliability Testing and Data Analysis
The data-driven models are also called statistical or empirical models. Two general
degradation process models are additive and multiplicative models. The general
additive degradation model has the following form:
Yt lt et 10:28
The model can have an inverse-S-shaped mean degradation path and the Wiener
process model can be viewed as its special case (achieved when b 1 and c 0).
In the general additive model, the degradation path can be nonmonotonic.
We now look at the multiplicative models. Let c Yt Dt Yt
Yt denote the
degradation growth rate. The multiplicative degradation model assumes that the
degradation growth rate is a small random perturbation et, which can be described
10.5 Accelerated Degradation Testing Models 183
According to the central limit theorem, lnYn =y0 approximately follows the normal
distribution so that Yn =y0 approximately follows the lognormal distribution. As a
result, the amount of degradation Yt approximately follows a lognormal degra-
dation model at any time t. Assume that rl is independent of t and ll is dependent
on t. The mean degradation function is given by
lny0:5 ll t: 10:33
There are other degradation process models. Two such models are the gamma
and Weibull process models. The stationary gamma process model assumes that the
degradation increment follows the gamma distribution with shape parameter uDt
and scale parameter v. The mean degradation function is given by
lt uvDt: 10:34
184 10 Reliability Testing and Data Analysis
The Weibull process model assumes that Yt follows the Weibull distribution
with shape parameter bt and scale parameter gt for a given t [7]. The mean
degradation function is given by
10.5.3 Discussion
The underlying assumption for ADT is that the failure results from one or more
observable degradation processes. A crucial issue is to appropriately select the
degradation measurement based on engineering knowledge.
Different from ALT, ADT needs to extrapolate the time to failure. Predicted
failure times can be considerably underestimated or overestimated if an improper
degradation process model is tted. Therefore, the necessity to appropriately
specify the degradation process model must be emphasized.
Once the degradation process model is assumed, the lifetime distribution model
is implicitly specied. This implicitly specied lifetime model may not match the
explicitly assumed lifetime distribution in some characteristic such as the shape of
failure rate function (see Ref. [1]). In this case, the assumption for the degradation
model or assumption for the lifetime model should be adjusted to make them
consistent.
Finally, it is benecial to incorporate ALT with ADT. This is because insuf-
cient failure data can be supplemented by degradation data to increase product
reliability information. The progressive censoring test plans discussed in
Sect. 10.3.3 can be used for this purpose.
The data shown in Table 10.3 come from a type-I censoring ADT for electrical
insulation. The degradation measurement is the breakdown strength in kV. The
breakdown strength decreases with time in weeks, and depends on temperature (i.e.,
stress). The degradation tests are conducted at four stress levels. Failure threshold is
the breakdown strength of a 2 kV. In Ref. [10], the problem is to estimate the
median lifetime at the design temperature of 150 C. Here, we also consider the
lifetime distribution.
10.5 Accelerated Degradation Testing Models 185
The electrical insulation failure process is similar to the crack growth and propa-
gation process, and hence the lognormal degradation process model appears
appropriate. To specify this model, we need to specify the process parameters ll t
and rl .
Let yt denote the breakdown voltage observed at time t. Noting that expll t
is the median value of Yt, the function form of ll t can be obtained through
examining the shape of the data plot of lny versus t. It is found that lny can be
approximated by a linear function of t and hence ll t can be written as
ll t a t=b: 10:36
For a given stress level, the parameters (a; b; rl ) can be estimated using the max-
imum likelihood method. Once these parameters are specied, the median lifetime
at each stress level can be obtained from Eq. (10.36) by letting
ll t lny0:5 t ln2. As such, the median life can be estimated as
To estimate the median lifetime at design stress, we rst t the four median lifetime
data at different stress levels to the Arrhenius model given by
where s is given by Eq. (10.21). The estimated model parameters are c 11:1485
and d 8:4294. From the tted model, the median log breakdown voltage at the
design temperature 150 C equals 8.7792 and the corresponding median lifetime
equals 6497.5 weeks. The extrapolation process of median lifetime is graphically
shown in Fig. 10.5.
Let r0 and l0 t denote the parameters of the lognormal degradation model at the
design stress. The life distribution at the design stress is given by
Ft Uln2; l0 t; r0 10:39
12
10
8
ln(t 0.5 )
0
0 1 2 3
s
varies with stress, it is better to assume that it does not vary with stress. Under this
assumption, we need to re-estimate the model parameters. The results are rl
0:1477 and the values of a and b at different stress levels are almost the same as
those shown in Table 10.4. This implies that the lognormal degradation model is
insensitive to the value of rl in this example.
We now look at l0 t. Its function form is the same as the one given by
Eq. (10.36) with parameters a0 and b0 . In Eq. (10.36), a ll 0; in Eq. (10.38),
ell ecds . As such, a is a linear function of s. Based on the data of a and b in
Table 10.4, regression and extrapolation yield a0 2:8675. From Eq. (10.36), b has
a dimension of the lifetime, and hence the relation between b and s follows the
Arrhenius model. Through regression and extrapolation, we have b0 2962:23. As
such, from Eq. (10.39) the life distribution at the design stress is given by
Clearly, it is the normal distribution with l 6440.93 and r 437.55. This yields
another estimate of median lifetime, i.e., 6440.93. The relative error between this
estimate and the estimate obtained earlier is 0.9 %.
The tted lifetime distribution provides more reliability information than a single
estimate of the median lifetime. For example, it is easy to infer B10 5880:19 from
the tted model.
Assume that stress type, the design stress level s0 , and the extreme stress level su are
known. The design variables are the following:
Number of stress levels k ( 2),
Magnitude of each stress level si with s0 s1 \s2 \ \sk su ,
188 10 Reliability Testing and Data Analysis
Main performances associated with a given test scheme are the required test
efforts and obtainable information amount. Main measures for the test efforts are the
required total test time and cost. At the ith stress level, the probability that an item
will fail by ti equals Fi ti , and the expected test time is given by
Zti
si 1Fi tdt: 10:42
0
X
k
Ttt ni s i : 10:43
i1
Let c1 denote the cost per test item and c2 denote the test cost per unit test time.
The required total test cost can be computed as
X
k
C c1 c2 si ni : 10:44
i1
X
k
mf ni Fi ti : 10:45
i1
wi 1=wsi : 10:46
To reflect the effect of the information weight on the reliability information quality,
we dene an equivalent total number of failures as
X
k
nf wi ni Fi ti 10:47
i1
Consider the single factor constant stress ALT scheme. The design variables can be
determined using an empirical approach [4]. Specic details are as follows.
Assume that the function form of the stress-life relationship model is known but
there are m unknown parameters. To specify these parameters, k m is required.
This implies that m is the lower bound of k.
Let n denote the total number of test items at all the stress levels and ta (a 0:5)
denote the a-fractile of time to failure, which is desired to estimate. At each stress
level, it is desired that the expected failure number is not smaller than 5. As such,
the upper bound of k is given by na 5k or k na=5. As a result, we have
m k na=5: 10:48
It is noted that a large value of k will make the design problem much more complex.
As such, many test schemes take k m 1 or m 2.
190 10 Reliability Testing and Data Analysis
We rst look at the highest stress level sk . Clearly, sk should not exceed the extreme
stress level su and the validation range of the ALT model, which is determined
based on engineering analyses. A large value of sk results in a shorter test time but
poorer information quality.
The basic criterion to determine the other stress levels is that they should bias
toward s0 . A preliminary selection can be given by the following relation:
If we use Eq. (10.49) to determine the intermediate stress levels for the case
study in Sect. 10.5.4, they would be 175, 203, and 236, respectively. If using an
equal-space method, they would be 181, 213, and 244. It is noted that the stress
levels used in the case study are closer to the ones obtained from the equal-space
method.
If tests at different stress levels are conducted simultaneously, the total test
duration is determined by t1 , which depends on s1 . In this case, s1 can be deter-
mined based on the total test duration requirement, and Eq. (10.49) can be revised
as
si s1 pi1 ; p sk =s1 1=k1 : 10:50
If we use Eq. (10.50) to determine the intermediate stress levels for the case
study in Sect. 10.5.4, they would be 200 and 230, respectively.
i1
Fi ti a1 2a a1 : 10:51
k1
It is preferred to allocate more units to low stress levels so as to obtain nearly equal
failure number at each stress level. It is desirable that the following relation can be
met:
ni Fi ti 5: 10:52
Example 10.2 In this example, we consider the test scheme in Example 10.1 but the
progressive censoring is not allowed. The following conditions maintain unchan-
ged: the stress-life relationship, the design stress level, number of stress levels, and
the upper stress level. To calculate the cost, we assume that c1 1 and c2 0:1.
References
1. Bae SJ, Kuo W, Kvam PH (2007) Degradation models and implied lifetime distributions.
Reliab Eng Syst Saf 92(5):601608
2. Cox DR (1972) Regression models and life tables (with discussion). JR Stat Soc B 34(2):187220
3. Ebrahem MAH, Higgins JJ (2006) Non-parametric analysis of a proportional wearout model
for accelerated degradation data. Appl Math Comput 174(1):365373
4. Elsayed EA (1996) Reliability engineering. Addison Wesley Longman, New York
5. Jiang R (2010) Optimization of alarm threshold and sequential inspection scheme. Reliab Eng
Syst Saf 95(3):208215
6. Jiang R (2012) A general proportional model and modelling procedure. Qual Reliab Eng Int
28(6):634647
7. Jiang R, Jardine AKS (2008) Health state evaluation of an item: a general framework and
graphical representation. Reliab Eng Syst Saf 93(1):8999
8. Lu Y, Loh HT, Brombacher AC et al (2000) Accelerated stress testing in a time-driven product
development process. Int J Prod Econ 67(1):1726
9. Meeker WQ, Hamada M (1995) Statistical tools for the rapid development and evaluation of
high-reliability products. IEEE Trans Reliab 44(2):187198
10. Nelson W (1981) Analysis of performance-degradation data from accelerated tests. IEEE
Trans Reliab 30(2):149155
11. Nelson, Wayne B (2004) Accelerated testingstatistical models, Test Plans, and
DataAnalysis, John Wiley & Sons, New York
12. Percy DF, Alkali BM (2006) Generalized proportional intensities models for repairable
systems. IMA J Manag Math 17(2):171185
13. Wang W, Carr M (2010) A stochastic ltering based data driven approach for residual life
prediction and condition based maintenance decision making support. Paper presented at 2010
prognostics and system health management conference, pp 110
Chapter 11
Reliability Growth Process
and Data Analysis
11.1 Introduction
During the product development phase, the reliability of product can be improved
by a test-analysis-and-x (TAF) process. This process is called the reliability
growth process. A challenging issue in this process is to predict the ultimate reli-
ability of the nal product conguration based on all the test observations and taken
corrective actions. This needs to use appropriate reliability growth models. In this
chapter we focus on the reliability growth models and data analysis. Reliability
demonstration test to verify the design is also briefly discussed.
This chapter is organized as follows. We discuss the TAF process in Sect. 11.2.
Reliability growth plan model, corrective action effectiveness model, and reliability
growth evaluation models are presented in Sects. 11.311.5, respectively. We
discuss reliability demonstration test in Sect. 11.6. Finally, a case study is presented
in Sect. 11.7.
Referring to Fig. 11.1, the reliability growth process involves testing one or more
prototype systems under operating stress conditions to nd potential failure modes.
The testing is conducted in several stages and the test stress level can gradually
increase from the nominal stress to overstress. When a failure occurs before the
stage test ends, the failed part is replaced by one new (which is equivalent to a
minimal repair) and the test is continued. The stage test ends at a prexed test time
or a prexed number of failures.
The observed failure modes are then analyzed, and the outcomes of the analysis
are design changes, which lead to new gurations. The new gurations are tested in
the next test stage.
8
7 1st stage 2nd stage
6
System 2
5
System 1
N (t )
4
System 3
3
2
1
0
0 50 100 150 200
t
Fig. 11.1 Reliability growth tests of multiple stages for several systems
For a given test stage, multiple failure point processes are observed. For a given
prototype system, the inter-arrival times in different test stages are independent but
nonidentically distributed. The reliability of the current conguration is assessed
based on the observed failure processes. If the assessed reliability level is unac-
ceptable for production, then the system design is modied, and the reliability of the
new conguration is predicted based on the observed failure processes and planned
corrective actions.
If the predicted reliability is still unacceptable, the growth testing is continued;
otherwise, the new conguration may need to experience a reliability demonstration
test to verify the design. This is because the effectiveness of the last corrective
actions and possibility of introducing new failure modes are not observed due to the
time and budget constraints.
According to the time when the corrective actions are implemented, there are
three different reliability growth testing strategies:
Test-nd-test strategy. This strategy focuses on discovering problems and the
corrective action is delayed to the end of the test.
Test-x-test strategy. In this strategy, the corrective action is implemented once
problems are discovered.
Test-x-nd-test strategy. This strategy is a combination of the above two
strategies. It xes some problems during the test and the other problems are
delayed to x until the end of the test.
The reliability growth process involves three classes of models:
Reliability growth plan models,
Corrective action effectiveness models, and
Reliability growth evaluation models.
We separately discuss these models in the following three sections.
11.3 Reliability Growth Plan Model 195
The function form of the planned growth curve is the well-known Duane model [6].
Suppose that the nth failure of a system occurs at tn . The interval MTBF is given by
ln tn =n. Duane nds the following empirical relation between ln and tn :
Achieved
Overall MTBF
( )
ln atnb ; a ea : 11:2
The value of a depends on the initial reliability level at the start of testing and
b 2 0; 1 represents the rate of growth. A large value of b implies a large rate of
growth.
If the reliability growth occurs continuously (e.g., the case in the test-x-test
strategy), the reliability achieved by t can be represented by instantaneous MTBF.
Let Mt denote the expected number of failures in (0, t). The instantaneous failure
intensity is given by mt dMt=dt. The instantaneous MTBF is given by
lt 1=mt.
Using ln tn =n to Eq. (11.2) and replacing n and tn by Mt and t, respectively,
we have Mt t1b =a and
where m0 t Mt=t is the interval average failure intensity over (0, t). As such,
the instantaneous MTBF is given by
a b
lt 1=mt t : 11:4
1b
b
l0 t
lt : 11:5
1 b g0
Different corrective actions can have different effects on reliability. The predicted
reliability can be inaccurate if the effectiveness of a corrective action is not
appropriately modeled (e.g., see Ref. [11]). There are two kinds of methods to
model the effectiveness of corrective actions: implicit (or indirect) and explicit (or
direct).
The implicit methods use extrapolation to predict the reliability after the cor-
rective actions are implemented. It is a kind of empirical methods. For example, the
Duane model uses the instantaneous MTBF to predict the MTBF of the next
conguration. Most of discrete reliability growth models fall into this category.
The explicit methods use a specic value called the x effectiveness factor (FEF)
to model the effectiveness of a corrective action. FEF is the fractional reduction in
the failure intensity of a failure mode after it is xed by a corrective action.
Therefore, it takes a value between 0 and 1. Specially, it equals 0 if nothing is done,
and equals 1 if the failure mode is fully removed.
Specically, let d denote FEF of a corrective action, k0 and k1 denote the failure
intensities before and after the corrective action is implemented, respectively. The
FEF is dened as
k0 k1 k1
d 1 : 11:6
k0 k0
This implies that given the values of d and k0 , one can calculate the value of k1
using Eq. (11.6), which is k1 1 dk0 . Comparing this with Eq. (11.3), d is
somehow similar to the growth rate b.
Suppose that t0 is a failure observation occurred before the mode is corrected.
This failure would occur at t1 if it were corrected at t 0. It is noted that the mean
198 11 Reliability Growth Process and Data Analysis
life is inversely proportional to the failure rate for the exponential distribution.
Under the exponential distribution assumption, t1 can be calculated as
k0 t0
t1 t0 : 11:7
k1 1d
In this way, a failure time observed before the corrective action is implemented can
be transformed into a failure time that is equivalent to the failure time observed for
the new conguration. The benet to do so is that we can predict the life distri-
bution of the new conguration by tting the equivalent failure sample to a life
distribution.
For a given corrective action, FEF can be quantied based on subjective judg-
ment or/and historic data. It is often difculty for experts to specify such infor-
mation, and the historic data may not be suitable for the current situation. To
address this problem, one needs to properly consider the root cause of the failure
mode and the features of corrective action. A sensitivity analysis that considers
different FEF values can be carried out.
Example 11.1 Suppose that the strength of a component is considered to be not
strong enough. The corrective action is to use a stronger component to replace the
weak component used currently. Assume that the lifetime of the components fol-
lows the exponential distribution and the mean lifetimes of the original and new
components are 0 = 600 h and 1 = 1000 h, respectively. The problem is to
calculate the FEF of the corrective action.
The failure rates of the original and new components are k0 1=l0 and
k1 1=l1 , respectively. From Eq. (11.6), we have
k1 l
d 1 1 0 0:4:
k0 l1
Reliability growth models are used to evaluate the improvement achieved in reli-
ability. According to the type of product, the models can be classied into two
categories: software reliability growth models and the reliability growth models for
complex repairable systems that are comprised of mechanical, hydraulic, electronic,
and electric units. These two categories of models are similar in the sense that the
growth takes place due to corrective actions. However, the corrective actions for
software can be unique and objective but they can be multidimensional and sub-
jective for complex systems.
11.5 Reliability Growth Evaluation Models 199
The reliability growth models for complex systems can be further classied into
two classes: discrete and continuous. Discrete models describe the reliability
improvement as a function of a discrete variable (e.g., the test stage number); and
continuous models describe the reliability improvement as a function of a contin-
uous variable (e.g., the total test time).
Reliability growth models can be parametric or nonparametric. The parametric
models are preferred since they can be used to extrapolate the future reliability if
corrective actions have been planned but have not been implemented.
11.5.1.1 Models
During the software testing phase, a software system is tested to detect software
faults remaining in the system and to x them. This leads to a growth in software
reliability. A software reliability growth model is usually used to predict the number
of faults remaining in the system so as to determine when the software testing
should be stopped.
Assume that a software failure occurs at random time and the fault caused the
failure is immediately removed without introducing new faults. Let Nt denote the
cumulative number of failures detected in the time interval 0; t, and
We call Mt the mean value function of Nt, and mt the failure intensity
function, which represents the instantaneous fault detection or occurrence rate.
According to Ref. [9], a software reliability growth model can be written as the
following general form:
Mt M1 Gt 11:9
Gt expg=tb 11:10
200 11 Reliability Growth Process and Data Analysis
and
t b
Ft 1 1 ; b; g [ 0: 11:11
g
The software reliability growth models can be used to estimate the number of
unobserved failure modes for complex systems. This will be illustrated in
Sect. 11.7.
The model parameters can be estimated using the maximum likelihood method and
least squares method. Consider the failure point process t1 t2 tn \T ,
where ti is the time to the ith failure, and T is the censored time. The distribution of
time to the rst failure is given by
f t mt expMt: 11:12
X
n X
n
lnL lnmti MT n lnM1 M1 GT lngti 11:14
i1 i1
M1 n=GT: 11:15
Since GT\1, we have M1 [ n. Using Eq. (11.15) to Eq. (11.14) and after some
simplications, we have
Xn
0 gti
lnL lnL n lnn n ln : 11:16
i1
GT
X
n
SSE Mti i 0:52 11:17
i1
11.5 Reliability Growth Evaluation Models 201
40
30
M(t )
20
10
0
0 200 400 600 800 1000
t
Fig. 11.3 Observed and tted inverse Weibull reliability growth curves
subject to the constraint given by Eq. (11.15). Here, we take the empirical estimate
of Mti as i 0:5 (since Mti i 1).
Example 11.2 The data set shown in Table 11.1 comes from Ref. [8]. The problem
is to t the data to an appropriate reliability growth model.
In this example, T tn . The observed data are displayed in Fig. 11.3 (the dotted
points). It indicates that a reliability growth model with an inverse-S-shaped growth
curve is desired, and hence the lognormal and inverse Weibull models can be
appropriate. For the purpose of illustration, we also consider the exponential and
Pareto models as candidate models.
The maximum likelihood estimates of the parameters of the candidate models
and associated values of lnL and SSE are shown in Table 11.2. As seen, the best
model is the inverse Weibull model. The reliability growth curve of the tted
inverse Weibull model is also shown in Fig. 11.3, which indicates a good agree-
ment between the empirical and tted growth curves.
According to the tted model, there are about 9 faults remaining in the system.
The time to the next failure can be estimated from Eq. (11.13) with Mt being the
tted inverse Weibull model. The expected time to the next failure is 1021 h.
Let Rj , Fj and kj are the reliability, unreliability, and failure intensity (or failure rate)
of the item at stage j j 1; 2; . . ., respectively. There are two general models
for modeling reliability growth process. One is dened in terms of Rj (or
Fj 1 Rj ) and applicable for attribute data (i.e., the data with the outcome of
success or failure), and the other is dened in terms of kj (or lj 1=kj ) and
applicable for the exponential life data. There are several specic models for each
general model. Different models or different parameter estimation methods can give
signicantly different prediction results. As such, one needs to look at several
models and select the best (e.g., see Ref. [7]).
In this class of models, the outcome of a test for an item (e.g., one-shot device) is
success or failure. Suppose that nj items are tested at the jth stage and the number of
successes is xj . As such, the stage reliability Rj is estimated as rj xj =nj . The
corrective actions are implemented at end of each stage so that Rj1 is not smaller
than Rj statistically. As such, Rj increases with j. A general reliability growth model
in this context can be dened as
Rj R1 hSj; j 1; 2; . . . 11:18
Fj hSj: 11:19
Two specic models are the inverse power and exponential models given
respectively by
Example 11.3 The data of this example come from Ref. [1]. A TAF process
comprises of 12 stages with nj 20 at each stage. The numbers of successes xj are
14, 16, 15, 17, 16, 18, 17, 18, 19, 19, 20 and 19, respectively. The corrective action
is implemented at the end of each stage. The problem is to estimate the reliability of
the product after the last corrective action.
11.5 Reliability Growth Evaluation Models 203
Fitting the data to the models in Eq. (11.20), we have the results shown in
Table 11.3. In terms of the sum of squared errors, the exponential model provides
better tting to the data. As a result, the reliability is estimated as R13 0:9631.
It is noted that the average reliability evaluated according to the data from the
last four stages is 0.9625. This implies that the inverse power model obviously
underestimates the reliability. However, if we take R1 as an arbitrary real number,
the inverse power model provides a reasonable estimate of reliability (see the last
row of Table 11.3).
Similar to the models in Eq. (11.20), two specic models are given respectively by
Optionally, the reliability growth model can be dened in terms of MTBF. Such
a model is the extended geometric process model [10]. Let Zj Tj Tj1 j
1; 2; . . . and lj EZj . In the context of reliability growth, the MTBF (i.e., lj ) is
increasing and asymptotically tends to a positive constant l1 (i.e., the maximum
obtainable MTBF); and the stochastic process X fl1 Zj g is stochastically
decreasing and tends to zero. If X follows a geometric process with parameter
l Z
a 2 0; 1, then Y f 1aj 1 j g is a renewal process with mean h and variance r2 .
As such, the mean function of the stochastic process Z fZj g is given by
This model can be viewed as a variant of Eq. (11.18) with the reliability being
replaced by MTBF. We dene the following two general models in terms of MTBF
The Duane model given by Eq. (11.2) may be the most important continuous
reliability growth model. Several variants and extensions of this model are pre-
sented as follows.
11.5 Reliability Growth Evaluation Models 205
Mt t=gb 11:25
b 1 b; g a1=1b : 11:26
The maximum likelihood method and least spares method can be used to esti-
mate the parameters of the Crow model. The least squares method has been pre-
sented in Sect. 6.2.4, and the maximum likelihood method is outlined as follows:
Consider the failure point processes of n nominally identical systems:
where
b t b1
Mt t=gb ; mt ; Rt expMt; f t mtRt:
g g
11:29
P
n
The log-likelihood function is given by lnL lnLi , where
i1
X
Ji X
Ji b
Ti
lnLi lnRc Ti lnfc tij lnmtij : 11:30
j1 j1
g
206 11 Reliability Growth Process and Data Analysis
X
J1
1=b
b J1 = lnT1 =t1j ; g T1 =J1 : 11:31
j1
Example 11.5 The data shown in Table 11.5 come from Ref. [5] and deal with the
reliability growth process of a system. Here, J 40 and T tJ . Crow ts the data
to the power-law model using the maximum likelihood method. The estimates
parameters are: b; g 0:4880; 1:6966. The failure intensity and MTBF at the
test end are estimated as
It is noted that MTBF can be estimated by tJ1 tJ , where tJ1 can be estimated
through letting MtJ1 J 1. Using this approach, we have tJ1 tJ 169:02,
which is slightly larger than the maximum likelihood estimate with a relative error
of 1.3 %.
Using the least square method, the estimated parameters are:
b; g 0:4796; 1:4143. The failure intensity and MTBF are estimated by
The relative error between the MTBF estimates obtained from the two methods is
0.66 %.
Table 11.5 Reliability growth test data (in hours) for Example 11.5
0.7 2.7 13.2 17.6 54.5 99.2 112.2
120.9 151.0 163.0 174.5 191.6 282.8 355.2
486.3 490.5 513.3 558.4 678.1 699.0 785.9
887.0 1010.7 1029.1 1034.4 1136.1 1178.9 1259.7
1297.9 1419.7 1571.7 1629.8 1702.3 1928.9 2072.3
2525.2 2928.5 3016.4 3181.0 3256.3
11.5 Reliability Growth Evaluation Models 207
bj
mj t t=gj bj 1 : 11:33
gj
The interval MTBF is given by lj bj C1 1=bj and the interval failure intensity
at the jth stage is given by kj 1=lj . If the test is no longer conducted after the last
corrective actions are implemented, the failure intensity can be predicted by tting the
estimated stage failure intensities [or interval MTBFs] to the models given by
Eq. (11.22) [or Eq. (11.24)]. The future failure intensity [or MTBF] is extrapolated
using the tted model in a way similar to the one in Example 11.4.
Consider a repairable system with K failure modes. The system failure intensity is
the sum of failure intensities from independent failure modes, i.e.,
X
K
ks ki 11:34
i1
where ki is the failure intensity of mode i evaluated at the end of a given stage, and
ks is the system failure intensity at the end of this stage.
There are two methods to evaluate ki . One is to assume that the failure intensity
for each mode is constant. In this case, the failure intensity of mode i is estimated as
ni
ki 11:35
nT
where ni is the number of failures of mode i during the current test stage, n is the
number of tested items, and T is the test duration of this stage.
The other method is to assume that the failure intensity for each mode can be
represented by the power-law model and the parameters are given by
X
ni
1=bi
b i ni = lnT=tij ; gi nT=ni : 11:36
j1
208 11 Reliability Growth Process and Data Analysis
Equation (11.36) comes from Eq. (11.31). As a result, the failure intensity of mode i
in the current stage is given by
1
ki : 11:37
gi C1 1=bi
When bi 1, the intensity estimated from Eq. (11.37) is the same as the one
estimated from Eq. (11.35). However, when bi 6 1, the intensity estimates from
Eqs. (11.35) and (11.37) are different.
Though the reliability predicted in the development stage is more accurate than the
reliability predicted in the design stage, it is still inaccurate due to the following
reasons:
the judgment for FEFs is subjective,
the assumption for the failure process may be unrealistic,
the test conditions and environment can be different from the real operating
conditions,
test observations are insufcient,
the test prototypes may have manufacturing defects, and
the repairs may have quality problems.
As such, the predicted reliability should be viewed as an approximation of the
product inherent reliability. This necessitates a reliability demonstration test to
conrm or validate the prediction obtained from the reliability growth analysis.
Two key issues with the demonstration test are when and how this test is conducted.
The rst issue deals with the relationship between the demonstration test and the
reliability growth testing. To make the results reliable, the demonstration test should
have sufcient test time. On the other hand, the reliability growth testing and
demonstration testing usually require common test facilities and resources, and are
subject to the constraint on the total test time. This implies that more growth testing
can lead to a higher reliability level but will reduce demonstration test time and lead
to a lower demonstration condence, as shown in Fig. 11.2. As such, there is a need
to achieve an appropriate balance between growth testing and demonstration
testing.
The second issue deals with the design of demonstration test. The test plan
involves the determination of number of tested items, test duration, and accept-
ability criterion. Several factors that affect the test plan include the nature of the test
item, the type of demonstration, and the availability of test resources.
To test as more system interfaces as possible, the demonstration test should be
carried out on system or its critical subsystems. The test conditions must be as close
to the expected environmental and operating conditions as possible.
11.6 Design Validation Test 209
Standard test plans used in demonstration testing assume that the failure rate is
constant. The demonstration test is actually an acceptance sampling test, which will
be discussed in detail in Chap. 13.
The data shown in Table 11.6 come from Ref. [12] and deal with failure occurrence
times of 13 failure modes from a developmental test, where 25 items are tested for
175 h.
A test-x-nd-test strategy is used. Specically, Mode 8 is partially corrected at
100 h, Modes 1, 5, 7, 8, 9, 10, and 11 are partially corrected at the test end, and the
other six modes are not corrected. As such, the test consists of two stages: t 2
0; 100 and t 2 100; 175. A simple prediction analysis for the future failure
intensity has been carried out in Example 11.4.
In this section, we carry out a detailed analysis for the data using the power-law
model for a system with multiple failure modes discussed in Sect. 11.5.3.3. We also
predict the number of unobserved failure modes using the software reliability
growth models presented in Sect. 11.5.1.
For this category of failure modes, we only need to assess their interval failure
intensities based on the power-law model using the maximum likelihood method.
The results are shown in Table 11.7, where l1 gC1 1=b and k 1=l1 .
210 11 Reliability Growth Process and Data Analysis
Table 11.6 Failure modes and occurrence times of failures and corrective actions
Mode FEF System Failure times Corrective action time
1 0.5 19 106.3 175
2 0 7 107.3
15 100.5
25 10.6
3 0 3 79.4
6 67.3, 67.1, 70.8, 162.3
14 102.7
17 100.8, 126.3
22 13.9, 64.1, 84.4
23 39.2, 48.5, 45.6, 53.3, 68.7
4 0 20 97.0
25 36.4
5 0.8 21 70.8 175
6 0 3 148.1
7 70.6
8 118.1
22 18.0
7 0.7 15 37.8 175
8 0.8 6 74.3 100
7 65.7, 93.1
9 0.5 13 90.8 175
10 0.5 13 99.2 175
11 0.5 17 130.8 175
12 0 5 169.1
15 114.7
16 6.6
17 154.8, 5.7, 21.4
23 125.4, 140.1
13 0 7 102.7
For this category of failure modes, we rst assess their interval failure intensities in
t 2 0; 175, and then predict the intensities using Eq. (11.6). The assessed inten-
sities are shown in the fth column of Table 11.8 and the predicted intensities are
shown in the last column. The last row shows the predicted MTBF (i.e., 93.19)
without considering the contribution of Mode 8.
It is noted that the predicted MTBF is smaller than the observed MTBF (i.e.,
25 175=43 101:74). This is because the observed MTBF is estimated based on
a constant-intensity assumption. Actually, if no corrective action, the total failure
intensity obtained from the power-law model is 13:80 103 and the intensity after
taking account of the corrective actions is 10:73 103 . As a result, the reduction
in the intensity is 3:07 103 , and the average FEF equals 0.2863. In this sense,
the reliability gets improved.
Table 11.8 Parameters of power-law model and failure intensities for failure modes corrected
at t = 175
Mode b g l1 k, 103 1 dk, 103
1 2.0059 870.85 771.73 1.2958 0.6479
5 1.1051 3221.61 3103.87 0.3222 0.0644
7 0.6525 24286.12 33040.9 0.0303 0.0091
9 1.5241 1446.32 1303.22 0.7673 0.3837
10 1.7617 1087.87 968.51 1.0325 0.5163
11 3.4351 446.68 401.50 2.4906 1.2453
Sum 2.8667
MTBF 93.19
212 11 Reliability Growth Process and Data Analysis
An interesting nding is that the corrected failure modes generally have larger
values of b than those failure modes without corrective actions. In fact, the average
of the bs values in Table 11.7 is 1.0801 and the average in Table 11.8 is 1.7474.
This implies that the value of b can provide useful clue in failure cause analysis. It
also conrms the observation that the constant-failure-intensity assumption is not
always true.
The corrective action for Mode 8 is implemented at t 100 and its effectiveness
has been observed in the second test stage when no failure occurs with this mode.
The FEF value (=0.8) indicates that the corrective action cannot fully remove this
mode. Clearly, we cannot directly assess the failure intensity of this mode in the
second test stage based on the observed data due to no failure observation available.
A method to solve this problem is outlined as follows.
Simultaneously consider the data from the two stages. In this case, the overall
likelihood function consists of two parts. The rst part is the likelihood function in
the rst test stage with parameters b1 and g1 , and the second part is the likelihood
function in the second test stage with parameters b2 and g2 . Assume that
b1 b2 b. The MTTF is given by
li gi C1 1=b; i 1; 2: 11:38
g2 g1 =1 d: 11:39
Maximizing the overall likelihood function yields the estimates of b and g1 , and the
failure intensity after considering the effect of the corrective action is given by [4]
Table 11.9 shows the estimated model parameters and predicted failure intensity
for both the constant-intensity and power-law models. In terms of the log maximum
likelihood value, the power-law model is more appropriate than the constant-
intensity model. It is noted that the predicted failure intensity from the power-law
model is much smaller than the one obtained from the constant-intensity model.
This implies that the constant-intensity assumption can lead to unrealistic estimate
of the failure intensity when b is not close to 1.
The total failure intensity from all the failure modes is now 10:85 103 , and
hence the eventual MTBF is 92.17 h.
If we continue the testing, more failure modes may be found. Each failure mode has
a contribution to the overall system failure intensity. As such, there is a need to
consider the influence of unobserved failure modes on reliability. To address this
issue, we need to look at the following two issues:
To estimate the cumulative number of failure modes expected in future testing,
and
To estimate the contribution of the unobserved failure modes to the total failure
intensity.
We look at these issues as follows.
Let ti denote the earliest occurrence time of mode i, Mt denote the expected
number of failure modes observed by time t, and M1 denote the expected total
number of failure modes in the system. The new failure mode introduction process
in complex system is somehow similar to the software reliability growth process
and hence the software reliability growth models can be used to model this process.
For the case study under consideration, the earliest occurrence times of failure
modes are summarized in Table 11.10. Since the sample size is small, we consider
two simple models: (a) Gt is the exponential distribution, and (b) Gt is the
Pareto distribution with b 1. The maximum likelihood estimates of the param-
eters and performances are shown in Table 11.11. Figure 11.4 shows the plots of
the tted models. As seen, the growth curves are different for large t.
20
Pareto
15
Exponential
M(t )
10
0
0 100 200 300 400 500
t
In terms of lnL0 , the exponential model provides better t to the data; in terms
of SSE, the Pareto model provides better t to the data. We combine these two
criteria into the following criterion:
SSE
Ic n ln 2 lnL0 : 11:41
n
The last column of Table 11.11 shows the values of Ic . As seen, the Pareto model
provides better t to the data in terms of Ic .
If the test is continued, the expected time to the occurrence of the jth failure
mode can be estimated from the tted Pareto model, which is given by
gj
sj ; j [ 13: 11:42
M1 j
For example, the 14th failure mode may appear at about t 211 test hours.
kj k=sj 11:43
where k is a positive constant. The contribution of the rst j failure modes to the
total failure intensity is given by
11.7 A Case Study 215
1
Exponential
Pareto
0.8
0.6
C(j )
0.4
0.2
0
0 5 10 15 20 25
j
X
j X
M1 X
j X
M1
Cj kl = kl s1
l = s1
l : 11:44
l1 l1 l1 l1
As such, the contribution of the unobserved failure modes to the total failure
intensity is given by 1 Cj. Figure 11.5 shows the plots of Cj for the two tted
growth models.
Let kc denote the current estimate of the intensity (with J identied modes)
without considering the contribution of the unobserved failure modes; and ka
denote the additional intensity from the unobserved failure modes. We have
kc CJ
; kc ka kc =CJ; l0J CJlJ 11:45
ka 1 CJ
11.7.4 Discussion
For the data shown in Table 11.6, an intuitive estimate of MTBF without consid-
ering the effect of corrective actions is given by 25 175=43 101:74. It appears
216 11 Reliability Growth Process and Data Analysis
140
Stage 2
120
100 Stage 1
80
m (t )
60
40
20
0
0 50 100 150 200 250 300
t
Fig. 11.6 Reliability growth plan curve for the case study
that the predicted reliability is an underestimate of MTBF. The two main causes to
have this impression are as follows.
The intuitive estimate comes from the constant-intensity model and has not
considered the effect of unobserved failure modes. However, the values of b
associated with nine failure modes are larger than 1, and six of them have
b 1. As a result, the prediction based on the constant-intensity assumption
may give an unrealistic estimate.
The observed reliability growth may partially come from the corrective actions
for the manufacturing and repair quality problems of the tested items. If this is
the case, the predicted reliability may be an underestimate since the manufac-
turing quality in mass production is expected to be better than the manufacturing
quality of prototypes.
Finally, it is important to differentiate the instantaneous intensity from interval
intensity. The instantaneous intensity is suitable for the case where improvement in
reliability continuously occurs and its value at the end of a stage should be viewed
as a prediction for the next stage. If the conguration is unchanged for a given test
stage, no reliability growth occurs in this stage. In this case, we should use the
interval intensity to evaluate the reliability.
We can use the data in Table 11.6 to derive a reliability growth plan curve. For the
purpose of illustration, we make the constant-intensity assumption though it can be
untrue. The estimates of MTBF in the two stages are shown in Table 11.4, and the
estimates are graphically displayed in Fig. 11.6 (the four dots). Fitting these points
to the Duane model given by Eq. (11.4) using the least squares method yields
11.7 A Case Study 217
84:11 t b
lt : 11:46
1 b 100
The growth plan curve is shown in Fig. 11.6 (the continuous curve).
References
1. Blischke WR, Murthy DNP (2000) Reliability: modeling, prediction, and optimization. Wiley,
New York
2. Calabria R, Guida M, Pulcini G (1996) A reliability-growth model in a Bayes-decision
framework. IEEE Trans Reliab 45(3):505510
3. Crow LH (1974) Reliability analysis for complex, repairable systems. In: Proschan F, Serfling
RJ (eds) Reliability and biometry. SIAM, Philadelphia, pp 379410
4. Crow LH (2004) An extended reliability growth model for managing and assessing corrective
actions. In: Proceedings of annual reliability and maintainability symposium, pp 7380
5. Crow LH (2006) Useful metrics for managing failure mode corrective action. In: Proceedings
of annual reliability and maintainability symposium, pp 247252
6. Duane JT (1964) Learning curve approach to reliability monitoring. IEEE Trans Aero 2
(2):563566
7. Fries A, Sen A (1996) A survey of discrete reliability-growth models. IEEE Trans Reliab 45
(4):582604
8. Hossain SA, Dahiya RC (1993) Estimating the parameters of a non-homogeneous Poisson-
process model for software reliability. IEEE Trans Reliab 42(4):604612
9. Jiang R (2009) Required characteristics for software reliability growth models. In: Proceedings
of 2009 world congress on software engineering, vol 4, pp 228232
10. Jiang R (2011) Three extended geometric process models for modeling reliability deterioration
and improvement. Int J Reliab Appl 12(1):4960
11. Meth MA (1992) Reliability-growth myths and methodologies: a critical view. In: Proceedings
of annual reliability and maintainability symposium, pp 337342
12. OConnor PDT (2002) Practical reliability engineering, 4th edn. Wiley, New York
Part III
Product Quality and Reliability
in manufacturing Phase
Chapter 12
Product Quality Variations and Control
Strategies
12.1 Introduction
Quality characteristics are the parameters that describe the product quality such as
length, weight, lifetime, number of defects, and so on. The data on quality
characteristics can be classied as attribute data (which take discrete integer values,
e.g., number of defects) and variable data (which take continuous values, e.g.,
lifetime).
Quality characteristics of a product are usually evaluated relative to design
specications. Specications are the desired values of quality characteristics on the
product or its components. The specications are usually specied by nominal
value, lower specication limit, and upper specication limit. Components or
products are nonconforming or defective if one or more of the specications are not
met.
Despite the efforts made during the design and development phases to ensure
optimal production and assembly characteristics, no production system is able to
produce two exactly identical outputs. Unit-to-unit difference in quality character-
istics is referred as variability.
The variability results from differences or variations in input materials, perfor-
mance of manufacturing equipment, operator skills, and other factors. These factors
are called sources of variation and can be divided into six aspects: Materials,
Manufacture, Man, Machine, Measurements and Environment, which are simply
written as 5M1E (e.g., see Ref. [2]). To discover the key variation sources of a
given quality problem, one can use a 5M1E approach. The approach rst generates
a check list for each aspect of the 5M1E based on empirical knowledge, and then
uses the check list to identify the potentially possible causes. The quality problem
can be solved by removing the impacts of those causes on the quality variability.
The causes that cause quality variation can be roughly classied into two types:
random cause (also termed as common cause or background noise), and
assignable cause (also termed as special causes).
The random causes are many small and unavoidable causes, and result in
inherent variability. A process that is only subjected to random causes is said to be
in statistical control. In practice, most of the variability is due to this type of causes.
Generally, nothing can be done with these causes except to modify the process.
Therefore, these causes are often called the uncontrollable causes.
The assignable causes include improperly adjusted machines, operator errors,
and defective raw material. The variability due to assignable causes is generally
large so that the level of process performance is usually unacceptable. A process
that is operating in the presence of assignable causes is said to be out of control. The
variability due to this type of causes can be controlled through effective quality
control schemes and process modication such as machine adjustment, mainte-
nance, and training for operators.
The probability that an item produced is nonconforming depends on the state of
manufacturing process. When the state is in control, the probability that an item
produced is nonconforming is very small although the nonconformance cannot be
avoided entirely. When the state changes from in-control to out-of-control due to
one or more of the controllable factors deviating signicantly from their target
12.2 Variations of Quality Characteristics 223
As mentioned earlier, the lifetimes of nominally identical items can be different due
to unit-to-unit variability. The product reliability realized in manufacture phase is
called the inherent reliability, which is usually evaluated using the life test data of
the product after the product is manufactured. The test data are obtained from
strictly controlled conditions without being impacted by actual operating conditions
and maintenance.
Assume that the life follows the Weibull distribution with parameters b and g.
The life variability can be represented by the coefcient of variation r=l given by
s
C1 2=b
r=l 1: 12:1
C2 1 1=b
Figure 12.1 shows the plots of r=l and p (with k 0:5) versus b. It clearly
shows that the smaller the variability is, the better the quality is. This is consistent
with the conclusion obtained by Jiang and Murthy [5].
224 12 Product Quality Variations and Control Strategies
1.5
p,
1 p
0.5
0
0 1 2 3 4 5 6 7
The items that do not conform to design specications are nonconforming. There
are two types of nonconforming items. In the rst case, the item is not functional
and this can be detected immediately after it is put in use. This type of noncon-
formance is usually due to defects in assembly (e.g., a dry solder joint). In the other
case, the item is functional after it is put in use but has more inferior characteristics
(e.g., a shorter mean life) than the conforming item. Such items usually contain
weak or nonconforming components, and cannot be detected easily. Jiang and
Murthy [4] develop the models to explicitly model the effects of these two types of
nonconformance on product reliability. We briefly outline them as follows:
We rst look at the case of component nonconformance. Let F1 t denote the life
distribution of a normal component and the proportion of the normal product is p.
Assume that the life distribution of the product with weak components is F2 t and
the proportion of the defective product is q 1 p. The life distribution of the
product population is given by
where d l2 =l1 . After some simplications, Eq. (12.5) can be written as:
Since the normal item has longer life and smaller life dispersion, we have d\1
and q1 q2 . From Eqs. (12.4) to (12.6), we have l\l1 and q2 [ q21 , implying that
the life of the item population is smaller than the life of the normal item and has
larger life dispersion.
We now look at the case of assembly errors without component nonconfor-
mance. Assume that the life distribution of the product with assembly error is F3 t
and the proportion of such products is r. For this case, the life distribution of the
product population is given by
Considering the joint effect of both assembly errors and component noncon-
formance, the life distribution of the product population is given by
Y
n
Rf t Rj t 12:9
j1
Table 12.1 Times to failure 1124 667 2128 2785 700+ 2500+ 1642 2756
(h)
3467 800+ 2489 2687 1974 1500+ 1000+ 2461
1945 1745 1300+ 1478 1000+ 2894 1500+ 1348
3097 1246 2497 2674 2056 2500+
12.3 Reliability and Design of Production Systems 227
ZT
1
pT ptdt: 12:10
T
0
For batch production, the items are produced in lots of size Q. At the start of
each lot production, the state is in control. Let pi denote the probability that an
item is conforming. The expected proportion of conforming items is given by
1X
Q
pQ pi: 12:11
Q i1
wt w1 w0 w1 et=g ; g [ 0: 12:12
Equation (12.12) does not reflect the wear behavior in the wear-out stage and hence
is applicable only for the early and normal wear stages. The accumulated wear
amount is given by
Zt
Wt wxdx w1 t gw0 w1 1 et=g : 12:13
0
The design of production system has a signicant impact on the fraction of con-
formance when the process is in control. Important issues for production system
design include supply chain design, production planning, system layout, equipment
selection, and production management.
Supply chain design involves supplier selection and contract specication. It also
deals with choosing the shipping mode and warehouse locations.
Production planning deals with the issues such as manufacturing tolerance
allocation, process planning, and process capability analysis (which will be dis-
cussed in Chap. 14) to predict the performance of a production system.
Main considerations for system layout are the systems flexibility and robust-
ness. The flexibility is the capability of producing several different products in one
system with no interruption in production due to product differences. Flexibility is
desired since it enables mass customization and high manufacturing utilization. A
robust production system is desired so as to minimize the negative influence of
fluctuations in operations on product quality. This can be achieved through using
the Taguchi method to optimally choose the nominal values of controllable factors.
Production equipment determines operating characteristics (e.g., production line
speed) and reliability. The speed impacts both quality and productivity. A high line
speed can increase productivity but harm quality. As such, the speed is a key factor
for equipment selection and needs to be optimized to achieve an appropriate
tradeoff between quality and productivity.
Production management focuses on the continuous improvement of product
quality. Quality improvements can be achieved by identifying and mitigating
quality bottlenecks. A quality bottleneck is the factor that can signicantly impact
product quality. Improving the bottleneck factor will lead to the largest improve-
ment in product quality.
Machine breakdowns affect product quality. Preventive maintenance improves
the reliability of production system and in turn improves quality. This necessitates
effectively planning preventive maintenance to mitigate machine deterioration.
Various statistical techniques have been developed to control and improve quality.
Major strategies for quality control and improvement in a production system are
shown in Fig. 12.2. As seen, the quality control and improvement strategies fall into
the following three categories:
inspection and testing for raw materials and nal product,
statistical process control, and
quality control by optimization.
12.4 Quality Control and Improvement Strategies 229
Uncontrollable
Product test
inputs
The production process can be controlled using statistical process control tech-
niques, which include off-line and online quality-control techniques, depending on
the type of manufacturing process. In continuous production, the process often rst
operates in the in-control state and produces acceptable product for a relatively long
period of time, and then assignable causes occur so that the process shifts to an out-
of-control state and produces more nonconforming items. The change from in-
control to out-of-control can be detected through regularly inspecting the items
produced and using control charts. A control chart is a graphical tool used to detect
the process shifts. When a out-of-control state is identied, appropriate corrective
230 12 Product Quality Variations and Control Strategies
actions can be taken before many nonconforming units are manufactured. The
control chart technique will be discussed in detail in Chap. 14.
In batch production, the production system is set up and may be subjected to a
preventive maintenance before going to production, and hence the process starts in
control and can go to out of control during the production of a lot. As the lot size
increases, the expected fraction of nonconforming items in a lot increases and the
set-up cost per manufactured unit decreases. Therefore, the optimal lot size can be
determined by a proper tradeoff between the manufacturing cost and the benets
derived through better outgoing quality. This approach deals with quality control by
optimization. We look at the optimal lot size problem as follows.
Let Q denote the lot size. At the start of each lot production, the process is in
control. The state can change from in-control to out-of-control. If the state is in out-
of-control state, the process will remain there until completion of the lot. Since the
expected fraction of nonconforming items increases and the setup cost per item
decreases as Q increases, an optimal batch size exists to minimize the expected
manufacturing cost per conforming item.
Let p0 [p1 ] denote the probability of occurrence of nonconforming items when
the manufacturing process is in control [out of control]. Clearly, we have p0 p1 .
Let N 2 0; Q denote the state change point, after which the process is out of
control. Since the probability of N 0 is zero, N is a random positive integer.
When 1 N\Q, the process ends with the out-of-control state; otherwise, the
process ends with the in-control state. Assume that the probability that the in-
control state changes to out-of-control state is q. For 1 N Q 1, we have
P
Q1
It is noted that pi pC 1.
i1
Conditional on N i 2 1; Q 1, the expected number of nonconforming
items equals
ni p0 i p1 Q i p1 Q p1 p0 i: 12:16
12.4 Quality Control and Improvement Strategies 231
We now look at the cost elements. The setup cost depends on the state in the end
of previous run. It is cs if the state in the end of previous run is in control, and an
additional cost d is needed if the state in the end of previous run is out of control.
The probability that needs the additional cost is given by
X
Q1
pA pi 1 pQ1 : 12:19
i1
Let c1 denote the cost of producing an item (including material cost and labor
cost) and c2 denote the penalty cost of producing a nonconforming item. The
penalty cost depends on whether the nonconforming item has been identied before
being delivered to the customer. If yes, it includes disposal cost; if not, it includes
warranty cost. These costs are independent of Q.
The expected total cost is given by
CQ c0 c1 Q c2 1 pQ Q: 12:21
c0 c1 Q c2 1 pQ Q
JQ : 12:22
pQ Q
In statistics, sigma is usually used to denote the standard deviation; which repre-
sents the variation about the mean. Assume that a quality characteristic follows a
normal distribution with mean l and the standard deviation r, and the specication
limits are l D. If D 6r, the probability that a product is within the specica-
tions is nearly equal to 1. As such, the six sigma concept can be read as nearly
perfect, defect-free performance or world-class performance.
Six sigma quality is a systematic and fact-based process for continued improve-
ment. It focuses on reducing variability in key product quality characteristics.
The six sigma implementation process involves ve phases. The rst phase is
design or dene phase. It involves identifying one or more project-driven
problems for improvement. The second phase is measure phase. It involves col-
lecting data on measures of quality so as to evaluate and understand the current state
of the process. The third phase is analyze phase. It analyzes the data collected in
the second phase to determine root causes of the problems and to understand the
different sources of process variability. The fourth phase is improve phase. Based
on the results obtained from the previous two phases, this step aims to determine
specic changes to achieve the desired improvement. Finally, the fth phase is
control phase. It involves the control of the improvement plan.
ISO 9000 series are the quality standards developed by the International
Organization for Standardization [3]. These standards focus on the quality system
with components such as management responsibility for quality; design control;
purchasing and contract management; product identication and traceability;
inspection and testing; process control; handling of nonconforming, corrective and
preventive actions; and so on. Many organizations require their partners or suppliers
to have ISO 9000 certication.
According to a number of comparative studies for the actual performance of the
enterprises with and without ISO 9000 Certication, its effectiveness strongly
depends on the motivation for the certication, i.e., just getting a pass or really
234 12 Product Quality Variations and Control Strategies
wanting to get an improvement in quality. This is because much of the focus of ISO
9000 is on formal documentation of the quality system rather than on variability
reduction and improvement of processes and products. As such, the certication
only certies the processes and the system of an organization rather than its product
or service.
References
1. Blischke WR, Murthy DNP (2000) Reliability: modeling, prediction, and optimization. Wiley,
New York, pp 492493
2. Han C, Kim M, Yoon ES (2008) A hierarchical decision procedure for productivity innovation
in large-scale petrochemical processes. Comput Chem Eng 32(45):10291041
3. International Organization for Standardization (2008) Quality management systems. ISO
9000:2000
4. Jiang R, Murthy DNP (2009) Impact of quality variations on product reliability. Reliab Eng
Syst Saf 94(2):490496
5. Jiang R, Murthy DNP (2011) A study of Weibull shape parameter: properties and signicance.
Reliab Eng Syst Saf 96(12):16191626
6. Montgomery DC (2007) Introduction to statistical quality control, 4th edn. Wiley, New York
7. Regattieri A (2012) Reliability evaluation of manufacturing systems: methods and applications.
Manufacturing system. http://www.intechopen.com/books/manufacturing-system/reliability-
evaluation-of-manufacturing-systemsmethods-and-applications. Accessed 16 May 2012
Chapter 13
Quality Control at Input
13.1 Introduction
The input material (raw material, components, etc.) is obtained from external
suppliers in batches, and the quality can vary from batch to batch. Acceptance
sampling is a way to ensure high input quality. It carries out tests with a small
sample from a batch. The batch is either accepted or rejected based on the test
outcome. According to the nature of quality characteristics, acceptance sampling
plans can be roughly divided into two types: acceptance sampling for attribute
(where the outcome of test is either normal or defective) and acceptance sampling
for variable (where the outcome of test is a numerical value).
As a quality assurance tool, acceptance sampling cannot be used to improve the
quality of the product. A way for manufacturers to improve the quality of their
products is to reduce the number of suppliers and to establish a strategic partnership
with their suppliers [5]. This deals with the supplier selection problem.
In this chapter, we focus on acceptance sampling and supplier selection. The chapter
is organized as follows. Section 13.2 deals with acceptance sampling for attribute.
Acceptance sampling for a normally distributed variable and acceptance sampling
for lifetime are discussed in Sects. 13.3 and 13.4, respectively. Acceptance sampling
for variable can be transferred to acceptance sampling for attribute and this is discussed
in Sect. 13.5. Finally, we discuss the supplier selection problem in Sect. 13.6.
items in a sample taken randomly from the lot. The lot is accepted if the number of
defects is not larger than a prespecied number; otherwise, the lot is rejected.
If the lot is rejected, the lot may be handled in different ways, e.g., returning it to
the supplier or inspecting every item. The latter case is called the rectifying
inspection (or 100 % inspection). In the rectifying inspection, the defective items
will be removed or replaced with good ones.
Acceptance sampling can also be used by a manufacturer to inspect their own
products at various stages of production. The accepted lots are sent forward for
further processing, and the rejected lots may be reworked or scrapped.
An acceptance sampling plan deals with the design of sampling scheme. Three typical
sampling plans are single-sampling, double-sampling, and sequential sampling. In a
single-sampling plan, one sample of items is randomly taken from the lot, and the
acceptance decision is made based on the information contained in the sample.
In a double-sampling plan, a decision based on the information in an initial
sample can be accept the lot, reject the lot, or take a second sample. If the
second sample is taken, the nal decision is made based on the information from the
initial and second samples.
In a sequential sampling, a decision is made after inspection of each item ran-
domly taken from the lot, and the decision can be accept, reject, or continue
the process by inspecting another item. The process ends when an accept or
reject decision is made. Sequential sampling can substantially reduce the
inspection costs. This is particularly true when the inspection is destructive and the
items are very expensive.
Depending on specic situations, there are other sampling plans (e.g., see
Ref. [5]). For example, two extreme sampling plans are (a) accepting the lot with no
inspection and (b) inspecting every item in the lot and removing all defective units.
For the concise purpose, we focus on the single-sampling plan in this chapter.
Pa
0.6 OC curve
0.4
Customer's risk
0.2
B
0
0 0.01 0.02 0.03 0.04 0.05
p
Consider the rectifying inspection where all defective items are replaced with good
ones. Let N and n denote the lot size and the sample size, respectively. The average
fraction defective obtained over a long sequence of lots is p and the acceptance
probability is Pa p. When a lot is accepted, the total inspection number is n and the
outgoing lot has pN n defective items. When a lot is rejected, the total
inspection number is N and the outgoing lot has zero defective items. As a result,
the average fraction defective (or average outgoing quality or AOQ) of all the
outgoing lots is given by
When p ! 0 or 1, AOQ ! 0. This implies that the plot of AOQ versus p is uni-
modal with a maximum value, which is called the average outgoing quality limit.
The average total inspection number per lot is given by
Let n denote the sample size, nd denote the number of defective items in the sample,
and c is a critical defective number to be specied. Clearly, nd =n can be viewed as
238 13 Quality Control at Input
X
c
Pa p px: 13:4
x0
0.3
0.2
0.1
p0
0
0 0.005 0.01 0.015 0.02 0.025 0.03
p
13.2 Acceptance Sampling for Attribute 239
the OC curve exactly pass these two known points. As such, we nd (n, c) so that
the OC curve is closest to these two desired points and meets the inequalities:
Pa p1 1 a; Pa p2 b: 13:5
SSE 1 a Pa p1 2 b Pa p2 2 : 13:6
As c increases, n increases and the risks decrease. The process is repeated until
Eq. (13.5) can be met. We illustrate this approach as follows.
Example 13.1 Assume that p0 0:01, p1 0:005, p2 0:015 and a b 0:2.
The problem is to nd the values of c and n.
Using the approach outlined above, we obtained the results shown in Table 13.1.
As seen, when c = 2, the inequalities given in Eq. (13.5) can be met. It is noted that
the inequalities can be met for c [ 2 but more inspections are required.
For the sampling plan (c; n) = (2, 295) and N 5000, Fig. 13.3 shows the
average outgoing quality curve. As seen, the average outgoing quality limit equals
0:4372 %, which is achieved at p 0:7673 %. The acceptance probability when
p 0:01 is 0.4334. This implies that the risk of producer is larger than the risk of
customer when p p0 .
Discussion: The approach to specify the two risk points can be troublesome and
potentially unfair. In fact, the producers and customers risks are generally unequal
at p p0 (see Fig. 13.2 and Example 13.1). To improve, Jiang [2] presents an
0.003
AOQ
0.002
0.001
0
0 0.01 0.02 0.03 0.04
p
240 13 Quality Control at Input
Zp0
1
r 1 Pa pdp: 13:7
p0
0
For Example 13.1, when c 2 and n 267, the producers average risk equals
0.1851, and the producers and customers risks at p p0 are 0.499829 and
0.500171, respectively, which are nearly equal. For this scheme, the producers risk
is a 0:1506 at p1 0:005, and the customers risk is b 0:2352 at p2 0:015.
This implies that the risk requirement given by the customer may be too high
relative to the risk requirement given by the producer.
When the lot size N is not very large, the acceptance sampling should be based on
the hypergeometric distribution, which describes the probability of x failures in n
draws from N items. Let m denote the number of conforming items. The number of
defective items in the lot is N m. Table 13.2 shows possible cases among n, m
and N m, where xL and xU are the lower and upper limits of X, respectively.
The probability of the event that there are x defective items in n items drawn
from N items is given by
px Cmnx CNm
x
=CNn ; x 2 xL ; xU : 13:8
X
c
Pa p px: 13:9
xxL
Pa
0.4 BN (150)
BN (100)
0.2
HG (150)
0
0 0.01 0.02 0.03 0.04 0.05
p
As such, the OC curve is dened by Eq. (13.9). The approach to specify n and c is
the same as that outlined in Sect. 13.2.5.
Figure 13.4 displays three OC curves:
(a) HG150, which is associated with the sampling plan based on the hyper-
geometric distribution with N; n; c = (800, 150, 2),
(b) BN100, which is associated with the sampling plans based on the binomial
distribution with n; c = (100, 2), and
(c) BN150, which is associated with the sampling plans based on the binomial
distribution with n; c = (150, 2).
From the gure, we have the following observations:
(a) The OC curve associated with the hypergeometric distribution is not smooth
due to the rounding operation in Eq. (13.10).
(b) The OC curves for the binomial and hypergeometric distributions with the
same n; c are close to each other when n=N is small.
(c) For the same n; c, the discriminatory power of the plan based on the hyper-
geometric distribution is slightly better than the plan based on the binomial
distribution.
Let X denote the quality characteristic with sample being (xi ; 1 i n), and Y
denote the sample mean. The quality characteristic can be nominal-the-best,
smaller-the-better, and larger-the-better. For the larger-the-better case, we set a
lower limit yL . If the sample average is less than the lower limit, the lot is rejected;
otherwise accepted. Similarly, we set an upper limit yU for the smaller-the-better
case; and set both the upper and lower limits for the nominal-the-best case. Since
242 13 Quality Control at Input
The lifetime of the product is an important quality characteristic. The sampling plan
for lifetime is usually to control the mean life and deals with a statistical hypothesis
test. The hypothesis can be tested either based on the observed lifetimes or based on
the observed number of failures. For the former case, we let l denote the average
13.4 Acceptance Sampling for Lifetime 243
life of a product and l0 denote the acceptable lot average life. A product is accepted
if the sample information supports the hypothesis:
l l0 : 13:14
For the latter case, the observed number of failures (m) is compared with the
acceptable failure number c. The lot is rejected if m [ c; otherwise, accepted.
Since the life tests are expensive, it is desired to shorten the test time. As such,
lifetime tests are commonly truncated. Fixed time truncated test (type-I) and xed
number truncated test (type-II) are two conventional truncated test methods. Many
testing schemes can be viewed as extensions or mixtures of these two truncated
schemes. Choice among testing methods mainly depends on testing equipment and
environment.
Suppose that a tester can simultaneously test r items, which are called a group. If
g groups of items are tested, the sample size will be n rg. A group acceptance
sampling plan is based on the information obtained from testing these groups of
items. Since r is usually known, the sample size depends on the number of groups g.
Sudden death testing is a special group acceptance sampling plan that can
considerably reduce testing time. Here, each group is tested simultaneously until the
rst failure occurs. Clearly, this is a xed number truncated test for each group.
There are several approaches for designing a sudden death testing, and we focus
on the approach presented in Ref. [4]. Assume that the product life T follows the
Weibull distribution with the shape parameter b and scale parameter g. Let Tj
(1 j g) denote the time to the rst failure for the jth group (termed as the group
failure time). Since Tj minTji ; 1 i r, Tj follows the Weibull distribution with
shape parameter b and scale parameter s g=r 1=b . It is noted that Tjb is also a
random variable and follows the exponential distribution with scale parameter gb =r.
Similarly, Hj Tj =gb is a random variable and follows the exponential distri-
bution with scale parameter (or mean) 1=r. The sum of g independent and identi-
cally distributed exponential random variables (with mean l) follows the Erlang
distribution, which is a special gamma distribution with shape P parameter g (an
integer) and the scale parameter l. This implies that V gj1 Hj follows the
gamma distribution with the shape parameter g and the scale parameter 1=r.
There is a close relation between the gamma distribution and the chi-square
distribution. The chi-square pdf is given by
1
fchi x xq=21 ex=2 13:15
2q=2 Cq=2
Let tL denote the lower limit of the lifetime. The quality of the product can be
dened as
or
P
g
Letting H Tj =tL b , from Eq. (13.18) we have
j1
where Fchi is the chi-squared distribution function. For given g and c, Eq. (13.20)
species the OC curve of the sampling plan.
Let a denote the producers risk at the acceptable reliability level p1 and br
denote the consumers risk at the lot tolerance reliability level p2 . The parameters g
and c can be determined by solving the following inequalities:
or
1 1
qp1 Fchi a; 2g; qp2 Fchi 1 br ; 2g: 13:22
ln1 p1 F 1 a; 2g
1 chi : 13:23
ln1 p2 Fchi 1 br ; 2g
13.4 Acceptance Sampling for Lifetime 245
Pa
0.4
0.2
0
0 0.01 0.02 0.03 0.04 0.05
p
As such, g is the smallest integer that meets Eq. (13.23). Once g is specied, we can
nd the value of c by minimizing the following:
Without loss of generality, we consider the case that the quality characteristic is
the lifetime. Let Ft; h1 ; h2 denote the life distribution, where h1 is a shape
parameter (e.g., b for the Weibull distribution or rl for the lognormal distribution)
and h2 is a scale parameter proportional to mean l (i.e., h2 al). The value of h1 is
usually specied based on experience but can be updated when sufcient data are
available to estimate a new shape parameter. Let l0 denote the critical value of the
mean so that the lot is accepted when l l0 .
Consider the xed time test plan with truncation time s. The failure probability
before s is given by p Fs; h1 ; al. For a sample with size n, the probability with
x failures (0 x n) is given by the binomial distribution with the probability
mass function given by Eq. (13.3). Let c denote the critical failure number. The
acceptable probability is given by Eq. (13.4). Given one of parameters n, s and c,
the other two parameters can be determined through minimizing the following:
where a and br are the risks of the producer and customer, respectively, and
0.6
Pa
0.4
0.2
0
0 50 100 150 200
13.6 Supplier Selection 247
There are two kinds of supplier selection problem (SSP). One deals with specic
purchasing decision and the other deals with establishing a strategic partnership
with suppliers. The purchasing decision problem is relatively simple, and the
strategic partnership with suppliers is a much more complicated problem since it
involves many qualitative and quantitative factors, which are often conflicting with
each other. We separately discuss these two kinds of SSP as follows.
Often, a manufacturer needs to select the component supplier from several sup-
pliers. The reliability and price of the components differ across component suppliers
and the problem is to select the best component supplier. Jiang and Murthy [3] deal
with this problem for the situation where the other conditions are similar and the
main concern is reliability. Here, we extend their model to the situation where the
main concerns are reliability and cost, and the other conditions are similar.
Suppose a key component is used in a system with a known design life (e.g.,
preventive replacement age) L. For a given supplier, assume that the life distribution
of its component is Ft. If the actual life of a component is larger than L, then the
associated life cycle cost is cp ; otherwise, the cost is cf 1 dcp with d [ 0. The
selection decision will be made based on the expected cost rate. Namely, the
selection will give the supplier with the smallest expected cost rate. We derive the
expected cost rate as follows.
The expected life is given by:
ZL ZL
EL L1 FL tf t dt 1 Ft dt: 13:27
0 0
J EC=EL: 13:29
Example 13.5 Suppose a certain component in a system has a design life L 50.
The components can be purchased from three suppliers: A1 , A2 and A3 . The lifetime
follows the Weibull distribution with the parameters shown in the second and third
248 13 Quality Control at Input
columns of Table 13.4. The cost parameters are shown in the fourth column of the
table. We assume that d 0:5 for all the suppliers.
The expected cost, expected lifetime, and cost rate are shown in the last three
columns of the table, respectively. As seen, the expected lifetimes of the alterna-
tives are almost indifferent though the mean lifetimes are quite different. This
results from the fact that Alternatives 2 and 3 have larger shape parameters. Based
on the cost rate criterion, Alternative 1 is theworst and Alternative 3 is the best.
If selection criterion is the value of cp l, Alternative 3 is the worst and
Alternative 1 is the best. This reflects that b has a considerable influence on the
purchasing decision.
total quality management program, corrective and preventive action system, process
control capability, and so on. The characteristics for delivery criterion include
delivery lead time, delivery performance, and so on. The characteristics for cost
criterion include competitiveness of cost, logistics cost, manufacturing cost,
ordering cost, fluctuation on costs, and so on.
The priorities of the criteria will be derived through pairwise comparisons.
Supplier scores for each characteristic can be derived based on the indicators of the
characteristic or based on pairwise comparisons. The pairwise comparisons require
a signicant effort and hence the voting and ranking methods can be used to
determine the relative importance ratings of alternatives.
Once the above tasks are completed, it is relatively simple to calculate the global
scores of alternatives and make relevant decision. For more details about AHP, see
Online Appendix A.
The supplier evaluation method based on AHP is useful for both manufacturers
and suppliers. The manufacturer may use this approach for managing the entire
supply system and adopt specic actions to support suppliers; and through the
evaluation process suppliers may identify their strengths and weaknesses, and adopt
corrective actions to improve their performance.
References
1. Ha SH, Krishnan R (2008) A hybrid approach to supplier selection for the maintenance of a
competitive supply chain. Expert Syst Appl 34(2):13031311
2. Jiang R (2013) Equal-risk acceptance sampling plan. Appl Mech Mater 401403:22342237
3. Jiang R, Murthy DNP (2011) A study of Weibull shape parameter: properties and signicance.
Reliab Eng Syst Saf 96(12):16191626
4. Jun CH, Balamurali S, Lee SH (2006) Variables sampling plans for Weibull distributed
lifetimes under sudden death testing. IEEE Trans Reliab 55(1):5358
5. Montgomery DC (2007) Introduction to statistical quality control, 4th edn. Wiley Sons, New
York
Chapter 14
Statistical Process Control
14.1 Introduction
The items of a product should be produced by a stable process so that the variability
of the products quality characteristics is sufciently small. Statistical process
control (SPC) is a tool to achieve process stability and improve process capability
through the reduction of variability. There are several graphical tools for the pur-
pose of SPC. They are histograms, check sheets, Pareto charts, cause-and-effect
diagrams, defect concentration diagrams, scatter diagrams, and control charts [3]. In
this chapter, we focus on control charts. Typical control charts are presented;
process capability indices and multivariate statistical process control methods are
also discussed.
The outline of the chapter is as follows. Section 14.2 deals with control charts for
variable and Sect. 14.3 with design and use of the Shewhart control chart. Process
capability indices are presented in Sect. 14.4. Multivariate statistical process control
methods are discussed in Sect. 14.5. Finally, typical control charts for attribute are
presented in Sect. 14.6.
In continuous production, the process begins with an in-control state. When some
of the controllable factors signicantly deviate from their nominal values, the state
of production process changes from in-control to out of control. If the change is
detected, then the state can be brought back to in-control in order to avoid the
situation where many nonconforming items are produced. Control charts can be
used to detect the state change of a process.
The basic principle of a control chart is to take small samples periodically and to
plot the sample statistics of one or more quality characteristics (e.g., mean, spread,
number, or fraction of defective items) on a chart. A signicant deviation in the
statistics is more likely to be the result of a change in the process state. When this
occurs, the process is stopped and the controllable factors that have deviated are
restored back to their nominal values. As such, the process is monitored, the out-of-
control cases can be detected, and the number of defectives gets reduced.
Let X denote the quality characteristic and x denote its realization. A sample of
items with size n is taken per h hours and the quality characteristic of each sample
item is measured. Let sj ux denote the sample statistic at tj jh j 1; 2; . . ..
The horizontal axis of a control chart is t or j and the vertical axis is s. Point tj ; sj
on a control chart is called a sample point.
Usually, a control chart has a center line and two control lines (or control limits).
The center line represents the average value of the quality characteristic corre-
sponding to the in-control state, and the two control lines are parallel to the center
line and called the upper control limit (UCL) and the lower control limit (LCL),
respectively. The control limits are chosen so that nearly all of the sample points
will fall between them in a random way if the process is in-control. In this case, the
process is assumed to be in-control and no corrective action is needed. If a sample
point falls outside the control limits or several successive sample points exhibit a
nonrandom pattern, this can be viewed as an indicator of a change in the process
state from in-control to out of control. In this case, investigation and corrective
action are required to nd and eliminate the assignable causes.
Most of quality characteristics are continuously valued, and the statistic used to
represent a quality characteristic can be usually approximated by the normal dis-
tribution. Assume that the distribution parameters of X are l0 and r0 when the
process is in-control. Consider a random sample with size n. Let Xji denote the
observed value for the ith item at time instant tj . The sample mean is given by
j 1X n
sj X Xji 14:1
n i1
The control chart based on the sample average is called an X control chart, which
monitors the process mean; and the control chart based on the sample range is
called a range chart or R chart, which monitors the process variability.
14.2 Control Charts for Variable 253
Consider the mean control chart. When the process is in-control, X approxi-
mately follows a normal distribution with parameters lx l0 and rx given by
p
rx r0 = n: 14:3
is given by:
The interval estimates of 1001 a % of X
lx z1a=2 rx ; lx z1a=2 rx 14:4
The control charts developed according to Eq. (14.5) are called the Shewhart
control charts. The L in Eq. (14.5) is similar to the z1a=2 in Eq. (14.4). When L = 3,
the corresponding control limits are called the three sigma control limits.
LCL D3 R
UCL D4 R; 14:6
The maximum relative error between the value of D4 calculated from Eq. (14.7) and
the value obtained from Appendix VI of Ref. [3] is 0.1469 %, which is achieved
when n = 3.
A control chart can give two types of error. A Type I error occurs when the process
is actually in-control but the control chart gives an out-of-control signal. This
false alarm leads to a stoppage of the production when the process is in-control.
254 14 Statistical Process Control
A Type II error occurs when the process is actually out of control but the control
chart gives an in-control signal. This type of error leads to a delay to initiate a
corrective action. When this occurs, more nonconforming items will be produced
due to the process being out of control.
to fall outside the
The probability of Type I error equals the probability for X
control limits and is given by
The basic performance of a control chart is the average time to signal (ATS). This
includes the two cases, which correspond to the concepts of Type I error and Type
II error, respectively. We use ATS0 to denote the ATS associated with Type I error,
and use ATS1 to denote the ATS associated with Type II error. ATS1 is an indicator
of the power (or effectiveness) of the control chart. A large ATS0 and a small ATS1
are desired.
14.2 Control Charts for Variable 255
We rst look at ATS0. For a given combination of n and h, the average number of
points (or samples) before a point wrongly indicates an out-of-control condition is
called the average run length (ARL) of the control chart. Let p denote the probability
that a single point falls outside the control limits when the process is in-control.
Clearly, p = = P1. Each sampling can be viewed as an independent Bernoulli trial
so that the number of samples (or run length) to give an out-of-control signal follows
a geometric distribution with mean 1/p. As such, the average run length is given by
When samples are taken at a xed time interval h, the average time to have a false
alarm signal is given by
Figure 14.3 shows the plot of ARL1 versus k for the X chart with 3-sigma limits.
As seen, for a xed k, ARL1 decreases as n increases. Since ATS1 is proportional to
h, a small value of ATS1 can be achieved using a small value of h.
0.001
x 0.05 x 0.5
0
0 200 400 600 800 1000 x 0.951200
Number of samples x
256 14 Statistical Process Control
ARL 1
n =10
n =15
5
0
0 1 2 3 4
k
The design of a control chart involves two phases. In the rst phase, a trial control
chart is obtained based on the data obtained from pilot runs. In the second phase,
the trial control chart is used to monitor the actual process, and the control chart can
be periodically revised using the latest information.
The data for estimating l0 and r0 should contain at least 2025 samples with a
sample size between 3 and 6. The estimate of l0 based on the observations obtained
during the pilot runs should be consistent with the following:
LSL USL
l0 14:14
2
where LSL and USL are the lower and upper specication limits of the quality
characteristic, respectively.
If the quality characteristic can be approximated by the normal distribution, the
p p
control limits are determined by l0 3r0 = n or l0 z1a=2 r0 = n with = 0.002.
The usual value of sample size n is 4, 5, or 6. A large value of n will decrease the
probability of Type II error and increase the inspection cost. When the quality char-
acteristic of the product changes relatively slowly, a small sample size should be used.
Sampling frequency is represented by inspection interval h. A small value of
h implies better detection ability and more sampling effort. The sampling effort can be
represented by the inspection rate (number of items inspected per unit time) given by
r n=h: 14:15
14.3 Construction and Implementation of the Shewhart Control Chart 257
Sample mean
UCL
0
80
LCL
79.99 LSL
79.98
0 5 10 15 20
t
h n=rm : 14:16
Generally, the value of h should be as small as possible, and hence we usually take
r rm .
Example 14.1 A manufacturing factory produces a type of bearing. The diameter of
the bearing is a key quality characteristic and specied as 80 0.008 mm. The
process mean can be easily adjusted to the nominal value. The pilot runs yields
r0 0:002 mm and R 0:0021 mm. The maximum allowable inspection rate is
rm 4 items per hour, and the minimum allowable ATS0 is 400 h. The problem is
chart and a R chart.
to design a X
r0
UCL l0 z1a=2 p 80:0026; LCL 79:9974:
n
A sampling strategy deals with how the samples are taken. An appropriate sampling
strategy can obtain as much useful information as possible from the control chart
analysis. Two typical sampling strategies are consecutive sampling and random
sampling.
258 14 Statistical Process Control
0.004
0.003
R
Center line
0.002
0.001
LCL
0
0 5 10 15 20
t
The consecutive sampling strategy takes the sample items from those items that
were produced at almost the same time. Such selected samples have a small unit-to-
unit variability within a sample. This strategy is suitable to detect process mean
shifts.
The random sampling strategy randomly takes each sample from all items that
have been produced since the last sample was taken. If the process average drifts
between several levels during the inspection interval, the range of the observations
within the sample may be relatively large. In this case, the R chart tends to give
more false alarm signals actually due to the drifts in the process average rather than
in the process variability. This strategy is often used when the control chart is
employed to make decisions about the acceptance of all items of product that have
been produced since the last sample.
If a process consists of several machines and their outputs are pooled into a
common stream, control chart techniques should be applied to the output of each
machine so as to detect whether or not a certain machine is out of control.
Variability of process data can be random or nonrandom. Typically, there are three
types of variability in the use of a control chart. They are stationary and uncorre-
lated, stationary but autocorrelated, and nonstationary.
The process data from an in-control process vary around a xed mean in a
random manner. This type of variability is stationary and uncorrelated. For this case
the Shewhart control charts can be used to effectively detect out-of-control
conditions.
If successive observations have a tendency to move on either side of the mean,
this type of variability is stationary but autocorrelated. The variability with this
phenomenon is nonrandom.
If the process does not have a stable mean, the variability with this phenomenon
is nonstationary. This kind of nonrandom pattern usually results from some external
factors such as environmental variables or properties of raw materials, and can be
avoided using engineering process control techniques such as feedback control.
14.3 Construction and Implementation of the Shewhart Control Chart 259
When the plotted points exhibit some nonrandom pattern, it may indicate an out-
of-control condition. Three typical nonrandom patterns are
(a) the number of points above the center line is signicantly different from the
number of points below the center line,
(b) several consecutive points increase or decreases in magnitude, and
(c) cyclic pattern. This occurs possibly due to some periodic cause (e.g., operator
fatigue) and signicantly affects the process standard deviation.
Several tests for randomness can be found in Sect. 6.6.
To help identify the nonrandom patterns, warning limits and one-sigma lines can be
displayed on control charts. The warning limits are the 2-sigma limits for the quality
characteristic with the normal distribution, or the 0.025-fractile and 0.975-fractile
for the case where the control limits are dened as the 0.001 probability limits (i.e.,
= 0.002). All these limits and lines partition the control chart into three zones on
each side of the center line. The region between the control limit and the warning
limit is called Zone A; the region between the one-sigma line and the warning limit
is called Zone B; and the region between the one-sigma line and the center line is
called Zone C.
When a point falls outside the control limits, a search for an assignable cause is
made and corrective action is accordingly taken. If one or more points fall into Zone
A, one possible action is to increase the sampling frequency and/or the sample size
so that more information about the process can be obtained quickly. This adjusted
sample size and/or sampling frequency depend on the current sample value. The
process control schemes with variable sample size or sampling frequency are called
adaptive schemes.
The use of warning limits allows the control chart to signal a shift in the process
more quickly but can result in more false alarms. Therefore, it is not necessary to
use them if the process is reasonably stable.
The control chart does not indicate the cause of the change in the process state.
Usually, FMEA is used to identify the assignable causes, and an out-of-control
action plan (OCAP) provides countermeasures to eliminate the causes. The OCAP
is a flowchart of activities when an out-of-control occurs, including checkpoints and
actions to eliminate the identied assignable cause. A control chart and an OCAP
should be jointly used and updated over time.
260 14 Statistical Process Control
The process capability can be represented in terms of process capability indices and
fraction nonconforming, which can be used to compare different processes that are
in a state of statistical control.
The process capability is the ability of a process to produce the output that meets the
specication limits. A process capability index (PCI) is a measure for representing
the inherent variability of a quality characteristic relative to its specication limits.
It is useful for product and process design as well as for supplier selection and
control.
Consider a quality characteristic Y with the specication limits LSL and USL,
respectively. Assume that Y follows the normal distribution with the process mean l
and variance r2 . The fraction of nonconformance (or defect rate) is given by
Figure 14.6 shows the influence of r on p. As seen, a good process has a small r
and a small fraction of nonconformance.
Under the following assumptions:
the process is stable,
the quality characteristic follows the normal distribution,
the specication limits are two-sided and symmetrical, and
the process mean is at the center of the specication limits,
the process capability index Cp is dened as
USL LSL
Cp : 14:18
6r
LSL USL
Poor process
y
14.4 Process Capability Indices and Fraction Nonconforming 261
If the specication limits are one-sided, the process capability index is dened as
l LSL USL l
Cpl or Cpu : 14:19
3r 3r
If the process mean is not at the center of the specication limits, the process
capability can be represented by index Cpk given by
This index can be used to judge how reliable a process is. When Cpk 1:5, the defect
rate equals 3.4 parts per million, which corresponds to famous Six Sigma Quality.
If the process mean is not equal to the target value T, the process capability can
be represented by index Cpm given by
s
lT 2
Cpm Cp = 1 : 14:21
r
If the process mean is neither at the center of the specication limits nor equal to
the target value, the process capability can be represented by index Cpkm given by
s
lT 2
Cpkm Cpk = 1 : 14:22
r
More variants and details of the process capability indices can be found from
Refs. [1, 4, 5].
For a given process, a small PCI means a high variation. Therefore, a large PCI
(e.g., Cp [ 1:0) is desirable. In general, the minimum acceptable PCI for a new
process is larger than the one for an existing process; the PCI for the two-sided
specication limits is larger than the one for the one-sided specication limit; and a
large PCI is required for a safety-related or critical quality characteristic.
Example 14.2 Assume that a quality characteristic follows the normal distribution
with the standard deviation r 0:5, and the specication limits and target values
equal LSL = 48, USL = 52 and T = 50.5, respectively. The problem is to calculate
the values of process capability indices for different process means.
For a set of process means shown in the rst row of Table 14.1, the corre-
sponding values of Cp are shown in the second row and the fractions of noncon-
formance shown in the third row. As seen, Cp maintains unvarying but p varies with
l and achieves its minimum when the process mean is at the center of the speci-
cation limits (i.e., l 50).
The fourth row shows the values of Cpk . As seen, Cpk achieves its maximum when
the process mean is at the center of the specication limits. The fth row shows the
262 14 Statistical Process Control
values of Cpm , which achieves its maximum when the process mean equals the target
value. The last row shows the values of Cpkm , which achieves its maximum when
l 50:33, a value between the target value and the center of the specication limits.
The fraction nonconforming is the probability for the quality characteristic to fall
outside the specication limits, and can be estimated based on the information from
the control charts that exhibit statistical control. To illustrate, we consider the X
chart and R chart. From the in-control observations of the X chart, we may estimate
the process mean l and average range R. The process standard deviation rs asso-
ciated with X can be estimated by:
2
rs R=d 14:23
where d2 is given by
The maximum relative error between the value of d2 calculated from Eq. (14.24)
and the one given in Appendix VI of Ref. [3] is 0.5910 %, which is achieved at
n = 2. The process standard deviation r associated with X is given by
p p
r nrs nR=d 2: 14:25
USL LSL
Cp p : 14:26
nUCL LCL
This implies that the process capability index can be estimated by the information
from control charts.
14.5 Multivariate Statistical Process Control Methods 263
The Shewhart control charts deals with a single quality characteristic. A product can
have several key quality characteristics. In this case, several univariate control
charts can be used for separately monitoring these quality characteristics if they are
independent of each other. However, if the quality characteristics are correlated, the
univariate approach is no longer appropriate and we must use multivariate SPC
methods.
Two typical multivariate SPC methods are multivariate control charts and pro-
jection methods. The multivariate control charts only deal with product quality
variables, and the projection methods deal with both quality and process variables.
We briefly discuss them as follows.
To be concise, we look at the multivariate Shewhart control charts with two cor-
related quality characteristics. Let Y1 and Y2 denote the quality characteristics,
which are normally distributed; l1 and l2 denote their means and aij ; i; j 1; 2
denote the elements of the inverse matrix of the covariance matrix between Y1 and
Y2. Let
Let Y denote the quality characteristics set and X denote the process variables set.
When the number of optional quality variables is large, it is necessary to reduce the
number of the quality variables. In practice, most of the variability in the data can
be captured in the few principal process variables, which can explain most of the
predictable variations in the product. The principal component analysis (PCA, see
Online Appendix C) and partial least squares (PLS, e.g., see Ref. [6]) are two useful
tools for this purpose. A PCA or PLS model is established based on historical data
collected in the in-control condition, and hence it represents the normal operating
conditions for a particular process. Then, a multivariate control chart (e.g., T2-chart)
can be developed based on the few principal variables.
264 14 Statistical Process Control
Different from univariate control charts that can give out-of-control signal but
cannot diagnose the assignable cause, multivariate control charts based on PCA or
PLS can diagnose assignable causes using the underlying PCA or PLS model. More
details about multivariate SPC can be found in Ref. [2] and the literature cited
therein.
Suppose that we inspect m samples with sample size n. Let Di denote the number of
defectives of the ith sample. Sample fraction nonconforming is given by
pi Di =n: 14:28
If LCL\0, then take LCL 0. The control chart dened by Eq. (14.29) is called
the p chart.
Another control chart (called the np chart) can be established for D. This results
from Eq. (14.28). Clearly, D has mean np and variance r2 np1 p. As a result,
the center line and control limits of a mean control chart are given by
p p
l np; LCL np 3 np1 p; UCL np 3 np1 p: 14:30
The mean and standard deviation of p are 0.0373 and 0.0314, respectively. As
such, the center line and control limits of the p chart are 0.0373, 0 and 0.0942,
respectively.
The mean and standard deviation of D are 1.87 and 1.57, respectively. As such,
the center line and control limits of the np chart are 1.87, 0 and 6.58, respectively.
Let d denote the number of defects in an inspected item, and c0 denote the maxi-
mum allowable number of defects of an item. If d [ c0 , the inspected item is
defective; otherwise, normal. The c chart is designed to control the number of
defects per inspected item. Here, the defects can be voids of a casting item, a
component that must be resoldered in a printed circuit board, and so on.
Let c denote the total number of defects in an inspected item. It follows the
Poisson distribution with mean and variance k. As such, the center line and control
limits of the c chart are given by
p p
l k; LCL max0; k 3 k; UCL k 3 k: 14:31
Let u D=n denote the average number of defects per item. Then, u has mean u,
and standard deviation r u=n. The u chart is developed to control the value of
u. The center line and control limits of the u chart are given by
p p
l
u; LCL max0; u 3 u=n; UCL u 3 u=n: 14:32
Clearly, the u chart is somehow similar to the p chart, and the c chart is somehow
similar to the np chart.
266 14 Statistical Process Control
References
1. Kotz S, Johnson NL (1993) Process capability indices. Chapman and Hall, New York, London
2. MacGregor JF, Kourti T (1995) Statistical process control of multivariate processes. Control
Eng Pract 3(3):403414
3. Montgomery DC (2007) Introduction to statistical quality control, 4th edn. Wiley, New York
4. Pearn WL, Chen KS (1999) Making decisions in assessing process capability index Cpk. Qual
Reliab Eng Int 15(4):321326
5. Porter LJ, Oakland JS (1991) Process capability indicesan overview of theory and practice.
Qual Reliab Eng Int 7(6):437448
6. Vinzi VE, Russolillo G (2013) Partial least squares algorithms and methods. Wiley Interdiscip
Rev: Comput Stat 5(1):119
Chapter 15
Quality Control at Output
15.1 Introduction
The reliability of a manufactured product usually differs from its design reliability
due to various quality variations such as nonconforming components and assembly
errors. These variations lead to a relatively high early failure rate. Quality control at
output mainly deals with quality inspections and screening testing of components
and nal products. The purpose is to identify and reduce defective items before they
are released for sale.
An issue with product quality inspection is to classify the inspected product into
several grades based on the quality characteristics. The partitions between two adjacent
grades can be optimized to achieve an appropriate tradeoff between manufacturing cost
and quality cost. This problem is called the optimal screening limit problem.
Two widely used screening tests are burn-in and environmental stress screening.
Such tests are required for the products with high reliability requirements. The tests
are generally expensive; and the losses of eld failures incurred by defective items
are usually high. As such, the test costs and eld losses must achieve an appropriate
tradeoff. This involves optimization of the test scheme.
The outline of this chapter is as follows. Section 15.2 deals with the optimal
screening limit problem, and Sect. 15.3 deals with relevant concepts of screening
test. Optimization models for component-level burn-in and system-level burn-in are
discussed in Sects. 15.4 and 15.5, respectively.
Consider the problem where the product items are classied into three grades
(acceptable, reworked, and scraped) based on a single variable Y, which is highly
correlated with the quality characteristic of interest. Assume that Y follows the
normal distribution with mean l and standard deviation r, and has a target value
T. It is easy to adjust the process mean to the target value so that we have l T.
Let d denote the screening limit, which is the decision variable. The manufactured
products are classied into the following three grades:
acceptable if y 2 T d,
scraped if y\T d, and
reworked if y [ T d.
Clearly, a small d results in more items being screened out as nonconforming. This
is why it is called the screening limit.
Consider two categories of costs: manufacturing-related cost before the sale and
quality loss after the product is delivered to the customer. As d decreases, the
manufacturing cost per sold item increases and the quality loss decreases. As such,
the optimum screening limit exists so that the expected total cost per sold item
achieves its minimum.
The manufacturing-related costs include three parts: raw material cost cm , pro-
duction cost cp , and inspection cost cI . Generally, these cost elements are constant
for a given manufacturing process. As such, the total manufacturing cost per
manufactured item is given by CM cm cp cI .
15.2 Optimal Screening Limit Problem 269
cq Kv 15:1
where v is the variance of the doubly truncated normal distribution with support
y 2 l d, and is given by
CT t CM ps cs pr cp cI pa Kv: 15:3
Jd CT d=Pd: 15:4
Using the approach outlined above, we obtained the optimal screening limit
d 16.0. The corresponding expected total cost per sold product item is J
1787:49 and the probability for a manufactured product item to be scraped is
ps 5:49 %.
If the scrap probability is considered to be too large, the manufacturing process
has to be improved to reduce the value of r by using high precision equipment and
machines. This will result in an increase of cp . Assume that r decreases from 10 to
8, and cp increases from 1000 to 1100. Then, the optimal solution is now d 14.9
with J 1774:46 and ps 3:11 %. Since both J and ps decrease, the improvement
is worthwhile.
270 15 Quality Control at Output
Burn-in and environmental stress screening (ESS) are two typical screening tests for
electronic products. An electronic product is usually decomposed into three hier-
archical levels, i.e., part (or component), unit (or assembly or subsystem), and
system (or product). We use item to represent any of them when we do not need
to differentiate the hierarchy level.
1 1
ET; z : 15:6
k1 Gx kf1 Guzg
Equation (15.6) relates the defect size to the mean lifetime in the normal use
condition.
For a given screening test, the time to latent failure determines the detectability
of the part. Figure 15.1 shows relations between the defect size, strength, lifetime,
and detectability. As seen from the gure, a large defect is more probably patent
and can be detected by functional test; a small defect is more probably latent and
can be transformed into a patent defect by ESS; a defect with intermediate size can
be either patent or latent and can be detected by burn-in.
The latent defects affect the failure pattern. If there are no latent defects, the
failure rate of a population of items can be represented by the classic bathtub curve.
If some of the items contain latent defects, they may fail in the normal use phase
under excessive stress conditions. Such failures result in jumps in the failure rate
curve. Figure 15.2 shows the failure rate curve superposed by the classic bathtub
failure rate curve and the failure rate curve resulted from latent defects. In the
literature, this superposed failure rate curve is called the roller coaster curve (e.g.,
see Refs. [7, 8]).
x = (z )
E (T )
Burn-in period
Burn-in PM
ESS
t
272 15 Quality Control at Output
15.3.2 Burn-in
Burn-in is a kind of test to expose defects of items or their components and screen
out those items with defects in order to prevent product early failure. It is usually
applied to the items with high initial failure rate, which result from defective parts
and quality variations due to assembly-related problems. Typical assembly prob-
lems include components damage and component connection defects. As such, the
burn-in can be used at component level and system level. Component-level burn-in
is done often by component suppliers to identify and eliminate defective compo-
nents and system-level burn-in is done by the manufacturer to remove component
defects and assembly defects.
The test conditions are application-specic. To accelerate the process, burn-in
can be conducted under relatively harsh environments. Burn-in of electronic
components is usually conducted at elevated temperature and/or voltage.
Figure 15.3 shows a typical temperature stress cycle used in burn-in.
The tested items usually operate for a xed time period (called burn-in period).
Any item that survives the burn-in will be released for sale. If the product is
repairable, then failures during burn-in are rectied and tested again until it survives
the burn-in period. Burn-in spends cost and consumes a part of useful lifetime but
can lead to less eld cost due to enhanced product reliability after burn-in. One of
the major problems with burn-in is to optimally determine the burn-in period for a
given criterion such as cost, reliability or their combination.
Reliability measure of the burnt-in product item can be the survival probability
over a prespecied mission time (e.g., warranty period or planning horizon) or
mean residual life. An age-based preventive replacement policy for the burnt-in
product can be implemented to further reduce the total costs. Jiang and Jardine [4]
simultaneously optimize the burn-in duration and the preventive replacement age
based on the cost rate, which considers both cost and mean residual life.
Most products are sold with warranty. Breakdown of a burnt-in item within the
warranty period causes warranty claims, which incur warranty costs. A balance
between burn-in costs and warranty costs can be achieved by minimizing the sum
of burn-in and warranty costs.
Operational condition
Low temperature
condition
t
15.3 Screening Tests 273
It is noted that eliminating the root cause of early failures is better than doing a
burn-in if possible. As various root causes for failures are identied and eliminated,
burn-in may eventually be no longer needed. Block and Savits [1] present a liter-
ature review on burn-in.
ESS is a process for accelerating the aging of latent defects by applying excessive
stress without damaging the items [2]. The intensity and magnitude of shocks
produced by ESS test can be sufciently large, implying that k is large and Guz
is small in Eq. (15.6) so that the time to latent failure will be small, that is, ESS can
be effective.
A key issue with ESS is to appropriately determine the types and ranges of
stresses (or shocks) to be applied. Typical stresses used in ESS are thermal stress,
vibration, and shock. The thermal stress tests include low temperature test, high
temperature test, temperature cycling tests, and thermal shock test. Temperature
cycling tests simulate varying temperature operating environment. Thermal shock
test quickly changes the temperature by moving the tested item from one temper-
ature environment to another temperature environment. Vibration testing can be
random vibration and sine vibration, and may be carried out on a single axis or three
mutually perpendicular axes. Random vibration testing can excite all resonant
frequencies throughout the entire test and hence is preferred. Shock tests include
mechanical shock test and power cycling. A typical shock test simulates the stresses
resulting from handling, transportation and operation by applying ve shock pulses
at a selected peak acceleration level in each of the six possible orientations. Power
cycling is implemented by turning on and off the power at predetermined intervals.
Other extreme environments that ESS tests can simulate include high altitude, high
voltage, humid, salt spray, sand, dust, and so on. Some ESS tests can simulate two
or more environments at a time.
ESS exposes defects by fatiguing weak or marginal mechanical interfaces [2].
Since fatigue is the result of repeated stress reversals, ESS usually applies stress
cycles (e.g., thermal cycling, onoff cycling, and random vibration) to produce such
stress reversals. Generally, temperature cycling, random vibration, and their com-
bination are the most effective screening processes for electronic assemblies.
Both ESS and burn-in processes emphasize on reducing early eld failures.
Generally, burn-in takes much lengthier time to power a product at an operating or
accelerated stress condition. On the other hand, ESS is generally conducted under
274 15 Quality Control at Output
accelerated conditions to stress a product for a limited number of stress cycles, and
functional testing is needed to verify that the product is functioning after ESS testing.
As such, main differences between them are as follows (e.g., see Refs. [9, 11]):
(a) the tested item is powered for burn-in and stressed for ESS,
(b) the stress levels used for burn-in are usually lower than the stress levels used
for ESS, and
(c) test duration is from several hours to a few days for burn-in and from several
minutes to a few hours for ESS.
Generally, ESS is more effective in screening out stress-dependent defects,
which result in overstress failure, but it is less effective in screening out the defects
caused by time- or usage-dependent failure modes. Conversely, burn-in can screen
out the time/usage-dependent defects and provides useful information for predicting
reliability performance of the product.
ESS and burn-in can be combined to reduce burn-in time. For example, a two-
level ESS-burn-in policy (see Ref. [10]) combines a part-level ESS and a unit-level
burn-in. Under this policy, all parts are subjected to an ESS and the parts passing
the part-level screen are used in the unit. Then, all units are burned-in, and the units
passing burn-in are used in the nal system, for which there is no burn-in or ESS.
We assume that the burnt-in item will be put under operation with a mission time
L, which can be a warranty period or a plan horizon. Let Rb L denote the survival
probability of this burnt-in item within the mission time. Assume that a breakdown
cost cf is incurred if the item fails within the mission time. As such, the eld failure
cost is given by
CL 1 Rb Lcf : 15:12
The probability that the item will pass the test is Rs. To be concise, we simply
write it as R and let F 1 R. Let K denote the number of repairs before the item
passes the test. The probability that the item passes the test after k repairs is given by:
pk F k R; k 0; 1; 2; . . .: 15:13
Zs Zs Zs
1 p q
b xf xdx xfc xdx xfn xdx: 15:14
F F F
0 0 0
Zs " #
s b
xgxdx lGa ; 1 1=b; 1 15:15
g
0
276 15 Quality Control at Output
where Ga : is the gamma cdf. Using Eq. (15.15), the mean test time b can be
expressed in terms of the gamma cdf. The expected test time for an item to pass the
test is given by
T bn 1 s: 15:16
Let c1 denote the test cost per unit test time and c2 denote mean repair cost for
each repair. The total burn-in cost is given by
CB c1 T c2 n 1: 15:17
The objective function is the sum of the eld failure cost and total burn-in cost,
given by
Js CL CB : 15:18
20
15
C ( )
10
r (t ), 10-4
5
0
0 50 100 150 200
t,
15.4 Optimal Component-Level Burn-in Duration 277
Rb (L , )
0.6
0.4
0.2
0
0 50 100 150 200
Table 15.2 shows the mission reliabilities and eld failure probabilities before
and after burn-in. The last row shows the relative reductions in eld failure after
burn-in. As seen, the reduction is signicant and the performances obtained from
two burn-in schemes are close to each other. This implies that the burn-in period
can be any value within (29.42, 41.92).
TTT
approach appears to be more practical. We consider the latter approach and present
the reliability and cost models as follows.
Y
n
Rt RPi t; RPi t 1 FPi t: 15:20
i1
After the burn-in, the reliability function at the i-th position is given by
0
Ri x si
RBi x 0 f1 qi Gi x s Gi sg; x 0: 15:21
Ri si
Y
n
RL; s RBi L: 15:22
i1
The cost of an item consists of the burn-in cost and the eld operational cost. The
burn-in cost consists of component-level cost and system-level cost. The compo-
nent-level cost includes component replacement cost and connection repair cost.
For Component i, the replacement cost is given by
where cri is the cost per replacement and Mi t is the renewal function associated
with Fi t. Assume that the repair for connection failure is perfect (with a cost of
cmi ) so that the connection failure for each component occurs once at most. As such,
the connection repair cost is given by
The system-level cost deals with burn-in operational cost. Assume that the
operational cost per unit time is a constant c0 . As such, the system-level cost is
given by
Cs c0 s: 15:26
We now look at the eld failure cost, which is given by Eq. (15.12) with Rb L
given by Eq. (15.22). It usually assumes that the cost for a eld failure, cf , is four to six
times of the actual repair cost to reflect the intangible losses such as reputation cost.
As a result, the total cost for each item is given by
X
n
Js Cri Cmi c0 s cf 1 Rb L: 15:27
i1
References
16.1 Introduction
Product support (also known as customer support or after-sales support) deals with
product service, including installation, maintenance, repair, spare parts, warranty,
eld service, and so on. Product support plays a key role in the marketing of
products and the manufacturer can obtain prots through product servicing (e.g.,
provision of spare parts and maintenance servicing contracts).
Product warranty is a key part of product support. In this chapter we focus on
product warranty-related issues, including warranty policies, warranty cost analysis,
and warranty servicing.
The outline of the chapter is as follows. We start with a discussion of product
warranties in Sect. 16.2. Typical warranty policies are presented in Sect. 16.3.
Reliability models in warranty analysis are presented in Sect. 16.4, and warranty
cost analysis is dealt with in Sect. 16.5. Finally, related issues about warranty
servicing are discussed in Sect. 16.6.
servicing costs to the manufacturers and hence reducing warranty cost becomes of
great importance to the manufacturers.
The expected warranty costs depend on warranty requirements and associated
maintenance actions, and can be reduced through reliability improvement, product
quality control, and making adequate maintenance decisions in the warranty period.
The warranty policies can be classied in different ways. According to the de-
nition of warranty period, a warranty policy can be one- or two-dimensional. One-
dimensional warranty policies are characterized by a warranty period, which is
usually a time interval on the items age. In contrast, the two-dimensional warranty
16.3 Warranty Policies 285
policies are characterized by a region on the two-dimensional plane, where the axes
represent the age and the usage of the item. For vehicles, the usage is represented in
terms of mileages.
According to whether or not the warranty period is xed, a warranty policy can
be nonrenewable or renewable. The renewable warranty policies are usually
associated with replacement of a failed item.
According to whether the warranty is an integral part of product sale, warranty
policies can be divided into base warranty (or standard warranty) and extended
warranty (also called service contract). An extended warranty is optional for the
customer and not free.
In terms of the cost structure of warranty, a warranty policy can be simple or of
combination, where two simple policies are combined.
Depending on the type of product, warranty policies can be for consumer
durables, commercial and industrial products or defense products. The buyers of
these products are individuals, organizations and government, respectively. When
the buyers are organizations and government, products are often sold in lots. This
leads to a type of special warranty policies: cumulative warranty policies. For the
defense products, a specic reliability performance may be required. In this case,
development effort is needed and this leads to another type of special warranty
policies: reliability improvement warranties.
In this subsection we present several typical warranty policies, which are applicable
for all types of products.
This policy is usually called free replacement warranty (FRW), which is widely
used for consumer products. Under this policy, the manufacturer agrees to repair or
provide replacements for failed items free of charge up to a time W (the warranty
period) from the time of the initial purchase.
This policy is one-dimensional and nonrenewing. The word replacement does
not imply that the failed items are always rectied by replacement. In fact, it is
common to restore the failed item to operational state by repair, especially by
minimal repair.
This policy is usually called pro-rata rebate warranty (PRW). Under this policy, the
manufacturer agrees to refund a fraction of the purchase price when the item fails
286 16 Product Warranty
before time W from the time of the initial purchase. The refund depends on the age
of the item at failure X and is a decreasing function of the remaining warranty time
W X. Let qx denote this refund function. A typical form of qx is
where a 2 0; 1, cb is unit sale price and x is the age of failed item.
When the rst failure occurs and a fraction of the purchase price is refunded, the
warranty expires. In other words, this policy expires at the time when the rst
failure occurs within the warranty period or at W. This policy is applicable for non-
repairable products.
Under this policy, the warranty period is divided into two intervals: (0, W1 ) and
(W1 , W). If a failure occurs in the rst interval, a FRW policy is implemented; if the
failure occurs in the second interval, a PRW policy is implemented and the refund is
calculated by
x W1
qx acb 1 : 16:2
W W1
y U1 x=W: 16:3
y
x
In addition to the policies discussed above, four special warranty policies that are
widely used for commercial and industrial products are one-dimensional cumulative
FRW, extended warranties, PM warranty, and reliability improvement warranties.
We briefly discuss them as follows.
Under a PM warranty policy, any product failures are rectied by minimal repair
and additional PM actions are carried out within the warranty period.
When the warranty period is relatively long (e.g., the case where the warranty
covers the whole life of the product), the manufacturer needs to optimize PM
policies. Often, the burn-in and PM are jointly optimized to reduce total warranty
servicing costs (e.g., see Ref. [8]).
All the policies discussed above are also applicable for defense products. A special
policy associated with defense products is reliability improvement warranties,
which provide guarantees on the reliability (e.g., MTBF) of the purchased equip-
ment. Under this policy, the manufacturer agrees to repair or provide replacements
free of charge for any failed parts or items until time W after purchase. In the
meantime, the manufacturer also guarantees the MTBF of the purchased equipment
to be at least a certain level M. If the evaluated or demonstrated MTBF is smaller
than M, the manufacturer will make design changes to meet the reliability
requirements at itself cost.
The terms of reliability improvement warranties are negotiated between the
manufacturer and buyer, and usually include an incentive for the manufacturer to
increase the reliability of the products after they are put into service. The incentive
is an increased fee paid to the manufacturer if the required reliability level has been
achieved.
In this section, we discuss the reliability models that are needed in warranty
analysis.
16.4 Reliability Models in Warranty Analysis 289
Mt Ft Ht lnRt 16:5
where Ht is the cumulative hazard function. Generally, the renewal function can
be approximated by [4]:
X
N
t il
Mt Ft U p 16:6
i2 ir
X 2
N
l 2 r
Mt Ft Ga t; i ; 16:7
i2
r l
where Ga t; u; v is the gamma cdf with shape parameter u and scale parameter v.
For the Weibull distribution with t g, we have the following approximation [5]:
where Fw x; a; b is the Weibull cdf (with shape parameter a and scale parameter b)
evaluated at x. In the warranty analysis, the warranty period W is usually smaller
than the characteristic life so that Eq. (16.8) is accurate enough.
290 16 Product Warranty
The failure of a system is often due to the failure of one or more of its components.
At each system failure, the number of failed components is usually small relative to
the total number of components in the system. The system is restored back to its
working state by either repairing or replacing these failed components. Since most
of systems components are not repaired or replaced, this situation is equivalent to a
minimal repair.
Let Ft denote the distribution function of the time to the rst failure, Ti denote
the time to the ith failure and Fi t; t [ ti1 , denote the distribution of Ti for the
repaired item. When a failed item is subjected to a minimal repair, the failure rate of
the item after repair is the same as the failure rate of the item immediately before it
failed. In this case, we have
Ft ti1 Fti1
Fi t ; t [ ti1 : 16:9
1 Fti1
If the item is not subjected to any PM actions and all repairs are minimal, then
the system failures can be modeled by a point process. Let Nt denote the number
of minimal repairs in (0; t). Nt follows the Poisson distribution with the MCF
given by [6]
It is noted that the variance of the Poisson distribution equals its mean. Therefore,
the variance of Nt is equal to Mt.
Specially, when Ft is the Weibull cdf, we have
Mt t=gb : 16:11
PM actions can affect both the rst and subsequent failures, and a PM is usually
viewed as an imperfect repair. As such, the effect of a PM on the reliability
improvement can by modeled by an imperfect maintenance model. Several specic
imperfect maintenance models are outlined as follows.
16.4 Reliability Models in Warranty Analysis 291
The virtual age models are widely used to model the effect of PM on the reliability
improvement [10]. Suppose that a periodic PM scheme is implemented at ti is,
where s is PM interval. The failure is rectied by minimal repair, which does not
change the failure rate. Let vi denote the virtual age after the ith PM. The virtual age
Model I assumes that each PM reduces the virtual age of the product by a fraction
of the previous PM interval length s, i.e., as, where a is a number between 0 and 1
and called the degree of restoration. When a 0, the PM can be viewed as a
minimal repair and when a 1 the PM is equivalent to a perfect repair. As such,
we have
It is noted that the actual age at the ith PM is ti is, the virtual age just before
the ith PM is v i vi1 s, and the virtual age just after the ith PM is vi . As a
result, the failure rate reduction due to the ith PM is given by Dri rv i rvi ;
and the growth of the failure rate is according to rt ti vi rather than according
to rt. In other words, the effect of PM is twofold:
(a) current failure rate gets reduced, and
(b) the growth of the failure rate gets slowed down.
Given the distribution of time to the rst failure, Ft, and the parameter of
virtual age model I (i.e., a), the conditional distribution function after the ith PM
performed at ti is given by
1 Ft ti vi
Fi t 1 ; t ti : 16:13
1 Fvi
The virtual age Model II assumes that each PM reduces the virtual age of the
product by a fraction of v
i , i.e., avi1 s. This implies that the PM in the virtual
age Model II has a larger reduction in virtual age than the reduction in Model I if
the value of a is the same. Therefore, it is often used to represent the effect of an
overhaul.
The virtual age just after the ith PM is given by
X
i
1 a 1 ai1
vi s 1 a j s : 16:14
j1
a
Caneld [3] introduces a PM model to optimize the PM policy during or after the
warranty period. Let s denote the PM interval and d 2 0; s denote the level of
restoration of each PM. A minimal repair has d 0, a perfect repair has d s and
an imperfect repair has 0 \ d \ s. Clearly, d has the same time unit as t.
The model assumes that the ith PM only gets the failure rate slowed down and
does not change the value of current failure rate. As such, the failure rate after the
ith PM is given by
ri t rt id ci 16:15
ri1 ti ri ti : 16:16
X
i
ci ci1 Di Dj 16:17
j1
Figure 16.2 shows the plots of Mt versus t for the three PM models. As seen,
the improvement effect associated with the virtual age Model II is the largest; and
the improvement effect associated with the Caneld model is the smallest. As such,
a PM with a large maintenance effort (e.g., an overhaul) can be represented by the
8 Canfield Model
6 Model I
4
2 Model II
0
0 5 10 15 20 25 30
t
16.4 Reliability Models in Warranty Analysis 293
For a given problem, both can be used as candidates, and the selection is given to
the candidate with smaller CVy h. As such, the reliability model can be repre-
sented by the distribution of random variable Y.
The third approach is to combine two scales into usage rate given by
q u=t: 16:19
For the population of a product, the usage rate q is a random variable and can be
represented by a distribution Gx, where x represents q. For a specic item, it is
usually assumed that the usage rate is a constant.
Consider the two-dimensional nonrenewing FRW with parameters W and U. Let
q0 U=W and p0 Gq0 . It is clear that the warranty of a sold item with usage
rate q\q0 [q q0 ] will expire at [before] t W. The proportion of the items
whose warranty expire at t W is 100p0 %.
294 16 Product Warranty
N (t)
20
15
10
5 Minor
0
0 20 40 60 80 100 120
t
For a repairable product, the time to the rst failure can be represented by a
distribution function Ft. Failure can be minor (type I failure) or major (type II
failure). A minor failure is rectied by a minimal repair and a major failure is
rectied by an imperfect or perfect repair.
Let pt [qt 1 pt] denote the conditional probability of the event that the
failure is minor [catastrophic] if a failure occurs at age t. Generally, it is more
possible for a failure to be minor [major] when the age is small [large]. This implies
that pt [qt] decreases [increases] with age. However, pt can be non-monotonic
if there exist early failures due to manufacturing quality problems. As such, the
failure and repair process is characterized by Ft and pt. The characteristics of
this process can be studied using simulation.
The simulation starts with an initial age a ( 0). Then, a random life x is
generated according to Ft. The age of the item is given by a x. The failure type
can be simulated according to pa x, and the initial age is accordingly updated.
We illustrate this approach as follows:
Example 16.2 Assume that the lifetime of a product follows the Weibull distri-
bution with shape parameter 2.5 and scale parameter 10, and pt et=8 . Further,
we assume that a minor [major] failure is rectied by a minimal repair [replace-
ment]. Using the approach outlined above, a failure-repair process is generated and
displayed in Fig. 16.3. As seen, 10 of 30 failures are major. The MTBF of this
process is 3.56, which is much smaller than MTTF 8:87.
The manufacturer incurs various costs for rectication actions of failed items under
warranty and is interested in forecasting the warranty cost of the product for a given
warranty policy. The warranty coverage can be changed since it can affect buying
decisions. In this case, the manufacturer needs to estimate the warranty cost under the
new warranty policy. These problems deal with warranty servicing cost analysis.
16.5 Warranty Cost Analysis 295
The outcomes of warranty servicing cost analysis include expected warranty cost
per unit sale, expected total cost over a given planning horizon L for the manu-
facturer and buyer, and the prot of the manufacturer. In this section, we analyze
these costs for three typical warranty policies: one-dimensional FRW and PRW as
well as two-dimensional FRW.
For a non-repairable product, any failure during the warranty period is rectied by
replacing the failed item with a new item. Failures over the warranty period occur
according to a renewal process.
Let cs denote the cost of replacing a failed item and cb denote the sale price. The
expected warranty cost per item to the manufacturer is given by
CW cs 1 MW: 16:20
rw cs MW=cb : 16:21
Cp cb CW: 16:22
We now look at the costs of the manufacturer and customer in a given planning
horizon L. The sold item will be renewed by the customer at the rst failure after W,
when the expected renewal number is MW 1. Under the assumption of l L,
we have Mt t=l. As such, the expected renewal cycle length is given by
L
n 1: 16:24
ET
As such, the users total cost in the planning horizon is given by ncb and the
manufacturers total cost in the planning horizon is given by nCW, and its total
prot is given by nCp .
296 16 Product Warranty
Example 16.3 Assume that the lifetime of a product follows the Weibull distri-
bution with shape parameter 2.5 and scale parameter 5 years. The warranty period is
W 2 years. The servicing and sale costs are 800 and 1000, respectively. The
planning horizon is 10 years. The problem is to estimate related costs.
We compute the renewal function using Eq. (16.8). The results are shown in
Table 16.1. As seen, the warranty servicing cost is about 7.8 % of the sale price.
For a repairable product, all failures over the warranty period are usually minimally
repaired. In this case, the number of minimal repairs over the warranty period is
given by Eq. (16.10). Let cm denote the cost of each repair. The expected warranty
cost per item to the manufacturer is given by
CW cs cW: 16:26
For a given planning horizon L, the expected length of a renewal cycle depends
on the replacement decision. The models for determine the optimal stopping time of
a minimal repair process can be found from Ref. [6].
Under the PRW policy, the time to the rst failure (i.e., warranty expiration time) is
random variable. Conditional on X x, the manufacturers cost is given by
ZW
CW cxf xdx RWcs : 16:28
0
ZW
Cu W cb cx cs f xdx: 16:29
0
Cp Cb CW: 16:30
ZW
acb
cxf xdx cs acb FW l 16:31
W W
0
where
ZW
lW xdFx lGa HW; 1 b1 ; 1: 16:32
0
Assume that the planning horizon L l. The expected renewals in the planning
horizon is given by
As such, the users total cost in the planning horizon is given by nCu W; the
manufacturers total cost in the planning horizon is given by nCW, and its total
prot is given by nCp .
We use the usage rate approach to analyze related costs under this policy. Assume
that the usage rate given by Eq. (16.19) is a random variable and can be represented
298 16 Product Warranty
by a distribution Gx, x 2(0; 1). Further, assume that any failure is rectied by
minimal repair. Let
q0 U=W: 16:34
Assume that the life at usage rate q0 follows the Weibull distribution with shape
parameter b and scale parameter g0 . Since the usage rate is similar to an accelerated
factor, we assume that the life at usage rate q follows the Weibull distribution with
shape parameter b and scale parameter g given by
q0
g g : 16:35
q 0
where
b b
W U
a1 ; a2 : 16:38
q0 g0 q 0 g0
It is noted that the expected repair number for q [ q0 is unchanged. As such, the
usage limit U actually controls the total repair number.
Removing on the condition, we have expected repair number per sold item given
by
Z1
n nxdGx: 16:39
0
Specially, assume that the usage rate follows the lognormal distribution with
parameter ll and rl . We have
bll brl 2 =2 lnq0 ll
n a1 e U ; brl ; 1 a2 1 Ulnq0 ; ll ; rl : 16:40
rl
16.5 Warranty Cost Analysis 299
0.2
n
0.1
0
1 2 3 4 5 6
CW cs ncm : 16:41
Product warranty servicing is important to reduce the warranty cost while ensuring
customer satisfaction. In this section we briefly discuss three warranty servicing-
related issues: spare part demand prediction, optimal repairreplacement decision,
and eld information collection and analysis.
Delay in repairs due to awaiting spare parts is costly but maintaining a spare part
inventory costs money. As such, the inventory can be optimized based on the
300 16 Product Warranty
estimation of spare part demand. The demand estimation deals with predicting the
number of replacements for a specic component in a given time interval. The
replacements over the warranty period occur according to a renewal process and the
demand prediction needs to evaluate the renewal function and to consider the
variance of number of replacements.
Spare part inventory optimization needs to consider the importance of a spare
part. The decision variables include inventory level, reordering time, and order
quantity. These are related to the sales over time and component reliability.
A lot of data is generated during the servicing of warranty, including [1, 2]:
Technical data such as modes of failures, times between failures, degradation
data, operating environment, use conditions, etc. This type of information can be
useful for reliability analysis and improvement (e.g., design changes).
Servicing data such as spare parts inventories, etc. This type of information is
important in the context of improving the product support.
16.6 Product Warranty Servicing 301
Customer related data (e.g., customer impressions for product and warranty
service) and nancial data (e.g., costs associated with different aspects of
warranty servicing). This type of information is useful for improving the overall
business performance.
To effectively implement warranty servicing, adequate information systems are
needed to collect data for detail analysis. Such systems include warranty man-
agement systems and FRACAS mentioned in Chap. 9.
References
1. Blischke WR, Murthy DNP (1994) Warranty cost analysis. Marcel Dekker, New York
2. Blischke WR, Murthy DNP (1996) Product warranty handbook. Marcel Dekker, New York
3. Caneld RV (1986) Cost optimization of periodic preventive maintenance. IEEE Trans Reliab
35(1):7881
4. Jiang R (2008) A gammanormal series truncation approximation for computing the Weibull
renewal function. Reliab Eng Syst Saf 93(4):616626
5. Jiang R (2010) A simple approximation for the renewal function with an increasing failure
rate. Reliab Eng Syst Saf 95(9):963969
6. Jiang R (2013) Life restoration degree of minimal repair and its applications. J Qual Maint Eng
19(4):13552511
7. Jiang R, Jardine AKS (2006) Composite scale modeling in the presence of censored data.
Reliab Eng Syst Saf 91(7):756764
8. Jiang R, Jardine AKS (2007) An optimal burn-in preventive-replacement model associated
with a mixture distribution. Qual Reliab Eng Int 23(1):8393
9. Jiang R, Murthy DNP (2011) A study of Weibull shape parameter: properties and signicance.
Reliab Eng Syst Saf 96(12):16191626
10. Kijima M, Sumita U (1986) A useful generalization of renewal theory: counting processes
governed by nonnegative Markovian increments. J Appl Prob 23(1):7188
Chapter 17
Maintenance Decision Optimization
17.1 Introduction
Maintenance is the actions to restore the system to its operational state through
corrective actions after a failure or to control the deterioration process leading to
failure of a system. The phrase actions to restore means corrective maintenance
(CM) and the phrase actions to control means preventive maintenance (PM).
Maintenance management deals with many decision problems, including
maintenance type selection (i.e., CM or PM), maintenance action selection (e.g.,
repair or replacement), maintenance policy selection (e.g., age-based or condition-
based), and policy parameter optimization. In this chapter, we present an overview
for key issues in maintenance management decision. Our focus is on typical
maintenance policies and their optimization models. More contents about mainte-
nance can be found from Ref. [5].
This chapter is organized as follows. We rst discuss maintenance policy
optimization in Sect. 17.2. Typical CM policies are presented in Sect. 17.3. Typical
component-level PM policies are classied into three categories: time-based
replacement policies, time-based inspection policies, and condition-based mainte-
nance policies. They are discussed in Sects. 17.4 through 17.6, respectively.
Typical system-level PM policies are group and opportunistic maintenance policies,
and are discussed in Sect. 17.7. Finally, we present a simple maintenance float
system in Sect. 17.8.
Two types of basic maintenance actions are CM and PM. Typical PM tasks can be
found from reliability-centered maintenance (RCM) and total productive mainte-
nance (TPM). We rst look at the choice problem between PM and CM before
introducing RCM and TPM, and then summarize typical PM tasks.
The choice problem between CM and PM deals with two aspects: applicability and
effectiveness of PM. The applicability deals with failure mechanism and the effec-
tiveness with economic sense, which is addressed by optimization and discussed later.
The failure mechanism can be roughly divided into two categories: overstress
mechanism and wear-out mechanism. The failures due to overstress are hard to be
predicted and hence have to be rectied by CM. If the consequence of failure is
unacceptable, redesign is a unique improvement strategy.
Wear-out is a phenomenon whereby the effect of damage accumulates with time.
The item fails when the accumulated damage reaches a certain critical level. As
such, the failure due to wear-out mechanism implies that the item experiences a
degradation process before its failure. Generally, an item failure can be the result of
interactions among two or more mechanisms (e.g., stress-assisted corrosion). In all
these cases, the item is aging, the failure is preventable and the PM is applicable.
support of the total workforce in all departments and levels is required to ensure
effective equipment operation, it is sometimes called the people-centered
maintenance.
TPM increases equipment efciency through eliminating six big losses:
breakdown losses caused by the equipment
setup and adjustment losses
idling and minor stoppage losses
speed losses
quality defect and rework losses, and
startup and yield losses.
These losses are combined into one measure of overall equipment effectiveness
(OEE) given by
OEE A P Y 17:1
17.2.1.4 Summary
According to the above discussion, typical maintenance actions or tasks are CM,
routine maintenance (or autonomous maintenance), replacement, overhaul,
inspection, CBM, and design-out maintenance.
17.2 Maintenance Policy Optimization 307
Timing of a specic maintenance task deals with under what condition the task is
triggered and implemented. Generally, there are three cases to trigger a maintenance
task. They are
1. Failure triggered: it leads to a CM action,
2. Age or calendar time triggered: it leads to a time-based PM action, and
3. Condition triggered: it leads to a condition-based PM action.
There are several CM policies that involve optimal selection between two
optional actions: repair and replacement. Two types of typical CM policies are
repair limit policy and failure counting policy.
There are many PM policies, and they can be divided into component-level
policies and system-level policies. A component-level policy is dened for a single
component, and a system-level policy is dened to simultaneously implement
several maintenance tasks for several components.
Most of PM policies are of component level. These policies fall into two cate-
gories: time-based maintenance (TBM) and CBM. Here, the time can be age,
calendar time, and usage; and the maintenance can be repair, replacement, and
inspection.
There are several system-level maintenance policies, and two typical policies are
group and opportunistic maintenance policies. In group maintenance, the PM
actions are combined into several groups. For each group, the tasks are simulta-
neously implemented in a periodic way. A main advantage with group maintenance
is that it can signicantly reduce maintenance interferences. A multi-level PM
program is usually implemented for complex systems, and group maintenance is the
basis of designing such a PM program.
A failure triggers a CM action. This provides an opportunity to simultaneously
perform some PM actions by delaying the CM action or advancing PM actions.
Such PM policies are called opportunity maintenance. A main advantage with
opportunity maintenance is that it can reduce both maintenance cost and mainte-
nance interferences.
According to the above discussion, we have the following classication for
maintenance policies:
CM policies (or repair-replacement policies) at both component level and sys-
tem level,
Component-level PM policies, including TBM and CBM policies,
Inspection policies at both component level and system level, and
System-level PM policies.
308 17 Maintenance Decision Optimization
Under this policy, the item runs to failure. When a failure occurs, the failed item is
inspected and the repair cost is estimated; the item undergoes minimal repair if the
estimated repair cost is less than a prespecied cost limit x0 ; otherwise it is replaced
by a new one. The repair cost limit is a decision variable. The policy reduces into
the renewal process when x0 0 and the minimal repair process when x0 1.
The appropriateness of the policy can be explained as follows. When the item
fails, the decision-maker has two options: minimal repair and failure replacement. If
the failure is rectied by a minimal repair, the direct repair cost may be smaller than
the cost of a failure replacement, but this may lead to more frequent failure and
hence spends more cost later.
Repair cost, X, is a random variable with cdf Gx and pdf gx, respectively. For
a specied cost limit x0 , the probability that a failed item will be repaired is
px0 Gx0 and the probability that it will be replaced is qx0 1 px0 .
After a minimal repair the failure rate remains unchanged. The replacement rate (as
opposed to failure rate) of the item at time t is ht; x0 qx0 rt. As such, the
intervals between failure replacements are independent and identically distributed
with the distribution function Fx t; x0 given by:
Letting U denote the time between two adjacent renewal points, then Fx t; x0
represents the distribution of U. Let Nt; x0 denote the expected number of failures
in 0; t. Then we have
Zx0
1
cm x0 ugudu: 17:4
px0
0
where cr is the average cost of a replacement. The cost rate in (0, t) is given by
Z1
lq Rqx0 tdt: 17:7
0
When t ! 1, we have
Mt; x0 1
: 17:9
t lq
cr cm x0 px0 =qx0
Jx0 : 17:10
lq
Under this policy, repair time X is a random variable with cdf Gx and pdf gx.
When an item fails, the completion time of repair is estimated. The item is rectied
by minimal repair if the estimated repair time is smaller than a prespecied time
limit x0 ; otherwise, it is replaced and this involves ordering a spare item with a lead
time.
The appropriateness of the policy can be explained as follows. In the context of
product warranty, the repair cost is usually less than the replacement cost in most
situations, and hence the manufacturer prefers repairing the failed product initially
before providing a replacement service. If the failed item is unable to be xed
before the prespecied time limit set by the manufacturer, the failed item has to be
replaced by a new one so that the item can be returned back to the customer as soon
as possible.
Let cm denote the repair cost per unit time; cd denote the penalty cost per unit
time when the system is in the down state, c0 denote the xed cost (including the
price of item) associated with the ordering of a new item, and L denote the lead time
for delivery of a new item.
17.3 Repair-Replacement Policies 311
In a similar argument as deriving the cost rate for the repair cost limit policy, we
have
The sequence of failure replacements forms a renewal process and the expected
number of renewals in (0, t) is given by the renewal function Mt; x0 associated
with Fx t; x0 . The expected number of failures is given by
Zt
cd cm
cm x0 ugudu: 17:13
px0
0
cr c0 cd L: 17:14
The expected cost per failure has the same expression as Eq. (17.5), and the
expected cost per unit time has the same expression as Eq. (17.10).
Let T denote a reference age, and tk denote the time when the kth failure occurs.
Under this policy, the item is replaced at the kth failure if tk [ T or at the (k + 1)st
failure if tk \ T; and the failures before the replacement are rectied by minimal
repairs. It is noted that event tk \ T includes two cases: tk 1 \ T and tk 1 [ T.
This policy has two decision variables k and T. When T 0 [T 1], the item
is always replaced at tk [tk 1 ] (i.e., a failure counting policy without a reference
age); when k 1, it reduces into a minimal repair process without renewal.
A replacement cycle is the time between two successive failure replacements.
Let X denote the cycle length, nx and nT denote the number of failures in 0; x
and [0; T], respectively. Table 17.1 shows the relations among X, T, nx and nT.
X
k
R1 x Prmx k Pk x pn x 17:15
n0
Hx
xe
n
where pn x HCn 1 .
When X [ T (implying that nx k and, nT k and nx k 1), the
reliability function is given by
R2 x Prfmx k 1 or mx mT kg
Pk x pk x pk Tpk x: 17:16
ZT ZT Z1 Z1
WT; k R1 xdx R2 xdx Pk xdx 1 pk T pk xdx:
0 0 0 T
17:17
c m nm c r
Jk; T : 17:20
WT; k
The optimal parameters of the policy are the values of k and T that minimize
Jk; T.
17.4 Time-Based Preventive Replacement Policies 313
When a component fails in operation, it can take a high cost to rectify the failure,
and hence it can be much cheaper to preventively replace the item before the failure.
Such preventive replacement actions reduce the likelihood of failure and the
resulting cost, but increase the PM costs and sacrice the partial useful life of the
replaced item. This implies that the parameters characterizing the PM policy need to
be selected properly to achieve an appropriate tradeoff between preventive and
corrective costs.
Three preventive replacement policies that have been used extensively are the
age replacement policy, block replacement policy, and periodic replacement policy
with minimal repair. Each of them involves one single-decision variable T. In this
section, we look at these policies and their optimization decision models.
Under the age replacement policy, the item is replaced either at failure or on
reaching a prespecied age T whichever occurs rst.
Let Ft denote the cdf of item life, and cf cp denote failure [preventive]
replacement cost. Preventive replacement of a component is appropriate only if the
components failure rate associated with Ft is (equivalently) increasing and cp \ cf .
A replacement cycle can be ended by a failure replacement with probability
FT or by a preventive replacement with probability RT. The expected cycle
length for a preventive replacement cycle is T and for a failure replacement cycle is
given by
ZT
1
Tc tdFt: 17:21
FT
0
ZT
WT Tc FT TRT Rtdt 17:22
0
where q cf =cp is called the cost ratio. The optimum replacement time interval T is
the time that minimizes the cost rate given by
EC
JT : 17:25
WT
yt tRt: 17:26
Using the tradeoff BX approach and the cost model, the optimal preventive
replacement ages can be obtained. The results are shown in the second and third
columns of Table 17.3, respectively. Since the cost ratios are large, the results
obtained from the two approaches are signicantly different. Generally, we take the
results from the cost model if the cost parameters can be appropriately specied.
Assume that component is repairable. Under the periodic replacement policy with
minimal repair, the item is preventively replaced at xed time instants kT and
failures are removed by minimal repair.
Let cp and cm denote the costs of a preventive replacement and a minimal repair,
respectively. The expected number of minimal repairs in a replacement cycle is
given by the cumulative hazard function HT. As a result, the cost rate function is
given by
cp cm HT
JT : 17:27
T
where cp;i , cm;i and Hi : are the preventive replacement cost, repair cost, and the
cumulative hazard function of the ith component, respectively.
Example 17.2 Consider Components B and P in Table 17.2. It is noted that their
reliability characteristics are similar. Assume that these two components are
repairable with cp =cm 2. The problem is to determine the individual preventive
replacement intervals and the common replacement interval.
From Eq. (17.27), we can obtain the individual preventive replacement intervals
shown in the fourth column of Table 17.3. From Eq. (17.28) we have that the
optimum common replacement interval is 82.8, which is close to the individual
replacement intervals.
The block replacement policy is similar to the periodic replacement policy with
minimal repair. The difference is that the phrase minimal repair is revised as
failure replacement. The cost rate function for this policy can be obtained from
316 17 Maintenance Decision Optimization
cp cf MT
JT 17:29
T
Example 17.3 Consider the components in Table 17.2. The problem is to nd the
preventive replacement interval of Component F, the common preventive
replacement interval of Component group (A, O), and the common preventive
replacement interval of Component group (B, P, S).
The renewal function is evaluated by Eq. (16.8). Using the cost models given by
Eqs. (17.29) and (17.30), we have the results shown in the fth column of Table 17.3.
As seen, the results are close to those obtained from the age replacement policy.
17.4.4 Discussion
to detect the state: discrete inspection and continuous monitoring. The continuous
monitoring is often impossible or too costly so that the discrete inspection scheme is
often used. The key decision variables for a discrete inspection scheme are
inspection times. This is because over-inspections will lead to high inspection cost
and low availability while under-inspections increase the risk of failure. Thus,
inspection times should be optimized.
An inspection scheme can be periodic, quasi-periodic, or sequential. Under a
periodic scheme, inspections are conducted at time instants jT, where j 1; 2; . . .,
and T is called the inspection interval. Under a quasi-periodic scheme, the rst several
inspections are conducted in a nonperiodic way and then a periodic inspection
scheme is implemented. A simple quasi-periodic inspection scheme is dened as
tj t1 j 1T 17:31
where tj is the time to conduct the jth inspection. Under the sequential inspection
scheme, the inspection interval tj tj1 varies with j. To be simple, we focus on the
periodic inspection scheme in this section.
The inspection actions can have influence on the reliability characteristics of the
inspected item. Two typical cases are:
(a) a thorough PM action is carried out at each inspection so that the item can be
good-as-new after the inspection, and
(b) nothing is done for the inspected item when it is in working state so that the
inspection can be effectively viewed as a minimal repair.
In this section, we consider inspection polices associated with the above two
cases, and present the corresponding optimization models with the objective being
cost or availability.
Under this policy, inspection actions are periodically performed and the item is
preventively maintained at each inspection. The PM is assumed to be perfect. The
decision variable is inspection interval T.
Since the inspection is perfect, an inspection ends a cycle and resets the time to
zero. The probability that an operating item survives until T is RT; and the
probability that the item fails before T is FT. The mean downtime from the
occurrence of the failure to the time when it is detected is given by:
ZT ZT
1 1
td T T tf tdt Ftdt: 17:32
FT FT
0 0
318 17 Maintenance Decision Optimization
gC1 1=b
td T T Ga Ht; 1 1=b; 1: 17:33
FT
T td TFT
AT : 17:34
T s1 RT s2 FT
JT c1 RT c2 FT c3 td T=T: 17:35
X
1
JT c1 j c2 tj Ftj Ftj1 c2 l 17:36
j1
where l is the mean lifetime. The optimal solution corresponds to the minimum of
JT.
If c1 represents the time of an inspection and c2 = 1, then JT represents the
expected downtime. The optimal solution with availability objective also corre-
sponds to the minimum of JT given by Eq. (17.36).
17.6 Condition-Based Maintenance 319
the observed condition information into actionable knowledge about health of the
system.
Prognostics and health management (PHM) can be viewed as a systematic CBM
approach for engineering asset health management. It attempts to integrate various
knowledge and available information to optimize system-level maintenance
decision.
T1 T2 . . . Tn : 17:37
Let sk denote the common PM interval of components in the kth group. These PM
intervals must meet the following relations:
This implies that we just need to determine the value of sK . It can be optimally
determined based on a periodic replacement policy discussed in Sect. 17.4, e.g., the
model given by Eq. (17.28) if minimal repairs are allowed or the model given by
Eq. (17.30) if minimal repairs are not allowed.
Example 17.4 Consider the PM intervals shown in the third column of Table 17.3.
The problem is to divide the components into three groups and to determine the
value of n1 , n2 and s3 .
30
25
20
xi
15
10
5 b2 b1
0
0 10 20 30 40 50 60 70
T (i )
Fig. 17.1 Dividing the components into K 3 groups based on components PM intervals
322 17 Maintenance Decision Optimization
We rst implement the rst step. Using the approach outlined above, we have
X1 42:7 18:7 24 with b1 30:7 and X2 18:7 4:7 14 with b2 11:7.
As such, the component with Ti b1 belongs the rst group; the component with
Ti b2 belongs the third group; and the other components belong the second
group. As a result, the components in the rst group are (B, P, S); the component in
the second group is (F); and the components in the third group are (A, O).
We now implement the second step. The groups mean of PM intervals are
(l3 ; l2 ; l1 ) = (4.65, 18.7, 52.13). This yields n1 ; n2 = (2.79, 4.02) (3, 4). As such,
the remaining problem is to adjust the value of s3 so that s2 4s3 and
s1 3s2 12s3 . Based on the total cost rate model given by Eq. (17.30), we have
s3 4:6.
The nal PM intervals of the components are shown in the last column of
Table 17.3. As seen, it is almost the same as those in the fth column, obtained
from the block replacement policy for each group.
T1
TL T TR t
17.7 System-Level Preventive Maintenance Policies 323
window given by TL ; TR is set for component C2 . If T1 falls into this window, the
replacement of C2 can be advanced to T1 .
We now look at cases (b) and (c) in the gure. An opportunistic PM window is
set for a group of components. In case (b), a failure triggers an opportunistic PM
action, which can be advanced. In case (c), the PM action cannot be advanced since
the failure time is smaller than the lower limit of the opportunistic window.
However, the CM action may be delayed to the PM opportunistic window to
complete if the delay is allowed.
According to the above discussion, we see that the key problem of opportunistic
maintenance is to set the opportunity maintenance window for key components and
all PM packages. We look at this issue as follows.
An opportunistic maintenance action can save a setup cost for the combined
maintenance actions. But, advancing the replacement time of a component reduces
the useful life of the component; and delaying a CM may have a negative influence
on production.
The opportunistic maintenance window can be derived through adjusting rele-
vant cost parameter. To be simple, we consider the age replacement policy for a
component. For the other cases, the method to determine the opportunistic main-
tenance window is similar but more complex.
Suppose that the preventive and failure replacement costs for a component are cp
and cf , respectively, and its preventive replacement interval T is determined by the
cost model of this policy. Let cs;p denote the setup cost for a preventive replace-
ment. Advancing the PM implies that the setup cost can be saved so that the
preventive replacement cost cp in normal condition is reduced to cp cp cs;p .
This results in an increase in the cost ratio and a decrease in the optimal PM
interval. As such, the optimal PM interval obtained for this case is set as the lower
limit of the opportunistic replacement window.
Similarly, let cs; f denote the setup cost for a failure replacement, which is usually
much larger than cs; p . Delaying a CM implies that the setup cost can be saved so
that failure replacement cost cf in normal condition is reduced to cf cf cs; f .
This can result in a signicant reduction in the cost ratio and an increase in the
optimal PM interval. As such, the optimal PM interval obtained for this case is set
as the upper limit of the opportunistic replacement window. If the downtime loss
must be considered, we can take cf cf cs; f cd Dt, where cd is the loss per
unit downtime and Dt is the expected delay time.
Example 17.5 For the data shown in Table 17.2, assume that cs;p cs; f 0:5cp .
We do not consider the downtime loss. The problem is to nd the opportunistic
replacement windows of those components.
Using the approach outlined above, we have the results shown in Table 17.4,
where w TR TL =T is the relative width of the opportunistic replacement
window. Figure 17.3 shows the plot of w versus b. As seen, a large b allows a small
opportunistic window.
324 17 Maintenance Decision Optimization
0.3
0.2
0.1
0
0 1 2 3 4 5
Working Repair
item workshop
Item to be
repaired
backup item gets repaired. Let X denote the operating time of the working item and
Y denote the repair time of the backup item. The reliability of the system is the
probability of event X [ Y, and can be evaluated using the stress-strength model
(i.e., X is equivalent to strength and Y is equivalent to stress) given by
Z1
R PfX [ Yg 1 FzdGz: 17:40
0
Z1
EX 1 Fxdx: 17:41
0
If ignoring the item switch time, the cycle length is given by T maxX; Y.
This implies that T follows the twofold multiplicative model given by Eq. (4.33)
with F1 t being replaced by Fx and F2 t being replaced by Gy. The expected
cycle length is given by
Z1
ET 1 FzGzdz: 17:42
0
A EX=ET: 17:43
In complex maintenance float systems, the number of working items, the number
of backup items, or the number of repair workshops can be larger than one. The
items may be subjected to a multi-level PM program. In this case, Monte Carlo
simulation is an appropriate approach to analyze the characteristics of the system.
326 17 Maintenance Decision Optimization
Example 17.6 Assume that Fx is the Weibull distribution with parameters b 2:5
and g 10, and Gy is the lognormal distribution with parameters ll 0:5 and
rl 0:8. Using numerical integration to evaluate the integrals of Eqs. (17.40) and
(17.42), we obtained the results shown in the second row of Table 17.5.
References
1. Chen M, Tseng H (2003) An approach to design of maintenance float systems. Integr Manuf
Syst 14(5):458467
2. Haque L, Armstrong MJ (2007) A survey of the machine interference problem. Eur J Oper Res
179(2):469482
3. Jiang R (2013) A tradeoff BX life and its applications. Reliab Eng Syst Saf 113:16
4. Jiang R (2013) A multivariate CBM model with a random and time-dependent failure threshold.
Reliab Eng Syst Saf 119:178185
5. Jiang R, Murthy DNP (2008) Maintenance: decision models for management. Science Press,
Beijing
6. Jiang R, Murthy DNP (2011) A study of Weibull shape parameter: properties and signicance.
Reliab Eng Syst Saf 96(12):16191626
7. Lopes IS, Leito ALF, Pereira GAB (2007) State probabilities of a float system. J Qual Maint
Eng 13(1):88102
8. Moubray J (1997) Reliability-centered maintenance. Industrial Press Inc, New York
9. Tajiri M, Got F (1992) TPM implementation, a Japanese approach. McGraw-Hill, New York