Fault Tree Analysis in Product Reliability Improvement: Milena Krasich, P.E

Fault Tree Analysis in Product Reliability Improvement
Milena Krasich, P.E.
Milena Krasich, PE; Bose Corporation; MS 450; The Mountain, Framingham MA 01701-7330 USA e-mail: milena_krasich@bose.com.
2002 Annual RELIABILITY and MAINTAINABILITY Symposium
SUMMARY & PURPOSE This tutorial introduces the use of a well known technique of the Fault Tree Analysis as a tool in reliability modeling and analysis of an electronics of mechanical design (including software), identification of potential failure modes that are high contributors to unreliability, tradeoffs and mitigation of those failure modes. Applied early in product the design phase, this activity allows for relatively inexpensive and easy design and manufacturing process improvements and, in that manner, achieving considerable improvement of the product reliability before the design is completed or the product is manufactured. A real example of this analysis as applied to audio products are discussed along with the achieved reliability improvement.
Milena Krasich
Milena Krasich is the Senior Technical Lead of Reliability Engineering in Design Assurance Engineering of Bose Corporation. Before joining Bose, she was a Member of Technical Staff in the Reliability Engineering Group of General Dynamics Advanced Technology Systems formerly Lucent Technologies, and prior to that, she worked for the Jet Propulsion Laboratory in Pasadena, California. While in California, she was a part-time professor at the California State University Dominguez Hills, where she taught graduate courses in System Reliability, Advanced Reliability and Maintainability, and Statistical Process Control. At that time, she was also a part-time professor at the California State Polytechnic University, Pomona, teaching undergraduate courses in Engineering Statistics, Reliability, Environmental Testing, Production Systems Design, Measurements, and Materials Procurement. She holds a BS and MS in Electrical Engineering from the University of Belgrade, Yugoslavia, and is a California registered professional electrical engineer. She is also a member of the IEEE and ASQC Reliability Society, a Fellow and the past president of the Institute of Environmental Sciences and Technology, and a member of the College of Fellows of the Institute for Advancement of Engineering. Currently, she is a US Delegate to the International Electrotechnical Committee, IEC, working on dependability/Reliability standards and is a project leader for revision of international standards for reliability growth.
Table of Contents
1. 1.1 2. 2.1 3. 3.1 3.2 3.3 3.4 3.5 4. 5. 6. 7. INTRODUCTION.......................................................................................................................................... 1 Notation and Acronyms ................................................................................................................................. 1 Reliability Improvement................................................................................................................................. 1 Reliability Definitions Related to This Tutorial............................................................................................. 2 Fault Tree Analysis and Its Use ..................................................................................................................... 2 Fault Tree Introduction ............................................................................................................................... 2 System Analysis Methodology....................................................................................................................... 2 Building of a Fault Tree ................................................................................................................................. 6 Contribution of Manufacturing Defects ......................................................................................................... 8 Origin of Values for the Basic Events............................................................................................................ 9 Failure Mode Detection and Mitigation ......................................................................................................... 9 Summary and Conclusions........................................................................................................................... 12 References and Bibliography ....................................................................................................................... 12 Attachment -Tutorial Visuals ....................................................................................................................... 13
ii
1.
INTRODUCTION
Multiple methods have been used for the estimation of product reliability for many decades that reliability has been applied as a science. Many reasons, such as product criticality (medical devices, defense systems, transportation) or the need for competitiveness in consumer industry, dictate the need for products with remarkably high reliability. Design alone, regardless of its features and technology, does not guarantee products reliability. A design team, conscious of good and reliable design methods such as proper component derating, ESD and EMI protection, may not be completely aware of all of the aspects of reliability modeling and potential reliability shortfalls. This is especially the case when a product must be designed to operate in multiple environments, or the specifics of component reliability aspects (such as dependency of their reliability on applied stresses) are not well understood. Therefore reliability of a completed design may not be as required or as expected. In the past, attempts to improve product reliability were concentrated on various types of the Failure Mode and Effects Analyses (FMEA), and/or on the dedicated Reliability Growth test programs. Both of those methods applied individually or in conjunction, even though useful, may not be cost effective or applicable. The first method, FMEA, is a valuable but a very comprehensive attempt to identify the potential failure modes and to assure their mitigation. Starting from the bottom and going up, the analysis addresses each component (electrical or mechanical), the modes in which it might fail, and the effects that those failure modes might have on higher level assemblies and the system. The process is very tedious and is often completed well after the design is finished and the production period has begun. This might be too late to accomplish any measurable improvements without major expenses for redesign, new PC boards layouts, and new tooling. In addition, any type of FMEA normally does not produce the measure of overall product reliability, thus any achieved reliability improvement is also not measurable. One type of a FMEA has a Risk Priority Number (RPN) associated with it; however, this number is a product of three numbers (from 1 to 10) assigned each, Severity, Occurrence, and Detection. Regardless of strict rules applied in estimation of these numbers, those are still only estimations, and thus might be subjective. Another FMEA type that includes criticality computation (FMECA) requires knowledge of failure rates; therefore, it cannot be applied for analysis of systems with components where failure probabilities, not failure rates are a far better attribute. Those also do not provide reliability estimates. Test methods for reliability improvement are even more costly keeping in mind that those were performed on preproduction or production runs, meaning that the design is mature. In addition, the test units might be complex and expensive so that only a limited number might be available for testing. Fault Tree Analysis combines many favorable aspects:
It is timely, therefore. low cost It is fast and easy to use It provides realistic reliability estimates at the same time with the failure mode analysis It measures achieved reliability improvement and the final reliability of a product.
1.1 NOTATION AND ACRONYMS (t) - Component failure rate, instantaneous failure rate Component failure rate if assumed constant assumed to be constant. ESD - Electrostatic Discharge EMI - Electromagnetic Interference FTA - Fault Tree Analysis FMEA - Failure Mode and Effects Analysis FMECA - Failure Mode Effects and Criticality Analysis RPN - Risk Priority Number MTTF - Mean Time to Failure MTBF - Mean Time Between Failures IEC - International Electrotechnical Commission Q(t) - Unreliability as a function of time Q - Unreliability assumed constant or calculated for a predetermined time Pr - Probability Pr(c) - Probability of occurrence of a cut set FET - Field Effect Transistor IC - Integrated Circuit R - Reliability F- Probability of failure unreliability CODEC - Coder/Decoder PRF - Part Random Failure PCB - Printed circuit board IEV International Electrotechnical Vocabulary
2.
RELIABILITY IMPROVEMENT
Reliability improvement can be undertaken and achieved in different phases of the product life: Design phase Product validation phase test reliability growth During its fielded life The first option, design phase, offers the most cost effective opportunities for product reliability improvement. Before design is finalized, even considerably involved changes do not pose a great expense, other than design time. If design improvements are not excessively extensive, necessary changes can often be painlessly done. Then the rest of product preparation (such as layout of printed circuit boards, tooling, component procurement) can be done without interruption or modifications. In the design phase, reliability improvements are achieved by identification of potential design deficiencies or potential manufacturing problems/defects that may compromise reliability of a design. Some potential design flaws that are likely to be identified are as follows: 1
Electrical or mechanical overstress of components Components inadequate to be used in that design (unreliable or improperly used) Potential relationship between failures, that is, secondary failures caused by occurrence of another failure or by the presence of an environmental stress Parts of inferior quality (reliability) as built by their respective manufacturers.
Capacitor fails short due to crack propagation Resistor fails open due to the poor welding of the connections FET saturates and overheats Seal leaks, etc. One failure mode can have multiple causes. Examples of those are: IC enclosure fails due to one or more of the following: high humidity high temperature thermal cycling IC manufacturing process Capacitor short: electrical overstress high temperature, use or soldering vehicle vibration A seal in underwater cable connector may leak due to: water pressure causing dilatation of the material cold temperature wearout from mating and de-mating of the connector defect in manufacturing undersize
2.1 RELIABILITY DEFINITIONS RELATED TO THIS TUTORIAL To assure proper understanding of the terms as they are used in this tutorial, some reliability definitions are included. These are as follows: Reliability probability that an item can perform a required function under given conditions for a given time interval (IEV 191-12-01). Here, the required function is defined by expected performance that may vary depending on the use of the item and of the expectations. For a high-fidelity stereo audio/video product, the expectations are, for example, no audible noise or distortion. For a mechanical device, a pipe or an underwater connector housing, the expected performance would be that there is no bending greater than a predefined angle under some expected force. The measures for reliability or its complement, unreliability, would be probability of survival past the end of a predetermined period, or probability of failure before the end of a predetermined period, respectively. The measurement that is best understood by management is the percent of items surviving a time period (life or warranty). Failure the termination of the ability of an item to perform a required function (IEV: 191-04-01). A failure can be classified as a failure of the hardware to operate properly due to: Design failure a failure due to the inadequate design of an item to withstand operational and/or environmental stresses, or due to the use of an improper part Manufacturing defect causing time-related failures that compromise design reliability
3.
FAULT TREE ANALYSIS AND ITS USE
A fault tree is used as a Boolean representation of a product design; a system, its assemblies and functions, failure modes, and their respective causes. Fault tree analysis in analysis of a design has a multiple mission. One of its applications is for modeling of the products architecture and functionality in a top down manner, searching for potential failure modes and their causes that might produce an unfavorable outcome defined as a product failure. It also estimates quantitatively reliability of an item and its assemblies. Based on this information, one can identify those failure modes that are the highest contributors to the products unreliability, follow the investigation down to identify their respective causes. This allows for tradeoff and mitigation of those potential failure modes, and finally, evaluation of the achieved reliability improvement. 3.1 FAULT TREE INTRODUCTION Fault tree is a logic diagram that represents functional dependencies of parts of a system. The top gate represents the unfavorable outcome of the system, and all other unfavorable outcomes that contribute to the system failure are represented as gates, logically connected to the top gate. Components of a fault tree are: Gates, which are outcomes of one or a combination of input events or other gates
Software interactions with hardware A failure of an item can also be attributed to a fault in the software code a failure of the software design. Failure Cause the circumstances during design, manufacture, or use which have led to a failure (IEV:191-0401) Failure Mechanism the physical, chemical, or other process which led to a failure. An example would be crack propagation through the dielectric of a ceramic capacitor causing the capacitor to develop a small resistance and ultimately a short circuit. Failure Mode manner or state in which an item or a component might fail. Examples of failure modes are: Low or no output from an IC Separation of the IC packaging material
Cut sets, which are groups of outcomes or events that, if occurred, would cause a system failure. Minimal cut set contains the minimum number of events that are required for a failure outcome. The removal of one of them would result in a system surviving. Types of events and 2
gates along with their definitions and graphical
representations are shown in Table 3.1.
Table 3.1. Graphical Representation and Definitions of Gates and Events FTA Symbol Symbol Name BASIC EVENT CONDITIONAL EVENT Description Basic event for which reliability information is available Event that is a condition of occurrence of another event when both must occur for the output to occur A basic event that represents a dormant failure A part of the system that yet has to be developed - defined Gate indicating that this part of the system is developed in another part or page of the diagram This output event occurs if any of its input event occur This output occurs if m of the inputs occur The output event takes place if one, but not the other input occur The output event takes place if all of the input events occur The output event (failure) occurs only if the input events occur in sequence from left to right The output occurs only if both of the input events take place, one of them conditional The outcome is present only if the input event does not occur Reliability Model Component failure mode, or a failure mode cause Occurrence of event that must occur for another event to occur Conditional probability Dormant component failure mode or dormant failure cause A contributor to the probability of failure. Structure of that system part is not yet defined A partial reliability block diagram that is shown in other location of the overall system Failure occurs if any of the parts of that system fails - series system Redundancy k out of n, where m = n-k+1 A failure of the system occurring only if one, not both of the two possible failures happens Parallel redundancy, one out of n equal or different branches. Good for representation of secondary failures or for enabling sequence of events Conditional probability occurrence of the final event of Inputs 0 0
DORMANT EVENT UNDEVELOPED EVENT TRANSFER GATE OR GATE
0 0
MAJORITY VOTE GATE EXCLUSIVE OR
AND GATE
PRIORITY AND
INHIBIT GATE
NOT GATE
Exclusive events or preventive measure does not take place
3.2 SYSTEM ANALYSIS METHODOLOGY 3.2.1 Classical System Reliability Analysis When a system is complex regarding the complexity of its modeling, that is, if it contains many of interlocked or common branches, standard modeling can become extensively cumbersome, lengthy, and subject to mathematical (computational) errors. An example of a simple, yet complex bridge circuit is shown in Figure 3.2-1.
Blocks 4 and 5 (c2 = 4,5) Blocks 1, 3, and 5 (c3 = 1,3,5) Blocks 2, 3, and 4 (c4 = 2,3,4) Should any of the above combinations fail, the signal flow from A to be will be interrupted. With Boolean algebra, the probability of the system failure would be: FS = Pr (c1 c 2 c 3 c 4 ) Probability of the cut set 1 is: Pr( c1 ) = F1 F2 = (1 R 1 ) (1 R 2 ) The correct calculation (Esary-Proschan) is then:
1 [1 Pr(c1 )] [1 Pr(c 2 )] [1 Pr( c 3 )] [1 Pr( c 4 )] With RARE event approximation; this calculation would be: FS = Pr( c1 ) + Pr( c 2 ) + Pr( c 3 ) + Pr( c 4 ) FS = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F5 While easy to implement, RARE approximation may introduce sizeable errors into calculations when the failure probabilities are larger numbers. Anything larger than a multiple of 10-2 as a value of a failure probability will produce an unwanted error. This is shown on the example below: F1 = 2 10 2
F2 = 5 10 2 F3 = 8 10 2 F4 = 2.5 10 2 F5 = 3 10 1 Esary Pr oshan : FS = 9.068 10 3 RARE : FSR = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F4 FSR = 9.08 10 3 FS = 1 (1 F1 F2 ) (1 F4 F5 ) (1 F1 F3 F5 ) (1 F2 F3 F4 )
FS = Pr (c1 c 2 c 3 c 4 ) =
A
3
Figure 3.2-1. Bridge Circuit In the bridge circuit above, the signal must flow from input A to output B. It can flow through block 3 in both directions. Analytical solution would be to model the system under two circumstances, assuming that the block 3 is good, in which case the signal would flow through blocks 1 or 2 and 4 or 5, as if they were parallel blocks, or assuming that the block 3 is bad (the condition that 3 failed), in which we have blocks 1 and 4 in series, parallel to blocks 2 and 5 also in series. This would be represented with the following equation:
R s = (R1 + R 2 R1 R 2 ) (R 4 + R5 R 4 R5 ) R3
[R1 R 4 + R2 R5 R1 R2 R 4 R5 ] (1 R3 )
R s = 0.991
When a system contains a multitude of complex systems of different kinds, the algebraic representation becomes rapidly too involved and cumbersome to solve. In addition, these complex equations need to contain a multitude of conditional probabilities to account for environmental effects and secondary failures. This only adds to already extensive complexity of the calculations. 3.2.2 System Reliability Analysis Using a Fault Tree The complex system shown in Figure 3.2-1 can be easily modeled using Boolean algebra with fault tree or success tree representation. Cut sets in this system would be made of the following combinations: Blocks 1 and 2 (c1 = 1,2)
Software packages commercially available for FTA are based on Boolean algebra, and most of them contain the constant failure rate model for unavailability:
Q( t ) = 1 e ( + )t +
If the time to repair (MTTR) is considered infinite (nonrepairable items), then = 0, and: Q(t) = F(t) Other information that can be obtained with FTA software is: Failure frequency (hazard rate) of all gates
Number of expected failures during the predetermined time Unavailability or probability of failure of the system at any gate Gate summaries in various forms Confidence intervals Sensitivity analysis
Calculations using distributions other than exponential The circuit from Figure 3.2-1 represented by FTA is shown n Figure 3.2-2
Figure 3.22. FTA Diagram of the Bridge Circuit Different gates of a fault tree are used to represent different circuit models as shown in the following examples: Example 1: Combination of series and redundant blocks (events) Reliability block diagram of this combination is shown in Figure 3.2-3
The corresponding equations are as follows: F1 = 0.002, F2 = 0.0005, F3 = 0.0032, n = 3, m = 2

FGate1 = 1 (1 F1 ) (1 F2 ) FGate2 =
i!(n i)! (1 F )
3
n!
F3 ( n i )
2 out
FTopGate = 1 (1 FGate1 ) 1 Fgate2 FTopGate = 2.53 10 3
)]
F3 F1 F2 Gate 1 Top Gate F3 F3 Gate 2
The FTA representation of the reliability block diagram in Figure 3.2-3 is shown in Figure 3.2-4
Figure 3.2-3 Series Parallel Circuit Configuration
F1 = 0.002, F2 = 0.0005, F3 = 0.00045, F4 = 0.00053, F5 = 0.0032

FGate1 = F1 F2 FGate2 = F3 F4 + F3 F5 + F4 F5 FTopGate = [1 (1 FGate1 ) [1 FGate2 ]] FTopGate = 4.374 10 6
Figure 3.2-4. FTA Representation of a Series-Parallel Reliability Block Diagram With different redundant blocks (Figures 3.2-3 and 3.2.4) the redundant gates are different, F3, F4, and F5 instead of the repeated F3 and the calculations are done in a similar way (binomial). The three different redundant blocks are shown with the Example 2 of the conditional probability, where Gate 2 has three different gates representing the three redundant blocks (Figure 3.2-5). Example 2: Use of a priority gate. The event F2 will occur only if the event F1 has occurred (conditional probability). The equivalent fault tree is shown in Figure 3.25.
Example 3: A real life example of a priority gate is the analysis of a switching amplifier, where on all four outputs (+1, - 1, +2, and -2) there are four switching FETs, followed by noise and EMI filtering. For the FETs to operate properly (in the switching mode) the Logic Ground (LGround) must be maintained at a certain voltage. This voltage is 5V, maintained by a voltage regulator filtered by two ceramic capacitors. Should LGround voltage decrease below 2V, the FETs will start operating in linear mode and will then saturate. This condition not only constitutes a failure, but could eventually cause the FET to overheat. This voltage would decrease in the event that one of the voltage filtering capacitors developed a small resistance close to a short. Here, the Lground below 2V is the condition for FETs to saturate and overheat. In the old design, voltage-filtering capacitors had a dielectric with Y5V characteristics, which has a higher concentration of voids and could develop and propagate a crack easier than other ceramics (especially in harsher environments as the one that this analysis was performed for). This characteristic, along with the less than adequate voltage rating contributed to a relatively high projected probability of failure for the specified lifetime. Replacement of both of the voltage filtering capacitors with those having a dielectric with X7R characteristics and a higher voltage rating, the 10-year probability of occurrence of FET overheat was reduced from 2.0969E-3 (per FET) to 1.0009E-4, which was an improvement by a factor of 20. The original circuit, as modeled with the fault tree is shown in Figure 3.2-6.
Figure 3.2 5 Example of a Priority Gate (Gate 1) The associated mathematics is as follows: n= 3, m = 2
Overheat of FET due to LGND <2V

IE
Page 1 Q=2.0969e-3
FET4 OVERHEAT
Ceramic capacitor shorts to ground

IE
LGND shorted to ground causing improper FET bias and overheat

I E
Electrolyte mixed with debris causing making a short

I E
FET saturates due to LGND <2V

I E
Short C905 Q=3.0001e-3
LGROUND Q=5.9911e-3
FET 4 SATURATION
DANDREIC SHORT
Q=4.6875e-13 Q=3.5000e-1 Capacitor shorts brings Lground to the ground

IE
Manufactirung defects cause a short

I E
Capacitor shorts due to part random f ailure

IE PRF_SHORT_C6905
Manufac tirung defects cause a short

IE
Short caused by leaking of the nearby capacitor

I E
resence of debris on the board

I E
MFG_SHORT_C905
MFG_Short_C906
Short_C906 Q=3.0001e-3
Q=7.0000e-8 Q=3.0000e-3
Excessive solder causing a short between the pins or pads
IE SOLDER SHORT_C6905
Q=7.0000e-8
EL. CAP LEAK Q=3.1250e-6
DEB RI S Q= 1.5000e-7 Electrolyte leak due to high humidity

IE
Debris on the PCB causing a sho rt

I E

Debris on the Capaci tor fails due to part PCB causing a random failure short
IE IE
Capacito leaking Electroly te Leak due to High electroly te due Temperature to ageing
IE I E
DEBRIS_C6905
DEBRIS_C6906
PRF_C6906 Q=3.0000e-3
AGEING Q=1.0000e-6
HI-TEMP Q=1.2500e-7
HI_HUMIDITY Q=2.0000e-6
Q=5.0000e-8
Q=2 .0000e-8
Q=5.0000e-8
Q=2.0000e-8
Figure 3.2-6. Practical Example of a Priority Gate Example 4. Use of an inhibit gate is shown in Figure 3.2-7. With the inhibit gate, for the outcome to constitute a failure, all of the input events (in our case three) must take place. A practical example of this modeling would be the connection of three EMI filtering capacitors. If a failure mode is defined as no filtering, all of the three would have to fail.
Figure 3.2-7. Example of an Inhibit Gate
3.3 . BUILDING OF A FAULT TREE To build a fault tree of a product (a system made of subsystems, assemblies, and components) is a top down process where, as a first step, one must define what constitutes the failure of that product. For a high quality audio amplifier, anything that the end user might hear and qualify as degraded performance constitutes the system failure. The next step is to outline the system architecture and the major functions such as: Power supply Video amplifier
High temperature A detailed example of how a fault tree analysis is done is shown in another real life example, an analog input to an analog to digital converter of an audio amplifier. The partial circuit of this amplifier is shown in Figure 3.3 1. This part of the amplifier is normally known as CODEC, as analog input signals are converted into a digital, and then again into linear output. The signals are directed into an IC that is an analog to digital converter. For the amplifier to be operational, all signals have to be processed by CODEC meaning that is they have to coded and decoded. The inputs signal 1+ into the left channel of IC U20 is interrupted if: R200, R209, or C171 fail open C179 shorts to ground, shorting the signal to ground The input signal 2+ into the right channel of the U20 is interrupted if: R201, R205, or C172 fail open C177 shorts the signal to ground The entire circuit will not work if no voltage is supplied to the analog input, (pin 8) R206 or R208 fail open, interrupting the supply of 2.3 V
Audio amplifier The further analysis going down determines what phenomena preclude proper operability of those parts or functions, i. e: Shorted line voltage or no VCC supplied No video processing
One or more audio channels not operational More detailed analyses further determine the causes of those phenomena, contributing factors, down to the causes of failure modes such as: Electrical overstress High humidity
Figure 3.3-1 Input into CODEC of an Audio Amplifier
The signal will be too noisy if C183 fails open (low frequency noise), or C181 fails open (high frequency noise). Other contributors to the failure are the lack of data inputs, which will not be considered in this example. The top level of the FTA representation of this analysis is shown in Figure 3.32.
Circled, in Figure 3.3.-2 is the gate that needs to be developed for the analog inputs 1 and 2 described earlier. Figure 3.3-3 shows further development of that gate.
Figure 3.3-2 Top Level FTA of CODEC
Figure 3.3-3 Development of the FTA for inputs 1 and 2
Inputs 1 and 2 are then separately analyzed, and so are the noisy or no analog voltages. Development of Input 1 is shown in Figure 3.3-4. The circle points out the open components that are to be further developed. The fault tree part in Figure 3.3-4 also contains a gate that points to the possible lack of the 2.3V voltage. Capacitor C179, if failed short, would short the signal to the ground. There are two possible reasons for this capacitor to fail short. One is so called part random failure. This term takes into
consideration the environment that the capacitor is supposed to be exposed to (temperature, vibration) as well as the operational stresses that the capacitor will see, such as its operating voltage. Thus, the term random failure actually is not just a failure that will occur at random, but it describes the likelihood that a part will fail, if having an intrinsic defect, under given environmental and operational stresses.
Figure 3.3-4. Development of the FTA Down to Components and Their Failure Cause
3.4 CONTRIBUTION OF MANUFACTURING DEFECTS Manufacturing defects causing time dependent failures are a vital contributor to product unreliability. Some contributions to components failing open are: Cold or insufficient solder, which after a period of time, due to relaxation and fatigue, causes connections to open. Vibration of a vehicle will cause the cold soldered joint to open as well.
Missing components Components cracked during insertion
Broken or bent pins or leads Contributors of manufacturing flaws to components failing short are: Debris (at times un-cleaned flux) left on the board Excessive solder
Bent pins (mostly ICs and connectors) shorting to another pin. Another reason for the capacitor failure (Figure 3.3-4) would be a failure (a short) caused by manufacturing defects. Normally during production, if a PC board is not properly cleaned, debris left on it will produce so called dandreic growth, which, in turn might cause a short between terminals. A second manufacturing defect causing an electrical short is a result of inadequate soldering technique, where excessive solder develops a bridge between the terminals and cause a short. Further development of the fault tree will point out to other components failing open or short causing failure of the analog power supply, or interruption of the second signal. 3.5 ORIGIN OF VALUES FOR THE BASIC EVENTS To be able to estimate the final (top gate) product reliability, each of the events must have information on its reliability assigned to it. This information may be attached in the form of a failure rate, MTTF, or probability of failure. For mixtures of hardware, mechanical and electrical, perhaps the most straight forward way would be to represent all the information in the form of a probability of failure calculated for a predetermined time, and a predetermined operational profile. For electrical components, data for event and failure mode probabilities comes from: Information from the manufacturers life testing, which needs to be recalculated for the proper environmental and electrical stresses Software databases (commercially available) Field use field failure data information, which would be the very last resort because of many inconsistencies of data reporting and recording.
For mechanical components, probability of failure needs to be calculated based on: Stresses loads, and their geometry and distribution Materials Construction (design) of parts, such as shape and size
Attachment of parts to other structures (adhesives, fasteners) Based on all the information, the safety margin needs to be calculated, which in turn will produce a reliability value. For determination of a probability of occurrence of manufacturing defects, the approach may be two-fold. The probability associated with the manufacturing defects can come from factory or service data (field failure data). On the other hand, sometimes it is advisable to fill in the requirements numbers into the reliability analysis, and then adjust the manufacturing process control to achieve this goal.
4.
FAILURE MODE DETECTION AND MITIGATION
In a completed or in a partially completed fault tree analysis of a system, when the probability of failure of the top level gate is calculated and it is concluded that reliability improvement is necessary, the process that follows is to identify the highest contributor to unreliability (a failure mode or a cause) and improve the design. This process continues in search for the next highest contributor. An example of such reliability improvement is shown in the case of a complex audio/video amplifier system. The top level of the system (the console) is shown in Figure 4 1. The Tuner is shown as an event because of the repeated reference designator numbers in the bill of material of the system, and the tuner. For that reason, the Tuner was analyzed separately, and then its top unreliability is depicted as an event.
Figure 41 Top Level Fault Tree of Console and its Major Subsystems
For the given warranty period, the original unreliability value is not acceptable, as 7,365 systems out of 100,000 made would need service before the end of their respective warranty periods. The highest contributor to unreliability is the block marked SPFIF. This gate was developed on page 13, as it is shown in Figure 4-2.
Figure 42 SPDIF Top Level fault Tree. Looking for the highest contributor to the SPDIF circuit unreliability is shown as a part of the circuit that is an input or output from the multiplexer. Further investigation leads to the SPDIF multiplexer, where the highest contributor is the IC U501 (Figure 4-3). The high failure probability of this IC is related to its construction packaging (TSSOP). In another package, SOIC, this IC is a reliable part. There were 3 of these units in the console. It also was apparent that the probability of failure of capacitors C513 and C517 was too high for ceramic capacitors. This is because those had the Y5V material dielectric characteristic. There were about 116 capacitors of this type in the console.
10
Figure 4-3 The Components which were the Highest Contributors to the Console Unreliability. Once the design improvements were made, the console reliability was improved to the point of almost meeting its aggressive goal. The resultant improvement is shown in Figure 4-4.
1
0.99
Console goal R(1 year) = 0.992

0.98
Transistors and FETs from a more reliable vendor Planned Reliability Growth TSSOPs replaced by SOICs
0.97
Console Reliability
0.96
Achieved Reliability Growth

0.95
0.94
0.93
0.92
Y5V caps replaced by X7R (116) Initialy calculated
0.91 0 50 100 150 200 250 300
Duration of the design period (days)
Figure 4-4 Console Reliability Goal, Planned Growth Curve, and the Actual Reliability
11
5.
SUMMARY AND CONCLUSIONS
The Fault Tree Analysis can be successfully used for identification and mitigation of potential failure modes that contribute to unreliability of a product. The FTA allows pictorial representation of the system, its architecture and functionality, along with using Boolean algebra and the multitude of modeling schemes to best represent the system operation and interdependency of its failure modes. The FTA is here used to evaluate the individual failure mode contributions to the system unreliability and come up with the most viable solution for its reliability improvement. The methodology can be summarized as follows: Define what constitutes the system failure Start with the top level of the system with an unfavorable outcome that defines the system failure Construct the fault tree down, using logic to express reliability modeling techniques Follow the analysis down the fault tree to determine what assembly, signal, part, or manufacturing defect will cause a particular failure Develop the fault tree all the way down to the causes of pertinent failure modes Determine respective probability of occurrence of individual causes. The software, when used for analysis, will roll up all information producing the system, subsystem, and assemblies failure probability Identify those failure modes that are the highest contributors to unreliability and mitigate.
and vote. The second is also a draft, in circulation for comments.
6.
1.
REFERENCES AND BIBLIOGRAPHY
Joanne Bechta Dugan, Fault-Tree Analysis of Computer-Based Systems 1999 Tutorial Notes, Reliability and Maintainability Symposium, Washington, DC Kiran Kumar Vemuri and Joanne Bechta Dugan, Reliability Analysis of Complex Hardware-Software Systems, Proceedings, Annual Reliability and Maintainability Symposium, January 1999, Washington, DC. Gza Szab and Pter Gspr, Practical treatment Methods for Adaptive Components in the Fault-Tree Analysis, Proceedings, Annual Reliability and Maintainability Symposium, January 1999, Washington, DC. Alfredo H-S. Ang and Wilson H. Tang Probability Concepts in Engineering Planning and Design, Volume II, Decision Risk and Reliability, 1990. Milena Krasich, Use of fault Tree Analysis for Evaluation of System Reliability Improvements in Design Phase. Proceedings, Annual Reliability and Maintainability Symposium, January 2000, Los Angeles, California
2.
3.
4.
5.
Update the analysis, and monitor the resultant reliability improvement Failure mode analysis with fault trees can be started with the start of a project, and updated as more detailed information becomes available. There is no need to come up with the failure rates as a reliability measure for all components, electrical, mechanical, and software. The fault tree modeling allows a mixture of various information (failure probability, different failure distributions), and does not require estimation of failure rates only like the classical reliability predictions do. Modeling and reliability assessment of a product system with the fault tree analysis allows for timely design improvements while design changes are still possible, feasible and inexpensive. This methodology is also described in the draft IEC standards, IEC 60300 1, Dependability management. Part 3: Application guide, Section Section 1: Analysis techniques for dependability; Guide on methodology, and IEC61014, Reliability growth methods. The first standard is in its last draft for comments
12
7.
ATTACHMENT -TUTORIAL VISUALS

Reliability Definition and Considerations
Reliability Definition (IEV 191-12-01) Probability that an item can perform a required function under given conditions for a given time interval Required function: defined by the expected performance, i. e.
Fault Tree Analysis for Product Reliability Improvement

Milena Krasich Bose Corporation January 23, 2002
No audible noise No distortion No bending pass the predetermined angle
Measures Reliability: Probability of survival after the end of a predetermined period Unreliability: Probability of failure before the end of the period Measure as management sees it: Percent of items surviving a predetermined time period normally warranty period, mission period or other time period requiring proper product operation
1-23-2002 M. Krasich 4
Definition of Failure IEV: 191-04-01 Tutorial Content

General reliability definitions in accordance with: IEC 60050(IEV 191 191) (1990), International Electrotechnical Vocabulary, Chapter 191: Dependability and quality of service Description of Fault Tree Analysis methodology Mathematics (statistics) associated with the Fault Tree Analysis Reliability modeling of a complex system using Fault Tree Analysis (FTA), in accordance with: IEC 60300-3-1, Dependability Analysis Methods Examples of how the FTA is used for reliability improvement of electronics Methods for determination of failure probabilities for basic events Failure mode mitigation and reliability growth/improvement a real life example
The termination of the ability of an item to perform a required function Failure of hardware to operate properly due to:
Design failure: A failure due to inadequate design of an item (to withstand operational or environmental stresses) -- improper part or improper use of part in design Manufacturing defect causing time - related failures A fault due to non-conformity during manufacture to the design of an item or to specified manufacturing processes Software failures
Failure of software Failure Cause The circumstances during design, manufacture, or use which have led to failure Failure Mechanism The physical, chemical, or other process which led to a failure
1-23-2002 M. Krasich 5
1-23-2002
M. Krasich
Definition of Failure Mode Reliability Growth - Improvement

Reliability improvement of a product can be achieved in various phases of its life:
Design phase Test, product validation phase test reliability growth Fielded life by upgrades, derivatives, recalls, etc.
The most cost effective reliability improvement done during the product design Product reliability improvement achieved by: Identification of potential design flaws:
Component electrical overstress Potential mechanical overstress and failure Inadequate components or parts used Failure of one part caused by the failure of another part Use of parts that are of inferior quality/reliability
Failure mode: Manner or state in which an item or a component might fail Examples: Low output of an IC Separation of the IC packaging material Capacitor fails short due to crack propagation in the dielectric (failure mechanism) Resistor fails open, failure cause poor lead welding FET saturation and overheat Gain change Seal leakage
1-23-2002 M. Krasich 6
Identification of manufacturing problems

1-23-2002 M. Krasich 3
13
Cause of a Failure Mode

Failure or failure mode cause
One failure mode can have multiple causes
Event
Basic event
Basic event for which reliability information is available Reliability model:
Component failure mode, or a failure mode cause
Examples:
Causes of capacitor short: electrical overstress, high temperature, vehicle vibration, high soldering temperature Causes of a IC enclosure failure: moisture, high temperature, IC manufacturing process
Conditional event
Event that is a condition of occurrence of another event when both must occur for the output to occur Reliability model:
Occurrence of event that must occur for another event to occur
Causes of a component open

poor soldering, manufacturing breakage in insertion
Causes of a seal to leak in communication application (under water ocean bottom)

water pressure causing dilatation, cold temperature, wearout during mating and de-mating, material degradation, manufacturing defect (under-size)
1-23-2002 M. Krasich 7
1-23-2002
M. Krasich
10
Use of a Fault Tree

Fault Tree Analysis (FTA), is a Boolean representation of a system and its assemblies and functions, along with failure modes and their respective causes FTA is used for a multiple mission:
For modeling the Item/system architecture and functionality with a fault tree logic diagram top down to search for potential failure modes that might cause an unfavorable outcome defined as a failure of the system and their respective causes To quantitatively estimate the item reliability To identify those failure modes and causes that are the highest contributor to the item probability of failure To evaluate necessary and possible improvements trade off To asses the item reliability improvement as the potential failure modes are mitigated.
Events cont.
Dormant event
A basic event that represents a dormant failure Reliability model:
Dormant component failure mode or dormant failure cause
Undeveloped event
A part of a system not yet developed
1-23-2002
M. Krasich
1-23-2002
M. Krasich
11
Fault Tree - Introduction

Fault tree
A logic diagram representing functional dependencies of parts of a system, and arrangement of events causing unfavorable outcomes - system failure that correspond predetermined failure definition.
Gates
OR gate
This output event occurs if any of its input event occur Reliability model: Failure occurs if any of the parts of that system fails - series system
Fault tree components

Gates
Outcomes of one or a combination of input events
AND gate
The output event takes place if all of the input events occur Reliability model: Parallel redundancy, one out of n equal or different branches.
Cut sets
Groups of events that, if all occur, would cause a system failure. Minimal cut set: contains the minimum number of events that are required for failure. A removal of one of them would result in system not failing.
Majority vote gate:

This output occurs if m of the inputs occur Reliability model: Redundancy k out of n, where m = n - k+1
Events Basic events

Usually a failure cause. Gets an assigned value: failure rate, MTBF, or failure probability
1-23-2002 M. Krasich 9
Priority AND gate:

The output event (failure) occurs only if the input events occur in sequence from left to right Reliability model: secondary failures or for enabling events
1-23-2002 M. Krasich 12
14
Modeling with a Fault Tree Boolean Algebra

Gates cont.
Exclusive OR gate
The output event takes place if one, but not the other input occurs Reliability model: A failure of the system occurring only if one, not both of the two possible failures happens
Basis for the Fault Tree: Boolean algebra, used to produce minimal cut sets (or paths sets) 1 4
Cut Sets A System fails if any one of the cut set happens: 2 c1 = 1,2 c2 = 4,5 c3 = 1,3,5 c4 = 2,3,4 RS = 1 - FS FS = Pr(c1 c2 c3 c4)
Pr(c 1 ) = F1 F2 = (1 R1 ) (1 R 2 ) Correct calculation (Esary Proschan) : Pr(c 1 c 2 c 3 c 4 ) = 1 [1 Pr(c 1 )] [1 Pr(c 2 )] [1 Pr(c 3 )] [1 Pr(c 4 )] Rare event approximat ion : FS = Pr(c 1 ) + Pr(c 2 ) + Pr(c 3 ) + Pr(c 4 ) FS = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F4
1-23-2002 M. Krasich 15
3
B
5
Inhibit gate:
The output occurs only if both (or all) of the input events take place, one of them conditional Reliability model: Conditional probability of the final event
Transfer gate:
Gate indicating that this part of the system is developed in another part or page of the diagram Reliability reference: A partial reliability block diagram that is shown in other location of the overall system block diagram
1-23-2002 M. Krasich 13
Comparison of the FTA Calculation Methods Esary - Proschan (correct calculations) :

System Analysis Methods
A complex System Reliability Block Diagram (RBD) Example: Failure: No signal flow from A to B
1 4
Fs = 1 (1 F1 F2 ) (1 F4 F5 ) (1 F1 F3 F5 ) (1 F2 F3 F4 )
Rare Approximation :
Fsr = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F4
F1 2 10
2
A
2
B
5
R S = (R1 + R 2 R1 R 2 ) (R 4 + R 5 R 4 R 5 ) R 3 +
F2
5 10
F3
8 10
F4
2.5 10
[R1 R 4 + R 2 R 5 R1 R 2 R 4 R 5 ] (1 R 3 )
Esary-Proschan : Fs Fs 1 1 F1 F2 10
3
Algebraic solution meaning:

Reliability of the system provided that R3 is good, plus reliability of the system provided R3 is bad.
F4 F5
F1 F 3 F 5
F 2 F3 F 4
9.068
When a system is really complex, with a multitude of interrelationships between the assemblies, the algebraic solutions become rapidly too involved. Environmental factors and manufacturing errors left out.
1-23-2002 M. Krasich 14
Rare Approximation: Fsr F1 F 2 F4 F5 F 1 F3 F5 F2 F 3 F 4

16
1-23-2002
M. Krasich
FTA Model with Esary-Proschan Calculation

1 4
A
2
B
No signal at the output
I E
Failure Q=9.068e-3
Signal not going thourgh the top first

IE
Signal not passing through the top branch

IE
Signal not passign through the bottom branch

I E
Signal not passing through the bottom bl ock fir st

I E
Cross 1 Q=4.800e-4
Top Q=1.000e-3
Bottom Q=7.500e-3
Cross 2 Q=1.000e-4
Block 1 fails
Block 3 fails
Block 5 fails
Block 1 fails
Block 2 failure
Block 4 fails
Block 5 fails
Block 2 failure
Block 3 fails
Block 4 fails
I E
IE
I E
IE
I E
I E
I E
IE
I E
IE
1 Q=2.000e-2
3 Q=8.000e-2
5 Q=3.000e-1
1 Q=2.000e-2
2 Q=5.000e-2
4 Q=2.500e-2
5 Q=3.000e-1
2 Q=5.000e-2
3 Q=8.000e-2
4 Q=2.500e-2
1-23-2002
M. Krasich
17
15
Example: The Redundant Gates are Different

2 out 3
fail s if Gate 1 OR Gate 2 fails
IE
F3 F1 F2 F4 F5 Gate 2
2 0.0005 0.0032 F1 1 F2 F4 F5 FGate2 F3 0.00045
FTA Representation of the RBD RARE Approximation

1 4
TOP1 Q=2.502 e-3
Gate 1 Top Gate

n F1 F4 3 m 0.002 F2 0.00053F5 1 1
A
2
B
5
No signal at the output

IE
Fails if event 1 OR the event 2 takes place

I E
Fai ls if any two of the event takes place

I E
Failure Q=9.080e-3
GAT E1 Q=2.499e-3
2 GAT E2 Q=3.374e-6
Signal not going thourgh the top first

IE
Signal not passing through the top branch

IE
Signal not passign through the bottom branch

IE
Signal not passing through the bottom b ock fir st l

IE
I E
F1
I E
F2
IE
F3
I E
F4
I E
F5
FGate1 FGate2 FTopGate FTopGate
Cross 1 Q=4.800e-4
Top Q=1.000e-3
Bottom Q=7.500e-3
Cross 2 Q=1.000e-4
EVENT1 Q=0.002
EVENT 2 Q=0.0005
EVENT3 Q=0.000 45
EVENT4 Q=0.00053
EVENT 5 Q=0.0032
F3 F4 1
F3 F5 1
Block 1 fails
Block 3 fails
Block 5 fails
Block 1 fails
Block 2 failure
Block 4 fails
Block 5 fails
Block 2 failure
Block 3 fails
Block 4 fails
FGate1 1
3
IE
IE
IE
IE
IE
IE
IE
IE
IE
IE
2.502 10
1 Q=2.000e-2
3 Q=8.000e-2
5 Q=3.000e-1
1 Q=2.000e-2
2 Q=5.000e-2
4 Q=2.500e-2
5 Q=3.000e-1
2 Q=5.000e-2
3 Q=8.000e-2
4 Q=2.500e-2
1-23-2002
M. Krasich
20
1-23-2002
M. Krasich
18
F4 0.00053F5 Priority Gate - Example 0.0032
Example: Combination of Series and Redundant Events

2 out 3
fails i f Gate 1 OR Gate 2 fails

IE
FGate1 FGate2
F1
F2 F4 F5 1 FGate2
F3 F4 1
F3 F5 1
F3
fail s if Gate 1 OR Gate 2 fail s
IE
T OP1 Q=4.374 e-6
FTopGate
Fai ls if any of the two events takes place
I E
FGate1
3
F1
F2
F3
Fails only i f EVENT1 occurs first
I E
Gate 1
TOP1 Q=2.530 e-3
F3 Gate 2 Top Gate

m 2 0.0005 F1 n i (n 1 10 i) 1 F3 F2 1 F3 F3 1
i ( n i)
FTopGate
2.502 10
GATE1 Q=1.000e-6
2 GATE2 Q=3.374e-6
n
Fails if event 1 OR the event 2 occur
IE
Gate 1, Conditional probability:

F5
I E
Fails if 2 of the three events take place

IE
F1
0.002 F2 1
m 1
0.0032
F1
IE IE
F2
IE
F3
I E
F4
GATE1 Q=2.499e-3
2 GATE2 Q=3.072e-5
FGate1 FGate2
Probability of occurrence of EVENT1 = F1 Probability of occurrence of event 2 if event 1 occurred = F2 FGate1=F(EVENT1)*F(EVENT2|E VENT1)
EVENT1
EVENT2
EVENT3
EVENT4
EVENT 5
F1
F2
F3
F3
F3
Q=0.002
Q=0.0005
Q=0.000 45
Q=0.00053
Q=0.0032
I E
I E
IE
IE
I E
FTopGate
EVENT5 Q=0.0032
1 2.53
FGate1
3
FGate2
1-23-2002 M. Krasich
EVENT1 Q=0.002
EVENT2 Q=0.0005
EVENT3 Q=0.0032
EVENT4 Q=0.0032
21
FTopGate
1-23-2002
M. Krasich
19
Example Partial Schematic of a Switching Amplifier
1-23-2002
M. Krasich
22
16
Inhibit Gate - Example
Example of the Priority and AND Gate Switching Amp Before Improvement
IE
fai ls if Gate 1 OR Gate 2 fail s

I E
Gate 1, Conditional probability: Gate 2, Inhibit:

Outcome occurs only if all three (or any number) of events or gates take place. Example: Three EMI protection capacitors in parallel.
F5
I E
T OP1 Q=1.001e-6
Page 1 Q=2.0969e-3
Fails only if EVENT1 happens bef ore EVENT2
I E
FET4 OVERHEAT
Fail s i f al l of the events take place

I E

IE

IE

IE

IE
GATE1 Q=1.000e-6
GAT E2 Q=7.632e-10
Short C905 Q=3.0001e-3
LGROUND Q=5.9911e-3
FET 4 SATURATION
DANDREIC SHORT
F1
I E I E
F2
I E
F3
I E
F4

IE

IE

IE PRF_SHORT_C6905

IE

IE
No filtering if all of the three fail open

FGate2 FGate2 F3 F4 F5 7.632 10
10
25

IE
EVENT 1 Q=0.002
EVENT 2 Q=0.0005
EVENT 3 Q=0.00045
EVENT 4 Q=0.00053
EVENT 5 Q=0.0032
MFG_SHORT_C905
MFG_Short_ C906
Short C906 Q=3.0001e-3
Q=7.0000e-8 Q=3.0000e-3
Q=7.0000e-8
DEB RI S Q= 1.5000e-7 Electrolyte leak due to high humidity

IE

IE

Debris on the Capacitor fails due to part PCB causing a random failure short
IE IE
Capacito leaking Electroly te Leak due to High electroly te due Temperature to ageing
IE IE
1-23-2002
M. Krasich
DEBRIS_C6905
DEBRIS_C6906
PRF_C6906 Q=3.0000e-3
AGEING Q=1.0000e-6
HI-TEMP Q=1.2500e-7
Q=5.0000e-8
Q=2 .0000e-8
Q=5.0000e-8
Q=2.0000e-8
1-23-2002
M. Krasich
23
Other Important Information from an FTA Software

Failure Frequency (hazard rate of all gates) Number of expected failures during the preset lifetime Unavailability (or availability) of the system or any gate (function or assembly), provided the system is assumed repairable Gate summary in various forms Confidence intervals on provided information (failure probability or unavailability Sensitivity analysis the most critical component variation in probability of occurrence Results from failure distributions other than exponential (constant failure rate) Results calculated with multiple simulations (we normally set the number of simulations to 10,000)
1-23-2002 M. Krasich 26
After Capacitor Improvement (0.033 F replaced 0.1 F)

IE
Page 1 Q=1.0009e-4
FET4 OVERHEAT

IE

I E

IE

I E
Short C905 Q=1.4299e-4
LGROUND Q=2.8596e-4
FET 4 SATURATION
DANDREIC SHORT

I E

I E

I E PRF_SHORT_C6905

I E

I E

I E
MFG_SHORT _C905
MFG_Short_C906
Short C906 Q=1.4299e-4
Q=7.0000e-8 Q=1.4292e-4
Q=7.0000e-8
DEB RI S Q= 1.5000e-7 Electrolyte l eak due to high humidity

I E

IE

I E SOLDER SHORT_C6906
Debris on the Capacitor fails due to part PCB causing a random failure short
IE IE
Capacito leaking Electrolyte Leak electroly te due due to High to ageing Temperature
IE I E
DEBRIS_C6905
DEBRIS_C6906
PRF_C6906 Q=1.4292e-4
AGEING Q=1.0000e-6
HI-TEMP Q=1.2500e-7
Q=5.0000e-8
Q=2 .0000e-8
Q=5.0000e-8
Q=2.0000e-8
1-23-2002
M. Krasich
24
Building a Fault Tree

Define the system Define its major parts or functions, I. e.:
Power supply Video Audio channels
Determine what phenomenon precludes proper operability of those parts or functions, i. e.

Shorted line voltage or no VCC supplied No video One or more audio channels not operational
Determine the causes of those phenomena Determine the contributing factors to the causes, i. e.
High temperature High humidity Electrical overstress
1-23-2002 M. Krasich 27
17
Rationale for Analysis of A to D Conversion Input Circuit

The entire circuit will not work if:
Example Input to CODEC of an Amplifier
No voltage supplied to the analog input (pin 8): Open R206, or R208 (if open slight non-audible distortion) or short C174 or C176 (if any of the caps open, no failure) No 5V analog supplied to pin 7: C 181 or C 183 fail short U20 fails in whichever mode (low, high, or no output)
There will be no output to the D to A conversion and the rest of the amp if failed open: R214, R215, R218, and R 219 (if shorted not too much harm) Not all failure modes need to be considered if not important to the failure definition realistic prediction
1-23-2002 M. Krasich 30
1-23-2002
M. Krasich
28
FTA Representation of CODEC Analysis

Failure: No analog output from CODEC, one of the reasons: no analog inputs into it 1 or 2
No analog output f rom CODEC av ailable
IE Analog Outputs 1 and 2
Page 1
Q=3.3535e-2
Rationale for Analysis of A to D Input Circuit

For the amplifier to be operational, all signals have to be processed by CODEC coded and decoded In CODEC, the analog signal is converted to digital, and then again into analog for the analog output The input signal 1+ into the left channel of IC U20 interrupted if:
Components fail open:
R200, R209, C171
Failure of A to D conv ersion f or channel 1 and 2
IE
One or more digital outputs from U20 not available

IE
D to A conversion for analog outputs 1 and 2

IE D to A for A_OUT_1&2+
Analog outputs not available

IE
A to D 1 and 2 Q=1.7487e-2 Page 30
Digital f rom U20
Q=1.4648e-2 Page 29
Q=1.5972e-2
A_OUT_1 and 2
Q=5.5314e-4
No digital input provided for the U21
No data available from CAD_1

E
IE
U21 failure
5V Analog not delivered or noisy

IE
Analog output 1 not available

IE
Analog output 2 not available

IE
C179 shorts to ground (shorting the signal)
The input signal 2+ into the right channel of IC U20 interrupted if:
Components fail open:
R201, R205, C172
Go to page 30 for the analog inputs
IE
D input to U21 Q=1.4648e-2 Page 27
DAC_1_DAT Q=0.0000
Fail_U21 Q=3.2979e-4 Page 198
5 V ANA to U21
Q=1.0147e-3 Page 43
A_OUT_1 Q=2.7661e-4 Page 66
A_OUT_2 Q=2.7661e-4 Page 65
Page 5
1-23-2002 M. Krasich 31
C177 shorts to ground (shorting the signal) Opening of C117 might cause some noise, that will be filtered later in the circuit
1-23-2002 M. Krasich 29
FTA Representation of CODEC Analysis, cont.
Failure of A to D conv ersion f or channel 1 and 2

IE
Page 5
A to D 1 and 2 Q=1.7487e-2
Analog input 1 to CODEC not available

IE
Analog input 2 to CODEC not available

IE IE
U20 failure
Analog inputs 1 and/or 2 not available

IE
A_IN_1_+ Q=7.2087e-3 Page 71
A_IN_2_+ Q=8.1431e-3 Page 69
Fail_U20 Q=1.4721e-3 Page 200 5V Analog not delivered or noisy
Analog Inputs 1 & 2
Q=5.9081e-3
Input 1 to CODEC A to D not available or too noisy

IE
Input 2 to CODEC A to D not available

IE
One of the plus inputs (1 or 2) not provided to the converter; No 5V analog supply voltage provided IC U20 not operational
Page 30
1-23-2002
IE
5 V ANA Q=1.0147e-3
Input 1 into A to D
Input 2 into A to D
Q=3.1481e-3 Page 68
Q=4.2788e-3 Page 125
5V analog not available

IE IE
High or low frequency noise introduced to the signal
No 5V Analog Q=7.7798e-4 Page 44
Noise on 5V ANA
Q=2.3692e-4 Page 201
M. Krasich
32
18
Failure Due to No Analog Voltage Supply
5V analog not available

IE
Page 30
No 5V Analog Q=7.7798e-4
Input 1 Not Available

Capacitor fails short, shorting signal 1 to the ground
IE
Input 1 to CODEC A to D not available or too noisy

IE
Page 30
The 5V analog shorts to ground, no voltage for pin 7 of U20

I E
Capacitor fails short, shorting +5V analog to the ground

I E I E
Voltage not available
Input 1 into A to D
Q=3.1481e-3
Short_EL_C183 Q=1.1136e-4
Short_C181 Q=2.7416e-4
+5V_ANA Q=3.9264e-4 Page 67
2.3 V supply
Open components interrupting the signal or causign noise

IE
IE
Connection short due to the manufacturing defect

I E
Capacitor fails short due to the part random failure

I E
Electrolyte leak due to the capacitor random failure

IE

I E

I E
Short_C179 Q=2.0730e-4
2.3V Q=2.6595e-3 Page 126
Open Comp Q=4.1504e-4 Page 140
MFG_Short_El_C183
PRF_Short_El_C183
PRF_Leak_El_C183
MFG_Short_C181
Q=7.0000e-8
Q=9.36354e-005 Q=1.76564e-005 Debris on the PCB causing dandreic growth and a short
I E
Q=7.0000e-8
PRF_Short_C181
Q=0.000274094
Debris on the PCB causing dandreic growth and a short
I E

IE

IE
MFG_Short_C179
Q=7.0000e-8
PRF_Short_C179
PRF Failure of the part random Failure probabilities are assigned to the manufacturing process quality requirement

IE Solder_short_El_C183

I E
Debris_El_C183
Debris_C 181 Q=2e-008
Solder_short_C181
Q=0.000207226
Debris on the PCB causing dandreic growth and a short
IE
Q=2e-008
Q=5e-008
Q=5e-008

IE
1-23-2002
M. Krasich
35
Debris_C179 Q=2e-008
Solder_short_C179
Page 68
1-23-2002
Q=5e-008
M. Krasich
33
Signal Noisy or Interrupted Due to Open Components

Open capacitor causes high frequency noi se on the input
IE
Open components interrupting the signal or causign noise

IE
High or Low Frequency Noise into the CODEC

Page 68
High or low frequency noise introduced to the signal

I E
P age 30
Noise on 5V ANA
Q=2.3692e-4
Open Comp Q=4.1504e-4
Open capac itor causes low frequency noise

Open capacitor interrupts the signal
IE
O pen capacitor caus es high frequenc y noise

I E
Resistor fail s open, +2.3 V not available for the analog input
IE
Resistor fails open, signal interrupted

IE
Resistor fails open, signal interrupted

IE
I E
Open_El_C183 Q=6.1851e-5
Open_C 181 Q=1.750 8e -4
Open_C179 Q=1.3238e-4
Open_R206 Q=5.1649e-5
Open_R209 Q=5.1649e-5
Open_El_C171 Q=1.2778e-4
Open_R200 Q=5.1649e-5
Capacitor connections open due to the manufacturing def ec t
Capacitor fails open due to the part random f ailure

IE
Resistor connections open due to the manuf acturing def ect

IE
Resistor f ails open due to the part random failure

IE
Resistor fails open due to the part random failure

IE
Capacitor connections open due to the manuf acturing def ect

IE
Capacitor fails open due to the part random failure

IE
Resistor fails open due to the part random failure

IE

I E
Capac itor connections open due to the manuf acturing defect

I E
I E
I E
MFG_Open_El_C183
PRF_Open_C179
MFG_Open_R206
PRF_Open_R206
PRF_Open_R209
MFG_Open_El_C171
PRF_Open_El_C171
Q=1.3000e-8
Q=5.16358e-005 Q=5.16358e-005
Q=1.3000e-8 Q=0.000127767
PRF_Open_R200
Q=1.300 0e-8
PRF_Open_El_C183
MFG_Open_C181
Q=1.3000e-8
PRF_Open_C181
Q=6.18377e-005
Q=0.000132368
Capacitor connections open due to the manuf acturing def ect
IE
Q=0.000175069
Connection opens due to ins ufficient or inproper soldering
I E
Q=5.16358e-005
Connecti on opens due to insufficient or inproper soldering

IE
Part not inserted during assembly

IE
Connecti on opens due to insufficient or inproper soldering

IE

IE
IE
IE
Connection opens due to insufficient or inproper soldering

I E
P art not inserted during assembly

I E
P art not inserte d during assembly

I E
MFG_Open_C179
Q=1.3000e-8
Cold solder_R206
Missing_R206 Q=1e-009
MFG_Open_R209
Q=1.3000e-8
Cold solder_El_C171
Missing_El_C171
MFG_Open_R200
Q=1.3000e-8
Cold solder_El_C183
Mis sing_El_C183
Cold solder_C181
Missing_C 181 Q=1 e-009
Q=1.2e-008
Connection opens due to insufficient or i nproper sol dering
IE
Q=1.2e-008
IE
Q=1e-009
IE

IE

IE

IE
Q=1 .2e -008
Q=1e-009
Q=1.2e-008
1-23-2002
M. Krasich
36
Cold solder_C179
Missing_C179 Q=1e-009
Cold solder_R209
Cold solder_R200
Q=1.2e-008
1-23-2002
Q=1.2e-008
Page 140
M. Krasich
Q=1.2e-008
34
Contribution of Manufacturing Defects

Contribution to components failing open
Cold or insufficient solder:
Connection opens over time due to the solder fatigue or vibrations
Missing components
Amazingly large number of components are not inserted during assembly detected later when the function exercised
Components cracked during insertion Broken or bent pins or leads
Contribution to failing short

Debris (un-cleaned flux) left on the board that with dandreic growth causes a short Excessive solder Bent pins (mostly ICs and connectors) shorting with another pin
1-23-2002 M. Krasich 37
19
Example of Failure Probability Calculations

Automotive amplifier Life expectancy: 15 years Average active time (ON) daily: 2.7 hours Assumptions: Car stereo ON when driving automotive or Ground Mobile (GM) environment Car stereo OFF while car parked stationary thermally uncontrolled environment (GF) dormancy applies
F(15years) = 1 exp( GM t
Values for the Basic Events

Electrical components
Information from manufacturers (life test data)
Need to be adjusted for the proper environment and stresses
Software databases Field use (last resort)
Mechanical components
Determine stresses - loads (mechanical, environmental) Construct stress/strength equation for multiple loads if required Calculate design (safety) margin and reliability (probability of failure) for the required life
Manufacturing defects
Factory data Field failure data
1-23-2002 M. Krasich 38
1-23-2002
t ON = 365 15 (24 2.7) GFD = GF d where d = dormancy factor 0.1

M. Krasich
Component probability of failure can be calculated as: ) exp( GFD t ) ON OFF t ON = 365 15 2.7
40
Probability of the Seal Wear

The wear or spiral fracture of the Parker Fluorocarbon seals is noticed when the squeeze was 0.017 per side failure definition for a 0.210 cross section Abrasion resistance of Fluorocarbon is determined (Parker Handbook) to be good with the properly determined seal compression (squeeze) Radius of the above seal is found from:
Part of the Failure Mode Probability Worksheet

pn desc 191470-332 CAP,0603,X7R,50V,3300PF ref C540 rem PRF_C540 PRF_Short_C540 PRF_ChValue_C540 PRF_Open_C540 PRF_C541 PRF_Short_C541 PRF_ChValue_C541 PRF_Open_C541 PRF_D803 PRF_Short_D803 PRF_Open_D803 PRF_ParamCh_D803 PRF_D306 PRF_Short_D306 PRF_Open_D306 PRF_ParamCh_D306 PRF_D206 PRF_Short_D206 PRF_Open_D206 PRF_ParamCh_D206 PRF_D707 PRF_Short_D707 PRF_Open_D707 PRF_ParamCh_D707 PRF_D702 PRF_Short_D702 PRF_Open_D702 PRF_D100 PRF_Short_D100 PRF_Open_D100 PRF_ParamCh_D100 PRF_D101 PRF_Short_D101 PRF_Open_D101 Failure mode ratio Failure rate Dormant FR R(Ta) fr F0 F1 0.0089 8.937E-09 8.937E-10 0.999922 7.8285E-05 7.8285E-06 0.75 6.7028E-09 6.7028E-10 0.999941 5.8714E-05 5.8714E-06 0.1 8.937E-10 8.937E-11 0.999992 7.8288E-06 7.8288E-07 0.15 1.3406E-09 1.3406E-10 0.999988 1.1743E-05 1.1743E-06 0.0114 1.1351E-08 1.1351E-09 0.999901 9.943E-05 9.943E-06 0.75 8.5133E-09 8.5133E-10 0.999925 7.4573E-05 7.4573E-06 0.1 1.1351E-09 1.1351E-10 0.99999 9.9434E-06 9.9434E-07 0.15 1.7027E-09 1.7027E-10 0.999985 1.4915E-05 1.4915E-06 0.01 9.95E-09 9.95E-10 0.999913 8.7158E-05 8.7158E-06 0.2 1.99E-09 1.99E-10 0.999983 1.7432E-05 1.7432E-06 0.45 8.955E-10 8.955E-11 0.999992 7.8445E-06 7.8445E-07 0.35 6.965E-10 6.965E-11 0.999994 6.1013E-06 6.1013E-07 0.003 3E-09 3E-10 0.999974 2.628E-05 2.628E-06 0.2 6E-10 6E-11 0.999995 5.256E-06 5.256E-07 0.45 2.7E-10 2.7E-11 0.999998 2.3652E-06 2.3652E-07 0.35 2.1E-10 2.1E-11 0.999998 1.8396E-06 1.8396E-07 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06 0.2 1.0349E-09 1.0349E-10 0.999991 9.0656E-06 9.0656E-07 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06 0.2 1.0349E-09 1.0349E-10 0.999991 9.0656E-06 9.0656E-07 0.0172 1.72E-08 1.72E-09 0.999849 0.00015066 1.5066E-05 0.92 1.5824E-08 1.5824E-09 0.999861 0.00013861 1.3861E-05 0.08 1.2659E-09 1.2659E-10 0.999989 1.1089E-05 1.1089E-06 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06 0.2 1.0349E-09 1.0349E-10 0.999991 9.0656E-06 9.0656E-07 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06
191470-473 CAP,0603,X7R,50V,.047UF
C541
254110
DIODE,SCHOTTKY,40V,3A,S D803
0.21
1
0.2585
2
135247-5232 DIODE,ZEN,5.6V,225MW,5% D306
Ratios of the one sided compression and the respective 0.017 0.004 radiuses are: r = ; r =
The probability of the actual seal failure in ten years of life is:
F(10 years) =
1-23-2002
147239
DIODE,DUAL,SOT-23,BAW56 D206
147239
DIODE,DUAL,SOT-23,BAW56 D707
r1 r2
147239
DIODE,SWITCHING,75V,200 D702
(0.3 r1 )
+ (0.1 r2 )
M. Krasich
= 1.464 10 6
41
147239
DIODE,SOT-23,BAV 99
D100
147239
DIODE,SOT-23,BAV 99
D101
1-23-2002
M. Krasich
39
FTA Top Level Audio/Video Console example

Start from the system top level Include only those failure modes that affect the system performance Represent system architecture functional, hardware, or mix When work completed, look for the highest contributor to unreliability System failure
or improper operation
IE
Postman System Console
Q=7.365e-2
Analog signal not available

IE
No or improper power delivered to the system

IE
Tuner failure
No video
No SPDIF botth zones

IE
Failure of these functions causes noticeable difference

IE
IE
IE
ANALOG SIGNAL
Q=1.162e-2 Page 2
1-23-2002
Power Supply Q=3.280e-3 Page 12
Tuner Q=4.423e-3
Video Q=3.221e-3 Page 8
SPDIF Q=4.946e-2 Page 13
Functions Q=3.464e-2 Page 11

42
M. Krasich
20
Audio/Video Console Reliability Growth Monitoring

1 0.99
Console goal R(1 year) = 0.992
The Highest Contributor to Unreliability - Example

Console Reliability
0.98
Transistors and FETs from a more reliable vendor Planned Reliability Growth TSSOPs replaced by SOICs
Follow the highest hitter down to its subassemblies Look for the highest contributor to its reliability
0.97
0.96
Achieved Reliability Growth

0.95
0.94
0.93
0.92
Y5V caps replaced by X7R (116) Initialy calculated
0.91 0 50 100 150 200 250 300
Duration of the design period (days)
1-23-2002
M. Krasich
45
Page 13
1-23-2002
M. Krasich
43
Fault Tree Analysis for Reliability Growth - Summary

Define what constitutes a system failure Start with the unfavorable outcome that defines the system failure Construct the fault tree down, using logic to express reliability modeling techniques Follow the analysis: failure of what assembly, signal, or part will cause the particular failure. Develop down to the causes of the pertinent failure modes Determine probabilities of occurrence of individual causes. Identify the highest unreliability contributor or safety related failure modes and mitigate Improve reliability as necessary and possible Update the analysis, monitor reliability until the goal is met
1-23-2002 M. Krasich 46
Detailed Failure Modes and Causes

Cause 1: TSSOPS Cause 2 Caps with Y5V dielectric
1-23-2002
M. Krasich
44
The Benefit of FTA for the Design Reliability Growth

1.02 1 0.98 0.96 0.94 Syste m Console Subw oofe r
Reliabilty
0.92 0.9 0.88 0.86 0.84 0.82 0.8 0 50 100 150 200 250 300
If 100,000 systems produced in on e year, 9,250 less w ill be returned for repair w ithin warranty period as a result of reliability improvement
Design Time (Days)
1-23-2002 M. Krasich 47
21

Fault Tree Analysis in Product Reliability Improvement: Milena Krasich, P.E

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Fault Tree Analysis in Product Reliability Improvement: Milena Krasich, P.E

Hochgeladen von

Copyright:

Verfügbare Formate

Fault Tree Analysis in Product Reliability Improvement

Milena Krasich, P.E.

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

FAULT TREE ANALYSIS AND ITS USE

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

gates along with their definitions and graphical

representations are shown in Table 3.1.

DORMANT EVENT UNDEVELOPED EVENT TRANSFER GATE OR GATE

MAJORITY VOTE GATE EXCLUSIVE OR

Exclusive events or preventive measure does not take place

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

The corresponding equations are as follows: F1 = 0.002, F2 = 0.0005, F3 = 0.0032, n = 3, m = 2

FTopGate = 1 (1 FGate1 ) 1 Fgate2 FTopGate = 2.53 10 3

F3 F1 F2 Gate 1 Top Gate F3 F3 Gate 2

Figure 3.2-3 Series Parallel Circuit Configuration

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

F1 = 0.002, F2 = 0.0005, F3 = 0.00045, F4 = 0.00053, F5 = 0.0032

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

Overheat of FET due to LGND <2V

Ceramic capacitor shorts to ground

LGND shorted to ground causing improper FET bias and overheat

Electrolyte mixed with debris causing making a short

FET saturates due to LGND <2V

Short C905 Q=3.0001e-3

Q=4.6875e-13 Q=3.5000e-1 Capacitor shorts brings Lground to the ground

Manufactirung defects cause a short

Capacitor shorts due to part random f ailure

Manufac tirung defects cause a short

Short caused by leaking of the nearby capacitor

resence of debris on the board

EL. CAP LEAK Q=3.1250e-6

DEB RI S Q= 1.5000e-7 Electrolyte leak due to high humidity

Debris on the PCB causing a sho rt

Excessive solder causing a short between the pins or pads

Figure 3.2-7. Example of an Inhibit Gate

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

Figure 3.3-1 Input into CODEC of an Audio Amplifier

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

Figure 3.3-2 Top Level FTA of CODEC

Figure 3.3-3 Development of the FTA for inputs 1 and 2

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

Missing components Components cracked during insertion

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

FAILURE MODE DETECTION AND MITIGATION

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

Console goal R(1 year) = 0.992

Achieved Reliability Growth

Y5V caps replaced by X7R (116) Initialy calculated

0.91 0 50 100 150 200 250 300

Duration of the design period (days)

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

SUMMARY AND CONCLUSIONS

and vote. The second is also a draft, in circulation for comments.

REFERENCES AND BIBLIOGRAPHY

2002 Annual RELIABILITY and MAINTAINABILITY Symposium

ATTACHMENT -TUTORIAL VISUALS

Fault Tree Analysis for Product Reliability Improvement

No audible noise No distortion No bending pass the predetermined angle