Beruflich Dokumente
Kultur Dokumente
Milena Krasich, PE; Bose Corporation; MS 450; The Mountain, Framingham MA 01701-7330 USA e-mail: milena_krasich@bose.com.
SUMMARY & PURPOSE This tutorial introduces the use of a well known technique of the Fault Tree Analysis as a tool in reliability modeling and analysis of an electronics of mechanical design (including software), identification of potential failure modes that are high contributors to unreliability, tradeoffs and mitigation of those failure modes. Applied early in product the design phase, this activity allows for relatively inexpensive and easy design and manufacturing process improvements and, in that manner, achieving considerable improvement of the product reliability before the design is completed or the product is manufactured. A real example of this analysis as applied to audio products are discussed along with the achieved reliability improvement.
Milena Krasich
Milena Krasich is the Senior Technical Lead of Reliability Engineering in Design Assurance Engineering of Bose Corporation. Before joining Bose, she was a Member of Technical Staff in the Reliability Engineering Group of General Dynamics Advanced Technology Systems formerly Lucent Technologies, and prior to that, she worked for the Jet Propulsion Laboratory in Pasadena, California. While in California, she was a part-time professor at the California State University Dominguez Hills, where she taught graduate courses in System Reliability, Advanced Reliability and Maintainability, and Statistical Process Control. At that time, she was also a part-time professor at the California State Polytechnic University, Pomona, teaching undergraduate courses in Engineering Statistics, Reliability, Environmental Testing, Production Systems Design, Measurements, and Materials Procurement. She holds a BS and MS in Electrical Engineering from the University of Belgrade, Yugoslavia, and is a California registered professional electrical engineer. She is also a member of the IEEE and ASQC Reliability Society, a Fellow and the past president of the Institute of Environmental Sciences and Technology, and a member of the College of Fellows of the Institute for Advancement of Engineering. Currently, she is a US Delegate to the International Electrotechnical Committee, IEC, working on dependability/Reliability standards and is a project leader for revision of international standards for reliability growth.
Table of Contents
1. 1.1 2. 2.1 3. 3.1 3.2 3.3 3.4 3.5 4. 5. 6. 7. INTRODUCTION.......................................................................................................................................... 1 Notation and Acronyms ................................................................................................................................. 1 Reliability Improvement................................................................................................................................. 1 Reliability Definitions Related to This Tutorial............................................................................................. 2 Fault Tree Analysis and Its Use ..................................................................................................................... 2 Fault Tree Introduction ............................................................................................................................... 2 System Analysis Methodology....................................................................................................................... 2 Building of a Fault Tree ................................................................................................................................. 6 Contribution of Manufacturing Defects ......................................................................................................... 8 Origin of Values for the Basic Events............................................................................................................ 9 Failure Mode Detection and Mitigation ......................................................................................................... 9 Summary and Conclusions........................................................................................................................... 12 References and Bibliography ....................................................................................................................... 12 Attachment -Tutorial Visuals ....................................................................................................................... 13
ii
1.
INTRODUCTION
Multiple methods have been used for the estimation of product reliability for many decades that reliability has been applied as a science. Many reasons, such as product criticality (medical devices, defense systems, transportation) or the need for competitiveness in consumer industry, dictate the need for products with remarkably high reliability. Design alone, regardless of its features and technology, does not guarantee products reliability. A design team, conscious of good and reliable design methods such as proper component derating, ESD and EMI protection, may not be completely aware of all of the aspects of reliability modeling and potential reliability shortfalls. This is especially the case when a product must be designed to operate in multiple environments, or the specifics of component reliability aspects (such as dependency of their reliability on applied stresses) are not well understood. Therefore reliability of a completed design may not be as required or as expected. In the past, attempts to improve product reliability were concentrated on various types of the Failure Mode and Effects Analyses (FMEA), and/or on the dedicated Reliability Growth test programs. Both of those methods applied individually or in conjunction, even though useful, may not be cost effective or applicable. The first method, FMEA, is a valuable but a very comprehensive attempt to identify the potential failure modes and to assure their mitigation. Starting from the bottom and going up, the analysis addresses each component (electrical or mechanical), the modes in which it might fail, and the effects that those failure modes might have on higher level assemblies and the system. The process is very tedious and is often completed well after the design is finished and the production period has begun. This might be too late to accomplish any measurable improvements without major expenses for redesign, new PC boards layouts, and new tooling. In addition, any type of FMEA normally does not produce the measure of overall product reliability, thus any achieved reliability improvement is also not measurable. One type of a FMEA has a Risk Priority Number (RPN) associated with it; however, this number is a product of three numbers (from 1 to 10) assigned each, Severity, Occurrence, and Detection. Regardless of strict rules applied in estimation of these numbers, those are still only estimations, and thus might be subjective. Another FMEA type that includes criticality computation (FMECA) requires knowledge of failure rates; therefore, it cannot be applied for analysis of systems with components where failure probabilities, not failure rates are a far better attribute. Those also do not provide reliability estimates. Test methods for reliability improvement are even more costly keeping in mind that those were performed on preproduction or production runs, meaning that the design is mature. In addition, the test units might be complex and expensive so that only a limited number might be available for testing. Fault Tree Analysis combines many favorable aspects:
It is timely, therefore. low cost It is fast and easy to use It provides realistic reliability estimates at the same time with the failure mode analysis It measures achieved reliability improvement and the final reliability of a product.
1.1 NOTATION AND ACRONYMS (t) - Component failure rate, instantaneous failure rate Component failure rate if assumed constant assumed to be constant. ESD - Electrostatic Discharge EMI - Electromagnetic Interference FTA - Fault Tree Analysis FMEA - Failure Mode and Effects Analysis FMECA - Failure Mode Effects and Criticality Analysis RPN - Risk Priority Number MTTF - Mean Time to Failure MTBF - Mean Time Between Failures IEC - International Electrotechnical Commission Q(t) - Unreliability as a function of time Q - Unreliability assumed constant or calculated for a predetermined time Pr - Probability Pr(c) - Probability of occurrence of a cut set FET - Field Effect Transistor IC - Integrated Circuit R - Reliability F- Probability of failure unreliability CODEC - Coder/Decoder PRF - Part Random Failure PCB - Printed circuit board IEV International Electrotechnical Vocabulary
2.
RELIABILITY IMPROVEMENT
Reliability improvement can be undertaken and achieved in different phases of the product life: Design phase Product validation phase test reliability growth During its fielded life The first option, design phase, offers the most cost effective opportunities for product reliability improvement. Before design is finalized, even considerably involved changes do not pose a great expense, other than design time. If design improvements are not excessively extensive, necessary changes can often be painlessly done. Then the rest of product preparation (such as layout of printed circuit boards, tooling, component procurement) can be done without interruption or modifications. In the design phase, reliability improvements are achieved by identification of potential design deficiencies or potential manufacturing problems/defects that may compromise reliability of a design. Some potential design flaws that are likely to be identified are as follows: 1
Electrical or mechanical overstress of components Components inadequate to be used in that design (unreliable or improperly used) Potential relationship between failures, that is, secondary failures caused by occurrence of another failure or by the presence of an environmental stress Parts of inferior quality (reliability) as built by their respective manufacturers.
Capacitor fails short due to crack propagation Resistor fails open due to the poor welding of the connections FET saturates and overheats Seal leaks, etc. One failure mode can have multiple causes. Examples of those are: IC enclosure fails due to one or more of the following: high humidity high temperature thermal cycling IC manufacturing process Capacitor short: electrical overstress high temperature, use or soldering vehicle vibration A seal in underwater cable connector may leak due to: water pressure causing dilatation of the material cold temperature wearout from mating and de-mating of the connector defect in manufacturing undersize
2.1 RELIABILITY DEFINITIONS RELATED TO THIS TUTORIAL To assure proper understanding of the terms as they are used in this tutorial, some reliability definitions are included. These are as follows: Reliability probability that an item can perform a required function under given conditions for a given time interval (IEV 191-12-01). Here, the required function is defined by expected performance that may vary depending on the use of the item and of the expectations. For a high-fidelity stereo audio/video product, the expectations are, for example, no audible noise or distortion. For a mechanical device, a pipe or an underwater connector housing, the expected performance would be that there is no bending greater than a predefined angle under some expected force. The measures for reliability or its complement, unreliability, would be probability of survival past the end of a predetermined period, or probability of failure before the end of a predetermined period, respectively. The measurement that is best understood by management is the percent of items surviving a time period (life or warranty). Failure the termination of the ability of an item to perform a required function (IEV: 191-04-01). A failure can be classified as a failure of the hardware to operate properly due to: Design failure a failure due to the inadequate design of an item to withstand operational and/or environmental stresses, or due to the use of an improper part Manufacturing defect causing time-related failures that compromise design reliability
3.
A fault tree is used as a Boolean representation of a product design; a system, its assemblies and functions, failure modes, and their respective causes. Fault tree analysis in analysis of a design has a multiple mission. One of its applications is for modeling of the products architecture and functionality in a top down manner, searching for potential failure modes and their causes that might produce an unfavorable outcome defined as a product failure. It also estimates quantitatively reliability of an item and its assemblies. Based on this information, one can identify those failure modes that are the highest contributors to the products unreliability, follow the investigation down to identify their respective causes. This allows for tradeoff and mitigation of those potential failure modes, and finally, evaluation of the achieved reliability improvement. 3.1 FAULT TREE INTRODUCTION Fault tree is a logic diagram that represents functional dependencies of parts of a system. The top gate represents the unfavorable outcome of the system, and all other unfavorable outcomes that contribute to the system failure are represented as gates, logically connected to the top gate. Components of a fault tree are: Gates, which are outcomes of one or a combination of input events or other gates
Software interactions with hardware A failure of an item can also be attributed to a fault in the software code a failure of the software design. Failure Cause the circumstances during design, manufacture, or use which have led to a failure (IEV:191-0401) Failure Mechanism the physical, chemical, or other process which led to a failure. An example would be crack propagation through the dielectric of a ceramic capacitor causing the capacitor to develop a small resistance and ultimately a short circuit. Failure Mode manner or state in which an item or a component might fail. Examples of failure modes are: Low or no output from an IC Separation of the IC packaging material
Cut sets, which are groups of outcomes or events that, if occurred, would cause a system failure. Minimal cut set contains the minimum number of events that are required for a failure outcome. The removal of one of them would result in a system surviving. Types of events and 2
Table 3.1. Graphical Representation and Definitions of Gates and Events FTA Symbol Symbol Name BASIC EVENT CONDITIONAL EVENT Description Basic event for which reliability information is available Event that is a condition of occurrence of another event when both must occur for the output to occur A basic event that represents a dormant failure A part of the system that yet has to be developed - defined Gate indicating that this part of the system is developed in another part or page of the diagram This output event occurs if any of its input event occur This output occurs if m of the inputs occur The output event takes place if one, but not the other input occur The output event takes place if all of the input events occur The output event (failure) occurs only if the input events occur in sequence from left to right The output occurs only if both of the input events take place, one of them conditional The outcome is present only if the input event does not occur Reliability Model Component failure mode, or a failure mode cause Occurrence of event that must occur for another event to occur Conditional probability Dormant component failure mode or dormant failure cause A contributor to the probability of failure. Structure of that system part is not yet defined A partial reliability block diagram that is shown in other location of the overall system Failure occurs if any of the parts of that system fails - series system Redundancy k out of n, where m = n-k+1 A failure of the system occurring only if one, not both of the two possible failures happens Parallel redundancy, one out of n equal or different branches. Good for representation of secondary failures or for enabling sequence of events Conditional probability occurrence of the final event of Inputs 0 0
0 0
AND GATE
PRIORITY AND
INHIBIT GATE
NOT GATE
3.2 SYSTEM ANALYSIS METHODOLOGY 3.2.1 Classical System Reliability Analysis When a system is complex regarding the complexity of its modeling, that is, if it contains many of interlocked or common branches, standard modeling can become extensively cumbersome, lengthy, and subject to mathematical (computational) errors. An example of a simple, yet complex bridge circuit is shown in Figure 3.2-1.
Blocks 4 and 5 (c2 = 4,5) Blocks 1, 3, and 5 (c3 = 1,3,5) Blocks 2, 3, and 4 (c4 = 2,3,4) Should any of the above combinations fail, the signal flow from A to be will be interrupted. With Boolean algebra, the probability of the system failure would be: FS = Pr (c1 c 2 c 3 c 4 ) Probability of the cut set 1 is: Pr( c1 ) = F1 F2 = (1 R 1 ) (1 R 2 ) The correct calculation (Esary-Proschan) is then:
1 [1 Pr(c1 )] [1 Pr(c 2 )] [1 Pr( c 3 )] [1 Pr( c 4 )] With RARE event approximation; this calculation would be: FS = Pr( c1 ) + Pr( c 2 ) + Pr( c 3 ) + Pr( c 4 ) FS = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F5 While easy to implement, RARE approximation may introduce sizeable errors into calculations when the failure probabilities are larger numbers. Anything larger than a multiple of 10-2 as a value of a failure probability will produce an unwanted error. This is shown on the example below: F1 = 2 10 2
F2 = 5 10 2 F3 = 8 10 2 F4 = 2.5 10 2 F5 = 3 10 1 Esary Pr oshan : FS = 9.068 10 3 RARE : FSR = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F4 FSR = 9.08 10 3 FS = 1 (1 F1 F2 ) (1 F4 F5 ) (1 F1 F3 F5 ) (1 F2 F3 F4 )
FS = Pr (c1 c 2 c 3 c 4 ) =
A
3
Figure 3.2-1. Bridge Circuit In the bridge circuit above, the signal must flow from input A to output B. It can flow through block 3 in both directions. Analytical solution would be to model the system under two circumstances, assuming that the block 3 is good, in which case the signal would flow through blocks 1 or 2 and 4 or 5, as if they were parallel blocks, or assuming that the block 3 is bad (the condition that 3 failed), in which we have blocks 1 and 4 in series, parallel to blocks 2 and 5 also in series. This would be represented with the following equation:
R s = (R1 + R 2 R1 R 2 ) (R 4 + R5 R 4 R5 ) R3
[R1 R 4 + R2 R5 R1 R2 R 4 R5 ] (1 R3 )
R s = 0.991
When a system contains a multitude of complex systems of different kinds, the algebraic representation becomes rapidly too involved and cumbersome to solve. In addition, these complex equations need to contain a multitude of conditional probabilities to account for environmental effects and secondary failures. This only adds to already extensive complexity of the calculations. 3.2.2 System Reliability Analysis Using a Fault Tree The complex system shown in Figure 3.2-1 can be easily modeled using Boolean algebra with fault tree or success tree representation. Cut sets in this system would be made of the following combinations: Blocks 1 and 2 (c1 = 1,2)
Software packages commercially available for FTA are based on Boolean algebra, and most of them contain the constant failure rate model for unavailability:
Q( t ) = 1 e ( + )t +
If the time to repair (MTTR) is considered infinite (nonrepairable items), then = 0, and: Q(t) = F(t) Other information that can be obtained with FTA software is: Failure frequency (hazard rate) of all gates
Number of expected failures during the predetermined time Unavailability or probability of failure of the system at any gate Gate summaries in various forms Confidence intervals Sensitivity analysis
Calculations using distributions other than exponential The circuit from Figure 3.2-1 represented by FTA is shown n Figure 3.2-2
Figure 3.22. FTA Diagram of the Bridge Circuit Different gates of a fault tree are used to represent different circuit models as shown in the following examples: Example 1: Combination of series and redundant blocks (events) Reliability block diagram of this combination is shown in Figure 3.2-3
i!(n i)! (1 F )
3
n!
F3 ( n i )
2 out
)]
The FTA representation of the reliability block diagram in Figure 3.2-3 is shown in Figure 3.2-4
Figure 3.2-4. FTA Representation of a Series-Parallel Reliability Block Diagram With different redundant blocks (Figures 3.2-3 and 3.2.4) the redundant gates are different, F3, F4, and F5 instead of the repeated F3 and the calculations are done in a similar way (binomial). The three different redundant blocks are shown with the Example 2 of the conditional probability, where Gate 2 has three different gates representing the three redundant blocks (Figure 3.2-5). Example 2: Use of a priority gate. The event F2 will occur only if the event F1 has occurred (conditional probability). The equivalent fault tree is shown in Figure 3.25.
Example 3: A real life example of a priority gate is the analysis of a switching amplifier, where on all four outputs (+1, - 1, +2, and -2) there are four switching FETs, followed by noise and EMI filtering. For the FETs to operate properly (in the switching mode) the Logic Ground (LGround) must be maintained at a certain voltage. This voltage is 5V, maintained by a voltage regulator filtered by two ceramic capacitors. Should LGround voltage decrease below 2V, the FETs will start operating in linear mode and will then saturate. This condition not only constitutes a failure, but could eventually cause the FET to overheat. This voltage would decrease in the event that one of the voltage filtering capacitors developed a small resistance close to a short. Here, the Lground below 2V is the condition for FETs to saturate and overheat. In the old design, voltage-filtering capacitors had a dielectric with Y5V characteristics, which has a higher concentration of voids and could develop and propagate a crack easier than other ceramics (especially in harsher environments as the one that this analysis was performed for). This characteristic, along with the less than adequate voltage rating contributed to a relatively high projected probability of failure for the specified lifetime. Replacement of both of the voltage filtering capacitors with those having a dielectric with X7R characteristics and a higher voltage rating, the 10-year probability of occurrence of FET overheat was reduced from 2.0969E-3 (per FET) to 1.0009E-4, which was an improvement by a factor of 20. The original circuit, as modeled with the fault tree is shown in Figure 3.2-6.
Figure 3.2 5 Example of a Priority Gate (Gate 1) The associated mathematics is as follows: n= 3, m = 2
Page 1 Q=2.0969e-3
FET4 OVERHEAT
LGROUND Q=5.9911e-3
FET 4 SATURATION
DANDREIC SHORT
MFG_SHORT_C905
MFG_Short_C906
Short_C906 Q=3.0001e-3
Q=7.0000e-8 Q=3.0000e-3
Excessive solder causing a short between the pins or pads
IE SOLDER SHORT_C6905
Q=7.0000e-8
Debris on the Capaci tor fails due to part PCB causing a random failure short
IE IE
Capacito leaking Electroly te Leak due to High electroly te due Temperature to ageing
IE I E
DEBRIS_C6905
DEBRIS_C6906
PRF_C6906 Q=3.0000e-3
AGEING Q=1.0000e-6
HI-TEMP Q=1.2500e-7
HI_HUMIDITY Q=2.0000e-6
Q=5.0000e-8
Q=2 .0000e-8
Q=5.0000e-8
Q=2.0000e-8
Figure 3.2-6. Practical Example of a Priority Gate Example 4. Use of an inhibit gate is shown in Figure 3.2-7. With the inhibit gate, for the outcome to constitute a failure, all of the input events (in our case three) must take place. A practical example of this modeling would be the connection of three EMI filtering capacitors. If a failure mode is defined as no filtering, all of the three would have to fail.
3.3 . BUILDING OF A FAULT TREE To build a fault tree of a product (a system made of subsystems, assemblies, and components) is a top down process where, as a first step, one must define what constitutes the failure of that product. For a high quality audio amplifier, anything that the end user might hear and qualify as degraded performance constitutes the system failure. The next step is to outline the system architecture and the major functions such as: Power supply Video amplifier
High temperature A detailed example of how a fault tree analysis is done is shown in another real life example, an analog input to an analog to digital converter of an audio amplifier. The partial circuit of this amplifier is shown in Figure 3.3 1. This part of the amplifier is normally known as CODEC, as analog input signals are converted into a digital, and then again into linear output. The signals are directed into an IC that is an analog to digital converter. For the amplifier to be operational, all signals have to be processed by CODEC meaning that is they have to coded and decoded. The inputs signal 1+ into the left channel of IC U20 is interrupted if: R200, R209, or C171 fail open C179 shorts to ground, shorting the signal to ground The input signal 2+ into the right channel of the U20 is interrupted if: R201, R205, or C172 fail open C177 shorts the signal to ground The entire circuit will not work if no voltage is supplied to the analog input, (pin 8) R206 or R208 fail open, interrupting the supply of 2.3 V
Audio amplifier The further analysis going down determines what phenomena preclude proper operability of those parts or functions, i. e: Shorted line voltage or no VCC supplied No video processing
One or more audio channels not operational More detailed analyses further determine the causes of those phenomena, contributing factors, down to the causes of failure modes such as: Electrical overstress High humidity
The signal will be too noisy if C183 fails open (low frequency noise), or C181 fails open (high frequency noise). Other contributors to the failure are the lack of data inputs, which will not be considered in this example. The top level of the FTA representation of this analysis is shown in Figure 3.32.
Circled, in Figure 3.3.-2 is the gate that needs to be developed for the analog inputs 1 and 2 described earlier. Figure 3.3-3 shows further development of that gate.
Inputs 1 and 2 are then separately analyzed, and so are the noisy or no analog voltages. Development of Input 1 is shown in Figure 3.3-4. The circle points out the open components that are to be further developed. The fault tree part in Figure 3.3-4 also contains a gate that points to the possible lack of the 2.3V voltage. Capacitor C179, if failed short, would short the signal to the ground. There are two possible reasons for this capacitor to fail short. One is so called part random failure. This term takes into
consideration the environment that the capacitor is supposed to be exposed to (temperature, vibration) as well as the operational stresses that the capacitor will see, such as its operating voltage. Thus, the term random failure actually is not just a failure that will occur at random, but it describes the likelihood that a part will fail, if having an intrinsic defect, under given environmental and operational stresses.
Figure 3.3-4. Development of the FTA Down to Components and Their Failure Cause
3.4 CONTRIBUTION OF MANUFACTURING DEFECTS Manufacturing defects causing time dependent failures are a vital contributor to product unreliability. Some contributions to components failing open are: Cold or insufficient solder, which after a period of time, due to relaxation and fatigue, causes connections to open. Vibration of a vehicle will cause the cold soldered joint to open as well.
Broken or bent pins or leads Contributors of manufacturing flaws to components failing short are: Debris (at times un-cleaned flux) left on the board Excessive solder
Bent pins (mostly ICs and connectors) shorting to another pin. Another reason for the capacitor failure (Figure 3.3-4) would be a failure (a short) caused by manufacturing defects. Normally during production, if a PC board is not properly cleaned, debris left on it will produce so called dandreic growth, which, in turn might cause a short between terminals. A second manufacturing defect causing an electrical short is a result of inadequate soldering technique, where excessive solder develops a bridge between the terminals and cause a short. Further development of the fault tree will point out to other components failing open or short causing failure of the analog power supply, or interruption of the second signal. 3.5 ORIGIN OF VALUES FOR THE BASIC EVENTS To be able to estimate the final (top gate) product reliability, each of the events must have information on its reliability assigned to it. This information may be attached in the form of a failure rate, MTTF, or probability of failure. For mixtures of hardware, mechanical and electrical, perhaps the most straight forward way would be to represent all the information in the form of a probability of failure calculated for a predetermined time, and a predetermined operational profile. For electrical components, data for event and failure mode probabilities comes from: Information from the manufacturers life testing, which needs to be recalculated for the proper environmental and electrical stresses Software databases (commercially available) Field use field failure data information, which would be the very last resort because of many inconsistencies of data reporting and recording.
For mechanical components, probability of failure needs to be calculated based on: Stresses loads, and their geometry and distribution Materials Construction (design) of parts, such as shape and size
Attachment of parts to other structures (adhesives, fasteners) Based on all the information, the safety margin needs to be calculated, which in turn will produce a reliability value. For determination of a probability of occurrence of manufacturing defects, the approach may be two-fold. The probability associated with the manufacturing defects can come from factory or service data (field failure data). On the other hand, sometimes it is advisable to fill in the requirements numbers into the reliability analysis, and then adjust the manufacturing process control to achieve this goal.
4.
In a completed or in a partially completed fault tree analysis of a system, when the probability of failure of the top level gate is calculated and it is concluded that reliability improvement is necessary, the process that follows is to identify the highest contributor to unreliability (a failure mode or a cause) and improve the design. This process continues in search for the next highest contributor. An example of such reliability improvement is shown in the case of a complex audio/video amplifier system. The top level of the system (the console) is shown in Figure 4 1. The Tuner is shown as an event because of the repeated reference designator numbers in the bill of material of the system, and the tuner. For that reason, the Tuner was analyzed separately, and then its top unreliability is depicted as an event.
Figure 41 Top Level Fault Tree of Console and its Major Subsystems
For the given warranty period, the original unreliability value is not acceptable, as 7,365 systems out of 100,000 made would need service before the end of their respective warranty periods. The highest contributor to unreliability is the block marked SPFIF. This gate was developed on page 13, as it is shown in Figure 4-2.
Figure 42 SPDIF Top Level fault Tree. Looking for the highest contributor to the SPDIF circuit unreliability is shown as a part of the circuit that is an input or output from the multiplexer. Further investigation leads to the SPDIF multiplexer, where the highest contributor is the IC U501 (Figure 4-3). The high failure probability of this IC is related to its construction packaging (TSSOP). In another package, SOIC, this IC is a reliable part. There were 3 of these units in the console. It also was apparent that the probability of failure of capacitors C513 and C517 was too high for ceramic capacitors. This is because those had the Y5V material dielectric characteristic. There were about 116 capacitors of this type in the console.
10
Figure 4-3 The Components which were the Highest Contributors to the Console Unreliability. Once the design improvements were made, the console reliability was improved to the point of almost meeting its aggressive goal. The resultant improvement is shown in Figure 4-4.
1
0.99
Transistors and FETs from a more reliable vendor Planned Reliability Growth TSSOPs replaced by SOICs
0.97
Console Reliability
0.96
0.94
0.93
0.92
Figure 4-4 Console Reliability Goal, Planned Growth Curve, and the Actual Reliability
11
5.
The Fault Tree Analysis can be successfully used for identification and mitigation of potential failure modes that contribute to unreliability of a product. The FTA allows pictorial representation of the system, its architecture and functionality, along with using Boolean algebra and the multitude of modeling schemes to best represent the system operation and interdependency of its failure modes. The FTA is here used to evaluate the individual failure mode contributions to the system unreliability and come up with the most viable solution for its reliability improvement. The methodology can be summarized as follows: Define what constitutes the system failure Start with the top level of the system with an unfavorable outcome that defines the system failure Construct the fault tree down, using logic to express reliability modeling techniques Follow the analysis down the fault tree to determine what assembly, signal, part, or manufacturing defect will cause a particular failure Develop the fault tree all the way down to the causes of pertinent failure modes Determine respective probability of occurrence of individual causes. The software, when used for analysis, will roll up all information producing the system, subsystem, and assemblies failure probability Identify those failure modes that are the highest contributors to unreliability and mitigate.
6.
1.
Joanne Bechta Dugan, Fault-Tree Analysis of Computer-Based Systems 1999 Tutorial Notes, Reliability and Maintainability Symposium, Washington, DC Kiran Kumar Vemuri and Joanne Bechta Dugan, Reliability Analysis of Complex Hardware-Software Systems, Proceedings, Annual Reliability and Maintainability Symposium, January 1999, Washington, DC. Gza Szab and Pter Gspr, Practical treatment Methods for Adaptive Components in the Fault-Tree Analysis, Proceedings, Annual Reliability and Maintainability Symposium, January 1999, Washington, DC. Alfredo H-S. Ang and Wilson H. Tang Probability Concepts in Engineering Planning and Design, Volume II, Decision Risk and Reliability, 1990. Milena Krasich, Use of fault Tree Analysis for Evaluation of System Reliability Improvements in Design Phase. Proceedings, Annual Reliability and Maintainability Symposium, January 2000, Los Angeles, California
2.
3.
4.
5.
Update the analysis, and monitor the resultant reliability improvement Failure mode analysis with fault trees can be started with the start of a project, and updated as more detailed information becomes available. There is no need to come up with the failure rates as a reliability measure for all components, electrical, mechanical, and software. The fault tree modeling allows a mixture of various information (failure probability, different failure distributions), and does not require estimation of failure rates only like the classical reliability predictions do. Modeling and reliability assessment of a product system with the fault tree analysis allows for timely design improvements while design changes are still possible, feasible and inexpensive. This methodology is also described in the draft IEC standards, IEC 60300 1, Dependability management. Part 3: Application guide, Section Section 1: Analysis techniques for dependability; Guide on methodology, and IEC61014, Reliability growth methods. The first standard is in its last draft for comments
12
7.
Measures Reliability: Probability of survival after the end of a predetermined period Unreliability: Probability of failure before the end of the period Measure as management sees it: Percent of items surviving a predetermined time period normally warranty period, mission period or other time period requiring proper product operation
1-23-2002 M. Krasich 4
Failure of software Failure Cause The circumstances during design, manufacture, or use which have led to failure Failure Mechanism The physical, chemical, or other process which led to a failure
1-23-2002 M. Krasich 5
1-23-2002
M. Krasich
The most cost effective reliability improvement done during the product design Product reliability improvement achieved by: Identification of potential design flaws:
Component electrical overstress Potential mechanical overstress and failure Inadequate components or parts used Failure of one part caused by the failure of another part Use of parts that are of inferior quality/reliability
Failure mode: Manner or state in which an item or a component might fail Examples: Low output of an IC Separation of the IC packaging material Capacitor fails short due to crack propagation in the dielectric (failure mechanism) Resistor fails open, failure cause poor lead welding FET saturation and overheat Gain change Seal leakage
1-23-2002 M. Krasich 6
13
Event
Basic event
Basic event for which reliability information is available Reliability model:
Component failure mode, or a failure mode cause
Examples:
Causes of capacitor short: electrical overstress, high temperature, vehicle vibration, high soldering temperature Causes of a IC enclosure failure: moisture, high temperature, IC manufacturing process
Conditional event
Event that is a condition of occurrence of another event when both must occur for the output to occur Reliability model:
Occurrence of event that must occur for another event to occur
1-23-2002
M. Krasich
10
Events cont.
Dormant event
A basic event that represents a dormant failure Reliability model:
Dormant component failure mode or dormant failure cause
Undeveloped event
A part of a system not yet developed
1-23-2002
M. Krasich
1-23-2002
M. Krasich
11
Gates
OR gate
This output event occurs if any of its input event occur Reliability model: Failure occurs if any of the parts of that system fails - series system
AND gate
The output event takes place if all of the input events occur Reliability model: Parallel redundancy, one out of n equal or different branches.
Cut sets
Groups of events that, if all occur, would cause a system failure. Minimal cut set: contains the minimum number of events that are required for failure. A removal of one of them would result in system not failing.
14
Basis for the Fault Tree: Boolean algebra, used to produce minimal cut sets (or paths sets) 1 4
Cut Sets A System fails if any one of the cut set happens: 2 c1 = 1,2 c2 = 4,5 c3 = 1,3,5 c4 = 2,3,4 RS = 1 - FS FS = Pr(c1 c2 c3 c4)
Pr(c 1 ) = F1 F2 = (1 R1 ) (1 R 2 ) Correct calculation (Esary Proschan) : Pr(c 1 c 2 c 3 c 4 ) = 1 [1 Pr(c 1 )] [1 Pr(c 2 )] [1 Pr(c 3 )] [1 Pr(c 4 )] Rare event approximat ion : FS = Pr(c 1 ) + Pr(c 2 ) + Pr(c 3 ) + Pr(c 4 ) FS = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F4
1-23-2002 M. Krasich 15
3
B
5
Inhibit gate:
The output occurs only if both (or all) of the input events take place, one of them conditional Reliability model: Conditional probability of the final event
Transfer gate:
Gate indicating that this part of the system is developed in another part or page of the diagram Reliability reference: A partial reliability block diagram that is shown in other location of the overall system block diagram
1-23-2002 M. Krasich 13
Fs = 1 (1 F1 F2 ) (1 F4 F5 ) (1 F1 F3 F5 ) (1 F2 F3 F4 )
Rare Approximation :
Fsr = F1 F2 + F4 F5 + F1 F3 F5 + F2 F3 F4
F1 2 10
2
A
2
B
5
R S = (R1 + R 2 R1 R 2 ) (R 4 + R 5 R 4 R 5 ) R 3 +
F2
5 10
F3
8 10
F4
2.5 10
[R1 R 4 + R 2 R 5 R1 R 2 R 4 R 5 ] (1 R 3 )
Esary-Proschan : Fs Fs 1 1 F1 F2 10
3
F4 F5
F1 F 3 F 5
F 2 F3 F 4
9.068
When a system is really complex, with a multitude of interrelationships between the assemblies, the algebraic solutions become rapidly too involved. Environmental factors and manufacturing errors left out.
1-23-2002 M. Krasich 14
1-23-2002
M. Krasich
A
2
B
No signal at the output
I E
Failure Q=9.068e-3
Cross 1 Q=4.800e-4
Top Q=1.000e-3
Bottom Q=7.500e-3
Cross 2 Q=1.000e-4
Block 1 fails
Block 3 fails
Block 5 fails
Block 1 fails
Block 2 failure
Block 4 fails
Block 5 fails
Block 2 failure
Block 3 fails
Block 4 fails
I E
IE
I E
IE
I E
I E
I E
IE
I E
IE
1 Q=2.000e-2
3 Q=8.000e-2
5 Q=3.000e-1
1 Q=2.000e-2
2 Q=5.000e-2
4 Q=2.500e-2
5 Q=3.000e-1
2 Q=5.000e-2
3 Q=8.000e-2
4 Q=2.500e-2
1-23-2002
M. Krasich
17
15
F3 F1 F2 F4 F5 Gate 2
2 0.0005 0.0032 F1 1 F2 F4 F5 FGate2 F3 0.00045
A
2
B
5
Failure Q=9.080e-3
GAT E1 Q=2.499e-3
2 GAT E2 Q=3.374e-6
F1
I E
F2
IE
F3
I E
F4
I E
F5
Cross 1 Q=4.800e-4
Top Q=1.000e-3
Bottom Q=7.500e-3
Cross 2 Q=1.000e-4
EVENT1 Q=0.002
EVENT 2 Q=0.0005
EVENT3 Q=0.000 45
EVENT4 Q=0.00053
EVENT 5 Q=0.0032
F3 F4 1
F3 F5 1
Block 1 fails
Block 3 fails
Block 5 fails
Block 1 fails
Block 2 failure
Block 4 fails
Block 5 fails
Block 2 failure
Block 3 fails
Block 4 fails
FGate1 1
3
IE
IE
IE
IE
IE
IE
IE
IE
IE
IE
2.502 10
1 Q=2.000e-2
3 Q=8.000e-2
5 Q=3.000e-1
1 Q=2.000e-2
2 Q=5.000e-2
4 Q=2.500e-2
5 Q=3.000e-1
2 Q=5.000e-2
3 Q=8.000e-2
4 Q=2.500e-2
1-23-2002
M. Krasich
20
1-23-2002
M. Krasich
18
FGate1 FGate2
F1
F2 F4 F5 1 FGate2
F3 F4 1
F3 F5 1
F3
fail s if Gate 1 OR Gate 2 fail s
IE
FTopGate
Fai ls if any of the two events takes place
I E
FGate1
3
F1
F2
F3
Fails only i f EVENT1 occurs first
I E
Gate 1
TOP1 Q=2.530 e-3
FTopGate
2.502 10
GATE1 Q=1.000e-6
2 GATE2 Q=3.374e-6
n
Fails if event 1 OR the event 2 occur
IE
F1
0.002 F2 1
m 1
0.0032
F1
IE IE
F2
IE
F3
I E
F4
GATE1 Q=2.499e-3
2 GATE2 Q=3.072e-5
FGate1 FGate2
Probability of occurrence of EVENT1 = F1 Probability of occurrence of event 2 if event 1 occurred = F2 FGate1=F(EVENT1)*F(EVENT2|E VENT1)
EVENT1
EVENT2
EVENT3
EVENT4
EVENT 5
F1
F2
F3
F3
F3
Q=0.002
Q=0.0005
Q=0.000 45
Q=0.00053
Q=0.0032
I E
I E
IE
IE
I E
FTopGate
EVENT5 Q=0.0032
1 2.53
FGate1
3
FGate2
1-23-2002 M. Krasich
EVENT1 Q=0.002
EVENT2 Q=0.0005
EVENT3 Q=0.0032
EVENT4 Q=0.0032
21
FTopGate
1-23-2002
M. Krasich
19
1-23-2002
M. Krasich
22
16
Example of the Priority and AND Gate Switching Amp Before Improvement
Overheat of FET due to LGND <2V
IE
T OP1 Q=1.001e-6
Page 1 Q=2.0969e-3
Fails only if EVENT1 happens bef ore EVENT2
I E
FET4 OVERHEAT
GATE1 Q=1.000e-6
GAT E2 Q=7.632e-10
LGROUND Q=5.9911e-3
FET 4 SATURATION
DANDREIC SHORT
F1
I E I E
F2
I E
F3
I E
F4
EVENT 1 Q=0.002
EVENT 2 Q=0.0005
EVENT 3 Q=0.00045
EVENT 4 Q=0.00053
EVENT 5 Q=0.0032
MFG_SHORT_C905
MFG_Short_ C906
Q=7.0000e-8 Q=3.0000e-3
Excessive solder causing a short between the pins or pads
IE SOLDER SHORT_C6905
Q=7.0000e-8
Debris on the Capacitor fails due to part PCB causing a random failure short
IE IE
Capacito leaking Electroly te Leak due to High electroly te due Temperature to ageing
IE IE
1-23-2002
M. Krasich
DEBRIS_C6905
DEBRIS_C6906
PRF_C6906 Q=3.0000e-3
AGEING Q=1.0000e-6
HI-TEMP Q=1.2500e-7
HI_HUMIDITY Q=2.0000e-6
Q=5.0000e-8
Q=2 .0000e-8
Q=5.0000e-8
Q=2.0000e-8
1-23-2002
M. Krasich
23
Page 1 Q=1.0009e-4
FET4 OVERHEAT
LGROUND Q=2.8596e-4
FET 4 SATURATION
DANDREIC SHORT
MFG_SHORT _C905
MFG_Short_C906
Q=7.0000e-8 Q=1.4292e-4
Excessive solder causing a short between the pins or pads
IE SOLDER SHORT_C6905
Q=7.0000e-8
Debris on the Capacitor fails due to part PCB causing a random failure short
IE IE
Capacito leaking Electrolyte Leak electroly te due due to High to ageing Temperature
IE I E
DEBRIS_C6905
DEBRIS_C6906
PRF_C6906 Q=1.4292e-4
AGEING Q=1.0000e-6
HI-TEMP Q=1.2500e-7
HI_HUMIDITY Q=2.0000e-6
Q=5.0000e-8
Q=2 .0000e-8
Q=5.0000e-8
Q=2.0000e-8
1-23-2002
M. Krasich
24
Determine the causes of those phenomena Determine the contributing factors to the causes, i. e.
High temperature High humidity Electrical overstress
1-23-2002 M. Krasich 27
17
No voltage supplied to the analog input (pin 8): Open R206, or R208 (if open slight non-audible distortion) or short C174 or C176 (if any of the caps open, no failure) No 5V analog supplied to pin 7: C 181 or C 183 fail short U20 fails in whichever mode (low, high, or no output)
There will be no output to the D to A conversion and the rest of the amp if failed open: R214, R215, R218, and R 219 (if shorted not too much harm) Not all failure modes need to be considered if not important to the failure definition realistic prediction
1-23-2002 M. Krasich 30
1-23-2002
M. Krasich
28
Page 1
Q=3.3535e-2
Q=1.4648e-2 Page 29
Q=1.5972e-2
A_OUT_1 and 2
Q=5.5314e-4
U21 failure
The input signal 2+ into the right channel of IC U20 interrupted if:
Components fail open:
R201, R205, C172
IE
DAC_1_DAT Q=0.0000
5 V ANA to U21
Q=1.0147e-3 Page 43
Page 5
1-23-2002 M. Krasich 31
C177 shorts to ground (shorting the signal) Opening of C117 might cause some noise, that will be filtered later in the circuit
1-23-2002 M. Krasich 29
Page 5
A to D 1 and 2 Q=1.7487e-2
U20 failure
Q=5.9081e-3
One of the plus inputs (1 or 2) not provided to the converter; No 5V analog supply voltage provided IC U20 not operational
Page 30
1-23-2002
IE
5 V ANA Q=1.0147e-3
Input 1 into A to D
Input 2 into A to D
Q=3.1481e-3 Page 68
Noise on 5V ANA
M. Krasich
32
18
Page 30
No 5V Analog Q=7.7798e-4
Page 30
Input 1 into A to D
Q=3.1481e-3
Short_EL_C183 Q=1.1136e-4
Short_C181 Q=2.7416e-4
2.3 V supply
IE
Short_C179 Q=2.0730e-4
MFG_Short_El_C183
PRF_Short_El_C183
PRF_Leak_El_C183
MFG_Short_C181
Q=7.0000e-8
Q=9.36354e-005 Q=1.76564e-005 Debris on the PCB causing dandreic growth and a short
I E
Q=7.0000e-8
PRF_Short_C181
Q=0.000274094
Debris on the PCB causing dandreic growth and a short
I E
MFG_Short_C179
Q=7.0000e-8
PRF_Short_C179
PRF Failure of the part random Failure probabilities are assigned to the manufacturing process quality requirement
Debris_El_C183
Solder_short_C181
Q=0.000207226
Debris on the PCB causing dandreic growth and a short
IE
Q=2e-008
Q=5e-008
Q=5e-008
1-23-2002
M. Krasich
35
Debris_C179 Q=2e-008
Solder_short_C179
Page 68
1-23-2002
Q=5e-008
M. Krasich
33
P age 30
Noise on 5V ANA
Q=2.3692e-4
Resistor fail s open, +2.3 V not available for the analog input
IE
I E
Open_El_C183 Q=6.1851e-5
Open_C179 Q=1.3238e-4
Open_R206 Q=5.1649e-5
Open_R209 Q=5.1649e-5
Open_El_C171 Q=1.2778e-4
Open_R200 Q=5.1649e-5
Capacitor connections open due to the manufacturing def ec t
I E
I E
MFG_Open_El_C183
PRF_Open_C179
MFG_Open_R206
PRF_Open_R206
PRF_Open_R209
MFG_Open_El_C171
PRF_Open_El_C171
Q=1.3000e-8
Q=5.16358e-005 Q=5.16358e-005
Q=1.3000e-8 Q=0.000127767
PRF_Open_R200
Q=1.300 0e-8
PRF_Open_El_C183
MFG_Open_C181
Q=1.3000e-8
PRF_Open_C181
Q=6.18377e-005
Q=0.000132368
Capacitor connections open due to the manuf acturing def ect
IE
Q=0.000175069
Connection opens due to ins ufficient or inproper soldering
I E
Q=5.16358e-005
IE
IE
MFG_Open_C179
Q=1.3000e-8
Cold solder_R206
Missing_R206 Q=1e-009
MFG_Open_R209
Q=1.3000e-8
Cold solder_El_C171
Missing_El_C171
MFG_Open_R200
Q=1.3000e-8
Cold solder_El_C183
Mis sing_El_C183
Cold solder_C181
Q=1.2e-008
Connection opens due to insufficient or i nproper sol dering
IE
Q=1.2e-008
Connection opens due to insufficient or inproper soldering
IE
Q=1e-009
Connection opens due to insufficient or inproper soldering
IE
Q=1e-009
Q=1.2e-008
1-23-2002
M. Krasich
36
Cold solder_C179
Missing_C179 Q=1e-009
Cold solder_R209
Missing_R209 Q=1e-009
Cold solder_R200
Missing_R200 Q=1e-009
Q=1.2e-008
1-23-2002
Q=1.2e-008
Page 140
M. Krasich
Q=1.2e-008
34
Missing components
Amazingly large number of components are not inserted during assembly detected later when the function exercised
19
Mechanical components
Determine stresses - loads (mechanical, environmental) Construct stress/strength equation for multiple loads if required Calculate design (safety) margin and reliability (probability of failure) for the required life
Manufacturing defects
Factory data Field failure data
1-23-2002 M. Krasich 38
1-23-2002
Component probability of failure can be calculated as: ) exp( GFD t ) ON OFF t ON = 365 15 2.7
40
191470-473 CAP,0603,X7R,50V,.047UF
C541
254110
DIODE,SCHOTTKY,40V,3A,S D803
0.21
1
0.2585
2
Ratios of the one sided compression and the respective 0.017 0.004 radiuses are: r = ; r =
The probability of the actual seal failure in ten years of life is:
F(10 years) =
1-23-2002
147239
DIODE,DUAL,SOT-23,BAW56 D206
147239
DIODE,DUAL,SOT-23,BAW56 D707
r1 r2
147239
DIODE,SWITCHING,75V,200 D702
(0.3 r1 )
+ (0.1 r2 )
M. Krasich
= 1.464 10 6
41
147239
DIODE,SOT-23,BAV 99
D100
147239
DIODE,SOT-23,BAV 99
D101
1-23-2002
M. Krasich
39
Q=7.365e-2
Tuner failure
No video
IE
IE
ANALOG SIGNAL
Q=1.162e-2 Page 2
1-23-2002
Tuner Q=4.423e-3
M. Krasich
20
0.98
Transistors and FETs from a more reliable vendor Planned Reliability Growth TSSOPs replaced by SOICs
Follow the highest hitter down to its subassemblies Look for the highest contributor to its reliability
0.97
0.96
0.94
0.93
0.92
1-23-2002
M. Krasich
45
Page 13
1-23-2002
M. Krasich
43
1-23-2002
M. Krasich
44
Reliabilty
0.92 0.9 0.88 0.86 0.84 0.82 0.8 0 50 100 150 200 250 300
If 100,000 systems produced in on e year, 9,250 less w ill be returned for repair w ithin warranty period as a result of reliability improvement
Design Time (Days)
1-23-2002 M. Krasich 47
21