Sie sind auf Seite 1von 17

Practical System Reliability

By

Nicoline Reynecke 920210605


A research paper submitted for the subject Reliability Management

Faculty of Engineering and the Build Environment Of the University of Johannesburg

November 2013

Table of Contents
Introduction ............................................................................................................................................ 3 Practical system reliability ...................................................................................................................... 4 Availability ........................................................................................................................................... 4 How does a high availability (HA) system work? ............................................................................ 4 Downtime budget ............................................................................................................................... 5 Quality engineering ............................................................................................................................. 5 Principles of reliability............................................................................................................................. 7 How to predict system reliability? ...................................................................................................... 7 Hazard rate.......................................................................................................................................... 8 Reliability of systems .......................................................................................................................... 8 Series systems ................................................................................................................................. 9 Parallel systems............................................................................................................................... 9 Fault tree analysis (FTA) .................................................................................................................. 9 Failure mode and effective analysis (FEMA) ................................................................................. 11 Maintenance Strategy........................................................................................................................... 12 Reactive maintenance....................................................................................................................... 12 Preventive maintenance ................................................................................................................... 12 Predictive maintenance .................................................................................................................... 13 Proactive Maintenance ..................................................................................................................... 13 Reliability centered maintenance (RCM) .......................................................................................... 13 Conclusion ............................................................................................................................................. 16 Bibliography .......................................................................................................................................... 17

Introduction
In todays revolutionary age right across the world, an organization is only as good as the service it provides. The service an organization can provide to the public hinges on the reliability of their system and their product. Reliability and maintainability is one of the most important aspects a company can invest in. If properly designed, the reliability of a product will ensure client goodwill as well as a good name for the company, a name that can be trusted. Certain tools and methods are available to organizations to evaluate and update their reliability of their system as well as the reliability of the products provided to the public. Some of those tools are the fault tree analysis (FTA), failure mode and effective analysis (FMEA) as well as the maintenance strategies.

Practical system reliability


The rise of the internet, sophisticated computing and communication technologies, and globalization have raised customers expectations of powerful always on services [1]. In this day and age this is very important for system reliability. What is reliability engineering? Reliability engineering emphasizes dependability in the lifecycle management of a product [2]. If a customer cannot get what they want, in the time frame the need it, another service provider is just a click away. Because of this, highly available services are vital in any organization.

Availability
The cost associated with poor service availability or reliability is: Loss of brand reputation and customer good will Direct loss of customers and business Higher maintenance related operating expenses Financial penalties or liquidated damages

Wikipedia states that availability is the degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, i.e. a random time. Simply put, availability is the proportion of time a system is in a functioning condition. [3] To calculate availability mathematically the following formula is used: Availability = Uptime/ (Uptime + Downtime) = MTTF/ (MTTF + MTTR) High-availability systems are designed in such a way that they detect, isolate, alarm and recover from failures that will inevitably happen because no system will last forever. To ensure this high availability the system would have redundant elements that can be switched to when there is a failure so that no single failure can result in a loss of service. This high availability design principal is known as no single point of failure [1]. How does a high availability (HA) system work? A HA system will have a suite of failure detectors which typically contain both hardware and software mechanisms. When these detectors are triggered the system will isolate the failure to a specific point, be it software or hardware related, and will then activate the suitable recovery scheme. If the system does not activate the suitable recovery a secondary recovery will be triggered. Ultimately a human operator is responsible for any system, and if the system does not recover fast enough or successfully, then the operator will step in and do a manual recovery. Failures fall into two broad categories namely sub-acute failures and acute failures. Sub-acute failures usually do not impact the systems performance suddenly and can be corrected as soon as the operator can see to the failure. Acute failures suddenly and profoundly impact the service of a system and must be corrected immediately to ensure no loss of service. HA systems must be able to detect both failure types and trigger the popper recovery action to ensure that the impact on the system from failures remains small. Both failure types will cost the company not only money but man hours and time to repair.

Downtime budget
A company must create a downtime budget to deal with these types of failures. The downtime budget can be created and managed to ensure that there is a plan of action for any type of failure. The factors that contribute to the downtime are divided into three categories that are based on the party it relates to namely: Customer attributable (due to the actions of the customer) o Procedural o Power failure, battery or generator o Internal environment o Traffic overload o Planned event Product attributable (due to the design and implementation of the product) o Hardware failure o Design, hardware o Design, software o Procedural o Planned event Third party attributable (due to the actions of others) o Facility related o Power failure or commercial o External environment [1]

In the article Single points of failure within systems of systems the author mentions that throughout the research the element that repeatedly showed vulnerability was data. If a system cannot create, store or transmit data, then that system as a whole can fail, and the data becomes a single point of failure. Data has a life cycle; it can be created, stored, used, shared, archived and then destroyed [4]. In todays advance age communication and the internet have complex and dynamic system of systems. A system or network must have multiple points of access and people can attack those points with little knowledge and skill. Data is not only at risk from outside attack but also form the everyday user, the components that are within the SoS and the physical structure of the given network or internet. Any of these risks can become a single point of failure, and companies need to have a system in place for when this happens. For each system that is in place for these failures the benefits increase, but so does the costs. If companies are aware of these points of failures, downtime budgets can be effectively in place before hand.

Quality engineering
Quality is an important factor in system reliability. But what is quality? Quality is the totality of features and characteristic of a product, process or service that bear on its ability to satisfy stated or implied needs [5]. David Garvin defined the concept of eight dimensions of quality. Some of the dimensions are mutually reinforcing, whereas others are not, improvement in one may be at the expense of others. Understanding the trade-offs desired by customers among these dimensions can help build a competitive advantage [6] [7]. The eight dimensions are:

Performance

Reliability

Features

Durability

QUALITY

Conformance

Aesthetics Percieved Quality

Serviceability

Figure 1. Source: BSI education The concept of quality Table 1

Dimension Performance

How efficient a product achieves its intended purpose The elements that supplement the products basic performance Does the product meet with the specifications for its use The ease with which a product can be repaired The is based on the customers view and opinions How the product influences the customers senses

Features Conformance

Example Does the ebook reader let you read electronic books and magazines with ease The reflow function on an ebook reader Does the ereader meet the specifications of being able to show electronic formats If the ereader breaks, is it easy to repair How the customer sees the ereader

Serviceability Perceived Quality Aesthetics

Is the ereader easy to hold, small and travel size, or big enough to read comfortably on Durability How much the product can If the ereader falls will it stay in one withstand stress without failure piece Reliability How the product performs over Will the ereader still work as well as the its life cycle with consistency day the customer buys it five years from now By applying statistical analysis of the products characteristics the quality can be determined. To calculate the statistical analysis the mean, standard deviation, probability and probability density function need to be calculated.

Principles of reliability
What is reliability? Reliability is typically defined as the ability to perform a specified or required function under specific condition for a stated period of time [1]. There is a relationship between quality and the reliability of a product. The reliability of a product is its ability to retain its quality as time progresses [8]. Reliability (R) and unreliability (F) varies with time. The reliability of a product decreases with time, while unreliability will increase with time. The events of reliability and unreliability are complementary and their product must equal 1.

R(t) + F(t) = 1
To have measures of the reliability of a product consider the non-repairable items and repairable items.
Table 2

Non-repairable items

Repairable items

Mean time to fail = Total up time/ number of Mean down time = total down time/ number of failures failures Mean failure rate = number of failures/ total up time

How to predict system reliability?


Some of the methods used to predict the reliability of a system is the fault tree analysis, network analysis and Monte Carlo simulation. All of these methods evaluate the probability of the component failure in a system.

Hazard rate
Another important measure of a products quality and reliability is the hazard rate function or the failure rate denoted by (t). The bathtub curve shows the most general form of the failure rate and consist of three distinct phases namely the early failure, useful life and the wear-out failure.

Figure 2. Source: Practical system reliability by E. Bauer, Z. Zhang and DA. Kimber

The early failure is when the failure rate decreases with time. When the product is a new design certain early failures can occur because of the design faults, poor quality of the components, manufacturing faults, installation errors as well as operating and maintenance errors. The hazard rate becomes less as time moves on because the design faults might be corrected, weak components are replaced with better components, and the user becomes more familiar with the installation of the product. In the next phase the useful life of a product is characterized by a constant low failure rate as indicated on the sketch. In this phase all the weak components have been replaced, the design, manufacture, installation, operation and maintenance errors are corrected. In the last phase known as the wear-out failure phase, the failure rate increases with time. The increase is due to individual components reaching the end of their expected design life for the particular product in other words the product is wearing out [8].

Reliability of systems
A system is a set of interacting or interdependent components forming an integrated whole or a set of elements (often called components) and relationships which are different from relationships of the set or its elements to other elements or sets [9]. The reliability of a system will then depend on the smaller elements or components that make up the system. Configuration of these components plays a big role in the systems performance. A recent paper argued that the performance criteria of manufacturing systems, such as reliability, productivity and quality, are determined by different configurations. Two fundamental configurations for the systems components are the series design and the parallel design. These two types of configurations form the basis of the reliability modeling and analysis of the more complex configurations [10].

Series systems RSYST = R1R2RiRm

Reliability of a series system

A
Figure 3

This type of system will fail if any one of the elements fails in the system. The systems reliability is then equal to the product of each individual elements reliability. The failure rate of the system will be the sum of the individual element/components failure rates. Parallel systems The system will still be able to function provided that any one of the components in the system still functions. A system that is in parallel has active redundancy. Active redundancy is a design concept that increases operational availability and that reduces operating cost by automating most critical maintenance actions [11]. The overall system unreliability is the product of the individual element system reliability. FSYST = F1F2.FjFn Unreliability of a parallel system

B
Figure 4

Fault tree analysis (FTA) The fault tree is an established from and is built from the top down using logical AND/OR gates to combine the causal events [12]. Computer programs are available to calculate these top down possibilities on a fault tree. The top event of a fault tree must be chosen well to ensure that the analysis is not too wide or narrow to produce the results that are necessary. More than one fault tree analysis can be done in a system, as each fault tree only represents one of the many possible types of failures in a system. The FTA has been used in many industries like the air and space industry, chemical industry, electrical industry, transport industry etc. In a case study done on radio based railroad crossing the author concluded that the formal FTA is promising, if not always an easy topic. Because FTA is human readable and understandable with a logical background structure the industry will accept this method easily [13]. The following table and figure shows the different common gates used for a fault tree analysis as well as an example of fault tree analysis done on a press unit at a paper mill. 9

Table 3 Source: Reliability and risk assessment by JD. Andrews and TR. Moss

Gate Symbol 1

Gate name AND gate

Casual relation Output event occurs if all input events occur simultaneously Output event occurs if at least one input events occur Output event occurs if m-out of-n input events occur Output event occurs if one, but not both, of the two input events occur

2 OR gate 3
m

n-out-of-n gate

4 Exclusive OR gate

Inhibit gate

Input produces output when input event and the conditional event occur

Priority AND gate

Output even occurs if all input events occur in the order from left to right Output even occurs if the input event does not

7 Not gate

Press unit failure

KEY
OR GATE

AND GATE

Felt(Synthetic belt)

Top roller

Bottom roller

Felt to old

Felt damaged

Top roller bending

Top roller bearing

Top roller rubber wear

Bottom roller bending

Bottom roller bearing

Top roller rubber wear

Figure 5 Example of a FTA for a press unit in a paper mill

10

Failure mode and effective analysis (FEMA) Failure mode and effective analysis (FEMA) is the procedure by which the each potential failure mode in a system is analyzed to determine the effect it has on the system and then to classify it according to its severity [12]. The following shows an example of a FMEA of a common house hold washing machine
Table 4 Source: Burgehugheswalsh.co.uk

FUNCTIONAL FMEA
Function Functional failure mode Potential effects of failure Potential causes of failure Current process controls preventio n

Occurrence

Current process controls detection

*RPN SxOx D

Responsibility and target completion date

Detection

Severity

Action taken Include sensor for load detection Include sensor for load detection Include sensor for load detection Clear lid for visual detection

No Load

No wash

User error

None Weigh load function Weigh load function

Built in test

54

John completed

Over Load

Very poor wash

User error

Built in test

270

John completed

Load dirty clothes

Under Load Hidden extreme mix of load Hidden extreme mix of load Unintended load- foreign object in load Unintended load- foreign object in load

Poor wash 4 Colour run Fabric shrink Object damages items Object damages machine

User error

Built in test

144

Mike completed Mike completed Jane completed

Items covered 6 by others Items covered 7 by others

None

None

486

None

None

567

None

User error

None

None

10

210

Jane completed

None

User error

None

None

10

160

Jane completed

None

*RPN is the risk priority number

11

Maintenance Strategy
For a working reliability plan maintenance must be taken into account. The approach to maintenance of a system is a follows: UNPLANNED MAINTENANCE PLANNED MAINTENANCE

REACTIVE MAINTENANCE PROACTIVE MAINTENANCE

PREVENTIVE MAINTENANCE PREDICTIVE MAINTENANCE


Figure 6 Source: Reliability Strategy and Plan - www.utk.education

The benefits of a planned maintenance system are numerous and have progressive effects on a company. Some of the benefits include:
Table 5 Source: Reliability Strategy and Plan www.utk.education

Reduction in The size and scale of repairs Downtime Number of repairs Overtime In maintenance costs Overall cost per product unit

Increase in Accountability for all cash spent Equipments useful life Operator, mechanic and public safety Consistency and quality of output Equipment availability Control over parts

Reactive maintenance
Reactive maintenance is also known as breakdown or run to failure maintenance. This type of maintenance only takes place when it is absolutely necessary. Few expenses or effort is allocated towards this type of maintenance until it is required. Some examples of reactive maintenance are light bulbs and electronic circuit boards.

Preventive maintenance
Preventive maintenance is also known as time-based or interval-based maintenance. This type of maintenance is scheduled and done on an operating time interval to prolong the life of the equipment and to prevent equipment failure. Preventive maintenance does not take equipment condition into consideration. This type of maintenance can be costly as well as ineffective. The type of maintenance tasks that are performed during preventive maintenance will be cleaning, inspection, and adjustments, lubrication as well as parts replacement and so on. Some examples of preventive maintenance are car maintenance and machine tooling. 12

Predictive maintenance
Predictive maintenance is also known as condition based maintenance. This type of maintenance is done through failure forecasting by analyzing the equipment condition. The analysis can be done by looking at trend parameters like vibration, temperature and flow. The maintenance is also scheduled so it will not interfere with normal operation and production times. Predictive maintenance reduces costs and improves reliability. Some benefits of predictive maintenance is improvement of mean time to repair and reduces inventory levels. The most commonly used preventive maintenance techniques include vibration monitoring, oil analysis, thermography, shock pulse measurement, ultrasonic and x-ray scanning. An example of predictive maintenance is knowing the service life of a microwave is 5 years, and then replacing that microwave just before the 5 years are up, even if the microwave is still in working condition.

Proactive Maintenance
Proactive maintenance is both preventive and predictive maintenance. Proactive maintenance improves maintenance through better design, installation, maintenance procedures, workmanship and scheduling [14]. Proactive maintenance employs the following basic techniques to extend machinery life: Specification for new/rebuild equipment; Precision rebuild and installation; Failed-part analysis (FPA); Root-cause failure analysis (RCFA); Reliability engineering; Rebuild certification/verification; Age exploration and Recurrence control.

Reliability centered maintenance (RCM)


Reliability centered maintenance is sum of all four maintenance methods mentioned earlier. RCM is an ongoing process which determines the optimum reactive, preventive, predictive and proactive maintenance practices in order to provide the required reliability at the minimum cost [14]. Reliability Centered Maintenance

Reactive Maintenance Small items Non- critical Inconsequent Unlikely to fail redundant

Preventive Maintenance Subject to wear out Consumable Replacement Failure pattern known

Predictive Maintenance Random failure Patterns not subject to wear PM induced failures

Proactive Maintenance RCFA FMEA AE

Figure 7 Source: Reliability strategy and plan www.utk.education

13

RCM finds its roots in the early 1960s. The first industry to develop RCM initially was the North American civil aviation industry. In the mid 1970 the US Department of Defense commissioned a report on the subject of RCM, and this report written by Stanley Nowlan and Howard Heap is still being used today, and is considered one of the most important documents available on the subject. The RCM analysis is as follows: What does the system or equipment do? What function failures are likely to occur? What are the likely consequences of these failures? What can be done to prevent these functional failures?

RCM decision logic tree is then done based on the answers to the above questions. The following is an exaplme of a RCM decision logic tree. Will the failure of the facility or equipment item have a direct and adverse effect on safety or critical mission operations? No Is the item expendable? Yes No Is there predictive technology that will monitor the condition and give sufficient warning of an impending failure? No Yes Is PdM cost and priority-justified? No Yes Yes Can redesign solve the problem permanently and cost effectively? No Yes Redesign

I there an effective PM task that will minimize functional failure? No Is establishing redundancy cost and priority-justified? No Accept risk Yes Install redundancy unit(s) Install PM task and schedule Yes

Define PM task and schedule

Figure 8 Source Reliability strategy and plan www.utk.education

14

For a successful implementation of RCM the following factors must be taken into consideration: Clear project goals Management support and a commitment to introduce a controlled maintenance environment Union involvement Good understanding of RCM philosophy by plant staff Pilot RCM application to demonstrate success and build support Sufficient resources for both the review and subsequent implementation of recommendations Clear documentation of results to facilitate acceptance of recommendations Integration with PdM maintenance capability [14]

15

Conclusion
Reliability is one of the most important concepts in any organization. From paper mills right through to airlines, reliability is a good representation of an organizations worth. The more reliable a system or organization the better the value a customer will place on that organization will be. A system that is reliable will also save the company money and ensure that reputation of the company is always shown in a positive light. Some of the tools available like the functional FMEA and the fault tree analysis are widely used to ensure an organizations reliability. Reliability centered maintenance (RCM) is another method to determine a systems reliability and save the company money. Reliability is not just expected from the customer, it is a necessity that can have a wide and full effect on the customer, the organization as well as the industry.

16

Bibliography
[1] E. Bauer, X. Zhang and D. Kimber, Practical system reliability, John Wiley & Sons,INC, 2009. [2] Wikipedia, "Reliability Engineering," [Online]. Available: http://en.wikipedia.org/wiki/Reliability_engineering. [3] Wikipedia, "Wikipedia," [Online]. Available: http://en.wikipedia.org/wiki/Availability. [Accessed 4 September 2013]. [4] C. S. Alliance, "Cloud Security Alliance:Security guidance for critical areas of focus in cloud computing V2.1," Cloud Security Alliance, 2009. [Online]. Available: https://cloudsecurityalliance.org/csaguide.pdf. [Accessed 5 September 2013]. [5] ISO, ISO 9000:2005 Quality management systems -- Fundamentals and vocabulary, ISO, 2002. [6] Wikipedia, "Wikepedia," [Online]. Available: http://en.wikipedia.org/wiki/Eight_dimensions_of_quality. [Accessed 5 September 2013]. [7] D. Needham, "BSI education," [Online]. Available: www.bsieducation.org ... ecture-MaterialsConcept-of- uality.doc. [Accessed 30 September 2013]. [8] J. Bentley, An introduction to reliability and quality engineering, Addison-Wesley longman, Ltd, 1999. [9] Wikipedia, "Wikipedia," [Online]. Available: http://en.wikipedia.org/wiki/System. [Accessed 6 September 2013]. [10] J. Sun, L. Xi, S. Du and B. Ju, "Reliability modeling and analysis of serial-parallel hybrid multioperational manufacturing system considering dimensional quality, tool degradation and system configuration," International Journal of Production Economics, vol. 114, no. 1, pp. 149164, 2008. [11] Wikipedia, "Wikipedia," [Online]. Available: http://en.wikipedia.org/wiki/Active_redundancy. [12] J. Andrews and T. Moss, Reliability and risk assessment, Longman group, UK , Ltd., 1993. [13] F. Ortmeier and G. Schellhorn, "Formal Fault Tree Analysis: Practical Experiences," Electronic Notes in Theoretical Computer Science , pp. 139-151, 2007. [14] U. Education, "UTK Education," [Online]. Available: http://web.utk.edu/~kkirby/IE591/Module03.pdf. [Accessed 25 09 2013].

17

Das könnte Ihnen auch gefallen