Sie sind auf Seite 1von 67

Systems Engineering: A great definition

Ben Rockwood said something last December about the re-emergence of the
Systems Engineer and I agree with him, 100%.
NASA Systems Engineering Handbook, 2007
To add to that, Id like to quote the excellent NASA Systems Engineering handbooks
introduction. The emphasis is mine:
Systems engineering is a methodical, disciplined approach for the design,
realization, technical management, operations, and retirement of a system. A
system is a construct or collection of different elements that together produce
results not obtainable by the elements alone. The elements, or parts, can include
people, hardware, software, facilities, policies, and documents; that is, all things
required to produce system-level results. The results include system-level qualities,
properties, characteristics, functions, behavior, and performance. The value added
by the system as a whole, beyond that contributed independently by the parts, is
primarily created by the relationship among the parts; that is, how they are
interconnected. It is a way of looking at the big picture when making technical
decisions. It is a way of achieving stakeholder functional, physical, and operational
performance requirements in the intended use environment over the planned life of
the systems. In other words, systems engineering is a logical way of thinking.
Systems engineering is the art and science of developing an operable system
capable of meeting requirements within often opposed constraints. Systems
engineering is a holistic, integrative discipline, wherein the contributions of
structural engineers, electrical engineers, mechanism designers, power engineers,
human factors engineers, and many more disciplines are evaluated and balanced,
one against another, to produce a coherent whole that is not dominated by the
perspective of a single discipline.
Systems engineering seeks a safe and balanced design in the face of opposing
interests and multiple, sometimes conflicting constraints. The systems engineer
must develop the skill and instinct for identifying and focusing efforts on
assessments to optimize the overall design and not favor one system/subsystem at
the expense of another. The art is in knowing when and where to probe. Personnel
with these skills are usually tagged as systems engineers. They may have other
titleslead systems engineer, technical manager, chief engineer but for this
document, we will use the term systems engineer.
The exact role and responsibility of the systems engineer may change from project
to project depending on the size and complexity of the project and from phase to
phase of the life cycle. For large projects, there may be one or more systems
engineers. For small projects, sometimes the project manager may perform these
practices. But, whoever assumes those responsibilities, the systems engineering

functions must be performed. The actual assignment of the roles and


responsibilities of the named systems engineer may also therefore vary. The lead
systems engineer ensures that the system technically fulfills the defined needs and
requirements and that a proper systems engineering approach is being followed.
The systems engineer oversees the projects systems engineering activities as
performed by the technical team and directs, communicates, monitors, and
coordinates tasks. The systems engineer reviews and evaluates the technical
aspects of the project to ensure that the systems/subsystems engineering
processes are functioning properly and evolves the system from concept to product.
The entire technical team is involved in the systems engineering process.
I would imagine that successful organization understands this concept of systems
engineering, but I dont think Ive ever seen it put so well.
NASAs engineers have both common and conflicting goals, just like we do in web
operations. They weigh trade-offs in efficiency and thoroughness, and wade into the
constraints of better, cheaper, faster, and hopefully: more resilient.
This re-emergence of the systems engineering (or full-stack engineering) notion is
excellent and exciting to me, and Im hoping that everyone in our field, when they
hear DevOps (and/or how Theo says *Ops) what they mean is taking a systems
engineering view

How a MTBF calculation becomes a trap if you are unaware of the dangers
in using Mean Time Between Failure (MTBF) for reliability analysis
MTBF is often used as an indicator of plant and equipment reliability. A MTBF value
is the average time between failures. There are serious dangers with use of MTBF
that need to be addressed when you do a MTBF calculation
Take a look at the diagram below representing a period in the life of an imaginary
production line. What is the MTBF formula to use for the period of interest to
represent the production lines reliability over that time?
If MTBF is the mean time between failure (MTBF applies to repairable systems;
MTTF, Mean Time To Failure, applies to unrepairable systems) the MTBF formula
would need to have time units in the top line and a count of failures on the bottom
line.

In the diagram you will see the MTBF formula that I finally settled on: Mean Time
Between Failure (MTBF) = Sum of Actual Operating Times 1, 2, 3, 4 divided No of
Breakdowns during Period of Interest.
But the MTBF value you get from that MTBF calculation changes depending on the
choices you make.
To arrive at a MTBF equation there are assumptions and options to consider. Like,
what event is, and is not, a failure? What power-on time do you consider to be
equipment operating time? When do you start and end the period of interest for
which you are doing the MTBF calculation?
Definition of Failure
To measure MTBF you need to count the failures. But some failures are out of your
control and you cannot influence them, like lightning strikes that fry equipment
electronics, or floods that cause short circuits, or if your utility provider turns off the
power or water supply. Do you include Acts-of-God into your MTBF calculation?

In reliability engineering a failure is considered to be any unwanted or


disappointing performance of the item/system being investigated. That definition
leaves failure wide-open to interpretation.
Is a failure only ever a breakdown? Is a failure anytime the production line stops
no matter the cause? Is a power black-out caused by the utility provider a failure
you should count? Is an operator error that stops production but does no other harm
a failure? Should you include all types of failures in your MTBF calculationthat
will give you a short MTBF value? Or do you remove certain categories of stoppages
when using a MTBF formulathat will give you a longer MTBF value? But which
categories do you and dont you count?
If in the imaginary production plant timeline modelled above you included Forced
Outage 1 along with the two breakdowns in the MTBF calculation you would get a
MTBF one-third lower. That is a substantial impact on the MTBF value.
To make sense of a MTBF calculation you need to know what failures are included
and which failures are not. And you also need to understand why those choices
were made.
Definition of Operating Time
When is an item of plant or equipment operating?
Equipment parts are degraded by the applied stresses put on their atomic structure.
The greater the stress suffered, the greater the resulting impact on the items
operating life. When a vehicle is stopped at red traffic lights the engine is running
under the least working load. The gearbox and the rest of the drive train are not in
use. When parts are under no stress their atomic structure suffers no harm. When
equipment working assemblies are at least stress their parts last longer. For the
MTBF calculation of the vehicle do you include its idle times, or just the times it
carries sufficiently high working load that causes stress in the parts?
Would you consider the equipment operating time for the MTBF calculation as any
time it was turned on, or only when it was suffering under working loads? If in the
MTBF formula you included all operating time from when the vehicle started, and
not only when the parts were under working stress, your MTBF value would be
higher. But that Mean Time Between Failure value would not be representative of
those vehicles that are continually working and hardly ever idling.
You cannot use MTBF as an indicator to compare the same equipment model,
assembly number or part number if they are suffering under different working
situations.
To make sense of a MTBF calculation you need to know the specific situation and
operating scenarios being measured.

Selecting the Time Period


Because you count failure events during a period of time in your MTBF calculation
the period of interest selected affects the resulting MTBF value.
In the above timeline the period used in the MTBF reliability analysis is through to
the end of the second breakdown. If I had chosen to make the time period through
to the end of Operate 4 the second breakdown would not be counted in the MTBF
calculation and I would double the MTBF value. By altering the date to exclude one
failure event I doubled MTBFsee what magic you can do with MTBF.
Notice how the two equipment breakdowns are well to the right hand side end of
the timeline. Even though the first breakdown happened long into the period of
interest, the MTBF for the period does not recognise the dates of those failures. The
MTBF was outstanding up until the first breakdown, then it dropped, and it dropped
again with the second breakdown. A MTBF calculation presumes failures are
distribute evenly across the period, even though that is not the real historic truth. A
MTBF value is hardly ever honest about what actually happened.
To make sense of a MTBF calculation you need to know the time period selected.
You also need to know why that duration was used and not some other period.
Selecting the Equipment to Monitor
One more issue to consider with regards MTBF is whether you measure a whole
process or measure individual equipment within a process. A complete process
suffers MTBF loss every time one of its critical items fail. If you have a problem
piece of plant that brings down the MTBF performance of the whole process, the
bad actor needs to be flagged as the performance destroying cause.
The companies who take the whole line/process into MTBF calculations often
struggle to get a high MTBF due to bad actors failing within the system being
monitored. Those companies also need to measure individual equipment MTBF to
identify the problem plant so its failure causes can be addressed and the bad actor
made more reliable.
How to Protect Yourself from the MTBF Calculation Trap
MTBF calculations are a statistical trap easily fallen into. A MTBF value can be a
total fabrication. Some Managers removey or add all kinds of MTBF parameters to
make their department look good (like not counting failures, changing period
lengths, and the like). But that is a falsehood. I once came across a company that
did not count stoppages less than 8 hours duration in their MTBF calculation. They
werent breakdowns but they were forced outages over which they had full control.
What a joke I thought. What an absolute rubbish way to run a business. You can
never improve a company if people tell lies about its performance and hide the
truth of where the troubles lay.

You need to get agreement across the company as to what can be called a failure,
what can be called operating time and what are the end points of the time period
being analysed before you can use MTBF values as a believable Production
Reliability KPI (Key Performance Indicator).
Maybe it is more sensible to have MTBF by categories, e.g. 1) mean time between
machinery/equipment breakdowns caused by internal events, 2) mean time
between operator induced stoppages, 3) mean time between external caused
outages that you cannot control, like power or water loss, and so on.
Your second best protection against misinterpreting and misunderstanding MTBF is
to have honest, rigid rules covering the choices and options that arise when doing a
MTBF calculation.
The very best protection is to also get the timeline of the period being analysed
showing all the events (and their explanations) that happened, and then ask a lot of
questions about the assumptions and decisions that were made, and not made, to
arrive at those MTBF values.

Calculating MTBF from data


Fred Schenkelberg
There are occasions when we have either field or test data that includes the
duration of operation and whether or not the unit failed. This can be, say, 10
large motors. For sake of argument, the test ran each motor for 1,000 hours
and when a motor failed it was repaired quickly and returned to the test.
There were 3 failures.
Sadly, this is all we need to calculate an estimate for the motor MTBF.
Total time divided by number of failures in this case is 10 times 1,000 hours
for a total time of 10,000 hours. Divide 10,000 by the three failures to find,
3,333 hr MTBF.
What I find interesting is I could find the same MTBF value using 10,000
motors each run for one hour. Or, the same MTBF if we ran one motor for
10,000 hours. IF in each case there were three failures we would find the
MTBF of 3,333 hours.
Now that works perfectly well when there is a constant failure rate. Meaning
there is equal chance of failure each hour of operation. Old motors would
have the same chance of failure as brand new motors.

Of course, you know why I choose motors for the example. To reinforce the
idea that the chance of failure is not always a constant. Be sure to think
about the failure mechanisms before using MTBF (or MTTF). If the failure rate
is time dependent then this simple calculation is not useful.
I used this example during a class last week and it seemed to spark a good
discussion. How have you explained MTBF to others? Any suggestions on
how to best describe what the MTBF value really means, or doesnt mean?
Excerpts
In my case, trying to calculate reliability and/or MTBF for a subsystem is very
frustrating. The advertised reliability of many components are based on the
OEMs projection or engineering analysis because no one wants to spend the
dollars and time required to truly test the component thoroughly. Trying to
validate their projection or engineering analysis at the system level usually
means that my only data is based on one or two failures for a series of tests
that accumulate a total of 30 to 40 hours of operation. For simplicity, I am
usually forced to ignore the conditions of testing (e.g., temperature, altitude,
load). Components that require a high level of reliability (R) and confidence
(C) need several hundred hours of operation to fully demonstrate R&C.
Hi Bill, I feel your pain. Vendors have to deal with many operating conditions
and use cases. They tend to do what is requested by the majority.
Unfortunately, so many seem happy with very poor information, that those
that need and request better information and thwarted. I suggest we
continue to ask for meaningful information, educate our peers to do likewise,
and when all else fails do the testing ourselves.
Hi Fred, At the moment I am doing an internship at a company concerning
MTBF. My research is forcing the same question into my mind everytime: Is it
even wise to help them calculate their MTBF? The FR is not constant at all, as
they mainly produce flowmeters. MTBF for me is not an estimation of how
long an asset will last at all, for me it says more about the improvement/
decrease of the reliability of an asset or system. Your topic intrigued me as I
am starting to very much agree about whether it is wise to use MTBF at all. I
would be very excited if you could tell me your experience with other,
similair and preferably more representative metrics.
Hi Stefan, you should be concerned as MTBF most likely is misleading or not
representing the actual failure rate at any particular time of interest.Instead
use reliability, probability of success at a specific duration. 98% reliable over

1 year, for example. Use multiple points in time, or better and you have the
time to failure information, fit a Weibull distribution (or appropriate
distribution) and have the entire picture of probability of failure over time in
a CDF plot.
Hi Fred, Thank you alot! I will have a look into that, this is very usefull and
fun information for me to work with! Thanks again.
Hi Fred, I appreciate your site a lot. How important mission it is one can
understand searching the internet for exmaples of MTTF calculation and FIT.
Ive started reading about hazard rate, failure rate and MTTFs etc. but cant
find any advise how to interpret test data toward obtaining failure rate. Let
me put here example: Im testing 10 devices (nonrepairable system) over
e.g. 400 hrs. Recorded failure times in hrs: {30, 45, 60, 90, 120, 180, 240,
300}, 2 devices survived. Can I say my failure rate is 8/
(30+45+60+90+120+180+240+300+2*400) and MTTF as reciprocal of
Failure rate? maybe even I shouldnt even try to calculate failure rate from
this data? Does this method imply any problems with reliability calculation? I
agree that mean value for particular distribution yields different reliability but
please advise how to process this data in correct way. I appreciate your
feedback in advance.
Hi Rafal, thanks for the note and example problem. While you can estimate
the failure rate and MTTF as described it is not all that useful in most
cases.Instead use a Weibull analysis (with so few data points Weibull is often
a great starting point as it is versatile ) This will provide the probability of
failure at various points in time. Im traveling at the moment and have
limited access, so will follow up later when I can either work out the problem
for you, or point to a better reference and example.
HI Fred, Thank you for your interest. Meanwhile I was sitting and struggling to
understand meaning of Failure Rate and its interpretation. I think it can be
good supplement to the previous question if I ask if when for example failure
rate equal 0,004 fail/hr or in other words 4 fails per 1000hrs means 4 of them
will fail every 1000 hrs assuming exponential distribution and constant
hazard rate. It also means that if I had 4 components then I could expect
none of them functioning after 1000 hrs but it also means that if I had 1000
components 4 of them will fail within 1000 hrs and 996 will remain healthy
until next 1000 hrs left? I read somewhere failure rate example: FR=0.1
means 10% of population will fail every time stamp and in fact it plots
exponential curve but in this case having specific amount of devices e.g.

1000 components, it lineary drops to 0 after 250 000 of hrs gone (1000 *
250) where 250 is mean time to fail. Maybe it shouldnt be understood so
straight forward? Maybe if it is an average value and follows assumed
exponential pdf then we can say in average 4/1000hrs fails but in our case
37% will fail in first 250hrs and remaining 63% within the next 250000-250 =
249750 hrs. If this is correct Im home if not Im lost.
Availability, MTBF, MTTR and other bedtime tales
If we let A represent availability, then the simplest formula for availability is:
A = Uptime/(Uptime + Downtime)
Of course, it's more interesting when you start looking at the things that
influence uptime and downtime. The most common measures that can be
used in this way are MTBF and MTTR.
MTBF is Mean Time Between Failures
MTTR is Mean Time To Repair
A = MTBF / (MTBF+MTTR)
One interesting observation you can make when reading this formula is that
if you could instantly repair everything (MTTR = 0), then it wouldn't matter
what the MTBF is - Availability would be 100% (1) all the time.
That's exactly what HA clustering tries to do. It tries to make the MTTR as
close to zero as it can by automatically (autonomically) switching in
redundant components for failed components as fast as it can. Depending
on the application architecture and how fast failure can be detected and
repaired, a given failure might not be observable by at all by a client of the
service. If it's not observable by the client, then in some sense it didn't
happen at all. This idea of viewing things from the client's perspective is an
important one in a practical sense, and I'll talk about that some more later
on.
It's important to realize that any given data center, or cluster provides many
services, and not all of them are related to each other. Failure of one
component in the system may not cause failure of the system. Indeed, good
HA design eliminates single points of failure by introducing redundancy. If
you're going to try and calculate MTBF in a real-life (meaning complex)
environment with redundancy and interrelated services, it's going to be very
complicated to do.

MTBFx is Mean Time Between Failures for entity x


MTTRx is Mean Time To Repair for entity x
Ax is the Availability of entity x
Ax = MTBFx / (MTBFx+MTTRx)
In practice, these measures (MTBFx and MTTRx) are hard to come by for
nontrivial real systems - in fact, they're so tied in to application reliability and
architecture, hardware architecture, deployment strategy, operational skill
and training, and a whole host of other factors, that you can actually
compute them only very very rarely. So, why did I spend your time talking
about it? That's simple - although you probably won't compute them, you
can learn some important things from these formulas, and you can see how
mistakes you make in viewing these formulas might lead you to some wrong
conclusions.
Let's get right into one example of a wrong conclusion you might draw from
incorrectly applying these formulas.
Let's say we have a service which runs on a single machine, which you put
onto a cluster composed of two computers with a certain individual MTBF
(Mi) and you can fail over to the other computer ("repair") a computer in a
certain repair time (Ri). With two computers, they'll fail twice as often as a
single computer, so the system MTBF becomes Mi/2. If you compute the
availability of the cluster, it then becomes:
A = Mi/2 / (Mi/2+Ri)
Using this (incorrect) analysis for a 1000 node cluster performing the same
service, the system MTBF becomes Mi/1000.
A = Mi/1000 / (Mi/1000+Ri)
If you take the number of nodes in the cluster to the limit (approaching
infinity), the Availability approaches zero.
A = 0/(0+Ri) = 0/Ri = 0
This makes it appear that adding cluster nodes decreases availability. Is this
really true? Of course not! The mistake here is thinking that the service
needed all those cluster nodes to make it go. If your service was a
complicated interlocking scientific computation that would stop if any cluster

node failed, then this model might be correct. But if the other nodes were
providing redundancy or unrelated services, then they would have no effect
on MTBF of the service in question. Of course, as they break, you'd have to
repair them, which would mean replacing systems more and more often,
which would be both annoying and expensive, but it wouldn't cause the
service availability to go down.
To properly apply these formulas, even intuitively, you need to make sure
you understand what your service is, how you define a failure, how the
service components relate to each other, and what happens when one of
them fails. Here are a few rules of thumb for thinking about availability
Complexity is the enemy of reliability (MTTR). This can take many forms
Complex software fails more often than simple software
Complex hardware fails more often than simple hardware
Software dependencies usually mean that if any component fails, the whole
service fails
Configuration complexity lowers the chances of the configuration being
correct
Complexity drastically increases the possibility of human error
What is complex software? - Software whose model of the universe doesn't
match that of the staff who manage it.
Redundancy is the friend of availability - it allows for quick autonomic
recovery - significantly improving MTTR. Replication is another word for
redundancy.
Good failure detection is vital - HA and other autonomic software can only
recover from failures it detects. Undetected failures have human-speed
MTTR or worse, not autonomic-speed MTTR. They can be worse than humanspeed MTTR because the humans are surprised that it wasn't automatically
recovered and they respond more slowly than normal. In addition, the added
complexity of correcting an autonomic service and trying to keep their
fingers out of the gears may slow down their thought processes.
Non-essential components don't count - failure of inactive or non-essential
components doesn't affect service availability. These inactive components
can be hardware (spare machines), or software (like administrative

interfaces), or hardware only being used to run non-essential software. More


generally, for the purpose of calculating the availability of service X, nonessential components include anything not running service X or services
essential to X.
The real world is much more complex than any simple rules of thumb like
these, but these are certainly worth taking into account.

Defining Failure: What Is MTTR, MTTF, and MTBF?


Most IT professionals are used to talking about uptime, downtime, and
system failure. But not everyone is entirely clear on the definition of the
terms widely used in the industry. What exactly differentiates mean time to
failure from mean time between failures? And how does mean time to
repair play into it? Lets get some definitions straight!
Definition of a Failure
I suppose it is wise to begin by considering what exactly qualifies as a
failure. Clearly, if the system is down, it has failed. But what about the
system running in degraded mode, such as a raid array that is rebuilding?
And what about systems that are intentionally brought off-line?
Technically speaking, a failure is declared when the system does not meet its
desired objectives. When comes to IT systems, including disk storage, this
generally means an outage or down time. But I have experienced situations
where the system was running so slowly that it should be considered failed
even though it was technically still up. Therefore, I consider any system
that cannot meet minimum performance or availability requirements to be
failed.
Similarly, a return to normal operations signals the end of downtime or
system failure. Perhaps the system is still in a degraded mode, with some
nodes or data protection systems not yet online, but if it is available for
normal use I would consider it to be non-failed.

MTBF is the sum of MTTR and MTTF


Mean Time to Failure (MTTF)

The first metric that we should understand is the time that a system is not
failed, or is available. Often referred to as uptime in the IT industry, the
length of time that a system is online between outages or failures can be
thought of as the time to failure for that system.

For example, if I bring my RAID array online on Monday at noon and the
system functions normally until a disk failure Friday at noon, it was
available for exactly 96 hours. If this happens every week, with repairs
lasting from Friday noon until Monday noon, I could average these numbers
to reach a mean time to failure or MTTF of 96 hours. I would probably
also call my system vendor and demand that they replace this horribly
unreliable device!

Most systems only occasionally fail, so it is important to think of reliability in


statistical terms. Manufacturers often run controlled tests to see how reliable
a device is expected to be, and sometimes report these results to buyers.
This is a good indication of the reliability of a device, as long as these
manufacturer tests are reasonably accurate. Unfortunately, many vendors
refer to this metric as mean time between failure (MTBF), which is incorrect
as we shall soon see.

Note too that MTTF often exceeds the expected lifetime or usefulness of a
device by a good margin. A typical hard disk drive might list an MTTF of
1,000,000 hours, or over 100 years. But no one should expect a given hard
disk drive to last this long. In fact, disk replacement rate is much higher than
disk failure rate!
Mean Time to Repair (MTTR)

Many vendors suppose that repairs are instantaneous or non-existent, but IT


professionals know that this is not the case. In fact, I might still be a systems
administrator if it wasnt for the fact that I had to spend hours in freezing
cold datacenters trying to repair failed systems! The amount of time required
to repair a system and bring it back online is the time to repair, another
critical metric.

In our example above, our flaky RAID array had an MTTF of 96 hours. This
leaves three days, or 72 hours, to get things operational again. Over time,
we would come to expect a mean time to repair or MTTR of 72 hours for
any typical failure. Again, we would be justified in complaining to the vendor
at this point.
Repairs can be excruciating, but they often do not take anywhere near as
long as this. In fact, most computer systems and devices are wonderfully
reliable, with MTTF measured in months or years. But when things do go
wrong, it can often take quite a while to diagnose, replace, or repair the
failure. Even so, MTTR in IT systems tends to be measured in hours rather
than days.
Mean Time between Failures (MTBF)
The most common failure related metric is also mostly used incorrectly.
Mean time between failures or MTBF refers to the amount of time that
elapses between one failure and the next. Mathematically, this is the sum of
MTTF and MTTR, the total time required for a device to fail and that failure to
be repaired.
For example, our faulty disk array with an MTTF of 96 hours and and MTTR of
72 hours would have an MTBF of one week, or 168 hours. But many disk
drives only fail once in their life, and most never fail. So manufacturers dont

bother to talk about MTTR and instead use MTBF as a shorthand for average
failure rate over time. In other words, MTBF often reflects the number of
drives that fail rather than the rate at which they fail!
Stephens Stance
Most computer industry vendors use the term MTBF rather indiscriminately.
But IT pros know that systems do not magically repair themselves, at least
not yet, so MTTR and MTTF are just as important!

A Question and my Response on MTTF


So this corrosion engineers walks into NoMTBF and send me a message.
The Questions Hi, I am corrosion engineer. May be you know for risk
assessment of heat ex-changer tube bundle in API-581 , mean time to failure
(MTTF) term is defined and used for risk assessment. Would you please give
me more information about MTTF and what history data required to calculate
MTTF?Thank u so much
My Response, Hi ,First off MTTF and similar metrics are used for situations
with a constant failure rate. Meaning that every hour a piece of equipment
has the same chance to failure as any other hour, anytime.This is generally
not true and certainly not true for corrosion failures. When the right
conditions exist, corrosion starts, grows and eventually over time leads to
failures. The older the equipment the more likely it will fail due to corrosion,
thus not a constant failure rate.My advice is to avoid using MTTF or MTBF.I
would take a look at the models and data you have and use Weibull or other
life data distribution to model the time to failure. From there you can convert
to MTTF although it will not be meaningful during the first half of the lifetime
generally by a wide margin.I would ask the risk analysis folks what time
frames they need failure rate information and provide estimates suitable for
each time frame. An overall MTTF is pretty misleading and may alter the risk
assessment results.If youd like to talk about better ways to work between
reliability and risk assessments, let me know.Just dawned on me I didnt
answer your question.The data you need for the calculation is the total hours
of operation of the equipment divided by the number of failures pretty
simple. So if you have 100 pumps, and all but one runs for 100 hours. The
one fails at say 50 hours. Then the calculation is (( 99 x 100 ) + (1 x 50)) / 1
failure = for mttf of 9950 hours.If there are no failure, still tally operating

hours and divide by one (rather than zero when bad things mathematically
occur).
Summary
Two things.
Be sure what you are measuring and reporting using MTTF actually has a
constant failure rate or close enough to constant that it doesnt matter. Send
over your questions and maybe become the next NoMTBF blog post.Third
and most important, if you have a NoMTBF button or mug or whatever,
please take a picture of it in the wild and send over. Also send along a short
note on how it has helped start conversations around MTBF.Ill create a page
on the site showcasing the ways these devices are starting conversations.

MTBF and MTTF Definition(s)


January 31, 2014
MTBF
Fred Schenkelberg

Recently Glenn S. asked if I had a reference for clear definitions of MTBF and
MTTF. After a bit of a search I sent him a definition or two, meanwhile he
gathered a few more.

They are all basically the same, with some slight differences. What is
interesting to me is the amount of variability in the interpretation and
understanding.

Heres the list Glenn collected:

This is what Ive compiled thus far on definitions for MTTF and MTBF. There
seems to be some variations in the definitions.

From a 1974 paper on DTIC.

Mean Time Between Failures (MTBF): The average operating time expected
between failures in a population of identical components. This measure has
meaning only when we are discussing a population where there is repair or
replacement.

Mean Time To Failure (MTTF): The average operating time expected before
failure of a component which is not repaired or replaced. This is simply the
average time to failure of n units, i. e., the sum of n individual unit times
to failure divided by n units.

MIL-HDBK-721 and MIL-STD-109 (Both Rescinded)

MEAN-TIME-TO-FAILURE (MTTF): A basic measure of reliability for nonrepairable items: The total number of life units of an item divided by the total
number of failures within that population, during a particular measurement
interval under stated conditions.

MEAN-TIME-BETWEEN-FAILURE (MTBF): A basic measure of reliability for


repairable items: The mean number of life units during which all parts of the
item perform within their specified limits, during a particular measurement
interval under stated conditions.

Internet PDF File of definitions

Mean Time to Failure. MTTF is the expected value (mean) of an items failurefree operating time. It is obtained from the reliability function R(t) as MTTF =
R(t) dt, with TL as the upper limit of the integral if the life time is limited to
TL (R(t) = 0 for t > TL ). MTTF applies to both non repairable and repairable
items if one assumes that after a repair the item is as-good-as-new. If this is
not the case, a new MTTF (MTTFsi starting from state Zj) can be considered
(Table 6.2). An unbiased (empirical) estimate for MTTF is MTTF = (tl + + tn
)/n, where tl + + tn are observed failure-free operating times of
statistically identical, independent items.
MTBF= 1/l. MTBF should be reserved for items with constant failure rate A. In
this case, MTBF = 1/ l is the expected value (mean) of the exponentially
distributed items failure-free operating time, as expressed by Eqs. (1.9) and
(A6.84). The definition given here agrees with the statistical methods
generally used to estimate or demonstrate an MTBF. In particular MTBF= T/
k, where T is the given, fixed cumulative operating time (cumulated over an
arbitrary number of statistically identical and independent items) and k the
total number of failures (failed items) during T. The use of MTBF for mean
operating time between failures (or, as formerly, for mean time between
failures) has caused misuses (see the remarks on pp. 7, 318, 327,416) and
should be dropped. The distinction often made between repairable and nonrepairable items should also be avoided (see MTTF).
Electropedia (http://www.electropedia.org/)
Mean Operating Time Between Failures MTBF the expectation of the
operating time between failures
Mean Time to Failure MTTF the expectation of the time to failure

Difference between hazard and failure rate


November 20, 2013
Engineering
Fred Schenkelberg

I too have found these terms used interchangeable in many papers and
references.

(This note is in response to a question on a forum asking about the


difference between these two terms. The question prompted some
interesting discussion and no clear resolution as various authors and
authoritative works do not seem to agree. Therefore, I highly recommend
you define these terms with those you converse.) While I do not have a
definitive source for the difference, I have this working understanding The
hazard rate is a function and is the function that describes the conditional
probability of failure in the next instant give survival up to a point in time, t.
h(t) = f(t) / R(t). Thus hazard rate is a value from 0 to 1. Failure rate is
broken down a couple of ways, instantaneous failure rate is the probability of
failure at some specific point in time (or limit with continuos functions. It is
the chance of failure calculated by h(t) for a specific t. Failure rate can also
be an average chance of failure over some period of time not as precise yet
very commonly used. I try to avoid this as it presumes a constant failure rate
over the duration. And, some dont even provide the duration and simply
state a failure rate per hour, for example. Well, over which group of hours
( like a year ) does this rate apply? Failure rate as the count of failures per
unit time, and can be a value greater than one. For example, 2 failures per
year, or a 200% annual failure rate. Given the number of different ways to
interpret the term failure rate, I suppose we should careful. My definitions
may not clear up any confusion, its just the way I think about these terms.
cheers, Fred http://creprep.wordpress.com/2011/09/20/the-four-functions/

Difficulties in calculating MTBF and reliability


Rory Dear, European Editor/Technical Contributor

1 Comment - Leave a Comment


Mean Time Between Failure (MTBF) is an important metric for determining
the lifetime of embedded system components, but its difficult to calculate

with accurate results. These difficulties lead to unreliable figures, which has
generated backlash against such calculations.

One typically defines reliability as the probability that said device will
perform functionally as required for a specified period of time. This all seems
rather simplistic, and it can be, to a degree, with a large enough sample size
and a long enough period of time. The main issue with deriving such figures
is that they are required for a products release not at the end of its lifetime
when actual reliability can be determined.

To retrospectively calculate the reliability a component or device provided


over its lifetime is fairly rudimentary math total time/total failures. This is
all well and good when proving near-obsolete products and potentially useful
to prove the reliability of a typical product, but new integrators want to know
how reliable this product is, not the previous incarnation.

Increasingly often, beyond a general acceptance of the estimated lifetime of


industrial electronics, reliability is specified upfront at the earliest
specification stage. Whilst this is more than logical for instance, one must
decide a warranty period based on an estimated lifetime we now need to
quantify that reliability. This is where our old friend, or increasingly, enemy,
Mean Time Between Failure (MTBF) comes in. MTBF and asset life
increasingly go hand in hand, but how accurate are any of these figures and
what do they actually tell us?
Its also worth pointing out MTBFs less common cousin Mean Time To Failure
(MTTF), which differs in that the latter generally is used for an irreparable
product, so is used more often for atomic components rather than an
assembled product. MTTF is calculated as total time/number of units.
Both have gaping holes in their accuracy; reliability of a given individual unit
is a hugely complex calculation. To provide an example of this minefield, a
client recently asked if their bespoke product we manufacture is suitable for
a 10-year asset life. By querying this, they wanted us to provide evidence of
a 10-year MTBF.

Interestingly, what would seem the most logical way to calculate MTBF gave
the most bizarre result! Given the product has been in manufacture and
deployed for more than 5 years, unlike a new product, we had the gift of
substantial historical data. Unfortunately, that data of approximately 5,000
units, deployed over an average of 3 years, with around 14 failures provides
an MTBF of more than 1,000 years!
As much as Id like to gloat about our bespoke products reliability and my
figures will entirely support this is a true MTBF figure, no one could
realistically believe even the materials the product are constructed from will
survive this length of time though that could well be true of the plastic
enclosure!
The second, perhaps more realistic method, only considers one component:
the weakest link. Its perfectly logical that by definition that the weakest link
is the most likely to fail, and thus most likely to fail first. So should no
calculation exist at all, and this figure just be passed through to the final
product?
The way in which MTBF is presented I liken to how automobile manufacturers
declare fuel consumption figures. Never in history has the real-world MPG
achieved in a vehicle actually matched the extravagant claims of the
manufacturer, as this figure was obtained in a far from real-world test with
vents sealed, no wind, etc. Likewise, a component manufacturers MTBF is
unlikely to encompass all, or any of the extraneous factors that will affect it
be that humidity, temperature, vibration, or shock. What these constants
were during testing are almost never documented, thus any particular MTBF
figure is rarely comparable to the next. Unfortunately, this regress follows to
the final product; MTBF simply doesnt cover the expected usage conditions
or what the product lifetime should be.
The calculation of reliability and likelihood of failure has been studied in
depth. Well-known, observable phenomena such as the bathtub effect are
well documented but very difficult to encompass into a single hours
integer. Weibull analysis, determining where a population of product
currently lies in the
bathtub, is well worth researching further alongside Accelerated Life
Testing that tries to encompass an individual units passage of time, though
not quite for a millennium!

Using MTBF to Determine Maintenance Interval Frequency Is Wrong

C
ollecting failure data to calculate mean time between failures (MTBF) in order
to determine accurate maintenance task intervals is wrong and should not be
done. MTBF is a measure of reliability. It is a measurement of the time
between two successive failure events.

Failures fall predominantly into two categoriesage related and random.


Typically, age related make up less than 20 percent of all failures while
random make up 80 percent or more.

For age related failures, it is not MTBF, but rather useful life that is significant
when attempting to determine maintenance task intervals to avoid failures.
There is a point in a piece of equipments lifetime at which there is a rapid
increase in its conditional probability of failure. The measurement between
the point when the equipment is installed and the point where the
conditional probability of failure begins to sharply increase is the useful life
of the equipment. It is different than MTBF. The MTBF is defined as the
average life of all the population of that item in that service.

If we want to prevent a failure from occurring, using traditional preventive


maintenance, we would intervene just prior to the end of the equipments
useful life, not just prior to MTBF. Incorrectly using MTBF to determine the
preventive maintenance interval will result in approximately 50 percent of all
failures occurring before the maintenance intervention. In addition,
approximately 50 percent of the remaining components that have additional
life will receive unnecessary maintenance attentionin both cases, not a
very effective maintenance program. Therefore we need to use useful life
and not MTBF when looking at age related failures and determining the
frequency of preventive maintenance tasks.

Random failures make up the vast majority of failures on complex equipment


as research has shown. For example, consider the failure of a component.
Assume that each time the component failed we tracked the length of time it
was in service. The first time the component is put into service it fails after 4
years, the second time after 6 years, and the third time after only 2 years (4
+ 6 + 2 = 12/3 = 4). We know that the average lifespan of the component is
4 years (its MTBF is 4 years).

However, we do not know when the next component will fail. Therefore we
cannot successfully manage this failure by traditional time-based
maintenance (scheduled overhaul or replacement). It is important to know
the condition of the component and the life remaining before failure; in other
words, how fast can the component go from being OK to NOT OK. This is
sometimes referred to as the failure development period or potential failure
to functional failure (P-F) interval.

If the time from when the component initially develops signs of failure to the
time when it fails is 4 months, then maintenance inspections must be
performed at intervals of less than 4 months in order to catch the
degradation of the component condition. The inspection also must be
performed often enough to provide sufficient lead time to fix the equipment
before it functionally fails. In this case, we might want to schedule the
inspection every 2 months. This would ensure we catch the failure in the
process of occurring and give us approximately 2 months to schedule and
plan the repair.
Failure prevention requires the use of some form of condition-based
maintenance at appropriate inspection intervals (failure finding, visual
inspections, and predictive technology inspections).
My experience has been that for every $1 million in asset value as many as
150 condition inspection points must be monitored. Gathering and analyzing
condition monitoring data to identify impending failure for assets worth
billions of dollars is practically impossible without the use of reliability
software.
The reliability software you choose should be able to:

collect equipment condition data from controls, sensors, data historians,


predictive maintenance technologies, and visual inspections
use single or multiple data points to analyze the data, applying defined
rules and calculations to get a true picture of equipment health
perform the calculations and conduct the analysis automatically
present results visually through flashing alarms and trending graphs,
identifying potential failures and recommending corrective actionsbefore
the equipment fails. MT

How to use MTTF?


Hi all, I understand that MTTF is a component failure of a big system and the
component is replaced. We can use minitab to estimate the MTTF. 1. If the
calculated MTTF = 300 hrs, what does it mean in laymans term? (is it 50% of
the component will fail when operations reach 300 hrs?) 2. Another question,
If a vendor publish its component MTBF to 500 hrs. Do I change all
components that reached 500 hrs? (Assuming the component cost and the
replacement cost is negligible).
BH,
IF this helps here are some thoughts on the way I use MTTF, MTBF and MTTR
et al. This information is based on my own experience of using these
measures and I would not say the definitions are textbook, rather, user
friendly :)
MTTF
Mean Time To Failure (MTTF) is the time, on average, that you would expect
a piece of plant to fail when it has been running. It is a simple indicator of
plant reliability. Example.
Plant runs for 500 hours. It breaks down 5 times during that period.
MTTF = 500/5 = 100 hours.
MTBF

Mean Time Between Failure (MTBF) is the time, on average, that you would
expect a plant to fail including time lost whilst repairs are undertaken. It is an
indicator of the combined reliability and maintenance
effectiveness/efficiency. Example.
Plant runs for 500 hours, has 200 hours downtime due to 5 failures.
MTBF = (500+200)/200= 140 hours
MTTR
Mean Time To Repair (MTTR) is the time, on average, that you would expect
a stoppage to last including time spent waiting for maintenance engineer,
diagnosis, waiting for parts, actual repair and testing. It is an indicator of
maintenance effectiveness/efficiency. Note that there is another indicator
that can be used here, Mean Corrective Repair Time (MCRT). MCRT is only
interested in the actual repair time assuming all tools, spares and required
manpower are available (efficiency). MTTR-MCRT= Waste and therefore gives
you an indication of how ineffective your stores and maintenance resource
strategy is.
Example.
As above, plant runs for 500 hours, has 200 hours downtime due to 5
failures.
MTTR= 200/5= 40 hours
Therefore MTBF = MTTF + MTTR.
Availability
MTTF, MTBF and MTTR can also be used to calculate the availability, or more
importantly, the unavailability of the plant due to maintenance causing
failures. Example.
As above, plant runs 500 hours breaks down 5 times and 200 minutes are
spent waiting repair/spares/repairing.
Availability:
= ((500 + 200)-200)/(500+200)
= (700-200)/700
= 71%

Or
Availability
= MTTF/ MTBF or MTTF/(MTTF+MTTR)
=100/140 = 71% or 100/(100+40) =71%
Unavailability
= 1 availability
= 29%
Or
= 200/ (500 +200)
= 29%
Or
MTTR/ MTBF or MTTR/ (MTTF+MTTR)
= 40/140 or 40/ (100 +40)
= 29%
Summary
As a maintenance engineer your objectives must be to ensure that the plant
is available to be used as and when required by production with no
compromise of quality and safety, all at minimum total cost. There are
indicators that can be used to measure performance against the formentioned criteria. MTTF is an indicator of plant reliability and should be as
long as possible. Increased MTTF can be achieved through an effective
preventative and/or predictive maintenance plan. MTTR should be as short as
possible and is an indicator of the combined effects of the maintenance
strategy, organisation and systems used. MTTR can be reduced by having
the right person, with the right tools, with the right spares, in the right place
at the right time (sound familiar?). MTBF should be as short as possible and
is an overall indicator of reliability and effectiveness/efficiency.
Hi, MTBF = (500+200)/200= 140 hours should be changed to
(500+200)/5=140

Formula for MTBF


Which formula do you use to calculate MTBF?
a) MTBF = operating time / no. of failures in between, where operating time
= current date - date of first production. Of course, running hour is better if
available. b) MTBF = (start date of last failure - start date of first failure)/(No.
of failures - 1). This leaves out the running time before the first failure and
final failure from the caculation. Any advantage in doing this?
MTBF is usually applied to a group of similar equipment, for example all the
pumps in a refinery. The formula for this is (NUMBER OF PIECES OF
EQUIPMENT X TIME PERIOD) / NUMBER OF FAILURES DURING THAT TIME
Example: 1200 Pumps over one year with a total of 387 failures during the
year. (1200 X 12(months))/ 387 = 37 months MTBF. For a single item, it is
just the time period / number of failures.Example: Pump failed twice in one
year, the MTBF would be 12(months)/2 = 6 months MTBF
Yes, Cheddar, that's what I think but somebody said the second formula
above should be used because we cannot find the exact start and end dates.
Also it's time between Failures, not between the start point and end point. So
I just want to confirm what formula others used to calculate their MTBF
especially the published MTBF for pumps which can be increased from 1 year
to 3 to 4 years.
Despite there is some controversy among peers with regard to what is
understood by the acronyms MTTF and MTBF, the attached simple example
illustrates the way I treat operating data since long with satisfaction from
users and no misunderstandings along the way. From a management
perspective, the data accommodates any failure mode. The method can be
easily implemented in software and the indicators made available on a
timely basis. Of course, many more indicators can be developed at the
equipment level to reflect progress in the pursuit of objectives set by
management.
I wasn't being all that serious before. Wearing out the end of one's boot by
kicking a_- may not be exactly correct. Machines are like people, people are
like snowflakes - no two alike. You may take an exact model pump and
subject it to different environments and industries and come up with
different failure rates. Sitting in an office coming up with statistical data

doesn't do anything and I've never seen a computer pull a wrench although I
have seen it done remotely. Generally the weakest link is the equipments
bearing. It is the first thing to fail usually. Rotors are generally rugged and up
to the task as is the shaft and other components. What's the L10 life of the
bearing? Do you realize it? You should! If not roll up your sleeves and get to
work: why isn't it lasting longer? Maximum is 7 years and has L10 of 12 then
its environment may be that 7 years MTBF is max life?????? But by getting
out there and interacting with people and training on alignment and
machinery setup will get you to max MTBF which will be reflected by smiling
faces. It has to go beyond compiling data to get an attaboy. You'll know the
MTBF when the machine has reached its max life maybe L10 +20%. Also
your paycheck should increase. Are your machines running longer now than
they were two years ago? I personally had first hand knowledge of a PdM
program where a guy got a $15,000 bonus for a successfull program - he
didn't have a program. In 6 years he had written 2 reports totalling 3 pages
and no improvements but he had attaboys and pull or draft as the case may
be. I recently retained a contract where the head guy wanted to go with a
cheaper service vendor. The techs said, don't screw with what works, we
have planned scheduled maintenance and no call-outs and it's good that
way. Still there. When I started failures were ~1.5 yrs and now ~6-9 on all. Of
course lightning is hard to predict. We've got so many acronyms now they
are hard to keep up with. Seems there's 5-10 acronyms for every icon and
it's growing. My soapbox is getting shaky about now.
L10 rating is largely... inadequate... even when just considering the fatigue
question. Forget not the effect of contaminants, lubrication, etc, etc.In the
paper I presented at the CMVA last year (Canadian equivalent to VI), the
early portion deals with fatigue. See the ugly red thumbnail in the last
segment of http://vibra-k.com/?cat=15 .To make a long story short: bearing
fatigue life has more to do with contaminant inclusions within the steel than
load and cycling. That is the principal reason why so many bearings which
should fail from fatigue keep going well beyond the life estimate: low
contaminant contents in the load zone. As a rule, the design target is 50,000
or 100,000 hours for many machines. It can be higher for electric motors
since the bearing size tends to be relative to shaft dimensions, and that last
size will be based on force couple and torque issues rather than the very
light rotor loads. It doesn't mean that you would not cases with a more
extensive life: they abound. But the previous values of 50K hours (standard)
and 100k hours (by request) are good estimates.

Rui, Clarifications:What do you mean by Constant time window? How is it


equal to 6*25*8?Why do you get no. of failures is 5? I thought the no. of
failures is 7.What do you mean by moving constant time window?
First we need to understand how do you plant to use MTBF in the first place,
in my training the formula can be quite confusing and depending on your
application : MTBF can be use for MTBF by critical component (usually called
mttf)
MTBF by sub-assembly - so we can pinpoint the worst sub assembly for a
particular machine
MTBF by Machine - to determine the MTBF of a particular machine
irregardless what part will fail
MTBF by process - so we can determine the part in the process that fail
frequently
MTBF by group of machines - so we can determine the total MTBF
In the formula, MTBF = operating time / BDO
where BDO is Breakdown occurence or frequency of failure, Operating time is
loading time - machine downtime.Hence machine downtime can vary a lot
and is not only attributed to failures or breakdowns hence if your equipment
is being converted to another product it is also a machine downtime.Hence I
would like to clarify and add to the forumula.
MTBF = (Loading Time - Machine DT attributed to failures ) / Frequency of
Breakdown
MTBF is an indicator as to the reliability of a machine/group of machines.
Where you "start" your recording is up to you, however, the longer the time
period you can use, the more accurate the result will be. Go back to the
beginning of your reliable breakdown data.MTBF will also indicate the
effectiveness of your maintenance practices and also any steps you might
take to improve reliability. For example, in a refinery where I was the
Machinery Engineer, we had around 1200 pumps of one sort or another. It
was an old refinery, around 1936, and their failure rate was horrific, with an
MTBF of around 6 months.I looked back over the failures and identified
several key failures that kept repeating across the group. The most
significant was that most of the pumps had grease lubricated, gear
couplings. Another was that many of the pumps still had packing, leading to

expensive shaft replacements etc. Another was poor alignment and there
were many premature bearing failures.Over the next couple of years I
replaced all the gear couplings with membranes, upgraded all pumps to
mechanical seals and bought a couple of laser alignment kits and trained the
guys to use them properly. I also looked at the workshop practices for
bearing changes and yes, they were "drifting" the new bearings onto the
shafts. This to me is criminal! I bought a couple of oil baths and some heavy
duty gloves.The result of all this hard work, and yes expense, was that after
four years the MTBF was at a very healthy 37 months!At the same time I
introduced a vibration monitoring programme along with lube oil analysis
and thermography. The most important result from all this was that
Operations could plan long runs and we could schedule our maintenance to
suite. Needless to say, there was no war between Operations and
Maintenance in this refinery (Rolly!).
Josh, The moving time window is just a convention or a rule to set the time
span within which performance indicators (PI) are to be calculated. Its use is
quite common in management performance control at large. It is set
arbitrarily though it should be sufficiently long to accommodate several
episodes (failures in the instance of MTTF or MTBF) the more episodes the
higher the accuracy. I forgot to erase the product 6*25*8 in cell D21 but the
product "6 month, 25 days per month and 8 hours per day" was in my mind
at the time I created the example in Excel. I should simply have entered
1,200 hours straight in that cell. In this example, only the latest 5 failure
matter because the other 2 are already out of the time window, being
consequently omitted (or discarded). Suppose you are using a time interval
of 6 weeks (your time window). If you calculate PIs, say, weekly, then the
time window is said to move (or slide) one week at a time. As time advances,
data pertaining to the seventh week back in time are abandoned and new
data, this time pertaining to the very last week, are now considered in
calculations. When an equipment is under a process of change, it is common
practice to give different weights to data the older the lesser the weight
(the weights must sum up to 1) in order to get a picture that anticipates, in
a certain way, the changes that are currently on the way. PIs are, in this
particular circumstance, calculated by a weighted average. By doing this,
you attribute more importance (weight) to more recent events and less
importance (weight) to older ones. In the end, if you plot the PIs collected
over time on a graph, you get a trend which might be quite informative on
whether you are going the right way or not. There is another method which
was baptized "exponential smoothing" that automatically attributes less

weight to older data, never "forgetting" any data, regardless of how old they
might be. This method doesn't use the constant time window like the simpler
methods I referred above and is less popular among practitioners.
Rui,What is the purpose of arbitrarily setting the time span and moving the
time span forward as time goes by? Is this idea of constant time window the
same as constant moving average? Why not just calculate the mtbf since the
first failure to today becasue you said the more the episode the higher the
accuracy? I need to compare our mtbf values with similar industrial values
and also for trending, so I need standard methods of calculating the mtbf.
Why do you we need to apply weightage to the data set for equipment under
process of change? WHat change do you mean here? Is it modification,
operational mode change, operating parameter change, etc?
Cheddar, Good job for increasing the mtbf for your refinery pumps. Yes, we
could just take the straight forward approach that you used i.e. by summing
up operating time for all pumps (rgdless of pump types) divided by no. of
failures in between (rgdless of failure modes). This composite or overall mtbf
allows us to see the improvement from its trending.For further simplication,
did you take the operating time period to start from the year of the refinery
being built which is 1936, rgdless whether any pumps being replaced over
the years? Also look like you did not mentioned whether you subtracted any
downtime from the operating time period.However, I would like to follow a
standard method to calculate the mtbf so that I can compare my figures with
published mtbf data such as those for HP Bloch's book.One more question,
did you track the mtbf of the pump components individually such as
bearings, seals, impellers, etc?
When I arrived at the refinery I realised that they had a reliability problem,
particularly with the pumps. I took a long time and looked at the pump
failures just for the previous five years, and came up with the 6 months. I
then analysed the failures and started by addressing the biggest problem,
the coulpings, then the next, the sealing and so on.If you are looking at MTBF
to help analyse the effectiveness of your maintenance practices, then you
just need sufficient data to make your readings realistic. Obviously, the more
data you have, the further back you can go with GOOD DATA, the more
reliable your initial MTBF calculation will be. So look at your records and see
how far back you can go with a good degree of accuracy. Even if it's only a
couple of years, it's a good starting point.Then look at "what were the most
common failures", and possibly ask !why were these failures happening (as I

did with the bearings).There's nothing management like better than seeing
an improving MTBF.
Hello All, I worked in a semiconductor before and we have a machine called a
wirebond where it placed the goldwire in the circuit. One wirebond machine
(my estimate) will compose of around 500,000 parts and most of them are
electronic parts, (this machine is full of boards), and one station can have
around 100 or more wirebonds. We use to trend MTBF on a monthly basis for
all machines total and a perfect MTBF for a month is 168 hrs.
Let me give you an example : (MTBF for 100 WB)
Jan. 49 hrs / mo
Feb. 63 hrs / mo
Mar. 46 hrs / mo
Apr. 31 hrs / mo
Note the trend is getting low, MTBF is not good
Level 1 : MTBF = 49 hrs / month for Jan.
Now we want to prepare an MTBF analysis
Level 2 : What particular Wirebond machines contribute to the low MTBF
Level 3 : Once we pinpointed what particular machines with low MTBF we go
on to what particular sub-assembly are always failing
Level 4 : Then we identify the components that usually fail
If you are just hired in your plant today, and you have no records of your
failures for example then as suggested by some, you start calculating your
MTBF today and wait till something fails.
So the start time is where you start to actually compute your MTBF. Now in
our case where we compute the MTBF monthly so we can see the trend of
our line/station. What if there are no failures, then the denominator for
frequency of failures is zero which gives an MTBF of infinity. We have 2
options. In case where we achieve perfect MTBF of 168 hours a month, then
we assume a denominator of one, since anything you divide by 1 will give
you the numerator. Another option is prolong the duration of MTBF until 1
machine will fail, however using this option you cannot trend MTBF on a

monthly basis. Now for a group of machines you have 3 options to get the
total MTBF :
1st Add the total MTBF
MTBF = (36 + 45 + 74 + 90) hrs
2nd You may get the average MTBF
MTBF = (36 + 45 + 74 + 90) / 4
3rd you may get the percentage MTBF
MTBF = (36 + 45 + 74 + 90) / 4 x 168 hrs
what is important will be the trend of the MTBF if this would be monthly,
quarterly, semi-annually or yearly and it must be increasing.
Josh,When dealing with historic numbers, the more data you consider
(farther back in time) in order to compute an average for instance, the
poorer the response will be to more recent inputs, but, in turn, the more
stable the indicator will be. The only way out consists in performing a tradeoff. Please see the example in Excel that I attach. It deals with data that,
despite erratic from one period to the other, presents a trend though. This
trend can be due to some improvements that are being implemented
consistently over time, such as the ones to be expected from a TPM program.
The shorter the period (not a constant time window in this case but rather a
fixed number of events), the better the response, that is, the better the
picture of the magnitude the TTFs are reaching actually. On the other
hand, if a piece of equipment suffers a radical change in what its reliability is
concerned, then past data will be of little use from that moment on. New
data will have to be collected in order to have refreshed indicators available
reflecting the new reality.
Rui,I have re-calculate the MTBF, MTTF & MTTR using the total operating time
from the beginning to today i.e. without using the constant time window. Is it
correct?From your second example attached above, what I understand is that
using constant time windown to calculate MTBF will give higher values if
there are improvements made during that constant time window. My problem
with using the constant time window to calculate MTBF etc is that none of
published MTBF data like HP Bloch books mention about using constant time
window or did I miss something? My second concern with using the constant
time window to calculate MTBF etc is that, what is the most appropriate

constant time windows to be used so that I can compare my MTBF data with
others' data? In view of this, look like calculating the MTBF every month
seems ok for trending purposes and comparing with the same industrial data
published by others.
Josh, An indicator is meant very often to anticipate the behavior of a system
within a few periods ahead. To do this, the indicator has to be built on data
collected for a while in the recent past typically a time window whose
length has to be adjusted now and then in order to compromise stability (the
property of not going up and down suddenly) and responsiveness (the
property of following the latest trend closely). Things that have long
happened are only interesting for historical records. This way, the decision
maker gets the best possible picture regarding the direction an indicator is
taking and can act accordingly. Note that a sudden move up (or down) in the
last control period doesn't mean necessarily that you are moving towards
disaster nor heaven. Management should be more concerned and focused on
trends rather than specific values from one period to the other. But the trend
cannot be built on a long period; otherwise management won't get response
but too much stability instead and it will be too late when he finally notices
an undesirable trend.The weighed data methods that I wrote about in my
previous posts on this thread offer a way of getting the best possible trend
awareness. This is why you should use a time frame and not a period as long
as possible to calculate an indicator, unless it corresponds to the very
minimum time span to accommodate a few events to be statistically
significant. I saw your calculations and I would agree with them if I were not
aware of the fact that you are simply trying to get rid of the time window
constraint!I hope you agree this time.I am sorry I cannot give you a straight
answer in what norms are concerned. I always follow the practices that I
think are the best to a specific case whenever I implement a management
performance control system and never accept an indicator calculated in a
way imposed by some norm whose logic or justification I don't agree with.
This is to say that I don't follow norms in what "management performance
control systems" are concerned but I, of course, accept to consider any when
a client points me out in that direction with the condition that I freely decide
whether to advise (and justify) its adoption or not.
If you code every possible failure mode at components level, then you can be
effective at judging equipment level performance on a timely basis, by just
consolidating figures gathered at that low level. This way, you get figures
useful for both reliability improvements and management performance
follow-up. By the way, I sometimes have doubts on how to translate a

specific failure mode into a code suitable to be dealt by computers. Does


anyone in this forum have a good experience in failure mode coding? A few
examples would be highly appreciated.
Rui, Ok, your explanation is clearer this time above. I think you said that my
approach of calculating the MTBF based on the longest time window gives
stability but not responsiveness. How to strike a balance? "typically a time
window whose length has to be adjusted now and then in order to
compromise stability (the property of not going up and down suddenly) and
responsiveness (the property of following the latest trend closely)."In the
sentence above, don't you mean ".. in order NOT to compromise BOTH
stability and responsiveness"? To get the optimum stability and
responsiveness, how do I select the constant time window to calculate the
MTBF? Normally, we report KPIs monthly but look like from your example, the
constant 3 monthly time window shows good improvement of the MTTF.
Further, how do I select the appropriate weightage to be applied to the
recent data?
In many situations a unit or system can be repaired immediately after
breakown. In such cases the mean time between failurerefer to the average
time of breakdown until the device is beyond repairs. For an exponential
failure rate , mean time between failures is 1/failure rat and this indicates the
average interval of time between failure of the equipment.The mean time
between failure is not the time for which the equipment will be expected to
operat before failure . For example , if equipment is to operate for a period
equal to MTBF , then the probability of the equipment lasting for that period
of time without failure is about 40%. Obviouslyif the equipment is used for a
shorter period of time the probability of survival is higher.
The MTBF is therefore the average time between failure and in practical
terms it can be obtained by calculating :
Total Operating time of Population
MTBF= ---------------------------------No of failures to occure
For eample , suppose there are 100 identical pieces of equipment in
operation in various systems each of which has been in operation for about
8000 Hrs- this means that the total operating for the poulation is 800000
Hrs. and also suppose that during this period of time 80 failures have
occured and have been repaired so that each runs 8000 hrs: the MTBF is

therefore 80000/80 = 10000 Hrs. The failure rate is the reciprocal of this and
is 0.0001.

PREVENTIVE MAINTENANCE (PM) DEFINED - PM includes all routine


preventive maintenance services, non-destructive testing and conditionmonitoring using applicable predictive maintenance techniques. All have the
same objectives of avoiding premature equipment failures and extending the
life of equipment. Predictive maintenance is sometimes referred to
separately as PdM. Through timely equipment inspection, non-destructive
testing and condition-monitoring, potential equipment failures are identified
before they reach the failure stage. In addition, the application of reliability
techniques such as FMEA (failure mode and effects analysis) can identify the
root causes of problems so they can be isolated, corrected and eliminated.
Lubrication, cleaning, adjusting, calibration and minor component
replacements (belts, filters etc.,) extend the life of equipment.
Characteristics of PM Services - PM services are routine, repetitive activities.
They are routine because the same checklist or procedure is used with each
repetition. They are repetitive because the same services are repeated at
regular intervals. PM services are also classified as static services that can
only be carried out only when equipment is shut down or dynamic services
that can be performed while equipment is running. PM services are also
carried out at fixed intervals like every two weeks or at variable intervals like
each 250 hours or after 10,000 tons are pushed through a system.
Detection Orientation of PM - Preventive maintenance is 'detection-oriented'
so that emphasis is given to finding problems before they create a failure or
stoppage. Then the time gained is used to plan and schedule the corrective
work rather than react to an emergency situation. A good analogy is to
compare a deteriorating piece of equipment to a stick of dynamite with a
burning fuse. If the fuse length is not checked it will continue to burn and
ignite the dynamite. Similarly, if the equipment is not inspected, checked or
monitored, the equipment problem will deteriorate into an emergency repair.
Timely inspections will find equipment deficiencies sooner. Thus, the
'detection orientation' of PM assures that early discovery of problems will
yield: Less serious problems; fewer failures and more time to plan.
PM AND PLANNING EFFECTIVENESS - The detection-orientation of PM makes
planning possible. Problems found well before failure are less serious and can
be corrected at less cost and in less elapsed downtime. When there is

sufficient lead time between the discovery of an equipment problem and the
time when it must be repaired, there is sufficient time to plan work. This also
avoids the need to make repairs under unfavorable emergency conditions.
Successful planning is a direct by-product of the 'detection-orientation' of PM.
When work is planned, it is done more deliberately resulting in higher quality
work done more productively. Conversely, problems found after the
equipment has deteriorated substantially are more serious and costlier to
repair. The detection-orientation of the PM program can increase the
opportunity for planning work to:
Maximize equipment availability
Minimize downtime
Prolong equipment life
Increase equipment reliability
Reduce emergencies
Routine Preventive Maintenance Services - Routine preventive maintenance
services include inspection, lubrication, cleaning, adjusting and calibration,
replacement of minor components like belts and filters and non-destructive
testing like oil sampling. Routine PM services are carried out by maintenance
craftsmen or operators as they perform visual inspections, lubrication,
cleaning, adjusting and calibration and non-destructive testing.
Condition-Monitoring - Condition-monitoring may supplement routine PM
services or be carried out separately. It uses a variety of predictive
techniques and specialized equipment to generate audio or visual signals
which are compared with signals depicting normal operation. Variations
enable analysts to identify problems, gauge the degree of deterioration and
determine corrective actions and their timing. Condition-monitoring has
vastly improved the capability of maintenance to detect potential problems
and has reinforced the concept of detection-orientation.
Types of Condition Monitoring
Critical components whose failure could be catastrophic are monitored
continuously. Sensors are located at the critical components and linked with
computers set to indicate normal operating conditions. The indication of an
abnormal condition signals the operator with an alarm (called a protected
system). The operator, often located in a remote control room, notifies

maintenance so that actions should be taken to restore the component to a


normal operating condition.
* Continuous readings establish trends in the running condition of the
equipment and their analysis leads to the interception of a worsening
condition before more serious problems arise.
* Periodic applications mean that routine inspections and continuousmonitoring could be applied jointly with one supplementing the other.
* On-demand applications mean that existing inspection or monitoring
techniques have failed to identify a troublesome problem and additional,
more effective techniques must be used to identify the elusive problems.
SUMMARY - The objectives of preventive maintenance to avoid premature
equipment failure and to extend equipment life are achieved through the
effective application of routine PM services and condition-monitoring.
However, successful PM depends also on an effective planning effort to
convert PM deficiencies into planned work. Only then will the real benefits of
PM be realized. PM programs should be oriented toward these objectives.
http://www.maintenance.org/forums

Supervision 101 The Fundamentals


Imagine youre on a walk through your plant. What you see is far from a
structured use of management fundamentals by your supervisor. Boards that
include production requirements are incomplete or not filled out at all.
Quality issues are evident. You notice considerable idle time -- employees are
standing around waiting for something. The supervisor is nowhere to be
found and you wait patiently. Sure enough, a few minutes later the
supervisor shows up on a forklift with a pallet of parts. Wow, you think, now
thats the type of dedication I like to seeor is it? What you should be asking
is why the line ran out of parts to begin with and why the supervisor had to
leave the area to get the parts. Why were the employees idle and not adding
some sort of value while the area was idled? These are subtle indications
that something isnt quite right with your operating / management system.
Throughout my time in industry Ive had considerable exposure to front-line
supervisors. One thing usually rings true: all too often supervisors lack
effective skills Ill call the fundamentals. This lack of skills is not always the

supervisors fault. Supervisor vacancies are often filled with the person most
proficient at performing a specific job. Rarely does a supervisor receive
training in how to lead a group of people effectively.
There are several aspects to being an effective supervisor but lets review
the three fundamental behaviors that supervisors need to master. These
work best with a span of control of around 15 to 20 people per supervisor
and when the supervisors primary job is towell, supervise, not chase parts
or attend too many meetings or perform maintenance work.
Set clear expectations
Supervisors need to establish expectations with individuals or work teams
that align with your business expectations. The first expectation usually
involves production output (quantity) at an expected quality over a specific
time period. Efficiency is the most common expectation but there are usually
more. Supervisors can communicate expectations during the morning startup
meeting or with individuals at their workplace. Its vital that supervisors get
buy-in to the expectations. When an individual or team understands whats
expected of them the supervisor has a basis for accountability to meet
production requirements without having to rely on convincing, persuading or
demanding action. Setting expectations helps to remove some of the
emotion in supervising the workforce and, as youll see with the next
fundamental, its not just managing by the numbers.
Follow up
Follow up requires spending time in the production area visiting with each
person or team to see how theyre satisfying the expectations. Its important
not to carry the sledge hammer during follow up; this method will only lead
to other issues further down the road. Supervisors should ask specific
questions about any issues causing deficiencies in performance. If these
issues cant be addressed by the individual or team then its the supervisors
jobs to determine a fix. Even the most effective supervisors dont always
have all the answers but they know where to find the answers, seeking the
appropriate support of functional groups (management, quality, maintenance
or engineering) to remedy the issue. Follow up is crucial because it provides
an opportunity to interact with the workforce in a constructive, proactive way
and lets the team know that meeting their agreed-upon expectations is
important.
Provide feedback

Once expectations have been set and follow up has occurred, its time to
provide feedback. This is when things can get interesting. Most supervisors
have little problem with providing positive feedback. However, those that
lack proper training and guidance may struggle when the necessary
feedback is negative. Negative feedback due to true inefficiencies caused by
an individual or team is the toughest. If a supervisor isnt careful it can lead
to some sort of confrontation. The best approach is to remind people of the
initial expectations and seek their input on how to turn the negative into a
positive. There may be times when the supervisor must recommend
retraining or replacement, all of which is part of supervising people. While it
is not always easy, providing feedback is again an opportunity to interact
with the workforce.
Supervisors often receive a lot of added pressures to do more, with little or
no support. Keep in mind that they are possibly the most important part of
your organization because they are the first-line contact for your most valued
asset, your people. Make sure your supervisors understand the three
fundamental behaviors for leading a team:
Set clear expectations that support the organizations goals
Follow up with your people to ensure expectations are understood and seek
input
Provide feedback when needed for both good and struggling performance
Applying these three fundamentals will take some effort and time. With
support from the rest of the organization, a struggling supervisor can
become an effective part of your organization and significantly improve their
teams performance.

Why Dont You Have a Tool Crib?

By Doug Wallace, CPIM

One of the many MRO best practices we look for at our client sites is the
existence of a tool crib designed to effectively and efficiently store, control
and

maintain specialty items such as bolt cutters, shop vacs, sawzalls and other
power tools, generators, fans, and pipe threaders. These are often one-of-akind and sometimes expensive items that arent required frequently, but
when they are, they need to be available and ready for use.

Ironically, our clients often lament about the lack of availability and poor
operating condition of their specialty tools, resulting in unnecessary delays in
completion of maintenance work, sometimes critical jobs. Yet when we ask
them if they have a tool crib, the response is usually an emphatic No or a
shrug of the shoulders that effectively says the same thing, sometimes with
a wistful look that suggests that they wish they did.

The more important question is Why? or more appropriately Why not?


The responses to that question vary from lack of resources to lack of space
to lack of management support. Despite overwhelming evidence to the
contrary (i.e. experiencing unplanned downtime due to specialty tools being
unavailable), some even tell us they dont think they need a tool crib!

Beyond the verbal responses are perhaps more deep-seated reasons, such as
ownership (I bought this tool, I will take care of it), or territorialism (These
are my tools and no one else is going to use them), or more often trust (I
dont believe they will have the tool when I need it.) Instead of recognizing
the problem and fixing it, they live with the current situation and complain
about the results.

The fact of the matter is that the only legitimate reason for not having a tool
crib is that you dont have any specialty tools at all, and thats a rarity.
How a Tool Crib Should Function

A tool crib should handle every specialty tool on site, and that could require a
significant amount of space to organize and manage effectively. Ideally that
space would be in a designated area of the MRO storeroom so the tools can

be properly organized, controlled, and kitted along with other MRO materials,
but having the tool crib in the Maintenance area is also a viable option.

Who will manage the tool crib is also a concern. Storeroom personnel
generally man and manage the tool crib if its in the MRO storeroom, while
Maintenance often is responsible for operating the tool crib if it is in their
area. Part-time storeroom support can sometimes be adequate to staff tool
cribs in maintenance shops depending on the level of activity.

Once the tool crib itself is established, item numbers are set up for each of
the specialty tools. These items are essentially no different than any other
MRO material, with the exception that you expect to get them back after the
job is completed. With item numbers set up in the inventory management
system, specialty tools can be assigned a bin location in the storeroom and
the available inventory monitored. Tools can be planned, scheduled, issued
and tracked through the system, even to the point of knowing where the tool
should be, who should have it, and when it should be returned to the tool
crib.

As required tools are identified for a planned job, they are listed on the work
order. This reserves the tool, preventing it from being used on another job.
When the work order is released by the planner, the tools appear on a kit list
along with other MRO materials required for the job. The tools are kitted
along with the other materials, charged to the appropriate work order, and
held until everything including the required tools is available. After the job
is scheduled, the kitted parts and tools are delivered to the job site where
the work is to be completed. Once the job has been assigned to a craftsman,
that person assumes responsibility for the tool. The estimated duration on
the work order provides an idea of when to anticipate the return of the tool.

After the job is done, the tool should be returned to the tool crib so it can be
reused. If the tool is not returned as expected, the assigned craftsman can
easily be identified and contacted to find out what happened to it.

Maintaining the quality of specialty tools is often an issue. There is an


expectation that the tool will be returned in good working condition, or that
the tool crib attendant will be notified if the tool needs to be repaired or
replaced. However, this doesnt always happen, so to be safe it is better to
have tools inspected to ensure that they are ready to be re-used. In many
cases storeroom personnel are not qualified to perform the inspection or
repair tools. But that doesnt mean they cant contact someone to check the
tool out and make sure it is in good condition for the next person who needs
it before putting it back into the tool crib.

After verifying the quality of the tool, it is put back into the on-hand
inventory in the system. This credits the work order and shows the tool as
available for future jobs. The transactions to issue the tool to the job in the
first place, and subsequently return it to stock, offset each other, so it
doesnt matter what cost is associated with the tool, but generally these
items are held at zero value.

For unplanned work, obtaining specialty tools is really no different than any
other unplanned material request. The appropriate item number should be
identified by the person who needs the tool, and the available inventory
checked before going to the storeroom to pick it up or asking the storeroom
to deliver it.

Many storerooms are starting to use Radio Frequency Identification (RFID)


tags to track specialty tools. The tags allow the item to be located anywhere
throughout the site. In some cases, sensors can also be used in conjunction
with the tags so the tool is automatically issued from inventory when it is
taken from the storeroom and put back into inventory when it is returned.
Start with the Basics
These are all elements of a best-practice tool crib, and we certainly
encourage our clients to strive for this standard. But a world-class tool crib
can take time to set up, resulting in continued unnecessary delays. While
planning for the long term, a working tool crib can be set up in a matter of
days. All you need to get started is a list of the tools you have, a space big
enough to store them, and a system to track them.

The space can be anywhere as long as it is controlled with limited access.


Working tool cribs have actually been set up in closets! Toss in a few shelves
or racks, and label them with bin locations. Contact Maintenance to see what
tools they have or are at least willing to turn over until they know the
system works. Make a list of the tools in the tool crib, along with their
assigned location. Make a spreadsheet or even a manual sign out sheet to
track the tools in and out of the crib. Assign a person full-time to monitor the
activity, or rotate responsibility to distribute the work. Just make sure
everyone knows who the assigned attendant is at any given point in time,
and make it clear that they have to check in with that person to get a tool.
Is that world class? No. Is it functional? Yes. Will it result in fewer delays?
Absolutely! And thats what its all about. Once people start to realize the
benefits of having a tool crib, they are more likely to support it, and help
make the necessary improvements to make it more effective and more
efficient.
So if you are one of those organizations that are dealing with unplanned
downtime due to availability of specialty tools, stop complaining and start
setting up your tool crib. Youll be glad you did, and youll wish you had done
it sooner!

The Elusive Weekly Maintenance Schedule


Is your organization (like so many others) missing out on the substantial
benefits that resource-leveled weekly reports can provide to a company? You
might want to rethink your strategy.

Scheduling has several variations: long-range planning (LRP),


shutdown/turnaround/outage scheduling, rolling schedules, weekly schedules
and daily plans. All of these are important, but, the weekly schedule process
is by far the most significant. It also is the most underutilized tool for work
force efficiency.

Most companies assume that their scheduling tool add-onwould make


weekly scheduling easy. They soon discover that what they bought is simply

an interface tool to a scheduling product. A further complication is that the


interface does not transfer all the needed information across at the right
level of detail. Upon discovering these problems, too many users say this is
too hard to use and give up on one of the most important benefits of a
Computerized Maintenance Management System (CMMS)increased labor
productivity.
Where is the problem? The problem could be the software, a lack of
perceived benefit for the process, or a training issue. In most cases, its
simply a software design issueor lack of design. CMMS vendors have
historically relied on a third-party interface to facilitate the scheduling
function. They also seem to treat all scheduling requirements the same. This
generic approach has given the users a clumsy interface that, at best, only
sort of works. The result is that very few companies take the time to create
a weekly schedule, and even fewer understand how important such a
schedule can be to their success.

A resource-leveled weekly schedule adds even more value. This advanced


technique requires several processes to already be in place. For example, if
the backlog isnt planned, it will be very difficult to create a schedule. Too
often, this critical process is overlooked and a company will stumble when it

comes to actual implementation. The typical CMMS software and training


regimen has a work order screen for entering schedule dates and work
priority fields, plus an ability to print a report that lists work for the week.
However, it often overlooks both resource leveling and compliance analysis.
In other words, during implementations, the process of deciding what is the
most critical work for the best use of limited resources is overlooked. There
are several points to consider when determining if your company should
develop weekly schedules:
You should not have to hire additional staff to generate and operate a weekly
schedule.
You should not have to migrate this data outside the CMMS when the
majority of the information needed for processing is already in the CMMS.
Resource leveling is viewed as a scheduling process that should be a part of
the CMMS product.
There are few to no logic ties (work dependencies) that require critical path
analysis.
Interestingly, for any given site, more manhours (across a one-year time
span) are spent developing and maintaining daily/weekly schedules than are
committed to shutdown/ turnaround scheduling. The everyday planner/
scheduler not only represents the largest need for this capability, he/she also
uses the CMMS more than any other employee.
Weekly scheduling what and why?
A weekly schedule is an excellent management tool since every employee
can easily relate to what needs to get done this week. More importantly,
this design promotes proactive maintenance, which is more cost-efficient
than traditional reactive maintenance practices.
One week also is an ideal amount of time for forecasting a set of work that
all departments can support. For example, warehouse and operations
employees can be more easily convinced that the specific jobs on the
schedule actually will be completed.
Managements goal should be to present a believable schedule that
maximizes the use of craft labor without incurring overtimeand that
effectively reduces backlog.Working with a schedule that accurately

forecasts work activities enhances worker productivity, builds teamwork and


keeps the staff focused on a common goal.

Resource leveling
A resource-leveled weekly schedule provides a logical way to balance
required work versus available man-hours.Once a week, the resource pool is
assessed for available man-hours. This information is then compared to the
backlog of work. This may be a manual process or it may utilize a resourceleveling program. A preliminary schedule is then taken to the weekly
schedule meeting where attendees can refine it.
Without resource leveling, the process becomes subjective and open to error.
That, unfortunately, is common practice for many sites.
The weekly schedule meeting
If the management team waits until the meeting to select the work, it is
already too late to gain maximum value from the meeting. The weekly
schedule meeting is the time to refine the schedulenot build it. That said,
the meeting should be flexible. This is the time to confirm whether the
scheduled work is actually what should be done.Work can be added or
subtracted, based on parameters not known to the CMMS. All affected
departments should be present to provide input and gain consensus. Good
communication between maintenance and operations will improve schedule
accuracy.
An example of an appropriate change at the weekly schedule meeting might
be selecting related work based on the craft traveling to a remote location.

This force selection, is called opportunistic scheduling, and it is an


acceptable practice. Resource leveling would be performed a second time to
incorporate these changes, followed by re-issuance of the schedule. Since
the resource pool is fixed, some work may drop off.
When a user site initiates resource-leveled scheduling, its typical to discover
inaccuracies in the maintenance backlog. This is because the automated
selection of work depends on accurate data.

The process
Simply implementing a fundamental planning and scheduling system should
help improve productivity. Before each work day, the maintenance supervisor
will create his daily schedule from the weekly schedule. The work is linked
to the worker in the daily schedule. Each day, progress is provided on work
performed and the CMMS is updated. Examples of progress could be: work
was started, completed or placed on hold.

The daily schedule should be created from the weekly schedule.However, the
typical daily schedule includes reactive maintenance not shown on the
weekly schedule.
If the maintenance organization is only issuing a daily schedule, this does not
eliminate the need for a weekly schedule. If a company relies only on a daily
schedule, it leads to increased reactive maintenance.
Schedule compliance

Once a schedule is issued, every attempt should be made to make sure


these activities occur. Sometimes unforeseen events prevent the start of
work. Possible reason codes might be:
Operations would not let maintenance take the equipment down
Parts not available (even though the job was planned)
Management said not to perform
Ran out of time or craft availability
Unexpected repair situation discovered causing job delay

This information should be recorded in the database under a compliance


tracking table recorded by the week number and work order record. The
goal is to make a schedule that is >80% accurate each week.

Resource pool
To increase the efficiency of producing a weekly schedule, a CMMS should
provide easy entry screens for:

1.Worker labor informationincluding the labor identifier, craft code and the
assigned calendar/shift code.
2. Yes/No worker availabilityis this craft person an available worker? A
worker, such as a leading hand may be in a craft, but not normally assigned
to work activities. (A leading hand may be the most senior person in the craft
for larger maintenance organizations.)

3.Yes/No craft availabilityan entire craft code may be marked as no


resource leveling necessary.
4. Calendar/shift definitionable to match any possible rotating shift
combination and company holiday schedule.
5. Planned worker absences for next week data stored as non-available time
per worker.
6. Efficiency factor by craftwhich relates to the percentage of time
expected to be available to work on the schedule each week.This factor
allows for an expected amount of reactive maintenance and is critical in
creating an accurate resource pool.

Given the above tools, it is easy to maintain a resource pool. The working
level normally stays on the same shift, although rotating, for years at a time.
The only variable is when someone says something like, I have jury duty
next Wednesday and Thursday.

In the end, resource-pool management is not an exact science.We are just


trying to get close. Typically you can find a staff member in the maintenance
group who already maintains this information. The challenge is to get this
data into the CMMS.

The maintenance backlog The accuracy of the maintenance backlog is a


critical part of the process. If it is not accurate, then one might wonder how
any analytical decisions should be made from the CMMSincluding KPI
measurements. The minimum amount of information needed within the
maintenance backlog for this process to work is:
A valid work order record assigned to a supervisor or an area with a clearly
defined scope.
The work order is in ready status, meaning it has been planned and is
ready to work with no material or operational constraints, i.e. requires major
system shutdown.

Estimated man-hours by craft and (minimum) number of personnel needed


to perform this job are entered.
Any long-lead material items required for this work are on-site, and linked to
this work order.
The work order has a valid work type, such as repair activity, preventive
maintenance, major maintenance or design work.
The work order has an assigned priority; ideally, thats a calculated priority
based on asset criticality. There are many more steps to properly planning a
work order. But, from a resourceleveling viewpoint, these form the minimum
criteria.

Manual vs automatic resource leveling Resource leveling balances the


resource demand (backlog) with the resource pool (worker availability). It can
be done using paper and pencil (manually) or with software (automatically).

Either approach involves a comparison of required work hours to available


hours. If done automatically, however, you save a substantial amount of
time. This factor is even more significant when the schedule has to be
regenerated during the course of a scheduling meeting.

Subjective selection: ineffective Without resource leveling, the staff is


basically guessing how many jobs can be completed each week.
Maintenance supervisors will routinely guess at a safe number they want
to work on, or select priority work that might have come up in the last two
daybecause this is what they (and management) remember as being
important.

This type of subjective selection technique often leads to a less-than-desired


backlog reduction rate. Thats because there is no way the human mind can
evaluate an entire backlog of work that takes into consideration multi-craft
work orders, craft estimates, work priorities and worker/craft availability.

What if you have no job planners? Keep in mind that not all company sites
are the same.

Some are involved in manufacturing, some in heavy industry and some in


utilitiesthese typically have detailed job plans and work packages. On the
other hand, some facility maintenance organizations may not have job
planners to keep up with a weekly schedule.
With or without a planner, its usually possible to find someone to create a
rough estimate and enter it into the CMMS.Here are the questions to be
asked and answered in this situation:
What is the repair problem? (Enter the problem description and work type.)
What should be the priority of this work?
What craft is required to perform this work? Can this be done with just one
craft, or does it require two? Number of personnel? Estimated man-hours?
Are there any long-lead type material requirements? (Yes/No)
Typical facility maintenance takes only five to 10 minutes to enter the above
information. Once entered, the status can be changed to ready. This type
of interaction helps the maintenance department quickly develop an
accurate, useable planned backlog.
Depending on the situation, it may take several more-than-40-hour weeks
to catch up on backlog planning. The maintenance staff should not be afraid
of job planning. The worst situation is to not have any planned estimates
entered on the work order, thus leaving it up to the worker to define all
requirements up front as well as do the work.
Communication The subject of communication between operations and
maintenance often raises strong opinions. Some companies simply say,
enter a job priority for all new work and apply
The time it takes How much time is involved to create an effective weekly
schedule? The answer depends on the amount of typing and screen
manipulations a person must perform to set up this type of schedule each
week.

For example, the person creating the schedule may be creating a list of work
and downloading this information to other software programs for further
editing and/or data sorting. They also may be pushing the data to a P3 or
Microsoft Project (MSP) tool. Those who track the process of pushing this data
from and back to the scheduling tool usually find that a substantial amount
of effort is involved.

Typically, the data moved outside the CMMS is quickly out of sync with reality
due to constant updates of the CMMS data from the insertion of new work
and changing priorities based on short-term emergencies. What if work
priorities or calendar data is entered on the schedule side and not updated
on the CMMS? Is it necessary to maintain work level calendars in two
systems? What if the resource leveling algorithm in the scheduling tool
doesnt use the order of fire concept? Where do you run weekly schedule
compliance?

Where do you stand? How does your company compare to the general CMMS
user community?

Table I is based on some informal surveys in the field. Looking at these


numbers, it would appear that very few sites are generating a resourceleveled weekly schedule.

The reason for this low adoption rate is simple. Most software vendors dont
make the development of resource-leveling software a priority. Likewise,
because a useable tool has been unavailable, users have not learned the
value of this process.

What now? Companies have learned that with a readilyavailable CMMS addin and adjustments in a few crucial processes, they can gain substantial
economic efficiencies. A surprising, but very significant bonus is that their
respective companies soon become far better places to work. Shared goals
built on inter-departmental cooperation have quickly lowered conflict and
increased job satisfaction.

If your organization is one of the estimated 53 companies world-wide that


regularly generate a resource-leveled weekly report, be proud. If not, its
probably time to evaluate how you can join this elite group.

Start by comparing your current practices with those discussed in this article.
If you believe you have opportunities for improvement, take action. Change
what you can with your current skills and tools, then ask for any necessary
outside support to help you make it all the way. MT

John Reeve has spent the past 18 years helping clients solve real-world
CMMS problems. As a senior consultant for Synterprise Global Consulting, he
deals with once in a lifetime issues several times each year. He can be
reached at JReeve@synterprise.com.

Additional Concepts & Definitions

A. Weekly schedules do not assign worker names to work orders. That is


done with the daily schedule. The weekly schedule primarily states this is
the set of work which maintenance needs to work on this coming week.

B. Weekly schedule compliance is a best practiceand should also be a KPI


(>80%)
There needs to be a separate table, other than the work order table, in which
to store these records by scheduled week.
This table also allows for reason codes as to why the work was not started.

C. PM work
The CMMS product automatically generates these records as PM work orders.
They have a work type of PM, a status of ready and a target start date. If
this target date falls within the upcoming weekly schedule range, it will be
scheduled.
Some sites may have a dedicated PM crew.
The processing order (order of fire) for the resource-leveling program
involving PM work would be selected by the client.

D. Order of fire
This is a unique concept that defines the order of backlog processing. A
primitive answer would be to simply take the highest priority. The order of
fire concept directs the planner/scheduler to develop statements that
control the exact order of evaluation.
Examples:
i. Emergency maintenance or fix-itnow (FIN) work-types

ii. Carry-over work


iii. PM work with dates in range
iv. Scheduled modifications that require internal maintenance resources
v. All other maintenance work, ranked by calculated priority in descending
order and by report date

E. Opportunistic scheduling as a best practice


When reviewing a job on the weekly schedule, it is proper to also consider
including related work, especially if the work is at a remote location.
The weekly schedule meeting should allow for this process to work quickly
and efficiently. An effective technique is for the planner/scheduler to project
the computergenerated report on a screen. While reviewing a work order
record, the planner should then hyperlink to that work location (or asset
field) and bring up other related work. The attendees would then say, for
example, select the first and third records and add them to the schedule.
By working as a team the group can very quickly make decisions that will be
honored by everyone involved in making them

F. Major maintenance; modifications; capital work; project work


Work can come from the long-term plan (LTP). Typically, an external group
meets periodically to review the entire LTP. They make decisions on budgets,
priority, system availability, shutdown requirements, contractor support and
long-lead material items.
Complications also can come from:
i. Jobs that cannot be done until a particular season
ii. Long-lead time material requirements
iii. Contractors may not be readily available.
iv. Planned operational downtime
Multi-craft coordination, where the major maintenance team might say, the
following work is now ready for the weekly schedule. This should lead to the

development of a work order in the CMMS product with the proper work type
(i.e. CP) and giving it a scheduled start date. If this start date falls within the
upcoming weekly schedule range, it will be processed.
Major maintenance may or may not consume on-site labor resources, but it is
still beneficial to include this work in the weekly schedule. Adding this
information gives improved visibility to all departments and reduces work
coordination errors such as tearing up the parking lot twice in the same
month.

G. In-progress work (sometimes called carry-over work) considerations


Once a job is started, it makes sense to allow that work to be continued,
even if it crosses over onto multiple weeks.
Any work left unfinished at end of the week must be changed to in
progress in the CMMS with these notations:
i. Remaining man-hours by craft
ii. Is the status changed to hold or is the unfinished work available for
the following week?

H. The importance of plannersand job planning


Creating a weekly schedule is quite difficult without a planned backlog.
Maintenance work should be preplanned to the extent necessary to minimize
delays in work performance. Pre-planning doesnt just minimize downtime, it
also optimizes labor efficiency and job safety.
Planners provide:
i. Consistency of input with regard to craft
estimates, priority assignment, work-type
assignment, and proper asset identification.
ii. Interpretation of each work request by using clear and obvious wording
and a sufficient amount of detail.

iii. Links to work associated with future system shutdowns


iv. A proactive view of future work, not just short-term reactive maintenance.
v. An important service by identifying recurring repair problems and
informing engineering.
I. Shutdown/turnaround scheduling typically requires a robust scheduling
product. It involves the use of logic ties, critical path and resource analysis.
Conversely, weekly scheduling is mostly a collection of work activities with
no inter-dependencies.
Prioritization Issues
1. How can you determine if your system of prioritization is NOT working?
You are making use of deadline priorities. This means you are linking the
allowed time to repair to a priority value. This approach does not take into
account the available resources for any given day or week. There will be
violations because you only have so many resources to get the work done.
You review your backlog of work and find high priority work that is many
months old.
You review your backlog of work and find that the majority of all work has the
same priority.
2. What constitutes a good system of prioritization?
Backlog work priorities are periodically reviewed and adjusted, as needed.
The work order priority is combined with the asset/location priority. This
technique provides a normalized result and is ideal for ranking the work.
The higher the number, the higher the priority. With this approach, there are
no limits on processing new work (which is now more important than the
existing work in the backlog).

Polish Up Your Maintenance Planning

FYI: Effective planning and scheduling can have a profound impact on a


maintenance organizations productivity and compliance, not to mention
equipment reliability. Planned jobs require only half as much time to execute
as unplanned jobs. A rule of thumb for maintenance is that each dollar
invested in preparation saves 3 to 5 hours during work execution. Whats the
situation with your organization?

Steve Mueller, director of commercial operations for the management


consulting firm Daniel Penn Associates (DPA), Hartford, CT, says overcoming
resistance to change, transitioning from a reactive to a proactive culture, and
maintaining forward momentum are the biggest challenges that companies
face when developing planning and scheduling programs. In a DPA Insights
blog post (danielpenn.com/maintenance-organization), he pointed to five
must do items for dealing with those challenges and jump-starting a
maintenance-planning process. His advice is timeless:
Communicate the benefits to all stakeholders upfront.
Appoint qualified people as planners and schedulers.
Install a robust computerized maintenance management system (CMMS).
Use metrics to manage the new processes and show results.
Frequently follow up with all stakeholders to reinforce the new behaviors.
Measuring success

How will you know if your efforts are successful? Mueller listed the following
indicators:
The preventive maintenance (PM) program is on schedule.
Supervisors dont need to do their own planning and scheduling because
they follow the schedules for their shops.
Your customers have a single point of communication.
All work is on a work order.
No work orders are released before the work is ready to be executed.

Every worker starts each day with ready work.


All planned work has estimated hours, status code, and priority.
Weekly schedules by crew, date, individual, and job are used.
Work-order actual hours/work-order hours planned is 90% to 100% of
capacity.
100% of work is covered by a work order.
100% of PM work orders are planned and scheduled.
Emergency work should comprise no more than 10% of labor hours.
You have the means to appropriately staff the organization by skill and
workload demand. MT
Bring Out The Shine
What shines when maintenance resources are properly planned and
scheduled? Plenty, according to Steve Mueller of Daniel Penn Associates:
Equipment reliability increases.
Maintenance stores are more available.
Costs drop.
Wait times are reduced.
Excess inventory goes away.
The information you use to make decisions is more reliable.

8 Steps to Success in Maintenance Planning and Scheduling


Maintenance Planning and Scheduling are key elements that influence the
true success of any organization. Many times we have a planner or
planner/scheduler, but do not know how to use him or her effectively or
efficiently. When we talk about maintenance planning, we are talking about
higher wrench time. At this time of economic uncertainty, a higher wrench
time equals lower cost, which results in job security for all. Past studies have
shown that most companies do not perform maintenance planning
effectively thus impacting negatively work effectiveness, wrench time,

equipment uptime, equipment reliability, and cost. If we were Effective in


Maintenance Planning, it would result in Higher Wrench Time and Higher
Equipment Reliability.
In Maintenance Scheduling, once we have achieved the discipline required
and maintenance plans are completed on time the reliability of facilities and
assets increase at a high rate.
Lets take a moment to look inside of what a proactive maintenance
planner/scheduling roles looks like.

Where does proactive work come from?


Proactive Work Orders or Request comes from an effective preventive
maintenance and effective condition monitoring program. Here is how it
breaks out:
PM Execution: 15% of Work
Results from PM Execution: 15% of Work
(Typically identify Functional Failures)
A Functional Failure (High or Critical Defect Severity very little, if any time
to plan and schedule proactive work) is the inability of an item (or the
equipment containing it) to meet a specified performance standard and is
usually identified by an operator
Condition Monitoring Execution: 15% of Work
Results from Condition Monitoring: 35% of Work
(Typically indentify Potential Failures)

A Potential Failure (Low Defect Severity- time to plan and schedule proactive
work) is an identifiable physical condition which indicates a functional failure
is imminent and is usually identified by a Maintenance Technician using
condition monitoring or quantitative preventive.

PM Vs CBM
Prior to the beginning of the maintenance day shift:
The maintenance planners day starts before the regular maintenance day
shift in order to review the work orders that came in overnight. The planner
will make an estimate of the man-hours, number of personnel and craft types
needed for any emergency work orders that must be started that day then,
move those work orders directly to the maintenance crew followed by a
quick phone call to notify the maintenance supervisor responsible for that
area of the plant. The planner will also code these jobs as Emergency work
orders so the level of this type work can be tracked over time. Application of
well-disciplined proactive maintenance strategies (PM/CBM) coupled with
effective planning and scheduling will make these emergency jobs fewer and
fewer over time.
The planner should also use good planning and scheduling techniques on his
own responsibilities. Once any emergency work has been estimated and
sent to the maintenance crew, the maintenance planner will plug new work
requests into his/her field inspection schedule. Some jobs may need to be
worked into todays field inspection schedule in order to be put on
tomorrows maintenance schedule. Other new requests can be scheduled for
field inspection and planning later in the week. It is important for a proactive
planner to schedule all of his jobs (other than emergency work) for field
inspections on a particular day to be most effective. The planner will also set

planning status for these new requests to Planning to show planning is


underway.
Early Morning:
Field inspections Next, armed with an inspection schedule, Job Inspection
forms, and a camera, the planner will begin making his/her inspection of all
of the job sites. The planner has established a logical route to minimize
travel time and will make notes of the specific needs of the request, any
ancillary work that should be completed by the mechanic while at the job
site, and all of the other applicable information required for a well-planned
job. The planner will make note of the complexity and predictability of the
various issues relative to the particular job in order to create a job plan most
effective yet suited to the particular job. Also, the planner will pay particular
attention to job issues where significant delays were identified in the Wrench
Time study. Understanding and watching for complexity, predictability, and
likely wrench time losses will enhance the likelihood of creating a job plan
that will minimize delays during execution and result in a high performing
work force. More on these topics can be found in the 3rd edition of Planning
and Scheduling Made Simple, Smith and Wilson.
Immediately after completing field inspections is a good time to start
ordering parts, or at least creating a list of parts to order, depending on the
time available before meeting with the supervisor, scheduler, and
maintenance coordinator. In particular, identifying the parts that will require
more than 24 hours to obtain will be important. These parts should be
ordered today and the status should be changed to Waiting Parts. At this
point in the process, it is not known when the job will actually be scheduled
so, any parts not on site should be ordered on the same day they are
identified as a need. Parts that are available from the storeroom should be
put on reserve so that they will be available for ordering the day before the
job is scheduled for execution. The planner will also need to review the
status of parts previously ordered and update the status to Ready to
Schedule on the work request where all parts have arrived and storeroom
parts are all on reserve. Some organizations go ahead and have storeroom
parts delivered and placed in Parts Kit boxes for each job. This process can
work fine however, one drawback is when jobs get pushed to the future for
execution, you can end up with a lot of Parts Kits to keep track of or, you can
end up sending some stuff back to the store room if jobs get canceled for
whatever reason. If you have a firm parts reservation system, it will be the
best of both worlds where the parts cant be bought out for a different job,

yet if the job gets cancelled it doesnt have to be returned. Less handling
and better inventory accuracy provided by the reservation approach will
reduce cost
Working from the Job Inspection Form, the planner will identify the various
needs required by the jobs and will start documenting the job plan. First and
foremost is the Job Summary page which will contain the basic information
that a fully qualified mechanic, who is very familiar with this type of job
would need. The Job Summary would provide reference numbers to the
detailed information for the job which would follow in the job plan. This type
of job plan format will allow those familiar with the task to quickly review the
job only using the summary sheet. Anyone less familiar or skilled would
have references on each item on the job summary sheet to the specific
section of the job plan to access the specific information they need. This
provides maintenance personnel with quick access to the information they
need without having to read through information they dont need.
All free time that the planner has should be spent refining and permanently
documenting job plans. As the planners job plan database grows, he/she
will have more and more plans that can be used on future jobs with only
minor refinements. This will allow the planner to plan for a greater number
of field maintenance personnel. As job plans are completed, the planner
should update his/her backlog status to Planning Complete. When all parts
not available through stores have been received and the storeroom parts are
on reserve, the status should be changed to Ready to Schedule, assuming
the job plan has been completed. The Scheduler will initiate the delivery of
storeroom parts on reserve the day before the job is scheduled for execution.
Late Morning:
Planner meets with the maintenance supervisor, scheduler and maintenance
coordinator:

Now armed with the information gathered during the field inspection route,
processing parts needs and updating the status on jobs that have received
some or all of the parts ordered, the planner should meet with the
maintenance supervisor, scheduler and coordinator. The planner should
bring a copy of the Planning Backlog with current status updated to the
meeting. This meeting should be short, 30 minutes or less and its purpose is
two fold, 1) provide preliminary info to those who will be building/amending

the maintenance schedule, and 2) ensure that the Planner has scheduled the
various jobs in his/her queue in a manner consistent with the needs of
maintenance and production. The planner should share parts issue updates
and the schedule for his/her planning activities. Any other major restraints
such as boom truck, crane needs, or some other special need for particular
jobs will need to be communicated. This will provide maintenance and
operations with important information that will allow them to start planning
for when particular jobs will be ready for placement in the maintenance
schedule. This meeting will also allow maintenance and operations to
provide feedback to the planner on any changes that need to be made to the
planning schedule. For example, the planner may have a particular job on
schedule for planning to be complete and be Ready to Schedule status by
next Tuesday when in fact, production needs it sooner or later.
Early Afternoon
Immediately after lunch the planner will continue writing job plans,
researching technical issues for particular jobs, obtaining approval for jobs
meeting specific criteria and, referring other jobs to Engineering for
redesigns as applicable, and updating the status of the request as
appropriate.
Each day, the planner should designate a small amount of time for reviewing
the feedback from the mechanics on jobs recently completed. This is an
important step for the planner to be able to improve the effectiveness of the
plans he or she creates.
Late Afternoon
An hour or so before the daily scheduling meeting the planner should review
his/her email account and phone messages to see if there have been any
late changes to the general plan that has be forming for the next days
schedule. This information may have impact on the Job Summary sheets the
planner takes to the scheduling meeting.
The daily scheduling meeting is not a meeting where the planning backlog
will be reviewed and jobs will be selected for scheduling. Because the
planner meticulously keeps the status of all jobs updated and because of the
late morning meeting between the planner, scheduler, maintenance
coordinator, and the maintenance supervisor; the schedule has inherently
been forming on its own. The daily scheduling meeting is where the weekly
schedule will be either confirmed for the next day or, slightly amended to

respond to higher priority needs that presented themselves since the weekly
schedule was posted yet, have allowed time for the preparations necessary
to reap the benefits of planning and scheduling. Also, changes may be made
to more days than just tomorrow depending on needs and planning status of
the jobs. This meeting should take 30 minutes or less if each role has
prepared in advance and communicated effectively with the other players as
needed. It is only to finalize what they have already been discussing and
working toward since yesterdays daily planning meeting.

After the daily scheduling meeting, the scheduler will change status of any
work orders that are to be added to the maintenance schedule and will also
order all parts that are on reserve in the storeroom. Following the daily
planning meeting, the planner will amend the Field Inspection schedule and
the make any adjustments necessary to the overall planning schedule.
The planner will need to update any measures the organization tracks
relative to planning such as man-hours planned and emergency man-hours
per day.
End of the day
Make a quick review of the entire Planning Backlog:
Is the job status up to date on all jobs?
Is the Field Inspection schedule for tomorrow ready?
Have all parts coming from off-site been ordered and parts available from the
storeroom placed on reserve for jobs that have been inspected?
Conclusions
Notice that the planner has not had any involvement in work that is
underway and almost all of the planners activities have been directed
toward work that will leverage his/her time. The only exception to this
should be the small amount of time it took the planner to make a quick labor
estimate on emergency work. A planner that follows this type of rigor can be
assured that he/she is leveraging the entire maintenance crew by his/her
efforts and helping to propel the organization to a more proactive state
where emergency work and unexpected failures are the exception. This job
requires discipline and patience as the transition from reactive maintenance

to proactive maintenance occurs. Hopefully this opens your eyes to what a


Maintenance Planner/Schedulers roles looks like in a proactive environment.

Das könnte Ihnen auch gefallen