Beruflich Dokumente
Kultur Dokumente
Ben Rockwood said something last December about the re-emergence of the
Systems Engineer and I agree with him, 100%.
NASA Systems Engineering Handbook, 2007
To add to that, Id like to quote the excellent NASA Systems Engineering handbooks
introduction. The emphasis is mine:
Systems engineering is a methodical, disciplined approach for the design,
realization, technical management, operations, and retirement of a system. A
system is a construct or collection of different elements that together produce
results not obtainable by the elements alone. The elements, or parts, can include
people, hardware, software, facilities, policies, and documents; that is, all things
required to produce system-level results. The results include system-level qualities,
properties, characteristics, functions, behavior, and performance. The value added
by the system as a whole, beyond that contributed independently by the parts, is
primarily created by the relationship among the parts; that is, how they are
interconnected. It is a way of looking at the big picture when making technical
decisions. It is a way of achieving stakeholder functional, physical, and operational
performance requirements in the intended use environment over the planned life of
the systems. In other words, systems engineering is a logical way of thinking.
Systems engineering is the art and science of developing an operable system
capable of meeting requirements within often opposed constraints. Systems
engineering is a holistic, integrative discipline, wherein the contributions of
structural engineers, electrical engineers, mechanism designers, power engineers,
human factors engineers, and many more disciplines are evaluated and balanced,
one against another, to produce a coherent whole that is not dominated by the
perspective of a single discipline.
Systems engineering seeks a safe and balanced design in the face of opposing
interests and multiple, sometimes conflicting constraints. The systems engineer
must develop the skill and instinct for identifying and focusing efforts on
assessments to optimize the overall design and not favor one system/subsystem at
the expense of another. The art is in knowing when and where to probe. Personnel
with these skills are usually tagged as systems engineers. They may have other
titleslead systems engineer, technical manager, chief engineer but for this
document, we will use the term systems engineer.
The exact role and responsibility of the systems engineer may change from project
to project depending on the size and complexity of the project and from phase to
phase of the life cycle. For large projects, there may be one or more systems
engineers. For small projects, sometimes the project manager may perform these
practices. But, whoever assumes those responsibilities, the systems engineering
How a MTBF calculation becomes a trap if you are unaware of the dangers
in using Mean Time Between Failure (MTBF) for reliability analysis
MTBF is often used as an indicator of plant and equipment reliability. A MTBF value
is the average time between failures. There are serious dangers with use of MTBF
that need to be addressed when you do a MTBF calculation
Take a look at the diagram below representing a period in the life of an imaginary
production line. What is the MTBF formula to use for the period of interest to
represent the production lines reliability over that time?
If MTBF is the mean time between failure (MTBF applies to repairable systems;
MTTF, Mean Time To Failure, applies to unrepairable systems) the MTBF formula
would need to have time units in the top line and a count of failures on the bottom
line.
In the diagram you will see the MTBF formula that I finally settled on: Mean Time
Between Failure (MTBF) = Sum of Actual Operating Times 1, 2, 3, 4 divided No of
Breakdowns during Period of Interest.
But the MTBF value you get from that MTBF calculation changes depending on the
choices you make.
To arrive at a MTBF equation there are assumptions and options to consider. Like,
what event is, and is not, a failure? What power-on time do you consider to be
equipment operating time? When do you start and end the period of interest for
which you are doing the MTBF calculation?
Definition of Failure
To measure MTBF you need to count the failures. But some failures are out of your
control and you cannot influence them, like lightning strikes that fry equipment
electronics, or floods that cause short circuits, or if your utility provider turns off the
power or water supply. Do you include Acts-of-God into your MTBF calculation?
You need to get agreement across the company as to what can be called a failure,
what can be called operating time and what are the end points of the time period
being analysed before you can use MTBF values as a believable Production
Reliability KPI (Key Performance Indicator).
Maybe it is more sensible to have MTBF by categories, e.g. 1) mean time between
machinery/equipment breakdowns caused by internal events, 2) mean time
between operator induced stoppages, 3) mean time between external caused
outages that you cannot control, like power or water loss, and so on.
Your second best protection against misinterpreting and misunderstanding MTBF is
to have honest, rigid rules covering the choices and options that arise when doing a
MTBF calculation.
The very best protection is to also get the timeline of the period being analysed
showing all the events (and their explanations) that happened, and then ask a lot of
questions about the assumptions and decisions that were made, and not made, to
arrive at those MTBF values.
Of course, you know why I choose motors for the example. To reinforce the
idea that the chance of failure is not always a constant. Be sure to think
about the failure mechanisms before using MTBF (or MTTF). If the failure rate
is time dependent then this simple calculation is not useful.
I used this example during a class last week and it seemed to spark a good
discussion. How have you explained MTBF to others? Any suggestions on
how to best describe what the MTBF value really means, or doesnt mean?
Excerpts
In my case, trying to calculate reliability and/or MTBF for a subsystem is very
frustrating. The advertised reliability of many components are based on the
OEMs projection or engineering analysis because no one wants to spend the
dollars and time required to truly test the component thoroughly. Trying to
validate their projection or engineering analysis at the system level usually
means that my only data is based on one or two failures for a series of tests
that accumulate a total of 30 to 40 hours of operation. For simplicity, I am
usually forced to ignore the conditions of testing (e.g., temperature, altitude,
load). Components that require a high level of reliability (R) and confidence
(C) need several hundred hours of operation to fully demonstrate R&C.
Hi Bill, I feel your pain. Vendors have to deal with many operating conditions
and use cases. They tend to do what is requested by the majority.
Unfortunately, so many seem happy with very poor information, that those
that need and request better information and thwarted. I suggest we
continue to ask for meaningful information, educate our peers to do likewise,
and when all else fails do the testing ourselves.
Hi Fred, At the moment I am doing an internship at a company concerning
MTBF. My research is forcing the same question into my mind everytime: Is it
even wise to help them calculate their MTBF? The FR is not constant at all, as
they mainly produce flowmeters. MTBF for me is not an estimation of how
long an asset will last at all, for me it says more about the improvement/
decrease of the reliability of an asset or system. Your topic intrigued me as I
am starting to very much agree about whether it is wise to use MTBF at all. I
would be very excited if you could tell me your experience with other,
similair and preferably more representative metrics.
Hi Stefan, you should be concerned as MTBF most likely is misleading or not
representing the actual failure rate at any particular time of interest.Instead
use reliability, probability of success at a specific duration. 98% reliable over
1 year, for example. Use multiple points in time, or better and you have the
time to failure information, fit a Weibull distribution (or appropriate
distribution) and have the entire picture of probability of failure over time in
a CDF plot.
Hi Fred, Thank you alot! I will have a look into that, this is very usefull and
fun information for me to work with! Thanks again.
Hi Fred, I appreciate your site a lot. How important mission it is one can
understand searching the internet for exmaples of MTTF calculation and FIT.
Ive started reading about hazard rate, failure rate and MTTFs etc. but cant
find any advise how to interpret test data toward obtaining failure rate. Let
me put here example: Im testing 10 devices (nonrepairable system) over
e.g. 400 hrs. Recorded failure times in hrs: {30, 45, 60, 90, 120, 180, 240,
300}, 2 devices survived. Can I say my failure rate is 8/
(30+45+60+90+120+180+240+300+2*400) and MTTF as reciprocal of
Failure rate? maybe even I shouldnt even try to calculate failure rate from
this data? Does this method imply any problems with reliability calculation? I
agree that mean value for particular distribution yields different reliability but
please advise how to process this data in correct way. I appreciate your
feedback in advance.
Hi Rafal, thanks for the note and example problem. While you can estimate
the failure rate and MTTF as described it is not all that useful in most
cases.Instead use a Weibull analysis (with so few data points Weibull is often
a great starting point as it is versatile ) This will provide the probability of
failure at various points in time. Im traveling at the moment and have
limited access, so will follow up later when I can either work out the problem
for you, or point to a better reference and example.
HI Fred, Thank you for your interest. Meanwhile I was sitting and struggling to
understand meaning of Failure Rate and its interpretation. I think it can be
good supplement to the previous question if I ask if when for example failure
rate equal 0,004 fail/hr or in other words 4 fails per 1000hrs means 4 of them
will fail every 1000 hrs assuming exponential distribution and constant
hazard rate. It also means that if I had 4 components then I could expect
none of them functioning after 1000 hrs but it also means that if I had 1000
components 4 of them will fail within 1000 hrs and 996 will remain healthy
until next 1000 hrs left? I read somewhere failure rate example: FR=0.1
means 10% of population will fail every time stamp and in fact it plots
exponential curve but in this case having specific amount of devices e.g.
1000 components, it lineary drops to 0 after 250 000 of hrs gone (1000 *
250) where 250 is mean time to fail. Maybe it shouldnt be understood so
straight forward? Maybe if it is an average value and follows assumed
exponential pdf then we can say in average 4/1000hrs fails but in our case
37% will fail in first 250hrs and remaining 63% within the next 250000-250 =
249750 hrs. If this is correct Im home if not Im lost.
Availability, MTBF, MTTR and other bedtime tales
If we let A represent availability, then the simplest formula for availability is:
A = Uptime/(Uptime + Downtime)
Of course, it's more interesting when you start looking at the things that
influence uptime and downtime. The most common measures that can be
used in this way are MTBF and MTTR.
MTBF is Mean Time Between Failures
MTTR is Mean Time To Repair
A = MTBF / (MTBF+MTTR)
One interesting observation you can make when reading this formula is that
if you could instantly repair everything (MTTR = 0), then it wouldn't matter
what the MTBF is - Availability would be 100% (1) all the time.
That's exactly what HA clustering tries to do. It tries to make the MTTR as
close to zero as it can by automatically (autonomically) switching in
redundant components for failed components as fast as it can. Depending
on the application architecture and how fast failure can be detected and
repaired, a given failure might not be observable by at all by a client of the
service. If it's not observable by the client, then in some sense it didn't
happen at all. This idea of viewing things from the client's perspective is an
important one in a practical sense, and I'll talk about that some more later
on.
It's important to realize that any given data center, or cluster provides many
services, and not all of them are related to each other. Failure of one
component in the system may not cause failure of the system. Indeed, good
HA design eliminates single points of failure by introducing redundancy. If
you're going to try and calculate MTBF in a real-life (meaning complex)
environment with redundancy and interrelated services, it's going to be very
complicated to do.
node failed, then this model might be correct. But if the other nodes were
providing redundancy or unrelated services, then they would have no effect
on MTBF of the service in question. Of course, as they break, you'd have to
repair them, which would mean replacing systems more and more often,
which would be both annoying and expensive, but it wouldn't cause the
service availability to go down.
To properly apply these formulas, even intuitively, you need to make sure
you understand what your service is, how you define a failure, how the
service components relate to each other, and what happens when one of
them fails. Here are a few rules of thumb for thinking about availability
Complexity is the enemy of reliability (MTTR). This can take many forms
Complex software fails more often than simple software
Complex hardware fails more often than simple hardware
Software dependencies usually mean that if any component fails, the whole
service fails
Configuration complexity lowers the chances of the configuration being
correct
Complexity drastically increases the possibility of human error
What is complex software? - Software whose model of the universe doesn't
match that of the staff who manage it.
Redundancy is the friend of availability - it allows for quick autonomic
recovery - significantly improving MTTR. Replication is another word for
redundancy.
Good failure detection is vital - HA and other autonomic software can only
recover from failures it detects. Undetected failures have human-speed
MTTR or worse, not autonomic-speed MTTR. They can be worse than humanspeed MTTR because the humans are surprised that it wasn't automatically
recovered and they respond more slowly than normal. In addition, the added
complexity of correcting an autonomic service and trying to keep their
fingers out of the gears may slow down their thought processes.
Non-essential components don't count - failure of inactive or non-essential
components doesn't affect service availability. These inactive components
can be hardware (spare machines), or software (like administrative
The first metric that we should understand is the time that a system is not
failed, or is available. Often referred to as uptime in the IT industry, the
length of time that a system is online between outages or failures can be
thought of as the time to failure for that system.
For example, if I bring my RAID array online on Monday at noon and the
system functions normally until a disk failure Friday at noon, it was
available for exactly 96 hours. If this happens every week, with repairs
lasting from Friday noon until Monday noon, I could average these numbers
to reach a mean time to failure or MTTF of 96 hours. I would probably
also call my system vendor and demand that they replace this horribly
unreliable device!
Note too that MTTF often exceeds the expected lifetime or usefulness of a
device by a good margin. A typical hard disk drive might list an MTTF of
1,000,000 hours, or over 100 years. But no one should expect a given hard
disk drive to last this long. In fact, disk replacement rate is much higher than
disk failure rate!
Mean Time to Repair (MTTR)
In our example above, our flaky RAID array had an MTTF of 96 hours. This
leaves three days, or 72 hours, to get things operational again. Over time,
we would come to expect a mean time to repair or MTTR of 72 hours for
any typical failure. Again, we would be justified in complaining to the vendor
at this point.
Repairs can be excruciating, but they often do not take anywhere near as
long as this. In fact, most computer systems and devices are wonderfully
reliable, with MTTF measured in months or years. But when things do go
wrong, it can often take quite a while to diagnose, replace, or repair the
failure. Even so, MTTR in IT systems tends to be measured in hours rather
than days.
Mean Time between Failures (MTBF)
The most common failure related metric is also mostly used incorrectly.
Mean time between failures or MTBF refers to the amount of time that
elapses between one failure and the next. Mathematically, this is the sum of
MTTF and MTTR, the total time required for a device to fail and that failure to
be repaired.
For example, our faulty disk array with an MTTF of 96 hours and and MTTR of
72 hours would have an MTBF of one week, or 168 hours. But many disk
drives only fail once in their life, and most never fail. So manufacturers dont
bother to talk about MTTR and instead use MTBF as a shorthand for average
failure rate over time. In other words, MTBF often reflects the number of
drives that fail rather than the rate at which they fail!
Stephens Stance
Most computer industry vendors use the term MTBF rather indiscriminately.
But IT pros know that systems do not magically repair themselves, at least
not yet, so MTTR and MTTF are just as important!
hours and divide by one (rather than zero when bad things mathematically
occur).
Summary
Two things.
Be sure what you are measuring and reporting using MTTF actually has a
constant failure rate or close enough to constant that it doesnt matter. Send
over your questions and maybe become the next NoMTBF blog post.Third
and most important, if you have a NoMTBF button or mug or whatever,
please take a picture of it in the wild and send over. Also send along a short
note on how it has helped start conversations around MTBF.Ill create a page
on the site showcasing the ways these devices are starting conversations.
Recently Glenn S. asked if I had a reference for clear definitions of MTBF and
MTTF. After a bit of a search I sent him a definition or two, meanwhile he
gathered a few more.
They are all basically the same, with some slight differences. What is
interesting to me is the amount of variability in the interpretation and
understanding.
This is what Ive compiled thus far on definitions for MTTF and MTBF. There
seems to be some variations in the definitions.
Mean Time Between Failures (MTBF): The average operating time expected
between failures in a population of identical components. This measure has
meaning only when we are discussing a population where there is repair or
replacement.
Mean Time To Failure (MTTF): The average operating time expected before
failure of a component which is not repaired or replaced. This is simply the
average time to failure of n units, i. e., the sum of n individual unit times
to failure divided by n units.
MEAN-TIME-TO-FAILURE (MTTF): A basic measure of reliability for nonrepairable items: The total number of life units of an item divided by the total
number of failures within that population, during a particular measurement
interval under stated conditions.
Mean Time to Failure. MTTF is the expected value (mean) of an items failurefree operating time. It is obtained from the reliability function R(t) as MTTF =
R(t) dt, with TL as the upper limit of the integral if the life time is limited to
TL (R(t) = 0 for t > TL ). MTTF applies to both non repairable and repairable
items if one assumes that after a repair the item is as-good-as-new. If this is
not the case, a new MTTF (MTTFsi starting from state Zj) can be considered
(Table 6.2). An unbiased (empirical) estimate for MTTF is MTTF = (tl + + tn
)/n, where tl + + tn are observed failure-free operating times of
statistically identical, independent items.
MTBF= 1/l. MTBF should be reserved for items with constant failure rate A. In
this case, MTBF = 1/ l is the expected value (mean) of the exponentially
distributed items failure-free operating time, as expressed by Eqs. (1.9) and
(A6.84). The definition given here agrees with the statistical methods
generally used to estimate or demonstrate an MTBF. In particular MTBF= T/
k, where T is the given, fixed cumulative operating time (cumulated over an
arbitrary number of statistically identical and independent items) and k the
total number of failures (failed items) during T. The use of MTBF for mean
operating time between failures (or, as formerly, for mean time between
failures) has caused misuses (see the remarks on pp. 7, 318, 327,416) and
should be dropped. The distinction often made between repairable and nonrepairable items should also be avoided (see MTTF).
Electropedia (http://www.electropedia.org/)
Mean Operating Time Between Failures MTBF the expectation of the
operating time between failures
Mean Time to Failure MTTF the expectation of the time to failure
I too have found these terms used interchangeable in many papers and
references.
with accurate results. These difficulties lead to unreliable figures, which has
generated backlash against such calculations.
One typically defines reliability as the probability that said device will
perform functionally as required for a specified period of time. This all seems
rather simplistic, and it can be, to a degree, with a large enough sample size
and a long enough period of time. The main issue with deriving such figures
is that they are required for a products release not at the end of its lifetime
when actual reliability can be determined.
Interestingly, what would seem the most logical way to calculate MTBF gave
the most bizarre result! Given the product has been in manufacture and
deployed for more than 5 years, unlike a new product, we had the gift of
substantial historical data. Unfortunately, that data of approximately 5,000
units, deployed over an average of 3 years, with around 14 failures provides
an MTBF of more than 1,000 years!
As much as Id like to gloat about our bespoke products reliability and my
figures will entirely support this is a true MTBF figure, no one could
realistically believe even the materials the product are constructed from will
survive this length of time though that could well be true of the plastic
enclosure!
The second, perhaps more realistic method, only considers one component:
the weakest link. Its perfectly logical that by definition that the weakest link
is the most likely to fail, and thus most likely to fail first. So should no
calculation exist at all, and this figure just be passed through to the final
product?
The way in which MTBF is presented I liken to how automobile manufacturers
declare fuel consumption figures. Never in history has the real-world MPG
achieved in a vehicle actually matched the extravagant claims of the
manufacturer, as this figure was obtained in a far from real-world test with
vents sealed, no wind, etc. Likewise, a component manufacturers MTBF is
unlikely to encompass all, or any of the extraneous factors that will affect it
be that humidity, temperature, vibration, or shock. What these constants
were during testing are almost never documented, thus any particular MTBF
figure is rarely comparable to the next. Unfortunately, this regress follows to
the final product; MTBF simply doesnt cover the expected usage conditions
or what the product lifetime should be.
The calculation of reliability and likelihood of failure has been studied in
depth. Well-known, observable phenomena such as the bathtub effect are
well documented but very difficult to encompass into a single hours
integer. Weibull analysis, determining where a population of product
currently lies in the
bathtub, is well worth researching further alongside Accelerated Life
Testing that tries to encompass an individual units passage of time, though
not quite for a millennium!
C
ollecting failure data to calculate mean time between failures (MTBF) in order
to determine accurate maintenance task intervals is wrong and should not be
done. MTBF is a measure of reliability. It is a measurement of the time
between two successive failure events.
For age related failures, it is not MTBF, but rather useful life that is significant
when attempting to determine maintenance task intervals to avoid failures.
There is a point in a piece of equipments lifetime at which there is a rapid
increase in its conditional probability of failure. The measurement between
the point when the equipment is installed and the point where the
conditional probability of failure begins to sharply increase is the useful life
of the equipment. It is different than MTBF. The MTBF is defined as the
average life of all the population of that item in that service.
However, we do not know when the next component will fail. Therefore we
cannot successfully manage this failure by traditional time-based
maintenance (scheduled overhaul or replacement). It is important to know
the condition of the component and the life remaining before failure; in other
words, how fast can the component go from being OK to NOT OK. This is
sometimes referred to as the failure development period or potential failure
to functional failure (P-F) interval.
If the time from when the component initially develops signs of failure to the
time when it fails is 4 months, then maintenance inspections must be
performed at intervals of less than 4 months in order to catch the
degradation of the component condition. The inspection also must be
performed often enough to provide sufficient lead time to fix the equipment
before it functionally fails. In this case, we might want to schedule the
inspection every 2 months. This would ensure we catch the failure in the
process of occurring and give us approximately 2 months to schedule and
plan the repair.
Failure prevention requires the use of some form of condition-based
maintenance at appropriate inspection intervals (failure finding, visual
inspections, and predictive technology inspections).
My experience has been that for every $1 million in asset value as many as
150 condition inspection points must be monitored. Gathering and analyzing
condition monitoring data to identify impending failure for assets worth
billions of dollars is practically impossible without the use of reliability
software.
The reliability software you choose should be able to:
Mean Time Between Failure (MTBF) is the time, on average, that you would
expect a plant to fail including time lost whilst repairs are undertaken. It is an
indicator of the combined reliability and maintenance
effectiveness/efficiency. Example.
Plant runs for 500 hours, has 200 hours downtime due to 5 failures.
MTBF = (500+200)/200= 140 hours
MTTR
Mean Time To Repair (MTTR) is the time, on average, that you would expect
a stoppage to last including time spent waiting for maintenance engineer,
diagnosis, waiting for parts, actual repair and testing. It is an indicator of
maintenance effectiveness/efficiency. Note that there is another indicator
that can be used here, Mean Corrective Repair Time (MCRT). MCRT is only
interested in the actual repair time assuming all tools, spares and required
manpower are available (efficiency). MTTR-MCRT= Waste and therefore gives
you an indication of how ineffective your stores and maintenance resource
strategy is.
Example.
As above, plant runs for 500 hours, has 200 hours downtime due to 5
failures.
MTTR= 200/5= 40 hours
Therefore MTBF = MTTF + MTTR.
Availability
MTTF, MTBF and MTTR can also be used to calculate the availability, or more
importantly, the unavailability of the plant due to maintenance causing
failures. Example.
As above, plant runs 500 hours breaks down 5 times and 200 minutes are
spent waiting repair/spares/repairing.
Availability:
= ((500 + 200)-200)/(500+200)
= (700-200)/700
= 71%
Or
Availability
= MTTF/ MTBF or MTTF/(MTTF+MTTR)
=100/140 = 71% or 100/(100+40) =71%
Unavailability
= 1 availability
= 29%
Or
= 200/ (500 +200)
= 29%
Or
MTTR/ MTBF or MTTR/ (MTTF+MTTR)
= 40/140 or 40/ (100 +40)
= 29%
Summary
As a maintenance engineer your objectives must be to ensure that the plant
is available to be used as and when required by production with no
compromise of quality and safety, all at minimum total cost. There are
indicators that can be used to measure performance against the formentioned criteria. MTTF is an indicator of plant reliability and should be as
long as possible. Increased MTTF can be achieved through an effective
preventative and/or predictive maintenance plan. MTTR should be as short as
possible and is an indicator of the combined effects of the maintenance
strategy, organisation and systems used. MTTR can be reduced by having
the right person, with the right tools, with the right spares, in the right place
at the right time (sound familiar?). MTBF should be as short as possible and
is an overall indicator of reliability and effectiveness/efficiency.
Hi, MTBF = (500+200)/200= 140 hours should be changed to
(500+200)/5=140
doesn't do anything and I've never seen a computer pull a wrench although I
have seen it done remotely. Generally the weakest link is the equipments
bearing. It is the first thing to fail usually. Rotors are generally rugged and up
to the task as is the shaft and other components. What's the L10 life of the
bearing? Do you realize it? You should! If not roll up your sleeves and get to
work: why isn't it lasting longer? Maximum is 7 years and has L10 of 12 then
its environment may be that 7 years MTBF is max life?????? But by getting
out there and interacting with people and training on alignment and
machinery setup will get you to max MTBF which will be reflected by smiling
faces. It has to go beyond compiling data to get an attaboy. You'll know the
MTBF when the machine has reached its max life maybe L10 +20%. Also
your paycheck should increase. Are your machines running longer now than
they were two years ago? I personally had first hand knowledge of a PdM
program where a guy got a $15,000 bonus for a successfull program - he
didn't have a program. In 6 years he had written 2 reports totalling 3 pages
and no improvements but he had attaboys and pull or draft as the case may
be. I recently retained a contract where the head guy wanted to go with a
cheaper service vendor. The techs said, don't screw with what works, we
have planned scheduled maintenance and no call-outs and it's good that
way. Still there. When I started failures were ~1.5 yrs and now ~6-9 on all. Of
course lightning is hard to predict. We've got so many acronyms now they
are hard to keep up with. Seems there's 5-10 acronyms for every icon and
it's growing. My soapbox is getting shaky about now.
L10 rating is largely... inadequate... even when just considering the fatigue
question. Forget not the effect of contaminants, lubrication, etc, etc.In the
paper I presented at the CMVA last year (Canadian equivalent to VI), the
early portion deals with fatigue. See the ugly red thumbnail in the last
segment of http://vibra-k.com/?cat=15 .To make a long story short: bearing
fatigue life has more to do with contaminant inclusions within the steel than
load and cycling. That is the principal reason why so many bearings which
should fail from fatigue keep going well beyond the life estimate: low
contaminant contents in the load zone. As a rule, the design target is 50,000
or 100,000 hours for many machines. It can be higher for electric motors
since the bearing size tends to be relative to shaft dimensions, and that last
size will be based on force couple and torque issues rather than the very
light rotor loads. It doesn't mean that you would not cases with a more
extensive life: they abound. But the previous values of 50K hours (standard)
and 100k hours (by request) are good estimates.
expensive shaft replacements etc. Another was poor alignment and there
were many premature bearing failures.Over the next couple of years I
replaced all the gear couplings with membranes, upgraded all pumps to
mechanical seals and bought a couple of laser alignment kits and trained the
guys to use them properly. I also looked at the workshop practices for
bearing changes and yes, they were "drifting" the new bearings onto the
shafts. This to me is criminal! I bought a couple of oil baths and some heavy
duty gloves.The result of all this hard work, and yes expense, was that after
four years the MTBF was at a very healthy 37 months!At the same time I
introduced a vibration monitoring programme along with lube oil analysis
and thermography. The most important result from all this was that
Operations could plan long runs and we could schedule our maintenance to
suite. Needless to say, there was no war between Operations and
Maintenance in this refinery (Rolly!).
Josh, The moving time window is just a convention or a rule to set the time
span within which performance indicators (PI) are to be calculated. Its use is
quite common in management performance control at large. It is set
arbitrarily though it should be sufficiently long to accommodate several
episodes (failures in the instance of MTTF or MTBF) the more episodes the
higher the accuracy. I forgot to erase the product 6*25*8 in cell D21 but the
product "6 month, 25 days per month and 8 hours per day" was in my mind
at the time I created the example in Excel. I should simply have entered
1,200 hours straight in that cell. In this example, only the latest 5 failure
matter because the other 2 are already out of the time window, being
consequently omitted (or discarded). Suppose you are using a time interval
of 6 weeks (your time window). If you calculate PIs, say, weekly, then the
time window is said to move (or slide) one week at a time. As time advances,
data pertaining to the seventh week back in time are abandoned and new
data, this time pertaining to the very last week, are now considered in
calculations. When an equipment is under a process of change, it is common
practice to give different weights to data the older the lesser the weight
(the weights must sum up to 1) in order to get a picture that anticipates, in
a certain way, the changes that are currently on the way. PIs are, in this
particular circumstance, calculated by a weighted average. By doing this,
you attribute more importance (weight) to more recent events and less
importance (weight) to older ones. In the end, if you plot the PIs collected
over time on a graph, you get a trend which might be quite informative on
whether you are going the right way or not. There is another method which
was baptized "exponential smoothing" that automatically attributes less
weight to older data, never "forgetting" any data, regardless of how old they
might be. This method doesn't use the constant time window like the simpler
methods I referred above and is less popular among practitioners.
Rui,What is the purpose of arbitrarily setting the time span and moving the
time span forward as time goes by? Is this idea of constant time window the
same as constant moving average? Why not just calculate the mtbf since the
first failure to today becasue you said the more the episode the higher the
accuracy? I need to compare our mtbf values with similar industrial values
and also for trending, so I need standard methods of calculating the mtbf.
Why do you we need to apply weightage to the data set for equipment under
process of change? WHat change do you mean here? Is it modification,
operational mode change, operating parameter change, etc?
Cheddar, Good job for increasing the mtbf for your refinery pumps. Yes, we
could just take the straight forward approach that you used i.e. by summing
up operating time for all pumps (rgdless of pump types) divided by no. of
failures in between (rgdless of failure modes). This composite or overall mtbf
allows us to see the improvement from its trending.For further simplication,
did you take the operating time period to start from the year of the refinery
being built which is 1936, rgdless whether any pumps being replaced over
the years? Also look like you did not mentioned whether you subtracted any
downtime from the operating time period.However, I would like to follow a
standard method to calculate the mtbf so that I can compare my figures with
published mtbf data such as those for HP Bloch's book.One more question,
did you track the mtbf of the pump components individually such as
bearings, seals, impellers, etc?
When I arrived at the refinery I realised that they had a reliability problem,
particularly with the pumps. I took a long time and looked at the pump
failures just for the previous five years, and came up with the 6 months. I
then analysed the failures and started by addressing the biggest problem,
the coulpings, then the next, the sealing and so on.If you are looking at MTBF
to help analyse the effectiveness of your maintenance practices, then you
just need sufficient data to make your readings realistic. Obviously, the more
data you have, the further back you can go with GOOD DATA, the more
reliable your initial MTBF calculation will be. So look at your records and see
how far back you can go with a good degree of accuracy. Even if it's only a
couple of years, it's a good starting point.Then look at "what were the most
common failures", and possibly ask !why were these failures happening (as I
did with the bearings).There's nothing management like better than seeing
an improving MTBF.
Hello All, I worked in a semiconductor before and we have a machine called a
wirebond where it placed the goldwire in the circuit. One wirebond machine
(my estimate) will compose of around 500,000 parts and most of them are
electronic parts, (this machine is full of boards), and one station can have
around 100 or more wirebonds. We use to trend MTBF on a monthly basis for
all machines total and a perfect MTBF for a month is 168 hrs.
Let me give you an example : (MTBF for 100 WB)
Jan. 49 hrs / mo
Feb. 63 hrs / mo
Mar. 46 hrs / mo
Apr. 31 hrs / mo
Note the trend is getting low, MTBF is not good
Level 1 : MTBF = 49 hrs / month for Jan.
Now we want to prepare an MTBF analysis
Level 2 : What particular Wirebond machines contribute to the low MTBF
Level 3 : Once we pinpointed what particular machines with low MTBF we go
on to what particular sub-assembly are always failing
Level 4 : Then we identify the components that usually fail
If you are just hired in your plant today, and you have no records of your
failures for example then as suggested by some, you start calculating your
MTBF today and wait till something fails.
So the start time is where you start to actually compute your MTBF. Now in
our case where we compute the MTBF monthly so we can see the trend of
our line/station. What if there are no failures, then the denominator for
frequency of failures is zero which gives an MTBF of infinity. We have 2
options. In case where we achieve perfect MTBF of 168 hours a month, then
we assume a denominator of one, since anything you divide by 1 will give
you the numerator. Another option is prolong the duration of MTBF until 1
machine will fail, however using this option you cannot trend MTBF on a
monthly basis. Now for a group of machines you have 3 options to get the
total MTBF :
1st Add the total MTBF
MTBF = (36 + 45 + 74 + 90) hrs
2nd You may get the average MTBF
MTBF = (36 + 45 + 74 + 90) / 4
3rd you may get the percentage MTBF
MTBF = (36 + 45 + 74 + 90) / 4 x 168 hrs
what is important will be the trend of the MTBF if this would be monthly,
quarterly, semi-annually or yearly and it must be increasing.
Josh,When dealing with historic numbers, the more data you consider
(farther back in time) in order to compute an average for instance, the
poorer the response will be to more recent inputs, but, in turn, the more
stable the indicator will be. The only way out consists in performing a tradeoff. Please see the example in Excel that I attach. It deals with data that,
despite erratic from one period to the other, presents a trend though. This
trend can be due to some improvements that are being implemented
consistently over time, such as the ones to be expected from a TPM program.
The shorter the period (not a constant time window in this case but rather a
fixed number of events), the better the response, that is, the better the
picture of the magnitude the TTFs are reaching actually. On the other
hand, if a piece of equipment suffers a radical change in what its reliability is
concerned, then past data will be of little use from that moment on. New
data will have to be collected in order to have refreshed indicators available
reflecting the new reality.
Rui,I have re-calculate the MTBF, MTTF & MTTR using the total operating time
from the beginning to today i.e. without using the constant time window. Is it
correct?From your second example attached above, what I understand is that
using constant time windown to calculate MTBF will give higher values if
there are improvements made during that constant time window. My problem
with using the constant time window to calculate MTBF etc is that none of
published MTBF data like HP Bloch books mention about using constant time
window or did I miss something? My second concern with using the constant
time window to calculate MTBF etc is that, what is the most appropriate
constant time windows to be used so that I can compare my MTBF data with
others' data? In view of this, look like calculating the MTBF every month
seems ok for trending purposes and comparing with the same industrial data
published by others.
Josh, An indicator is meant very often to anticipate the behavior of a system
within a few periods ahead. To do this, the indicator has to be built on data
collected for a while in the recent past typically a time window whose
length has to be adjusted now and then in order to compromise stability (the
property of not going up and down suddenly) and responsiveness (the
property of following the latest trend closely). Things that have long
happened are only interesting for historical records. This way, the decision
maker gets the best possible picture regarding the direction an indicator is
taking and can act accordingly. Note that a sudden move up (or down) in the
last control period doesn't mean necessarily that you are moving towards
disaster nor heaven. Management should be more concerned and focused on
trends rather than specific values from one period to the other. But the trend
cannot be built on a long period; otherwise management won't get response
but too much stability instead and it will be too late when he finally notices
an undesirable trend.The weighed data methods that I wrote about in my
previous posts on this thread offer a way of getting the best possible trend
awareness. This is why you should use a time frame and not a period as long
as possible to calculate an indicator, unless it corresponds to the very
minimum time span to accommodate a few events to be statistically
significant. I saw your calculations and I would agree with them if I were not
aware of the fact that you are simply trying to get rid of the time window
constraint!I hope you agree this time.I am sorry I cannot give you a straight
answer in what norms are concerned. I always follow the practices that I
think are the best to a specific case whenever I implement a management
performance control system and never accept an indicator calculated in a
way imposed by some norm whose logic or justification I don't agree with.
This is to say that I don't follow norms in what "management performance
control systems" are concerned but I, of course, accept to consider any when
a client points me out in that direction with the condition that I freely decide
whether to advise (and justify) its adoption or not.
If you code every possible failure mode at components level, then you can be
effective at judging equipment level performance on a timely basis, by just
consolidating figures gathered at that low level. This way, you get figures
useful for both reliability improvements and management performance
follow-up. By the way, I sometimes have doubts on how to translate a
therefore 80000/80 = 10000 Hrs. The failure rate is the reciprocal of this and
is 0.0001.
sufficient lead time between the discovery of an equipment problem and the
time when it must be repaired, there is sufficient time to plan work. This also
avoids the need to make repairs under unfavorable emergency conditions.
Successful planning is a direct by-product of the 'detection-orientation' of PM.
When work is planned, it is done more deliberately resulting in higher quality
work done more productively. Conversely, problems found after the
equipment has deteriorated substantially are more serious and costlier to
repair. The detection-orientation of the PM program can increase the
opportunity for planning work to:
Maximize equipment availability
Minimize downtime
Prolong equipment life
Increase equipment reliability
Reduce emergencies
Routine Preventive Maintenance Services - Routine preventive maintenance
services include inspection, lubrication, cleaning, adjusting and calibration,
replacement of minor components like belts and filters and non-destructive
testing like oil sampling. Routine PM services are carried out by maintenance
craftsmen or operators as they perform visual inspections, lubrication,
cleaning, adjusting and calibration and non-destructive testing.
Condition-Monitoring - Condition-monitoring may supplement routine PM
services or be carried out separately. It uses a variety of predictive
techniques and specialized equipment to generate audio or visual signals
which are compared with signals depicting normal operation. Variations
enable analysts to identify problems, gauge the degree of deterioration and
determine corrective actions and their timing. Condition-monitoring has
vastly improved the capability of maintenance to detect potential problems
and has reinforced the concept of detection-orientation.
Types of Condition Monitoring
Critical components whose failure could be catastrophic are monitored
continuously. Sensors are located at the critical components and linked with
computers set to indicate normal operating conditions. The indication of an
abnormal condition signals the operator with an alarm (called a protected
system). The operator, often located in a remote control room, notifies
supervisors fault. Supervisor vacancies are often filled with the person most
proficient at performing a specific job. Rarely does a supervisor receive
training in how to lead a group of people effectively.
There are several aspects to being an effective supervisor but lets review
the three fundamental behaviors that supervisors need to master. These
work best with a span of control of around 15 to 20 people per supervisor
and when the supervisors primary job is towell, supervise, not chase parts
or attend too many meetings or perform maintenance work.
Set clear expectations
Supervisors need to establish expectations with individuals or work teams
that align with your business expectations. The first expectation usually
involves production output (quantity) at an expected quality over a specific
time period. Efficiency is the most common expectation but there are usually
more. Supervisors can communicate expectations during the morning startup
meeting or with individuals at their workplace. Its vital that supervisors get
buy-in to the expectations. When an individual or team understands whats
expected of them the supervisor has a basis for accountability to meet
production requirements without having to rely on convincing, persuading or
demanding action. Setting expectations helps to remove some of the
emotion in supervising the workforce and, as youll see with the next
fundamental, its not just managing by the numbers.
Follow up
Follow up requires spending time in the production area visiting with each
person or team to see how theyre satisfying the expectations. Its important
not to carry the sledge hammer during follow up; this method will only lead
to other issues further down the road. Supervisors should ask specific
questions about any issues causing deficiencies in performance. If these
issues cant be addressed by the individual or team then its the supervisors
jobs to determine a fix. Even the most effective supervisors dont always
have all the answers but they know where to find the answers, seeking the
appropriate support of functional groups (management, quality, maintenance
or engineering) to remedy the issue. Follow up is crucial because it provides
an opportunity to interact with the workforce in a constructive, proactive way
and lets the team know that meeting their agreed-upon expectations is
important.
Provide feedback
Once expectations have been set and follow up has occurred, its time to
provide feedback. This is when things can get interesting. Most supervisors
have little problem with providing positive feedback. However, those that
lack proper training and guidance may struggle when the necessary
feedback is negative. Negative feedback due to true inefficiencies caused by
an individual or team is the toughest. If a supervisor isnt careful it can lead
to some sort of confrontation. The best approach is to remind people of the
initial expectations and seek their input on how to turn the negative into a
positive. There may be times when the supervisor must recommend
retraining or replacement, all of which is part of supervising people. While it
is not always easy, providing feedback is again an opportunity to interact
with the workforce.
Supervisors often receive a lot of added pressures to do more, with little or
no support. Keep in mind that they are possibly the most important part of
your organization because they are the first-line contact for your most valued
asset, your people. Make sure your supervisors understand the three
fundamental behaviors for leading a team:
Set clear expectations that support the organizations goals
Follow up with your people to ensure expectations are understood and seek
input
Provide feedback when needed for both good and struggling performance
Applying these three fundamentals will take some effort and time. With
support from the rest of the organization, a struggling supervisor can
become an effective part of your organization and significantly improve their
teams performance.
One of the many MRO best practices we look for at our client sites is the
existence of a tool crib designed to effectively and efficiently store, control
and
maintain specialty items such as bolt cutters, shop vacs, sawzalls and other
power tools, generators, fans, and pipe threaders. These are often one-of-akind and sometimes expensive items that arent required frequently, but
when they are, they need to be available and ready for use.
Ironically, our clients often lament about the lack of availability and poor
operating condition of their specialty tools, resulting in unnecessary delays in
completion of maintenance work, sometimes critical jobs. Yet when we ask
them if they have a tool crib, the response is usually an emphatic No or a
shrug of the shoulders that effectively says the same thing, sometimes with
a wistful look that suggests that they wish they did.
Beyond the verbal responses are perhaps more deep-seated reasons, such as
ownership (I bought this tool, I will take care of it), or territorialism (These
are my tools and no one else is going to use them), or more often trust (I
dont believe they will have the tool when I need it.) Instead of recognizing
the problem and fixing it, they live with the current situation and complain
about the results.
The fact of the matter is that the only legitimate reason for not having a tool
crib is that you dont have any specialty tools at all, and thats a rarity.
How a Tool Crib Should Function
A tool crib should handle every specialty tool on site, and that could require a
significant amount of space to organize and manage effectively. Ideally that
space would be in a designated area of the MRO storeroom so the tools can
be properly organized, controlled, and kitted along with other MRO materials,
but having the tool crib in the Maintenance area is also a viable option.
Who will manage the tool crib is also a concern. Storeroom personnel
generally man and manage the tool crib if its in the MRO storeroom, while
Maintenance often is responsible for operating the tool crib if it is in their
area. Part-time storeroom support can sometimes be adequate to staff tool
cribs in maintenance shops depending on the level of activity.
Once the tool crib itself is established, item numbers are set up for each of
the specialty tools. These items are essentially no different than any other
MRO material, with the exception that you expect to get them back after the
job is completed. With item numbers set up in the inventory management
system, specialty tools can be assigned a bin location in the storeroom and
the available inventory monitored. Tools can be planned, scheduled, issued
and tracked through the system, even to the point of knowing where the tool
should be, who should have it, and when it should be returned to the tool
crib.
As required tools are identified for a planned job, they are listed on the work
order. This reserves the tool, preventing it from being used on another job.
When the work order is released by the planner, the tools appear on a kit list
along with other MRO materials required for the job. The tools are kitted
along with the other materials, charged to the appropriate work order, and
held until everything including the required tools is available. After the job
is scheduled, the kitted parts and tools are delivered to the job site where
the work is to be completed. Once the job has been assigned to a craftsman,
that person assumes responsibility for the tool. The estimated duration on
the work order provides an idea of when to anticipate the return of the tool.
After the job is done, the tool should be returned to the tool crib so it can be
reused. If the tool is not returned as expected, the assigned craftsman can
easily be identified and contacted to find out what happened to it.
After verifying the quality of the tool, it is put back into the on-hand
inventory in the system. This credits the work order and shows the tool as
available for future jobs. The transactions to issue the tool to the job in the
first place, and subsequently return it to stock, offset each other, so it
doesnt matter what cost is associated with the tool, but generally these
items are held at zero value.
For unplanned work, obtaining specialty tools is really no different than any
other unplanned material request. The appropriate item number should be
identified by the person who needs the tool, and the available inventory
checked before going to the storeroom to pick it up or asking the storeroom
to deliver it.
Resource leveling
A resource-leveled weekly schedule provides a logical way to balance
required work versus available man-hours.Once a week, the resource pool is
assessed for available man-hours. This information is then compared to the
backlog of work. This may be a manual process or it may utilize a resourceleveling program. A preliminary schedule is then taken to the weekly
schedule meeting where attendees can refine it.
Without resource leveling, the process becomes subjective and open to error.
That, unfortunately, is common practice for many sites.
The weekly schedule meeting
If the management team waits until the meeting to select the work, it is
already too late to gain maximum value from the meeting. The weekly
schedule meeting is the time to refine the schedulenot build it. That said,
the meeting should be flexible. This is the time to confirm whether the
scheduled work is actually what should be done.Work can be added or
subtracted, based on parameters not known to the CMMS. All affected
departments should be present to provide input and gain consensus. Good
communication between maintenance and operations will improve schedule
accuracy.
An example of an appropriate change at the weekly schedule meeting might
be selecting related work based on the craft traveling to a remote location.
The process
Simply implementing a fundamental planning and scheduling system should
help improve productivity. Before each work day, the maintenance supervisor
will create his daily schedule from the weekly schedule. The work is linked
to the worker in the daily schedule. Each day, progress is provided on work
performed and the CMMS is updated. Examples of progress could be: work
was started, completed or placed on hold.
The daily schedule should be created from the weekly schedule.However, the
typical daily schedule includes reactive maintenance not shown on the
weekly schedule.
If the maintenance organization is only issuing a daily schedule, this does not
eliminate the need for a weekly schedule. If a company relies only on a daily
schedule, it leads to increased reactive maintenance.
Schedule compliance
Resource pool
To increase the efficiency of producing a weekly schedule, a CMMS should
provide easy entry screens for:
1.Worker labor informationincluding the labor identifier, craft code and the
assigned calendar/shift code.
2. Yes/No worker availabilityis this craft person an available worker? A
worker, such as a leading hand may be in a craft, but not normally assigned
to work activities. (A leading hand may be the most senior person in the craft
for larger maintenance organizations.)
Given the above tools, it is easy to maintain a resource pool. The working
level normally stays on the same shift, although rotating, for years at a time.
The only variable is when someone says something like, I have jury duty
next Wednesday and Thursday.
What if you have no job planners? Keep in mind that not all company sites
are the same.
For example, the person creating the schedule may be creating a list of work
and downloading this information to other software programs for further
editing and/or data sorting. They also may be pushing the data to a P3 or
Microsoft Project (MSP) tool. Those who track the process of pushing this data
from and back to the scheduling tool usually find that a substantial amount
of effort is involved.
Typically, the data moved outside the CMMS is quickly out of sync with reality
due to constant updates of the CMMS data from the insertion of new work
and changing priorities based on short-term emergencies. What if work
priorities or calendar data is entered on the schedule side and not updated
on the CMMS? Is it necessary to maintain work level calendars in two
systems? What if the resource leveling algorithm in the scheduling tool
doesnt use the order of fire concept? Where do you run weekly schedule
compliance?
Where do you stand? How does your company compare to the general CMMS
user community?
The reason for this low adoption rate is simple. Most software vendors dont
make the development of resource-leveling software a priority. Likewise,
because a useable tool has been unavailable, users have not learned the
value of this process.
What now? Companies have learned that with a readilyavailable CMMS addin and adjustments in a few crucial processes, they can gain substantial
economic efficiencies. A surprising, but very significant bonus is that their
respective companies soon become far better places to work. Shared goals
built on inter-departmental cooperation have quickly lowered conflict and
increased job satisfaction.
Start by comparing your current practices with those discussed in this article.
If you believe you have opportunities for improvement, take action. Change
what you can with your current skills and tools, then ask for any necessary
outside support to help you make it all the way. MT
John Reeve has spent the past 18 years helping clients solve real-world
CMMS problems. As a senior consultant for Synterprise Global Consulting, he
deals with once in a lifetime issues several times each year. He can be
reached at JReeve@synterprise.com.
C. PM work
The CMMS product automatically generates these records as PM work orders.
They have a work type of PM, a status of ready and a target start date. If
this target date falls within the upcoming weekly schedule range, it will be
scheduled.
Some sites may have a dedicated PM crew.
The processing order (order of fire) for the resource-leveling program
involving PM work would be selected by the client.
D. Order of fire
This is a unique concept that defines the order of backlog processing. A
primitive answer would be to simply take the highest priority. The order of
fire concept directs the planner/scheduler to develop statements that
control the exact order of evaluation.
Examples:
i. Emergency maintenance or fix-itnow (FIN) work-types
development of a work order in the CMMS product with the proper work type
(i.e. CP) and giving it a scheduled start date. If this start date falls within the
upcoming weekly schedule range, it will be processed.
Major maintenance may or may not consume on-site labor resources, but it is
still beneficial to include this work in the weekly schedule. Adding this
information gives improved visibility to all departments and reduces work
coordination errors such as tearing up the parking lot twice in the same
month.
How will you know if your efforts are successful? Mueller listed the following
indicators:
The preventive maintenance (PM) program is on schedule.
Supervisors dont need to do their own planning and scheduling because
they follow the schedules for their shops.
Your customers have a single point of communication.
All work is on a work order.
No work orders are released before the work is ready to be executed.
A Potential Failure (Low Defect Severity- time to plan and schedule proactive
work) is an identifiable physical condition which indicates a functional failure
is imminent and is usually identified by a Maintenance Technician using
condition monitoring or quantitative preventive.
PM Vs CBM
Prior to the beginning of the maintenance day shift:
The maintenance planners day starts before the regular maintenance day
shift in order to review the work orders that came in overnight. The planner
will make an estimate of the man-hours, number of personnel and craft types
needed for any emergency work orders that must be started that day then,
move those work orders directly to the maintenance crew followed by a
quick phone call to notify the maintenance supervisor responsible for that
area of the plant. The planner will also code these jobs as Emergency work
orders so the level of this type work can be tracked over time. Application of
well-disciplined proactive maintenance strategies (PM/CBM) coupled with
effective planning and scheduling will make these emergency jobs fewer and
fewer over time.
The planner should also use good planning and scheduling techniques on his
own responsibilities. Once any emergency work has been estimated and
sent to the maintenance crew, the maintenance planner will plug new work
requests into his/her field inspection schedule. Some jobs may need to be
worked into todays field inspection schedule in order to be put on
tomorrows maintenance schedule. Other new requests can be scheduled for
field inspection and planning later in the week. It is important for a proactive
planner to schedule all of his jobs (other than emergency work) for field
inspections on a particular day to be most effective. The planner will also set
yet if the job gets cancelled it doesnt have to be returned. Less handling
and better inventory accuracy provided by the reservation approach will
reduce cost
Working from the Job Inspection Form, the planner will identify the various
needs required by the jobs and will start documenting the job plan. First and
foremost is the Job Summary page which will contain the basic information
that a fully qualified mechanic, who is very familiar with this type of job
would need. The Job Summary would provide reference numbers to the
detailed information for the job which would follow in the job plan. This type
of job plan format will allow those familiar with the task to quickly review the
job only using the summary sheet. Anyone less familiar or skilled would
have references on each item on the job summary sheet to the specific
section of the job plan to access the specific information they need. This
provides maintenance personnel with quick access to the information they
need without having to read through information they dont need.
All free time that the planner has should be spent refining and permanently
documenting job plans. As the planners job plan database grows, he/she
will have more and more plans that can be used on future jobs with only
minor refinements. This will allow the planner to plan for a greater number
of field maintenance personnel. As job plans are completed, the planner
should update his/her backlog status to Planning Complete. When all parts
not available through stores have been received and the storeroom parts are
on reserve, the status should be changed to Ready to Schedule, assuming
the job plan has been completed. The Scheduler will initiate the delivery of
storeroom parts on reserve the day before the job is scheduled for execution.
Late Morning:
Planner meets with the maintenance supervisor, scheduler and maintenance
coordinator:
Now armed with the information gathered during the field inspection route,
processing parts needs and updating the status on jobs that have received
some or all of the parts ordered, the planner should meet with the
maintenance supervisor, scheduler and coordinator. The planner should
bring a copy of the Planning Backlog with current status updated to the
meeting. This meeting should be short, 30 minutes or less and its purpose is
two fold, 1) provide preliminary info to those who will be building/amending
the maintenance schedule, and 2) ensure that the Planner has scheduled the
various jobs in his/her queue in a manner consistent with the needs of
maintenance and production. The planner should share parts issue updates
and the schedule for his/her planning activities. Any other major restraints
such as boom truck, crane needs, or some other special need for particular
jobs will need to be communicated. This will provide maintenance and
operations with important information that will allow them to start planning
for when particular jobs will be ready for placement in the maintenance
schedule. This meeting will also allow maintenance and operations to
provide feedback to the planner on any changes that need to be made to the
planning schedule. For example, the planner may have a particular job on
schedule for planning to be complete and be Ready to Schedule status by
next Tuesday when in fact, production needs it sooner or later.
Early Afternoon
Immediately after lunch the planner will continue writing job plans,
researching technical issues for particular jobs, obtaining approval for jobs
meeting specific criteria and, referring other jobs to Engineering for
redesigns as applicable, and updating the status of the request as
appropriate.
Each day, the planner should designate a small amount of time for reviewing
the feedback from the mechanics on jobs recently completed. This is an
important step for the planner to be able to improve the effectiveness of the
plans he or she creates.
Late Afternoon
An hour or so before the daily scheduling meeting the planner should review
his/her email account and phone messages to see if there have been any
late changes to the general plan that has be forming for the next days
schedule. This information may have impact on the Job Summary sheets the
planner takes to the scheduling meeting.
The daily scheduling meeting is not a meeting where the planning backlog
will be reviewed and jobs will be selected for scheduling. Because the
planner meticulously keeps the status of all jobs updated and because of the
late morning meeting between the planner, scheduler, maintenance
coordinator, and the maintenance supervisor; the schedule has inherently
been forming on its own. The daily scheduling meeting is where the weekly
schedule will be either confirmed for the next day or, slightly amended to
respond to higher priority needs that presented themselves since the weekly
schedule was posted yet, have allowed time for the preparations necessary
to reap the benefits of planning and scheduling. Also, changes may be made
to more days than just tomorrow depending on needs and planning status of
the jobs. This meeting should take 30 minutes or less if each role has
prepared in advance and communicated effectively with the other players as
needed. It is only to finalize what they have already been discussing and
working toward since yesterdays daily planning meeting.
After the daily scheduling meeting, the scheduler will change status of any
work orders that are to be added to the maintenance schedule and will also
order all parts that are on reserve in the storeroom. Following the daily
planning meeting, the planner will amend the Field Inspection schedule and
the make any adjustments necessary to the overall planning schedule.
The planner will need to update any measures the organization tracks
relative to planning such as man-hours planned and emergency man-hours
per day.
End of the day
Make a quick review of the entire Planning Backlog:
Is the job status up to date on all jobs?
Is the Field Inspection schedule for tomorrow ready?
Have all parts coming from off-site been ordered and parts available from the
storeroom placed on reserve for jobs that have been inspected?
Conclusions
Notice that the planner has not had any involvement in work that is
underway and almost all of the planners activities have been directed
toward work that will leverage his/her time. The only exception to this
should be the small amount of time it took the planner to make a quick labor
estimate on emergency work. A planner that follows this type of rigor can be
assured that he/she is leveraging the entire maintenance crew by his/her
efforts and helping to propel the organization to a more proactive state
where emergency work and unexpected failures are the exception. This job
requires discipline and patience as the transition from reactive maintenance