Beruflich Dokumente
Kultur Dokumente
By Carlo Lebrun (Ecisgroup Spa)
Some definitions first
Maintenance could be defined as:
“All actions performed with the objective to keep an item in its full functional state.”
In general maintenance can be further defined according to the following types:
1. Breakdown maintenance: An item gets repaired when it breaks or fails.
2. Preventive maintenance: regular activity (cleaning, lubricating, replacing, etc.) is performed in
order to keep the functional condition. It can be further split into periodic maintenance (based on
absolute time ) and predictive maintenance (based on working conditions, e.g. 2000 hrs running at
90% of load).
3. Corrective maintenance: It improves equipment with design issues. Weak design requires
improvement to keep or to improve functionality. Note: the term is often but improperly used
instead of “breakdown maintenance”.
4. Maintenance prevention: design actions are performed to reduce maintenance requirements,
based on the analysis of past experience with similar equipment (lesson learned).
And to keep our installed equipment in good shape we should perform at least the first two:
‐ restore the full functionality every time a certain type of failure is detected by the system itself
‐ regularly inspect the system, and re‐establish the full functionality in case the inspection shows
some performance decrease
Today we often tend to Reliability Centered Maintenance (RCM). This is meant as the optimal balance
between lower activity cost and higher equipment reliability.
The technical standard SAE JA 1011 “Evaluation Criteria for RCM Processes” sets the criteria to implement
Reliability Centered Maintenance. It is based on the questions here below, to be answered in the given
sequence:
1. What is the item supposed to do and its associated performance standards?
2. In what ways can it fail to provide the required functions?
3. What are the events that cause each failure?
4. What happens when each failure occurs?
5. In what way does each failure matter?
6. What systematic task can be performed proactively to prevent, or to decrease the consequences of
the failure?
7. What should be done if a suitable proactive task cannot be found?
Maintenance improves Reliability
Automation systems like PLC , DCS, ESD, HIPPS logic solvers, etc. require maintenance just as any other
equipment. Regular preventive maintenance helps to keep the desired functionality and minimize the risk
of failures.
Although they are apparently static and reliable devices they are subject to changes of their working
conditions which may affect or breach their functionality.
When the automation equipment is also in charge of a protective function (like an HIPPS logic solver, and
we call it Safety Instrument System: SIS) we then fall in the field of functional safety (portion of safety that
depends on the correct functionality of an equipment).
As a consequence if we deal with functional safety requirements (often defined as SIL classification) we are
then obliged to respect and document a precise maintenance plan, together with a periodic total or partial
verification of the equipment functionality, the “proof test interval”.
The declared reliability of a PLC (or other “logic solver”) is not an intrinsic and stable feature. It is a dynamic
condition, which can change due to several causes, including the correct actuation of the planned
maintenance. The usual metric is the Probability of Failure on Demand (PFD). PFD is normally considered as
increasing with time. A 100% proof test of functionality would make the PFD restart from the lowest level.
And SIL classification is based on the Average PFD during the Safety system lifecycle.
Maintenance will ensure that any known possible degradation of functionality is prevented or detected on
time.
The value of declared failure rates
We often hear that electronic equipment has its own failure rate. It is not always easy to understand if a
failure of an electronic module is due to an intrinsic feature or to some external cause. But indeed we have
to admit that we are not really able to track all possible causes. So let’s accept there is an intrinsic failure
rate for each equipment.
When we read a manufacturer IEC61508 certification for a component, we should read the declared data
as the intrinsic failure rates, assuming perfect design, installation, environmental conditions, etc.
And if the manufacturer declares only the SIL or the PFD values that’s even worst: these should be taken as
the most optimistic data, assuming some hidden proof test time interval, some hidden percentage of
common cause of failures, etc.
We should carefully evaluate the applicability of all these assumptions.
Do you know any PLC/DCS manufacturer who declared their equipment failure rates instead of just the PFD
values in their IEC61508 certification? I don’t.
Do you know any PLC/DCS manufacturer who declared their assumptions in their IEC61508 certification? I
don’t.
Negative answers, but it doesn’t really matter too much: their evaluation are usually based on theoretical
calculations based on the detail list of electronic subcomponents, each of them with a specific failure rate.
Can you believe the result? I don’t. But you can use it as a starting value …
Therefore the IEC61508 manufacturer declaration should allow us to estimate the “ideal” failure rate of the
equipment in use.
As an alternative we can use a failure rates database issued by some organization or authority working
with functional safety (Exida, OREDA, OLF, etc.). At least these guys declare clearly their assumptions.
Here is a table A.3 portion from APPLICATION OF IEC 61508 AND IEC 61511 IN THE NORWEGIAN
PETROLEUM INDUSTRY Rev. 2 by OLF (also called “OLF‐070 guidelines”, see www.itk.ntnu.no/sil )
Possible causes of failures
As we said the declared failure rates mostly assume that everything is perfect out of the boundaries of our
component, which is not always true.
But several other causes may affect the frequency of failures.
We recall here several possible issues that we are aware of (there are surely other!):
‐ Ambient (and/or cabinets) temperature above tolerated limits: overheating can degrade
components, and even cause fires. Normally different conditions are specified for operation and for
storage.
‐ Ambient (and/or cabinets) temperature oscillations: thermal expansion could materials
degradation or repeated movement up to loss of electrical contact
‐ Moisture / humidity above tolerated limits: Humidity can provoke short circuits, and facilitate
corrosion or materials degradation.
‐ Dust: deposits can limit the thermal exchange and promote overheating. If we consider that our
equipment is normally installed in ventilated cabinet dust can also plug the filters and reduce the
ventilation effect. In some cases dust can be conductive (from metals or from coal) and can
provoke short circuits.
‐ Corrosion: exposure to corrosive environments can cause materials degradation. Corrosion maybe
often caused or enhanced by dust and by humidity(even when below the tolerated limits!)
‐ Electromagnetic interference: radio waves can affect electrical equipment. Nowadays the
technology improvements have significantly reduced the importance of this issue.
‐ Power Supply stability: fluctuations beyond the specified limits can affect PLC/DCS functionality.
The use of separate power supplies for separate functions is mostly recommended. Avoid to use
automation equipment power supply for other purposes like e.g. soldering, welding or brewing
coffee.
‐ Vibration: vibration could provoke physical breaks and components or cables disconnection, with
loss of electrical contact. Vibration could be caused by earthquakes, but also by machinery in
operation and by human activities during construction.
‐ Age: it may degrade some materials like e.g. condensers and transformers filling and
isolation(dielectric materials). This effect is normally defeated by regular replacement of some
components (typically power supply units) every specific time interval.
‐ Grounding / Earthing: bad grounding can cause power supply instability and also network
communication problems. Complete isolation from rest of the plant is essential. Neutral to ground
voltage to be less than 1 V. All metal parts of the cabinet and in the cabinet shall be connected to
ground.
‐ Induced currents: incorrect routing of power cables close to automation system cables (e.g. I/O
wirings) can cause induced currents which may lead to signal spikes or even I/O modules failure.
Shielded and grounded cables for power and control circuits are usually recommended.
‐ Do not allow the welding or any high current drawing activity near the cabinet or power from the
cabinet.
‐ I/O short circuits: I/O cards should be isolated from external shorts
‐ Wildlife: ants, rats, squirrels, bats can damage equipment and gnaw cables. Sometimes they do not
cause a direct damage but they can hinder maintenance (wasps, etc.)
‐ Design problems: let’s not forget they can exist.
‐ Maintenance: mistakes, distractions, underestimated impact on other equipment, missing or
wrong documentation, etc. can also cause actions on the wrong piece of equipment.
‐ Human unauthorized access: voluntary sabotage or involuntary damage action should also be
considered in the development of access procedures to control rooms and cabinets security.
‐ Housekeeping: beware of external objects ,wastes, and swarf in the cabinets, or around, or above,
etc. Some could have different environmental conditions tolerance compared with the real
equipment. Some materials could be ignited by heating. Some could obstruct ventilation. Some
could be conductive and provoke short circuits. Take also care of correct doors opening and
housekeeping of working spaces around the cabinets.
If the frequency of failures in our plant is clearly higher than the theoretical failure rate it is worth to
investigate further on the listed topics.
Do you suffer too many DCS failures? A shortcut assessment
A standard industrial PLC is considered to have a failure rate 5 failures / 10e6 hours (as per the table
included above). This number is given for a complete system composed of 1 CPU + 2 I/O cards.
This gives 5/3 failures / 10e6 hours per each card, which makes 0.0146 failures/year per each hardware
module card. If your complete system is composed of 100 modules you should then expect about 1.5
failures/year.
Too rough? It is. But you should consider the result purely qualitatively, and look at the order of magnitude
only.
If you experience 5 failure/year you are still in the same order of magnitude.
If you experience e.g. 1 failure/month you should then investigate for some external failure cause.
Planning preventive PLC / DCS maintenance
For an optimal planning of PLC and DCS preventive maintenance we should also balance the value of any
check and inspection with the risk associated with it. In fact some tests can require to temporarily force the
system to a degraded functionality, and this may be not acceptable: e.g. we do not expect you test the CPU
backup switch over during plant in operation.
We could split our preventive maintenance activities like this, assuming we already completed FAT, SAT and
commissioning:
‐ Periodical maintenance during plant operation: it will include all checks that can be safely done
during plant operation, and all checks which MUST be done only during operation.
‐ Periodical maintenance during plant shut down
‐ Periodical Proof test: it is normally required for SIS protective equipment, with a specific minimum
frequency (with interval T1) to maintain the desired reliability target. It can be perfect (test of 100%
of functionality) or non‐perfect (test of less than 100% of functionality). In both cases it is a key
parameter in the average PFD estimation during the system lifecycle. The Proof Test procedure
should clearly define the associated plant conditions: in operation or shut‐down.
Here is a suggested preventive maintenance plan for a BPCS (basic process control system), without need of
SIL certification:
Note (*): actions which are not recommended as preventive action during plant operation could be anyway
required in case of breakdown maintenance.
Here is the equivalent list of preventive maintenance plan for a SIL‐rated PLC protection system (HIPPS,
ESD, etc.):
Note (*): actions which are not recommended as preventive action during plant operation could be anyway
required in case of breakdown maintenance.
Note (**): detail procedures can be referred (or even totally match) Factory Acceptance / Site Acceptance
Tests procedures. These should in fact considered as the correct procedures to test functionality at 100%.
Proof test can be also deliberately shortcut some test procedures and target less than 100% (a socalled
non‐perfect proof test).
Picture 1: US Navy Construction Electrician checks dirty and corroded wires and circuit breakers from a
generator damaged by water during the Tsunami that hit Indonesia in Dec. 26, 2004 (Public Domain photo
by United States Navy , ID 050125‐N‐9712C‐001).
Picture 2: A clean cabinet filter (left) against a dirty cabinet filter (right). (photo Ecisgroup SpA)