Sie sind auf Seite 1von 160

FUNCTIONAL SAFETY CERTIFICATION COURSE

Functional Safety for


Safety Instrumented System Professionals

White paper

Within the TÜV Functional Safety Program:


White Paper content

1. Functional Safety: A practical Approach for... / HIMA - FSCS


2. Certification, Proven in Use and Data / Inside Functional Safety
3. Achieving Plant Safety & Availability through... / Risknowlogy
4. Safety availability versus process availability / Risknowlogy
5. Comparison of PFD calculation / HIMA FSCS
6. Diagnostic test versus proof test / Risknowlogy
7. Modern 2oo4-processing architecture for safety systems / HIMA FSCS
8. The effect of diagnostics and periodic proof testing... / Risknowlogy
9. Certified level sensor for LNG industry / Risknowlogy
10. Achieving TÜV’s Maintenance Override Procedure / www.tuv-fs.com

Copyright  ©  2004  -­‐  2012  HIMA.  All  rights  reserved


FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
01

Within the TÜV Functional Safety Program:


White Paper
Functional Safety: A Practical Approach for
End-Users and System Integrators

Date: 01 June 2006


Author(s): Tino Vande Capelle
Dr. M.J.M. Houtermans

HIMA Paul Hildebrandt GmbH Co KG


Albert-Bassermann-Strasse 28
68782 Bruehl
Germany
www.hima.com

HIMA Functional Safety Consulting Services


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Title: Functional Safety: A Practical Approach


for End-Users and System Integrators

Date: 01 June 2006


Author(s): Tino Vande Capelle
Dr. M.J.M. Houtermans

© 2006 HIMA

All Rights Reserved

LIMITATION OF LIABILITY - This report was prepared using best efforts. HIMA does not accept any responsibility for
omissions or inaccuracies in this report caused by the fact that certain information or documentation was not made available to
us. Any liability in relation to this report is limited to the indemnity as outlined in our Terms and Conditions. A copy is available
at all times from our website at www.hima.com.

Printed in Germany

This document is the property of, and is proprietary to HIMA. It is not to be disclosed in whole or in part and no portion of this
document shall be duplicated in any manner for any purpose without Hima’s expressed written authorization.

HIMA, the HIMA logo, are registered service marks.

HIMA Functional Safety Consulting Services Page 2


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

FUNCTIONAL SAFETY: A PRACTICAL APPROACH


FOR END-USERS AND SYSTEM INTEGRATORS

TINO VANDE CAPELLE1, Dr. MICHEL HOUTERMANS2, 3


1- HIMA Paul Hildebrandt GmbH + Co KG, Brühl, GERMANY
2 – Risknowlogy, Brunssum, THE NETHERLANDS
3 - TUV Rheinland Group, Köln, GERMANY
tinovdc@hima.com, m.j.m.houtermans@risknowlogy.com,
michel.houtermans@de.tuv.com http://www.hima.com, http://www.risknowlogy.com

Abstract: - The objective of this paper is to demonstrate through a practical example how an
end-user should deal with functional safety while designing a safety instrumented function
and implementing it in a safety instrumented system. The paper starts with explaining the
problems that exist inherently in safety systems. After understanding the problems the paper
takes the reader from the verbal description of a safety function through the design of the
architecture, the process for the selection of safety components, and the role of reliability
analysis. After reading this paper the end-users understands the practical process for
implementing the design of safety instrumented systems without going into detail of the
requirements of the standard.

Key-Words: - Functional Safety – Hazard – Risk – Safety Instrumented Systems – Safety


Integrity Level –Reliability – PFD – PFH.

1. Introduction

Every day end-users around the world are struggling with the design, implementation,
operation, maintenance and repair of safety instrumented systems. The required
functionality and safety integrity of safety instrumented systems is in practice determined
through a process hazard analysis. This task is typically performed by the end-users as they
are the experts on their own production processes and understand the hazards of their
processes best. The result of such a process hazard analysis is among others a verbal
description of each required safety function needed to protect the process. These safety
functions are allocated to one or more safety systems, which can be of different kinds of
technology. Those safety systems that are based on electronic or programmable electronic
technology need as a minimum to comply with the functional safety standards IEC 61508
and/or 61511.

The end-user is typically not involved in the actual design of the safety system. Normally this
is outsourced to a system integrator. The system integrator determines the design of the
safety system architecture and selects the safety components based on the specification of
the end-users. No matter who designs the safety system according to the safety function
requirements, in the end it is the end-user who is responsible for the safety integrity of the
safety system. This means that the end-user needs to assure him that the chosen safety
architecture and the selected safety components meet the requirements of the applicable
standards and be able to defend its decision to any third party performing the functional
safety assessment.

HIMA Functional Safety Consulting Services Page 3


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

In reality end-users and system integrators are not experts in hardware and software design
of programmable electronic systems. They know how to run and automate chemical or oil &
gas plants but most likely they are not experts on how the operating system of a logic
solvers works or whether the communication ASIC of the transmitter is capable of fault free
safe communication. Even if they would be experts, the suppliers of the safety components
will not give them sufficient information about the internals of the devices so that they can
assure themselves of the safety integrity of these devices. Yet they are responsible for the
overall design and thus they need to assure themselves that functional safety is achieved.
But how can they deal with that in practice?

The objective of this paper is to demonstrate through a practical example how an end-user
and/or system integrator should deal with functional safety while designing a safety
instrumented function and implementing it in a safety instrumented system. The paper starts
with explaining the problems that exist inherently in safety systems. After understanding the
problems the paper takes the reader from the verbal description of a safety function through
the design of the architecture, the process for the selection of safety components, and the
role of reliability analysis. After reading this paper the end-users understands the practical
process for implementing the design of safety instrumented systems without going into detail
of the requirements of the standard.

2. Why Safety Systems Fail

The hardware of a safety instrumented system can consist of sensors, logic solvers,
actuators and peripheral devices. With a programmable logic solver there is also application
software that needs to be designed. An end-user in the process industry uses as basis for
the design and selection of the safety devices the IEC 61511 standard. This standard
outlines requirements for the hardware and software and refers to the IEC 61508 standard if
the requirements of the IEC 61511 cannot be not met. This means that even if the IEC
61511 standard is used as basis some of the hardware and software needs to comply with
IEC 61508.

As with any piece of equipment also safety equipment can fail. One of the main objectives of
the IEC 61508 standard is to design a “safe” safety system. A “safe” safety system means a
system that is designed in a way that it can either tolerate internal failures, and still execute
the safety function, or if it cannot carry out the safety function any more it at least can notify
an operator via an alarm. If we want to design a safe safety system we should first
understand how safety systems can fail. According to IEC 61508 equipment can fail
because of three types of failures, i.e.,
ƒ Random hardware failures,
ƒ Common cause failures and
ƒ Systematic failures.

2.1 Random Hardware Failures.

Random hardware failures are failures that can occur at any given point in time because of
internal degradation mechanisms in the hardware. A typical example is wear out. Any
rotating or moving equipment will eventually wear out and fail. There are two kinds of
random hardware failures (Rouvroye et. al., 1997):
ƒ Permanent
ƒ Dynamic

HIMA Functional Safety Consulting Services Page 4


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Permanent random hardware failures exist until they are repaired. This in contrast to the
dynamic random hardware failures. They only appear under certain conditions (for example
when the temperature is above 80 C). When the condition is removed the failure disappears
again. It is very difficult to test hardware for random dynamic hardware failures.

The IEC 61508 standard addresses random failures in two ways. First of all IEC 61508
requires a designer to implement measures to control failures. The appendix of IEC 61508
part 2 contains tables (Table A16-A18) which represent per SIL level measures that need to
be implemented in order to control failures that might occur in hardware.

Secondly, IEC 61508 requires a qualitative and quantitative failure analysis on the
hardware. Via a failure mode and effect analysis the failure behaviour of the equipment
needs to be analysed and documented. For the complete safety function it is necessary to
carry out a probabilistic reliability calculation to determine the average probability of failure
on demand of the safety function.

2.2 Common Cause Failures

A common cause failure is defined as a failure, which is the result of one or more events,
causing coincident failures of two or more separate channels in a multiple channel system,
leading to total system failure. Thus a common cause can only occur if the safety function is
carried out with hardware more than once (dual, triple, quadruple, etc. redundancy).

Common cause failures are always related to environmental issues like temperature,
humidity, vibration, EMC, etc. If the cause is not related to environmental circumstances
than it is not a common cause. Typical examples of a common cause could be a failure of a
redundant system due to flooding with water or an EMC field. A common cause failure is
only related to hardware, and not to software. A software failure is a systematic failure which
is addressed in the next paragraph.

The IEC 61508 standard has two ways to address common cause failures. First of all there
is one measure defined to control failures defined, i.e., diversity. Diversity still means that we
carry out the safety function in a redundant manner but we use different hardware, or a
different design principle or even completely different technology to carry out the same
safety function. For example if we use a pure mechanical device and a programmable
electronic device to carry out the safety function then a common cause failure of the safety
function due to an EMC field will never occur. The programmable electronic device might fail
due to EMC but the pure mechanical device will never fail due to EMC.

In practice a real common cause is difficult to find because the failures of a multi channel
system must per definition of a common cause occur at exactly the same time. The same
hardware will always have different strength and thus fail at slightly a different time. A well
designed safety system can take advantage of this gap in time and detect one failure before
the other failure occurs.

2.3 Systematic Failures

The most important failures to manage in safety system are the systematic failures. A
systematic failure is defined as a failure related in a deterministic way to a certain cause,
which can only be eliminated by a modification of the design or of the manufacturing

HIMA Functional Safety Consulting Services Page 5


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

process, operational procedures, documentation or other relevant factors. A systematic


failure can exist in hardware and software.

Systematic failures are the hardest failures to eliminate in a safety system. One can only
eliminate systematic failures if they are found during testing. Testing that either takes place
during the development and design of the safety system or testing that takes place when the
system exist in the field (so called proof test). The problem that systematic failures only can
be found if a specific test is carried out to find that failure. If we do not test for it we do not
find it.

The IEC 61508 standard addresses systematic failure in only one way. The standard
defines measures to avoid failures for hardware as well as software. These measures are
presented in the appendix of part 2 and 3 of IEC 61508 (respectively tables B1-B5 and
tables A1-B9) and depend on the required safety integrity. The standard does not take
systematic failures into account in the failure analysis. The philosophy behind this is simple.
If all the required measures to avoid failures are implemented and correctly carried out then
there are no systematic failures (or at least it is negligible for the desired safety integrity) and
thus the contribution to the probability of failure is (close to) zero.

2.4 End-user Responsibility

All though the end-user has no control over the actual design and internal testing of safety
equipment ultimately they are still responsible when accidents occur due to any of the three
types of failures mentioned above. They need to assure themselves that the safety
equipment selected by themselves or their system integrators is compliant with either the
IEC 61508 or the IEC 61511 standard. In practice though end-users nor system integrators
do not have the knowledge to understand what is going inside safety equipment. They will
have to rely on third party assessments of this equipment to assure themselves that the
equipment is suitable for their safety application. More on this topic is presented in
paragraph 4.

3. FROM HAZARD AND RISK ANALYSIS TO SPECIFICATION TO DESIGN

A safety requirement specification of a safety system must at all times be based on the
hazard and risk analysis. A good hazard and risk analysis includes the following steps:
ƒ Hazard identification
ƒ Hazard analysis (consequences)
ƒ Risk analysis
ƒ Risk management
o Tolerable risk
o Risk reduction through existing protection layers
o Risk reduction through additional safety layers

Many techniques exist to support hazard identification and analysis. There is not one
ultimate technique that can do it all. A serious hazard and risk study is be based on the use
of several techniques and methods. Typical hazard identification techniques include:
ƒ Checklists
ƒ What if study
ƒ Failure mode and effect analysis (FMEA)
ƒ Hazard and operability analysis (HAZOP)
ƒ Dynamic flowgraph methodology (DFM)

HIMA Functional Safety Consulting Services Page 6


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Hazard analysis techniques include:


ƒ Event tree analysis (ETA)
ƒ Fault tree analysis (FTA)
ƒ Cause consequence analysis

Risk reduction techniques include:


ƒ Event tree analysis (ETA)
ƒ Layer of protection analysis (LOPA, a variation on ETA)

More techniques exist then the ones listed above that can be used to carry out the hazard
and risk analysis. It is important to select the right technique for the right kind of analysis and
not to limit oneself to one technique.

3.1 Safety Requirement Specification.

The hazard and risk analysis should among others document in a repeatable and traceable
way those hazards and hazard events that require protection via an additional safety
function. The results from the hazard and risk analysis are used to create the safety
requirement specification of each safety function needed to protect the process. The
specification as a minimum defines the following 5 elements for each safety function:
ƒ Sensing
ƒ Logic solving
ƒ Actuating
ƒ Safety integrity in terms of reliability
ƒ Timing

Each safety function description should as a minimum consist of these five elements. The
sensing element of the specification describes what needs to be sensed (e.g., temperature,
pressure, speed, etc.). The logic solving element describes what needs to be done with the
sensing element when it meets certain conditions (e.g., if the temperature goes over 65 C
then actuate the shutdown procedure). The actuating element explains what actually needs
to be done when the logic solving elements meets the conditions to be met (e.g., open the
drain valve).

So far we have described the functionality of the safety function. But the functionality is not
complete if we do not know with how much safety integrity this needs to be carried out. The
safety integrity determines how reliable the safety function needs to be. The functional
safety standards have technical and non-technical safety integrity requirements that are
based on the so called safety integrity level. There are four safety integrity levels (1 through
4) where 1 is the lowest integrity level and 4 the highest. In other words it is much more
difficult to build a SIL 4 safety function than it is to build a SIL 1 function. The SIL level
determines not only the measures to avoid and to control failures that need to be
implemented but also the required probability of failure on demand (PFD). The higher the
SIL level the lower the probability of failure on demand of this safety function.

The last element to be described is how fast the safety function should be carried out. Also
this is a critical element as it depends on the so-called process safety time. This is the time
the process needs to develop a potential hazard into a real incident. For example, mixing
two chemicals at 30 C is not a problem at all. Mixing the same two chemicals at 50 C can

HIMA Functional Safety Consulting Services Page 7


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

lead to a runaway reaction and result in an explosion. The process safety time is the time
the reaction needs to develop into an explosion.

It is common practice in the safety industry to define the time element of the safety function
as half of the process safety time. If the chemical reaction takes 2 hours to develop then we
have 1 hour to carry out our safety function. On the other hand if the reaction takes 10
seconds we have only 5 seconds to carry out the safety function. It is of up most importance
to know this time for two reasons. First of all we need to build a safety function that can
actually be carried out in this time. Each device used to carry out the safety function takes a
piece of the available total time slot. If we for example use valves that need to be closed we
need to make sure that these valves can close fast enough. The second reason is that we
need to know whether the build-in diagnostics can diagnose a failure in less than half of the
process safety time. Before we need to actuate the safety function we should be able to
know whether the safety system has not failed. This puts extra constraints on the internal
design of the safety devices when it comes to implementing fast enough diagnostics.

The following is a bad example of a specified safety function:

“The main safety function of the HIPPS is to protect the separation vessels against
overpressure and to protect the low pressure equipment against high pressure.”

There is no system integrator who can build the hardware and software from this definition.
The only clear aspect is the sensing element. Some where the pressure needs to be
measured. After that the system integrator will be lost. The logic, actuating, safety integrity
and timing element are not covered with this specification. Specification like this will cost
every party involved in the project more time than necessary. It will lead to a lot of
unnecessary discussion. A much better example of a safety function specification is the
following:

“Measure the pressure on two locations in vessel XYZ and if the pressure exceeds the high-
high pressure limit open the drain valve within 3 seconds. Perform the function with a safety
integrity of SIL 3.”

This specification gives much more complete information. The system integrator knows
exactly what the function should do and can now design the function according to the rules
of SIL 3 and select components and write application software that can perform this
function.

For each safety function the end-user should provide the system integrator with a clear
definition containing as a minimum the 5 elements specified before. There are many other
requirements that the end-user can put into the specification. For example environmental
conditions that the safety system should be able to handle (temperature ranges, humidity
levels, vibration levels, EMC levels, etc.) or restart procedures, periodic test intervals, and
more.

A good system integrator will take the safety requirements specification of the end-user and
translate that into a requirement specification that is usable for the system integrator. The
specification created by the system integrator should be verified and approved by the end-
user. This is an excellent step to be performed as it assures that both parties can see that
they understand each other and that they interpreted the system to be designed correctly.

HIMA Functional Safety Consulting Services Page 8


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Needless to say this costs a more time during specification which is saved during actual
design and testing and the often required modifications after words.

3.2 Architectural Design Safety Function

When the safety requirements specification is clear and agreed upon the system integrator
can start with the architectural design of the safety function and system. Figure 1 shows how
a safety function definition can be implemented in hardware. The safety function is divided
into three subsystems, i.e., sensing, logic solving, and actuating. The designer of the safety
function can decide how to divide the safety function into subsystem and to what level or
detail. In practice subsystems are determined by redundancy aspects or whether the
component can still be repaired or not by the end-user.

Measure the temperature in the reactor and if the temperature exceeds 65 C


then open the drain valve and stop the supply pumps to the reactor. This function
needs to be carried out within 3 seconds and with safety integrity SIL 3

Sensing Logic Solving Actuating

T1 TM 1 I1 O1 R1- Pump A
Common Circuitry

Common Circuitry

I2 O2
T2 TM 2 I3 O3
I4
I5
CPU O4
O5
R2- Pump B

I6 O6
I7 O7 SOV Drain V
I8 O8

Fig. 1. From specification to hardware design of the safety instrumented system

The IEC 61508 and IEC 61511 standard have set limitations on the architecture of the
hardware. The concepts of the architectural constraints are the same for both standards
although the IEC 61508 standard requires some more detail. The architectural constraints of
the IEC 61508 standard are shown in Table 1 and 2 and are based on the following aspects
per subsystem:

ƒ SIL level safety function


ƒ Type A or B
ƒ Hardware fault tolerance
ƒ Safe failure fraction

HIMA Functional Safety Consulting Services Page 9


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Table 1 Architectural Constraints Type A

Type A Subsystem
Hardware Fault Tolerance
Safe Failure (HFT)
Fraction (SFF) 0 1 2
< 60 % SIL 1 SIL 2 SIL 3
60 % -< 90% SIL 2 SIL 3 SIL 4
90 % -< 99% SIL 3 SIL 4 SIL 4
> 99 % SIL 3 SIL 4 SIL 4

Table 2 Architectural Constraints Type B

Type B Subsystem
Hardware Fault Tolerance
Safe Failure (HFT)
Fraction (SFF) 0 1 2
< 60 % N.A. SIL 1 SIL 2
60 % -< 90% SIL 1 SIL 2 SIL 3
90 % -< 99% SIL 2 SIL 3 SIL 4
> 99 % SIL 3 SIL 4 SIL 4

The “type” designation of a subsystem refers to the internal complexity of the subsystem. A
type A subsystem has a defined failure behaviour and the effect of every failure mode of the
subsystem is clearly defined and well understood. Typical type A components are valves
and actuators. A subsystem is of type B if only one failure mode and its effect cannot be
understood. In practice any subsystem with an integrated circuit (IC) is per definition a type
B. Typical type B systems are programmable devices like logic solvers, smart transmitter, or
valve positioners.

The hardware fault tolerance (HFT) determines the number of faults that can be tolerated
before the safety function is lost. It is thus a measure of redundancy. When determining the
hardware fault tolerance one should also take into account the voting aspects of the
subsystem. A 1oo3 and 2oo3 subsystem carry out the safety function 3 times (triple
redundant) but because of the voting aspect the HFT of the 1oo3 subsystem equals 2 and
the HFT of the 2oo3 subsystem equals 1. A complete overview of the most common
architectures is given in Table 3.

Table 3 Redundancy versus HFT

Architecture / Voting Redundancy HFT


1oo1 No redundancy 0
1oo2 Dual 1
2oo2 No redundancy 0
1oo3 Triple 2
2oo3 Triple 1
2oo4 Quadruple 2

HIMA Functional Safety Consulting Services Page 10


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Another important factor is the safe failure fraction (SFF). This is basically a measure of the
fail safe design and build-in diagnostics of the subsystem. A subsystem can fail safe or
dangerous. Safe failures are those failures that case the subsystem to carry out the safety
function without a demand. For example, the safety function of an emergency shutdown
valve is to close upon demand. We call it a safe failure if the valve closes because of an
internal failure without an demand. A dangerous failure is the opposite. The valve has failed
dangerous if it cannot close upon demand because of an internal failure. Some components
also have internal diagnostics (diagnostics should not be confused with proof testing). If that
is the case it is possible to detect failures and act upon the detection. Smart sensor and
logic solvers typically can have build-in diagnostics. Taking this into account a subsystem
can basically have four different kind of failures:

ƒ Safe detected (SD)


ƒ Safe undetected (SU)
ƒ Dangerous detected (DD)
ƒ Dangerous undetected (DU)

If we know the failure rates for each subsystem in terms of these four failure categories then
we can calculate the SFF as follows:

λSD + λSU + λDD


SFF =
λSD + λSU + λDD + λDU
From the above formula you can see that the SFF is fully determined by the failure rate of
the dangerous undetected failures. In other words if we make a fail safe design (lots of safe
failures) and we diagnose a lot of dangerous failures (DD) then we will have little dangerous
undetected failures and thus a high SFF.

It is important to understand these concepts in order to be able to interpret Table 1 and 2. A


system integrator receives from an end-user only the safety function definition with a SIL
level attached to it. From the SIL level the system integrator then needs to determine the
Type, HFT, and SFF of the subsystem. For example if the system integrator needs to
measure the temperature with a subsystem of SIL 3 then there are among others the
following options (see Table 1 and 2):

ƒ 1 type A sensor with a SFF > 90%


ƒ 2 type A sensors, 1oo2 or 2oo3, with a SFF 60-90%
ƒ 3 type A sensors, 1oo3, with no diagnostics
ƒ 1 type B sensor with a SFF > 99%
ƒ 2 type B sensors, 1oo2 or 2oo3, with a SFF 90-99%
ƒ 3 type B sensors, 1oo3 with a SFF 60-90%

In other words the system integrator has a lot of design options to choose from. The actual
design depends on many things. For example what kind of sensors are available on the
market? Which type are they, which SFF do they achieve. Does the end-user have a
preferred vendor list to choose from? And so on.

HIMA Functional Safety Consulting Services Page 11


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Also the IEC 61511 standard has architectural constraints defined. The principle is the
similar as above only the IEC 61511 is less complicated. The IEC 61511 does not
differentiate between type A and B components but only between programmable electronic
logic solvers and all equipment except programmable electronic logic solvers. Smart
sensors with dual processors and software inside are apparently not considered complex
devices in terms of IEC 61511. The architectural constraints of IEC 61511 are shown in
Table 4 and 5.

Table 4 Architectural Constraints PE Logic Solver

Minimum hardware fault tolerance (HFT)


SIL SFF < 60% SFF 60% to 90% SFF> 90%
1 1 0 0
2 2 1 0
3 3 2 1
4 Special requirements apply, see IEC 61508

Table 5 Architectural Constraints All Equipment Except PE Logic Solvers

SIL Minimum hardware fault tolerance


1 0
2 1
3 2
4 Special requirements apply, see IEC 61508

For all equipment except PE logic solvers it is possible to decrease the hardware fault
tolerance by 1 if the following conditions are met:

ƒ The hardware is prove in use


ƒ Only process related parameters can be adjusted
ƒ Adjustment of process parameters is protected
ƒ The SIL level of the safety function is less than 4

In reality every single product supplier will try to prove to an end-user that their equipment
meets the above conditions but in practice it is hard to find a product that truly fulfils these
conditions. Specially the proven in use condition is hard to meet, at least the prove for it.

On the other hand the HFT of a product needs to be increased by one if the dominant failure
mode of the product is not to the safe mode and dangerous failures are not detected.

4. SELECTING SUITABLE EQUIPMENT

The tables of IEC 61511 and IEC 61508 determine the hardware architecture of the safety
function. The starting point is always the SIL level of the safety function and from there the
system integrator has a certain degree of freedom to design a safety system architecture
depending on the hardware fault tolerance and the hardware complexity of the subsystem.

HIMA Functional Safety Consulting Services Page 12


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

4.1 Selecting Hardware According to IEC 61511

Having two standards to deal with in order to determine the system architecture does not
make it easier for the end-user or system integrator. Many end-users and system integrators
do not realize that even if they deal with the IEC 61511 standard that some subsystems of
the safety functions still need to comply with the IEC 61508 standard. Figure 2 gives
guidance. From this figure it becomes clear that we need per definition to follow IEC 61508 if
we want to apply new hardware, which has not been developed yet. For any hardware
which meets the IEC 61511 requirements for proven in use or has been assessed according
to the requirements of IEC 61508 we can continue to follow the IEC 61511 requirements
and particular Table 3 and 4. IEC 61511 defines proven in use as follows:

“When a documented assessment has shown that there is appropriate evidence, based on
the previous use of the component, that the component is suitable for use in a safety
instrumented system”

Hardware

New hardware Use hardware Use hardware


development? based on developed
proven in use? and assessed
according to
IEC 61508?

Follow Follow Follow


IEC 61508-2 IEC 61511 IEC 61511

Fig. 2. Which Standard to Follow: IEC 61508 or IEC 61511?

Although proven in use is typically something that only an end-user can determine the
suppliers of safety components will do everything to convince end-users and system
integrators that their products are proven in use. The evidence though that needs to be
delivered in order to “prove” proven in use is not so easy to accumulate:

ƒ Manufacturers quality, management and configuration management systems


ƒ Restricted functionality
ƒ Identification and specification of the components or subsystems
ƒ Performance of the components or subsystems in similar operating profiles and
physical environments
ƒ The volume of the operating experience
ƒ Statistical evidence that the claimed failure rate is sufficiently low

HIMA Functional Safety Consulting Services Page 13


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Especially the last point is very difficult to meet as failure track records are usually not
available. End-user don’t always track them and product manufactures do not have the
capability to track their products once they are sold and delivered.

4.2 Certification and Third Party Reports

End-users do not have the capabilities to verify for every single product that will be used in a
safety function whether it meets the proven in use requirements of IEC 61511 or to assess
them according to IEC 61508. Many end-users therefore make use of certified products or
third party reports. There is a big difference between a product with a certificate and a
product with a third party report.

When a product is certified according to the IEC 61508 standard then this means that every
single requirements of the standard is verified for this product. It is for example not possible
to only certify the hardware of a programmable electronic system. Certification is all-
inclusive and thus also the software needs to be addressed. A well certified safety product
not only addresses functional safety according to IEC 61508 but also issues like:

ƒ Electrical safety
ƒ Environmental safety
ƒ EMC/EMI
ƒ User documentation
ƒ Reliability analysis

A certified product always comes with a certificate and a report to the certificate. The report
to the certificate is very important as it explains how the verification or assessment has been
carried out and whether there are any restrictions on the use of the product.

A third party report is often used in industry but is only limited in scope. The report itself will
outline what the scope of the analysis is. Many third party reports only focus on the
hardware analysis of the product. In principle this is no problem as long as the end-user or
system integrator is aware that other aspects of the product, like the software, also need to
be addressed and may be another third party report should be requested that covers the
software.

4.3 Required hardware functional safety information

For each safety device the end-user should assure themselves that the device is either
compliant with IEC 61508 or with IEC 61511. Concerning the hardware the end-user should
as a minimum ask from their suppliers the information listed in Table 6.

HIMA Functional Safety Consulting Services Page 14


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Table 6 Hardware Checklist

Item
Applicable standard
Type
Hardware fault tolerance
Safe failure fraction
Safe detected failure rate
Safe undetected failure rate
Dangerous detected failure rate
Dangerous undetected failure rate
SIL level for which the product is fit for use
Recommends periodic proof test interval

With this information the end-user or system integrator can easily determine how to comply
with the architectural constraints tables and build the architecture of their loop as desired.
This information can be delivered by the supplier itself, through a third party report or
through a certification report. It is up to the end-user to decide what is actually required
(read what can be trusted). The architecture needs to be redesigned until the architectural
constraints requirements are met.

5. THE ROLE OF RELIABILITY ANALYSIS

Once the architectural system design of the safety loop complies with the architectural
constraints tables the loop has met already one of the most important requirements of the
standards. Another important requirement is the probability of failure on demand (PFD) or
continuous mode (PFH) calculation. This is the probability that the safety function cannot be
carried out upon demand from the process. It needs to be calculated for those processes
where the expected demand is less than once per year. If a loop is used in a continuous
mode then it is necessary to calculate the frequency of failure of this loop per hour. This is
necessary as we now are in a different situation. Where the demand loop can only cause a
problem when there is an actual demand from the process the continuous loop can actual
be the cause of a process upset when the loop itself has failed. Table 7 gives an overview of
the required probabilities and frequencies per SIL level.

Table 7 PFD versus PFH

Demand Mode Continuous Mode


SIL Probability of failure on demand Frequency of the failure per hour
4 ≥ 10-5 to < 10-4 ≥ 10-9 to < 10-8
3 ≥ 10-4 to < 10-3 ≥ 10-8 to < 10-7
2 ≥ 10-3 to < 10-2 ≥ 10-7 to < 10-6
1 ≥ 10-2 to < 10-1 ≥ 10-6 to < 10-5

HIMA Functional Safety Consulting Services Page 15


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

In order to carry out a probability or frequency calculation the following information is


required:
ƒ A reliability model per loop
ƒ The reliability data for all equipment in the loop

5.1 Reliability modeling

A reliability model needs to be created for each loop of the safety system. There are
different techniques available in the world to create reliability models. Well known
techniques include:
ƒ Reliability block diagrams
ƒ Fault tree analysis
ƒ Markov analysis

The reliability block diagram technique is probably one of the simplest methods available. A
block diagram is a graphical representation of the required functionality. The block diagram
of the safety function of Figure 1 is given in Figure 3 below. A reliability block diagram is
always read from left to right where each block represents a piece of the available success
path(s). As long as a block has not failed it is possible to go from left to right. Depending on
build-in redundancy it is possible that alternative paths exist in the block diagram to go from
left to right. Once the block diagram is created it is possible to use simple probability theory
to calculate the probability of failure.

T1 TM 1 I1

CC CPU

T2 TM 2 I2

CC O1 O2 O3 R1

R2 SOV ESD SV

Fig. 3. Block diagram safety function

Another technique is fault tree analysis (FTA). FTA is a technique that originates from the
nuclear industry. Although the technique is more suitable to analysis complete facilities it is
also used to calculate the probability of failure on demand of safety loops. A FTA is created
with a top even in mind, e.g., safety does not actuate on demand. From this top event an
investigation is started to determine the root causes. Basically an FTA is a graphical
representation of combinations of basic events that can lead to the top event. A simplified
version of the FTA for the safety function in Figure 1 is given in Figure 4. It is possible to
quantify the FTA and calculate the probability of occurrence of the top even when the
probabilities of occurrence of the basic events are known.

HIMA Functional Safety Consulting Services Page 16


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Safety Function Failed

Input Failed Logic Failed Output Failed

Path 1 Failed Path 2 Failed CC CPU CC O1 O2 O3 SOV Drain R1 R2 R3


Valve

T1 Tm1 I1 T2 Tm2 I2

Fig. 4. Simplified fault tree diagram safety function

Research has indicated that Markov analysis is the most complete and suitable technique
for safety calculations. Markov is a technique that captures transitions between two unique
states. In terms of safety this means the working state and the failed state. Going from one
state to the other can either be caused by a failure of a component or by repair of a
component. Therefore a Markov model is also called a state transition diagram. See Figure
5 for the Markov model of the safety function of Figure 1. Once the Markov model is created
and the rate of transition is known between two states (that is the failure rate or repair rate) it
is possible to solve the Markov model and calculate the probability of being in a state.

Path 1
Failed

System
OK Failed

Path 2
Failed
Fig. 5. Markov model safety function

HIMA Functional Safety Consulting Services Page 17


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

The IEC 61508 standard also has standard formulas to calculate the PFD and PFH for each
loop. These formulas are called simplified equations. What many people do not realize
though is that these simplified equations are derived from Markov models and that in
practice it is not so simple to derive them. Another limitation of these equations is that they
only exist for 1oo1, 1oo2, 2oo2, 1oo2D, and 2oo3 architectures. For any other kind of
architecture the standards do not provide equations and thus one needs to refer to any of
the above mentioned techniques. Also the simplified equations are not flexible enough to
handle diverse equipment, or different repair times and periodic proof test intervals. Hence
their name simplified. For a complete list of simplified equations derived from Markov
models see Börcsök (2004). The following two equations are examples of the simplified
equations as they can be found in the standards

1oo1 PFD equations:

PFDavg = (λdu + λdd ) ⋅ tCE


1oo2 PFD equations:

PFDavg = 2((1 − β D )λdd + (1 − β )λdu ) tCE tGE


2

⎛ T1 ⎞
+ β D λdd MTTR + βλdu ⎜ + MTTR ⎟
⎝2 ⎠
Rouvroye (1998) has compared different reliability techniques and their usefulness in the
safety industry. Figure 6 gives an overview of these techniques and the result is that Markov
analysis is the most complete technique. With Markov it is possible to create one model that
allows us to take into account any kind of component, diverse component, different repair
and test strategies. It is possible to calculate the probability of failure on demand, the rate
per hour or the probability that the safety system causes a spurious trip. No other technique
can do this.

Safety ranking Availability Probability of Effects of Trip rate Time depen-


comparison prediction Unsafe Failure test & repair prediction dent effects

Expert analysis

FMEA

FTA

Reliability Block
Diagram
Parts Count
Analysis

Markov Analysis

Fig. 6. Comparison reliability techniques for safety analysis (Rouvroye, 1998)

HIMA Functional Safety Consulting Services Page 18


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

5.2 Reliability data

Every reliability model needs reliability data in order to actually perform the reliability
calculation and quantify the results. There are many sources for reliability data. The
following list is an overview of available data:
ƒ End user maintenance records
ƒ Databases
ƒ Handbooks
ƒ Manufacturer data
ƒ Functional safety data sheets
ƒ Documented reliability studies
ƒ Published papers
ƒ Expert opinions
ƒ Reliability Standards

The absolute best data an end-user can use is its own maintenance data. Unfortunately not
many end-users have their one reliability data collection program and there is of course
always the problem that a new safety system contains devices that were not used before by
the end-user. Luckily there are more and more databases available were there is a
collection of data from different sources which can be used by end-users.

For the calculation we need the following reliability data for each device:
ƒ Safe detected failure rate
ƒ Safe undetected failure rate
ƒ Dangerous detected failure rate
ƒ Dangerous undetected failure rate

This data was already collected when the architectural constraints were verified, see Table
5. On plant level we also need to know the following reliability data:
ƒ Repair rate per device
ƒ Periodic proof test interval
ƒ Common cause

The repair rate per device depends on the availability of the spare device and the availability
of a repair crew. Periodic proof test intervals can be determined by three means:
ƒ The supplier of the device specifies a rate
ƒ Laws or standards determine a minimum inspection interval
ƒ The desired SIL level determines the period proof test interval through the PFD
calculation

Once the reliability model is created and the reliability data is collected the actual calculation
can be performed. Figure 7 shows an example of PFD calculation with and without periodic
proof testing. Actually every year an imperfect proof test is performed which assures that the
PFD level of the safety function stays within the SIL 3 range.

HIMA Functional Safety Consulting Services Page 19


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

Cracker Safety Loop 1


0.01
without prooftest
with proof test
probability to fail dangerous

0.001

0.0001

1e-005
0 8760 17520 26280 35040 43800 52560 61320 70080 78840 87600
hours

Fig. 7. Example PFD calculation with and without proof testing

6. CONCLUSIONS

This purpose of the paper was to explain to end-users the most important high level
requirements when designing safety instrumented system. The paper explained that safety
system can fail in three different ways and that it is important to design, operate and
maintain a safety system in a why that those three failures types are controlled. In order to
understand the functional requirements of a safety system it is important to carry out hazard
and risk analysis. The paper explained several techniques that can be used for this purpose.
The results of the hazard and risk analysis are documented as the safety function
requirement specification. Next the paper explained from a top level safety function
description the high level requirements that end-users or system integrators need to follow
in order to design the actual safety instrumented systems. The paper explained the
significance of the architectural constraints from the point of view of the IEC 61508 and IEC
61511 standard. For end-users and system integrators it is important to collect reliability
data and to perform reliability analysis and be able to calculate the safe failure fraction and
the probability of a failure on demand calculations.

HIMA Functional Safety Consulting Services Page 20


White Paper
Tino Vande Capelle, Dr. M.J.M. Houtermans
Functional Safety: A Practical Approach
for End-Users and System Integrators

References:

[1] Rouvroye, J.L., Houtermans, M.J.M., Brombacher, A.C., (1997). Systematic Failures in
Safety Systems: Some observations on the ISA-S84 standard. ISA-TECH 97,ISA,
Research Triangle Park, USA.

[2] IEC (1999). Functional safety of electrical, electronic, programmable electronic safety-
related systems, IEC 61508. IEC, Geneva.

[3] IEC (2003). Functional safety – safety instrumented systems for the process industry,
IEC 61511. IEC, Geneva.

[4] Houtermans M.J.M., Velten-Philipp, W., (2005). The Effect of Diagnostics and Periodic
Proof Testing on Safety-Related Systems, TUV Symposium, Cleveland, OHIO.

[5] Börcsök, J., Electronic Safety Systems, Hardware Concepts, Models, and Calculations,
ISBN 3-7785-2944-7, Heidelberg, Germany, 2004

HIMA Functional Safety Consulting Services Page 21


FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
02

Within the TÜV Functional Safety Program:


This copy belongs to Administrator. Please do not distribute in any form.

Certification, Proven-in-Use,
and Reliability Data
Three of the most discussed topics
in the functional safety industry

Dr. Michel Houtermans, Risknowlogy GmbH, Switzerland

© 2009 Inside Publishing. All Rights Reserved.


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


Certification, Proven-in-Use, and
Reliability Data
Three of the most discussed topics in the functional safety industry
Dr. Michel Houtermans, Risknowlogy GmbH, Switzerland 1

Abstract Keywords:
Certification and Proven-in-Use can help us select safety equipment and devices Certification,
that are compliant with safety standards, and that help us build safety systems Proven in use,
and solutions that meet the required safety criteria set out in those standards. Prior use,
Users of these devices and equipment need only to understand how to apply Reliability data,
certification and Proven-in-Use in the correct way, so that they can take full Quality data,
advantage of these solutions to achieve the full benefit. In the end it will save IEC 61508,
them a lot of time, money and headaches. Therefore, the purpose of this paper IEC 61511,
is to explain what certification and Proven-in-Use is, and what role reliability IEC 62061,
data plays in these solutions. The paper will explain what to pay attention to EN 50402
and how to apply certification, Proven-in-Use and reliability data. © 2009 Inside
Publishing. All Rights Reserved.

1. Why Is This A Topic tions might be developed or implemented. These


solutions can include:
of Interest?
 Altering the design of the process;
It is always true that the end-user of a process,
such as a factory or a plant, is ultimately respon-  A change to the basic material used in the
sible for the safety of its people and the environ- process;
ment, as well as the protection of its own capital
 Varying the location of the process;
equipment. It is for this reason that end-users
continually perform hazard and risk analysis, so  Development of new operational procedures;
that they can judge how much risk reduction
 Additional (safety) training of employees;
is needed to properly protect themselves from
hazardous scenarios, especially those with the  Installation of active and passive safety
potential to develop into catastrophic accidents. solutions; and
Depending on the outcome of a hazard and risk
 Installation of safety instrumented systems
analysis, different risk reduction and safety solu-
and other safety systems;

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

We depend on solutions such as these to reduce existed long before current functional safety
the risk associated with these hazardous events. standards were released. It is rarely practical, nor
This paper will concentrate on the performance is it usually necessary, to replace all old safety
of the safety (instrumented) system solutions. If equipment and devices each time a new func-
we are to build safety systems that perform well, tional safety standard comes out. And if the proc-
we have to construct solutions that are not only ess runs fine, the equipment is operable, why
safe and reliable, but which also are compliant should they? When a new standard as complex as
with applicable functional safety standards like the IEC 61508 is released, it can often take years
IEC 61508 [1], IEC 61511 [2], IEC 62061 [3], EN for compliant equipment to come on the market,
50402 [4] and others. In practice, end-users and anyway, so even if an end-user wants to replace old
other stakeholders, such as system integrators, safety equipment, in practice it’s often not possible.
developers, manufacturers and consultants in So how can the end-user deal with his old safety
2 this industry, face only two real problems. equipment? The answer is Proven-in-Use.

As safety systems have become increasingly com- Certification and Proven-in-Use can help us select
plex, with devices more commonly based on elec- safety equipment and devices that are compliant
trical, electronic, and programmable electronic with safety standards and which help us design
technology, standards such as IEC 61508 and EN safety systems and solutions capable of meeting
50402 have been developed, which detail specific the required safety criteria set out in those stand-
requirements for building complex safety com- ards. Users of these devices and equipment need
ponents and systems. Following such standards is only understand how to apply certification and
imperative, not only because the devices themselves Proven-in-Use in the correct way, so as to receive
are complex, but also because it is always neces- the full benefit and advantage of these solutions.
sary for the end-user to demonstrate to a respon- In the end, proper understanding of how certifi-
sible authority that the installed safety equipment cation and Proven-in-Use work will save end users
meets the current state of the art in safety technol- a significant amount of time, money and head-
ogy, whenever they request such information, and aches. Therefore, the purpose of this paper is to
that state of the art necessarily includes compliance explain the concepts of certification and Proven-
with such functional safety standards. in-Use, as well as to explain the role reliability
data play in these solutions. We will explain what
Unfortunately these days, end-users often do not
to pay attention to and how to apply certification,
have the time, resources or, most importantly,
Proven-in-Use and reliability data.
the knowledge to verify whether the equipment
they use is actually compliant with all applicable
safety standards. End-users may be experts in
running their own production processes, but not
2. Certification: The (Un)
necessarily have adequate knowledge of soft- Necessary Evil?
ware operating systems, safety-related bus com-
munication protocols, or reliability predictions of
complex logic solvers. So how do users of safety
2.1. What Is Certification Really?
equipment assure themselves that the safety The very discussion of functional safety certifica-
equipment used is actually complaint with the tion often results in many wild stories and ideas
safety standards applicable to their jurisdiction? being told. Some companies love it, while others
The answer is third party certification. hate it. Some understand how to apply it, while
others see no benefit to it, and consider the proc-
Likewise, existing plants use equipment and
ess as an unnecessary evil. One major reason
devices to carry out safety functions that often
there is so much confusion about certification is

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


that most people don’t understand what it really  What is more important; the certificate or
means, and they do not understand how to use it the report that comes with the certificate?
to their advantage. So what is certification?
 What if no appropriate certification is available?
The dictionary defines certification as follows:
We will try to address these questions one at a
“to attest as being true or as represented or as time, and perhaps we can bring some order to
meeting a standard” [5]. this certification jungle.

Essentially, certification that a device, system,


organisation, or person meets certain technical 2.2. When Is A System
standards means that someone with expertise
Functionally Safe?
and authority has looked at one of them, and has
attested to the fact that statements or represen- Before we can certify something we first need to
tations made about the device, system, organisa- understand what we are going to certify; in this 3
tion, or person are true. In the technical industry, case, functional safety. In industry, functional
certification to a particular standard means that safety has several definitions. IEC 61508 defines
an expert has examined each device, system, functional safety as that part of the overall safety
organisation, or person, and confirmed that it relating to the equipment under control (EUC)
is, in fact, compliant with one or more standards and the EUC control system [1]. IEC 61511 defines
important to that industry. That is the extent of functional safety as part of the overall safety
certification; it is nothing more, nothing less. relating to the process and the basic process con-
trol system (BPCS) which depends on the correct
In the functional safety industry, of course, we
functioning of the safety instrumented system
use a more detailed definition for certification.
(SIS) and other protection layers [2]. IEC 62061
Here certification is defined as follows:
defines functional safety as part of the safety of
“a process by which sufficiently independent the machine and the machine control system
qualified entities can attest that the claimed which depends on the correct func-
functions of a system or process are performed tioning of the safety-related elec-
at a verifiable level of functional safety”. trical control system, other
technology safety-related
As always in this industry, different stakeholders
systems and external risk
have diverse viewpoints when it comes to certi-
reduction facilities [3].
fication. Depending on which stakeholder you
represent, you might begin by asking yourself One thing is clear from
some basic questions, such as: these definitions; when
it comes to functional
 When is a system functionally safe?
safety, we depend on
 What should it mean when somebody says the correct functioning
a product is certified? of the safety system that
we are using for our pro-
 Is certification the same as assessment?
tection. In other words, we
 Who is allowed to certify? need to build safety systems
that are dependable; that do
 Is certification needed?
not fail. Unfortunately, eve-
 Can anything be certified? ryone knows that any kind
of system can fail, including
 What standards should be used to certify?
safety systems. They can fail

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

because the hardware or software has failed; certification is not necessarily a problem, as long
because they were wrongly designed in the first as the certification clearly states which aspects of
place; or because they were not operated, used the system have been certified. But this must be
or maintained as intended. But they can also fail understood by the end-user of the product. But in
because of environmental influences like earth- practice, often the end-user does not understand
quakes, flooding or lightning. the difference; he just sees a certificate stating SIL
X, with a signature, and automatically assumes
Any failure will always fall into one of the follow-
that all requirements of the standards have been
ing three categories: random hardware failure,
met. This is a real problem in industry.
common cause failures, or systematic failures [6].
In practice, when dealing with safety systems, we A manufacturer who claims compliance with the
can define functional safety as follows: IEC 61508 standard for one of its devices must
demonstrate that the functional safety stand-
4 A safety system is 100% functionally safe if
ard applies to all aspects of that device, by hir-
random, systematic and common cause fail-
ing a third party to produce a report and certify
ures do not lead to malfunction of the safety
this device. The question is, what does this mean
system and do not result in
in practice? To what should the certifier attest?
 Injury or death of humans; For a product to be fully certified to the IEC
61508 standard, which requirements need to be
 Spills to the environment;
addressed?
 Loss of equipment or production.
The IEC 61508 standard consists of literally thou-
This is a very practical definition, but we also know sands of requirements, and it is not easy to filter
that it is not possible to achieve 100% functional out those requirements that do not apply to a
safety. But it is possible to certify that a safety sys- product. The author has carried out hundreds of
tem has met functional safety standards, such as certifications, and found that a complete func-
the requirements for SIL 1, 2, 3 and 4 function- tional safety certification statement according
ally safe systems, without requiring that a system to IEC 61508 should cover the following require-
never fail, as long as the certification is properly ments:
performed.
1. Functional safety management;

2. Hardware requirements;
2.3. What Should It Mean When
3. Hardware reliability analysis;
Somebody Says This Product Is
Certified? 4. Software requirements;
Let’s take the example of the IEC 61508 stand- 5. Basic safety, environmental safety, EMC;
ard. In practice, it is possible to certify a product
6. User documentation.
according to all requirements of the standard or
to only certain parts of the standard. For instance, Basically, when a manufacturer states that their
a device without software does not need to com- product is fully compliant with IEC 61508 then the
ply with the software requirements of IEC 61508 above requirements should have been addressed.
part 3. But if the product does have software Functional safety management addresses the life
inside and only a hardware certification has been cycle requirements, documentation, verification,
carried out, such certification is insufficient. Unfor- validation, assessments, and measures to avoid
tunately, today it is common industry practice to and control failures. The hardware requirements
certify some safety systems this way. Such partial and hardware reliability analysis go hand in hand.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


5

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

They address the requirements of IEC 61508 part partially address the requirements of the stand-
2 and include the implementation of measure to ards. The next time you receive a certificate or a
control failures and items like safe failure fractions, report to the certificate, ask yourself some ques-
hardware constraints, probability calculations, tions:
and so on. The software requirements address IEC
 Does the certificate address those stand-
61508 part 3 of the standard and also include the
ards that are important to us?
measures to avoid and control failures. In practice
is it not always possible to separate hardware and  Was the certification report complete? Did
software certification. it address all the requirements of the stand-
ard or only a part?
Completely overlooked by many certifiers are
the requirements for basic safety, environmen-  Is the certification report clear about how
tal safety, EMC, and user documentation. It is the certification has taken place?
6 very important that we not only develop devices
 Can I trust the certifier? Have they done
that are functionally safe but which also function
certification work before or is this the first
properly in the environment for which they are
time? Did the manufacturer self-certify?
intended. This is a requirement of IEC 61508. For
each device, the environmental properties have You do not need a positive answer to all these
to be defined upfront and then the device has to questions in order to decide whether the device
be designed to work under these environmental is suitable or not. The questions are merely
properties. In practice, what this means is, other designed to help you make decisions. For exam-
standards besides the IEC 61508 must be part of ple, just because you receive a certification from
the certification as well. Typical standards are IEC a third party you have never heard of does not
61131 [7], IEC 60101 [8], EN 50082-2 [9], EN 61000- mean that you cannot trust the results of their
6-3 [10], IEC 60068 [11], and many more including certification. But it should lead you ask more
possible application-specific standards. questions or look more carefully into the product
certification.
User documentation is the other often forgotten
requirement that should be certified. It makes no
sense to have a device that is functionally safe 2.4. Is Certification The Same
when used properly, but the end user has no idea
As Assessment?
how to use the product properly. Therefore, user
documentation plays an important role. It should Certainly not! Assessment is a term used in the
explain an end-user how to install, commission, functional safety standards themselves. Cer-
validate, operate, maintain and repair the device. tification does not exist in these standards as
So it is very important that the information in the such. While an assessor examines whether eve-
user documentation is correct. A certifier should rything was carried out according to the safety
check this. plan and thus whether functional safety is or can
be achieved, certification confirms (attests) the
What actually is to be certified is the responsibil-
achievement. Often in the certification industry,
ity of the company that orders the certification,
the certifier is also the assessor. This is not a con-
not the responsibility of the end-user of the prod-
flict of interest because certification has a limited,
uct, unless the end-user pays for the certification.
very specific meaning. Certification only confirms
When it comes to devices, the manufacturer usu-
that statements made are true, which means
ally orders the certification, which means that
that a certifier is in a very good position to do an
they decide what will actually be certified, correct
assessment.
or not. Many certifications in today’s market only

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


2.5. Who Is Allowed To Certify? ent third party should carry out certification. A
third party is a party that is not involved with the
In the process industry, there are no rules when manufacturer or the end-user and is therefore
it comes to who is allowed to certify. The author independent. Luckily, there are plenty of choices
of this paper has collected third party SIL state- and thus fierce competition in this industry, which
ments from over 100 different organisations or drives down the cost for certification. Unfortu-
individuals. In theory, this means that anyone can nately it also tends to drive down the quality.
certify functional safety. Of course, in practice this
does not work, because all stakeholders wish to Some companies claim that only a notified body
have some level of trust when it comes to certifi- can perform functional safety certification. Usu-
cation. The question to ask is not so much “Who is ally these companies are in some field identified
allowed to certify?” but rather, “Who do we trust?”. by the European Union as notified bodies them-
Unfortunately, the level of trust required by the selves, see note. Unfortunately for those compa-
nies, it is a fact that a notified body for functional 7
stakeholders seems to vary quite a bit around the
world, and occasionally, it is so low that any level safety does not exist, and even if they are a noti-
of quality is acceptable. fied body themselves, they can’t be a notified
body for functional safety.
In practice, a majority of end-users do not seem
to really care about the quality of the certification. Note: Notification is an act whereby a Member
They only care to know that some third party signed State informs the Commission and the other
a piece of paper, so that they can shift responsibility Member States that a body, which fulfils the rel-
to that third party. Product developers look at this evant requirements has been designated to carry
from a different point of view. They often don’t care out conformity assessment according to a direc-
about the certification company; they feel it is far tive. Notification of Notified Bodies and their
more important who is doing the certification, to withdrawal are the responsibility of the notifying
the point that some product developers will follow Member State [12].
a certifier when he switches jobs. Remember; in the
end certification is up to the person performing the
task. If you choose a competent certifier, you will
2.6. Is Certification Needed?
end up with a good, functionally safe product. But No and yes, and really in that order. The answer
if your choice is not so competent, you may save is no, because in the functional safety business
money, but you will have no idea whether or not there is no mandatory third party certification.
the end product is actually functionally safe. The law does not require it, nor do the standards.
Therefore, officially speaking, there is no need to
How the certification has been carried out or what
demonstrate through certification that functional
actually has been certified is of lesser interest to
safety statements made are true.
the majority of stakeholders in this world, which
is not a positive development, in part because But, the answer is also yes, because there are very
companies that take functional safety seriously good reasons why certification is beneficial for
end up in the same league with companies that all stakeholders. Certification helps to demon-
do not, only because both companies have a third strate to “outsiders,” including governments, local
party SIL X statement. Not only is it not good for authorities, insurance companies, internal audit-
safety but, from a commercial point of view, it is ing departments, etc, that products, systems,
unsustainable in the future. organisations or people meet the requirements
of the safety standards. The only thing the differ-
Although some companies apply self-certifica-
ent stakeholders need to understand is how to
tion, most in the industry agree that an independ-
use (or not use) certification.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

Why is certification beneficial for end-users? It is ments of the standard. This also means that an
because it is impossible for an end-user to ana- end-user is basically responsible when a non-cer-
lyze every single safety device in the plant, or that tified device is used in a safety loop. How will the
a sales person wants to sell to the end-user. End- end-user demonstrate that the product is suitable
users simply do not have the time, knowledge and and meets the requirements of the standards?
resources to properly evaluate all these devices Well today this is a matter of risk management;
based on the requirements of the standards. The as long as nothing happens he does not need to
first end-user who understands how to make the prove it. But are you willing to take that risk?
“stack” functionally safe or how to guarantee that Why is certification beneficial for product devel-
an ASIC is developed without systematic failures opers? Product developers should consider certi-
still needs to reveal himself. But even if end-users fication because it is a one-stop-fits-all solution.
had the time, knowledge and resources, such an If the product developer goes through certifica-
evaluation still would be practically impossible tion once, it can sell the same product to any end-
as the manufacturers of the devices will not give user without having to demonstrate to each one
them access to the internal workings of a device, that the product is suitable. Of course, the main
because such information is proprietary. task for product developers is to figure out what
In practice there is only one way for end-users of potential end-users want to have certified and
safety devices to overcome this, and that is to rely to find a trustworthy certifier. For product devel-
on third party certification. An end-user often opers, certification is often used as a marketing
buys a certified product because a trusted third and sales feature. It differentiates them from the
party has investigated that the statements made competition (as long as the competition does not
by the manufacturer are true, and has assured have certification or suddenly has a “better” cer-
end-users that the product meets all the require- tification).

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


What if a product is not certified? Well then we  Individual safety devices, hardware as
have a number of options. well as software. And not only electrical,
electronic and programmable electronic
 We can try to demand certification from our
devices, but also mechanical and electro-
product supplier. This might work, but even
mechanical devices, as well;
if the product supplier is willing to get third
party certification, that does not mean that  Software: Embedded software as well as
it is a workable solution. Functional safety application software;
certification can often take from several
 Safety functions: a collection of individ-
months to several years to complete1. This
ual safety devices carrying out one safety
is not an option if the product is needed
function;
quickly.
 Safety systems: a systems that carries out
 We can try to find a different manufacturer, 9
one or more safety functions;
who sells similar devices with certification.
This option is becoming more appealing  Organisations: mainly the management
over time, as more devices are being certi- systems that deal with safety devices, func-
fied, giving the end-users more options to tions and systems;
choose from. But unfortunately, it does not
 People: a professional dealing with safety
always work, because sometimes certified
devices.
alternatives are just not available.
Functional safety certification, as we know it
 If the device is older we can also try to apply
today, started in 1970 when HIMA had their
Proven-in-Use. Proven-in-Use is also a hot
PLANAR system certified for the first time by
topic in industry. A separate paragraph is
TÜV SÜD. Today any device that is used in a
dedicated to explain the (mis)use of Prov-
safety instrumented loop and is required to
en-in-Use. Proven is use is only an option
carry out the safety function can be certified.
for older devices; a newly developed device
Typical devices that manufacturers have certi-
cannot be claimed as Proven-in-Use.
fied are:
Note: Don’t forget that in the end, good engi-
 Sensing elements;
neering practice is always more important then a
certificate or third party report. Always use com-  Transmitters;
mon sense to make decisions and if the decision-
 Barriers;
making process includes information from third
parties then only the better. Do not use a device  Logic solvers;
only because a third party certified it.
 Safety relays;

 Actuators;
2.7. Can Anything Be Certified?
 Solenoid valves;
It certainly seems so. In the functional safety busi-
ness, it is possible to literally get anything certi-  Valves;
fied if it has anything to do directly or indirectly  Bus networks and other communication
with safety systems. These days, functional safety peripherals.
certification addresses such items as:
Besides equipment, since the late 1990s there
1 For example, certification of a smart transmitter takes anywhere from has been a growing trend to certify organizations
6 months to 2 years. A logic solver takes anywhere from 1 year to 5
years, if successful at all. and people. Some organisations, like product

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

manufacturers, have their functional safety man- 2.8. Which Standards Should I
agement (FSM) system certified according to the Certify To?
FSM requirements of IEC 61508 and IEC 61511.
The first organisation to do so was Honeywell, Many standards exist to deal with require-
the Netherlands. This practice is not really tak- ments for safety devices or solutions. These
ing off throughout industry, however. It is one of standards can be divided into product-spe-
those typical certifications where the manufac- cific standards and application-specific stand-
turer understands why he does it but his custom- ards. In terms of functional safety the most
ers have no clue what it is and thus do not know important standards are:
how to appreciate it.  IEC 61508;
Since 1999, when TÜV SÜD first introduced a  EN 50402;
program (CFSE2) to certify people according
10 to functional safety principles, more and more  IEC 61511;
professionals and managers are getting certi-  IEC 62061.
fied worldwide. People certification started first
with functional safety consultants, followed IEC 61508 and EN 50402 are typical product
by system integrators. Today, end-users are still standards while IEC 61511 and IEC 62061 are
significantly behind the curve. Why do the end- typical application standards. The IEC 61508
users of the safety devices resist the trend toward standard was officially released in 1999 and
certification? deals with any type of safety system with one or
more electrical, electronic and programmable
Whether the issue is products, systems, organi- electronic (E/E/PE) devices. Every manufacturer
sations or people, those companies and people of safety devices based on E/E/PE technology
who deliver certified products or services have must comply with this standard. The stand-
a real interest in showing customers that their ard itself has many detailed requirements that
products, services or even they themselves are deal with electronic components and software
certified, in order to demonstrate, through inde- issues.
pendence, their qualification to work in the func-
tional safety industry. For end-users, this is good When an individual product is certified, it should
news, because they feel less need to investigate first be certified against a product standard and
whether a product, service or person is “compli- not an application standard. Often, certificates
ant,” although certification alone can never be show that a product is certified against both
cited as a reason to not investigate any further; product standards and application standards. In
just because something is certified doesn’t auto- theory it usually makes no sense to certify a prod-
matically mean that it is the best choice or option. uct against an application standard, but it is done
Certification just helps to get a better picture of any way. Failing to put the application standard
a product, service or person; it is just one ele- on the certificate turns out to be a marketing and
ment that supports the decision making process. sales problem. Many end-users do not under-
Many other factors contribute to selecting the stand the difference between a product stand-
right choice and these other factors can some- ard and an application standard; they care more
times lead an end-user to another, non-certified about applications than about individual prod-
choice. ucts. In the end, one of the most important rea-
sons why application-specific standards are put
2 The author, at the time a department manager, was personally on certificates is to make sure that the end-user
responsible for the CFSE program at TÜV SÜD. He developed the pro-
gram in 1998 from an idea into a marketable product. Today over 3000 actually buys the product, not because it is really
professionals are world wide certified for functional safety. Today this
program is called TÜV Functional Safety Certification Program (FSCP).
needed from a certification point of view.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


11

For example, IEC 61508 does not address appli- 2.9. What Is More Important The
cations; it is a general purpose standard. EN 54-2 Report Or The Certificate?
[13], on the other hand, is an application-specific
standard for fire detection and alarm systems. A good certification always consists of a techni-
End-users often look for products that can work cal report, and it may include a certificate, but
in an EN 54-2 environment, while in reality they there should never be a certificate without a
need a product that works according to the technical report. The technical report is always
requirements of IEC 61508. If an end-user sees a more important then the certificate. The certifi-
certificate stating that the product is IEC 61508 cate is like a degree from a university. The degree
compliant, he might still think that he cannot use demonstrates graduation, but more important
it in an EN 54-2 environment. Marketing and sales is the transcript, which shows how you passed
teams want the EN 54-2 certification for their all the exams. It is the same for certification. The
products, as well, despite the fact that there are certificate itself is only a summary of the results,
almost no requirements on product level in this while the report to the certificate contains all the
standard. details of how the third party did the verification
and assessment of the product.
Finally, we have the aforementioned environ-
mental, basic safety and EMC standards. Every Another reason the report is very important, is
product must work in a particular environment, because it not only explains how the certification
and it is important that the product is tested was performed, but often it also lists possible
for that environment. Many different standards restrictions on use of the device. These restric-
exist that can be used for this kind of testing, tions are important to know. Do you want these
some of which are international standards, and restrictions? Are they limiting you in any way? Do
others of which country, industry or applica- they force you to buy more equipment or to per-
tion-specific. form testing when you do not want it? All of this

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

is important information for you to know before o How often has this device been used in
you decide to buy the device. a safety function? Has the safety func-
tion been activated? Did the device
In other words, before you buy a certified device
work?
it is always important to request the report to
the certificate. Alarm bells should go off if a sup- o How many of these devices have failed.
plier does not want to show you the report before How did they fail? Safe or dangerous?
the purchase is made, because it’s possible there is
o Does the device contain software?
information in there they do not want you to know.
How do you know the software has
no bugs? Which version is running? Is
2.10. What If Certification Does there a bug list?
Not Exist? o Have you ever had to send a device back
12 to the manufacturer? Did they respond
It still happens a lot in industry that we want to
to your needs in a satisfactory manner?
use devices in safety applications for which there
is no available proof of compliance with safety o Are you convinced that this device is
standards. There may be many reasons why suffi- the right choice?
cient information about a device is not available,
If you do not have the above information, how
including lack of certification, or for some safety
can you be convinced that the device is suitable
solutions, it is possible that no devices exist that
in a safety application? If you still want to use the
are capable of demonstrate compliance with the
device (and there might be good reasons to do
standards. What can we do in such a case? Here
so), then introduce the device slowly. First, test
are some solutions:
it in a non-safety application for a while, to see
1. Try to push the manufacturer to sup- if it works the way you expect. Activate its safety
ply the appropriate documentation. If function many times. Just keep in mind that slow
you would like to use this device, try to and old is good in the safety business; we do not
convince them to go through third party need the latest high-tech features. We should
certification; have simple safety problems, like measure the
pressure and if too high stop the flow. This we can
2. If the manufacturer is not willing or
do with simple solutions. Keep it simple, as sim-
able to provide certification, try to find
ple as possible Einstein once said. If it looks too
another device on the market with the
complex it probably is and most likely another
appropriate third party statements;
simpler solution exists.
3. If there are no other devices on the mar-
ket, try to make a safety case for this
device. Collect as much information as 2.11. Conclusions Certification
possible about the device, try to find the
Today, certification is really a necessary evil. It is
answer to the following questions, in
almost impossible in the functional safety indus-
order to determine if the device is or is
try to sell products without certification. When
not suitable for safety purposes:
looking for certified devices, always pay more
o Why do you think this device is suitable? attention to the report of the certificate then the
certificate itself. In the end, it does not matter
o How many of these devices are cur-
who certified a product as long as you feel that
rently in use? How long have these
you can trust them. Don’t forget though; just
devices been in use?
because you use a certified product does not

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


reduce your responsibility as an end-user con- Of course, if the end-user does not want to replace
cerning functional safety. anything then why should a manufacturer of
devices develop devices which are compliant
with the latest standard? If they do not need to
3. Proven-in-Use: Heaven comply with IEC 61508 then they do not need to
Or Hell? change their development procedures, they do
not need to change the design of their devices,
they do not need to implement functional safety
3.1. Why Is Proven-in-Use So management and measures to avoid and control
Attractive? failures. Proven-in-Use saves them a lot of devel-
opment time and cost3.
Proven-in-Use is an option in the standard that
can be used for “older” safety devices. Unfortu- Unfortunately the result is that everybody is try-
nately, the standards do not explain to us explicitly ing to comply with Proven-in-Use and this cannot 13
why such a concept as Proven-in-Use is actually be the intent of the standards. We do not develop
required. Even though there are good reasons to new standards so that stakeholders work with
have it, few people seem to understand the real Proven-in-Use devices for many years to come.
background of Proven-in-Use and therefore, it is So let’s dive more deeply into the Proven-in-Use
often misused in industry. So why does Proven- topic and try to answer the following questions:
in-Use exist?
 What is Proven-in-Use?
Well, consider this. A new functional safety
 What are the requirements for Proven-in-Use?
standard came out in 1999 (IEC 61508). Because
this standard was released at that time did not  What kind of device can be Proven-in-Use?
mean that every device manufacturer was able
 Does Proven-in-Use make sense?
to design devices to meet with the requirements
of this standard immediately. And even if an end-  What about operational hours?
user wanted to comply with that standard, he
couldn’t do so when devices are not available. So,
what is he expected to do; shut down the plant? 3.2. What is Proven-in-Use?
That’s not a viable option, of course. Proven-in-Use is an option in the standards that
Just because the industry introduces a new safety allows us to use “older” devices that weren’t
standard doesn’t mean that all old safety equip- designed according to the rules of these stand-
ment becomes obsolete and does not work. ards but that are suitable due to their proven per-
Many plants have run without problems for many formance history. The current IEC 61508 does not
years, because the equipment has worked. So have a definition for Proven-in-Use, but has many
why replace old equipment that works with new requirements for it. The new draft version of IEC
equipment that has yet to prove itself? In other 61508 defines Proven-in-Use as:
words, it is not always possible, nor is it neces- “demonstration, based on an analysis of opera-
sary, for an end-user to implement fully compli- tional experience for a specific configuration of
ant safety devices, which is why the standard an element, that the likelihood of an dangerous
has requirements for Proven-in-Use. But there systematic faults is low enough so that every
is a catch. If you give an end-user the option to safety function that uses the element achieves
do something or to do nothing, he will usually its required safety integrity level” [14]
choose to do nothing. And that is the danger
with Proven-in-Use. 3 Depending how fit they are in functional safety matters they save
about 25-75% in development cost.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

The current version of IEC 61511 defines Proven- higher the SIL level, the more work a developer
in-Use as follows: has. If a device can be regarded as Proven-in-Use,
then it is unnecessary to carry out this develop-
“when a documented assessment has
ment, and that is a big incentive.
shown that there is appropriate evidence,
based on the previous use of the compo- The most important requirement of IEC 61508 is
nent, that the component is suitable for use the necessary documentary evidence regarding
in a safety instrumented system (see “prior the use of the device. This evidence should typi-
use” in 11.5)” [2] cally include requirements for both hardware and
software, addressing:
The definition of IEC 61511 is very weak and
leaves too much room for interpretation. It fails  Failure recording and failure data indicat-
to address the actual problem. Even worse, in IEC ing that the probability of failure is low
14 61511 two terms are used, i.e., prior use and Prov- enough;
en-in-Use. Prior use is even weaker than Proven-
 Any testing that has taken place (a demand
in-Use, as it only indicates that a device was used
is considered a test of the safety function);
previously. A term like prior use does not address
the question of whether the device worked or  The current condition of use must be simi-
not; therefore the term Proven-in-Use is much lar to the previous conditions of use. Condi-
more suitable. Now that we know what Proven- tions of use include:
in-Use is, we can look into the requirements for
o Environment;
Proven-in-Use.
o Modes of use;

3.3. What Are The Requirements o Functions performed;


For Proven-in-Use? o Configuration;
As noted above, both IEC 61508 and IEC 61511 o Interfaces to other systems;
have requirements for Proven-in-Use. The IEC
o Etc;
61508 standard has one requirement that clearly
shows the advantage for claiming Proven-in-  If the conditions are not the same then addi-
Use. In case a previously developed subsystem tional analysis and testing is required to dem-
is regarded as Proven-in-Use, then information onstrate that, under the new conditions, the
regarding the measures and techniques to avoid probability of failure is still low enough;
and control failures is not required any more. For
 That the device has restricted function-
most people using safety instrumented systems,
ality;
this does not ring a bell. For developers of safety
systems or devices this sounds like music to their  Statistical evidence that the claimed failure
ears. There are about 30 tables in the IEC 61508 rates are sufficiently low;
standards that require per SIL level certain meas-
o Single-sided lower confidence limit of
ures to avoid and control failures. Basically, these
at least 70%;
tables determine how much safety needs to go
into a device. For example, typical measures are o Excluding operation times of less than
the level of documentation, project manage- 1 year;
ment, testing, as well as the types of diagnostic
o Only the operational time of the device,
features, types of software languages to use, etc..
where all failures have been detected
And all of this is a function of the SIL level. The
and reported.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


15

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

When considering the above requirements one ware have been applied. For an end-user, that
should always take into account can be impossible to demonstrate, because
an end-user cannot look inside a product. No
 The complexity of the subsystem;
amount of operating experience and statistical
 The contribution made by the subsystem data analysis can give you answers to questions
to the risk reduction; such as these, which only product developers
can answer. In practice, however, in most cases,
 The consequence associated with a failure
only the end-user knows how many devices
of the subsystem;
he has, how long he has been using them and
 The novelty of the design. when and how they failed (if he documented
this). In other words, only end-users can deter-
The approach in IEC 61508 is very clear. The more
mine Proven-in-Use, although there are some
complex a device is, the higher the SIL level of
16 requirements they will be unable to prove.
the loop, and more attention needs to be placed
on proving Proven-in-Use. The actual informa-
tion that needs to be documented is not as com- 3.4. Which Devices Can Be
plex either. The problem is, of course, that almost
Proven-in-Use?
no one ever wrote this information down. Only
today, companies are starting to document that The standards have one requirement about the
information, but 10 years have passed since the devices that can be claimed Proven-in-Use. Both
release of IEC 61508. Many companies do not standards state that only devices with restricted
realize that evidence for Proven-in-Use can be functionality can be claimed as Proven-in-Use.
relatively easy to document. Evidence can come Unfortunately, the standards do not explain what
from data such as: restricted functionality is. Typical devices with
restricted functionality are:
 Documented loop checks after plant turna-
rounds;  Sensors and transmitters;

 Documented spurious trips and scheduled  Barriers;


shutdowns;
 Relays;
 Maintenance records demonstrating repair
 Valves;
activities and potential failures found;
 Solenoids;
The requirements of IEC 61511 are similar to
those of IEC 61508. Additionally though, IEC  Etc.
61511 requires that the devices considered for
These devices basically carry out only one func-
Proven-in-Use be manufactured under a quality
tion. A pressure transmitter transmits pressure
management system and extra requirements
from the field. A relay only opens or closes its
exist for devices based on Fixed Programming
contacts. Other devices capable of carrying out
Languages and Limited Variability Languages.
many different functions simultaneously, such as
There is a practical problem, however, with
logic solvers, are not considered to have restricted
some of the additional requirements in IEC
functionality, because they are too complex.
61511. For example, for a fixed programming
There are too many (side) functions with such a
language (FPL) product, you must demonstrate
device that can influence the behaviour of the
that unused features are unlikely to jeopardize
safety function when they fail, making it impos-
the safety function in case of failure, or that
sible to apply Proven-in-Use.
appropriate standards for hardware and soft-

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


3.5. Does Proven-in-Use Make should be expressed in a number of equal devices
Sense? and their operational hours. The old DIN V VDE
0801 [15] standard requires 100’000 operation
Yes, but … As explained before, Proven-in-Use hours and a minimum of 1 year of service his-
makes sense for devices that have been devel- tory. 100’000 hours of operation is not really that
oped before the standards became effective, but much; it is basically 10 devices for 1 year. Would
we also have to apply our own common sense you take medicine that is only tested on 10 peo-
when applying Proven-in-Use. If a manufacturer ple for 1 year? IEC 61508 talks about 1’000’000
develops a new device today, can Proven-in-Use hours but it is not really a requirement, it is only
then be claimed a few years from now, when suf- mentioned as an example in a note of one of the
ficient operating experience has been gathered? requirements. The reason for this is that IEC 61508
Unfortunately, this completely defeats the pur- makes it dependent on the statistical confidence
pose of Proven-in-Use, although it is exactly what instead of the operational hours. The standard 17
some product manufacturers and end-users have requires a single-sided lower confidence limit of
been doing since 1999. Devices that were not at least 70%. This does not depend on the opera-
developed according to the requirements of IEC tional hours only but on the combination of the
61508, but which were released after 1999, are operational hours and the number dangerous
now claimed as Proven-in-Use. Why, then, do we undetected failures. Nevertheless, many compa-
have these new standards? nies use 1’000’000 operational hours as guide-
line and together with low expected dangerous
Of course we cannot expect or demand that prod- undetected failure rates the 70% is often easily
uct developers have their products ready on the achievable.
day a new standard is released. What end-users and
product developers need is a grace period, which The necessary time, in terms of operational
would make sense for Proven-in-Use. If IEC 61508 is hours, required to establish the claimed rates of
released in 1999 and we agree on a grace period of failure can result from the operation of a number
five years, for example, then end-users can, for the of identical applications, provided that failures
next five years, still buy safety devices that may not from all the applications have been effectively
be compliant with IEC 61508, but which are Proven- detected and reported. If, for example, each of the
in-Use. And product developers have five years to 10 applications works fault-free for 90’000 h (10
update their existing devices, or to develop new years), then the total time of fault-free operation
devices which are compliant with the standards. may be considered as approximately 900’000 h.
Just a thought from a safety engineer. In this case, each application has been in use for
over 10 years and therefore the operation counts
towards the total number of operational hours
3.6. What About Operational considered.
Hours? All failures count towards Proven-in-Use. Not
Probably the most important aspect of Proven-in-Use only dangerous, undetected4 failures need to
is that we can demonstrate that the safety function of be recorded, but safe detected, safe undetected
a device really works. But we can only demonstrate and dangerous detected failures, as well. Even
this if we have sufficient evidence, and that is exactly if no dangerous undetected failures have been
where the standards lack a bit. They do not really recorded over time it is statistically possible to
explain when the evidence is sufficient. make a statement about how often the device

A Proven-in-Use device should have sufficient 4 A failure is only detected when it is detected by built in diagnostics.
statistical evidence about the failure rate. This Failures that are revealed by proof tests are still considered as being
undetected. This is just a matter of definition.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

would suffer a dangerous undetected failure. This  The safe failure fraction (SFF), and
is not a problem in the statistics industry.
 The probability of failure on demand (PFD).

Furthermore, even though it is not a requirement


3.7. Conclusions Proven-in-Use in the standards, the end-users would also like to
Proven-in-Use is a necessary industry feature. It know how likely it is that the safety system causes
would be impossible to achieve functional safety a spurious trip. In this case, the probability of fail
today if we did not have this option. Unfortu- safe must be calculated for a safety loop, the
nately Proven-in-Use is a very popular subject for determining factor for the STL level. Basically, we
the wrong reasons. It is often misused, which can need data in order to carry out reliability calcula-
lead to superficial safety. Many companies try to tions.
claim Proven-in-Use so that they do not need to
18 comply with measures to control and avoid fail-
ures as listed in IEC 61508.
4.1. What Kind of Data Do We
Need?
We always need to keep in mind why Proven-in-
In order to fulfil the reliability requirements of
Use exists. It is only to overcome a grace period.
the standards, we need different kinds of reliabil-
Proven-in-Use should never be used as a long-
ity data. For each device used in a safety loop we
term solution. Eventually new, compliant devices
need as a minimum the following:
should replace all old Proven-in-Use devices.
 Safe detected failure rate;
Although Proven-in-Use is often claimed by prod-
uct manufacturers, in reality Proven-in-Use can  Safe undetected failure rate;
only really be claimed by end-users of devices.
 Dangerous detected failure rate;
Only they can know how many devices they have
operational, when they have failed, and how they  Dangerous undetected failure rate;
have failed. Each end-user has a specific opera-
We need this data because we are required to calcu-
tional environment. Each end-user has his own
late the safe failure fraction per device. This data can
maintenance philosophy, strategy and schedule.
be expressed in different ways. For example, many
All these aspects influence the failure behaviour
companies prefer work with Mean Time To Failures
of a device. Product developers know a lot but
(MTTF) instead of rates. Other data, like safe ratios
they do not know everything about their cus-
or diagnostics coverage factors, may also be useful,
tomers and how they use their device. In most
although they are unnecessary. Today, data is typi-
cases, they cannot claim Proven-in-Use even if
cally delivered by device manufactures, but in case
they want to.
of Proven-in-Use we probably need to look for other
sources to gather it, as well.
4. Why Do We Need In the end, though, we need to combine safety
Reliability Data? devices to make a safety loop. On safety loop
level, more reliability data is needed. This data
Sometimes, a safety device meets all other crite-
includes (basically per device):
ria for certification or Proven-in-Use but we still
do not have sufficient data to demonstrate that  Repair rate (MTTR);
the failure rates are low enough. Somehow we
 Periodic proof test interval;
need this failure rate data because in the end, the
standards require us to perform the reliability cal-  Periodic proof test coverage;
culations for:

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


 Common cause factors.  How did this failure affect the safety func-
tion? Was the safety function carried out, or
This data is typically only available from end-us-
was it impossible to carry it out? Or did it
ers of safety devices. Product manufacturers have
have no effect the function at all?
no influence on this information. Only an end-
user knows how long it takes to repair a device  How many of these devices are in opera-
or how often and how effective the proof test is tion? When were they installed?
that they carry out. This data is required in order
 How often do we proof test them? Have
to calculate the probability of failure on demand
there been any unscheduled trips that
(PFD). Furthermore, although not required by the
proved whether or not they worked?
standards, we can use this information to calcu-
late the probability of fail safe (PFS), a measure  Do we need a procedural change
for the number of spurious trips caused by the because of this failure? Do we need to
safety loop equipment (expressed as Spurious buy a more reliable alternative? What 19
Trip Level® or STL™). do we need to do?

Most companies do not ask themselves these


4.2. Data sources questions, for various reasons. One typical rea-
son is because some devices are so cheap that it
Today, many sources are available wherein engi- just is not worth the effort to collect this kind of
neers can get the aforementioned desired reli- information. Of course, that is a misconception,
ability data. Typical data source includes as knowledge is priceless. Another reason cited –
 Maintenance records from end-users; although this, too, could be put into the “lack of
knowledge” category – is because the company
 Databases; simply doesn’t know what to do with the infor-
 Reliability studies; mation once it is obtained. For a true reliability
engineer, however, there is no doubt that more
 Manufacturer data; information leads to more knowledge and thus,
 Standards. better decision-making.

The best available data comes from the end-user In the past, many industry initiatives have been
of the product. Unfortunately, this is only the undertaken to collect data on an industry basis.
case for existing devices, and only when an end- One famous database is OREDA [16], which col-
user has a documented reliability data collection lects data from the North Sea operators. This
program. Fortunately, more end-users are col- project is ongoing and continuously updated.
lecting reliability data. Though many end-users The American Institute for Chemical Engineers
still have no idea how often a device fails, how it carried out a similar project for the chemical
fails, how long it takes to replace it, or what needs industry in the USA. They have the CCPS [16]
to be done to correct it, other than the “we need databases. Both projects focus on many types of
to replace it quickly” solution. Instead, for each industry equipment. SINTEF [18] and exida.com
device that fails, end-users should ask themselves [19] have handbooks that collect data specific to
some questions, in order to help them build a safety instrumented systems. And many other
knowledge base: sources exist as well.

 When did the device fail? Manufacturers often carry out their own reliabil-
ity studies and produce their own data for their
 How did the devices fail? Was it a random, devices. These reliability studies are carried out on
common cause, or systematic failure? individual devices, examining every single com-

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

ponent of the device, as well as the entire device. while in the first case we are comparing apples
Life and stress testing and reliability modelling with pretty much anything.
are carried out to find failures or to predict the
Though today the functional safety industry is
failure behaviour of the device. Even the failure
not concerned with uncertain data, and thus
rates of individual device components (resistors,
does not address the problem, the reliability
capacitors, integrated circuits, relays, springs, and
industry recognized this problem ages ago and
so on) can be predicted using reliability standards
has already solved it. Reliability engineers use
like Military Handbook 217 [20], IEC 62380 [21],
uncertainty and sensitivity analyses to address
Telcordia [22], Mechanical Handbook [23], PRISM
the consequences and effects of bad quality data.
[24], etc, and serve as a basis to predict the reli-
How long will it be before the functional safety
ability of complete devices.
industry picks up on this problem?

20 The data used in our industry today is simply taken


4.3. Data Quality, Quality Data, for granted. Lots of meaningless calculations are
Does It Exist? carried out to prove that the PFD of a SIL has been
reached. No one asks himself whether or not the
Even though there are many sources, the quality
provided data makes any sense. For example,
of the data is never guaranteed. What this means
the product supplier tells you their product has
is, even though we have the data, we never know
a lifetime of 10 years. Ten years ago, you installed
how reliable that data is. Regardless of the source,
1’000 of these products in your plant. Did all 1’000
we don’t know where the data came from, or how
products fail in the mean time? Based on the data,
it was collected and processed. Even worse, most
you should have replaced most of them by now,
of the time we do know that the data itself is unre-
but did you? Or even better, if another product
liable. In the end, of course, bad data can affect
supplier tells you their product has a lifetime of
everything. If we cannot trust the failure rates of
1’000 years, should you expect to have almost no
a device, we cannot trust the safe failure fraction
failed products of this type? Did you or have you
of that device. If all devices in a loop are based
already replaced half of them?
on bad data then we also cannot trust the prob-
ability calculations (PFD, PFS) based on that data,
and potential derived proof test intervals based
4.4. Conclusions Reliability Data
on that data also comes into question. In other
words, we should ask ourselves how we can we Good reliability data is a requirement in the func-
trust the claimed SIL level for a loop, if we cannot tional safety industry. Not only do the standards
trust the data? require us to perform reliability calculations con-
cerning the SFF and PFD / PFH calculations, but
Despite the uncertainty that exists in today’s data,
we also need to understand how often our safety
bad data is still better than no data at all. In real-
system causes a spurious trip. The better the data,
ity, it is far more important to be consistent with
the more reliable our calculations will be. Many
the data than to have the best quality data. To
sources are available today, but that does not
have devices in a loop that are all based on differ-
guarantee that we have good quality data. Even
ent data sources, is much worse than to have all
worse is that in reality most data is of very bad
devices in a loop based on the same data source.
quality. Unfortunately, in the functional safety
In one instance, we may have bad data from dif-
industry, data quality is not yet an issue. Solu-
ferent sources. In another, we may have bad data
tions are available in the reliability industry but it
from at least a consistent source. But in the latter
will take many years before they filter through to
case, at least we are comparing apples to apples,
the functional safety industry.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

© Inside Functional Safety


5. Conclusions of the device. This is not the purpose of Proven-
in-Use.
This paper addressed three of the most discussed
Reliability data is a must in the functional safety
topics in the functional safety industry today, i.e.,
business. We need to perform reliability calcu-
certification, Proven-in-Use, and reliability data.
lations, whether for certification purposes, for
All three topics play an important role and, with-
Proven-in-Use purposes, or just for plant avail-
out them, functional safety would not work. Cer-
ability calculation purposes. Unfortunately, today
tification is seen as a costly but necessary evil that
good quality data is still not available; therefore,
we cannot get rid of anymore. Despite the nega-
we should be very careful when interpreting the
tive aspects of certification, it also offers many
results in terms of functional safety.
advantages to end-users, product suppliers and
service professionals. When done well, and when Even though certification, Proven-in-Use and reli-
the certification can be trusted, it saves all of us a ability data are good aspects of the functional 21
lot of work. Despite its high cost, it actually saves safety industry, we still need to rely on our engi-
money in the end. neering brains and use common sense when
applying them. We should stop blindly trusting
Proven-in-Use is one of those topics in the industry
reports or certificates addressing these topics and
that too many stakeholders try to take advantage
should begin reading them with great interest
of, in order to try to cut corners when it comes to
again. Even if after examining these reports, and
achieving functional safety. Proven-in-Use has a
concluding that they do make sense, we should
clear function for older devices and should only
still only use them as one aspect of the deci-
be used during a transition period. Unfortunately
sion making process. It is much better to build a
some end-users and product developers see it as
(safety) case that consists of many different view-
quick solution to achieve functional safety, even
points than to accept that a product is the best
with new devices, and want to apply it for the life
solution based on single third party statement.

on risk, reliability and safety. He actively certifies,


The author products, loops, systems, people and organizations
Michel Houtermans according to functional safety standards and audits
safety management systems for international oper-
Dr. Houtermans has
ating companies. Furtermore he acts as an inde-
a MSc. in mechanical
pendent safety auditor for Governments around the
engineering and a Ph.D.
world and has served as expert witness on safety
in safety and risk man-
related court cases. Dr. Houtermans actively partici-
agement (Eindhoven
pates in the development of national and interna-
University of Technol-
tional safety standards including ISA S84, TR84, IEC
ogy, The Netherlands).
61508, and IEC 61511.
At Factory Mutual Glo-
bal he held the posi-
Risknowlogy GmbH
tions of research and project engineer. For TÜV he
Dr. Michel Houtermans
has held the positions of project engineer, project
Industriestrasse 47
manager and department manager. Today he is
6300 Zug
the president of Risknowlogy.
Switzerland
Dr. Houtermans has over 15 years experience in +41 41 511 2338
functional safety, has published numerous papers m.j.m.houtermans@risknowlogy.com

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.
www.insidefunctionalsafety.com

6. References
1. Functional safety of electrical/electronic/pro- 11. Environmental testing, IEC 60068. IEC, Geneva,
grammable electronic safety-related systems, 1988
IEC 61508. IEC, Geneva, 1999 12. European Union website: http://ec.europa.eu/
2. Functional safety - Safety instrumented sys- enterprise/newapproach/legislation/nb/noti-
tems for the process industry sector, IEC 61511. fied_bodies.htm
IEC, Geneva, 2003 13. Fire detection and fire alarm systems - Part
3. Safety of machinery - Functional safety of safe- 2: Control and indicating equipment, EN
ty-related electrical, electronic and program- 54-2,1997
mable electronic control systems, IEC 62061. 14. Functional safety of electrical/electronic/pro-
22 IEC, Geneva, 2005 grammable electronic safety-related systems,
4. Electrical apparatus for the detection and IEC 61508. IEC, Geneva, new draft.
measurement of combustible or toxic gases 15. Principles for computers in safety-related sys-
or vapours or of oxygen. Requirements on the tems, DIN V VDE 0801
functional safety of fixed gas detection sys- 16. Sintef, Offshore Reliability Data Handbook 4th
tems, EN 50402. 2005 Edition, OREDA, 2002
5. The Merriam-Webster English Dictionary. 17. CCPS, Guidelines for Process Equipment Reli-
Merriam-Webster; Revised edition, ISBN-13: ability Data, with data tables.AIChE, ISBN
978-0877799306, July 2004 0-8169-0422-7
6. Houtermans, M.J.M., IEC 61508: An Introduc- 18. SINTEF, Reliability data for safety instrumented
tion to the Safety Standard for End-Users. SISIS, systems, PDS Data Handbook, 2006 Edition
Buenos Aires, 2003 19. Exida.com, SERH, 3rd Edition,
7. Programmable controllers - Part 3: Program- 20. MIL-HDBK-217, reliability prediction of elec-
ming languages, IEC 61131. IEC, Geneva, 2003 tronic equipment
8. Safety requirements for electrical equipment 21. IEC, Reliability data handbook – Universal
for measurement, control, and laboratory use, model for reliability prediction of electron-
IEC 61010. IEC, Geneva, 2001 ics components, PCBs and equipment, IEC TR
9. Electromagnetic Compatibility - Generic 62380. IEC, Geneva, 2004
Immunity Standard - Part 2: Industrial Environ- 22. ATT Labs, Telcordia Issue 2, Telcordia
ment, EN 50082-2, 1999 23. Handbook of Reliability Prediction Procedures
10. Electromagnetic compatibility (EMC). Generic for Mechanical Equipment (NSWC-98/LE1),
standards. Emission standard for residential, Naval Surface Warfare Center, 1992
commercial and light-industrial environments 24. System Reliability Center, PRISM software tool,
EN 61000-6-3, 2007 Alion Science

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

2009 /01 • Houtermans


This copy belongs to Administrator. Please do not distribute in any form.

Inside Publishing GmbH


www.insidepublishing.ch
FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
03

Within the TÜV Functional Safety Program:


White Paper
Achieving Plant Safety & Availability Through
Reliability Engineering and Data Collection

Date: 19 December 2006


Author(s): Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham

Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

© 2002 - 2007 Risknowlogy B.V.

All Rights Reserved

Printed in The Netherlands

This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.

Risknowlogy, the Risknowlogy logo, and Functional Safety Data Sheet are registered service marks.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 2


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Achieving Plant Safety & Availability Through Reliability


Engineering and Data Collection

Dr. M.J.M. Houtermans


Risknowlogy B.V., Brunssum, The Netherlands
T. Vande Capelle
HIMA Paul Hildenbrandt GmbH + Co KG, Brühl, Germany
M. Al-Ghumgham
SAFCO, Jubail, Kingdom of Saudi Arabia

Abstract
World wide chemical and other processing plants are trying to implement reliability programs to
improve plant safety while trying to maintain plant availability. These programs can vary significantly
in size and complexity. Any kind of reliability program, like a preventive maintenance (PM) program,
consists always of one or more reliability models and reliability data to execute these models. It is
needless to say that the actual successful implementation and utilization of these reliability programs
heavily depends on the accuracy of the reliability models and the availability of realistic data, or at
least as close as possible data.
The objective of this paper is to give the reader a better understanding of the importance of
reliability engineering focusing on the collection of reliability data for the purpose of process
availability, safety and preventive maintenance programs. The paper will first explain what reliability
engineering is and why it is important for processing plants to have a reliability program. Second the
paper will give an overview of different programs in use today by companies (Preventive
maintenance, risk based inspections, etc). Next the paper will focus on the role of reliability modeling
and data. The paper will explain the kind of reliability data needed and the current available sources
for this data. The focus will be on the actual data available within the plant that can among others be
collected through the information derived from the archived data in the DCS systems. An excellent
time to collect data is during scheduled plant turnarounds. Based on the example of a control valve
the paper will explain the role that humans can play to improve the administration of reliability data
during plant turnaround. Also the importance of failure analysis will be addressed. Finally the paper
will conclude with an example of a decision support model which heavily depends on accurate
reliability data. This example demonstrates how a plant owner can benefit from a practical application
of reliability engineering.

1 Introduction
World wide chemical and other processing plants are trying to implement reliability programs to
improve plant safety while trying to maintain plant availability. These programs can vary significantly
in size and complexity. Any kind of reliability program, like a preventive maintenance (PM) program,
consists always of one or more reliability models and reliability data to execute these models. It is
needless to say that the actual successful implementation and utilization of these reliability programs
heavily depends on the accuracy of the reliability models and the availability of realistic data, or at
least as close as possible data.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 3


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

At the start of any reliability data program good data is usually missing. Companies depend in that
case on external data sources (e.g., handbooks, databases, and expert opinions) that do not
necessarily represent the situation at their own plant. Data needs to be collected for each piece of
equipment, device or instrument needed to operate the plant. Many companies observe that during
the first usage of the reliability program further fine tuning of the collected data is needed as there is
an offset between the current model and the actual situation observed in the plant. Once the lack of
data or the uncertainty in data starts to decrease the models become more accurate and the
companies start to grab the benefits of the implemented reliability programs. Plant availability and
safety will both increase, more preventive maintenance will take place and the total lifetime operating
cost (TLOC) will decrease because of less unscheduled maintenance and associated spurious trips of
the plant.
The objective of this paper is to give the reader a better understanding of the importance of
reliability engineering focusing on the collection of reliability data for the purpose of process
availability, safety and preventive maintenance programs. The paper will first explain what reliability
engineering is and why it is important for processing plants to have a reliability program. Second the
paper will give an overview of different programs in use today by companies (Preventive
maintenance, risk based inspections, etc). Next the paper will focus on the role of reliability modeling
and data. The paper will explain the kind of reliability data needed and the current available sources
for this data. The focus will be on the actual data available within the plant that can among others be
collected through the information derived from the archived data in the DCS systems. An excellent
time to collect data is during scheduled plant turnarounds. Based on the example of a control valve
the paper will explain the role that humans can play to improve the administration of reliability data
during plant turnaround. Also the importance of failure analysis will be addressed. Finally the paper
will conclude with an example of a decision support model which heavily depends on accurate
reliability data. This example demonstrates how a plant owner can benefit from a practical application
of reliability engineering.

2 Reliability engineering
Reliability engineering plays an important but undervalued role in today’s processing plants around
the world. Many companies might not realize it but reliability engineering lies at the heart of total asset
management, a popular buzz term in industry today. What total asset management entails is not
really clearly defined yet but it incorporates elements such as reliability centered maintenance (RCM)
, total productive maintenance (TPM), design for maintainability, design for reliability, life cycle
costing, loss prevention, probabilistic risk assessment and others. The objective of total asset
management is to arrive at the optimum cost-benefit-risk asset solution to meet our desired
production levels. In other words, how can we spend the least money on our plant meeting our
production targets while maintaining process availability and process safety. Many aspects are
involved to achieve this but when it comes to the hardware and software that we use in our plant then
reliability engineering is the discipline to utilize here.
Reliability engineering is a very broad discipline and is practiced by engineers that design hardware
and/or software of individual products but also by engineers who use these products and integrate
them into larger systems. A reliability engineer in a plant has a similar task as a reliability engineer
who is responsible for the design of a transmitter or valve. They apply similar techniques to perform
their jobs only on a different scale and with a different focus.
Reliability itself is defined as the probability that a product or system meets its specification over a
given period of time [1]. The word specification is of course very broad and a product might have
several functions. One can calculate the reliability of each individual function or of all functions
together which make up the specification. The term time can also be replaced by distance, or cycles
or other units as appropriate. In other words it is very important to be clear when we talk about
“reliability” as it can have different meanings to different people and in different situations. In a plant
we can calculate process availability, unavailability, probability of fail dangerous, fail safe, etc., which
are all aspects of, and related to, reliability. In general reliability deals with probability of failure of

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 4


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

components, products and systems and is therefore at the heart of disciplines like hazard and risk
analysis, loss prevention, maintenance programs, quality assurance and so on.
Reliability engineering is thus the discipline of ensuring that a product or system will be reliable
when operated in a specified manner. It is performed throughout the entire life cycle of a product or
system, including design, development, test, manufacturing, operation, maintenance and repair. In
process plants it is often a staff function who’s prime responsibility is to ensure that maintenance
techniques are effective, that equipment is designed and modified to improve maintainability, that
ongoing maintenance technical problems are investigated, and that appropriate corrective and
improvement actions are taken. But in reality it is much broader than that. Reliability engineering
deals with every aspect of a component or system; from making a reliable design, to reviewing
operating and maintenance procedures, or even to setup a reliability data collection program. In many
plants reliability engineering is often also called maintenance engineering.

3 Overview of reliability programs


Reliability engineering plays a role in many well known reliability programs. Typical programs where
reliability engineering is not always associated with but where it is often the important pillar are:
ƒ (Probabilistic) risk assessment

ƒ (Functional) safety assessment

ƒ Condition based maintenance

ƒ Preventive maintenance (PM)

ƒ Risk based inspections (RBI)

ƒ Reliability centered maintenance (RCM)

Probabilistic risk and safety assessment heavily depends on reliability engineering techniques and
theory. With risk assessments we try to establish the risk associated with operating a process plant.
Often risk assessment uses a "top-down" approach to establish and rank the risk of individual areas of
a plant and process equipment to eventually establish the risk associated with the complete facility or
process plant. Risk is defined as the combination of consequences and frequencies. We can only
determine the frequency of an event occurring if we know the individual probabilities for equipment
failure associated with that event. In order to be able to carry out a risk assessment we need to know
how often a pump fails or a valve is stuck open or the instruments air is lost. Determining these
probabilities is the discipline of reliability engineering. Without proper failure rate data of equipment
we cannot establish a quantitative risk level.
When the risk level is established it can be that it is too high, and therefore needs to be reduced, or
that it is low enough, but needs to be maintained at that level. Standards like IEC 61508 [2] and IEC
61511 [3] are based on this concept. When we need to reduce the risk we can either reduce the
consequence or reduce the frequency of the hazardous event. We can reduce the frequency if we
implement a safety system. But this safety system needs to be reliable enough. We need to design a
safety system that is so reliable that it reduces the risk to a level where we can accept it again. This
means that we not only need to have reliable safety system components but also make an
appropriate safety system design to achieve overall reliability. In order to maintain our level of risk we
also need to maintain our process plant and safety system. This is why reliability engineering is often
called maintenance engineering. It is to make sure that the assumption we made during our risk and
safety assessments are maintained throughout the life of our facility. Being able to collect failure rate
data or predict failure behavior can help us in our maintenance strategy.
One program used for the predication of failures is condition based maintenance or predictive
maintenance [5]. As it names implies it means that we perform maintenance based on the condition

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 5


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

of the equipment subject to maintenance. We try to measure the condition of equipment in order to
assess whether it will fail during some future period. The objective is to avoid failure and thus we
either maintain or replace the product just in time. What actually needs to be monitored depends on
the equipment and can mean that we measure for example particles in the lubrication oil of a gearbox
or that we need to apply statistical process control techniques and monitoring the performance of
equipment. If we associate reliability theory with maintenance then we can try to probabilistically
predict when to perform maintenance. This is called reliability centered maintenance [5]. It is a
structured process, originally developed in the airline industry, which heavily depends on reliability
data and expert systems to interpret that data.
Condition based maintenance or reliability centered maintenance can still mean that we are too late
or that the maintenance occurs at an inconvenient time. In order to prevent this we can instead
implement preventive maintenance. The strategy of preventive maintenance is to replace or overhaul
a piece of equipment at a fixed interval, regardless of its condition at that time or the expected
probability of failure. It is purely based on time. Reliability and decision modeling can demonstrate that
it is often more cost effective to replace a piece of equipment before it has failed and at a scheduled
time then to wait until it fails unexpectedly. In this way replacements can be made at, for example,
scheduled plant shutdowns.
There is not one program that is the best strategy for a plant. Most likely, for different pieces of
equipment different programs are applied. Some equipment lends itself perfectly for condition based
or reliability centered maintenance, other equipment not at all and preventive maintenance is more
appropriate.

4 Reliability modeling
Reliability engineering heavily depends on probabilistic methods. In order to predict something,
whether it is the reliability of a piece of equipment or a complete process plant, we first need a
reliability model. There are many different techniques and methods developed over time that we can
use to make models. If we make models of (complex) systems for the purpose of prediction we
usually depend on one or more techniques like:
ƒ Reliability block diagrams

ƒ Fault trees

ƒ Markov models

ƒ Monte Carlo simulation

Other techniques exist as well but these are very common ones. Figure 1 shows a safety function
required to reduce the risk associated with high temperature in a vessel. In order to protect the vessel
against over temperature a safety system has been build with two temperature sensors connected via
two transmitters to a logic solver. The logic solver consists of an input board, a cpu board and an
output board. The input board utilizes two input channels while the output board utilizes three output
channels. These three output channels are required because we need to open two relays to stop two
pumps and one solenoid valve needs to close in order to open a drain valve.
For this system we can do all kind of analyses, e.g., calculation of the probability of fail safe,
probability of fail dangerous, the availability of the safety function, the unavailability of the process due
to spurious trips, the desired periodic proof test interval, optimization of maintenance strategies and
so on. In order to perform these analyses we need a reliability model of this safety function. Three
different reliability models have been created of the same function, i.e., a reliability block diagram, a
fault tree and a Markov model represented respectively in Figure 2, Figure 3, and Figure 4. In order to
actually perform calculations we need to fill the models with reliability data.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 6


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Measure the temperature in the reactor and if the temperature exceeds 65 C


then open the drain valve and stop the supply pumps to the reactor. This function
needs to be carried out within 3 seconds and with safety integrity SIL 3

Sensing Logic Solving Actuating

T1 TM 1 I1 O1 R1- Pump A
Common Circuitry

Common Circuitry
I2 O2
T2 TM 2 I3 O3
I4
I5
CPU O4
O5
R2- Pump B

I6 O6
I7 O7 SOV Drain V
I8 O8

Figure 1 - From specification to hardware design of the safety instrumented system [6]

T1 TM 1 I1

CC CPU CC O1 O2 O3 R1 R2 SOV ESD SV

© Risknowlogy 2002-2005
T2 TM 2 I2

Figure 2 - Block diagram safety function [6]

Safety Function Failed

Input Failed Logic Failed Output Failed

Path 1 Failed Path 2 Failed CC CPU CC O1 O2 O3 SOV Drain R1 R2 R3


Valve

T1 Tm1 I1 T2 Tm2 I2

Figure 3 - Simplified fault tree diagram safety function [6]

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 7


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Path 1
Failed

System
OK Failed

Path 2
Failed

Figure 4 - Markov model safety function [6]

5 Reliability data and data collection


The reliability models in the previous paragraph are useless if we cannot fill them with appropriate
reliability data. Not many companies have their own reliability database and collection program and
need therefore to depend on different sources for the data. It is very important though to utilize the
best possible data. If data is uncertain, i.e., data for which we do not necessarily know whether it is
correct data, also the results will be uncertain. In [7] Rouvroye demonstrates the effect of uncertain
data for the safety performance of a HIPPS installation. The probability of a dangerous failure on
demand, i.e., the probability that the safety function of the HIPPS cannot be carried out when
required, is shown in Figure 5. This figure shows that the results can be a factor of 10 better or worse,
which can have a significant impact in the safety world where a factor of 10 means a difference in SIL
level [2]. Whether uncertain data really has an impact depends on the kind of problem we are trying to
solve. In general counts that the more accurate the data the better the results will be. There are also
techniques available that allow us to determine the influence of uncertain data on the results. These
techniques fall under sensitivity analysis and make it possible to determine whether it is worth
spending time and resources on finding better data.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 8


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

1E-01

Probability of Failure on Demand (PFD)

1E-02

1E-03

1E-04

10th percentile
median
1E-05
90th percentile
0 2496 4992 7488 9984 12480 14976 17472
Tim e (hours)

Figure 5 - Reliability calculations of a HIPPS with uncertain data, two different periodic proof
test intervals and two periodic proof test coverages [7]
Basically the following data sources exist in industry:
ƒ End user maintenance records

ƒ Industry databases

ƒ Reliability standards

ƒ Handbooks

ƒ Manufacturer data

ƒ Documented reliability studies

ƒ Expert opinions

ƒ Published papers

The most preferred data is always the data from the plant itself. Usually this data is collected via
maintenance records. Your own data is the best data for obvious reasons. Look at it this way. When
two companies buy the same valve but one company uses the valve on an offshore platform in the
North Sea while the other company uses the same valve in a plant in the dessert then we cannot
expect both valves to have the same failure behavior. Not only have the environmental parameters
influence on the failure behavior of a device but also its operational use and the maintenance strategy
of the company. Since no two companies are the same (and probably not even two factories within in
a company are the same) also their similar devices will not fail at the same rate. Thus the best data is
the data you collect yourself.
If this kind of data is not available then the next best possible source is to use data from industry
databases. Figure 6 shows two industry databases or handbooks that can be used. One is the
OREDA [8] database and the other is the SINTEF [9] handbook. Both have collected over time
reliability data. OREDA holds reliability data collected from offshore companies in the North Sea and
the SINTEF handbook holds reliability data specifically for safety equipment. Several other databases

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 9


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

and handbooks exist in the world that can be utilized but no matter who delivers the data it is
important to tailor the data in a way that it useful for the applicable situation.

Figure 6 - Two examples of industry databases, OREDA [8] and SINTEF [9]
When collecting reliability data we need to make sure that we document the right information when
a piece of equipment fails. Basically we are interested in three types of information, i.e., the failure
rate, the failure modes, and the repair times of a device. Unfortunately a lot of maintenance records
that we use are not suitable for reliability data collection as desired information is not recorded or
recorded in away that it cannot be used. It is very important that we get an overview of how often a
device fails and how it has failed. Before we can document that information we first need to be clear
about the function of that device.
For example the function of an ESD valve is to close upon demand. This valve can have the
following general failure modes:
ƒ Stuck open

ƒ Stuck close

ƒ Stuck in position

ƒ Leakage

A control valve has a different function then an ESD valve. For a control valve we might be
interested in failure modes like
ƒ Moves too fast

ƒ Moves too slow

ƒ Stuck in position

ƒ Leakage

What the real meaning of a failure mode is can only be determined when we understand the failure
mode in the larger context of the plant. We need to understand what the functionality of a device is

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 10


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

when it is used. Consider the following two valves. One valve controls the flow of an inlet pipe of a
vessel, while the other valve is a drain valve for the same vessel. The valve on the inlet pipe is
normally open and should close upon demand. The drain valve is normally closed and should open
upon demand. Both valves have the same failure modes like stuck open, stuck close, stuck in position
or leakage. But the effects of these failure modes are the opposite for both valves. Thus, it is
important to understand the function of a device on device level, and on system level, in order to be
able to properly document the failure behavior.
Collecting failure rates can also be done on different levels but the only correct level will be on
failure mode level. In practice the maintenance department should track for each device the number
of devices installed, the operating hours of each device and the time that the device has failed. This
information, combination with the failure modes allows us to calculate the failure rate per failure mode
and that is exactly what we need for our reliability models.
For each device we should basically collect the failure rates per failure mode but in practice many
companies do not have this kind of information. Often they need to work the other way around in
order to determine the failure rates per mode. Consider the safety industry where they are only
interested in 4 different failure modes [11]:
ƒ Safe detected

ƒ Safe undetected

ƒ Dangerous detected

ƒ Dangerous undetected

Only electronic devices can benefit from diagnostics and have detected failure modes. A partial
stroke test is not a diagnostic test as diagnostic tests are defined as frequent tests that run fully
automatically [10]. Most partial stroke setups require human interaction though. Therefore it is in most
cases not possible for mechanical devices, like valves, to define safe detected and dangerous
detected failure modes, only undetected failure modes. This also makes sense. When a valve is stuck
open and one performs a partial stroke test once in 6 months then potentially we do not know about
this stuck at failure for 6 months. Detecting this failure with a partial stroke test after 6 months is good
but very slow to take advantage of it. Therefore only devices that have build in tests, which run
automatically and frequently are useful as we can act upon a failure immediately and do something
about it.
If we have the following information available then we can calculate the four failure modes as
desired:
ƒ The overall failure rate of a device (λ), this includes all failures of the device regardless of
their failure mode

ƒ The safe ratio of the failures (SR), i.e., the ratio between all safe failures and all dangerous
failures of a device

ƒ The safe diagnostic coverage of a device (SDC), i.e., the percentage of all safe failures that
can be detected through diagnostic tests

ƒ The dangerous diagnostic coverage of a device (DDC), i.e., the percentage of all dangerous
failures that can be detected through diagnostic tests

Consider the following example where we can calculate the four failure rates important in the safety
industry from the following basic data:
ƒ λ = 5.5 E-6 /h

ƒ SR = 80%

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 11


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

ƒ SDC = 90%

ƒ DDC = 90%

This basic data results in the following four failure rates:


ƒ Safe detected failure rate λ_sd = 5.5E-6 x 0.8 x 0.9 = 3.96E-6 /h

ƒ Safe undetected failure rate λ_su = 5.5E-6 x 0.8 x (1.0 - 0.9) = 0.44E-6 /h

ƒ Dangerous detected failure rate λ_dd = 5.5E-6 x (1.0 - 0.8) x 0.9 = 0.99E-6 /h

ƒ Dangerous undetected failure rate λ_du = 5.5E-6 x (1.0 - 0.8) x (1.0-0.9) = 0.11E-6 /h

This is the kind of calculations that companies make when they do not have their own reliability
data. In reality one does not get this information as the maintenance department collects information
on failure mode level. If you have the failure rate information failure mode level it is possible to
calculate the overall failure rate, the safe ratio and the safe and dangerous diagnostic coverage
factors. More and more suppliers of devices are providing end-users with this kind of detailed product
information though. Consider the functional safety data sheet© in Figure 7 where this basic failure
rate information was used to also calculate factors like the safe failure fraction, the MTTFsafe,
MTTFdangerous, etc.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 12


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Figure 7 - Functional safety data sheet© with basic reliability data

The only reliability data still missing in order to make the model complete is repair data and proof
test data. Many product suppliers make statements about how long it takes to repair their transmitter

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 13


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

or valve and often this is considered to be 8 or 24 hours. In practice only the end-user knows how
long it will take to repair a particular device. It depends on many different factors. For example is the
failed device in stock or not? If we do not have it in stock how long does it take to order it and to have
it shipped to the desired location? If we do have it in stock then how long does it actually take to
replace it? Do we have only one repair crew or do we have multiple repair crews available? In our
model we can make the assumption that something takes only 8 hours to repair but if in reality it takes
30 days to repair then our calculations results are not much worth. The closer we can make our
model to practice the more useful reliability engineering will be.

5.1 Reliability Practiced Program


Companies collect reliability data in many different ways and where possible they try to automate the
data collection process. Normally reliability-centered maintenance programs work from an offline
database, which is developed for chemical equipment plants. Reliability engineers feed the plant
specific static data into the database and use it for RCM modeling. The more data because available
the more closely the reliability model gets to the actual situation in the Field. Fortunately in these
days, many chemical plants are operated thru DCS controls and a lot of the plant data is archived in
the DCS and/or the plant information system. The collected data is unfortunately not being utilized or
transmitted to RCM database which limits their application and utilization.
In addition, as plant hardwired failures are required in construction of RCM programs, it is observed
in chemical plants that there is a tendency to avoid component failure analysis. Unfortunately the
maintenance role is often solely responsible for replacing failed components to assure plant
availability. In order to have better feedback and results from a reliability centered maintenance
program it is of utmost importance that root cause analysis of failed components is practiced more
often. These would lead to better data and thus improve the overall maintenance and repair strategy.

5.2 Component Failure Analysis

Many plant operators have an established a methodology or procedure for repairing or


replacing failed components in the plant. The following is an example of such a
procedure [11,12]. The purpose of this procedure is to provide a methodology to
analyze component failure that happens in a plant and its systems. It is assumed that
component failures occur in random fashion, which makes failure occurrence a difficult
complex process to predict. Objective of this procedure is to document available data
on past failures and to empower our knowledge on failures. This will certainly help to
predict and may even prevent future failures and thus incidents. Success can only be
achieved if the maintenance, operations, and engineering department work together
when applying this policy. The benefit of this procedure is to establish a follow-up
system that will have an objective approach to question and demand close
coordination to implement incident recommendations.

1. The component failure that happens in a system “X”, shall be isolated and
removed. If the Failure has caused a plant shutdown, then a new certified
component shall be installed to manage and speedup plant start-up activities.
The new component certificate shall be issued by OEM (Original Equipment
Manufacturer).
2. The failed component shall be clearly tagged. The maintenance engineer shall
record the components ID, its function, the failure description, and the physical
and environmental state. A sample form is shown in Figure 8. Then, it shall be

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 14


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

shipped to the OEM by the maintenance department. This is required to analyze


the faulty component and issue the final failure report.
3. Maintenance department shall issue a copy of the inspection report to operations,
and engineering.
4. Operations shall call for an overview meeting to discuss the failure report, and
prepare an action plan, if the inspection report contains serious
recommendations.
5. Examples of component failures can be tabulated for the past two-year. This
table can be updated every 6 months, and maintenance will review and insert the
right updated information, with target date to complete each activity.
6. Random regular site visits shall be made to ensure that regular preventive
maintenance has been done. The team should be led by operations.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 15


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

COMPONENT SYSTEM FAILURE ANALYSIS - REQUEST FORM

Plant / Unit ______________________________________________

Engineer ______________________________________________

Equipment ID ______________________________________________

Equipment Function ______________________________________________

Time of Failure ______________________________________________

_____________________________________________
Fault Description
_____________________________________________

Unit Operations Approval ______________ __________ _________________


Name ID Signature

Email:
_____________________________________________
Name:
OEM Address _____________________________________________
Fax / Tel :
_____________________________________________

_____________________________________________

Surrounding Status _____________________________________________


Record
_____________________________________________

Figure 8 - Sample Form Maintenance Record

5.3 Human Interaction & Control Valve Example


In many plants, the human interaction plays a major role in daily job planning as well as major plant
turnaround jobs. The frequency of a plant turn around is normally set every three years. For every
plant turn around there is a define list of Jobs that are being plan, and accordingly people used to plan
material and resources to accomplish these big tasks. Therefore, the human factor will play a main
role to organize job resources and administration follow-up that will lead into an optimized execution

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 16


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

for all jobs respecting cost, and time constraints. A RCM program alone can’t be used to manage
plant turnaround jobs, but the integration of the human role into this process will facilitate such theme.
There are some further issues that need to be asked, and in order to make the typical questions clear,
we will use a control valve example to illustrate the approach.
For preventive maintenance (PM) of control valves, we can ask many questions such as:-
ƒ What is the frequency or PM for critical valves?
ƒ Is this a critical control valve? Is this Emergency Shutdown Valve? Is this a high pressure
service valve? Is this valve fully open, close, or regulating?
ƒ How many items are required checking in the control valve? (we can have many accessories
in a control valve such as solenoid valve, Positioner, regulator set, booster set, I/P unit, valve
internals)
ƒ What sort of tests are required as part of PM process? (leak test based on valve leakage
class, and hydro test based on line pressure rating)
ƒ Is there any certificate check for major accessory item such as solenoid valve Positioner, &
I/P?
ƒ Is there any internal component to be replaced such as plug, seat, and valve soft kit?
ƒ Is there any outside component that shall be replaced (such as Solenoid valve, I/P, pneumatic
set, Diaphragm .etc) based on plant standard or applied practices?
ƒ Is there any certificate to be issued for each of small component that can jeopardize valve
malfunction?
ƒ Is there any bypass line on the valve that can help Maintenance to do PM for the valve when
the plant is running?

As one can see, after establishing answers to these questions, and others, we can put a closer
view on real reliability models and see how through an audit & validation process, we can enhance
the RCM models in the real world.

6 Practical Example of How to Benefit from Reliability


Engineering
The following is an example of how end-users can benefit from reliability engineering. In this example
the question is asked whether a plant should use a single pressure transmitter (1oo1) for a certain
application, or if it would benefit from redundant (1oo2) or even triple redundant (2oo3) pressure
sensors. The question was whether there would be financial benefits, if any, for using more than one
pressure sensor. A risk based methodology was used to determine the scenario cost associated with
the three pressure sensor architectures.
It is assumed that a sensor can fail in four ways:
ƒ Safe detected

ƒ Safe undetected

ƒ Dangerous detected

ƒ Dangerous undetected

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 17


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

The failure rates for each of these possible failure modes are calculated with the values from Table
1.
Table 1 – Reliability data sensor
Parameter Value
Overall failure rate of the sensor 8.6E-6 / h

Safe ratio 50%


Safe detected diagnostics 25%
Dangerous detected diagnostics 25%
Common cause 5%

Table 2 shows data related to the process. The mission time is the time the pressure sensors are
operated. The periodic test interval is the time between periodic proof tests. The periodic test
coverage represents the percentage of failures that can be detected. The demand rate represents the
number of demands that come from the process. A demand means that the safety function needs to
be carried out and thus needs to be available (the pressure sensor needs to work in order to carry out
the safety function).

Table 2 – Process data


Parameter Value
Mission time 10 years
Periodic test interval 6 months
Periodic test coverage 100%
Probability of a process demand 6 per year

The financial data from Table 3 is used to estimate the cost associated with three different sensor
architectures.
Table 3 – Financial data
Parameter Value
Cost sensor* $ 5000.00 / sensor
Cost associated with a spurious trip of the plant** $ 1,000,000.00 / trip
Cost associated with an accident** $ 15,000,000.00 / accident
* These cost include all cost of the sensor including installation, repair cost, etc.
** These cost include all cost associated with it (repair, production loss, etc).

The three different architectures all have the same possible operating modes or failure scenarios.
These scenarios are:
ƒ Operational – the sensors subsystem has no failures that effect the measurement of the
sensor

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 18


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

ƒ Trip – The sensor subsystem has failed in a way that the associated logic solver (DCS or
safety plc) can only decide to trip the process

ƒ Dangerous – The sensor subsystem has failed in a way that the associated logic solver (DCS
or safety plc) cannot take any action when demanded from the process.

For each of these scenarios it will be calculated what the probability of occurrence is. The
probabilities for each of these scenarios are calculated using the Markov modeling technique [12]. For
all architectures Markov models are created that allow us to calculate the probabilities associated with
these scenarios. Once the probabilities for each scenario are known we can calculated the associated
cost with this scenario. The expected cost over the mission time for each sensor subsystem is then
the total cost for each of the scenarios.
Based on the assumptions the results are presented in Figure 9 and Figure 10. Figure 9 shows the
results over a mission time of 10 years without performing a periodic proof test. Figure 10 shows the
same model but then with a periodic proof test performed every 6 months. The results are based on
the weighted scenario cost. For each sensor subsystem we calculate the probability that the sensor is
either:
ƒ Operational;
ƒ Caused a plant trip;
ƒ Failed dangerous.
As the subsystem needs to be in any of these three states, at all times, the total probability adds up
to 1. Please note that the results only apply for these assumptions as it was applicable to this
particular customer. In this case the 2oo3 sensor architecture clearly favors the results. The pressure
sensor system clearly benefits from a periodic scheduled proof test every 6 months. The probability
weighted scenario cost for the three architectures are in this case:
ƒ 1oo1 subsystem: $1,185,421.60;
ƒ 1oo2 subsystem: $103,572.38;
ƒ 2oo3 subsystem: $60,792.63.
In all three cases the dangerous scenarios contribute the most to the overall weighted cost. This is
due to the 6 demands per year and the high cost associated with a possible accident. The periodic
proof tests improves the system significantly, see below in Figure 9. The overall achieved
improvement due to periodic proof testing is demonstrated in Table 4.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 19


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Installed Failure Initial + Business = Total Cost Scenario Probability


Subsystem m odes Investm ent Interuption Cost Probability W eighted
Scenario Cost

Operational $5,000.00 + $0.00 = $5,000.00 0.76632 $3,831.62

1oo1 Subsystem Trip $5,000.00 + $100,000.00 = $105,000.00 0.00006 $5.87

Dangerous $5,000.00 + $15,000,000.00 = $15,005,000.00 0.23362 $21,032,862.62


Total 1.00000 $21,036,700.10

Operational $10,000.00 + $0.00 = $10,000.00 0.93564 $9,356.35

1oo2 Subsystem Trip $10,000.00 + $100,000.00 = $110,000.00 0.00012 $12.71

Dangerous $10,000.00 + $15,000,000.00 = $15,010,000.00 0.06425 $5,786,273.05


Total 1.00000 $5,795,642.11

Operational $15,000.00 + $0.00 = $15,000.00 0.87195 $13,079.24

2oo3 Subsystem Trip $15,000.00 + $100,000.00 = $115,000.00 0.00000 $0.18

Dangerous $15,000.00 + $15,000,000.00 = $15,015,000.00 0.12805 $11,535,745.22


Total 1.00000 $11,548,824.64

Figure 9 – Decision model result without periodic proof testing

Installed Failure Initial + Business = Total Cost Scenario Probability


Subsystem modes Investment Interuption Probability Weighted
Cost Scenario Cost

Operational $5,000.00 + $0.00 = $5,000.00 0.98688 $4,934.42

1oo1 Subsystem Trip $5,000.00 + $100,000.00 = $105,000.00 0.00000 $0.31

Dangerous $5,000.00 + $15,000,000.00 = $15,005,000.00 0.01311 $1,180,486.86


Total 1.00000 $1,185,421.60

Operational $10,000.00 + $0.00 = $10,000.00 0.99895 $9,989.55

1oo2 Subsystem Trip $10,000.00 + $100,000.00 = $110,000.00 0.00001 $0.68

Dangerous $10,000.00 + $15,000,000.00 = $15,010,000.00 0.00104 $93,582.16


Total 1.00000 $103,572.38

Operational $15,000.00 + $0.00 = $15,000.00 0.99949 $14,992.37

2oo3 Subsystem Trip $15,000.00 + $100,000.00 = $115,000.00 0.00000 $0.00

Dangerous $15,000.00 + $15,000,000.00 = $15,015,000.00 0.00051 $45,800.25


Total 1.00000 $60,792.63

Figure 10 – Decision model result with periodic proof testing

Table 4 – How the periodic proof test improves the system significantly
Architecture Without Proof Test With Proof Test Improvement
1oo1 $21.0 Mil $1.2 Mil 17.5

1oo2 $5.8 Mil $103 k 58

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 20


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

2oo3 $11.5 Mil $60 k 191

7 Conclusions
This paper has addressed plant safety and availability thru the eye reliability engineering and
reliability data collection. The paper explained what reliability engineering was, how reliability models
can be made and what kind of data needs to be collected. It demonstrated through practical examples
how reliability data can be collected, what problems may arise and how plants can benefit from good
reliability data.

8 References
1. Bently, J.P., An Introduction to Reliability & Quality Engineering. John Wiley & Sons,
ISBN 0-582-08970-0, 1993
2. IEC, Functional safety for electrical / electronic / programmable electronic safety-related
systems. IEC 61508, IEC, Geneva, 1999
3. IEC, Functional safety: safety instrumented systems for the process industry sector.
IEC 61511, IEC, Geneva, 2003
4. Condition based maintenance
5. Moubray J., Reliability Centered Maintenance, 2nd Edition, ISBN: 0831130784, April
1997
6. Vande Capelle, T., Houtermans, M.J.M., Functional Safety For End-users and system
integrators,
7. Rouvroye, J.L., Et. Al., Uncertainty in safety, New Techniques For The Assessment And
Optimisation Of Safety In Process Industry. American Society for Mechanical
Engineers, 1994
8. Det Norske Veritas, OREDA, Offshore Reliability Data, 2nd Edition, ISBN 82 515 0188 1,
1992
9. SINTEF, Reliability data for safety instrumented systems. PDS Data handbook, 2004
Edition. SINTEF, September 2004.
10. Velten-Philipp, W., Houtermans, M.J.M., The effect of diagnostic and proof testing on
safety related systems. Control 2006, Glasgow, Scotland, 30 August – 1 September,
2006
11. Houtermans M.J.M, IEC 61508: An Introduction to Safety Standard for End-Users.
SISIS 2004, Buenos Aires, Argentina, September 2004
12. Billinton R., Allan R.N., Reliability Evaluation of Engineering Systems, Concepts and
Techniques. Pitman Books Limited, London, 1983.
13. Al-Ghumgham MA, “On A Neural Network-Based Fault Decoction Algorithm”; Chapter 4 of
Master Research Thesis in fulfillment of Master Degree program for Control Systems
Engineering, KFUPM 1992.
14. Al-Ghumgham MA, Angelito Hermoso, Humaidi, MA,“Safety and reliability: Two faces of A
coin for Ammonia Plant ESD System”; ISA EXPO 2005, Chicago, USA.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 21


FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
04

Within the TÜV Functional Safety Program:


White Paper
Safety Availability Versus Process Availability
Introducing Spurious Trip Levels™

Date: 25 May 2006


Author(s): Dr. M.J.M. Houtermans

RISKNOWLOGY B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

© 2002 - 2007 Risknowlogy

All Rights Reserved

Printed in The Netherlands

This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.

Risknowlogy, the Risknowlogy logo, functional safety data sheet, and spurious trip level are registered service marks.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 2


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Safety Availability Versus Process Availability


Introducing Spurious Trip Levels™™

Dr. Michel J.M. Houtermans 1


Risknowlogy B.V., Brunssum, The Netherlands

1 Introduction
The functional safety industry is driven by the international standards IEC 61508 [1] and IEC 61511
[2]. These standards describe performance levels for safety functions and the devices and systems
that carry out these safety functions. This performance is expressed as the so called safety integrity
level (SIL). In practice there are four levels, SIL 1-4. The required SIL level is directly derived from the
process which needs to be protected with a safety function of certain safety integrity. The more
dangerous the process the more safety integrity is required for the safety function.
The SIL level is a measurement of the qualitative and quantitative performance of the safety function.
The higher the SIL level the more difficult it is for a product supplier to design and manufacture the
safety device and the more difficult it is for end-users and system integrators to integrate safety
devices from different manufactures to a complete safety system. The higher the SIL level the more
safety has been or needs to be built into the devices and systems.
The quantitative part of the SIL level is expressed as the probability of failure on demand. This
means that we need to calculate the probability that the safety function cannot be carried out in case
of a demand from the process. In other words how likely is it that the safety function does not work
when we require it to work? The higher the SIL level the more likely it is that the safety function works.
Besides the demand mode functions the IEC 61511 standard also refers to continuous mode
functions. Compared to demand mode functions these kinds of safety functions have a direct impact
on the process when an internal failure occurs. Therefore continuous mode functions need to be
calculated per hour and not per demand, see Figure 1.

SIL Demand Mode Continuous Mode


PFDavg Risk Reduction PFH
4 ≥10-5 to <10-4 >10,000 to ≤100,000 ≥10-9 to <10-8

3 ≥10-4 to <10-3 >1000 to ≤10,000 ≥10-8 to <10-7


2 ≥10-3 to <10-2 >100 to ≤1000 ≥10-7 to <10-6
1 ≥10-2 to <10-1 >10 to ≤100 ≥10-6 to <10-5

Figure 1 – Safety Integrity Levels – A Measure of Safety Availability

1
Corresponding author: m.j.m.houtermans@risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 3


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

2 End-users need safety availability and process availability


The probability of failure on demand (PFD) is a measure of safety availability. Not of process
availability. The PFD helps us get a feeling of how likely it is that the safety function is available or
better not available when we need it. From an end-user and from a safety point of view this is an
important measurement as it directly relates to the achieved risk reduction of running the process. But
a safety function is of no use when it causes too many spurious trips, i.e., undesired process
shutdowns as the process was running normally. These spurious trips are caused by internal
failure(s) of the safety device(s) due to random hardware failures, common cause failures or
systematic failures.
Safety functions that cause spurious trips are undesired for two reasons. First of all the most
dangerous aspects of running a process are during process startup and process shutdown. Especially
the undesired process shutdowns are critical as they are not controlled shutdowns. A safety function
causing undesired process shutdowns is causing more safety problems then that it resolves them. So
we should avoid unnecessary shutdowns as much as possible. Second of all, a spurious process
shutdown results in a production loss and thus in undesired economic loss. It has a direct negative
impact on the economic performance of a company.
For an end-user it is important to have safety functions that offer both sufficient safety availability
and process availability. Unfortunately process availability is of almost no interest in the existing
functional safety standards like IEC 61508 and IEC 61511. These standards defined the SIL level but
do not define performance levels for spurious trips. For this purpose Risknowlogy has defined the so
called Spurious Trip Level™. The purpose of the spurious trip level™ is to give end-users an attribute
that helps them define the desired process availability of safety functions.

3 Spurious Trip Levels™


The =spurious trip level™ (STL) complements the SIL level. The STL level is a measurement of how
often the safety function is carried out without a demand from the process. As of today the STL level
is only expressed quantitatively. There are no qualitative requirements. The quantitative requirements
are listed in Figure 2 and are expressed as the probability of fail safe (PFS). The PFS is the
probability that the safety function causes a spurious trip because of an internal failure of the safety
function. The PFS complements the PFD value. The PFS value is expressed as probability of fail
safe.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 4


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

STL Probability of Fail Safe Per Year

X ≥10-(x+1) to <10-x

… ….

5 ≥10-6 to <10-5

4 ≥10-5 to <10-4

3 ≥10-4 to <10-3
2 ≥10-3 to <10-2
1 ≥10-2 to <10-1

Figure 2 - Spurious Trip Levels™


Unlike the SIL level there are an unlimited number of STL levels. The better the performance of the
safety function the higher the STL level.

4 STL for product suppliers


Today suppliers of safety devices are providing end-users with (third party) statements about the
achieved SIL level for their devices. System integrators are providing end-users with PFD statements
about the complete safety loop. End-users have a good impression about the safety availability of
these devices and complete safety systems. Now the end-users can demand from the suppliers and
system integrators also statements about the probability of fail safe and achieved STL level.

4.1 Example – LNG Level sensor


The following is an example of the PFD and PFS calculations for a level sensor used to measure the
level of LNG in storage tanks. The level sensor itself consists of mechanical hardware, electronic
hardware and software. For a complete description of the level sensors see [3]. The functional safety
characteristics of a single sensor are depicted in Table 1. From this table can be concluded that a
single sensor can be used as a maximum in a SIL 2 application. Because the software of a single
sensor was developed according to the SIL 3 requirements it is possible to use multiple sensors to
achieve SIL 3. Table 2 gives a complete overview of the possible architectures for the level sensor
and their achievable SIL levels according to IEC 61508.

Table 1 – Functional Safety Characteristics single level sensor

Subsystem
Hardware Mechanical Electronics Electronics

Type A A B

Hardware fault tolerance 0 0 0

Safe failure fraction 99.0% 93.2% 95.8%

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 5


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Safe detected failure rate [/h] 1.49E-9 2.00E-7 1.80E-8

Safe undetected failure rate [/h] 1.50E-11 3.86E-8 9.50E-9

Dangerous detected failure rate [/h] 1.32E-7 3.07E-7 1.80E-8

Dangerous undetected failure rate [/h] 1.34E-9 3.97E-8 2.00E-9

Maximum achievable SIL based on hardware 3 3 2

Software
SIL 3

Table 2 – Overview of the possible architectures and their achievable SIL level

Architecture
Attribute 1oo1 1oo2 2oo3
Hardware fault tolerance 0 1 1
Fit for use in SIL 2 3 3

Table 3 gives an overview of the PFD, the PFS and the achieved SIL and STL levels of the LNG level
sensors in the different architectures. This table is particularly useful for end-users and system
integrators as it demonstrates how much the level sensors allocates of the overall SIL level. For
example in order to achieve a SIL 2 safety loop the level sensors only takes 0.18% of the total PFD
value of SIL 2. For SIL 3 the level sensor takes even less, 0.004% and 0.033% respectively for the
1oo2 and 2oo3 configuration. Even when the safety loop is calculated over a period of 10 years the
level sensor allocates only very little of the overall required PFD value. Also the PFS values are
calculated for the different architectures of the level sensor. The best STL level is achieved by the
2oo3 sensor architecture.

Table 3 – Architecture and configuration overview

Architecture
Attribute 1oo1 1oo2 2oo3

PFD after 1 year 1.802e-004 4.404e-008 3.287e-007

Percentage of PFD after 1 year 0.180% 0.004% 0.033%

PFD after 10 years 1.771e-003 4.181e-006 3.201e-005

Percentage of PFD after 10 year 17.7% 0.42% 3.20%

Fit for use in SIL 2 3 3

PFS after 1 year 1.154e-006 9.701e-005 1.918e-010

Fit for use in STL 5 4 9

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 6


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Figure 3 shows how the probability of failure on demand develops over time for all three architectures.
A graphical representation like this can be used by an end user to determine periodic proof test
intervals. This can only be done though if the logic solver and actuating part are also included in the
calculation. The 1oo1 architecture clearly performs the worst of the three architectures. The reason
that the 1oo2 architecture has a better performance then the 2oo3 architecture is because the 2oo3
has more possibilities to fail.

Figure 3 – Probability of Failure on Demand for 1oo1, 1oo2, and 2oo3 architectures.

Figure 4 – Safety availability calculations for 1oo1, 1oo2, and 2oo3 architectures

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 7


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Figure 5 – Process availability calculations for 1oo1, 1oo2, and 2oo3 architectures

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 8


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

5 STL for end-users


End-users can use the STL to specify a safety availability target for their safety instrumented systems.
Specifying the target SIL together with a target STL will assure the end-user that the safety
instrumented system delivers sufficient safety and does not cause unnecessary shutdowns.

5.1 Example – safety instrumented system


The following is an example of the PFD and PFS calculations for a complete safety instrumented
system. In a storage tank liquefied gas is stored which needs to be processed through a vaporizer so
that the actual gas is suitable for consumption by the client. Under no circumstance should liquefied
gas flow to the piping system at the client side. This piping system is not suitable to handle liquefied
gas and would fail instantly not only damaging equipment but also causing a hazardous situation and
loss of production.
The safety instrumented system consists of a sensor section, logic solver section and an actuator
section. The sensor section is a 2oo3 system where each leg consists of an RTD connected to a
threshold relay. Each threshold relay is connected to an input of the logic solver. The architecture of
the logic solver is not clear but the 2oo3 voting of the sensors takes place inside the logic solver. The
logic solver will activate, via four output channels, the actuators if the temperature set point is
reached. The actuator section consists of a 1oo2 valve section. Each leg of the actuator consists of
2oo2 solenoid valves driving a 1oo1 pneumatic valve.

TT TR SOV
FCV
Logic SOV
TT TR Solver
SOV
FCV
TT TR SOV
Figure 6 – Architecture Safety Instrumented System

The following component reliability data has been used for the components listed in Figure 6.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 9


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Table 4 – Equipment reliability data

# Model OFR [/h] SF [%] DDC [%] SDC [%] A SFF Type

1 PT100 1.688E-6 50% 0% 0% 1oo1 50% A

2 TR 1.688E-6 50% 0% 0% 1oo1 50% A

3 Logic solver 4.566E-8 50% 50% 50% 1oo1 75% B

4 SOV 2.000E-6 50% 0% 0% 1oo1 50% A

5 FCV 2.283E-7 50% 0% 0% 1oo1 50% A

The following reliability properties are calculated using Markov modeling:


ƒ PFD: The probability that the safety function has failed upon demand;
ƒ PFDavg: The average probability that the safety function has failed upon demand;
ƒ PFS: The probability that the safety function causes a spurious trip of the process;
ƒ Safety Availability: The probability that the safety function is available to protect the process.

Based on the reliability data and the Markov model the results are presented in Table 5. An overview
of the development of the PFD, PFDavg and Safety Availability is given respectively in Figure 7,
Figure 8, and Figure 9:
Table 5 – Results analysis

Parameter Value

Mission Time 1 Year

PFD 3.916168e-003

PFDavg 1.860371e-003

PFS 2.962833e-003

Safety Availability 9.960626e-001

SIL based on PFDavg 2

STL based on PFS 2

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 10


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Figure 7 – PFD and PFD average

Figure 8 – PFS

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 11


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Figure 9 – Safety Availability

References

1. IEC 61508, Functional safety for electrical, electronic, programmable electronic safety related
systems. Geneva, Switzerland, 1999
2. IEC 61511, Functional safety – Safety instrumented systems for the process industry sector.
Geneva, Switzerland, 2003
3. L. Monfilliette, P. Versluys, M.J.M. Houtermans, Certified Level Sensor For The Liquefied
Natural Gas Industry, TÜV Symposium, Cologne, Germany, May 2006

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 12


White Paper
Dr. M.J.M. Houtermans
Safety Availability Versus Process Availability

Appendix - Frequently Asked Questions

1. Why is the STL level an important property?


The STL level is an important property for two reasons. First of all it gives us an indication of
how many times a devices or safety function will cause a spurious trip. Second of all it allows
us to compare devices and safety functions with each other allowing us to choose the most
appropriate devices and safety function architectures.
2. Our system has a high SIL level and yet it causes a lot of trips. How can that be?
The SIL analysis as required by the standards is a theoretical analysis. If there is a mismatch
between what is calculated in theory and how the system performs in reality then this means
that we need to adjust our theoretical analysis and redo the calculations. On the other hand it
is more a problem of the designers. They need to redesign their safety devices in a way that
they do meet the appropriate SIL level. Their initial analysis was not based on the right
“theory”.
3. What is the difference between PFD and PFS?
The PFD and PFS are both properties of the safety function. The PFD is a measure of safety
availability and is calculated by determining the probability of a failure on demand of the
safety function. The PFS is a measure of process availability and is calculated by determining
the probability of causing a spurious trip failure, i.e., the probability of fail safe.
4. What is the difference between SIL and STL?
The PFD value of a safety loop is one requirement that determines the SIL level of that loop.
The PFS value of a safety loop is one requirement that determines the STL level of that loop.
Both values play an important role. The SIL of a safety functions states how reliability the
safety function needs to be in order to achieve process safety or safety availability. The STL
of a safety function states how reliable a safety function needs to be in order to achieve
process availability.
5. Is the STL defined by the IEC 61508 or 61511 standards?
No, there are no standards that defined STL levels. The STL levels originated from the
Risknowlogy Company who also defined the ranges for the STL levels. The 61511 standard
requires the spurious trip rate to be defined for each safety function but makes not statement
what is should be.
6. Which STL level does my process need?
Like the SIL level the required STL level needs to be determined by the end-users. At the
moment no end-user is setting targets for their STL levels. In the future the desired STL level
will be determined via and be part of the risk analysis just like the desired SIL level.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 13


FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
05

Within the TÜV Functional Safety Program:


Comparison of PFD calculation

Prof. Dr.-Ing. habil. Josef Börcsök


Prof. Dr.-Ing. habil. Josef Börcsök is vice president of R&D at HIMA Paul
Hildebrandt GmbH + Co KG, Industrial Automation. He is working for
many years on the field of safety technology and he is member of different
committees of DKE. He is doing lectures for many years on universities
and colleges with the topics automatic technologies, computer architec-
tures and safety computer architectures.

Address:
HIMA Paul Hildebrandt GmbH + Co KG
Albert-Bassermann-Str. 28
D-68782 Brühl near Mannheim
Tel. +49-6202 709 270
E-Mail: j.boercsoek@hima.com

Keywords
IEC/EN 61508, ISA-TR84.0.02, normal failure, common cause failures,
1oo1-system, safety related 1oo2-system, safety related 2oo3-system,
safety integrity levels (SIL), SIL-requirement, probability of failure on de-
mand (PFD), probability of failure per hour (PFH), safe failure fraction
(SFF), type A subsystem, type B subsystem, hardware fault tolerance,
diagnostic coverage factor (DC), proof-test interval, loop calculation

Abstract
Safety systems are be used in a wide range of technical application. Beside the avail-
ability of such systems the safety aspects, e. g. PFD and PFH figures, must be ob-
served. Especially the calculation of these figures requires the use of standards.
Worldwide are standards available for this calculation. The newest standard is
IEC 61508. This standard is worldwide accepted. Another standard, which is used since
years, is ISA-TR84.0.02. In this standard a safety calculation can be performed without
using MTTR and common cause failure. Since the introduction of the standard
IEC 61508 a lot of discussion concerning the PFD-number appears in the industry. The
reason for that discussion is the way of calculation this numbers. This contribution will
compare both calculation-methods.

Introduction
In the process industry is the use of safety related controllers and systems
increasing by regulative measures. For the validation of applications of
those systems specific figures of the failure rates are used. VDE 0801
part 1 to 7 "Functional safety, Safety related systems" has been recently
the state of the art for national and international standards (also known
IEC 65A/179/CDV, Draft IEC1508). It describes the procedures and the
calculations of complex electronics and microcomputers for safety related
applications. After the introduction of the IEC/EN 61508, a common na-
tional and international standard was created that describes/specifies
generic safety related systems.
Today in various publications exist different ways of calculating the PFD-
figures and availability-figures. Some parts of them are based on the ISA-
TR84.0.02 (1998) and the therein described equations.

To get reasonable analysis related to the safety and the probability of


failure rates, it is required to do the comparisons on the same base.

Basically there is a differentiation in the failure analysis between safe


and dangerous failures.
Further more the safe failures are differentiate in

• safe detectable
• safe undetectable.

Safe failures are failures, which have no effect to the safety function of
the system, either detected nor undetected.
At dangerous failures this situations is not valid. These failures lead at
their occurrence to a dangerous situations in the application, that can lead
under certain circumstances up to massive risk for human life. These
failures are differentiate as well in

• dangerous detectable
• dangerous undetectable.

When the safety related system is designed properly the system reaches
the safe state at detectable dangerous failures. For this cases the safety
related system is able to bring the complete system or the plant in the
safe state.

The critical state is caused by the undetectable dangerous failures. In


their occurrence there is no possibility in any safety related systems to
detect them. They can exist in the systems until the systems will be shut
down. Or in the worst case they can be present without possibility to be
detected and any knowledge of the user up to the system hazard.

+
safety
input 1 related output 1
cpu 1

sensor

safety
input 2 related output 2
cpu 2

actuator
final
element

Figure 1: Safety related 1oo2-system


+
A
output 1A
safety
input 1 related
cpu 1
output 2A

B
output 1B
safety
sensor input 2 related
cpu 2
output 2B

C
output 1C
safety
input 3 related
cpu 3
output 2C

actuator
final
element

-
Figure 2: Safety related 2oo3-system

SIL-requirements according to IEC/EN 61508 and ISA-TR84.0.02 (1998)


The following tables show the fundamental requirements of the differ-
ent safety integrity levels (SIL) according to IEC/EN 61508 and ISA-
TR84.0.02 (1998).

Table 1: SIL for systems operating in low and high demand or continuous mode
of operation according to IEC/EN 61508

Safety Low demand mode of operation High demand or continuous mode


integrity level of operation
(SIL) (average probability of failure to per- (probability of dangerous failure per
form its design function on demand) hour)
4 ≥10-5 to <10-4 ≥10-9 to <10-8
3 ≥10-4 to <10-3 ≥10-8 to <10-7
2 ≥10-3 to <10-2 ≥10-7 to <10-6
1 ≥10-2 to <10-1 ≥10-6 to <10-5

Table 2: SIL according to ISA-TR84.0.02 (1998)

Safety demand mode of operation


integrity level (probability of failure on demand aver-
(SIL) age)
3 ≥10-4 to <10-3
2 ≥10-3 to <10-2
1 ≥10-2 to <10-1

In principle the statement can be derived from the tables that the prob-
abilities of failures are specified in the same ranges.
Advanced considerations of PFD-values according to IEC/EN 61508
Part 2 of this standard specifies the hardware requirements. Further the
safety life cycle of the hardware is there defined, also the architecture
constraints for type A (for these subsystems the behavior is in the case of
an error well known), as well as type B subsystems (for these subsystems
the behavior is in the case of an error not completely known), and at least
the required safe failure fraction (SFF).

Table 3: Type A subsystems and type B subsystems

Type A Type B
Safe failure Hardware fault tolerance Hardware fault tolerance
fraction 0 fault 1 fault 2 faults 0 fault 1 fault 2 faults
< 60 % SIL 1 SIL 2 SIL 3 Not SIL 1 SIL 2
allowed
60 % - < 90% SIL 2 SIL 3 SIL 4 SIL 1 SIL 2 SIL 3
90 % - < 99% SIL 3 SIL 4 SIL 4 SIL 2 SIL 3 SIL 4
> 99 % SIL 3 SIL 4 SIL 4 SIL 3 SIL 4 SIL 4

Calculation of PFD-values according to IEC/EN 61508


Part 6 besides the parts 2 and 3 of the IEC/EN 61508 represents
one of the central parts for the development of safety related systems.
Detailed information are given for the quantitative calculations of
safety related systems. For example there are shown block diagrams
and formulas to calculate the PFD values. As well there are tables to
determine the β factor as well as equations for the calculation of the
diagnostic coverage (DC) and safe failure fraction (SFF). Further ta-
bles are presented with calculated PFD values for all system configu-
rations demonstrated in this standard with variants of all relevant pa-
rameters. The equations for the PFD values of different systems are
here presented exemplarily:

Equation to quantify a 1oo1-System:

PFDG ,1oo1 = ( λ DU + λ DD ) ⋅ t CE
= λ D ⋅ t CE (1)
⎛T ⎞
= λ DU ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + λ DD ⋅ MTTR
⎝ 2 ⎠

with
λ DU ⎛T ⎞ λ
t CE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (2)
λD ⎝ 2 ⎠ λD

Equation to quantify a 1oo2-System:

PFDG ,1oo 2 = 2 ⋅ ((1 − β D ) ⋅ λ DD + (1 − β ) ⋅ λ DU )2 ⋅ t CE ⋅ t GE


⎛T ⎞ (3)
+ β D ⋅ λ DD ⋅ MTTR + β ⋅ λ DU ⋅ ⎜⎜ 1 + MTTR ⎟⎟
⎝ 2 ⎠
with
λ DU ⎛T ⎞ λ
t CE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (4)
λD ⎝ 2 ⎠ λD

λ DU ⎛T ⎞ λ
t GE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (5)
λD ⎝ 3 ⎠ λD
Equation to quantify a 2oo3-System:

PFDG ,2oo3 = 6 ⋅ ((1 − β D ) ⋅ λ DD + (1 − β ) ⋅ λ DU )2 ⋅ t CE ⋅ t GE


⎛T ⎞ (6)
+ β D ⋅ λ DD ⋅ MTTR + β ⋅ λ DU ⋅ ⎜⎜ 1 + MTTR ⎟⎟
⎝ 2 ⎠

with
λ DU ⎛T ⎞ λ
t CE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (7)
λD ⎝ 2 ⎠ λD

λ DU ⎛T ⎞ λ
t GE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (8)
λD ⎝ 3 ⎠ λD

Two further more important indicators for safety related systems are
represented by the safe failure fraction (SFF) and the diagnostic coverage
factor (DC). The SFF is calculated by the equation:

λ S + λ DD
SFF = (9)
λ S + λ DD + λ DU
The DC factor can be determined by the equation:

∑ λ DD
DC = (10)
∑ λD

The SFF represents the ratio of non safety critical failures and the
DC factor describes the fraction of dangerous failures which are de-
tected by automatic diagnostic tests. The individual factors in these
equations have the following meaning:

β The fraction of undetected failures that have a common cause

βD The fraction of those failures that are detected by the diagnostic tests, the fraction that have a
common cause

λD Dangerous failure rate (per hour) of a channel in a subsystem, equal 0,5 λ (assumes 50 % dan-
gerous failures and 50 % safe failures)

λDD Detected dangerous failure rate (per hour) of a channel in a subsystem (this is the sum of all the
detected dangerous failure rates within the channel of the subsystem)

λDU Undetected dangerous failure rate (per hour) of a channel in a subsystem (this is the sum of all
the undetected dangerous failure rates within the channel of the subsystem)
MTTR Mean time to restoration (hour)
PFDG Average probability of failure on demand for the group of voted channels
T1 Proof-test interval (h)
Channel equivalent mean down time (hour) for 1oo1, 1oo2, 2oo2 and 2oo3 architectures (this is
tCE
the combined down time for all the components in the channel of the subsystem)
Voted group equivalent mean down time (hour) for 1oo2 and 2oo3 architectures (this is the com-
tGE
bined down time for all the channels in the voted group)

The standard shows exemplary the procedure with the determination of


hardware failures. At first basics and assumptions are specified establish-
ing the calculations. There are in principle several methods for the analy-
sis of the safety integrity of safety related systems. The most frequent
applied methods are the reliability block diagrams and the Markov mod-
els. Both methods correctly applied supply almost equivalent results. The
Markov models represent the more exact, although more difficult method,
delivering accurate values with more complex systems.
A further characteristic value of the average probability of a failure for a
system or a loop is the PFDsys. This value is calculated adding the aver-
age probabilities of the individual systems.

PFDsys = PFDs + PFDL + PFDFE (11)

In order to determine the average probability of failures for each sub-


system the following information must be present:

• the system architecture


• the diagnostic coverage of each channel
• the failure rate per hour for each channel
• the factors β and βD for the failures with common cause.

In the last list the term common cause factor is introduced. The β-factor
is introduced as ratio of the probability of failures with a common cause to
the probability of random dangerous failures. The next example shall
show this:

The factors are specified as follows:

βD = common cause-factor of detectable failures


β = common cause-factor of undetectable failures
T1 = Proof-test interval
MTTR = Mean time to restoration

with following values:

βD = 1%
β = 2%
T1 = 3 years
MTTR = 8 hours

With these assumptions the PFD-calculations can be executed.

PFD-calculation according to ISA-TR84.0.02 (1998)


In order to compare directly the equations for the PFD-calculations, the
ISA-equations are listed below. Basically there are two different methods
for calculating: with and without common cause factor.

Equations to quantify a 1oo1, 1oo2 and 2oo3-system according to ISA-


TR84.0.02 (1998). Remark: The first equation is with consideration of the
common-cause failure and MTTR. The second equation is the simplified
equation.

1oo1-system

TI
PFDavg = λDU ⋅
2

TI
PFDavg = λ DU ⋅
2

The factors in this configuration have the meaning:

λDU = dangerous undetectable failure rate


TI = time interval between manual functional tests of the com-
ponent
1oo2-system


( ) ⋅ TI3 ⎤
[ ]
2
2 ⎡ TI ⎤
PFDavg = ⎢ λ DU ⎥+ λ
DU
⋅ λ DD ⋅ MTTR ⋅ TI + ⎢ β ⋅ λ DU ⋅ ⎥
⎣⎢ ⎦⎥ ⎣ 2⎦

PFDavg =
(λ ) ⋅ TI
DU 2 2

The factors in this configuration have the meaning:

λDD = dangerous detectable failure rate


λDU = dangerous undetectable failure rate
β = percentage of failures that impact more than one
channel of a redundant system (common cause)
TI = time interval between manual functional tests of the
component
MTTR = mean time to repair

2oo3-system


( ) ⋅ TI
PFDavg = ⎢ λ DU

2 2⎤
[
⎥⎦ + 3 λ
DU
]
⎡ TI ⎤
⋅ λ DD ⋅ MTTR ⋅ TI + ⎢ β ⋅ λ DU ⋅ ⎥
⎣ 2⎦

( ) ⋅ TI
PFDavg = λ DU
2 2

The factors in this configuration have the meaning as for the 1oo2-
system:

λDD = dangerous detectable failure rate


λDU = dangerous undetectable failure rate
β = percentage of failures that impact more than one
channel of a redundant system (common cause)
TI = time interval between manual functional tests of the
component
MTTR = mean time to repair

Comparison between IEC 61508 and ISA-TR84.0.02 (1998)


Seeing the differences between IEC 61508 and ISA-TR84.0.02 (1998)
in the following items are to be considered.
In the ISA-standard there is no consideration of the safe failure fraction
(SFF). The diagnostic coverage factor (DC) is defined in a different way,
IEC is more detailed. In addition the beta factor β is considered only for
the failure rate λDU.
The part of failures of λDD during the repair time (MTTR) caused by com-
mon-cause failures is not calculated.
In a 1oo1-architecture the ISA standard does not consider the parts of
failure rates caused by λDD.
In a case of a huge DC factor it is possible that the IEC 61508 standard
shows worse values than the ISA standard. In the IEC the term
λ DD ⋅ MTTR .
is considered.
This means for redundant systems like 1oo2 or 2oo3 with a high DC
factor (assumption: > 99,9) and a high MTTR compared with TI, it is pos-
sible that the IEC term shows worse values than the ISA standard. In the
IEC the term β D ⋅ λ DD ⋅ MTTR
is taken into account.
The above mentioned points are serious because depending on cho-
sen system configuration they create the need for additional hard- and
software measures in safety systems.
Comparison of the results
Basically both calculation methods show possibilities to calculate the
probability of failure. To clarify the comparison a fictive module is consi-
dered.

Following values are applied for the fictive module:

Fictive λb MTTF λS λD λDD λDU MTTR


module [years] [h]
βD β
[1/h] [1/h] [1/h] [1/h] [1/h]
1,700E-07 671,50 8,500E-08 8,500E-08 8,415E-08 8,500E-10 8 0,01 0,02

Additional fixed parameters:


Diagnostic coverage factor DC = 99 %
Safety relevant factor S = 50 %

Please observe: The scale on the y-axis is logarithmic.

PFD-calculation for a 1oo1-system

Diagram of the different PFD-values for a 1oo1-system:

1,00E-07

1,00E-06
PFD

1,00E-05

1,00E-04
1 year

2 years

3 years

4 years

5 years

6 years

7 years

8 years

9 years

10 years

Proof-test interval T1 / TI

Figure 3: PFD-diagram for a 1oo1-system with DC = 99 %

Legend: according to IEC 61508, with MTTR and common-cause-failure

according to ISA standard, with MTTR and common-cause-failure

according to ISA standard, without MTTR and without common-cause-failure


Table 4: PFD-values for a 1oo1-system with DC = 99 %

IEC 61508: ISA TR 84.0.02: ISA TR 84.0.02:

Proof-test PFD1oo1 PFD1oo1 with PFD1oo1 without


interval T1 / TI [1] MTTR and cc [1] MTTR and cc [1]

1 month 9,902500E-07 3,102500E-07 3,102500E-07


3 months 1,610750E-06 9,307500E-07 9,307500E-07
6 months 2,541500E-06 1,861500E-06 1,861500E-06
1 year 4,403000E-06 3,723000E-06 3,723000E-06
2 years 8,126000E-06 7,446000E-06 7,446000E-06
3 years 1,184900E-05 1,116900E-05 1,116900E-05
4 years 1,557200E-05 1,489200E-05 1,489200E-05
5 years 1,929500E-05 1,861500E-05 1,861500E-05
6 years 2,301800E-05 2,233800E-05 2,233800E-05
7 years 2,674100E-05 2,606100E-05 2,606100E-05
8 years 3,046400E-05 2,978400E-05 2,978400E-05
9 years 3,418700E-05 3,350700E-05 3,350700E-05
10 years 3,791000E-05 3,723000E-05 3,723000E-05

At a 1oo1-system both ISA-graphs are identical because in this system


configuration no common cause failure exists.
The PFD values based on IEC and ISA are in the same magnitude, see
figure 3. Dramatic chances occur at low T1/TI and vary high DC-factor,
see figure 4, e. g. DC = 99,99 %.

1,00E-08

1,00E-07
PFD

1,00E-06

1,00E-05
1 year

2 years

3 years

4 years

5 years

6 years

7 years

8 years

9 years

10 years

Proof-test interval T1 / TI

Figure 4: PFD-diagram for a 1oo1-system with DC = 99,99 %


Legend: according to IEC 61508, with MTTR and common-cause-failure

according to ISA standard, with MTTR and common-cause-failure

according to ISA standard, without MTTR and without common-cause-failure


Table 5: PFD-values for a 1oo1-system with DC = 99,99 %

IEC 61508: ISA TR 84.0.02: ISA TR 84.0.02:

Proof-test PFD1oo1 PFD1oo1 with PFD1oo1 without


interval T1 / TI [1] MTTR and cc [1] MTTR and cc [1]

1 month 6,831025E-07 3,102500E-09 3,102500E-09


3 months 6,893075E-07 9,307500E-09 9,307500E-09
6 months 6,986150E-07 1,861500E-08 1,861500E-08
1 year 7,172300E-07 3,723000E-08 3,723000E-08
2 years 7,544600E-07 7,446000E-08 7,446000E-08
3 years 7,916900E-07 1,116900E-07 1,116900E-07
4 years 8,289200E-07 1,489200E-07 1,489200E-07
5 years 8,661500E-07 1,861500E-07 1,861500E-07
6 years 9,033800E-07 2,233800E-07 2,233800E-07
7 years 9,406100E-07 2,606100E-07 2,606100E-07
8 years 9,778400E-07 2,978400E-07 2,978400E-07
9 years 1,015070E-06 3,350700E-07 3,350700E-07
10 years 1,052300E-06 3,723000E-07 3,723000E-07

PFD-calculation for a 1oo2-system


Diagram of the different PFD-values for a 1oo2-system:

1,00E-14

1,00E-13

1,00E-12

1,00E-11
PFD

1,00E-10

1,00E-09

1,00E-08

1,00E-07

1,00E-06
1 year

2 years

3 years

4 years

5 years

6 years

7 years

8 years

9 years

10 years

Proof-test interval T1 / TI

Figure 5: PFD-diagram for a 1oo2-system with DC = 99 %


Legend:
according to IEC 61508, with MTTR and common-cause-failure

according to ISA standard, with MTTR and common-cause-failure

according to ISA standard, without MTTR and without common-cause-failure


Table 6: PFD-values for a 1oo2-system with DC = 99 %
IEC 61508: ISA TR 84.0.02: ISA TR 84.0.02:

Proof-test PFD1oo2 PFD1oo2 with PFD1oo2 without


interval T1 / TI [1] MTTR and cc [1] MTTR and cc [1]

1 month 1,307472E-08 6,205546E-09 1,283401E-13


3 months 2,548711E-08 1,861741E-08 1,155061E-12
6 months 4,410757E-08 3,723713E-08 4,620243E-12
1 year 8,135528E-08 7,448349E-08 1,848097E-11
2 years 1,558779E-07 1,490039E-07 7,392389E-11
3 years 2,304367E-07 2,235614E-07 1,663287E-10
4 years 3,050317E-07 2,981557E-07 2,956956E-10
5 years 3,796630E-07 3,727871E-07 4,620243E-10
6 years 4,543305E-07 4,474554E-07 6,653150E-10
7 years 5,290342E-07 5,221607E-07 9,055676E-10
8 years 6,037741E-07 5,969029E-07 1,182782E-09
9 years 6,785502E-07 6,716821E-07 1,496959E-09
10 years 7,533626E-07 7,464982E-07 1,848097E-09

At a 1oo2-system the ISA-graph is under consideration of the MTTR


and the common-cause failure three to four magnitudes bigger, at low
T1/TI and DC = 99 %, than the ISA-graph without consideration of these
two parameters, see figure 5. The deviation between the two graphs in-
creases the higher the DC-factor becomes, see figure 6.

1,00E-18

1,00E-17

1,00E-16

1,00E-15

1,00E-14

1,00E-13
PFD

1,00E-12

1,00E-11

1,00E-10

1,00E-09

1,00E-08

1,00E-07
1 year

2 years

3 years

4 years

5 years

6 years

7 years

8 years

9 years

10 years

Proof-test interval T1 / TI

Figure 6: PFD-diagram for a 1oo2-system with DC = 99,99 %

Legend:
according to IEC 61508, with MTTR and common-cause-failure

according to ISA standard, with MTTR and common-cause-failure

according to ISA standard, without MTTR and without common-cause-failure


Comparing the ISA and the IEC graph under consideration of MTTR
and common-cause-failure with DC = 99 %, see figure 5, both graphs are
to be found in the same magnitude. With increasing the DC-factor, DC =
99,99 %, see figure 6, these both graphs deviate at low T1/TI by two mag-
nitudes from each other. The reason for this deviation results mainly in the
case that in the IEC the part of failures of λDD is considered during the
repair time MTTR caused by common-cause-failures by the term
β D ⋅ λ DD ⋅ MTTR .

Table 7: PFD-values for a 1oo2-system with DC = 99,99 %

IEC 61508: ISA TR 84.0.02: ISA TR 84.0.02:

Proof-test PFD1oo2 PFD1oo2 with PFD1oo2 without


interval T1 / TI [1] MTTR and cc [1] MTTR and cc [1]

1 month 6,863643E-09 6,205423E-11 1,283401E-17


3 months 6,987757E-09 1,861628E-10 1,155061E-16
6 months 7,173928E-09 3,723258E-10 4,620243E-16
1 year 7,546271E-09 7,446525E-10 1,848097E-15
2 years 8,290959E-09 1,489309E-09 7,392389E-15
3 years 9,035651E-09 2,233969E-09 1,663287E-14
4 years 9,780346E-09 2,978632E-09 2,956956E-14
5 years 1,052505E-08 3,723299E-09 4,620243E-14
6 years 1,126975E-08 4,467970E-09 6,653150E-14
7 years 1,201445E-08 5,212645E-09 9,055676E-14
8 years 1,275916E-08 5,957323E-09 1,182782E-13
9 years 1,350388E-08 6,702005E-09 1,496959E-13
10 years 1,424859E-08 7,446691E-09 1,848097E-13
PFD-calculation for a 2oo3-system
Diagram of the different PFD-values for a 2oo3-system:

1,00E-13

1,00E-12

1,00E-11

1,00E-10
PFD

1,00E-09

1,00E-08

1,00E-07

1,00E-06
1 year

2 years

3 years

4 years

5 years

6 years

7 years

8 years

9 years

Proof-test interval T1 / TI 10 years

Figure 7: PFD-diagram for a 2oo3-system with DC = 99 %

Legend: according to IEC 61508, with MTTR and common-cause-failure

according to ISA standard, with MTTR and common-cause-failure

according to ISA standard, without MTTR and without common-cause-failure


Table 8: PFD-values for a 2oo3-system with DC = 99 %

IEC 61508: ISA TR 84.0.02: ISA TR 84.0.02:

Proof-test PFD2oo3 PFD2oo3 with PFD2oo3 without


interval T1 / TI [1] MTTR and cc [1] MTTR and cc [1]

1 month 1,307816E-08 6,206638E-09 3,850203E-13


3 months 2,549532E-08 1,862222E-08 3,465182E-12
6 months 4,412670E-08 3,725138E-08 1,386073E-11
1 year 8,140985E-08 7,453048E-08 5,544292E-11
2 years 1,560576E-07 1,491718E-07 2,217717E-10
3 years 2,308141E-07 2,239241E-07 4,989862E-10
4 years 3,056792E-07 2,987872E-07 8,870867E-10
5 years 3,806530E-07 3,737613E-07 1,386073E-09
6 years 4,557354E-07 4,488462E-07 1,995945E-09
7 years 5,309265E-07 5,240420E-07 2,716703E-09
8 years 6,062262E-07 5,993487E-07 3,548347E-09
9 years 6,816346E-07 6,747662E-07 4,490876E-09
10 years 7,571517E-07 7,502947E-07 5,544292E-09
General:
The difference between the PFD-values of a 1oo2 and a 2oo3 architec-
ture is marginal, thereby it is not important, using the equations from the
ISA or the IEC standard.
Reason:
The difference of both architectures mathematically depends on the dif-
ferent weighting of the single faults, thereby at the 2oo3 architecture the
single faults are weighted stronger.

At a 2oo3-system the ISA-graph is under consideration of the MTTR


and the common-cause failure three to four magnitudes bigger, at low
T1/TI and DC = 99 %, than the ISA-graph without consideration of these
two parameters, see figure 7. The deviation between the two graphs in-
creases the higher the DC-factor becomes, see figure 8.

Comparing the ISA and the IEC graph under consideration of MTTR
and common-cause-failure with DC = 99 %, see figure 7, both graphs are
to be found in the same magnitude. With increasing the DC-factor, DC =
99,99 %, see figure 8, these both graphs deviate at low T1/TI by two mag-
nitudes from each other. The reason for this deviation results mainly in the
case that in the IEC the part of failures of λDD is considered during the
repair time MTTR caused by common-cause-failures by the term
β D ⋅ λ DD ⋅ MTTR .

1,00E-17

1,00E-16

1,00E-15

1,00E-14

1,00E-13
PFD

1,00E-12

1,00E-11

1,00E-10

1,00E-09

1,00E-08

1,00E-07
1 year

2 years

3 years

4 years

5 years

6 years

7 years

8 years

9 years

10 years

Proof-test interval T1 / TI

Figure 8: PFD-diagram for a 2oo3-system with DC = 99,99 %

Legend:
according to IEC 61508, with MTTR and common-cause-failure

according to ISA standard, with MTTR and common-cause-failure

according to ISA standard, without MTTR and without common-cause-failure


Table 9: PFD-values for a 2oo3-system with DC = 99,99 %

IEC 61508: ISA TR 84.0.02: ISA TR 84.0.02:

Proof-test PFD2oo3 PFD2oo3 with PFD2oo3 without


interval T1 / TI [1] MTTR and cc [1] MTTR and cc [1]

1 month 6,865470E-09 6,206270E-11 3,850202E-17


3 months 6,989612E-09 1,861883E-10 3,465182E-16
6 months 7,175825E-09 3,723773E-10 1,386073E-15
1 year 7,548253E-09 7,447574E-10 5,544292E-15
2 years 8,293117E-09 1,489526E-09 2,217717E-14
3 years 9,037992E-09 2,234306E-09 4,989862E-14
4 years 9,782879E-09 2,979096E-09 8,870867E-14
5 years 1,052778E-08 3,723898E-09 1,386073E-13
6 years 1,127268E-08 4,468711E-09 1,995945E-13
7 years 1,201760E-08 5,213535E-09 2,716703E-13
8 years 1,276253E-08 5,958370E-09 3,548347E-13
9 years 1,350747E-08 6,703216E-09 4,490876E-13
10 years 1,425242E-08 7,448073E-09 5,544292E-13

Summary
This short comparison demonstrates the difficulty of the direct comparing between
PFD-values that are generated by means of different procedure. In fact using both
standards the quantitative values are to be found in the same ranges as long as not the
simplified calculations of the ISA-standard are applied. Although the parameters are
different leading to these calculations. For example at the ISA standard there are no
definitions regarding SFF- / DC-factor. A further criterion is the non existing of the dif-
ferentiation between type A and Type B subsystems, that has remarkable influence on
the structure and on the integrity level of the system. Also there is no differentiation in
the ISA standard between β and βD , here is only considered the better factor for the
failure rate λDU. Further more a difference is the consideration of the part of failures of
λDD during the repair time caused by common-cause-failure. This is not considered in
the ISA-standard.

In the IEC 61508 all these factors are considered comparing to the ISA-standard.
These consideration increases the demand in safety measures in Hard- and Software
in the system. A so designed system is at all more suitable qualified for a safety related
application.

In a summary it can be stated that based on the fact that IEC 61508 has an universal
application approach and not only applies to the pure safety calculation of systems, this
new standard for the functional safety will open a wide spectrum of applications. The
approach of the IEC-standard follows the goal and succeeded according to the author’s
opinion in creating a generic standard for safety related applications.

For the certification of already used complex systems it is necessary to proof the con-
formity to the standard. Both standards tolerate as all standards do certain latitude at
the different integrity levels. It can be stated that e. g. a system certification according to
SIL3 represents not necessarily a decision criterion to a complete system. In fact the
described system fulfills the requirements of the safety integrity level but it is necessary
to keep the prerequisites in mind written down in the so-called certification reports.
Generally the limitations of the certified system or plant are to be found in this docu-
ment.
Literature
[1] IEC/EN 61508: International Standard 61508 Functional Safety: Safety-Related
System. Geneva, International Electrotechnical Commission
[2] ISA-TR84.0.02 (1998) Technical Report; Safety Instrumented Systems (SIS) –
Safety Integrity Level (SIL) Evaluation Techniques. Instrument Society of America
[3] Börcsök, J.: Internationale-/Europa Norm 61508, Vortrag bei der VD-Tagung der
HIMA GmbH + Co KG, 2002
[4] Börcsök, J.: Konzepte zur methodischen Untersuchung von Hardwarearchitekturen
in sicherheitsgerichteten Anwendungen, 2002
[5] Börcsök, J.: Sicherheits-Rechnerarchitekturen Teil 1 und 2, Vorlesung Universität
Kassel 2000/2001
[6] Börcsök, J.: Echtzeit-Betriebssysteme für sicherheitsgerichtete Realzeitrechner,
Vorlesung Universität Kassel 2001/2002
[7] VDE 0801 part 1 to 7: Functional safety, Safety related systems, IEC
65A/179/CDV, Draft IEC1508, part 6, p. 26f, August 1998.
[8] DIN V 19250: Grundlegende Sicherheitsbetrachtungen für MSR- Schutzeinrichtun-
gen. Beuth Verlag Berlin 1998
[9] DIN VDE 0801/A1: Grundsätze für Rechner in Systemen mit Sicherheitsaufgaben.
Beuth Verlag
[10] IEC 60880-2: Software für Rechner mit sicherheitskritischer Bedeutung. 12/2001
FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
06

Within the TÜV Functional Safety Program:


White Paper
Diagnostic Test versus Proof Test
The IEC 61508 way

Date: 15 June 2006


Author(s): Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)

Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

© 2002 - 2007 Risknowlogy

All Rights Reserved

Printed in The Netherlands

This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.

Risknowlogy, the Risknowlogy logo, are registered service marks.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 2


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

Diagnostic Test versus Proof Test


The IEC 61508 way

Dr. M.J.M. Houtermans 1 ,


Risknowlogy B.V., Brunssum, The Netherlands
W. Velten-Philipp
TUV Rheinland, Cologne, Germany

Abstract
The purpose of this paper is to show the influence of online diagnostic and periodic proof testing on
the performance of safety functions in terms of the PFD. For three different architectures the influence
of the diagnostic coverage, the proof test coverage, and the proof test interval on the PFD are
determined. Performance indicators are used to express this influence and show the effect.

1 Introduction
Safety systems carry out one or more safety functions. Each safety function consists of a sensing
element, a logic solving element and an actuation element. Typical sensing elements are for example
sensors, switches or emergency push buttons. The logic solving element is usually a general-purpose
safety computer which can carry out several safety functions at once. Valves, pumps or alarms are
typical actuating elements of a safety function. The performance of these safety functions is
determined by several design parameters of the individual components of the safety functions. In
(Houtermans, 1999) the following design parameters of the safety function are identified as the
ƒ Architecture
ƒ Hardware failures
ƒ Software failures
ƒ Systematic failures
ƒ Common cause failures
ƒ Online diagnostics, and
ƒ Periodic test intervals.

The performance of a safety function can be expressed as the probability of failure on demand
(PFD) and the probability of fail safe or spurious trip. Both attributes are important in the safety world
as their values represent respectively a measurement for the level of safety achieved and the financial
loss caused by the safety system because of spurious trips. The PFD value is one requirement to
meet the safety integrity level of the IEC 61508 standard (IEC, 1999). For the PFS value there are
currently no requirements in the international safety world, although end-users of safety system
require an as low as possible PFS value.
The purpose of this paper is to show the effect that the above mentioned design parameters,
namely online diagnostic and periodic proof testing, have on the performance of the safety function in

1
Corresponding author: m.j.m.houtermans@risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 3


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

terms of the PFD. For three different architectures the influence the diagnostic coverage, the proof
test coverage, and the proof test interval on the PFD are determined. Performance indicators are
used to express this influence and show the effect.

2 Diagnostic test versus periodic proof test


Before we can discuss the influence of diagnostics tests and periodic proof tests on the performance
of the safety functions we first need to explain the difference between the two. Actually to diagnosis
something means, "to distinguish through knowledge" [Houtermans, 1998]. In the safety world we use
diagnostic tests to identify failures within the system which otherwise would not be revealed [Howland,
1995]. In other words a diagnostic test is performed to find failures inside the safety system. But only
revealing the failures is not sufficient. Once a failure is found a decision needs to be made on what to
do with that failure. Typical decisions made are shutdown or switch to degraded mode if the safety
system has sufficient redundancy. But it is also possible to just notify an operator if the detected
failure is not significant. A diagnostic test is not something that is clearly defined in IEC 61508 2 . But
everywhere it is mentioned it is clear that we deal with a test that is automatic.
The proof test is defined in IEC 61508 as well as in IEC 61511 [IEC, 1999, 2003]. IEC 61508
defines the proof test as follows:

“Periodic test performed to detect failures in a safety-related system so that, if necessary,


the system can be restored to an “as new” condition or as close as practical to this
condition”

IEC 61511 has a similar definition:

“test performed to reveal undetected faults in a safety instrumented system so that, if


necessary, the system can be restored to its designed functionality”.

Thus in other words both diagnostic tests and periodic proof tests try to detect failures inside the
safety system. At first sight there seems to be no difference yet the actual difference is quite an
important one. Note 3 of paragraph 7.4.3.2.2 of IEC 61508-2 explains that a test is a diagnostic test if
the test interval is at least a magnitude less than the expected demand rate. Based on this extra
information we can conclude that in theory a test to detect failures is called a diagnostic test if the test
is carried out automatically and more often than a magnitude less than the expected demand rate. In
all other cases we can refer to a test as a proof test.
The difference between a proof test and a diagnostic test is also important in case of a safety
system with a single architecture, i.e., hardware fault tolerance zero. In this case a proof test is only
sufficient if the safety function operates in a low demand mode. In case of a high demand safety
function diagnostic tests are required that are able to detect dangerous faults within the process
safety time (see IEC 61508-2, chapter 7.4.3.2.5). In other words you cannot build a single safety
system that only depends on proof tests.
In practice we define a test as a diagnostic test if it fulfils the following three criteria:
1. It is carried out automatically (without human interaction) and frequently (related to the
process safety time considering the hardware fault tolerance) by the system software and/or
hardware;
2. The test is used to find failures that can prevent the safety function from being available; and
3. The system automatically acts upon the results of the test.

2
The term diagnostic test is not defined and does not exist in part 4 of the standard.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 4


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

A typical example of a diagnostic test is a CPU –Test, a memory check or program flow control
monitoring. A proof test on the other hand is a test which is carried out manually and where it is often
necessary to have external hardware and/or software to determine the result of the test. The
frequency of the proof test is much longer then the process safety time and magnitudes bigger than
the period chosen for diagnostic tests. A typical example of a proof test is full or partial valve stroke
testing [McCrea-Steele, 2006]. Partial valve stroke testing is seldom carried out without some form of
human interaction (in other words we need to depend on the human to carry out the test, to determine
the actual results of the test and/or to take the appropriate action based on the results) and often
needs additional equipment to be carried out.
The advantage of a diagnostic test over a proof test is that failures can be detected very quickly. If
a proof test is carried out once in three months then there is a possibility that the safety function is
running with a failure for three months before we find out about it. With diagnostics we often know
about the problem within milliseconds and thus can repair the failure very quickly. On the other hand
though good diagnostics require more a complicated design and additional hardware and software
build into the system. This additional hardware and software is often difficult to build, and costs extra.
There is another important reason to make a distinction between diagnostic tests and periodic proof
tests. IEC 61508 requires the calculation of the safe failure fraction for subsystems. The safe failure
fraction of a subsystem is defined as the ratio of the average rate of safe failures plus dangerous
detected failures of the subsystem to the total average failure rate of the subsystem, see formula
below
λSD + λSU + λ DD
SFF =
λSD + λSU + λ DD + λ DU
A high safe failure fraction can be accomplished if we either have a lot of safe failures (detected or
undetected does not really matter for the SFF) or if we can detect a lot of the possible dangerous
failures. Only failures detected by diagnostic tests can be accounted for in the safe failure fraction
calculation. Failures detected by periodic proof tests cannot be accounted for in the safe failure
fraction calculations. This is logical of course, as we do not want to count on humans to carry out our
safety.

3 Architectures
To show the effect that diagnostic and proof tests have on safety functions we introduce three
common safety system architectures. The presented architectures are oversimplified but represent
common structures that are used to implement safety functions. In practice these systems are much
more complex and can consist of many more components. The four basic architectures presented
can be characterized by their redundancy and voting properties, i.e., as XooY (i.e., “X out of Y”) and
are
ƒ The 1oo1 architecture;
ƒ The 1oo2 architecture;
ƒ The 2oo3 architecture; and
Each architecture consists of one or more sensors, one ore more logic solvers, and one or more
actuating elements. The following paragraphs will explain the three different architectures in more
details.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 5


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

3.1 The 1oo1 Architecture


The 1oo1 architecture is the simplest safety system around and consists of one single channel (see
Figure 1). If any of the components within a channel fails dangerously the safety function cannot be
executed anymore. If any of the components fail safe the safety function will be executed and a
spurious trip will result.

S1 LS1 A1

Fig. 1. Functional block diagram 1oo1

3.2 The 1oo2 Architecture


The 1oo2 architecture, see Figure 2, consists of two channels, where each channel can execute the
safety function by itself. If one channel fails dangerously the other channel is still able to execute the
safety function and thus the safety system is still available. If one channel has a safe failure the safety
function will be executed and a spurious trip will follow.

S1 LS1 A1

S2 LS2 A2

Fig. 2. Functional block diagram 1oo2

3.3 The 2oo3 Architecture


The third architecture often used in practice is a 2oo3 voter structure, see Figure 3. This architecture
consists of three independent channels. Each channel can carry out the safety function. The
execution of the safety function requires two equal votes from the available channels. That means the
safety function is executed if two channels have the same result. In other words we need two
dangerous failures in two channels before the safety systems fails. We also need to safe failures in
two channels before the safety function is executed and a spurious trip results.

S1 LS1 A1

S2 LS2 A2

S3 LS3 A3

Fig. 3. Functional block diagram 2oo3

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 6


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

4 Evaluation Procedure
The procedure used to calculate the PFD is outlined in detail in (Houtermans, 1999) and is in short as
follows. A functional block diagram of the hardware of the safety system is drawn. For each element
of the block diagram the typical failure modes are listed. The effect on system level is determined for
each of them. This information is used to construct the reliability model, which is in this case a Markov
model (ISA, 1997, IEC, 2001, Börcsök, 2004). The last step in the procedure includes the
quantification of the models by adding the failure and repair rates of the different components.

5 Reliability Data
One of the objectives of this study was to make sure that it was possible to compare the results for
the different architectures. The actual value of the outcome is not as important as we are more
interested in the relative results. All calculation studies are carried out with these values unless
otherwise noted in the specific study (Reference model).

Table 1. Default reliability data

Component Sensor Logic Actuator

Failure rate [/h] 35E-6 15E-6 50E-6

Safe failures [%] 50 50 50

Diagnostic coverage [%] 0 0 0

Online repair [h] 8 8 8

Repair after spurious [h] 24 24 24

Periodic proof test interval [y] none none none

Proof test coverage [%] 100 100 100

Mission time [h] 87600

Common cause 2 failures [%] 0.0225

Common cause 3 failures [%] 0.00225

6 Performance Indicators
In order to exam the results and the effects of changing the diagnostic, proof test coverage and
parameters performance indicators are introduced. A performance indicator helps us understand what
impact changing a parameter has on the PFD value of the system at hand. The performance indicator
(PI) is calculated relatively simple. We change the parameter, for example the diagnostic coverage,
stepwise from its minimum value to its maximum value. This will result in a PFD values changing
according to the value of the parameter. To determine the impact of the parameter on the PFD value
we calculate the change of the PFD relatively to the 50% value of the parameter. In case of diagnostic
coverage 50% means 50% of the failures is detected by diagnostics. The 50% value is in this case
the reference value that we use to normalize the PFD value to 1. To get an impression of the
influence below 50% DC and above 50% DC we choose 25% DC and 75% DC for the PFD values to
compare with.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 7


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

The reason we do this is because we want to determine whether changing a parameter has a lot or
only little effect on the PFD. In other words should we have in real life a system that does not meet
our PFD value then we know which design parameter we have to address in order to make the
changes that have the most influence and thus the fastest results. We don’t want to spend time and
money on improving parameters that show little or no effect at all.

7 Calculation results
7.1 Calculation results with and without diagnostics – no proof test
In this paragraph we study the influence diagnostics has on the PFDavg values for the different
system architectures as presented in paragraph 3.1. The diagnostics coverage is varied in the
following percentages 0%, 25%, 50%, 75%, and 99%. The results are presented in the figure below.

PFDavg for variable DC


1
1oo1
1oo2
2oo3

0.1
probability

0.01

0.001

0.0001
0 0.25 0.5 0.75 1
diagnostic coverage

Fig. 4. PFDavg in relation to diagnostic coverage

Next we calculate the performance indicator of the diagnostic coverage, i.e., how it influences the
PFDavg of the different architectures. Figure 5 shows the change of the PFDavg for 25% DC
compared to 50% DC and the change for 75% DC compared with 50% DC. The numbers are
normalized with the 50% value.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 8


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

Compare for variable DC 1oo1


1oo2
2oo3

3,00 2,73

2,50
2,08
2,00 1,70
1,52
1,40
PI-PFD 1,50
1,19

1,00

0,50

0,00

25%-50% 50%-75%

Fig. 5. Performance indicators for variable diagnostic coverage

From Figure 4 and Figure 5 we can draw the following conclusions:


ƒ The diagnostic coverage factor has a significant influence on the PFDavg value. For all
architectures counts that improving the diagnostic coverage from 0 to 100% will improve the
PFDavg a factor of approximately 10 to 1000;
ƒ The 1oo1 system is least sensitive to modifications of diagnostic coverage;
ƒ The 1oo2 system is the most sensitive to modification of the diagnostic coverage factor. For
low diagnostic coverage factors as well as high values;
ƒ Both the 1oo2 and 2oo3 are one fault tolerant systems. The difference in performance can be
explained be course of probability theory. Both systems need two failures in two different
channels to loose the safety function. The 2oo3 system though has 3 possible combinations
of two failures. But the probability of repair is equally fast in both systems;
For all three architectures counts that making improvements has most effect above 50% diagnostic
coverage.
In other words all architectures are sensitive for the diagnostic coverage factors and increasing the
diagnostic coverage factor can make major improvements. The redundant structures are much more
sensitive then the single structure. Diagnostic coverage really makes an impact on the PFDavg value
when over 50% preferable over 75%.

7.2 Calculation results for variable proof test coverage


In this paragraph we study the influence the proof test coverage has on the PFDavg value for the
different architectures as presented in paragraph 3.1. The proof test coverage varied in the following
percentages 0%, 25%, 50%, 75%, and 100%. The results are presented in the figures below.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 9


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

Next we calculate the performance indicator of the proof test coverage, i.e., how it influences the
PFDavg of the different architectures. Figure 7 shows the change of the PFDavg for 25% test
coverage compared to 50% and the change for 75% compared with 50%. The numbers are
normalized with the 50% value.

PFDavg for variable proof test coverages


1
1oo1
1oo2
2oo3

0.1
probability

0.01

0.001

0.0001
0 0.25 0.5 0.75 1
proof test coverage

Fig. 6. PFDavg in relation to proof test coverage

Compare for variable Proof Test Coverage 1oo1


1oo2
2oo3
1,96
1,89
2,00
1,73 1,70
1,80
1,60 1,42 1,40
1,40
1,20
PI-PFD 1,00
0,80
0,60
0,40
0,20
0,00

25%-50% 50%-75%

Fig. 7. Performance indicators for variable proof test coverage

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 10


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

From the Figure 6 and Figure 7 we can conclude the following:


ƒ Also the proof test coverage helps improve the PFDavg but not as significant as the
diagnostic coverage. An improvement with a factor of 10-100 is achievable;
ƒ Just like with the diagnostic coverage the redundant architectures react better to proof test
coverage improvements than the single architecture.
ƒ The sensitivity concerning the PFDavg improvement is nearly the same for all architectures.
The PFDavg is less sensitive to proof test coverage compared to diagnostic coverage. This is
understandable as diagnostic coverage gives almost very fast repair compared to proof testing.
Failures detected with diagnostics are detected within seconds, while proof test can take 1 year or
longer before they are carried out. Failures can thus exist much longer in the system. Therefore the
PFD does on average not improve that much.

7.3 Calculation results with variable proof test interval


In this paragraph we study the influence of the proof test interval on the PFDavg value for the
architectures as presented in paragraph 3.1. The proof test interval was varied from none to 3
months, 1 year, 2 years, and 5 years with a proof test coverage of 100%. The results are presented in
the figures below.

PFDavg for variable proof test interval


1
1oo1
1oo2
2oo3

0.1
probability

0.01

0.001

0.0001
0 1 2 3 4 5 6 7 8 9 10
years

Fig. 8. PFDavg in relation to the proof test interval

Next we calculate the performance indicator of the proof test interval, i.e., how it influences the
PFDavg of the different architectures. Figure 9 shows the change of the PFDavg for a proof test

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 11


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

interval of 1-5 years and a proof test interval of 5-10 years. The numbers are normalized with the 5
years value.

Compare for variable Proof Test Interval 1oo1


1oo2
2oo3

7,87
8,00

7,00
5,79
6,00

5,00

PI-PFD 4,00
3,13
3,00
1,97
1,65
2,00 1,30

1,00

0,00

1y-5y 5y-10y

Fig. 9. PFDavg for different proof test intervals

From the Figure 8 and Figure 9 we can conclude that


ƒ There is a high sensitivity to proof test intervals varying approximately with a factor 10-1000
depending on the architecture;
ƒ The redundant architectures benefit the most. The shorter the proof test intervals the more
benefit. This is logical of course as the short the interval the more the proof test resembles a
diagnostic test.
ƒ The 1oo2 architecture benefits the most from frequent proof tests.
ƒ There is almost no difference from a probability point of view to perform the proof test with a 5
or 10 years interval.
Just like the diagnostic coverage the proof test interval has a significant impact on the PFDavg
value for all architectures. It is more important to carry out proof tests more frequently then with a
high-test coverage. In other words even a simple but frequent proof test can help reduce the PFDavg
value significantly. This is good news for partial stroke testing of valves. Especially with partial stroke
testing of valves we do not know the actual proof test coverage but it if done frequently it will still help
significantly.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 12


White Paper
Dr. M.J.M. Houtermans (Risknowlogy), W. Velten-Philipp (TUV)
Diagnostic Test versus Proof Test

8 Calculation results
The purpose of this paper was to show the effect of diagnostic coverage, proof test coverage and the
proof test interval on the PFDavg value for different safety architectures. To get more inside in the
influence of these design parameters a performance indicator was introduced. The choice of
diagnostic coverage and the proof test interval have the most influence on the PFDavg. The proof test
coverage also improves the PFDavg but less significant then the other two parameters. The 1oo2
architecture gains the most benefits from while the 1oo1 is the least sensitive. For redundant
architectures counts that there is more chance on finding failures then there is for single architectures.
Therefore they perform better in terms of improving the PFDavg. The authors are currently working on
a more extended version of this paper taking among others into account more architectures and the
PFS calculation per architecture.

REFERENCES
ƒ IEC 61508 (1999) Functional safety of E/E/PE Safety-related systems, IEC 1999;
ƒ IEC, 61511 (2003) Safety instrumented systems for the process industry, 2003
ƒ IEC 61165, (2001) Ed.2: Application of Markov techniques
ƒ ISA TR84.0.04 Part 4 (1997) Determining the SIL of a SIS via Markov Analysis
ƒ Börscök, J. (2004) Electronic Safety Systems, Hardware Concepts, Models, and Calculations,
Huthig GmbH & Co. KG Heidelberg, Germany
ƒ Houtermans, M.J.M., Rouvroye, J.L., (1999) The Influence Of Design Parameters On The
Performance Of Safety- Related Systems. International Conference on Safety, IRRST,
Montreal, Canada
ƒ Houtermans, M.J.M., Brombacher, A.C., Karydas, D.M.,(1998) Diagnostic Systems of
Programmable Electronic Systems. PSAM IV, New York, U.S.A
ƒ Howland, R.E., (1995) Computer Hardware Diagnostics for Engineers, ISBN 0-07-030561-7,
McGraw-Hill, U.S.A.
ƒ Karydas, D.M., Houtermans, M.J.M., (1998) A Practical Approach for the Selection of
Programmable Electronic Systems used for Safety Functions in the Process Industry. 9th
International Symposium on Loss Prevention and Safety Promotion in the Process Industries,
Barcelona, Spain
ƒ McCrea-Steele, R., (2006) Partial Stroke Testing - The Good, the Bad and the Ugly, 7th
International Symposium Programmable Electronic Systems in Safety Related Applications,
Cologne, Germany, 2006

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 13


FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
07

Within the TÜV Functional Safety Program:


Modern 2oo4-processing architecture for safety systems
PROF. DR.-ING. HABIL. JOSEF BÖRCSÖK
HIMA Paul Hildebrandt GmbH + Co KG
68782 Brühl, Albert-Bassermann-Str. 28
GERMANY
j.boercsoek@hima.com

Abstract: - An advanced safety architecture is the 2 out of 4-system (2oo4). In order to trigger the safety function
at least two of the four channels must work correctly. It is said: “A 2oo4-system is 2-failure safe”. In order to
classify the quality of a system we calculate different parameters. In the report equations are indicated for PFD for
normal and common-cause-failures. Also the Markov-model for a 2oo4-architecture is introduced. We can
calculate the MTTF (Mean Time To Failure) of this architecture with this Markov-model. The results are high
availability and a high reliability.

Key-Words: - 2oo4-Architecture, Availability, IEC/EN 61508, Reliability, Markov-model, MTTF, PFD, SIL

1 Introduction Today’s controlling or application systems used for


Modern technical systems, controlling and steering safety critical missions commonly consist of highly
safety relevant processes are becoming more and complex single components, implemented either as
more complex. There are multivarious reasons for software or hardware. A hardware and a software
this: On the one hand, the demands on high quality model has to be generated, evaluating aspects like
performance systems increase while simultaneously reliability and safety of a complex system. Reliability
the required space for components has to decrease, means to function without any failure under all
and on the other hand it is necessary to offer circumstances. Safety here means that the system
technically enhanced and safer systems, due to a will not come into a critical state even if a failure
steadily growing of competitive globalization, - in occurs. The process’s safe status is reffered to as a
order to remain competitive. This applies especially status of no danger occuring. If a failure occurs the
to the field of safety relevant digital processing and system has to be able to reach the safe status.
automation, in which complex digital circuits are The various functional, non-functional and safety-
integrated. technical demands to the system along with common
Digital processing systems of each size are system characteristics lead to a list of system specific
particularly used for safety related tasks. Such tasks features. This contains:
might be the supervising or controlling of vehicles,
trains, aeroplanes or power plants and chemical • Reliability, availability and failure safe
processing units. Another important and growing operation
application field is the medical field. In each of the • System integrity and data integrity
indicated sectors failures and errors of the systems • Maintenance and system restoring.
would increase the risk for immense damage up to
the threat of human lives. In order to have measurable parameters it was
defined the widely used parameters “mean time to
failure” (MTTF) and “probability of failure on
demand” (PFD). The PFD characterizes the quality
of a faultless system. The smaller the value the better
the safety of the system. A systems’s safety refers to
all items in the loop. In automation a loop among
others consists of a safety related system of the
following components:
• Computing elements (logic processing devices On the one hand, a system can be judged by its
such as analog and digital in- outputs, CPU) probability of a dangerous failure, i. e. an error
• Sensors occurs on the demand of a safety function and the
• Termination elements such as actuators system can no longer perform its safety function.
IEC 61508 implies that the so called proof check
Combing all elements of a system in a safety interval lies at
architecture the system can be classed with a defined
safety level, safety integrity level (SIL). • two years
Table 1 shows the various classifications of safety • ten years.
systems. The norm IEC 61508 defines two different
criterions for the classification of the safety systems This probability of failure is defined as “probability
into the individual safety levels. of failure on demand” (PFD). It has a dimension of 1
unit.

Table 1: SIL classification


High demand or
continuous mode of
Low demand mode of
operation
operation
Safety Integrity Level
T1 = 1 month or
(SIL) T1 = 2 years or
T1 = 2 months or
T1 = 10 years
T1 = 6 months or
T1 = 1 year

1 ≥ 10-2 to < 10-1 ≥ 10-6 to < 10-5


2 ≥ 10-3 to < 10-2 ≥ 10-7 to < 10-6
3 ≥ 10-4 to < 10-3 ≥ 10-8 to < 10-7
4 ≥ 10-5 to < 10-4 ≥ 10-9 to < 10-8

IEC 61508 proposes a second possibility for The probability of a failure on demand always has to
classification of safety system. The probability of an be regarded as a statistical term. Even in safety
occuring failure on demand leaving the system systems there is no absolute safety given, since these
unable to perform its safety functions is calculated as systems may fail on demand.
well. Therefore a certain period of time is demanded By long lasting empirical studies on corresponding
for the proof check interval, either applications the distribution of a system’s failures
can commonly be assumed as follows:
• one month or
• three months or • 15 % of computing elements
• six months or • 50 % of sensors
• one year. • 35 % of termination elements such as
actuators.
This probability of failure is defined as probability of
failure per hour (PFH). Unlike probabilities it has a
dimension of 1/h. Systems demanding a continuous
operation are highly significant for industrial
systems.
Note that comparing both systems to its PFD or PFH
value is only possible within limits, as they refer to
different bases.
The whole system’s failure rate λ is subdivided into detected failures λ DD . Fig. 1 shows the spreading of
safe failures λ S and dangerous failures λ D . In failure rates. Failure rates could be specified with the
addition, safe failures are subdivided into safe aid of standard specifications.
undetected failures λ SU and safe detected failures
λ SD . Whereas dangerous failures are subdivided into
dangerous undetected failures λ DU and dangerous

safe detected

safe undetected
λ S = λ SD + λSU

dangerous undetected
λ D = λ DD + λDU

dangerous detected

Fig. 1: Structural Software Creation for Safe Systems

A system’s quality can be specified by defining its The most known architectures in use for safety
PFD value reffered to its accuracy. The smaller this systems are the 1oo2- and 2oo3-architectures. 1oo2-
value the better is the system. However, the longer (reading 1 out of 2) and 2oo3- (reading 2 out of 3)
the system runs the higher will be the PFD value. architectures are common for safety-related systems
The PFD value is calculated for a period of time in industry.
called proof check interval T1). After the A 1oo2-architecture, s. figure 2, contains two
maintenance of the system we proceed on the independent channels which are connected in manner
assumption that it works without any failures. so if one of the two serial output circles has a safety-
Judging and comparing systems is mostly specified related failure the other channel must work correctly
by the PFD average value (PFDavg) over a whole and transmits the controlling process into the safe
proof check interval. state.

Input channel 1 A
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Output channel 2
Sensor Input channel 4

Input channel 1 B
Output channel 1
Input channel 2 Safe
Input channel 3 Logic solver
Output channel 2
Input channel 4

Actuator
connecting
element

Fig. 2: Reliability block diagram of 1oo2-architecture


The 2oo3-architecture, s. figure 3, distinguishes by it If you (additional) require a high reliability you have
that at least two of the four channels must work to choose a 2oo3-architecture. In order to take
correctly in order to trigger the safety function. In advantage of both system in industry you must
order to meet all requirements for safety the 1oo2- develop a 2oo4-architecture. This architecture will be
architecture is sufficient. described in the following.

+
Input channel 1 A
Output channel 1
Input channel 2 Safe
Input channel 3 Logic solver
Output channel 2
Input channel 4

Sensor Input channel 1 B


Output channel 1
Input channel 2 Safe
Input channel 3 Logic solver
Output channel 2
Input channel 4

Input channel 1 C
Output channel 1
Input channel 2 Safe
Input channel 3 Logic solver
Output channel 2
Input channel 4

Actuator
connecting
element

Fig. 3: Reliability block diagram of 2oo3-architecture

2 Description of the 2oo4-architecture • Three of four channels have a dangerous


detectable or a dangerous undetectable failure
for safety-related technology
which all have no common cause
The 2oo4-system normally contains four independent
channels. The four channels are connected one with
Theoretically a 2oo4-system is immediately
another. In order to trigger the safety function at least
transmitted into the safe state if a dangerous failure
two of the four channels must work correctly. Even if
arises. However in practise each detection of a
two failures in two different channels occur the
failure is time consuming. If any more failure occurs
system can be transmitted into the safe state. It is
in this time, so we have two failures at the moment.
said: “a 2oo4-system is 2-failure safe”.
However due to its 2-failure-safety the 2oo4-system
A dangerous breakdown of the system is generated if
can definitively reach the safe state in contrast to a
three of the four channels have dangerous failures
2oo3-system. When a dangerous failure occurs then
themselves. Figure 4 shows a reliability block
the system switches off the concerned channel. So
diagram of a 2oo4-architecture. Each single channel
the 2oo4-system degrades to a 2oo3-system itself. In
contains of an input circle, a safe processing unit and
this new system there is still another failure in the
two serial output circles.
three correct operating channels possible.
In a fault-tree-analysis you can determine the
In a 2oo3-system you have a majority of correct
following which causes a system in a dangerous non
working channels if a dangerous failure will happen.
safety state:
The system is in a defined state and it decides to
transmit into the safe state. In a 1oo2-system one of
• There is in all four channels a dangerous
the two channels must work correctly. However if
detectable failure which all have a common
there are two failures in each channel there is no
cause
possibility to switch off the process in a safe state. So
• There is in all four channels a dangerous the difference between the 2oo4- and a 1oo2-system
undetectable failure which all have a common is higher availability of the 2oo4-system and it has a
cause light better probability of the safe-function.
+
Input channel 1 A
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Input channel 4 Output channel 2

Input channel 1 B
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Input channel 4 Output channel 2

Sensor

Input channel 1 C
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Input channel 4 Output channel 2

Input channel 1 D
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Input channel 4 Output channel 2

Actuator
connecting
element

Fig. 4: Reliability block diagram of 2oo4-architecture

3 Calculation of probability P(t) describes the probability of failure for the ith
channel with the failure rate of
distributions λ = λ Di (3)
You can apply the basic approach for determination
of PFDavg-equation of a 2oo4-system:
for a dangerous, normal failure in channel i and the
P( t ) = Psin gle + Pcommon cause probability of failure
(1)
= 4 ⋅ P1 ( t ) ⋅ P2 ( t ) ⋅ P3 ( t ) + PDUC ( t ) + PDDC ( t )
Pi ( t ) = 1 − e − λDi ⋅t . (4)
The index DUC means a dangerous undetected
If the equation (4) and (2) are used with the general
common-cause-failure, whereas DDC accounts for a
applicable PFDavg equation
dangerous detected common-cause failure.
T
1
3.1 Calculation of probability of normal T ∫
PFD avg ( T ) = ⋅ P( t ) ⋅ dt ,
0
(5)

failures
As already mentioned the 2oo4-system is 2-failure we get the result
tolerant. Before we calculate the probability of
normal failures for a 2oo4-system, we should reflect e − λD1T − 1 e −λD 2T − 1 e− λD 3T − 1
how is the probability for a 1-failure tolerant system, PFDavg, normal( T ) = 1 + + +
λD1T λD 2T λD3T
e. g. a 1oo3-system. If a 1oo3-system should fail
with normal failures, we have the condition that each e−( λ D1 + λD 2 )T − 1 e −( λD1 + λ D 3 )T − 1
− −
of the three channels must have a dangerous failure. ( λD1 + λD 2 )T ( λD1 + λD 3 )T
If the probability is calculated for this case, then the e−( λ D 2 + λD 3 )T − 1
product is derived from the probability of failure of −
( λD 2 + λD 3 )T
each channel. The following equation results:
e −( λD1 + λ D 2 + λD 3 )T − 1
+ .
Pnormal ( t ) = P1 ( t ) ⋅ P2 ( t ) ⋅ P3 ( t ) . (2) ( λD1 + λD 2 + λD 3 )T
(6)
This function can be developed into a power series Analogue, these failure probabilities can be derived
with help from a Taylor development (exactly for a 1oo1-system with λD,1 oo 1 = β⋅ λDU respectively
MacLaurin series). The condition that the
λD,1 oo 1 = β D ⋅ λDD . A random common cause failure
PFDavg, single(T) is a continuous function, which has a
removable singularity at T = 0 and thus all represents a 1oo1 function block! Therefore it is
derivations at this point exist can be proved, e. g. in possible to apply the derived PFDavg equation of the
[3], [4]. After some calculation, see also [3], [4], we 1oo1-system for the calculation of probability of
get the result common cause failure, see [3], [4]. The general
solution for the probability failure results in
( λ D )3 ⋅ T 3
PFDavg , normal ( T ) = . (7) λD ⋅ T
4 PFD avg = . (10)
2
This is the result for the probability of failure on
Since we have two common cause failure modes,
demand for a 1oo3-system for normal failure. You
have to be aware that the parameter T is not λ DUC = β ⋅ λ DU and λ DDC = β D ⋅ λ DD , and with the
equivalent to the parameter T1 (proof check interval) two assumptions that
in the IEC/EN 61508, see [1]. T1 is only a part of T!
For the calculation of the PFDavg value for a 2oo4- • a dangerous undetected common cause failure
system in case of normal failures we can use the occurs within the time period T1 + MTTR (T1
equation (7) of the 1oo3-system. This equation must means the proof time interval, MTTR means
be extended for the factor four as with four channels the mean time to repair)) and
there are four possibilities that in two channels a • a dangerous detected common cause failure
failure exist – remember the 2oo4-system is two- occurs within the repair time MTTR,
failure tolerant.
The probability of failure for the 2oo4-system for we can calculate the PFDavg value for common cause
normal failures is failures as

PFD avg , normal ( T ) = ( λ D ) 3 ⋅ T 3 . (8) β⋅ λDU


PFDavg, β = (T1 + MTTR) + βD ⋅ λD D ⋅ MTTR . (11)
2 2

3.2 Calculation of probability of common-


cause failures 3.3 PFDavg-equation for a 2oo4-system
Now we want to calculate the failure probability for The PFDavg equation of a 2oo4-system taking into
dangerous undetectable and dangerous detectable account the normal failures, equation (8), and the
common cause failures PDUC and PDDC. Common common cause failure, equation (11), is therefore:
cause failures are those failures that occur in all
β⋅ λDU
system channels at the same time and which have a PFDavg = ( λD )3 ⋅ T 3 + (T1 + MTTR) + βD ⋅ λD D ⋅ MTTR
common cause. When determining the PFDavg this 2 2
kind of failure is rated for a multi channel system (12)
through the β -factor. One differentiates between the
β -factor for dangerous undetectable failures, with
the weight β , and the β -factor for dangerous
4 Markov-model of a 2oo4-
detectable failures, with the weight β D . Calculating
architecture
Basically is the Markov-model of a 2oo4-“Single-
the common cause part of the total probability, you Board System” accomplished with conventional
have to add the failure probabilities PDUC and PDDC. calculation methods. The single transitions are
shown in figure 5.
Pβ ( t ) = PDUC ( t ) + PDDC ( t ) (9)
The state 0 represents the accuracy in all of the 4 basis of state 3 we will describe the different
channels. State 1 is the safe state in which the system transitions. For all other states obtain the same
devolves if a safe failure occurs. The transition-rate issues.
from state 0 to state 1 is 4 ⋅ λ S , because in each of
the four channels is a safe failure possible. On the

µ0
µ0
µ0
State S µ0
µ0 β D ⋅ λDD
µ0 Sys. DD
State 1
µ0 Sys. DD
µ0 Sys. DD
µ0 Sys. DD
µ0
2 ⋅ λ DD Sys. OK State 11
4 ⋅ λS Sys. DD
Sys. DD
µR Sys. DD
Sys. OK State 7
Sys. OK
Sys. DD
Sys. DD
3 ⋅ λ DD
State 4 2 ⋅ λ DU Sys. DD
Sys. DD
Sys. OK
Sys. OK
λ DD Sys. DD
Sys. DU
Sys. OK
Sys. DD 2 ⋅ λ DD Sys. OK
State 15

Sys. OK 4 ⋅ λ DD State 2
3 ⋅ λ DU Sys. DD
Sys. DD
Sys. OK
Sys. OK Sys. OK Sys. DU
Sys. OK Sys. DD State 8
λDU
Sys. OK Sys. DU
State 0 2 ⋅ λ DU
State 5
3 ⋅ λ DD
Sys. DD
Sys. OK λ DD Sys. DD
Sys. DD Sys. DU
4 ⋅ λ DU Sys. OK 2 ⋅ λ DD
Sys. DU
Sys. DU
Sys. DU
Sys. OK State 12
µLT Sys. OK State 9
Sys. DU λ DU
3 ⋅ λ DU Sys. OK
µLT State 3
Sys. OK
Sys. DU
Sys. DU
β ⋅ λ DU State 6
Sys. DD
2 ⋅ λ DU Sys. OK λ DD Sys. DU
Sys. DU
µLT Sys. DU Sys. DU Sys. DU
Sys. DU
µLT Sys. DU
Sys. DU
Sys. DU
State 13
Sys. DU State 10
State 14 λDU

Fig. 5: 2oo4-Markov-model

In state 3 one of the four channels is operating with a Therefore, in state 5 a dangerous undetected failure
failure. The occurring failure is dangerous and is not exists in one channel while at the same time in one of
recognized by the failure diagnostics. The transition the other three channels a dangerous detected failure
rate between the states 0 and 3 has the value 4 ⋅ λ DU , has occurred. The dangerous detected failure is
as in one of the three channels a dangerous revealed within the test interval when the system
undetected failure can exist. No transition possibility exists in state 5 and no further dangerous failures
exists for the system from state 3 into safe state 1 occur and then state 5 changes with transition rate
because the failure cannot be recognized within the µ 0 = 1 / τ Test into the safe state 1. The system exists
test interval τ Test = 1 / µ 0 . From state 3 a transition in state 6 if another dangerous undetected second
takes place into state 5 respectively 6 if a failure failure occurs in one of the three channels while the
occurs in the until then still failure-free channels. system is in state 3. The transition rate is 3 ⋅ λ DU .
The system can only change to state 0 again, where State 6 is characterized by two dangerous undetected
the system is failure free, after τ LT if during the total failures, one in two of four channels. No transition
lifetime of the system in state 3 no further failures possibility exists for the system from state 6 into the
occur.In praxis this means: After time τ LT the total safe state 1 because the failures are not recognized
within the test interval τ Test = 1 / µ 0 .
system is exchanged.
If the second failure in state 3 is a dangerous detected
failure then a transition takes place into state 5. The
transition rate is 3 ⋅ λ DD .
Because of the failure detected within the test An operational system is possible for a 2oo4-system
interval τ Test a transition possibility exists for the in the states 0, 2, 3, 4, 5, 6, 8, 9 and 10. The states 1,
system from the states 7, 8, 9, 11, 12, 13 and 15 into 7, 11, 12, 13, 14 and 15 should not be considered
the safe state 1. The transition rate for this transition during the MTTF calculation, as they are absorbing
is µ 0 = 1 / τ Test . states. Therefore the Q matrix has a 9 x 9 matrix
form, see [3], [4].
The following two cases can be differentiated if a
For the considered Markov model we make the
common cause failure occurs in a 2oo4-system:
assumption τ LT = ∞ . As such applies
• The common failure cause leads to dangerous
detected failures. Then a transition exists from
state 0 directly into the state 11. The transition 1
µ LT = =0. (13)
rate is β D ⋅ λ DD . τ LT
• The common failure cause leads to dangerous
undetected failures. Then a transition exists The next step is to calculate the M-matrix. We get
from state 0 directly into the state 14. The the M-matrix with the following formula:
transition rate is β ⋅ λ DU .
⎡M M2⎤
I −Q = ⎢ 1 ⋅ dt = M ⋅ dt . (14)
In summary we can note the following: ⎣M 3 M 4 ⎥⎦

• If state 7 occurs the system immediately For the 2oo4-system the M-matrix is also a 9 x 9
switches to state S. matrix. Now we can calculate the N-matrix. The N-
• Failures that bring the 2oo4-system in the matrix needs to be composed to derive the MTTF
states 8, 9, 12, 13, and 15, result in a transition value of the system. The N-matrix is the inverse
of the system into the safe state 1 after time, matrix of the M-matrix.
which is smaller than 4 ⋅τ Test . The transition The MTTF value describes the mean time between
rate from these states into state 1 is always the occurrences of two failures. One assumes state 0
equal to µ 0 = 1 / τ Test . at the start time, i.e. the state in which the system
operates failure free. After the inversion the elements
• The states 1, 7, 11, 12, 13, 14 and 15 are
of the new matrix represent time dependent values.
absorbing states, that means, this states has
One needs to sum the first row of the N-matrix in
only a transition to the safe state or to the state
order to derive the MTTF value of the system. The
”system fully operational” and no further
MTTF term of a 2oo4-system has the following form,
transitions exist.
see also [3], [4]:
In the states 0, 2, 3, 4, 5, 6, 8, 9 and 10 the system is
operational. These states must be taken into account 1 4 ⋅ λ DD 4 ⋅ λ DU 12 ⋅ λ 2DD
MTTF2oo4 = + + + +
for the MTTF calculation of the 2oo4-systems. A1 A1 ⋅ A2 A1 ⋅ A3 A1 ⋅ A2 ⋅ A4
(15)
12 ⋅ λ 2DU
A11 + + A12 + A13 + A14
A1 ⋅ A3 ⋅ A6
4.1 Calculation of MTTF-value for a 2oo4-
with
system
For the 2oo4 Markov model exists the transition
matrix P. This transition matrix is 16 x 16 matrix, see
[3], [4], because we have 16 states.
The P matrix is the basis for the Q matrix. The
elements of the Q matrix are composed of the
respective probability densities, where the
corresponding states meet the following criteria:

• System operational
• Non absorbing state.
A1 = 4 ⋅ λ S + 4 ⋅ λ DD + 4 ⋅ λ DU + β D ⋅ λ DD + β ⋅ λ DU References:
A2 = µ 0 + 3 ⋅ λ DD + 3 ⋅ λ DU [1] IEC/EN 61508: International Standard 61508
Functional safety: Safety-related System,
A3 = µ LT + 3 ⋅ λ DD + 3 ⋅ λ DU
Geneva, International Electrotechnical
A4 = µ 0 + 2 ⋅ λ DD + 2 ⋅ λ DU Commission
A5 = µ 0 + 2 ⋅ λ DD + 2 ⋅ λ DU [2] Börcsök, J.: International and EU Standard
A6 = µ LT + 2 ⋅ λ DD + 2 ⋅ λ DU 61508, Presentation within the VD Conference
A7 = µ 0 of HIMA GmbH + CO KG, 2002
[3] Börcsök, J.: Elektronische Sicherheitssysteme,
A8 = µ 0 + λ DU
Hüthig publishing company, 2004.
A9 = µ 0 + λ DD + λ DU [4] Börcsök, J.: Elektronic Safety Systems, Hüthig
A10 = µ LT + λ DD + λ DU publishing company, 2004.
12 ⋅ λ DD ⋅ λ DU ⋅ ( A2 + A3 ) [5] Börcsök, J.: Sicherheits-Rechnerarchitektur Teil
A11 = 1 und 2, lecture of University of Kassel,
A1 ⋅ A2 ⋅ A3 ⋅ A5
2000/2001.
24 ⋅ λ 2DD ⋅ λ DU ⋅ ( A2 ⋅ A4 + A3 ⋅ A4 + A3 ⋅ A5 ) [6] Börcsök, J.: Echtzeitbetriebsysteme für
A12 =
A1 ⋅ A2 ⋅ A3 ⋅ A4 ⋅ A5 ⋅ A8 sicherheitsgerichtete Realzeitrechner, lecture of
24 ⋅ λ 2DU ⋅ λ DD ⋅ ( A2 ⋅ A5 + A2 ⋅ A6 + A3 ⋅ A6 ) University of Kassel, 2000/2001.
A13 = [7] DIN VDE 0801: Funktionale Sicherheit,
A1 ⋅ A2 ⋅ A3 ⋅ A5 ⋅ A6 ⋅ A9
sicherheitsbezogenener elektrischer/elektroni-
24 ⋅ λ3DU scher/programmierbarer elektronischer
A14 =
A2 ⋅ A3 ⋅ A6 ⋅ A10 Systeme (E/E/PES), (IEC 65A/255/CDV: 1998),
6 ⋅ λ DD ⋅ λ DU ⋅ ( A4 + A5 ) Page: 27f, August 1998
A15 = [8] DIN V 19250: Grundlegende Sicherheits-
A2 ⋅ A4 ⋅ A5 ⋅ A8
betrachtungen für MSR-Schutzeinrichtungen,
6 ⋅ λ DD ⋅ λ DU ⋅ ( A5 + A6 )
A16 = Beuth publishing company, Berlin 1998
A3 ⋅ A5 ⋅ A6 ⋅ A9 [9] DIN VDE 0801/A1: Grundsätze für Rechner in
Systemen mit Sicherheitsaufgaben, Beuth
publishing company
5 Conclusion [10] IEC 60880-2: Software für Rechner mit
The more safe 2oo4-architecture will be established sicherheitskritischer Bedeutung, 12/2001
within high safety class computers in future. Such
computers will be applied in various fields which
require simultaneously both: availability and
maximal safety. They are applied where human lives
need to be protected and/or safed, either in material
handling, energy production/distribution, in the
medical field or in future industrial power plants in
space.
As already mentioned in the introduction, today’s
technical systems will be more and more complex.
Man will no longer be able to provide appropriate
safety in processes which have to be monitored.
Future safety control must support him, either in
recording and analysing data, or in operation
resulting from this. Advanced safety architectures
like the introduced 2oo4-system have to be utilized
in order to guarantee the required safety. This system
combines the benefits of the 1oo2- and the 2oo3-
system: simultaneously a higher availability and a
higher safety than today’s systems.
FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
08

Within the TÜV Functional Safety Program:


White Paper
The effect of diagnostic and periodic proof
testing on the availability of programmable
safety systems

Date: 15 June 2006


Author(s): M.J.M. Houtermans, W. Velten-Philipp

Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

© 2002 - 2007 Risknowlogy B.V.

All Rights Reserved

Printed in The Netherlands

This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.

Risknowlogy, the Risknowlogy logo, and Functional Safety Data Sheet are registered service marks.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 2


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

The effect of diagnostic and periodic proof testing on the


availability of programmable safety systems

M.J.M. Houtermans 1
Risknowlogy B.V., Brunssum, The Netherlands
W. Velten-Philipp
TUV Rheinland, Cologne, Germany

Abstract
Abstract: - The purpose of this paper is to show the effect diagnostic and periodic proof testing have
on the availability of the safety function carried out by programmable electronic systems. For three
different architectures the influence of the diagnostic coverage, the proof test coverage, and the proof
test interval on the probability of failure on demand are determined. Performance indicators are used
to express this influence and show the effect.

1 Introduction
Safety systems carry out one or more safety functions. Each safety function consists of a sensing
element, a logic solving element and an actuation element. Typical sensing elements are for example
sensors, switches or emergency push buttons. The logic solving element is usually a general-purpose
safety computer which can carry out several safety functions at once. Valves, pumps or alarms are
typical actuating elements of a safety function. The performance of these safety functions is
determined by several design parameters of the individual components of the safety functions. In [1]
the following design parameters of the safety function are identified as the
ƒ Architecture
ƒ Hardware failures
ƒ Software failures
ƒ Systematic failures
ƒ Common cause failures
ƒ Online diagnostics, and
ƒ Periodic test intervals.

The performance of a safety function can be expressed as the probability of failure on demand
(PFD) and the probability of fail safe or spurious trip. Both attributes are important in the safety world
as their values represent respectively a measurement for the level of safety achieved and the financial
loss caused by the safety system because of spurious trips. The PFD value is one requirement to
meet the safety integrity level of the IEC 61508 standard [2]. For the PFS value there are currently no
requirements in the international safety world, although end-users of safety system require an as low
as possible PFS value.

1
Corresponding author: m.j.m.houtermans@risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 3


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

The purpose of this paper is to show the effect that the above mentioned design parameters,
namely online diagnostic and periodic proof testing, have on the performance of the safety function in
terms of the PFD. For three different architectures the influence of the diagnostic coverage, the proof
test coverage, and the proof test interval on the PFD are determined. A performance indicator is used
to express this influence and show the effect.

2 Diagnostic test versus periodic proof test


Before we can discuss the influence of diagnostics tests and periodic proof tests on the performance
of the safety functions we first need to explain the difference between the two. Actually to diagnosis
something means, "to distinguish through knowledge" [3]. In the safety world we use diagnostic tests
to identify failures within the system which otherwise would not be revealed. In other words a
diagnostic test is performed to find failures inside the safety system. But only revealing the failures is
not sufficient. Once a failure is found a decision needs to be made on what to do with that failure.
Typical decisions made are shutdown or switch to degraded mode if the safety system has sufficient
redundancy. But it is also possible to just notify an operator if the detected failure is not significant. A
diagnostic test is not something that is clearly defined in IEC 61508 . But everywhere it is mentioned it
is clear that we deal with a test that is automatic.

The proof test is defined in IEC 61508 as well as in IEC 61511 [2,4]. IEC 61508 defines the proof
test as follows:

“Periodic test performed to detect failures in a safety-related system so that, if necessary,


the system can be restored to an “as new” condition or as close as practical to this
condition”

IEC 61511 has a similar definition:

“test performed to reveal undetected faults in a safety instrumented system so that, if


necessary, the system can be restored to its designed functionality”.

Thus in other words both diagnostic tests and periodic proof tests try to detect failures inside the
safety system. At first sight there seems to be no difference yet the actual difference is quite an
important one. Note 3 of paragraph 7.4.3.2.2 of IEC 61508-2 explains that a test is a diagnostic test if
the test interval is at least a magnitude less than the expected demand rate. Based on this extra
information we can conclude that in theory a test to detect failures is called a diagnostic test if the test
is carried out automatically and more often than a magnitude less than the expected demand rate. In
all other cases we can refer to a test as a proof test.
The difference between a proof test and a diagnostic test is also important in case of a safety
system with a single architecture, i.e., hardware fault tolerance zero. In this case a proof test is only
sufficient if the safety function operates in a low demand mode. In case of a high demand safety
function diagnostic tests are required that are able to detect dangerous faults within the process
safety time (see IEC 61508-2, chapter 7.4.3.2.5). In other words you cannot build a single safety
system that only depends on proof tests.
In practice we define a test as a diagnostic test if it fulfils the following three criteria:

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 4


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

1. It is carried out automatically (without human interaction) and frequently (related to the
process safety time considering the hardware fault tolerance) by the system software and/or
hardware;
2. The test is used to find failures that can prevent the safety function from being available; and
3. The system automatically acts upon the results of the test.

A typical example of a diagnostic test is a CPU –Test, a memory check or program flow control
monitoring. A proof test on the other hand is a test which is carried out manually and where it is often
necessary to have external hardware and/or software to determine the result of the test. The
frequency of the proof test is much longer then the process safety time and magnitudes bigger than
the period chosen for diagnostic tests. A typical example of a proof test is full or partial valve stroke
testing. Partial valve stroke testing is seldom carried out without some form of human interaction (in
other words we need to depend on the human to carry out the test, to determine the actual results of
the test and/or to take the appropriate action based on the results) and often needs additional
equipment to be carried out.
The advantage of a diagnostic test over a proof test is that failures can be detected very quickly. If
a proof test is carried out once in three months then there is a possibility that the safety function is
running with a failure for three months before we find out about it. With diagnostics we often know
about the problem within milliseconds and thus can repair the failure very quickly. On the other hand
though good diagnostics require more a complicated design and additional hardware and software
build into the system. This additional hardware and software is often difficult to build, and costs extra.
There is another important reason to make a distinction between diagnostic tests and periodic proof
tests. IEC 61508 requires the calculation of the safe failure fraction for subsystems. The safe failure
fraction of a subsystem is defined as the ratio of the average rate of safe failures plus dangerous
detected failures of the subsystem to the total average failure rate of the subsystem, see formula
below

λSD + λSU + λ DD
SFF =
λSD + λSU + λ DD + λ DU

A high safe failure fraction can be accomplished if we either have a lot of safe failures (detected or
undetected does not really matter for the SFF) or if we can detect a lot of the possible dangerous
failures. Only failures detected by diagnostic tests can be accounted for in the safe failure fraction
calculation. Failures detected by periodic proof tests cannot be accounted for in the safe failure
fraction calculations. This is logical of course, as we do not want to count on humans to carry out our
safety.

3 Architectures
To show the effect that diagnostic and proof tests have on safety functions we introduce three
common safety system architectures. The presented architectures are oversimplified but represent
common structures that are used to implement safety functions. In practice these systems are much
more complex and can consist of many more components. The four basic architectures presented
can be characterized by their redundancy and voting properties, i.e., as XooY (i.e., “X out of Y”) and
are
ƒ The 1oo1 architecture;
ƒ The 1oo2 architecture;

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 5


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

ƒ The 2oo3 architecture; and


Each architecture consists of one or more sensors, one ore more logic solvers, and one or more
actuating elements. The following paragraphs will explain the three different architectures in more
details.

3.1 The 1oo1 Architecture


The 1oo1 architecture is the simplest safety system around and consists of one single channel (see
Figure 1). If any of the components within a channel fails dangerously the safety function cannot be
executed anymore. If any of the components fail safe the safety function will be executed and a
spurious trip will result.

S1 LS1 A1

Fig. 1. Functional block diagram 1oo1

3.2 The 1oo2 Architecture


The 1oo2 architecture, see Figure 2, consists of two channels, where each channel can execute the
safety function by itself. If one channel fails dangerously the other channel is still able to execute the
safety function and thus the safety system is still available. If one channel has a safe failure the safety
function will be executed and a spurious trip will follow.

S1 LS1 A1

S2 LS2 A2

Fig. 1. Functional block diagram 1oo2

3.3 The 2oo3 Architecture


The third architecture often used in practice is a 2oo3 voter structure, see Figure 3. This architecture
consists of three independent channels. Each channel can carry out the safety function. The
execution of the safety function requires two equal votes from the available channels. That means the
safety function is executed if two channels have the same result. In other words we need two
dangerous failures in two channels before the safety systems fails. We also need to safe failures in
two channels before the safety function is executed and a spurious trip results.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 6


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

S1 LS1 A1

S2 LS2 A2

S3 LS3 A3

Fig. 1. Functional block diagram 2oo3

4 Evaluation Procedure
The procedure used to calculate the PFD is outlined in detail in [1] and is in short as follows. A
functional block diagram of the hardware of the safety system is drawn. For each element of the block
diagram the typical failure modes are listed. The effect on system level is determined for each of
them. This information is used to construct the reliability model, which is in this case a Markov model
[5,6]. The last step in the procedure includes the quantification of the models by adding the failure and
repair rates of the different

5 Reliability Data
One of the objectives of this study was to make sure that it was possible to compare the results for
the different architectures. The actual value of the outcome is not so important as we are more
interested in the relative results. All calculation studies are carried out with these values unless
otherwise noted in the specific study (Reference model).

Table 1.Default reliability data


Component Sensor Logic Actuator
Failure rate [/h] 35E-6 15E-6 50E-6
Safe failures [%] 50 50 50
Diagnostic coverage [%] 0 0 0
Online repair [h] 8 8 8
Repair after spurious [h] 24 24 24
Periodic proof test interval [y] none none none
Proof test coverage [%] 100 100 100
Mission time [h] 87600
Common cause 2 failures [%] 0.0225
Common cause 3 failures [%] 0.00225

6 6. Performance Indicators
In order to exam the results and the effects of changing the diagnostic, proof test coverage and
parameters performance indicators are introduced. A performance indicator helps us understand what
impact changing a parameter has on the PFD value of the system at hand. The performance indicator
(PI) is calculated relatively simple. We change the parameter, for example the diagnostic coverage,
stepwise from its minimum value to its maximum value. This will result in a PFD values changing
according to the value of the parameter. To determine the impact of the parameter on the PFD value
we calculate the change of the PFD relatively to the 50% value of the parameter. In case of diagnostic
coverage 50% means 50% of the failures is detected by diagnostics. The 50% value is in this case
the reference value that we use to normalize the PFD value to 1. To get an impression of the

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 7


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

influence below 50% DC and above 50% DC we choose 25% DC and 75% DC for the PFD values to
compare with.
The reason we do this is because we want to determine whether changing a parameter has a lot or
only little effect on the PFD. In other words should we have in real life a system that does not meet
our PFD value then we know which design parameter we have to address in order to make the
changes that have the most influence and thus the fastest results. We don’t want to spend time and
money on improving parameters that show little or no effect at all.

7 Calculation results
7.1 Calculation results with and without diagnostics – no proof test
In this paragraph we study the influence diagnostics has on the PFDavg values for the different
system architectures as presented in paragraph 3.1. The diagnostics coverage is varied in the
following percentages 0%, 25%, 50%, 75%, and 99%. The results are presented in the figure below.

PFDavg for variable DC


1
1oo1
1oo2
2oo3

0.1
probability

0.01

0.001

0.0001
0 0.25 0.5 0.75 1
diagnostic coverage

Fig. 4. PFDavg in relation to diagnostic coverage

Next we calculate the performance indicator of the diagnostic coverage, i.e., how it influences the
PFDavg of the different architectures. Figure 5 shows the change of the PFDavg for 25% DC
compared to 50% DC and the change for 75% DC compared with 50% DC. The numbers are
normalized with the 50% value.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 8


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

Compare for variable DC 1oo1


1oo2
2oo3
3.00 2.73

2.50
2.08
2.00 1.70
1.52
1.40
PI-PFD 1.50 1.19

1.00

0.50

0.00

25%-50% 50%-75%

Fig. 5. Performance indicators for variable diagnostic coverage

From Figure 4 and Figure 5 we can draw the following conclusions:


ƒ The diagnostic coverage factor has a significant influence on the PFDavg value. For all
architectures counts that improving the diagnostic coverage from 0 to 100% will improve the
PFDavg a factor of approximately 10 to 1000;
ƒ The 1oo1 system is least sensitive to modifications of diagnostic coverage;
ƒ The 1oo2 system is the most sensitive to modification of the diagnostic coverage factor. For
low diagnostic coverage factors as well as high values;
ƒ Both the 1oo2 and 2oo3 are one fault tolerant systems. The difference in performance can be
explained be course of probability theory. Both systems need two failures in two different
channels to loose the safety function. The 2oo3 system though has 3 possible combinations
of two failures. But the probability of repair is equally fast in both systems;
ƒ For all three architectures counts that making improvements has most effect above 50%
diagnostic coverage.

In other words all architectures are sensitive for the diagnostic coverage factors and increasing the
diagnostic coverage factor can make major improvements. The redundant structures are much more
sensitive then the single structure. Diagnostic coverage really makes an impact on the PFDavg value
when over 50% preferable over 75%.

7.2 Calculation results for variable proof test coverage


In this paragraph we study the influence the proof test coverage has on the PFDavg value for the
different architectures as presented in paragraph 3.1. The proof test coverage varied in the following
percentages 0%, 25%, 50%, 75%, and 100%. The results are presented in Figure 6.
Next we calculate the performance indicator of the proof test coverage, i.e., how it influences the
PFDavg of the different architectures. Figure 7 shows the change of the PFDavg for 25% test
coverage compared to 50% and the change for 75% compared with 50%. The numbers are
normalized with the 50% value.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 9


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

PFDavg for variable proof test coverages


1
1oo1
1oo2
2oo3

0.1

probability

0.01

0.001

0.0001
0 0.25 0.5 0.75 1
proof test coverage

Fig. 6. PFDavg in relation to proof test coverage

Compare for variable Proof Test Coverage 1oo1


1oo2

2.00
1.96
1.89 2oo3
1.73 1.70
1.80
1.60 1.42 1.40
1.40
1.20
PI-PFD 1.00
0.80
0.60
0.40
0.20
0.00

25%-50% 50%-75%

Fig. 7. Performance indicators for variable proof test coverage

From Figure 6 and Figure 7 we can conclude the following:


ƒ Also the proof test coverage helps improve the PFDavg but not as significant as the
diagnostic coverage. An improvement with a factor of 10-100 is achievable;
ƒ Just like with the diagnostic coverage the redundant architectures react better to proof test
coverage improvements than the single architecture.
ƒ The sensitivity concerning the PFDavg improvement is nearly the same for all architectures.

The PFDavg is less sensitive to proof test coverage compared to diagnostic coverage. This is
understandable as diagnostic coverage gives almost very fast repair compared to proof testing.
Failures detected with diagnostics are detected within seconds, while proof test can take 1 year or
longer before they are carried out. Failures can thus exist much longer in the system. Therefore the
PFD does on average not improve that much.

7.3 Calculation results with variable proof test interval


In this paragraph we study the influence of the proof test interval on the PFDavg value for the
architectures as presented in paragraph 0. The proof test interval was varied from none to 3 months,

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 10


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

1 year, 2 years, and 5 years with a proof test coverage of 100%. The results are presented in Figure
8.

PFDavg for variable proof test interval


1
1oo1
1oo2
2oo3

0.1
probability

0.01

0.001

0.0001
0 1 2 3 4 5 6 7 8 9 10
years

Fig. 8. PFDavg in relation to the proof test interval

Next we calculate the performance indicator of the proof test interval, i.e., how it influences the
PFDavg of the different architectures. Figure 9 shows the change of the PFDavg for a proof test
interval of 1-5 years and a proof test interval of 5-10 years. The numbers are normalized with the 5
years value.

Compare for variable Proof Test Interval 1oo1


1oo2
7.87 2oo3
8.00

7.00
5.79
6.00

5.00

PI-PFD 4.00 3.13


3.00
1.97
1.65
2.00 1.30
1.00

0.00

1y-5y 5y-10y

Fig. 9. PFDavg for different proof test intervals

Figure 8 and Figure 9 we can conclude that


ƒ There is a high sensitivity to proof test intervals varying approximately with a factor 10-1000
depending on the architecture;
ƒ The redundant architectures benefit the most. The shorter the proof test intervals the more
benefit. This is logical of course as the short the interval the more the proof test resembles a
diagnostic test.
ƒ The 1oo2 architecture benefits the most from frequent proof tests.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 11


White Paper
M.J.M. Houtermans, W. Velten-Philipp
The effect of diagnostic and periodic proof testing on the availability of programmable safety systems

ƒ There is almost no difference from a probability point of view to perform the proof test with a 5
or 10 years interval.

Just like the diagnostic coverage the proof test interval has a significant impact on the PFDavg
value for all architectures. It is more important to carry out proof tests more frequently then with a
high-test coverage. In other words even a simple but frequent proof test can help reduce the PFDavg
value significantly. Thus is good news for partial stroke testing of valves. Especially with partial stroke
testing of valves we do not know the actual proof test coverage but it if done frequently it will still help
significantly.

8 Calculation results
The purpose of this paper was to show the effect of diagnostic coverage, proof test coverage and the
proof test interval on the PFDavg value for different safety architectures. To get more inside in the
influence of these design parameters a performance indicator was introduced. The choice of
diagnostic coverage and the proof test interval have the most influence on the PFDavg. The proof test
coverage also improves the PFDavg but less significant then the other two parameters. The 1oo2
architecture gains the most benefits from while the 1oo1 is the least sensitive. For redundant
architectures counts that there is more chance on finding failures then there is for single architectures.
Therefore they perform better in terms of improving the PFDavg. The authors are currently working on
a more extended version of this paper taking among others into account more architectures and the
PFS calculation per architecture.

REFERENCES

1. Houtermans, M.J.M., Rouvroye, J.L., (1999) The Influence Of Design Parameters On The
Performance Of Safety- Related Systems. International Conference on Safety, IRRST,
Montreal, Canada
2. IEC 61508 (1999) Functional safety of E/E/PE Safety-related systems, IEC 1999;
3. Houtermans, M.J.M., Brombacher, A.C., Karydas, D.M.,(1998) Diagnostic Systems of
Programmable Electronic Systems. PSAM IV, New York, U.S.A
4. IEC, 61511 (2003) Safety instrumented systems for the process industry, 2003
5. IEC 61165, (2001) Ed.2: Application of Markov techniques
6. ISA TR84.0.04 Part 4 (1997) Determining the SIL of a SIS via Markov Analysis
7. Karydas, D.M., Houtermans, M.J.M., (1998) A Practical Approach for the Selection of
Programmable Electronic Systems used for Safety Functions in the Process Industry. 9th
International Symposium on Loss Prevention and Safety Promotion in the Process Industries,
Barcelona, Spain

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 12


FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
09

Within the TÜV Functional Safety Program:


White Paper
Certified Level Sensor for the Liquefied Natural
Gas Industry

Date: 3 May 2006


Author(s): L. Monfilliettte, P. Versluys, M.J.M. Houtermans

Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

© 2002 - 2007 Risknowlogy B.V.

All Rights Reserved

Printed in The Netherlands

This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.

Risknowlogy, the Risknowlogy logo, and Functional Safety Data Sheet are registered service marks.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 2


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

Certified Level Sensor for the Liquefied Natural Gas


Industry

L. Monfilliette, P. Versluys
Whesso S.A., Calais, France
M.J.M. Houtermans 1
TUV Industrie Service, Cologne, Germany
Risknowlogy B.V., Brunssum, The Netherlands

Abstract
This paper will demonstrate in a practical way in which liquefied gas storage facilities around the
world can benefit from IEC 61508 compliant level sensor systems. World wide there are about 100
liquefied natural gas (LNG) storage facilities. This market is increasing about 8% on average over the
past 5 years due to the increased demand for clean fuels, the development of new gas fields and the
consequential requirement for more storage facilities.
The inherent hazardous situation of storing liquefied natural gas brings about that the industry
requires a very high level of safety. One of the main problems is that a storage tank may get
overfilled, which results in structural damage to the tank and a spill of the liquefied natural gas into the
environment bringing with it unpredictable hazardous situations and their associated risks. The LNG
storage tanks need to be equipped with special level sensors and emergency shutdown equipment to
assure that it is not possible to overfill the tank.
The purpose of this paper is to demonstrate how the liquefied gas market can benefit from SIL
certified level sensors. From an application level point of view the safety requirements are explained.
This paper will discuss the IEC 61508 requirements as well as the specific requirements of NFPA 59.
The paper explains how the sensor system fulfills these requirements and what efforts the company
had to take to meet these requirements. To end-users the paper will explain why and how the sensor
system has been tested. Further more, via a practical example of an LNG storage tank the paper will
demonstrate the achieved probability of failure on demand and the required proof test interval.

1 Introduction
The level sensor described in this paper consists of hardware, software and mechanical sub modules.
The applicable functional safety standard for these kinds of systems is IEC 61508. Although this
standard only has technical safety requirements for electrical, electronic and programmable electronic
systems, it is clearly stated in the standard that also supports other technology, like the mechanical
part of the level sensor, and can be used as long as it follows the framework and lifecycle approach of
IEC 61508. As the standard only has detailed requirements for electrical, electronic, and
programmable electronic devices, additional requirements were defined when the level sensor was in
the process of certification to address the mechanical parts.
This paper will demonstrate in a practical way in which liquefied gas storage facilities around the
world can benefit from IEC 61508 compliant level sensor systems. This paper will demonstrate how
the liquefied gas market can benefit from SIL certified level sensors. The safety requirements are

1
Corresponding author: m.j.m.houtermans@risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 3


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

explained from an application level point of view. This paper will discuss the IEC 61508 requirements
as well as the specific requirements of NFPA 59.

2 LNG Storage Tank Solution


More and more LNG tank farms are build all over the world in order to temporarily store LNG before it
is either shipped to its final destination or before it is further processed. The filling of the tank is
typically controlled by a basic process control system and for safety purposes monitored by a safety
instrumented system (SIS). The main safety function of the SIS deals with potential overfilling of the
tank and measures the level with custom designed process level gauges. Besides the level gauges
other instrumentation and equipment is typically present like an LTD gauge, in-tank temperature
sensing and transmission devices as well as leak detection and cool-down monitoring systems.
Figure 1 gives an overview of a typical tank with three level gauges. The level gauges are placed
on top of the tank. In this case all three gauges are connected to both the safety instrumented system
(ESD system) and the basic process control system (DCS). Typically two level gauges serve as level
meters and one level gauge operates as over spill alarm. All three gauges are used to determine
whether the over spill position has indeed been reached. Besides the over spill alarm there are also a
low, high and high-high alarms.

Figure 1 – Overview LNG Tank with Level Gauges

The actual level sensor is shown in more detail in Figure 2. The gauge itself is build up out of
software, electronics, and a mechanical part, all enclosed in a rugged metal housing. The level gauge
sensing head is composed of TWO parts: the main sensing head body and a PVC displacer. The
main sensing head accommodates a coil and a linkage to the level gauge tape. The displacer floats
on the LNG surface. Following the changes in the actual LNG level, the displacer drives a core up or
down in above mentioned coil, thus changing the induction in the latter. The level gauge tape,
connected to the coil, consists of 2 conductors, linking this coil to the gauge’s electronics, where the
changes in induction (and thus of the actual LNG level) are being measured. The Tefzel® coated

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 4


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

stainless steel tape can be as long as 75 meter, dependant on the storage tank height. The software
and electronics part not only exist to enable communication with the BPCS and SIS systems but also
to ensure a high level of self-diagnostics. Also the level gauge tape plays an integral part in the
diagnostics capabilities of the level gauge.
The displacer continuously floats on top of the liquid surface thus continuously sensing any
movements in the surface, thus continuously changing the induction in the sensing head coil. The rate
of change of this induction is measured and analyzed by the electronics, giving the following results:
ƒ The speed in which the induction changes is proportional to the speed at which the servo
motor should be driven.
ƒ The direction in which the induction changes sets the direction of the servo motor (up or
down)
ƒ If the changes up / down equal to zero, such indicates that the surface is merely showing
wave-action and no real level change at which point the servo motor is not activated.
ƒ If NO changes are measured during a 10 minute time span, the servo motor travels a short
distance up to re-find the actual level immediately after. This is a self check to ensure that the
system is still functioning properly.

Figure 2 - Level transmitter gauge, Model 1143-Mark II

The safety function of the level transmitter gauge is defined as follows:

To measure continuously the level of product and compare it to a “High over


spill set point”; should this set point be reached or passed, trigger the safety
relay that is connected to the Emergency Shutdown loop of the unit. This
safety function needs to be carried out with a safety integrity of SIL 3 and
needs to process its level signal within 10 seconds.

When the sensor was certified, the basis of the certification was the above defined safety function.
Without a well-defined safety function it is impossible to test the level sensors against the
requirements of the applicable standards. It is crucial to have a safety function definition that is not too
narrow since otherwise the end-users will not be able to use the device for safety.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 5


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

If the safety function is defined too wide it is too difficult to certify the device, as many requirements
can possibly not be met. A well-defined safety function makes also the testing task clear for anybody
involved in certification.

3 Certification of the level sensors


For companies, new to the world of functional safety, product development is a challenging task. Also
for Whessoe IEC 61508 and related terms like SIL were not heard of when the first customers started
to ask whether their sensor could fulfill these requirements in 2004.
After some initial investigation by the company, Whessoe’s management decided that complying with
functional safety should be an additional safety requirement for their products and was considered to
be a “must have” to survive the stringent and ever increasing safety requirements, imposed by the
market place. It was decided to have the level sensors tested by an independent party in order to
show to their clients Whessoe’s commitment to safety.
As the company was new to functional safety and IEC 61508 the first kick-off meeting included
training for the personnel to be involved in the project. As also senior management was involved, the
training proved to be particular useful. Normally only engineers attend these trainings, which makes it
difficult for management to fully understand the impact a new standard might have on the way the
company develops products and does business. In this case the involvement of management made it
easier for engineers to explain the time and resources needed to make product changes in order to
obtain the certification.
Sensors like these need to be very robust as they needs to withstand harsh environmental conditions.
Therefore, besides the functional safety and application specific standards, the sensor has also been
tested and certified to specific environmental, electrical safety and EMC standards. The following is
an overview of all the standards this level gauge has been tested and certified against:
ƒ IEC 61508 basic standard for functional safety.
ƒ IEC 61511 application specific standard for functional safety.
ƒ NFPA 59 A application specific standard for LNG storage.
ƒ 49 CFR. Part 13 US federal standards.
ƒ EN1473. 4. 5. 8 standards for seismic behavior.
ƒ EN 61326. 1 standard for electro magnetic compliance.
ƒ IEC 61010 standard for electrical safety.
ƒ ATEX: EN50014 standard for explosive atmosphere general requirements.
ƒ ATEX: EN50018 standard for explosive atmosphere flameproof.
ƒ ATEX: EN50020 standard for explosive atmosphere intrinsic safety.

This is quite a list of requirements to manage. Therefore any TüV certification project always starts
with three important documents. These three documents are:

ƒ Safety plan
ƒ Verification & validation plan
ƒ Safety requirements specification

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 6


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

These documents are not demanded by TüV but they are a direct result from the requirements of
IEC 61508. Besides that, TüV having years of experience in dealing with functional safety projects
and having these three documents ready at the start of the project, is a guarantee that the project
runs faster and that everybody involved has a clear understanding of the project plan on how to
achieve safety.
The safety plan outlines the management of functional safety requirements and is basically the plan
or approach on how to achieve functional safety for the project. It outlines the people, departments
and organizations involved, the lifecycle to follow, the activities and documents in each step of the
lifecycle, the tools and measures that will need to be applied to avoid failures.
In other words, it is a document that outlines who will do what, how and at what time.
Since it is a plan, it is a living document that can be updated over the course of the project
whenever necessary.
The verification & validation plan is a document that outlines who will perform which verification
activities at what point in time. It does not outline actual tests but only the activities to come to these
tests. For example in the case of Whessoe, one of the activities was to understand the IEC 61508
standard. One cannot design and verify a design if one does not understand the requirements of IEC
61508…
The third document is the safety requirements specification (SRS). Where the first two documents
were process related, (that is: how we manage functional safety), the SRS is about the requirements
of the actual product or system. The SRS is the most important safety document as it outlines the
basic and top-level safety requirements of the product. It is a well-focused document, which does not
go into detail and does not include any non-safety requirements. For this project, a lot of time was
spend upfront to generate these three documents. This time was considered well spend though and
was gained back during the remainder of the project as less mistakes were made and less “surprises”
revealed themselves during the project. The following paragraphs give a more detailed overview of
the requirements directly related to functional safety and applied during certification of the product.

4 IEC 61508 requirements


The basis for functional safety is always the IEC 61508 standard. No matter which other standards
are involved, the basic requirements of IEC 61508 need to be met. Besides IEC 61508 other
standards can easily be involved for application specific purposes like IEC 61511 or NFPA 59 A. For a
product to comply with IEC 61508 the following requirements need to be addressed:
ƒ Functional safety management,
ƒ Hardware,
ƒ Software,
ƒ Reliability and
ƒ Documentation.

The functional safety management requirements are in general dealt with in the safety plan and the
verification & validation plan. Detailed verification & validation documentation is created for each step
of the lifecycle, both for hardware as well as for software. The hardware and software requirements
are, on a general level, explained in the safety requirements specification and in more detail in the
supporting design specifications.
A qualitative and quantitative reliability analysis needs to be carried out on the hardware and is part
of the hardware verification documentation. Besides specifications, verification and validation, also
supporting documentation needs to be created like a user manual, including the safety manual.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 7


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

One of the most important IEC 61508 concepts that need to be addressed is the architectural
constraint. According to IEC 61508, it is not possible to just build any kind of safety system, as the
architecture is limited according to the requirements in Table 1. This table applies to so-called
subsystems. Typical subsystems are sensors, valves, logic solvers, etc.
For each subsystem we need to determine the following:

ƒ The type
ƒ The safe failure fraction
ƒ The hardware fault tolerance

The type of the subsystem deals with the complexity of the component.
There are two types, A or B. Type A subsystems are simple systems with well-defined failure
modes and failure behavior. Type B subsystems are complex systems where one or more failure
modes are not clear or where we cannot fully understand the failure behavior of the system.
The safe failure fraction is a measure of the “fail-safe” design and build-in diagnostics of the safety
system. The more internal failures go to the safe side, or the more failures we can detect via build in
diagnostics, the higher the safe failure fraction.
The hardware fault tolerance is a measure of redundancy. A hardware fault tolerance of 0 means
that the safety function of the subsystem is lost when 0+1=1 dangerous failure occurs.
A single subsystem has a hardware fault tolerance of zero;
A redundant subsystem has a hardware fault tolerance 1, and so on.

Table 1 – Architectural constraints for subsystems

Type A Type B
Safe Failure
Hardware Fault Tolerance (HFT) Hardware Fault Tolerance (HFT)
Fraction (SFF)
0 1 2 0 1 2
< 60 % SIL 1 SIL 2 SIL 3 N.A. SIL 1 SIL 2

60 % - < 90% SIL 2 SIL 3 SIL 4 SIL 1 SIL 2 SIL 3

90 % - < 99% SIL 3 SIL 4 SIL 4 SIL 2 SIL 3 SIL 4

> 99 % SIL 3 SIL 4 SIL 4 SIL 3 SIL 4 SIL 4

The level gauge can actually be divided into three subsystems as shown in Figure 3.
The division is based on the type of the subsystem according to IEC 61508. A single level gauge is
a mixed Type subsystem as it consists of Type A mechanical hardware, Type A electronic hardware
and type B electronic hardware. In order for a single level gauge to achieve SIL 2 the following
conditions need to be met:
ƒ Type A mechanical hardware needs to have a safe failure fraction of 60-90%
ƒ Type A electronic hardware needs to have a safe failure fraction of 60-90%
ƒ Type B electronic hardware needs to have a safe failure fraction of 90-99%

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 8


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

Sub system
level sensor
1143-2

Sub system Sub system Sub system


mechanical electronic electronic
hardware hardware hardware
Type A Type A Type B

Figure 3 – Subsystems 1143-2 level sensor

To verify the safe failure fraction of a single sensor a detailed component level failure modes and
effects analysis (FMEA) has been carried out. This FMEA addresses the mechanical as well as the
electronic hardware of the sensor. For every single internal component of the level gauge, the failure
modes were listed and the effects of these failure modes were analyzed taking into account the safety
function as defined before. This was indeed a tedious task but it documented the full possible failure
behavior of the level sensor as required by the standard.

Failure rate data was added to the FMEA in order to calculate the safe failure fraction. During the
FMEA also existing diagnostics features of the gauge were taken into account.
Not all diagnostics as required by the standard were available in the first design of the level sensor.
The FMEA revealed that there were several improvements to be made in order to achieve the
required safe failure fractions. Additional software diagnostics were implemented. The accepted
design for a single level sensor currently meets the safe failure fractions for SIL 2. As it is possible to
use multiple sensors in different architectures, it is also possible to achieve SIL 3.
NFPA 59 A requirements
gives a complete overview of the possible architectures for the level sensor and their achievable
SIL levels according to IEC 61508.

Table 2 – Overview of the possible architectures and their achievable SIL level

Architecture
Attribute 1oo1 1oo2 2oo3
Hardware fault tolerance 0 1 1
Fit for use in SIL 2 3 3

5 NFPA 59 A requirements
Level sensors for LNG tanks need to comply in many countries to the US standard NFPA 59 A. This
standard is application specific, which means that besides the IEC 61508 requirements it is also
necessary for these levels gauges to comply with the NFPA 59 A standard.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 9


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

Although this being a US standard, many countries in the world storing LNG apply this standard as
a basis when building LNG storage tanks. There are a few very significant requirements in the
standard that need to be considered when using level gauges. The requirements within NFPA 59A
call for three level gauges, one being dedicated to high – high level alarming only.
In other words, no matter how well the level sensors perform according to the IEC 61508 standard if
a company needs to comply with NFPA 59 A then per definition they need to use three level gauges.
At the time of writing the NFPA 59 A standard, IEC 61508 was not known to the committee. Possibly
in the future the requirement of using 3 sensors may be reduced to 1 or 2 levels gauges fit for use in a
certain SIL level according to IEC 61508.

6 Environmental conditions
At design stage the safety system should integrate the following environmental factors
ƒ Temperature range: 20°C to + 50°C
ƒ Enclosure: IP 65
ƒ Components Tropical type protection: Optional coating for PCB
ƒ Pressure range: Up to 500 mBar relative to atmospheric pressure
ƒ Seismic resistance: Up to 2g in all directions

Besides the above, the level gauges must also comply to the EMC requirements.

7 Quantitative reliability analysis


IEC 61508 requires the calculation of the probability of failure on demand for a safety function. A
safety function usually consists of sensors, logic solvers, actuators and other peripheral equipment.
The probability of failure on demand is the probability that the safety function cannot be carried out
because of an internal failure of the safety system. For each SIL level the PFD range is presented in
the following table.
Table 3 - Safety Integrity Levels
SIL Average Probability of Failure On
Demand
1 ≥10-5 to <10-4
2 ≥10-3 to <10-2
3 ≥10-4 to <10-3
4 ≥10-5 to <10-4

Although the PFD can only be calculated for a complete safety function, in this paper we will
calculate the contribution the level sensor will have to the overall safety function. One of the most
advanced techniques to make reliability calculations is Markov analysis [11]. To make the
calculations, three Markov models were created for the three possible architectures the level sensors
can be used in. For each Markov model the reliability data as resulted from the FMEA were used as
failure rate inputs. The actual voting of the 1oo2 and 2oo3 system occurs in the logic solver of the
ESD system. As the level sensors have excellent diagnostics capabilities, it is possible to send to the
logic solver signals indicating safe and dangerous detected failures. In other words, the logic solvers
know which signal from which sensor to trust and which signal not to trust. This helps significantly in
deciding whether to shutdown or to indicate to the operators to repair the sensors. The results of the

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 10


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

PFD calculation are presented in Table 4. The PFD calculations are performed for 1 and 10 years
continuous operation.

Table 4 – Architecture and configuration overview

Architecture
Attribute 1oo1 1oo2 2oo3
PFD after 1 year 1.802e-004 4.404e-008 3.287e-007
Percentage of PFD after 1 year 0.180% 0.004% 0.033%
PFD after 10 years 1.771e-003 4.181e-006 3.201e-005
Percentage of PFD after 10 year 17.7% 0.42% 3.20%
Fit for use in SIL 2 3 3
PFS after 1 year 1.154e-006 9.701e-005 1.918e-010
Fit for use in STL 5 4 9

Figure 4 shows how the probability of failure on demand develops over time for all three
architectures. A graphical representation like this can be used by an end user to determine periodic
proof test interval. This can only be done though if the logic solver and actuating part are also
included in the calculation. The 1oo1 architecture clearly performs the worst of the three architectures.
The reason that the 1oo2 architecture has a better performance then the 2oo3 architecture is because
the 2oo3 has more possibilities to fail.

Figure 4 – Probability of Failure on Demand for 1oo1, 1oo2, and 2oo3 architectures.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 11


White Paper
L. Monfilliettte, P. Versluys, M.J.M. Houtermans
Certified Level Sensor for the Liquefied Natural Gas Industry

8 Conclusions
The paper presented the work performed by Whessoe S.A. to certify their LNG level sensor to the IEC
61508 and related standards. The level sensors were rigorously tested, not only for functional safety,
but also for specific environmental conditions. Whessoe decided to have the level sensor certify by
TÜV. This certification ensure the end-user that they do not need to evaluate the sensor any more
according to the IEC 61508 standard. The independent review by TÜV demonstrated that the level
sensor is capable of achieving SIL 2 in a 1oo1 configuration and SIL 3 in a 1oo2 or 2oo3
configuration.

9 References
1 IEC 61508, Functional safety of electrical, electronic, programmable electronic safety-related
systems. International Electrotechnical Committee, Geneva,. Switzerland, 1999
2 IEC 61511, Functional safety – safety instrumented systems for the process industry.
International Electrotechnical Committee, Geneva, Switzerland, 2003
3 NFPA 59, NFPA 59: Utility LP-Gas Plant Code. National Fire Protection Association, Quincy, MA,
USA, 2004
4 49 CFR. Part 13 USA
5 EN1473. 4. 5. 8, Installation and equipment for liquefied natural gas. Design of onshore
installations, 1997
6 EN 61326. 1, Electrical equipment for measurement, control and laboratory use - EMC
requirements. International Electrotechnical Committee, Geneva, Switzerland, 2005
7 IEC 61010, Safety requirements for electrical equipment for measurement, control, and
laboratory use, International Electrotechnical Committee, Geneva, Switzerland, 2003
8 EN50014, Electrical apparatus for potentially explosive atmospheres. General requirements,
1998
9 EN50018, Electrical apparatus for potentially explosive atmospheres. Flameproof enclosure 'd',
2000
10 EN50020, Electrical apparatus for potentially explosive atmospheres. Intrinsic safety 'i', 2002
11 Börcsök, J., Electronic Safety Systems, Hardware Concepts, Models, and Calculations, ISBN 3-
7785-2944-7, Heidelberg, Germany, 2004

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 12


FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals

White paper
10

Within the TÜV Functional Safety Program:


HIMA Functional Safety Program

TÜV’s Maintenance Override Procedure


Draft Version 3.0 - 20. October 2000
1. Preface to Draft Version 3.0

This version specifically addresses:


ƒ Maintenance by use of tools (PC/Laptop/workstation/DCS)
ƒ Maintenance by use of public networks (Internet, communication networks, RF-networks,
remote servicing)
ƒ General requirements for safety-related communication protocols
ƒ Security aspects in addition to safety-related communication
ƒ Availability aspects
ƒ Change of system data (set-points, parameter, etc.)
ƒ Exchange of sensors and actuators and related modification of the application programs.

Not all aspects are yet addressed in this draft. Comments and suggestions that can improve this
maintenance override paper in terms of safety are very welcome.

2. Introduction
The purpose of this document is to describe the procedures for the use of maintenance override of
safety related programmable electronic systems, like sensors, controllers, and actuators. The
document also shows how to overcome safety problems and the inconvenience of hardwired
solutions.

2.1. Maintenance Override


There are basically two methods in use to check safety relevant peripherals connected to PLC's:
ƒ Special switches are connected to the inputs of the PLC. These inputs are used to deactivate
sensors and actuators that are under maintenance. The maintenance condition is handled as
part of the application program of the PLC.
ƒ During maintenance sensors and actuators are electrically isolated (disconnected) from the
PLC and checked manually by special measures.
In some cases, for example, where space is limited, there is the desire to integrate the maintenance
console to the operator display, or to have the maintenance condition covered by other strategies.
This introduces a third alternative:
ƒ Maintenance overrides initiated by serial communication to the PLC.
The available maintenance options and communication protocols must be part of the TÜV Type
Approval of the safety system in order to be applied safely. If communication takes place over open
networks, then in addition to the functional safety requirements additional requirements must also be
in place that guarantee security. The end user needs to take into account the advice described in the
safety manual.
This option is to be handled with care and further explained in this document.

Version 2.0 © 2006 HIMA 1 of 3


We strongly recommend to keep the tools for programming and debugging separate from the tools
used for maintenance override. The engineering workstation, which is used for programming, should
not be used for maintenance.

2.2. Procedure for Maintenance Override


The use of non-approved maintenance tools demands a complete test of the requirements after any
change has been made. The thoroughness of the test is equal to the initial acceptance test. The tests
should not only focus on the changed programmed parts but also on the non-changed parts, as it
cannot be guaranteed that these changes do not have an impact on the unchanged parts. Because of
the cost associated with this it is often not feasible to use non-approved tools.
When using approved tools it is possible to make changes to the program taking into account the
appropriate measures to maintain the required safety integrity level. After changes are being made to
the program it is possible to carry out limited verification activities if this is confirmed based on the
analysis of the required regression tests. The procedures required for override or online changes must
be described in the safety manual. Approved tools generally meet the following requirements:
ƒ They incorporate measurements to control random failures when the program is created or
changed.
ƒ They incorporate measurements to control random data communication failures to the PLC.
ƒ They were developed using version maintenance and control tools.
ƒ They were developed using tools to verify changes.
ƒ They were developed using tools to verify the program.
Communication is established using approved protocols. It is possible to use protocols that are
universally valid for the current safety level (e.g., Modbus RTU) or vendor specific, proprietary
protocols that have been taken into account during the type approval process of the PLC. In general it
is only allowed to use tools that have been approved for their current use.

2.3. Guidelines to carry out Maintenance Override.


These guidelines apply mainly to application engineering and operation of a plant.

2.3.1. Application engineering:


1. The maintenance strategy and procedures need to be established before or during application
engineering
2. While the PLC application program is being created it should be determined whether later on it
will be allowed to override a particular signal.
3. Maintenance overrides are enabled for the whole PLC or a subsystem (process unit) by the DCS
or other applicable authorized procedures (e.g. key switch, or password authorization).
Note: “enabling” the overrides permits, but does not necessarily turn on the overrides.
4. Because of organizational measures the operator should confirm the override condition.
5. Direct overrides on inputs and outputs are not allowed (e.g., using clamps). Overrides have to be
checked and implemented in relation to the application. Multiple overrides in a PLC are allowed
as long as only one override is used in a given safety related group.

2.3.2. Operation:
1. The alarm shall not be overridden. It should always be clear that signals are in a maintenance
condition.
2. The PLC alerts the operator (e.-g. via the DCS) indicating the override condition. The operator will
be warned until the override is removed.

Version 2.0 © 2006 HIMA 2 of 3


3. During the period of override proper operational measures have to be implemented to assure that
the intervention can be removed again.
4. During the period of override proper operational measures have to be implemented to assure that
the intervention into the process does not lead to unacceptable conditions.
5. A program in the DCS checks regularly that no discrepancies exist between the override
command signals from the DCS and the override activated signals received by the DCS from the
PLC.
6. The use of the maintenance override function should be documented on the DCS and on the
programming environment if connected. The print-out should include:
ƒ The time stamp of start and end of maintenance override
ƒ The ID of the person who activated the maintenance override - maintenance engineer or
operator
ƒ If the override information cannot be printed online (preferred), it should be entered in the
work-permit
ƒ The tag name of the signal being overridden

Version 2.0 © 2006 HIMA 3 of 3


ining. Nonsto
Tra p.
2. Pla
s& nn
a lysi on Des ing &
An ati ign
1. cific
e
Sp
ng

3. I missioning
Co
oni

nsta
m
7. Decommissi

HIMA

llation &
LIFECYCLE
SERVICES
6. M & Re
od tro

o n
ific fi

ati

io
at

n
lid

t s Va
4.
5. Operation &
M a i n te n a n c e

www.hima.com
fscs@hima.com


Copyright  ©  2004  -­‐  2012  HIMA  Paul  Hildebrandt  GmbH  +  Co  KG
SpecificaDon  are  subject  to  change,  All  rights  reserved.

For  all  HIMA  FuncDonal  Safety  queries,  please  contact:  fscs@hima.com

For  a  detailed  list  of  all  our  subsidiaries  and  representaDves,  please  
visit  our  website:  www.hima.com/contact

Das könnte Ihnen auch gefallen