Beruflich Dokumente
Kultur Dokumente
White paper
White paper
01
© 2006 HIMA
LIMITATION OF LIABILITY - This report was prepared using best efforts. HIMA does not accept any responsibility for
omissions or inaccuracies in this report caused by the fact that certain information or documentation was not made available to
us. Any liability in relation to this report is limited to the indemnity as outlined in our Terms and Conditions. A copy is available
at all times from our website at www.hima.com.
Printed in Germany
This document is the property of, and is proprietary to HIMA. It is not to be disclosed in whole or in part and no portion of this
document shall be duplicated in any manner for any purpose without Hima’s expressed written authorization.
Abstract: - The objective of this paper is to demonstrate through a practical example how an
end-user should deal with functional safety while designing a safety instrumented function
and implementing it in a safety instrumented system. The paper starts with explaining the
problems that exist inherently in safety systems. After understanding the problems the paper
takes the reader from the verbal description of a safety function through the design of the
architecture, the process for the selection of safety components, and the role of reliability
analysis. After reading this paper the end-users understands the practical process for
implementing the design of safety instrumented systems without going into detail of the
requirements of the standard.
1. Introduction
Every day end-users around the world are struggling with the design, implementation,
operation, maintenance and repair of safety instrumented systems. The required
functionality and safety integrity of safety instrumented systems is in practice determined
through a process hazard analysis. This task is typically performed by the end-users as they
are the experts on their own production processes and understand the hazards of their
processes best. The result of such a process hazard analysis is among others a verbal
description of each required safety function needed to protect the process. These safety
functions are allocated to one or more safety systems, which can be of different kinds of
technology. Those safety systems that are based on electronic or programmable electronic
technology need as a minimum to comply with the functional safety standards IEC 61508
and/or 61511.
The end-user is typically not involved in the actual design of the safety system. Normally this
is outsourced to a system integrator. The system integrator determines the design of the
safety system architecture and selects the safety components based on the specification of
the end-users. No matter who designs the safety system according to the safety function
requirements, in the end it is the end-user who is responsible for the safety integrity of the
safety system. This means that the end-user needs to assure him that the chosen safety
architecture and the selected safety components meet the requirements of the applicable
standards and be able to defend its decision to any third party performing the functional
safety assessment.
In reality end-users and system integrators are not experts in hardware and software design
of programmable electronic systems. They know how to run and automate chemical or oil &
gas plants but most likely they are not experts on how the operating system of a logic
solvers works or whether the communication ASIC of the transmitter is capable of fault free
safe communication. Even if they would be experts, the suppliers of the safety components
will not give them sufficient information about the internals of the devices so that they can
assure themselves of the safety integrity of these devices. Yet they are responsible for the
overall design and thus they need to assure themselves that functional safety is achieved.
But how can they deal with that in practice?
The objective of this paper is to demonstrate through a practical example how an end-user
and/or system integrator should deal with functional safety while designing a safety
instrumented function and implementing it in a safety instrumented system. The paper starts
with explaining the problems that exist inherently in safety systems. After understanding the
problems the paper takes the reader from the verbal description of a safety function through
the design of the architecture, the process for the selection of safety components, and the
role of reliability analysis. After reading this paper the end-users understands the practical
process for implementing the design of safety instrumented systems without going into detail
of the requirements of the standard.
The hardware of a safety instrumented system can consist of sensors, logic solvers,
actuators and peripheral devices. With a programmable logic solver there is also application
software that needs to be designed. An end-user in the process industry uses as basis for
the design and selection of the safety devices the IEC 61511 standard. This standard
outlines requirements for the hardware and software and refers to the IEC 61508 standard if
the requirements of the IEC 61511 cannot be not met. This means that even if the IEC
61511 standard is used as basis some of the hardware and software needs to comply with
IEC 61508.
As with any piece of equipment also safety equipment can fail. One of the main objectives of
the IEC 61508 standard is to design a “safe” safety system. A “safe” safety system means a
system that is designed in a way that it can either tolerate internal failures, and still execute
the safety function, or if it cannot carry out the safety function any more it at least can notify
an operator via an alarm. If we want to design a safe safety system we should first
understand how safety systems can fail. According to IEC 61508 equipment can fail
because of three types of failures, i.e.,
Random hardware failures,
Common cause failures and
Systematic failures.
Random hardware failures are failures that can occur at any given point in time because of
internal degradation mechanisms in the hardware. A typical example is wear out. Any
rotating or moving equipment will eventually wear out and fail. There are two kinds of
random hardware failures (Rouvroye et. al., 1997):
Permanent
Dynamic
Permanent random hardware failures exist until they are repaired. This in contrast to the
dynamic random hardware failures. They only appear under certain conditions (for example
when the temperature is above 80 C). When the condition is removed the failure disappears
again. It is very difficult to test hardware for random dynamic hardware failures.
The IEC 61508 standard addresses random failures in two ways. First of all IEC 61508
requires a designer to implement measures to control failures. The appendix of IEC 61508
part 2 contains tables (Table A16-A18) which represent per SIL level measures that need to
be implemented in order to control failures that might occur in hardware.
Secondly, IEC 61508 requires a qualitative and quantitative failure analysis on the
hardware. Via a failure mode and effect analysis the failure behaviour of the equipment
needs to be analysed and documented. For the complete safety function it is necessary to
carry out a probabilistic reliability calculation to determine the average probability of failure
on demand of the safety function.
A common cause failure is defined as a failure, which is the result of one or more events,
causing coincident failures of two or more separate channels in a multiple channel system,
leading to total system failure. Thus a common cause can only occur if the safety function is
carried out with hardware more than once (dual, triple, quadruple, etc. redundancy).
Common cause failures are always related to environmental issues like temperature,
humidity, vibration, EMC, etc. If the cause is not related to environmental circumstances
than it is not a common cause. Typical examples of a common cause could be a failure of a
redundant system due to flooding with water or an EMC field. A common cause failure is
only related to hardware, and not to software. A software failure is a systematic failure which
is addressed in the next paragraph.
The IEC 61508 standard has two ways to address common cause failures. First of all there
is one measure defined to control failures defined, i.e., diversity. Diversity still means that we
carry out the safety function in a redundant manner but we use different hardware, or a
different design principle or even completely different technology to carry out the same
safety function. For example if we use a pure mechanical device and a programmable
electronic device to carry out the safety function then a common cause failure of the safety
function due to an EMC field will never occur. The programmable electronic device might fail
due to EMC but the pure mechanical device will never fail due to EMC.
In practice a real common cause is difficult to find because the failures of a multi channel
system must per definition of a common cause occur at exactly the same time. The same
hardware will always have different strength and thus fail at slightly a different time. A well
designed safety system can take advantage of this gap in time and detect one failure before
the other failure occurs.
The most important failures to manage in safety system are the systematic failures. A
systematic failure is defined as a failure related in a deterministic way to a certain cause,
which can only be eliminated by a modification of the design or of the manufacturing
Systematic failures are the hardest failures to eliminate in a safety system. One can only
eliminate systematic failures if they are found during testing. Testing that either takes place
during the development and design of the safety system or testing that takes place when the
system exist in the field (so called proof test). The problem that systematic failures only can
be found if a specific test is carried out to find that failure. If we do not test for it we do not
find it.
The IEC 61508 standard addresses systematic failure in only one way. The standard
defines measures to avoid failures for hardware as well as software. These measures are
presented in the appendix of part 2 and 3 of IEC 61508 (respectively tables B1-B5 and
tables A1-B9) and depend on the required safety integrity. The standard does not take
systematic failures into account in the failure analysis. The philosophy behind this is simple.
If all the required measures to avoid failures are implemented and correctly carried out then
there are no systematic failures (or at least it is negligible for the desired safety integrity) and
thus the contribution to the probability of failure is (close to) zero.
All though the end-user has no control over the actual design and internal testing of safety
equipment ultimately they are still responsible when accidents occur due to any of the three
types of failures mentioned above. They need to assure themselves that the safety
equipment selected by themselves or their system integrators is compliant with either the
IEC 61508 or the IEC 61511 standard. In practice though end-users nor system integrators
do not have the knowledge to understand what is going inside safety equipment. They will
have to rely on third party assessments of this equipment to assure themselves that the
equipment is suitable for their safety application. More on this topic is presented in
paragraph 4.
A safety requirement specification of a safety system must at all times be based on the
hazard and risk analysis. A good hazard and risk analysis includes the following steps:
Hazard identification
Hazard analysis (consequences)
Risk analysis
Risk management
o Tolerable risk
o Risk reduction through existing protection layers
o Risk reduction through additional safety layers
Many techniques exist to support hazard identification and analysis. There is not one
ultimate technique that can do it all. A serious hazard and risk study is be based on the use
of several techniques and methods. Typical hazard identification techniques include:
Checklists
What if study
Failure mode and effect analysis (FMEA)
Hazard and operability analysis (HAZOP)
Dynamic flowgraph methodology (DFM)
More techniques exist then the ones listed above that can be used to carry out the hazard
and risk analysis. It is important to select the right technique for the right kind of analysis and
not to limit oneself to one technique.
The hazard and risk analysis should among others document in a repeatable and traceable
way those hazards and hazard events that require protection via an additional safety
function. The results from the hazard and risk analysis are used to create the safety
requirement specification of each safety function needed to protect the process. The
specification as a minimum defines the following 5 elements for each safety function:
Sensing
Logic solving
Actuating
Safety integrity in terms of reliability
Timing
Each safety function description should as a minimum consist of these five elements. The
sensing element of the specification describes what needs to be sensed (e.g., temperature,
pressure, speed, etc.). The logic solving element describes what needs to be done with the
sensing element when it meets certain conditions (e.g., if the temperature goes over 65 C
then actuate the shutdown procedure). The actuating element explains what actually needs
to be done when the logic solving elements meets the conditions to be met (e.g., open the
drain valve).
So far we have described the functionality of the safety function. But the functionality is not
complete if we do not know with how much safety integrity this needs to be carried out. The
safety integrity determines how reliable the safety function needs to be. The functional
safety standards have technical and non-technical safety integrity requirements that are
based on the so called safety integrity level. There are four safety integrity levels (1 through
4) where 1 is the lowest integrity level and 4 the highest. In other words it is much more
difficult to build a SIL 4 safety function than it is to build a SIL 1 function. The SIL level
determines not only the measures to avoid and to control failures that need to be
implemented but also the required probability of failure on demand (PFD). The higher the
SIL level the lower the probability of failure on demand of this safety function.
The last element to be described is how fast the safety function should be carried out. Also
this is a critical element as it depends on the so-called process safety time. This is the time
the process needs to develop a potential hazard into a real incident. For example, mixing
two chemicals at 30 C is not a problem at all. Mixing the same two chemicals at 50 C can
lead to a runaway reaction and result in an explosion. The process safety time is the time
the reaction needs to develop into an explosion.
It is common practice in the safety industry to define the time element of the safety function
as half of the process safety time. If the chemical reaction takes 2 hours to develop then we
have 1 hour to carry out our safety function. On the other hand if the reaction takes 10
seconds we have only 5 seconds to carry out the safety function. It is of up most importance
to know this time for two reasons. First of all we need to build a safety function that can
actually be carried out in this time. Each device used to carry out the safety function takes a
piece of the available total time slot. If we for example use valves that need to be closed we
need to make sure that these valves can close fast enough. The second reason is that we
need to know whether the build-in diagnostics can diagnose a failure in less than half of the
process safety time. Before we need to actuate the safety function we should be able to
know whether the safety system has not failed. This puts extra constraints on the internal
design of the safety devices when it comes to implementing fast enough diagnostics.
“The main safety function of the HIPPS is to protect the separation vessels against
overpressure and to protect the low pressure equipment against high pressure.”
There is no system integrator who can build the hardware and software from this definition.
The only clear aspect is the sensing element. Some where the pressure needs to be
measured. After that the system integrator will be lost. The logic, actuating, safety integrity
and timing element are not covered with this specification. Specification like this will cost
every party involved in the project more time than necessary. It will lead to a lot of
unnecessary discussion. A much better example of a safety function specification is the
following:
“Measure the pressure on two locations in vessel XYZ and if the pressure exceeds the high-
high pressure limit open the drain valve within 3 seconds. Perform the function with a safety
integrity of SIL 3.”
This specification gives much more complete information. The system integrator knows
exactly what the function should do and can now design the function according to the rules
of SIL 3 and select components and write application software that can perform this
function.
For each safety function the end-user should provide the system integrator with a clear
definition containing as a minimum the 5 elements specified before. There are many other
requirements that the end-user can put into the specification. For example environmental
conditions that the safety system should be able to handle (temperature ranges, humidity
levels, vibration levels, EMC levels, etc.) or restart procedures, periodic test intervals, and
more.
A good system integrator will take the safety requirements specification of the end-user and
translate that into a requirement specification that is usable for the system integrator. The
specification created by the system integrator should be verified and approved by the end-
user. This is an excellent step to be performed as it assures that both parties can see that
they understand each other and that they interpreted the system to be designed correctly.
Needless to say this costs a more time during specification which is saved during actual
design and testing and the often required modifications after words.
When the safety requirements specification is clear and agreed upon the system integrator
can start with the architectural design of the safety function and system. Figure 1 shows how
a safety function definition can be implemented in hardware. The safety function is divided
into three subsystems, i.e., sensing, logic solving, and actuating. The designer of the safety
function can decide how to divide the safety function into subsystem and to what level or
detail. In practice subsystems are determined by redundancy aspects or whether the
component can still be repaired or not by the end-user.
T1 TM 1 I1 O1 R1- Pump A
Common Circuitry
Common Circuitry
I2 O2
T2 TM 2 I3 O3
I4
I5
CPU O4
O5
R2- Pump B
I6 O6
I7 O7 SOV Drain V
I8 O8
The IEC 61508 and IEC 61511 standard have set limitations on the architecture of the
hardware. The concepts of the architectural constraints are the same for both standards
although the IEC 61508 standard requires some more detail. The architectural constraints of
the IEC 61508 standard are shown in Table 1 and 2 and are based on the following aspects
per subsystem:
Type A Subsystem
Hardware Fault Tolerance
Safe Failure (HFT)
Fraction (SFF) 0 1 2
< 60 % SIL 1 SIL 2 SIL 3
60 % -< 90% SIL 2 SIL 3 SIL 4
90 % -< 99% SIL 3 SIL 4 SIL 4
> 99 % SIL 3 SIL 4 SIL 4
Type B Subsystem
Hardware Fault Tolerance
Safe Failure (HFT)
Fraction (SFF) 0 1 2
< 60 % N.A. SIL 1 SIL 2
60 % -< 90% SIL 1 SIL 2 SIL 3
90 % -< 99% SIL 2 SIL 3 SIL 4
> 99 % SIL 3 SIL 4 SIL 4
The “type” designation of a subsystem refers to the internal complexity of the subsystem. A
type A subsystem has a defined failure behaviour and the effect of every failure mode of the
subsystem is clearly defined and well understood. Typical type A components are valves
and actuators. A subsystem is of type B if only one failure mode and its effect cannot be
understood. In practice any subsystem with an integrated circuit (IC) is per definition a type
B. Typical type B systems are programmable devices like logic solvers, smart transmitter, or
valve positioners.
The hardware fault tolerance (HFT) determines the number of faults that can be tolerated
before the safety function is lost. It is thus a measure of redundancy. When determining the
hardware fault tolerance one should also take into account the voting aspects of the
subsystem. A 1oo3 and 2oo3 subsystem carry out the safety function 3 times (triple
redundant) but because of the voting aspect the HFT of the 1oo3 subsystem equals 2 and
the HFT of the 2oo3 subsystem equals 1. A complete overview of the most common
architectures is given in Table 3.
Another important factor is the safe failure fraction (SFF). This is basically a measure of the
fail safe design and build-in diagnostics of the subsystem. A subsystem can fail safe or
dangerous. Safe failures are those failures that case the subsystem to carry out the safety
function without a demand. For example, the safety function of an emergency shutdown
valve is to close upon demand. We call it a safe failure if the valve closes because of an
internal failure without an demand. A dangerous failure is the opposite. The valve has failed
dangerous if it cannot close upon demand because of an internal failure. Some components
also have internal diagnostics (diagnostics should not be confused with proof testing). If that
is the case it is possible to detect failures and act upon the detection. Smart sensor and
logic solvers typically can have build-in diagnostics. Taking this into account a subsystem
can basically have four different kind of failures:
If we know the failure rates for each subsystem in terms of these four failure categories then
we can calculate the SFF as follows:
In other words the system integrator has a lot of design options to choose from. The actual
design depends on many things. For example what kind of sensors are available on the
market? Which type are they, which SFF do they achieve. Does the end-user have a
preferred vendor list to choose from? And so on.
Also the IEC 61511 standard has architectural constraints defined. The principle is the
similar as above only the IEC 61511 is less complicated. The IEC 61511 does not
differentiate between type A and B components but only between programmable electronic
logic solvers and all equipment except programmable electronic logic solvers. Smart
sensors with dual processors and software inside are apparently not considered complex
devices in terms of IEC 61511. The architectural constraints of IEC 61511 are shown in
Table 4 and 5.
For all equipment except PE logic solvers it is possible to decrease the hardware fault
tolerance by 1 if the following conditions are met:
In reality every single product supplier will try to prove to an end-user that their equipment
meets the above conditions but in practice it is hard to find a product that truly fulfils these
conditions. Specially the proven in use condition is hard to meet, at least the prove for it.
On the other hand the HFT of a product needs to be increased by one if the dominant failure
mode of the product is not to the safe mode and dangerous failures are not detected.
The tables of IEC 61511 and IEC 61508 determine the hardware architecture of the safety
function. The starting point is always the SIL level of the safety function and from there the
system integrator has a certain degree of freedom to design a safety system architecture
depending on the hardware fault tolerance and the hardware complexity of the subsystem.
Having two standards to deal with in order to determine the system architecture does not
make it easier for the end-user or system integrator. Many end-users and system integrators
do not realize that even if they deal with the IEC 61511 standard that some subsystems of
the safety functions still need to comply with the IEC 61508 standard. Figure 2 gives
guidance. From this figure it becomes clear that we need per definition to follow IEC 61508 if
we want to apply new hardware, which has not been developed yet. For any hardware
which meets the IEC 61511 requirements for proven in use or has been assessed according
to the requirements of IEC 61508 we can continue to follow the IEC 61511 requirements
and particular Table 3 and 4. IEC 61511 defines proven in use as follows:
“When a documented assessment has shown that there is appropriate evidence, based on
the previous use of the component, that the component is suitable for use in a safety
instrumented system”
Hardware
Although proven in use is typically something that only an end-user can determine the
suppliers of safety components will do everything to convince end-users and system
integrators that their products are proven in use. The evidence though that needs to be
delivered in order to “prove” proven in use is not so easy to accumulate:
Especially the last point is very difficult to meet as failure track records are usually not
available. End-user don’t always track them and product manufactures do not have the
capability to track their products once they are sold and delivered.
End-users do not have the capabilities to verify for every single product that will be used in a
safety function whether it meets the proven in use requirements of IEC 61511 or to assess
them according to IEC 61508. Many end-users therefore make use of certified products or
third party reports. There is a big difference between a product with a certificate and a
product with a third party report.
When a product is certified according to the IEC 61508 standard then this means that every
single requirements of the standard is verified for this product. It is for example not possible
to only certify the hardware of a programmable electronic system. Certification is all-
inclusive and thus also the software needs to be addressed. A well certified safety product
not only addresses functional safety according to IEC 61508 but also issues like:
Electrical safety
Environmental safety
EMC/EMI
User documentation
Reliability analysis
A certified product always comes with a certificate and a report to the certificate. The report
to the certificate is very important as it explains how the verification or assessment has been
carried out and whether there are any restrictions on the use of the product.
A third party report is often used in industry but is only limited in scope. The report itself will
outline what the scope of the analysis is. Many third party reports only focus on the
hardware analysis of the product. In principle this is no problem as long as the end-user or
system integrator is aware that other aspects of the product, like the software, also need to
be addressed and may be another third party report should be requested that covers the
software.
For each safety device the end-user should assure themselves that the device is either
compliant with IEC 61508 or with IEC 61511. Concerning the hardware the end-user should
as a minimum ask from their suppliers the information listed in Table 6.
Item
Applicable standard
Type
Hardware fault tolerance
Safe failure fraction
Safe detected failure rate
Safe undetected failure rate
Dangerous detected failure rate
Dangerous undetected failure rate
SIL level for which the product is fit for use
Recommends periodic proof test interval
With this information the end-user or system integrator can easily determine how to comply
with the architectural constraints tables and build the architecture of their loop as desired.
This information can be delivered by the supplier itself, through a third party report or
through a certification report. It is up to the end-user to decide what is actually required
(read what can be trusted). The architecture needs to be redesigned until the architectural
constraints requirements are met.
Once the architectural system design of the safety loop complies with the architectural
constraints tables the loop has met already one of the most important requirements of the
standards. Another important requirement is the probability of failure on demand (PFD) or
continuous mode (PFH) calculation. This is the probability that the safety function cannot be
carried out upon demand from the process. It needs to be calculated for those processes
where the expected demand is less than once per year. If a loop is used in a continuous
mode then it is necessary to calculate the frequency of failure of this loop per hour. This is
necessary as we now are in a different situation. Where the demand loop can only cause a
problem when there is an actual demand from the process the continuous loop can actual
be the cause of a process upset when the loop itself has failed. Table 7 gives an overview of
the required probabilities and frequencies per SIL level.
A reliability model needs to be created for each loop of the safety system. There are
different techniques available in the world to create reliability models. Well known
techniques include:
Reliability block diagrams
Fault tree analysis
Markov analysis
The reliability block diagram technique is probably one of the simplest methods available. A
block diagram is a graphical representation of the required functionality. The block diagram
of the safety function of Figure 1 is given in Figure 3 below. A reliability block diagram is
always read from left to right where each block represents a piece of the available success
path(s). As long as a block has not failed it is possible to go from left to right. Depending on
build-in redundancy it is possible that alternative paths exist in the block diagram to go from
left to right. Once the block diagram is created it is possible to use simple probability theory
to calculate the probability of failure.
T1 TM 1 I1
CC CPU
T2 TM 2 I2
CC O1 O2 O3 R1
R2 SOV ESD SV
Another technique is fault tree analysis (FTA). FTA is a technique that originates from the
nuclear industry. Although the technique is more suitable to analysis complete facilities it is
also used to calculate the probability of failure on demand of safety loops. A FTA is created
with a top even in mind, e.g., safety does not actuate on demand. From this top event an
investigation is started to determine the root causes. Basically an FTA is a graphical
representation of combinations of basic events that can lead to the top event. A simplified
version of the FTA for the safety function in Figure 1 is given in Figure 4. It is possible to
quantify the FTA and calculate the probability of occurrence of the top even when the
probabilities of occurrence of the basic events are known.
T1 Tm1 I1 T2 Tm2 I2
Research has indicated that Markov analysis is the most complete and suitable technique
for safety calculations. Markov is a technique that captures transitions between two unique
states. In terms of safety this means the working state and the failed state. Going from one
state to the other can either be caused by a failure of a component or by repair of a
component. Therefore a Markov model is also called a state transition diagram. See Figure
5 for the Markov model of the safety function of Figure 1. Once the Markov model is created
and the rate of transition is known between two states (that is the failure rate or repair rate) it
is possible to solve the Markov model and calculate the probability of being in a state.
Path 1
Failed
System
OK Failed
Path 2
Failed
Fig. 5. Markov model safety function
The IEC 61508 standard also has standard formulas to calculate the PFD and PFH for each
loop. These formulas are called simplified equations. What many people do not realize
though is that these simplified equations are derived from Markov models and that in
practice it is not so simple to derive them. Another limitation of these equations is that they
only exist for 1oo1, 1oo2, 2oo2, 1oo2D, and 2oo3 architectures. For any other kind of
architecture the standards do not provide equations and thus one needs to refer to any of
the above mentioned techniques. Also the simplified equations are not flexible enough to
handle diverse equipment, or different repair times and periodic proof test intervals. Hence
their name simplified. For a complete list of simplified equations derived from Markov
models see Börcsök (2004). The following two equations are examples of the simplified
equations as they can be found in the standards
⎛ T1 ⎞
+ β D λdd MTTR + βλdu ⎜ + MTTR ⎟
⎝2 ⎠
Rouvroye (1998) has compared different reliability techniques and their usefulness in the
safety industry. Figure 6 gives an overview of these techniques and the result is that Markov
analysis is the most complete technique. With Markov it is possible to create one model that
allows us to take into account any kind of component, diverse component, different repair
and test strategies. It is possible to calculate the probability of failure on demand, the rate
per hour or the probability that the safety system causes a spurious trip. No other technique
can do this.
Expert analysis
FMEA
FTA
Reliability Block
Diagram
Parts Count
Analysis
Markov Analysis
Every reliability model needs reliability data in order to actually perform the reliability
calculation and quantify the results. There are many sources for reliability data. The
following list is an overview of available data:
End user maintenance records
Databases
Handbooks
Manufacturer data
Functional safety data sheets
Documented reliability studies
Published papers
Expert opinions
Reliability Standards
The absolute best data an end-user can use is its own maintenance data. Unfortunately not
many end-users have their one reliability data collection program and there is of course
always the problem that a new safety system contains devices that were not used before by
the end-user. Luckily there are more and more databases available were there is a
collection of data from different sources which can be used by end-users.
For the calculation we need the following reliability data for each device:
Safe detected failure rate
Safe undetected failure rate
Dangerous detected failure rate
Dangerous undetected failure rate
This data was already collected when the architectural constraints were verified, see Table
5. On plant level we also need to know the following reliability data:
Repair rate per device
Periodic proof test interval
Common cause
The repair rate per device depends on the availability of the spare device and the availability
of a repair crew. Periodic proof test intervals can be determined by three means:
The supplier of the device specifies a rate
Laws or standards determine a minimum inspection interval
The desired SIL level determines the period proof test interval through the PFD
calculation
Once the reliability model is created and the reliability data is collected the actual calculation
can be performed. Figure 7 shows an example of PFD calculation with and without periodic
proof testing. Actually every year an imperfect proof test is performed which assures that the
PFD level of the safety function stays within the SIL 3 range.
0.001
0.0001
1e-005
0 8760 17520 26280 35040 43800 52560 61320 70080 78840 87600
hours
6. CONCLUSIONS
This purpose of the paper was to explain to end-users the most important high level
requirements when designing safety instrumented system. The paper explained that safety
system can fail in three different ways and that it is important to design, operate and
maintain a safety system in a why that those three failures types are controlled. In order to
understand the functional requirements of a safety system it is important to carry out hazard
and risk analysis. The paper explained several techniques that can be used for this purpose.
The results of the hazard and risk analysis are documented as the safety function
requirement specification. Next the paper explained from a top level safety function
description the high level requirements that end-users or system integrators need to follow
in order to design the actual safety instrumented systems. The paper explained the
significance of the architectural constraints from the point of view of the IEC 61508 and IEC
61511 standard. For end-users and system integrators it is important to collect reliability
data and to perform reliability analysis and be able to calculate the safe failure fraction and
the probability of a failure on demand calculations.
References:
[1] Rouvroye, J.L., Houtermans, M.J.M., Brombacher, A.C., (1997). Systematic Failures in
Safety Systems: Some observations on the ISA-S84 standard. ISA-TECH 97,ISA,
Research Triangle Park, USA.
[2] IEC (1999). Functional safety of electrical, electronic, programmable electronic safety-
related systems, IEC 61508. IEC, Geneva.
[3] IEC (2003). Functional safety – safety instrumented systems for the process industry,
IEC 61511. IEC, Geneva.
[4] Houtermans M.J.M., Velten-Philipp, W., (2005). The Effect of Diagnostics and Periodic
Proof Testing on Safety-Related Systems, TUV Symposium, Cleveland, OHIO.
[5] Börcsök, J., Electronic Safety Systems, Hardware Concepts, Models, and Calculations,
ISBN 3-7785-2944-7, Heidelberg, Germany, 2004
White paper
02
Certification, Proven-in-Use,
and Reliability Data
Three of the most discussed topics
in the functional safety industry
Abstract Keywords:
Certification and Proven-in-Use can help us select safety equipment and devices Certification,
that are compliant with safety standards, and that help us build safety systems Proven in use,
and solutions that meet the required safety criteria set out in those standards. Prior use,
Users of these devices and equipment need only to understand how to apply Reliability data,
certification and Proven-in-Use in the correct way, so that they can take full Quality data,
advantage of these solutions to achieve the full benefit. In the end it will save IEC 61508,
them a lot of time, money and headaches. Therefore, the purpose of this paper IEC 61511,
is to explain what certification and Proven-in-Use is, and what role reliability IEC 62061,
data plays in these solutions. The paper will explain what to pay attention to EN 50402
and how to apply certification, Proven-in-Use and reliability data. © 2009 Inside
Publishing. All Rights Reserved.
We depend on solutions such as these to reduce existed long before current functional safety
the risk associated with these hazardous events. standards were released. It is rarely practical, nor
This paper will concentrate on the performance is it usually necessary, to replace all old safety
of the safety (instrumented) system solutions. If equipment and devices each time a new func-
we are to build safety systems that perform well, tional safety standard comes out. And if the proc-
we have to construct solutions that are not only ess runs fine, the equipment is operable, why
safe and reliable, but which also are compliant should they? When a new standard as complex as
with applicable functional safety standards like the IEC 61508 is released, it can often take years
IEC 61508 [1], IEC 61511 [2], IEC 62061 [3], EN for compliant equipment to come on the market,
50402 [4] and others. In practice, end-users and anyway, so even if an end-user wants to replace old
other stakeholders, such as system integrators, safety equipment, in practice it’s often not possible.
developers, manufacturers and consultants in So how can the end-user deal with his old safety
2 this industry, face only two real problems. equipment? The answer is Proven-in-Use.
As safety systems have become increasingly com- Certification and Proven-in-Use can help us select
plex, with devices more commonly based on elec- safety equipment and devices that are compliant
trical, electronic, and programmable electronic with safety standards and which help us design
technology, standards such as IEC 61508 and EN safety systems and solutions capable of meeting
50402 have been developed, which detail specific the required safety criteria set out in those stand-
requirements for building complex safety com- ards. Users of these devices and equipment need
ponents and systems. Following such standards is only understand how to apply certification and
imperative, not only because the devices themselves Proven-in-Use in the correct way, so as to receive
are complex, but also because it is always neces- the full benefit and advantage of these solutions.
sary for the end-user to demonstrate to a respon- In the end, proper understanding of how certifi-
sible authority that the installed safety equipment cation and Proven-in-Use work will save end users
meets the current state of the art in safety technol- a significant amount of time, money and head-
ogy, whenever they request such information, and aches. Therefore, the purpose of this paper is to
that state of the art necessarily includes compliance explain the concepts of certification and Proven-
with such functional safety standards. in-Use, as well as to explain the role reliability
data play in these solutions. We will explain what
Unfortunately these days, end-users often do not
to pay attention to and how to apply certification,
have the time, resources or, most importantly,
Proven-in-Use and reliability data.
the knowledge to verify whether the equipment
they use is actually compliant with all applicable
safety standards. End-users may be experts in
running their own production processes, but not
2. Certification: The (Un)
necessarily have adequate knowledge of soft- Necessary Evil?
ware operating systems, safety-related bus com-
munication protocols, or reliability predictions of
complex logic solvers. So how do users of safety
2.1. What Is Certification Really?
equipment assure themselves that the safety The very discussion of functional safety certifica-
equipment used is actually complaint with the tion often results in many wild stories and ideas
safety standards applicable to their jurisdiction? being told. Some companies love it, while others
The answer is third party certification. hate it. Some understand how to apply it, while
others see no benefit to it, and consider the proc-
Likewise, existing plants use equipment and
ess as an unnecessary evil. One major reason
devices to carry out safety functions that often
there is so much confusion about certification is
because the hardware or software has failed; certification is not necessarily a problem, as long
because they were wrongly designed in the first as the certification clearly states which aspects of
place; or because they were not operated, used the system have been certified. But this must be
or maintained as intended. But they can also fail understood by the end-user of the product. But in
because of environmental influences like earth- practice, often the end-user does not understand
quakes, flooding or lightning. the difference; he just sees a certificate stating SIL
X, with a signature, and automatically assumes
Any failure will always fall into one of the follow-
that all requirements of the standards have been
ing three categories: random hardware failure,
met. This is a real problem in industry.
common cause failures, or systematic failures [6].
In practice, when dealing with safety systems, we A manufacturer who claims compliance with the
can define functional safety as follows: IEC 61508 standard for one of its devices must
demonstrate that the functional safety stand-
4 A safety system is 100% functionally safe if
ard applies to all aspects of that device, by hir-
random, systematic and common cause fail-
ing a third party to produce a report and certify
ures do not lead to malfunction of the safety
this device. The question is, what does this mean
system and do not result in
in practice? To what should the certifier attest?
Injury or death of humans; For a product to be fully certified to the IEC
61508 standard, which requirements need to be
Spills to the environment;
addressed?
Loss of equipment or production.
The IEC 61508 standard consists of literally thou-
This is a very practical definition, but we also know sands of requirements, and it is not easy to filter
that it is not possible to achieve 100% functional out those requirements that do not apply to a
safety. But it is possible to certify that a safety sys- product. The author has carried out hundreds of
tem has met functional safety standards, such as certifications, and found that a complete func-
the requirements for SIL 1, 2, 3 and 4 function- tional safety certification statement according
ally safe systems, without requiring that a system to IEC 61508 should cover the following require-
never fail, as long as the certification is properly ments:
performed.
1. Functional safety management;
2. Hardware requirements;
2.3. What Should It Mean When
3. Hardware reliability analysis;
Somebody Says This Product Is
Certified? 4. Software requirements;
Let’s take the example of the IEC 61508 stand- 5. Basic safety, environmental safety, EMC;
ard. In practice, it is possible to certify a product
6. User documentation.
according to all requirements of the standard or
to only certain parts of the standard. For instance, Basically, when a manufacturer states that their
a device without software does not need to com- product is fully compliant with IEC 61508 then the
ply with the software requirements of IEC 61508 above requirements should have been addressed.
part 3. But if the product does have software Functional safety management addresses the life
inside and only a hardware certification has been cycle requirements, documentation, verification,
carried out, such certification is insufficient. Unfor- validation, assessments, and measures to avoid
tunately, today it is common industry practice to and control failures. The hardware requirements
certify some safety systems this way. Such partial and hardware reliability analysis go hand in hand.
They address the requirements of IEC 61508 part partially address the requirements of the stand-
2 and include the implementation of measure to ards. The next time you receive a certificate or a
control failures and items like safe failure fractions, report to the certificate, ask yourself some ques-
hardware constraints, probability calculations, tions:
and so on. The software requirements address IEC
Does the certificate address those stand-
61508 part 3 of the standard and also include the
ards that are important to us?
measures to avoid and control failures. In practice
is it not always possible to separate hardware and Was the certification report complete? Did
software certification. it address all the requirements of the stand-
ard or only a part?
Completely overlooked by many certifiers are
the requirements for basic safety, environmen- Is the certification report clear about how
tal safety, EMC, and user documentation. It is the certification has taken place?
6 very important that we not only develop devices
Can I trust the certifier? Have they done
that are functionally safe but which also function
certification work before or is this the first
properly in the environment for which they are
time? Did the manufacturer self-certify?
intended. This is a requirement of IEC 61508. For
each device, the environmental properties have You do not need a positive answer to all these
to be defined upfront and then the device has to questions in order to decide whether the device
be designed to work under these environmental is suitable or not. The questions are merely
properties. In practice, what this means is, other designed to help you make decisions. For exam-
standards besides the IEC 61508 must be part of ple, just because you receive a certification from
the certification as well. Typical standards are IEC a third party you have never heard of does not
61131 [7], IEC 60101 [8], EN 50082-2 [9], EN 61000- mean that you cannot trust the results of their
6-3 [10], IEC 60068 [11], and many more including certification. But it should lead you ask more
possible application-specific standards. questions or look more carefully into the product
certification.
User documentation is the other often forgotten
requirement that should be certified. It makes no
sense to have a device that is functionally safe 2.4. Is Certification The Same
when used properly, but the end user has no idea
As Assessment?
how to use the product properly. Therefore, user
documentation plays an important role. It should Certainly not! Assessment is a term used in the
explain an end-user how to install, commission, functional safety standards themselves. Cer-
validate, operate, maintain and repair the device. tification does not exist in these standards as
So it is very important that the information in the such. While an assessor examines whether eve-
user documentation is correct. A certifier should rything was carried out according to the safety
check this. plan and thus whether functional safety is or can
be achieved, certification confirms (attests) the
What actually is to be certified is the responsibil-
achievement. Often in the certification industry,
ity of the company that orders the certification,
the certifier is also the assessor. This is not a con-
not the responsibility of the end-user of the prod-
flict of interest because certification has a limited,
uct, unless the end-user pays for the certification.
very specific meaning. Certification only confirms
When it comes to devices, the manufacturer usu-
that statements made are true, which means
ally orders the certification, which means that
that a certifier is in a very good position to do an
they decide what will actually be certified, correct
assessment.
or not. Many certifications in today’s market only
Why is certification beneficial for end-users? It is ments of the standard. This also means that an
because it is impossible for an end-user to ana- end-user is basically responsible when a non-cer-
lyze every single safety device in the plant, or that tified device is used in a safety loop. How will the
a sales person wants to sell to the end-user. End- end-user demonstrate that the product is suitable
users simply do not have the time, knowledge and and meets the requirements of the standards?
resources to properly evaluate all these devices Well today this is a matter of risk management;
based on the requirements of the standards. The as long as nothing happens he does not need to
first end-user who understands how to make the prove it. But are you willing to take that risk?
“stack” functionally safe or how to guarantee that Why is certification beneficial for product devel-
an ASIC is developed without systematic failures opers? Product developers should consider certi-
still needs to reveal himself. But even if end-users fication because it is a one-stop-fits-all solution.
had the time, knowledge and resources, such an If the product developer goes through certifica-
evaluation still would be practically impossible tion once, it can sell the same product to any end-
as the manufacturers of the devices will not give user without having to demonstrate to each one
them access to the internal workings of a device, that the product is suitable. Of course, the main
because such information is proprietary. task for product developers is to figure out what
In practice there is only one way for end-users of potential end-users want to have certified and
safety devices to overcome this, and that is to rely to find a trustworthy certifier. For product devel-
on third party certification. An end-user often opers, certification is often used as a marketing
buys a certified product because a trusted third and sales feature. It differentiates them from the
party has investigated that the statements made competition (as long as the competition does not
by the manufacturer are true, and has assured have certification or suddenly has a “better” cer-
end-users that the product meets all the require- tification).
Actuators;
2.7. Can Anything Be Certified?
Solenoid valves;
It certainly seems so. In the functional safety busi-
ness, it is possible to literally get anything certi- Valves;
fied if it has anything to do directly or indirectly Bus networks and other communication
with safety systems. These days, functional safety peripherals.
certification addresses such items as:
Besides equipment, since the late 1990s there
1 For example, certification of a smart transmitter takes anywhere from has been a growing trend to certify organizations
6 months to 2 years. A logic solver takes anywhere from 1 year to 5
years, if successful at all. and people. Some organisations, like product
manufacturers, have their functional safety man- 2.8. Which Standards Should I
agement (FSM) system certified according to the Certify To?
FSM requirements of IEC 61508 and IEC 61511.
The first organisation to do so was Honeywell, Many standards exist to deal with require-
the Netherlands. This practice is not really tak- ments for safety devices or solutions. These
ing off throughout industry, however. It is one of standards can be divided into product-spe-
those typical certifications where the manufac- cific standards and application-specific stand-
turer understands why he does it but his custom- ards. In terms of functional safety the most
ers have no clue what it is and thus do not know important standards are:
how to appreciate it. IEC 61508;
Since 1999, when TÜV SÜD first introduced a EN 50402;
program (CFSE2) to certify people according
10 to functional safety principles, more and more IEC 61511;
professionals and managers are getting certi- IEC 62061.
fied worldwide. People certification started first
with functional safety consultants, followed IEC 61508 and EN 50402 are typical product
by system integrators. Today, end-users are still standards while IEC 61511 and IEC 62061 are
significantly behind the curve. Why do the end- typical application standards. The IEC 61508
users of the safety devices resist the trend toward standard was officially released in 1999 and
certification? deals with any type of safety system with one or
more electrical, electronic and programmable
Whether the issue is products, systems, organi- electronic (E/E/PE) devices. Every manufacturer
sations or people, those companies and people of safety devices based on E/E/PE technology
who deliver certified products or services have must comply with this standard. The stand-
a real interest in showing customers that their ard itself has many detailed requirements that
products, services or even they themselves are deal with electronic components and software
certified, in order to demonstrate, through inde- issues.
pendence, their qualification to work in the func-
tional safety industry. For end-users, this is good When an individual product is certified, it should
news, because they feel less need to investigate first be certified against a product standard and
whether a product, service or person is “compli- not an application standard. Often, certificates
ant,” although certification alone can never be show that a product is certified against both
cited as a reason to not investigate any further; product standards and application standards. In
just because something is certified doesn’t auto- theory it usually makes no sense to certify a prod-
matically mean that it is the best choice or option. uct against an application standard, but it is done
Certification just helps to get a better picture of any way. Failing to put the application standard
a product, service or person; it is just one ele- on the certificate turns out to be a marketing and
ment that supports the decision making process. sales problem. Many end-users do not under-
Many other factors contribute to selecting the stand the difference between a product stand-
right choice and these other factors can some- ard and an application standard; they care more
times lead an end-user to another, non-certified about applications than about individual prod-
choice. ucts. In the end, one of the most important rea-
sons why application-specific standards are put
2 The author, at the time a department manager, was personally on certificates is to make sure that the end-user
responsible for the CFSE program at TÜV SÜD. He developed the pro-
gram in 1998 from an idea into a marketable product. Today over 3000 actually buys the product, not because it is really
professionals are world wide certified for functional safety. Today this
program is called TÜV Functional Safety Certification Program (FSCP).
needed from a certification point of view.
For example, IEC 61508 does not address appli- 2.9. What Is More Important The
cations; it is a general purpose standard. EN 54-2 Report Or The Certificate?
[13], on the other hand, is an application-specific
standard for fire detection and alarm systems. A good certification always consists of a techni-
End-users often look for products that can work cal report, and it may include a certificate, but
in an EN 54-2 environment, while in reality they there should never be a certificate without a
need a product that works according to the technical report. The technical report is always
requirements of IEC 61508. If an end-user sees a more important then the certificate. The certifi-
certificate stating that the product is IEC 61508 cate is like a degree from a university. The degree
compliant, he might still think that he cannot use demonstrates graduation, but more important
it in an EN 54-2 environment. Marketing and sales is the transcript, which shows how you passed
teams want the EN 54-2 certification for their all the exams. It is the same for certification. The
products, as well, despite the fact that there are certificate itself is only a summary of the results,
almost no requirements on product level in this while the report to the certificate contains all the
standard. details of how the third party did the verification
and assessment of the product.
Finally, we have the aforementioned environ-
mental, basic safety and EMC standards. Every Another reason the report is very important, is
product must work in a particular environment, because it not only explains how the certification
and it is important that the product is tested was performed, but often it also lists possible
for that environment. Many different standards restrictions on use of the device. These restric-
exist that can be used for this kind of testing, tions are important to know. Do you want these
some of which are international standards, and restrictions? Are they limiting you in any way? Do
others of which country, industry or applica- they force you to buy more equipment or to per-
tion-specific. form testing when you do not want it? All of this
is important information for you to know before o How often has this device been used in
you decide to buy the device. a safety function? Has the safety func-
tion been activated? Did the device
In other words, before you buy a certified device
work?
it is always important to request the report to
the certificate. Alarm bells should go off if a sup- o How many of these devices have failed.
plier does not want to show you the report before How did they fail? Safe or dangerous?
the purchase is made, because it’s possible there is
o Does the device contain software?
information in there they do not want you to know.
How do you know the software has
no bugs? Which version is running? Is
2.10. What If Certification Does there a bug list?
Not Exist? o Have you ever had to send a device back
12 to the manufacturer? Did they respond
It still happens a lot in industry that we want to
to your needs in a satisfactory manner?
use devices in safety applications for which there
is no available proof of compliance with safety o Are you convinced that this device is
standards. There may be many reasons why suffi- the right choice?
cient information about a device is not available,
If you do not have the above information, how
including lack of certification, or for some safety
can you be convinced that the device is suitable
solutions, it is possible that no devices exist that
in a safety application? If you still want to use the
are capable of demonstrate compliance with the
device (and there might be good reasons to do
standards. What can we do in such a case? Here
so), then introduce the device slowly. First, test
are some solutions:
it in a non-safety application for a while, to see
1. Try to push the manufacturer to sup- if it works the way you expect. Activate its safety
ply the appropriate documentation. If function many times. Just keep in mind that slow
you would like to use this device, try to and old is good in the safety business; we do not
convince them to go through third party need the latest high-tech features. We should
certification; have simple safety problems, like measure the
pressure and if too high stop the flow. This we can
2. If the manufacturer is not willing or
do with simple solutions. Keep it simple, as sim-
able to provide certification, try to find
ple as possible Einstein once said. If it looks too
another device on the market with the
complex it probably is and most likely another
appropriate third party statements;
simpler solution exists.
3. If there are no other devices on the mar-
ket, try to make a safety case for this
device. Collect as much information as 2.11. Conclusions Certification
possible about the device, try to find the
Today, certification is really a necessary evil. It is
answer to the following questions, in
almost impossible in the functional safety indus-
order to determine if the device is or is
try to sell products without certification. When
not suitable for safety purposes:
looking for certified devices, always pay more
o Why do you think this device is suitable? attention to the report of the certificate then the
certificate itself. In the end, it does not matter
o How many of these devices are cur-
who certified a product as long as you feel that
rently in use? How long have these
you can trust them. Don’t forget though; just
devices been in use?
because you use a certified product does not
The current version of IEC 61511 defines Proven- higher the SIL level, the more work a developer
in-Use as follows: has. If a device can be regarded as Proven-in-Use,
then it is unnecessary to carry out this develop-
“when a documented assessment has
ment, and that is a big incentive.
shown that there is appropriate evidence,
based on the previous use of the compo- The most important requirement of IEC 61508 is
nent, that the component is suitable for use the necessary documentary evidence regarding
in a safety instrumented system (see “prior the use of the device. This evidence should typi-
use” in 11.5)” [2] cally include requirements for both hardware and
software, addressing:
The definition of IEC 61511 is very weak and
leaves too much room for interpretation. It fails Failure recording and failure data indicat-
to address the actual problem. Even worse, in IEC ing that the probability of failure is low
14 61511 two terms are used, i.e., prior use and Prov- enough;
en-in-Use. Prior use is even weaker than Proven-
Any testing that has taken place (a demand
in-Use, as it only indicates that a device was used
is considered a test of the safety function);
previously. A term like prior use does not address
the question of whether the device worked or The current condition of use must be simi-
not; therefore the term Proven-in-Use is much lar to the previous conditions of use. Condi-
more suitable. Now that we know what Proven- tions of use include:
in-Use is, we can look into the requirements for
o Environment;
Proven-in-Use.
o Modes of use;
When considering the above requirements one ware have been applied. For an end-user, that
should always take into account can be impossible to demonstrate, because
an end-user cannot look inside a product. No
The complexity of the subsystem;
amount of operating experience and statistical
The contribution made by the subsystem data analysis can give you answers to questions
to the risk reduction; such as these, which only product developers
can answer. In practice, however, in most cases,
The consequence associated with a failure
only the end-user knows how many devices
of the subsystem;
he has, how long he has been using them and
The novelty of the design. when and how they failed (if he documented
this). In other words, only end-users can deter-
The approach in IEC 61508 is very clear. The more
mine Proven-in-Use, although there are some
complex a device is, the higher the SIL level of
16 requirements they will be unable to prove.
the loop, and more attention needs to be placed
on proving Proven-in-Use. The actual informa-
tion that needs to be documented is not as com- 3.4. Which Devices Can Be
plex either. The problem is, of course, that almost
Proven-in-Use?
no one ever wrote this information down. Only
today, companies are starting to document that The standards have one requirement about the
information, but 10 years have passed since the devices that can be claimed Proven-in-Use. Both
release of IEC 61508. Many companies do not standards state that only devices with restricted
realize that evidence for Proven-in-Use can be functionality can be claimed as Proven-in-Use.
relatively easy to document. Evidence can come Unfortunately, the standards do not explain what
from data such as: restricted functionality is. Typical devices with
restricted functionality are:
Documented loop checks after plant turna-
rounds; Sensors and transmitters;
A Proven-in-Use device should have sufficient 4 A failure is only detected when it is detected by built in diagnostics.
statistical evidence about the failure rate. This Failures that are revealed by proof tests are still considered as being
undetected. This is just a matter of definition.
would suffer a dangerous undetected failure. This The safe failure fraction (SFF), and
is not a problem in the statistics industry.
The probability of failure on demand (PFD).
The best available data comes from the end-user In the past, many industry initiatives have been
of the product. Unfortunately, this is only the undertaken to collect data on an industry basis.
case for existing devices, and only when an end- One famous database is OREDA [16], which col-
user has a documented reliability data collection lects data from the North Sea operators. This
program. Fortunately, more end-users are col- project is ongoing and continuously updated.
lecting reliability data. Though many end-users The American Institute for Chemical Engineers
still have no idea how often a device fails, how it carried out a similar project for the chemical
fails, how long it takes to replace it, or what needs industry in the USA. They have the CCPS [16]
to be done to correct it, other than the “we need databases. Both projects focus on many types of
to replace it quickly” solution. Instead, for each industry equipment. SINTEF [18] and exida.com
device that fails, end-users should ask themselves [19] have handbooks that collect data specific to
some questions, in order to help them build a safety instrumented systems. And many other
knowledge base: sources exist as well.
When did the device fail? Manufacturers often carry out their own reliabil-
ity studies and produce their own data for their
How did the devices fail? Was it a random, devices. These reliability studies are carried out on
common cause, or systematic failure? individual devices, examining every single com-
ponent of the device, as well as the entire device. while in the first case we are comparing apples
Life and stress testing and reliability modelling with pretty much anything.
are carried out to find failures or to predict the
Though today the functional safety industry is
failure behaviour of the device. Even the failure
not concerned with uncertain data, and thus
rates of individual device components (resistors,
does not address the problem, the reliability
capacitors, integrated circuits, relays, springs, and
industry recognized this problem ages ago and
so on) can be predicted using reliability standards
has already solved it. Reliability engineers use
like Military Handbook 217 [20], IEC 62380 [21],
uncertainty and sensitivity analyses to address
Telcordia [22], Mechanical Handbook [23], PRISM
the consequences and effects of bad quality data.
[24], etc, and serve as a basis to predict the reli-
How long will it be before the functional safety
ability of complete devices.
industry picks up on this problem?
6. References
1. Functional safety of electrical/electronic/pro- 11. Environmental testing, IEC 60068. IEC, Geneva,
grammable electronic safety-related systems, 1988
IEC 61508. IEC, Geneva, 1999 12. European Union website: http://ec.europa.eu/
2. Functional safety - Safety instrumented sys- enterprise/newapproach/legislation/nb/noti-
tems for the process industry sector, IEC 61511. fied_bodies.htm
IEC, Geneva, 2003 13. Fire detection and fire alarm systems - Part
3. Safety of machinery - Functional safety of safe- 2: Control and indicating equipment, EN
ty-related electrical, electronic and program- 54-2,1997
mable electronic control systems, IEC 62061. 14. Functional safety of electrical/electronic/pro-
22 IEC, Geneva, 2005 grammable electronic safety-related systems,
4. Electrical apparatus for the detection and IEC 61508. IEC, Geneva, new draft.
measurement of combustible or toxic gases 15. Principles for computers in safety-related sys-
or vapours or of oxygen. Requirements on the tems, DIN V VDE 0801
functional safety of fixed gas detection sys- 16. Sintef, Offshore Reliability Data Handbook 4th
tems, EN 50402. 2005 Edition, OREDA, 2002
5. The Merriam-Webster English Dictionary. 17. CCPS, Guidelines for Process Equipment Reli-
Merriam-Webster; Revised edition, ISBN-13: ability Data, with data tables.AIChE, ISBN
978-0877799306, July 2004 0-8169-0422-7
6. Houtermans, M.J.M., IEC 61508: An Introduc- 18. SINTEF, Reliability data for safety instrumented
tion to the Safety Standard for End-Users. SISIS, systems, PDS Data Handbook, 2006 Edition
Buenos Aires, 2003 19. Exida.com, SERH, 3rd Edition,
7. Programmable controllers - Part 3: Program- 20. MIL-HDBK-217, reliability prediction of elec-
ming languages, IEC 61131. IEC, Geneva, 2003 tronic equipment
8. Safety requirements for electrical equipment 21. IEC, Reliability data handbook – Universal
for measurement, control, and laboratory use, model for reliability prediction of electron-
IEC 61010. IEC, Geneva, 2001 ics components, PCBs and equipment, IEC TR
9. Electromagnetic Compatibility - Generic 62380. IEC, Geneva, 2004
Immunity Standard - Part 2: Industrial Environ- 22. ATT Labs, Telcordia Issue 2, Telcordia
ment, EN 50082-2, 1999 23. Handbook of Reliability Prediction Procedures
10. Electromagnetic compatibility (EMC). Generic for Mechanical Equipment (NSWC-98/LE1),
standards. Emission standard for residential, Naval Surface Warfare Center, 1992
commercial and light-industrial environments 24. System Reliability Center, PRISM software tool,
EN 61000-6-3, 2007 Alion Science
White paper
03
Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com
This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.
Risknowlogy, the Risknowlogy logo, and Functional Safety Data Sheet are registered service marks.
Abstract
World wide chemical and other processing plants are trying to implement reliability programs to
improve plant safety while trying to maintain plant availability. These programs can vary significantly
in size and complexity. Any kind of reliability program, like a preventive maintenance (PM) program,
consists always of one or more reliability models and reliability data to execute these models. It is
needless to say that the actual successful implementation and utilization of these reliability programs
heavily depends on the accuracy of the reliability models and the availability of realistic data, or at
least as close as possible data.
The objective of this paper is to give the reader a better understanding of the importance of
reliability engineering focusing on the collection of reliability data for the purpose of process
availability, safety and preventive maintenance programs. The paper will first explain what reliability
engineering is and why it is important for processing plants to have a reliability program. Second the
paper will give an overview of different programs in use today by companies (Preventive
maintenance, risk based inspections, etc). Next the paper will focus on the role of reliability modeling
and data. The paper will explain the kind of reliability data needed and the current available sources
for this data. The focus will be on the actual data available within the plant that can among others be
collected through the information derived from the archived data in the DCS systems. An excellent
time to collect data is during scheduled plant turnarounds. Based on the example of a control valve
the paper will explain the role that humans can play to improve the administration of reliability data
during plant turnaround. Also the importance of failure analysis will be addressed. Finally the paper
will conclude with an example of a decision support model which heavily depends on accurate
reliability data. This example demonstrates how a plant owner can benefit from a practical application
of reliability engineering.
1 Introduction
World wide chemical and other processing plants are trying to implement reliability programs to
improve plant safety while trying to maintain plant availability. These programs can vary significantly
in size and complexity. Any kind of reliability program, like a preventive maintenance (PM) program,
consists always of one or more reliability models and reliability data to execute these models. It is
needless to say that the actual successful implementation and utilization of these reliability programs
heavily depends on the accuracy of the reliability models and the availability of realistic data, or at
least as close as possible data.
At the start of any reliability data program good data is usually missing. Companies depend in that
case on external data sources (e.g., handbooks, databases, and expert opinions) that do not
necessarily represent the situation at their own plant. Data needs to be collected for each piece of
equipment, device or instrument needed to operate the plant. Many companies observe that during
the first usage of the reliability program further fine tuning of the collected data is needed as there is
an offset between the current model and the actual situation observed in the plant. Once the lack of
data or the uncertainty in data starts to decrease the models become more accurate and the
companies start to grab the benefits of the implemented reliability programs. Plant availability and
safety will both increase, more preventive maintenance will take place and the total lifetime operating
cost (TLOC) will decrease because of less unscheduled maintenance and associated spurious trips of
the plant.
The objective of this paper is to give the reader a better understanding of the importance of
reliability engineering focusing on the collection of reliability data for the purpose of process
availability, safety and preventive maintenance programs. The paper will first explain what reliability
engineering is and why it is important for processing plants to have a reliability program. Second the
paper will give an overview of different programs in use today by companies (Preventive
maintenance, risk based inspections, etc). Next the paper will focus on the role of reliability modeling
and data. The paper will explain the kind of reliability data needed and the current available sources
for this data. The focus will be on the actual data available within the plant that can among others be
collected through the information derived from the archived data in the DCS systems. An excellent
time to collect data is during scheduled plant turnarounds. Based on the example of a control valve
the paper will explain the role that humans can play to improve the administration of reliability data
during plant turnaround. Also the importance of failure analysis will be addressed. Finally the paper
will conclude with an example of a decision support model which heavily depends on accurate
reliability data. This example demonstrates how a plant owner can benefit from a practical application
of reliability engineering.
2 Reliability engineering
Reliability engineering plays an important but undervalued role in today’s processing plants around
the world. Many companies might not realize it but reliability engineering lies at the heart of total asset
management, a popular buzz term in industry today. What total asset management entails is not
really clearly defined yet but it incorporates elements such as reliability centered maintenance (RCM)
, total productive maintenance (TPM), design for maintainability, design for reliability, life cycle
costing, loss prevention, probabilistic risk assessment and others. The objective of total asset
management is to arrive at the optimum cost-benefit-risk asset solution to meet our desired
production levels. In other words, how can we spend the least money on our plant meeting our
production targets while maintaining process availability and process safety. Many aspects are
involved to achieve this but when it comes to the hardware and software that we use in our plant then
reliability engineering is the discipline to utilize here.
Reliability engineering is a very broad discipline and is practiced by engineers that design hardware
and/or software of individual products but also by engineers who use these products and integrate
them into larger systems. A reliability engineer in a plant has a similar task as a reliability engineer
who is responsible for the design of a transmitter or valve. They apply similar techniques to perform
their jobs only on a different scale and with a different focus.
Reliability itself is defined as the probability that a product or system meets its specification over a
given period of time [1]. The word specification is of course very broad and a product might have
several functions. One can calculate the reliability of each individual function or of all functions
together which make up the specification. The term time can also be replaced by distance, or cycles
or other units as appropriate. In other words it is very important to be clear when we talk about
“reliability” as it can have different meanings to different people and in different situations. In a plant
we can calculate process availability, unavailability, probability of fail dangerous, fail safe, etc., which
are all aspects of, and related to, reliability. In general reliability deals with probability of failure of
components, products and systems and is therefore at the heart of disciplines like hazard and risk
analysis, loss prevention, maintenance programs, quality assurance and so on.
Reliability engineering is thus the discipline of ensuring that a product or system will be reliable
when operated in a specified manner. It is performed throughout the entire life cycle of a product or
system, including design, development, test, manufacturing, operation, maintenance and repair. In
process plants it is often a staff function who’s prime responsibility is to ensure that maintenance
techniques are effective, that equipment is designed and modified to improve maintainability, that
ongoing maintenance technical problems are investigated, and that appropriate corrective and
improvement actions are taken. But in reality it is much broader than that. Reliability engineering
deals with every aspect of a component or system; from making a reliable design, to reviewing
operating and maintenance procedures, or even to setup a reliability data collection program. In many
plants reliability engineering is often also called maintenance engineering.
Probabilistic risk and safety assessment heavily depends on reliability engineering techniques and
theory. With risk assessments we try to establish the risk associated with operating a process plant.
Often risk assessment uses a "top-down" approach to establish and rank the risk of individual areas of
a plant and process equipment to eventually establish the risk associated with the complete facility or
process plant. Risk is defined as the combination of consequences and frequencies. We can only
determine the frequency of an event occurring if we know the individual probabilities for equipment
failure associated with that event. In order to be able to carry out a risk assessment we need to know
how often a pump fails or a valve is stuck open or the instruments air is lost. Determining these
probabilities is the discipline of reliability engineering. Without proper failure rate data of equipment
we cannot establish a quantitative risk level.
When the risk level is established it can be that it is too high, and therefore needs to be reduced, or
that it is low enough, but needs to be maintained at that level. Standards like IEC 61508 [2] and IEC
61511 [3] are based on this concept. When we need to reduce the risk we can either reduce the
consequence or reduce the frequency of the hazardous event. We can reduce the frequency if we
implement a safety system. But this safety system needs to be reliable enough. We need to design a
safety system that is so reliable that it reduces the risk to a level where we can accept it again. This
means that we not only need to have reliable safety system components but also make an
appropriate safety system design to achieve overall reliability. In order to maintain our level of risk we
also need to maintain our process plant and safety system. This is why reliability engineering is often
called maintenance engineering. It is to make sure that the assumption we made during our risk and
safety assessments are maintained throughout the life of our facility. Being able to collect failure rate
data or predict failure behavior can help us in our maintenance strategy.
One program used for the predication of failures is condition based maintenance or predictive
maintenance [5]. As it names implies it means that we perform maintenance based on the condition
of the equipment subject to maintenance. We try to measure the condition of equipment in order to
assess whether it will fail during some future period. The objective is to avoid failure and thus we
either maintain or replace the product just in time. What actually needs to be monitored depends on
the equipment and can mean that we measure for example particles in the lubrication oil of a gearbox
or that we need to apply statistical process control techniques and monitoring the performance of
equipment. If we associate reliability theory with maintenance then we can try to probabilistically
predict when to perform maintenance. This is called reliability centered maintenance [5]. It is a
structured process, originally developed in the airline industry, which heavily depends on reliability
data and expert systems to interpret that data.
Condition based maintenance or reliability centered maintenance can still mean that we are too late
or that the maintenance occurs at an inconvenient time. In order to prevent this we can instead
implement preventive maintenance. The strategy of preventive maintenance is to replace or overhaul
a piece of equipment at a fixed interval, regardless of its condition at that time or the expected
probability of failure. It is purely based on time. Reliability and decision modeling can demonstrate that
it is often more cost effective to replace a piece of equipment before it has failed and at a scheduled
time then to wait until it fails unexpectedly. In this way replacements can be made at, for example,
scheduled plant shutdowns.
There is not one program that is the best strategy for a plant. Most likely, for different pieces of
equipment different programs are applied. Some equipment lends itself perfectly for condition based
or reliability centered maintenance, other equipment not at all and preventive maintenance is more
appropriate.
4 Reliability modeling
Reliability engineering heavily depends on probabilistic methods. In order to predict something,
whether it is the reliability of a piece of equipment or a complete process plant, we first need a
reliability model. There are many different techniques and methods developed over time that we can
use to make models. If we make models of (complex) systems for the purpose of prediction we
usually depend on one or more techniques like:
Reliability block diagrams
Fault trees
Markov models
Other techniques exist as well but these are very common ones. Figure 1 shows a safety function
required to reduce the risk associated with high temperature in a vessel. In order to protect the vessel
against over temperature a safety system has been build with two temperature sensors connected via
two transmitters to a logic solver. The logic solver consists of an input board, a cpu board and an
output board. The input board utilizes two input channels while the output board utilizes three output
channels. These three output channels are required because we need to open two relays to stop two
pumps and one solenoid valve needs to close in order to open a drain valve.
For this system we can do all kind of analyses, e.g., calculation of the probability of fail safe,
probability of fail dangerous, the availability of the safety function, the unavailability of the process due
to spurious trips, the desired periodic proof test interval, optimization of maintenance strategies and
so on. In order to perform these analyses we need a reliability model of this safety function. Three
different reliability models have been created of the same function, i.e., a reliability block diagram, a
fault tree and a Markov model represented respectively in Figure 2, Figure 3, and Figure 4. In order to
actually perform calculations we need to fill the models with reliability data.
T1 TM 1 I1 O1 R1- Pump A
Common Circuitry
Common Circuitry
I2 O2
T2 TM 2 I3 O3
I4
I5
CPU O4
O5
R2- Pump B
I6 O6
I7 O7 SOV Drain V
I8 O8
Figure 1 - From specification to hardware design of the safety instrumented system [6]
T1 TM 1 I1
© Risknowlogy 2002-2005
T2 TM 2 I2
T1 Tm1 I1 T2 Tm2 I2
Path 1
Failed
System
OK Failed
Path 2
Failed
1E-01
1E-02
1E-03
1E-04
10th percentile
median
1E-05
90th percentile
0 2496 4992 7488 9984 12480 14976 17472
Tim e (hours)
Figure 5 - Reliability calculations of a HIPPS with uncertain data, two different periodic proof
test intervals and two periodic proof test coverages [7]
Basically the following data sources exist in industry:
End user maintenance records
Industry databases
Reliability standards
Handbooks
Manufacturer data
Expert opinions
Published papers
The most preferred data is always the data from the plant itself. Usually this data is collected via
maintenance records. Your own data is the best data for obvious reasons. Look at it this way. When
two companies buy the same valve but one company uses the valve on an offshore platform in the
North Sea while the other company uses the same valve in a plant in the dessert then we cannot
expect both valves to have the same failure behavior. Not only have the environmental parameters
influence on the failure behavior of a device but also its operational use and the maintenance strategy
of the company. Since no two companies are the same (and probably not even two factories within in
a company are the same) also their similar devices will not fail at the same rate. Thus the best data is
the data you collect yourself.
If this kind of data is not available then the next best possible source is to use data from industry
databases. Figure 6 shows two industry databases or handbooks that can be used. One is the
OREDA [8] database and the other is the SINTEF [9] handbook. Both have collected over time
reliability data. OREDA holds reliability data collected from offshore companies in the North Sea and
the SINTEF handbook holds reliability data specifically for safety equipment. Several other databases
and handbooks exist in the world that can be utilized but no matter who delivers the data it is
important to tailor the data in a way that it useful for the applicable situation.
Figure 6 - Two examples of industry databases, OREDA [8] and SINTEF [9]
When collecting reliability data we need to make sure that we document the right information when
a piece of equipment fails. Basically we are interested in three types of information, i.e., the failure
rate, the failure modes, and the repair times of a device. Unfortunately a lot of maintenance records
that we use are not suitable for reliability data collection as desired information is not recorded or
recorded in away that it cannot be used. It is very important that we get an overview of how often a
device fails and how it has failed. Before we can document that information we first need to be clear
about the function of that device.
For example the function of an ESD valve is to close upon demand. This valve can have the
following general failure modes:
Stuck open
Stuck close
Stuck in position
Leakage
A control valve has a different function then an ESD valve. For a control valve we might be
interested in failure modes like
Moves too fast
Stuck in position
Leakage
What the real meaning of a failure mode is can only be determined when we understand the failure
mode in the larger context of the plant. We need to understand what the functionality of a device is
when it is used. Consider the following two valves. One valve controls the flow of an inlet pipe of a
vessel, while the other valve is a drain valve for the same vessel. The valve on the inlet pipe is
normally open and should close upon demand. The drain valve is normally closed and should open
upon demand. Both valves have the same failure modes like stuck open, stuck close, stuck in position
or leakage. But the effects of these failure modes are the opposite for both valves. Thus, it is
important to understand the function of a device on device level, and on system level, in order to be
able to properly document the failure behavior.
Collecting failure rates can also be done on different levels but the only correct level will be on
failure mode level. In practice the maintenance department should track for each device the number
of devices installed, the operating hours of each device and the time that the device has failed. This
information, combination with the failure modes allows us to calculate the failure rate per failure mode
and that is exactly what we need for our reliability models.
For each device we should basically collect the failure rates per failure mode but in practice many
companies do not have this kind of information. Often they need to work the other way around in
order to determine the failure rates per mode. Consider the safety industry where they are only
interested in 4 different failure modes [11]:
Safe detected
Safe undetected
Dangerous detected
Dangerous undetected
Only electronic devices can benefit from diagnostics and have detected failure modes. A partial
stroke test is not a diagnostic test as diagnostic tests are defined as frequent tests that run fully
automatically [10]. Most partial stroke setups require human interaction though. Therefore it is in most
cases not possible for mechanical devices, like valves, to define safe detected and dangerous
detected failure modes, only undetected failure modes. This also makes sense. When a valve is stuck
open and one performs a partial stroke test once in 6 months then potentially we do not know about
this stuck at failure for 6 months. Detecting this failure with a partial stroke test after 6 months is good
but very slow to take advantage of it. Therefore only devices that have build in tests, which run
automatically and frequently are useful as we can act upon a failure immediately and do something
about it.
If we have the following information available then we can calculate the four failure modes as
desired:
The overall failure rate of a device (λ), this includes all failures of the device regardless of
their failure mode
The safe ratio of the failures (SR), i.e., the ratio between all safe failures and all dangerous
failures of a device
The safe diagnostic coverage of a device (SDC), i.e., the percentage of all safe failures that
can be detected through diagnostic tests
The dangerous diagnostic coverage of a device (DDC), i.e., the percentage of all dangerous
failures that can be detected through diagnostic tests
Consider the following example where we can calculate the four failure rates important in the safety
industry from the following basic data:
λ = 5.5 E-6 /h
SR = 80%
SDC = 90%
DDC = 90%
Safe undetected failure rate λ_su = 5.5E-6 x 0.8 x (1.0 - 0.9) = 0.44E-6 /h
Dangerous detected failure rate λ_dd = 5.5E-6 x (1.0 - 0.8) x 0.9 = 0.99E-6 /h
Dangerous undetected failure rate λ_du = 5.5E-6 x (1.0 - 0.8) x (1.0-0.9) = 0.11E-6 /h
This is the kind of calculations that companies make when they do not have their own reliability
data. In reality one does not get this information as the maintenance department collects information
on failure mode level. If you have the failure rate information failure mode level it is possible to
calculate the overall failure rate, the safe ratio and the safe and dangerous diagnostic coverage
factors. More and more suppliers of devices are providing end-users with this kind of detailed product
information though. Consider the functional safety data sheet© in Figure 7 where this basic failure
rate information was used to also calculate factors like the safe failure fraction, the MTTFsafe,
MTTFdangerous, etc.
The only reliability data still missing in order to make the model complete is repair data and proof
test data. Many product suppliers make statements about how long it takes to repair their transmitter
or valve and often this is considered to be 8 or 24 hours. In practice only the end-user knows how
long it will take to repair a particular device. It depends on many different factors. For example is the
failed device in stock or not? If we do not have it in stock how long does it take to order it and to have
it shipped to the desired location? If we do have it in stock then how long does it actually take to
replace it? Do we have only one repair crew or do we have multiple repair crews available? In our
model we can make the assumption that something takes only 8 hours to repair but if in reality it takes
30 days to repair then our calculations results are not much worth. The closer we can make our
model to practice the more useful reliability engineering will be.
1. The component failure that happens in a system “X”, shall be isolated and
removed. If the Failure has caused a plant shutdown, then a new certified
component shall be installed to manage and speedup plant start-up activities.
The new component certificate shall be issued by OEM (Original Equipment
Manufacturer).
2. The failed component shall be clearly tagged. The maintenance engineer shall
record the components ID, its function, the failure description, and the physical
and environmental state. A sample form is shown in Figure 8. Then, it shall be
Engineer ______________________________________________
Equipment ID ______________________________________________
_____________________________________________
Fault Description
_____________________________________________
Email:
_____________________________________________
Name:
OEM Address _____________________________________________
Fax / Tel :
_____________________________________________
_____________________________________________
for all jobs respecting cost, and time constraints. A RCM program alone can’t be used to manage
plant turnaround jobs, but the integration of the human role into this process will facilitate such theme.
There are some further issues that need to be asked, and in order to make the typical questions clear,
we will use a control valve example to illustrate the approach.
For preventive maintenance (PM) of control valves, we can ask many questions such as:-
What is the frequency or PM for critical valves?
Is this a critical control valve? Is this Emergency Shutdown Valve? Is this a high pressure
service valve? Is this valve fully open, close, or regulating?
How many items are required checking in the control valve? (we can have many accessories
in a control valve such as solenoid valve, Positioner, regulator set, booster set, I/P unit, valve
internals)
What sort of tests are required as part of PM process? (leak test based on valve leakage
class, and hydro test based on line pressure rating)
Is there any certificate check for major accessory item such as solenoid valve Positioner, &
I/P?
Is there any internal component to be replaced such as plug, seat, and valve soft kit?
Is there any outside component that shall be replaced (such as Solenoid valve, I/P, pneumatic
set, Diaphragm .etc) based on plant standard or applied practices?
Is there any certificate to be issued for each of small component that can jeopardize valve
malfunction?
Is there any bypass line on the valve that can help Maintenance to do PM for the valve when
the plant is running?
As one can see, after establishing answers to these questions, and others, we can put a closer
view on real reliability models and see how through an audit & validation process, we can enhance
the RCM models in the real world.
Safe undetected
Dangerous detected
Dangerous undetected
The failure rates for each of these possible failure modes are calculated with the values from Table
1.
Table 1 – Reliability data sensor
Parameter Value
Overall failure rate of the sensor 8.6E-6 / h
Table 2 shows data related to the process. The mission time is the time the pressure sensors are
operated. The periodic test interval is the time between periodic proof tests. The periodic test
coverage represents the percentage of failures that can be detected. The demand rate represents the
number of demands that come from the process. A demand means that the safety function needs to
be carried out and thus needs to be available (the pressure sensor needs to work in order to carry out
the safety function).
The financial data from Table 3 is used to estimate the cost associated with three different sensor
architectures.
Table 3 – Financial data
Parameter Value
Cost sensor* $ 5000.00 / sensor
Cost associated with a spurious trip of the plant** $ 1,000,000.00 / trip
Cost associated with an accident** $ 15,000,000.00 / accident
* These cost include all cost of the sensor including installation, repair cost, etc.
** These cost include all cost associated with it (repair, production loss, etc).
The three different architectures all have the same possible operating modes or failure scenarios.
These scenarios are:
Operational – the sensors subsystem has no failures that effect the measurement of the
sensor
Trip – The sensor subsystem has failed in a way that the associated logic solver (DCS or
safety plc) can only decide to trip the process
Dangerous – The sensor subsystem has failed in a way that the associated logic solver (DCS
or safety plc) cannot take any action when demanded from the process.
For each of these scenarios it will be calculated what the probability of occurrence is. The
probabilities for each of these scenarios are calculated using the Markov modeling technique [12]. For
all architectures Markov models are created that allow us to calculate the probabilities associated with
these scenarios. Once the probabilities for each scenario are known we can calculated the associated
cost with this scenario. The expected cost over the mission time for each sensor subsystem is then
the total cost for each of the scenarios.
Based on the assumptions the results are presented in Figure 9 and Figure 10. Figure 9 shows the
results over a mission time of 10 years without performing a periodic proof test. Figure 10 shows the
same model but then with a periodic proof test performed every 6 months. The results are based on
the weighted scenario cost. For each sensor subsystem we calculate the probability that the sensor is
either:
Operational;
Caused a plant trip;
Failed dangerous.
As the subsystem needs to be in any of these three states, at all times, the total probability adds up
to 1. Please note that the results only apply for these assumptions as it was applicable to this
particular customer. In this case the 2oo3 sensor architecture clearly favors the results. The pressure
sensor system clearly benefits from a periodic scheduled proof test every 6 months. The probability
weighted scenario cost for the three architectures are in this case:
1oo1 subsystem: $1,185,421.60;
1oo2 subsystem: $103,572.38;
2oo3 subsystem: $60,792.63.
In all three cases the dangerous scenarios contribute the most to the overall weighted cost. This is
due to the 6 demands per year and the high cost associated with a possible accident. The periodic
proof tests improves the system significantly, see below in Figure 9. The overall achieved
improvement due to periodic proof testing is demonstrated in Table 4.
Table 4 – How the periodic proof test improves the system significantly
Architecture Without Proof Test With Proof Test Improvement
1oo1 $21.0 Mil $1.2 Mil 17.5
7 Conclusions
This paper has addressed plant safety and availability thru the eye reliability engineering and
reliability data collection. The paper explained what reliability engineering was, how reliability models
can be made and what kind of data needs to be collected. It demonstrated through practical examples
how reliability data can be collected, what problems may arise and how plants can benefit from good
reliability data.
8 References
1. Bently, J.P., An Introduction to Reliability & Quality Engineering. John Wiley & Sons,
ISBN 0-582-08970-0, 1993
2. IEC, Functional safety for electrical / electronic / programmable electronic safety-related
systems. IEC 61508, IEC, Geneva, 1999
3. IEC, Functional safety: safety instrumented systems for the process industry sector.
IEC 61511, IEC, Geneva, 2003
4. Condition based maintenance
5. Moubray J., Reliability Centered Maintenance, 2nd Edition, ISBN: 0831130784, April
1997
6. Vande Capelle, T., Houtermans, M.J.M., Functional Safety For End-users and system
integrators,
7. Rouvroye, J.L., Et. Al., Uncertainty in safety, New Techniques For The Assessment And
Optimisation Of Safety In Process Industry. American Society for Mechanical
Engineers, 1994
8. Det Norske Veritas, OREDA, Offshore Reliability Data, 2nd Edition, ISBN 82 515 0188 1,
1992
9. SINTEF, Reliability data for safety instrumented systems. PDS Data handbook, 2004
Edition. SINTEF, September 2004.
10. Velten-Philipp, W., Houtermans, M.J.M., The effect of diagnostic and proof testing on
safety related systems. Control 2006, Glasgow, Scotland, 30 August – 1 September,
2006
11. Houtermans M.J.M, IEC 61508: An Introduction to Safety Standard for End-Users.
SISIS 2004, Buenos Aires, Argentina, September 2004
12. Billinton R., Allan R.N., Reliability Evaluation of Engineering Systems, Concepts and
Techniques. Pitman Books Limited, London, 1983.
13. Al-Ghumgham MA, “On A Neural Network-Based Fault Decoction Algorithm”; Chapter 4 of
Master Research Thesis in fulfillment of Master Degree program for Control Systems
Engineering, KFUPM 1992.
14. Al-Ghumgham MA, Angelito Hermoso, Humaidi, MA,“Safety and reliability: Two faces of A
coin for Ammonia Plant ESD System”; ISA EXPO 2005, Chicago, USA.
White paper
04
RISKNOWLOGY B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com
This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.
Risknowlogy, the Risknowlogy logo, functional safety data sheet, and spurious trip level are registered service marks.
1 Introduction
The functional safety industry is driven by the international standards IEC 61508 [1] and IEC 61511
[2]. These standards describe performance levels for safety functions and the devices and systems
that carry out these safety functions. This performance is expressed as the so called safety integrity
level (SIL). In practice there are four levels, SIL 1-4. The required SIL level is directly derived from the
process which needs to be protected with a safety function of certain safety integrity. The more
dangerous the process the more safety integrity is required for the safety function.
The SIL level is a measurement of the qualitative and quantitative performance of the safety function.
The higher the SIL level the more difficult it is for a product supplier to design and manufacture the
safety device and the more difficult it is for end-users and system integrators to integrate safety
devices from different manufactures to a complete safety system. The higher the SIL level the more
safety has been or needs to be built into the devices and systems.
The quantitative part of the SIL level is expressed as the probability of failure on demand. This
means that we need to calculate the probability that the safety function cannot be carried out in case
of a demand from the process. In other words how likely is it that the safety function does not work
when we require it to work? The higher the SIL level the more likely it is that the safety function works.
Besides the demand mode functions the IEC 61511 standard also refers to continuous mode
functions. Compared to demand mode functions these kinds of safety functions have a direct impact
on the process when an internal failure occurs. Therefore continuous mode functions need to be
calculated per hour and not per demand, see Figure 1.
1
Corresponding author: m.j.m.houtermans@risknowlogy.com
X ≥10-(x+1) to <10-x
… ….
5 ≥10-6 to <10-5
4 ≥10-5 to <10-4
3 ≥10-4 to <10-3
2 ≥10-3 to <10-2
1 ≥10-2 to <10-1
Subsystem
Hardware Mechanical Electronics Electronics
Type A A B
Software
SIL 3
Table 2 – Overview of the possible architectures and their achievable SIL level
Architecture
Attribute 1oo1 1oo2 2oo3
Hardware fault tolerance 0 1 1
Fit for use in SIL 2 3 3
Table 3 gives an overview of the PFD, the PFS and the achieved SIL and STL levels of the LNG level
sensors in the different architectures. This table is particularly useful for end-users and system
integrators as it demonstrates how much the level sensors allocates of the overall SIL level. For
example in order to achieve a SIL 2 safety loop the level sensors only takes 0.18% of the total PFD
value of SIL 2. For SIL 3 the level sensor takes even less, 0.004% and 0.033% respectively for the
1oo2 and 2oo3 configuration. Even when the safety loop is calculated over a period of 10 years the
level sensor allocates only very little of the overall required PFD value. Also the PFS values are
calculated for the different architectures of the level sensor. The best STL level is achieved by the
2oo3 sensor architecture.
Architecture
Attribute 1oo1 1oo2 2oo3
Figure 3 shows how the probability of failure on demand develops over time for all three architectures.
A graphical representation like this can be used by an end user to determine periodic proof test
intervals. This can only be done though if the logic solver and actuating part are also included in the
calculation. The 1oo1 architecture clearly performs the worst of the three architectures. The reason
that the 1oo2 architecture has a better performance then the 2oo3 architecture is because the 2oo3
has more possibilities to fail.
Figure 3 – Probability of Failure on Demand for 1oo1, 1oo2, and 2oo3 architectures.
Figure 4 – Safety availability calculations for 1oo1, 1oo2, and 2oo3 architectures
Figure 5 – Process availability calculations for 1oo1, 1oo2, and 2oo3 architectures
TT TR SOV
FCV
Logic SOV
TT TR Solver
SOV
FCV
TT TR SOV
Figure 6 – Architecture Safety Instrumented System
The following component reliability data has been used for the components listed in Figure 6.
# Model OFR [/h] SF [%] DDC [%] SDC [%] A SFF Type
Based on the reliability data and the Markov model the results are presented in Table 5. An overview
of the development of the PFD, PFDavg and Safety Availability is given respectively in Figure 7,
Figure 8, and Figure 9:
Table 5 – Results analysis
Parameter Value
PFD 3.916168e-003
PFDavg 1.860371e-003
PFS 2.962833e-003
Figure 8 – PFS
References
1. IEC 61508, Functional safety for electrical, electronic, programmable electronic safety related
systems. Geneva, Switzerland, 1999
2. IEC 61511, Functional safety – Safety instrumented systems for the process industry sector.
Geneva, Switzerland, 2003
3. L. Monfilliette, P. Versluys, M.J.M. Houtermans, Certified Level Sensor For The Liquefied
Natural Gas Industry, TÜV Symposium, Cologne, Germany, May 2006
White paper
05
Address:
HIMA Paul Hildebrandt GmbH + Co KG
Albert-Bassermann-Str. 28
D-68782 Brühl near Mannheim
Tel. +49-6202 709 270
E-Mail: j.boercsoek@hima.com
Keywords
IEC/EN 61508, ISA-TR84.0.02, normal failure, common cause failures,
1oo1-system, safety related 1oo2-system, safety related 2oo3-system,
safety integrity levels (SIL), SIL-requirement, probability of failure on de-
mand (PFD), probability of failure per hour (PFH), safe failure fraction
(SFF), type A subsystem, type B subsystem, hardware fault tolerance,
diagnostic coverage factor (DC), proof-test interval, loop calculation
Abstract
Safety systems are be used in a wide range of technical application. Beside the avail-
ability of such systems the safety aspects, e. g. PFD and PFH figures, must be ob-
served. Especially the calculation of these figures requires the use of standards.
Worldwide are standards available for this calculation. The newest standard is
IEC 61508. This standard is worldwide accepted. Another standard, which is used since
years, is ISA-TR84.0.02. In this standard a safety calculation can be performed without
using MTTR and common cause failure. Since the introduction of the standard
IEC 61508 a lot of discussion concerning the PFD-number appears in the industry. The
reason for that discussion is the way of calculation this numbers. This contribution will
compare both calculation-methods.
Introduction
In the process industry is the use of safety related controllers and systems
increasing by regulative measures. For the validation of applications of
those systems specific figures of the failure rates are used. VDE 0801
part 1 to 7 "Functional safety, Safety related systems" has been recently
the state of the art for national and international standards (also known
IEC 65A/179/CDV, Draft IEC1508). It describes the procedures and the
calculations of complex electronics and microcomputers for safety related
applications. After the introduction of the IEC/EN 61508, a common na-
tional and international standard was created that describes/specifies
generic safety related systems.
Today in various publications exist different ways of calculating the PFD-
figures and availability-figures. Some parts of them are based on the ISA-
TR84.0.02 (1998) and the therein described equations.
• safe detectable
• safe undetectable.
Safe failures are failures, which have no effect to the safety function of
the system, either detected nor undetected.
At dangerous failures this situations is not valid. These failures lead at
their occurrence to a dangerous situations in the application, that can lead
under certain circumstances up to massive risk for human life. These
failures are differentiate as well in
• dangerous detectable
• dangerous undetectable.
When the safety related system is designed properly the system reaches
the safe state at detectable dangerous failures. For this cases the safety
related system is able to bring the complete system or the plant in the
safe state.
+
safety
input 1 related output 1
cpu 1
sensor
safety
input 2 related output 2
cpu 2
actuator
final
element
B
output 1B
safety
sensor input 2 related
cpu 2
output 2B
C
output 1C
safety
input 3 related
cpu 3
output 2C
actuator
final
element
-
Figure 2: Safety related 2oo3-system
Table 1: SIL for systems operating in low and high demand or continuous mode
of operation according to IEC/EN 61508
In principle the statement can be derived from the tables that the prob-
abilities of failures are specified in the same ranges.
Advanced considerations of PFD-values according to IEC/EN 61508
Part 2 of this standard specifies the hardware requirements. Further the
safety life cycle of the hardware is there defined, also the architecture
constraints for type A (for these subsystems the behavior is in the case of
an error well known), as well as type B subsystems (for these subsystems
the behavior is in the case of an error not completely known), and at least
the required safe failure fraction (SFF).
Type A Type B
Safe failure Hardware fault tolerance Hardware fault tolerance
fraction 0 fault 1 fault 2 faults 0 fault 1 fault 2 faults
< 60 % SIL 1 SIL 2 SIL 3 Not SIL 1 SIL 2
allowed
60 % - < 90% SIL 2 SIL 3 SIL 4 SIL 1 SIL 2 SIL 3
90 % - < 99% SIL 3 SIL 4 SIL 4 SIL 2 SIL 3 SIL 4
> 99 % SIL 3 SIL 4 SIL 4 SIL 3 SIL 4 SIL 4
PFDG ,1oo1 = ( λ DU + λ DD ) ⋅ t CE
= λ D ⋅ t CE (1)
⎛T ⎞
= λ DU ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + λ DD ⋅ MTTR
⎝ 2 ⎠
with
λ DU ⎛T ⎞ λ
t CE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (2)
λD ⎝ 2 ⎠ λD
λ DU ⎛T ⎞ λ
t GE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (5)
λD ⎝ 3 ⎠ λD
Equation to quantify a 2oo3-System:
with
λ DU ⎛T ⎞ λ
t CE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (7)
λD ⎝ 2 ⎠ λD
λ DU ⎛T ⎞ λ
t GE = ⋅ ⎜⎜ 1 + MTTR ) ⎟⎟ + DD ⋅ MTTR (8)
λD ⎝ 3 ⎠ λD
Two further more important indicators for safety related systems are
represented by the safe failure fraction (SFF) and the diagnostic coverage
factor (DC). The SFF is calculated by the equation:
λ S + λ DD
SFF = (9)
λ S + λ DD + λ DU
The DC factor can be determined by the equation:
∑ λ DD
DC = (10)
∑ λD
The SFF represents the ratio of non safety critical failures and the
DC factor describes the fraction of dangerous failures which are de-
tected by automatic diagnostic tests. The individual factors in these
equations have the following meaning:
βD The fraction of those failures that are detected by the diagnostic tests, the fraction that have a
common cause
λD Dangerous failure rate (per hour) of a channel in a subsystem, equal 0,5 λ (assumes 50 % dan-
gerous failures and 50 % safe failures)
λDD Detected dangerous failure rate (per hour) of a channel in a subsystem (this is the sum of all the
detected dangerous failure rates within the channel of the subsystem)
λDU Undetected dangerous failure rate (per hour) of a channel in a subsystem (this is the sum of all
the undetected dangerous failure rates within the channel of the subsystem)
MTTR Mean time to restoration (hour)
PFDG Average probability of failure on demand for the group of voted channels
T1 Proof-test interval (h)
Channel equivalent mean down time (hour) for 1oo1, 1oo2, 2oo2 and 2oo3 architectures (this is
tCE
the combined down time for all the components in the channel of the subsystem)
Voted group equivalent mean down time (hour) for 1oo2 and 2oo3 architectures (this is the com-
tGE
bined down time for all the channels in the voted group)
In the last list the term common cause factor is introduced. The β-factor
is introduced as ratio of the probability of failures with a common cause to
the probability of random dangerous failures. The next example shall
show this:
βD = 1%
β = 2%
T1 = 3 years
MTTR = 8 hours
1oo1-system
TI
PFDavg = λDU ⋅
2
TI
PFDavg = λ DU ⋅
2
⎡
( ) ⋅ TI3 ⎤
[ ]
2
2 ⎡ TI ⎤
PFDavg = ⎢ λ DU ⎥+ λ
DU
⋅ λ DD ⋅ MTTR ⋅ TI + ⎢ β ⋅ λ DU ⋅ ⎥
⎣⎢ ⎦⎥ ⎣ 2⎦
PFDavg =
(λ ) ⋅ TI
DU 2 2
2oo3-system
⎡
( ) ⋅ TI
PFDavg = ⎢ λ DU
⎣
2 2⎤
[
⎥⎦ + 3 λ
DU
]
⎡ TI ⎤
⋅ λ DD ⋅ MTTR ⋅ TI + ⎢ β ⋅ λ DU ⋅ ⎥
⎣ 2⎦
( ) ⋅ TI
PFDavg = λ DU
2 2
The factors in this configuration have the meaning as for the 1oo2-
system:
1,00E-07
1,00E-06
PFD
1,00E-05
1,00E-04
1 year
2 years
3 years
4 years
5 years
6 years
7 years
8 years
9 years
10 years
Proof-test interval T1 / TI
1,00E-08
1,00E-07
PFD
1,00E-06
1,00E-05
1 year
2 years
3 years
4 years
5 years
6 years
7 years
8 years
9 years
10 years
Proof-test interval T1 / TI
1,00E-14
1,00E-13
1,00E-12
1,00E-11
PFD
1,00E-10
1,00E-09
1,00E-08
1,00E-07
1,00E-06
1 year
2 years
3 years
4 years
5 years
6 years
7 years
8 years
9 years
10 years
Proof-test interval T1 / TI
1,00E-18
1,00E-17
1,00E-16
1,00E-15
1,00E-14
1,00E-13
PFD
1,00E-12
1,00E-11
1,00E-10
1,00E-09
1,00E-08
1,00E-07
1 year
2 years
3 years
4 years
5 years
6 years
7 years
8 years
9 years
10 years
Proof-test interval T1 / TI
Legend:
according to IEC 61508, with MTTR and common-cause-failure
1,00E-13
1,00E-12
1,00E-11
1,00E-10
PFD
1,00E-09
1,00E-08
1,00E-07
1,00E-06
1 year
2 years
3 years
4 years
5 years
6 years
7 years
8 years
9 years
Comparing the ISA and the IEC graph under consideration of MTTR
and common-cause-failure with DC = 99 %, see figure 7, both graphs are
to be found in the same magnitude. With increasing the DC-factor, DC =
99,99 %, see figure 8, these both graphs deviate at low T1/TI by two mag-
nitudes from each other. The reason for this deviation results mainly in the
case that in the IEC the part of failures of λDD is considered during the
repair time MTTR caused by common-cause-failures by the term
β D ⋅ λ DD ⋅ MTTR .
1,00E-17
1,00E-16
1,00E-15
1,00E-14
1,00E-13
PFD
1,00E-12
1,00E-11
1,00E-10
1,00E-09
1,00E-08
1,00E-07
1 year
2 years
3 years
4 years
5 years
6 years
7 years
8 years
9 years
10 years
Proof-test interval T1 / TI
Legend:
according to IEC 61508, with MTTR and common-cause-failure
Summary
This short comparison demonstrates the difficulty of the direct comparing between
PFD-values that are generated by means of different procedure. In fact using both
standards the quantitative values are to be found in the same ranges as long as not the
simplified calculations of the ISA-standard are applied. Although the parameters are
different leading to these calculations. For example at the ISA standard there are no
definitions regarding SFF- / DC-factor. A further criterion is the non existing of the dif-
ferentiation between type A and Type B subsystems, that has remarkable influence on
the structure and on the integrity level of the system. Also there is no differentiation in
the ISA standard between β and βD , here is only considered the better factor for the
failure rate λDU. Further more a difference is the consideration of the part of failures of
λDD during the repair time caused by common-cause-failure. This is not considered in
the ISA-standard.
In the IEC 61508 all these factors are considered comparing to the ISA-standard.
These consideration increases the demand in safety measures in Hard- and Software
in the system. A so designed system is at all more suitable qualified for a safety related
application.
In a summary it can be stated that based on the fact that IEC 61508 has an universal
application approach and not only applies to the pure safety calculation of systems, this
new standard for the functional safety will open a wide spectrum of applications. The
approach of the IEC-standard follows the goal and succeeded according to the author’s
opinion in creating a generic standard for safety related applications.
For the certification of already used complex systems it is necessary to proof the con-
formity to the standard. Both standards tolerate as all standards do certain latitude at
the different integrity levels. It can be stated that e. g. a system certification according to
SIL3 represents not necessarily a decision criterion to a complete system. In fact the
described system fulfills the requirements of the safety integrity level but it is necessary
to keep the prerequisites in mind written down in the so-called certification reports.
Generally the limitations of the certified system or plant are to be found in this docu-
ment.
Literature
[1] IEC/EN 61508: International Standard 61508 Functional Safety: Safety-Related
System. Geneva, International Electrotechnical Commission
[2] ISA-TR84.0.02 (1998) Technical Report; Safety Instrumented Systems (SIS) –
Safety Integrity Level (SIL) Evaluation Techniques. Instrument Society of America
[3] Börcsök, J.: Internationale-/Europa Norm 61508, Vortrag bei der VD-Tagung der
HIMA GmbH + Co KG, 2002
[4] Börcsök, J.: Konzepte zur methodischen Untersuchung von Hardwarearchitekturen
in sicherheitsgerichteten Anwendungen, 2002
[5] Börcsök, J.: Sicherheits-Rechnerarchitekturen Teil 1 und 2, Vorlesung Universität
Kassel 2000/2001
[6] Börcsök, J.: Echtzeit-Betriebssysteme für sicherheitsgerichtete Realzeitrechner,
Vorlesung Universität Kassel 2001/2002
[7] VDE 0801 part 1 to 7: Functional safety, Safety related systems, IEC
65A/179/CDV, Draft IEC1508, part 6, p. 26f, August 1998.
[8] DIN V 19250: Grundlegende Sicherheitsbetrachtungen für MSR- Schutzeinrichtun-
gen. Beuth Verlag Berlin 1998
[9] DIN VDE 0801/A1: Grundsätze für Rechner in Systemen mit Sicherheitsaufgaben.
Beuth Verlag
[10] IEC 60880-2: Software für Rechner mit sicherheitskritischer Bedeutung. 12/2001
FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals
White paper
06
Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com
This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.
Abstract
The purpose of this paper is to show the influence of online diagnostic and periodic proof testing on
the performance of safety functions in terms of the PFD. For three different architectures the influence
of the diagnostic coverage, the proof test coverage, and the proof test interval on the PFD are
determined. Performance indicators are used to express this influence and show the effect.
1 Introduction
Safety systems carry out one or more safety functions. Each safety function consists of a sensing
element, a logic solving element and an actuation element. Typical sensing elements are for example
sensors, switches or emergency push buttons. The logic solving element is usually a general-purpose
safety computer which can carry out several safety functions at once. Valves, pumps or alarms are
typical actuating elements of a safety function. The performance of these safety functions is
determined by several design parameters of the individual components of the safety functions. In
(Houtermans, 1999) the following design parameters of the safety function are identified as the
Architecture
Hardware failures
Software failures
Systematic failures
Common cause failures
Online diagnostics, and
Periodic test intervals.
The performance of a safety function can be expressed as the probability of failure on demand
(PFD) and the probability of fail safe or spurious trip. Both attributes are important in the safety world
as their values represent respectively a measurement for the level of safety achieved and the financial
loss caused by the safety system because of spurious trips. The PFD value is one requirement to
meet the safety integrity level of the IEC 61508 standard (IEC, 1999). For the PFS value there are
currently no requirements in the international safety world, although end-users of safety system
require an as low as possible PFS value.
The purpose of this paper is to show the effect that the above mentioned design parameters,
namely online diagnostic and periodic proof testing, have on the performance of the safety function in
1
Corresponding author: m.j.m.houtermans@risknowlogy.com
terms of the PFD. For three different architectures the influence the diagnostic coverage, the proof
test coverage, and the proof test interval on the PFD are determined. Performance indicators are
used to express this influence and show the effect.
Thus in other words both diagnostic tests and periodic proof tests try to detect failures inside the
safety system. At first sight there seems to be no difference yet the actual difference is quite an
important one. Note 3 of paragraph 7.4.3.2.2 of IEC 61508-2 explains that a test is a diagnostic test if
the test interval is at least a magnitude less than the expected demand rate. Based on this extra
information we can conclude that in theory a test to detect failures is called a diagnostic test if the test
is carried out automatically and more often than a magnitude less than the expected demand rate. In
all other cases we can refer to a test as a proof test.
The difference between a proof test and a diagnostic test is also important in case of a safety
system with a single architecture, i.e., hardware fault tolerance zero. In this case a proof test is only
sufficient if the safety function operates in a low demand mode. In case of a high demand safety
function diagnostic tests are required that are able to detect dangerous faults within the process
safety time (see IEC 61508-2, chapter 7.4.3.2.5). In other words you cannot build a single safety
system that only depends on proof tests.
In practice we define a test as a diagnostic test if it fulfils the following three criteria:
1. It is carried out automatically (without human interaction) and frequently (related to the
process safety time considering the hardware fault tolerance) by the system software and/or
hardware;
2. The test is used to find failures that can prevent the safety function from being available; and
3. The system automatically acts upon the results of the test.
2
The term diagnostic test is not defined and does not exist in part 4 of the standard.
A typical example of a diagnostic test is a CPU –Test, a memory check or program flow control
monitoring. A proof test on the other hand is a test which is carried out manually and where it is often
necessary to have external hardware and/or software to determine the result of the test. The
frequency of the proof test is much longer then the process safety time and magnitudes bigger than
the period chosen for diagnostic tests. A typical example of a proof test is full or partial valve stroke
testing [McCrea-Steele, 2006]. Partial valve stroke testing is seldom carried out without some form of
human interaction (in other words we need to depend on the human to carry out the test, to determine
the actual results of the test and/or to take the appropriate action based on the results) and often
needs additional equipment to be carried out.
The advantage of a diagnostic test over a proof test is that failures can be detected very quickly. If
a proof test is carried out once in three months then there is a possibility that the safety function is
running with a failure for three months before we find out about it. With diagnostics we often know
about the problem within milliseconds and thus can repair the failure very quickly. On the other hand
though good diagnostics require more a complicated design and additional hardware and software
build into the system. This additional hardware and software is often difficult to build, and costs extra.
There is another important reason to make a distinction between diagnostic tests and periodic proof
tests. IEC 61508 requires the calculation of the safe failure fraction for subsystems. The safe failure
fraction of a subsystem is defined as the ratio of the average rate of safe failures plus dangerous
detected failures of the subsystem to the total average failure rate of the subsystem, see formula
below
λSD + λSU + λ DD
SFF =
λSD + λSU + λ DD + λ DU
A high safe failure fraction can be accomplished if we either have a lot of safe failures (detected or
undetected does not really matter for the SFF) or if we can detect a lot of the possible dangerous
failures. Only failures detected by diagnostic tests can be accounted for in the safe failure fraction
calculation. Failures detected by periodic proof tests cannot be accounted for in the safe failure
fraction calculations. This is logical of course, as we do not want to count on humans to carry out our
safety.
3 Architectures
To show the effect that diagnostic and proof tests have on safety functions we introduce three
common safety system architectures. The presented architectures are oversimplified but represent
common structures that are used to implement safety functions. In practice these systems are much
more complex and can consist of many more components. The four basic architectures presented
can be characterized by their redundancy and voting properties, i.e., as XooY (i.e., “X out of Y”) and
are
The 1oo1 architecture;
The 1oo2 architecture;
The 2oo3 architecture; and
Each architecture consists of one or more sensors, one ore more logic solvers, and one or more
actuating elements. The following paragraphs will explain the three different architectures in more
details.
S1 LS1 A1
S1 LS1 A1
S2 LS2 A2
S1 LS1 A1
S2 LS2 A2
S3 LS3 A3
4 Evaluation Procedure
The procedure used to calculate the PFD is outlined in detail in (Houtermans, 1999) and is in short as
follows. A functional block diagram of the hardware of the safety system is drawn. For each element
of the block diagram the typical failure modes are listed. The effect on system level is determined for
each of them. This information is used to construct the reliability model, which is in this case a Markov
model (ISA, 1997, IEC, 2001, Börcsök, 2004). The last step in the procedure includes the
quantification of the models by adding the failure and repair rates of the different components.
5 Reliability Data
One of the objectives of this study was to make sure that it was possible to compare the results for
the different architectures. The actual value of the outcome is not as important as we are more
interested in the relative results. All calculation studies are carried out with these values unless
otherwise noted in the specific study (Reference model).
6 Performance Indicators
In order to exam the results and the effects of changing the diagnostic, proof test coverage and
parameters performance indicators are introduced. A performance indicator helps us understand what
impact changing a parameter has on the PFD value of the system at hand. The performance indicator
(PI) is calculated relatively simple. We change the parameter, for example the diagnostic coverage,
stepwise from its minimum value to its maximum value. This will result in a PFD values changing
according to the value of the parameter. To determine the impact of the parameter on the PFD value
we calculate the change of the PFD relatively to the 50% value of the parameter. In case of diagnostic
coverage 50% means 50% of the failures is detected by diagnostics. The 50% value is in this case
the reference value that we use to normalize the PFD value to 1. To get an impression of the
influence below 50% DC and above 50% DC we choose 25% DC and 75% DC for the PFD values to
compare with.
The reason we do this is because we want to determine whether changing a parameter has a lot or
only little effect on the PFD. In other words should we have in real life a system that does not meet
our PFD value then we know which design parameter we have to address in order to make the
changes that have the most influence and thus the fastest results. We don’t want to spend time and
money on improving parameters that show little or no effect at all.
7 Calculation results
7.1 Calculation results with and without diagnostics – no proof test
In this paragraph we study the influence diagnostics has on the PFDavg values for the different
system architectures as presented in paragraph 3.1. The diagnostics coverage is varied in the
following percentages 0%, 25%, 50%, 75%, and 99%. The results are presented in the figure below.
0.1
probability
0.01
0.001
0.0001
0 0.25 0.5 0.75 1
diagnostic coverage
Next we calculate the performance indicator of the diagnostic coverage, i.e., how it influences the
PFDavg of the different architectures. Figure 5 shows the change of the PFDavg for 25% DC
compared to 50% DC and the change for 75% DC compared with 50% DC. The numbers are
normalized with the 50% value.
3,00 2,73
2,50
2,08
2,00 1,70
1,52
1,40
PI-PFD 1,50
1,19
1,00
0,50
0,00
25%-50% 50%-75%
Next we calculate the performance indicator of the proof test coverage, i.e., how it influences the
PFDavg of the different architectures. Figure 7 shows the change of the PFDavg for 25% test
coverage compared to 50% and the change for 75% compared with 50%. The numbers are
normalized with the 50% value.
0.1
probability
0.01
0.001
0.0001
0 0.25 0.5 0.75 1
proof test coverage
25%-50% 50%-75%
0.1
probability
0.01
0.001
0.0001
0 1 2 3 4 5 6 7 8 9 10
years
Next we calculate the performance indicator of the proof test interval, i.e., how it influences the
PFDavg of the different architectures. Figure 9 shows the change of the PFDavg for a proof test
interval of 1-5 years and a proof test interval of 5-10 years. The numbers are normalized with the 5
years value.
7,87
8,00
7,00
5,79
6,00
5,00
PI-PFD 4,00
3,13
3,00
1,97
1,65
2,00 1,30
1,00
0,00
1y-5y 5y-10y
8 Calculation results
The purpose of this paper was to show the effect of diagnostic coverage, proof test coverage and the
proof test interval on the PFDavg value for different safety architectures. To get more inside in the
influence of these design parameters a performance indicator was introduced. The choice of
diagnostic coverage and the proof test interval have the most influence on the PFDavg. The proof test
coverage also improves the PFDavg but less significant then the other two parameters. The 1oo2
architecture gains the most benefits from while the 1oo1 is the least sensitive. For redundant
architectures counts that there is more chance on finding failures then there is for single architectures.
Therefore they perform better in terms of improving the PFDavg. The authors are currently working on
a more extended version of this paper taking among others into account more architectures and the
PFS calculation per architecture.
REFERENCES
IEC 61508 (1999) Functional safety of E/E/PE Safety-related systems, IEC 1999;
IEC, 61511 (2003) Safety instrumented systems for the process industry, 2003
IEC 61165, (2001) Ed.2: Application of Markov techniques
ISA TR84.0.04 Part 4 (1997) Determining the SIL of a SIS via Markov Analysis
Börscök, J. (2004) Electronic Safety Systems, Hardware Concepts, Models, and Calculations,
Huthig GmbH & Co. KG Heidelberg, Germany
Houtermans, M.J.M., Rouvroye, J.L., (1999) The Influence Of Design Parameters On The
Performance Of Safety- Related Systems. International Conference on Safety, IRRST,
Montreal, Canada
Houtermans, M.J.M., Brombacher, A.C., Karydas, D.M.,(1998) Diagnostic Systems of
Programmable Electronic Systems. PSAM IV, New York, U.S.A
Howland, R.E., (1995) Computer Hardware Diagnostics for Engineers, ISBN 0-07-030561-7,
McGraw-Hill, U.S.A.
Karydas, D.M., Houtermans, M.J.M., (1998) A Practical Approach for the Selection of
Programmable Electronic Systems used for Safety Functions in the Process Industry. 9th
International Symposium on Loss Prevention and Safety Promotion in the Process Industries,
Barcelona, Spain
McCrea-Steele, R., (2006) Partial Stroke Testing - The Good, the Bad and the Ugly, 7th
International Symposium Programmable Electronic Systems in Safety Related Applications,
Cologne, Germany, 2006
White paper
07
Abstract: - An advanced safety architecture is the 2 out of 4-system (2oo4). In order to trigger the safety function
at least two of the four channels must work correctly. It is said: “A 2oo4-system is 2-failure safe”. In order to
classify the quality of a system we calculate different parameters. In the report equations are indicated for PFD for
normal and common-cause-failures. Also the Markov-model for a 2oo4-architecture is introduced. We can
calculate the MTTF (Mean Time To Failure) of this architecture with this Markov-model. The results are high
availability and a high reliability.
Key-Words: - 2oo4-Architecture, Availability, IEC/EN 61508, Reliability, Markov-model, MTTF, PFD, SIL
IEC 61508 proposes a second possibility for The probability of a failure on demand always has to
classification of safety system. The probability of an be regarded as a statistical term. Even in safety
occuring failure on demand leaving the system systems there is no absolute safety given, since these
unable to perform its safety functions is calculated as systems may fail on demand.
well. Therefore a certain period of time is demanded By long lasting empirical studies on corresponding
for the proof check interval, either applications the distribution of a system’s failures
can commonly be assumed as follows:
• one month or
• three months or • 15 % of computing elements
• six months or • 50 % of sensors
• one year. • 35 % of termination elements such as
actuators.
This probability of failure is defined as probability of
failure per hour (PFH). Unlike probabilities it has a
dimension of 1/h. Systems demanding a continuous
operation are highly significant for industrial
systems.
Note that comparing both systems to its PFD or PFH
value is only possible within limits, as they refer to
different bases.
The whole system’s failure rate λ is subdivided into detected failures λ DD . Fig. 1 shows the spreading of
safe failures λ S and dangerous failures λ D . In failure rates. Failure rates could be specified with the
addition, safe failures are subdivided into safe aid of standard specifications.
undetected failures λ SU and safe detected failures
λ SD . Whereas dangerous failures are subdivided into
dangerous undetected failures λ DU and dangerous
safe detected
safe undetected
λ S = λ SD + λSU
dangerous undetected
λ D = λ DD + λDU
dangerous detected
A system’s quality can be specified by defining its The most known architectures in use for safety
PFD value reffered to its accuracy. The smaller this systems are the 1oo2- and 2oo3-architectures. 1oo2-
value the better is the system. However, the longer (reading 1 out of 2) and 2oo3- (reading 2 out of 3)
the system runs the higher will be the PFD value. architectures are common for safety-related systems
The PFD value is calculated for a period of time in industry.
called proof check interval T1). After the A 1oo2-architecture, s. figure 2, contains two
maintenance of the system we proceed on the independent channels which are connected in manner
assumption that it works without any failures. so if one of the two serial output circles has a safety-
Judging and comparing systems is mostly specified related failure the other channel must work correctly
by the PFD average value (PFDavg) over a whole and transmits the controlling process into the safe
proof check interval. state.
Input channel 1 A
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Output channel 2
Sensor Input channel 4
Input channel 1 B
Output channel 1
Input channel 2 Safe
Input channel 3 Logic solver
Output channel 2
Input channel 4
Actuator
connecting
element
+
Input channel 1 A
Output channel 1
Input channel 2 Safe
Input channel 3 Logic solver
Output channel 2
Input channel 4
Input channel 1 C
Output channel 1
Input channel 2 Safe
Input channel 3 Logic solver
Output channel 2
Input channel 4
Actuator
connecting
element
Input channel 1 B
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Input channel 4 Output channel 2
Sensor
Input channel 1 C
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Input channel 4 Output channel 2
Input channel 1 D
Output channel 1
Input channel 2 Safe
Input channel 3 logic solver
Input channel 4 Output channel 2
Actuator
connecting
element
3 Calculation of probability P(t) describes the probability of failure for the ith
channel with the failure rate of
distributions λ = λ Di (3)
You can apply the basic approach for determination
of PFDavg-equation of a 2oo4-system:
for a dangerous, normal failure in channel i and the
P( t ) = Psin gle + Pcommon cause probability of failure
(1)
= 4 ⋅ P1 ( t ) ⋅ P2 ( t ) ⋅ P3 ( t ) + PDUC ( t ) + PDDC ( t )
Pi ( t ) = 1 − e − λDi ⋅t . (4)
The index DUC means a dangerous undetected
If the equation (4) and (2) are used with the general
common-cause-failure, whereas DDC accounts for a
applicable PFDavg equation
dangerous detected common-cause failure.
T
1
3.1 Calculation of probability of normal T ∫
PFD avg ( T ) = ⋅ P( t ) ⋅ dt ,
0
(5)
failures
As already mentioned the 2oo4-system is 2-failure we get the result
tolerant. Before we calculate the probability of
normal failures for a 2oo4-system, we should reflect e − λD1T − 1 e −λD 2T − 1 e− λD 3T − 1
how is the probability for a 1-failure tolerant system, PFDavg, normal( T ) = 1 + + +
λD1T λD 2T λD3T
e. g. a 1oo3-system. If a 1oo3-system should fail
with normal failures, we have the condition that each e−( λ D1 + λD 2 )T − 1 e −( λD1 + λ D 3 )T − 1
− −
of the three channels must have a dangerous failure. ( λD1 + λD 2 )T ( λD1 + λD 3 )T
If the probability is calculated for this case, then the e−( λ D 2 + λD 3 )T − 1
product is derived from the probability of failure of −
( λD 2 + λD 3 )T
each channel. The following equation results:
e −( λD1 + λ D 2 + λD 3 )T − 1
+ .
Pnormal ( t ) = P1 ( t ) ⋅ P2 ( t ) ⋅ P3 ( t ) . (2) ( λD1 + λD 2 + λD 3 )T
(6)
This function can be developed into a power series Analogue, these failure probabilities can be derived
with help from a Taylor development (exactly for a 1oo1-system with λD,1 oo 1 = β⋅ λDU respectively
MacLaurin series). The condition that the
λD,1 oo 1 = β D ⋅ λDD . A random common cause failure
PFDavg, single(T) is a continuous function, which has a
removable singularity at T = 0 and thus all represents a 1oo1 function block! Therefore it is
derivations at this point exist can be proved, e. g. in possible to apply the derived PFDavg equation of the
[3], [4]. After some calculation, see also [3], [4], we 1oo1-system for the calculation of probability of
get the result common cause failure, see [3], [4]. The general
solution for the probability failure results in
( λ D )3 ⋅ T 3
PFDavg , normal ( T ) = . (7) λD ⋅ T
4 PFD avg = . (10)
2
This is the result for the probability of failure on
Since we have two common cause failure modes,
demand for a 1oo3-system for normal failure. You
have to be aware that the parameter T is not λ DUC = β ⋅ λ DU and λ DDC = β D ⋅ λ DD , and with the
equivalent to the parameter T1 (proof check interval) two assumptions that
in the IEC/EN 61508, see [1]. T1 is only a part of T!
For the calculation of the PFDavg value for a 2oo4- • a dangerous undetected common cause failure
system in case of normal failures we can use the occurs within the time period T1 + MTTR (T1
equation (7) of the 1oo3-system. This equation must means the proof time interval, MTTR means
be extended for the factor four as with four channels the mean time to repair)) and
there are four possibilities that in two channels a • a dangerous detected common cause failure
failure exist – remember the 2oo4-system is two- occurs within the repair time MTTR,
failure tolerant.
The probability of failure for the 2oo4-system for we can calculate the PFDavg value for common cause
normal failures is failures as
µ0
µ0
µ0
State S µ0
µ0 β D ⋅ λDD
µ0 Sys. DD
State 1
µ0 Sys. DD
µ0 Sys. DD
µ0 Sys. DD
µ0
2 ⋅ λ DD Sys. OK State 11
4 ⋅ λS Sys. DD
Sys. DD
µR Sys. DD
Sys. OK State 7
Sys. OK
Sys. DD
Sys. DD
3 ⋅ λ DD
State 4 2 ⋅ λ DU Sys. DD
Sys. DD
Sys. OK
Sys. OK
λ DD Sys. DD
Sys. DU
Sys. OK
Sys. DD 2 ⋅ λ DD Sys. OK
State 15
Sys. OK 4 ⋅ λ DD State 2
3 ⋅ λ DU Sys. DD
Sys. DD
Sys. OK
Sys. OK Sys. OK Sys. DU
Sys. OK Sys. DD State 8
λDU
Sys. OK Sys. DU
State 0 2 ⋅ λ DU
State 5
3 ⋅ λ DD
Sys. DD
Sys. OK λ DD Sys. DD
Sys. DD Sys. DU
4 ⋅ λ DU Sys. OK 2 ⋅ λ DD
Sys. DU
Sys. DU
Sys. DU
Sys. OK State 12
µLT Sys. OK State 9
Sys. DU λ DU
3 ⋅ λ DU Sys. OK
µLT State 3
Sys. OK
Sys. DU
Sys. DU
β ⋅ λ DU State 6
Sys. DD
2 ⋅ λ DU Sys. OK λ DD Sys. DU
Sys. DU
µLT Sys. DU Sys. DU Sys. DU
Sys. DU
µLT Sys. DU
Sys. DU
Sys. DU
State 13
Sys. DU State 10
State 14 λDU
Fig. 5: 2oo4-Markov-model
In state 3 one of the four channels is operating with a Therefore, in state 5 a dangerous undetected failure
failure. The occurring failure is dangerous and is not exists in one channel while at the same time in one of
recognized by the failure diagnostics. The transition the other three channels a dangerous detected failure
rate between the states 0 and 3 has the value 4 ⋅ λ DU , has occurred. The dangerous detected failure is
as in one of the three channels a dangerous revealed within the test interval when the system
undetected failure can exist. No transition possibility exists in state 5 and no further dangerous failures
exists for the system from state 3 into safe state 1 occur and then state 5 changes with transition rate
because the failure cannot be recognized within the µ 0 = 1 / τ Test into the safe state 1. The system exists
test interval τ Test = 1 / µ 0 . From state 3 a transition in state 6 if another dangerous undetected second
takes place into state 5 respectively 6 if a failure failure occurs in one of the three channels while the
occurs in the until then still failure-free channels. system is in state 3. The transition rate is 3 ⋅ λ DU .
The system can only change to state 0 again, where State 6 is characterized by two dangerous undetected
the system is failure free, after τ LT if during the total failures, one in two of four channels. No transition
lifetime of the system in state 3 no further failures possibility exists for the system from state 6 into the
occur.In praxis this means: After time τ LT the total safe state 1 because the failures are not recognized
within the test interval τ Test = 1 / µ 0 .
system is exchanged.
If the second failure in state 3 is a dangerous detected
failure then a transition takes place into state 5. The
transition rate is 3 ⋅ λ DD .
Because of the failure detected within the test An operational system is possible for a 2oo4-system
interval τ Test a transition possibility exists for the in the states 0, 2, 3, 4, 5, 6, 8, 9 and 10. The states 1,
system from the states 7, 8, 9, 11, 12, 13 and 15 into 7, 11, 12, 13, 14 and 15 should not be considered
the safe state 1. The transition rate for this transition during the MTTF calculation, as they are absorbing
is µ 0 = 1 / τ Test . states. Therefore the Q matrix has a 9 x 9 matrix
form, see [3], [4].
The following two cases can be differentiated if a
For the considered Markov model we make the
common cause failure occurs in a 2oo4-system:
assumption τ LT = ∞ . As such applies
• The common failure cause leads to dangerous
detected failures. Then a transition exists from
state 0 directly into the state 11. The transition 1
µ LT = =0. (13)
rate is β D ⋅ λ DD . τ LT
• The common failure cause leads to dangerous
undetected failures. Then a transition exists The next step is to calculate the M-matrix. We get
from state 0 directly into the state 14. The the M-matrix with the following formula:
transition rate is β ⋅ λ DU .
⎡M M2⎤
I −Q = ⎢ 1 ⋅ dt = M ⋅ dt . (14)
In summary we can note the following: ⎣M 3 M 4 ⎥⎦
• If state 7 occurs the system immediately For the 2oo4-system the M-matrix is also a 9 x 9
switches to state S. matrix. Now we can calculate the N-matrix. The N-
• Failures that bring the 2oo4-system in the matrix needs to be composed to derive the MTTF
states 8, 9, 12, 13, and 15, result in a transition value of the system. The N-matrix is the inverse
of the system into the safe state 1 after time, matrix of the M-matrix.
which is smaller than 4 ⋅τ Test . The transition The MTTF value describes the mean time between
rate from these states into state 1 is always the occurrences of two failures. One assumes state 0
equal to µ 0 = 1 / τ Test . at the start time, i.e. the state in which the system
operates failure free. After the inversion the elements
• The states 1, 7, 11, 12, 13, 14 and 15 are
of the new matrix represent time dependent values.
absorbing states, that means, this states has
One needs to sum the first row of the N-matrix in
only a transition to the safe state or to the state
order to derive the MTTF value of the system. The
”system fully operational” and no further
MTTF term of a 2oo4-system has the following form,
transitions exist.
see also [3], [4]:
In the states 0, 2, 3, 4, 5, 6, 8, 9 and 10 the system is
operational. These states must be taken into account 1 4 ⋅ λ DD 4 ⋅ λ DU 12 ⋅ λ 2DD
MTTF2oo4 = + + + +
for the MTTF calculation of the 2oo4-systems. A1 A1 ⋅ A2 A1 ⋅ A3 A1 ⋅ A2 ⋅ A4
(15)
12 ⋅ λ 2DU
A11 + + A12 + A13 + A14
A1 ⋅ A3 ⋅ A6
4.1 Calculation of MTTF-value for a 2oo4-
with
system
For the 2oo4 Markov model exists the transition
matrix P. This transition matrix is 16 x 16 matrix, see
[3], [4], because we have 16 states.
The P matrix is the basis for the Q matrix. The
elements of the Q matrix are composed of the
respective probability densities, where the
corresponding states meet the following criteria:
• System operational
• Non absorbing state.
A1 = 4 ⋅ λ S + 4 ⋅ λ DD + 4 ⋅ λ DU + β D ⋅ λ DD + β ⋅ λ DU References:
A2 = µ 0 + 3 ⋅ λ DD + 3 ⋅ λ DU [1] IEC/EN 61508: International Standard 61508
Functional safety: Safety-related System,
A3 = µ LT + 3 ⋅ λ DD + 3 ⋅ λ DU
Geneva, International Electrotechnical
A4 = µ 0 + 2 ⋅ λ DD + 2 ⋅ λ DU Commission
A5 = µ 0 + 2 ⋅ λ DD + 2 ⋅ λ DU [2] Börcsök, J.: International and EU Standard
A6 = µ LT + 2 ⋅ λ DD + 2 ⋅ λ DU 61508, Presentation within the VD Conference
A7 = µ 0 of HIMA GmbH + CO KG, 2002
[3] Börcsök, J.: Elektronische Sicherheitssysteme,
A8 = µ 0 + λ DU
Hüthig publishing company, 2004.
A9 = µ 0 + λ DD + λ DU [4] Börcsök, J.: Elektronic Safety Systems, Hüthig
A10 = µ LT + λ DD + λ DU publishing company, 2004.
12 ⋅ λ DD ⋅ λ DU ⋅ ( A2 + A3 ) [5] Börcsök, J.: Sicherheits-Rechnerarchitektur Teil
A11 = 1 und 2, lecture of University of Kassel,
A1 ⋅ A2 ⋅ A3 ⋅ A5
2000/2001.
24 ⋅ λ 2DD ⋅ λ DU ⋅ ( A2 ⋅ A4 + A3 ⋅ A4 + A3 ⋅ A5 ) [6] Börcsök, J.: Echtzeitbetriebsysteme für
A12 =
A1 ⋅ A2 ⋅ A3 ⋅ A4 ⋅ A5 ⋅ A8 sicherheitsgerichtete Realzeitrechner, lecture of
24 ⋅ λ 2DU ⋅ λ DD ⋅ ( A2 ⋅ A5 + A2 ⋅ A6 + A3 ⋅ A6 ) University of Kassel, 2000/2001.
A13 = [7] DIN VDE 0801: Funktionale Sicherheit,
A1 ⋅ A2 ⋅ A3 ⋅ A5 ⋅ A6 ⋅ A9
sicherheitsbezogenener elektrischer/elektroni-
24 ⋅ λ3DU scher/programmierbarer elektronischer
A14 =
A2 ⋅ A3 ⋅ A6 ⋅ A10 Systeme (E/E/PES), (IEC 65A/255/CDV: 1998),
6 ⋅ λ DD ⋅ λ DU ⋅ ( A4 + A5 ) Page: 27f, August 1998
A15 = [8] DIN V 19250: Grundlegende Sicherheits-
A2 ⋅ A4 ⋅ A5 ⋅ A8
betrachtungen für MSR-Schutzeinrichtungen,
6 ⋅ λ DD ⋅ λ DU ⋅ ( A5 + A6 )
A16 = Beuth publishing company, Berlin 1998
A3 ⋅ A5 ⋅ A6 ⋅ A9 [9] DIN VDE 0801/A1: Grundsätze für Rechner in
Systemen mit Sicherheitsaufgaben, Beuth
publishing company
5 Conclusion [10] IEC 60880-2: Software für Rechner mit
The more safe 2oo4-architecture will be established sicherheitskritischer Bedeutung, 12/2001
within high safety class computers in future. Such
computers will be applied in various fields which
require simultaneously both: availability and
maximal safety. They are applied where human lives
need to be protected and/or safed, either in material
handling, energy production/distribution, in the
medical field or in future industrial power plants in
space.
As already mentioned in the introduction, today’s
technical systems will be more and more complex.
Man will no longer be able to provide appropriate
safety in processes which have to be monitored.
Future safety control must support him, either in
recording and analysing data, or in operation
resulting from this. Advanced safety architectures
like the introduced 2oo4-system have to be utilized
in order to guarantee the required safety. This system
combines the benefits of the 1oo2- and the 2oo3-
system: simultaneously a higher availability and a
higher safety than today’s systems.
FUNCTIONAL SAFETY CERTIFICATION COURSE
Functional Safety for
Safety Instrumented System Professionals
White paper
08
Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com
This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.
Risknowlogy, the Risknowlogy logo, and Functional Safety Data Sheet are registered service marks.
M.J.M. Houtermans 1
Risknowlogy B.V., Brunssum, The Netherlands
W. Velten-Philipp
TUV Rheinland, Cologne, Germany
Abstract
Abstract: - The purpose of this paper is to show the effect diagnostic and periodic proof testing have
on the availability of the safety function carried out by programmable electronic systems. For three
different architectures the influence of the diagnostic coverage, the proof test coverage, and the proof
test interval on the probability of failure on demand are determined. Performance indicators are used
to express this influence and show the effect.
1 Introduction
Safety systems carry out one or more safety functions. Each safety function consists of a sensing
element, a logic solving element and an actuation element. Typical sensing elements are for example
sensors, switches or emergency push buttons. The logic solving element is usually a general-purpose
safety computer which can carry out several safety functions at once. Valves, pumps or alarms are
typical actuating elements of a safety function. The performance of these safety functions is
determined by several design parameters of the individual components of the safety functions. In [1]
the following design parameters of the safety function are identified as the
Architecture
Hardware failures
Software failures
Systematic failures
Common cause failures
Online diagnostics, and
Periodic test intervals.
The performance of a safety function can be expressed as the probability of failure on demand
(PFD) and the probability of fail safe or spurious trip. Both attributes are important in the safety world
as their values represent respectively a measurement for the level of safety achieved and the financial
loss caused by the safety system because of spurious trips. The PFD value is one requirement to
meet the safety integrity level of the IEC 61508 standard [2]. For the PFS value there are currently no
requirements in the international safety world, although end-users of safety system require an as low
as possible PFS value.
1
Corresponding author: m.j.m.houtermans@risknowlogy.com
The purpose of this paper is to show the effect that the above mentioned design parameters,
namely online diagnostic and periodic proof testing, have on the performance of the safety function in
terms of the PFD. For three different architectures the influence of the diagnostic coverage, the proof
test coverage, and the proof test interval on the PFD are determined. A performance indicator is used
to express this influence and show the effect.
The proof test is defined in IEC 61508 as well as in IEC 61511 [2,4]. IEC 61508 defines the proof
test as follows:
Thus in other words both diagnostic tests and periodic proof tests try to detect failures inside the
safety system. At first sight there seems to be no difference yet the actual difference is quite an
important one. Note 3 of paragraph 7.4.3.2.2 of IEC 61508-2 explains that a test is a diagnostic test if
the test interval is at least a magnitude less than the expected demand rate. Based on this extra
information we can conclude that in theory a test to detect failures is called a diagnostic test if the test
is carried out automatically and more often than a magnitude less than the expected demand rate. In
all other cases we can refer to a test as a proof test.
The difference between a proof test and a diagnostic test is also important in case of a safety
system with a single architecture, i.e., hardware fault tolerance zero. In this case a proof test is only
sufficient if the safety function operates in a low demand mode. In case of a high demand safety
function diagnostic tests are required that are able to detect dangerous faults within the process
safety time (see IEC 61508-2, chapter 7.4.3.2.5). In other words you cannot build a single safety
system that only depends on proof tests.
In practice we define a test as a diagnostic test if it fulfils the following three criteria:
1. It is carried out automatically (without human interaction) and frequently (related to the
process safety time considering the hardware fault tolerance) by the system software and/or
hardware;
2. The test is used to find failures that can prevent the safety function from being available; and
3. The system automatically acts upon the results of the test.
A typical example of a diagnostic test is a CPU –Test, a memory check or program flow control
monitoring. A proof test on the other hand is a test which is carried out manually and where it is often
necessary to have external hardware and/or software to determine the result of the test. The
frequency of the proof test is much longer then the process safety time and magnitudes bigger than
the period chosen for diagnostic tests. A typical example of a proof test is full or partial valve stroke
testing. Partial valve stroke testing is seldom carried out without some form of human interaction (in
other words we need to depend on the human to carry out the test, to determine the actual results of
the test and/or to take the appropriate action based on the results) and often needs additional
equipment to be carried out.
The advantage of a diagnostic test over a proof test is that failures can be detected very quickly. If
a proof test is carried out once in three months then there is a possibility that the safety function is
running with a failure for three months before we find out about it. With diagnostics we often know
about the problem within milliseconds and thus can repair the failure very quickly. On the other hand
though good diagnostics require more a complicated design and additional hardware and software
build into the system. This additional hardware and software is often difficult to build, and costs extra.
There is another important reason to make a distinction between diagnostic tests and periodic proof
tests. IEC 61508 requires the calculation of the safe failure fraction for subsystems. The safe failure
fraction of a subsystem is defined as the ratio of the average rate of safe failures plus dangerous
detected failures of the subsystem to the total average failure rate of the subsystem, see formula
below
λSD + λSU + λ DD
SFF =
λSD + λSU + λ DD + λ DU
A high safe failure fraction can be accomplished if we either have a lot of safe failures (detected or
undetected does not really matter for the SFF) or if we can detect a lot of the possible dangerous
failures. Only failures detected by diagnostic tests can be accounted for in the safe failure fraction
calculation. Failures detected by periodic proof tests cannot be accounted for in the safe failure
fraction calculations. This is logical of course, as we do not want to count on humans to carry out our
safety.
3 Architectures
To show the effect that diagnostic and proof tests have on safety functions we introduce three
common safety system architectures. The presented architectures are oversimplified but represent
common structures that are used to implement safety functions. In practice these systems are much
more complex and can consist of many more components. The four basic architectures presented
can be characterized by their redundancy and voting properties, i.e., as XooY (i.e., “X out of Y”) and
are
The 1oo1 architecture;
The 1oo2 architecture;
S1 LS1 A1
S1 LS1 A1
S2 LS2 A2
S1 LS1 A1
S2 LS2 A2
S3 LS3 A3
4 Evaluation Procedure
The procedure used to calculate the PFD is outlined in detail in [1] and is in short as follows. A
functional block diagram of the hardware of the safety system is drawn. For each element of the block
diagram the typical failure modes are listed. The effect on system level is determined for each of
them. This information is used to construct the reliability model, which is in this case a Markov model
[5,6]. The last step in the procedure includes the quantification of the models by adding the failure and
repair rates of the different
5 Reliability Data
One of the objectives of this study was to make sure that it was possible to compare the results for
the different architectures. The actual value of the outcome is not so important as we are more
interested in the relative results. All calculation studies are carried out with these values unless
otherwise noted in the specific study (Reference model).
6 6. Performance Indicators
In order to exam the results and the effects of changing the diagnostic, proof test coverage and
parameters performance indicators are introduced. A performance indicator helps us understand what
impact changing a parameter has on the PFD value of the system at hand. The performance indicator
(PI) is calculated relatively simple. We change the parameter, for example the diagnostic coverage,
stepwise from its minimum value to its maximum value. This will result in a PFD values changing
according to the value of the parameter. To determine the impact of the parameter on the PFD value
we calculate the change of the PFD relatively to the 50% value of the parameter. In case of diagnostic
coverage 50% means 50% of the failures is detected by diagnostics. The 50% value is in this case
the reference value that we use to normalize the PFD value to 1. To get an impression of the
influence below 50% DC and above 50% DC we choose 25% DC and 75% DC for the PFD values to
compare with.
The reason we do this is because we want to determine whether changing a parameter has a lot or
only little effect on the PFD. In other words should we have in real life a system that does not meet
our PFD value then we know which design parameter we have to address in order to make the
changes that have the most influence and thus the fastest results. We don’t want to spend time and
money on improving parameters that show little or no effect at all.
7 Calculation results
7.1 Calculation results with and without diagnostics – no proof test
In this paragraph we study the influence diagnostics has on the PFDavg values for the different
system architectures as presented in paragraph 3.1. The diagnostics coverage is varied in the
following percentages 0%, 25%, 50%, 75%, and 99%. The results are presented in the figure below.
0.1
probability
0.01
0.001
0.0001
0 0.25 0.5 0.75 1
diagnostic coverage
Next we calculate the performance indicator of the diagnostic coverage, i.e., how it influences the
PFDavg of the different architectures. Figure 5 shows the change of the PFDavg for 25% DC
compared to 50% DC and the change for 75% DC compared with 50% DC. The numbers are
normalized with the 50% value.
2.50
2.08
2.00 1.70
1.52
1.40
PI-PFD 1.50 1.19
1.00
0.50
0.00
25%-50% 50%-75%
In other words all architectures are sensitive for the diagnostic coverage factors and increasing the
diagnostic coverage factor can make major improvements. The redundant structures are much more
sensitive then the single structure. Diagnostic coverage really makes an impact on the PFDavg value
when over 50% preferable over 75%.
0.1
probability
0.01
0.001
0.0001
0 0.25 0.5 0.75 1
proof test coverage
2.00
1.96
1.89 2oo3
1.73 1.70
1.80
1.60 1.42 1.40
1.40
1.20
PI-PFD 1.00
0.80
0.60
0.40
0.20
0.00
25%-50% 50%-75%
The PFDavg is less sensitive to proof test coverage compared to diagnostic coverage. This is
understandable as diagnostic coverage gives almost very fast repair compared to proof testing.
Failures detected with diagnostics are detected within seconds, while proof test can take 1 year or
longer before they are carried out. Failures can thus exist much longer in the system. Therefore the
PFD does on average not improve that much.
1 year, 2 years, and 5 years with a proof test coverage of 100%. The results are presented in Figure
8.
0.1
probability
0.01
0.001
0.0001
0 1 2 3 4 5 6 7 8 9 10
years
Next we calculate the performance indicator of the proof test interval, i.e., how it influences the
PFDavg of the different architectures. Figure 9 shows the change of the PFDavg for a proof test
interval of 1-5 years and a proof test interval of 5-10 years. The numbers are normalized with the 5
years value.
7.00
5.79
6.00
5.00
0.00
1y-5y 5y-10y
There is almost no difference from a probability point of view to perform the proof test with a 5
or 10 years interval.
Just like the diagnostic coverage the proof test interval has a significant impact on the PFDavg
value for all architectures. It is more important to carry out proof tests more frequently then with a
high-test coverage. In other words even a simple but frequent proof test can help reduce the PFDavg
value significantly. Thus is good news for partial stroke testing of valves. Especially with partial stroke
testing of valves we do not know the actual proof test coverage but it if done frequently it will still help
significantly.
8 Calculation results
The purpose of this paper was to show the effect of diagnostic coverage, proof test coverage and the
proof test interval on the PFDavg value for different safety architectures. To get more inside in the
influence of these design parameters a performance indicator was introduced. The choice of
diagnostic coverage and the proof test interval have the most influence on the PFDavg. The proof test
coverage also improves the PFDavg but less significant then the other two parameters. The 1oo2
architecture gains the most benefits from while the 1oo1 is the least sensitive. For redundant
architectures counts that there is more chance on finding failures then there is for single architectures.
Therefore they perform better in terms of improving the PFDavg. The authors are currently working on
a more extended version of this paper taking among others into account more architectures and the
PFS calculation per architecture.
REFERENCES
1. Houtermans, M.J.M., Rouvroye, J.L., (1999) The Influence Of Design Parameters On The
Performance Of Safety- Related Systems. International Conference on Safety, IRRST,
Montreal, Canada
2. IEC 61508 (1999) Functional safety of E/E/PE Safety-related systems, IEC 1999;
3. Houtermans, M.J.M., Brombacher, A.C., Karydas, D.M.,(1998) Diagnostic Systems of
Programmable Electronic Systems. PSAM IV, New York, U.S.A
4. IEC, 61511 (2003) Safety instrumented systems for the process industry, 2003
5. IEC 61165, (2001) Ed.2: Application of Markov techniques
6. ISA TR84.0.04 Part 4 (1997) Determining the SIL of a SIS via Markov Analysis
7. Karydas, D.M., Houtermans, M.J.M., (1998) A Practical Approach for the Selection of
Programmable Electronic Systems used for Safety Functions in the Process Industry. 9th
International Symposium on Loss Prevention and Safety Promotion in the Process Industries,
Barcelona, Spain
White paper
09
Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com
This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.
Risknowlogy, the Risknowlogy logo, and Functional Safety Data Sheet are registered service marks.
L. Monfilliette, P. Versluys
Whesso S.A., Calais, France
M.J.M. Houtermans 1
TUV Industrie Service, Cologne, Germany
Risknowlogy B.V., Brunssum, The Netherlands
Abstract
This paper will demonstrate in a practical way in which liquefied gas storage facilities around the
world can benefit from IEC 61508 compliant level sensor systems. World wide there are about 100
liquefied natural gas (LNG) storage facilities. This market is increasing about 8% on average over the
past 5 years due to the increased demand for clean fuels, the development of new gas fields and the
consequential requirement for more storage facilities.
The inherent hazardous situation of storing liquefied natural gas brings about that the industry
requires a very high level of safety. One of the main problems is that a storage tank may get
overfilled, which results in structural damage to the tank and a spill of the liquefied natural gas into the
environment bringing with it unpredictable hazardous situations and their associated risks. The LNG
storage tanks need to be equipped with special level sensors and emergency shutdown equipment to
assure that it is not possible to overfill the tank.
The purpose of this paper is to demonstrate how the liquefied gas market can benefit from SIL
certified level sensors. From an application level point of view the safety requirements are explained.
This paper will discuss the IEC 61508 requirements as well as the specific requirements of NFPA 59.
The paper explains how the sensor system fulfills these requirements and what efforts the company
had to take to meet these requirements. To end-users the paper will explain why and how the sensor
system has been tested. Further more, via a practical example of an LNG storage tank the paper will
demonstrate the achieved probability of failure on demand and the required proof test interval.
1 Introduction
The level sensor described in this paper consists of hardware, software and mechanical sub modules.
The applicable functional safety standard for these kinds of systems is IEC 61508. Although this
standard only has technical safety requirements for electrical, electronic and programmable electronic
systems, it is clearly stated in the standard that also supports other technology, like the mechanical
part of the level sensor, and can be used as long as it follows the framework and lifecycle approach of
IEC 61508. As the standard only has detailed requirements for electrical, electronic, and
programmable electronic devices, additional requirements were defined when the level sensor was in
the process of certification to address the mechanical parts.
This paper will demonstrate in a practical way in which liquefied gas storage facilities around the
world can benefit from IEC 61508 compliant level sensor systems. This paper will demonstrate how
the liquefied gas market can benefit from SIL certified level sensors. The safety requirements are
1
Corresponding author: m.j.m.houtermans@risknowlogy.com
explained from an application level point of view. This paper will discuss the IEC 61508 requirements
as well as the specific requirements of NFPA 59.
The actual level sensor is shown in more detail in Figure 2. The gauge itself is build up out of
software, electronics, and a mechanical part, all enclosed in a rugged metal housing. The level gauge
sensing head is composed of TWO parts: the main sensing head body and a PVC displacer. The
main sensing head accommodates a coil and a linkage to the level gauge tape. The displacer floats
on the LNG surface. Following the changes in the actual LNG level, the displacer drives a core up or
down in above mentioned coil, thus changing the induction in the latter. The level gauge tape,
connected to the coil, consists of 2 conductors, linking this coil to the gauge’s electronics, where the
changes in induction (and thus of the actual LNG level) are being measured. The Tefzel® coated
stainless steel tape can be as long as 75 meter, dependant on the storage tank height. The software
and electronics part not only exist to enable communication with the BPCS and SIS systems but also
to ensure a high level of self-diagnostics. Also the level gauge tape plays an integral part in the
diagnostics capabilities of the level gauge.
The displacer continuously floats on top of the liquid surface thus continuously sensing any
movements in the surface, thus continuously changing the induction in the sensing head coil. The rate
of change of this induction is measured and analyzed by the electronics, giving the following results:
The speed in which the induction changes is proportional to the speed at which the servo
motor should be driven.
The direction in which the induction changes sets the direction of the servo motor (up or
down)
If the changes up / down equal to zero, such indicates that the surface is merely showing
wave-action and no real level change at which point the servo motor is not activated.
If NO changes are measured during a 10 minute time span, the servo motor travels a short
distance up to re-find the actual level immediately after. This is a self check to ensure that the
system is still functioning properly.
When the sensor was certified, the basis of the certification was the above defined safety function.
Without a well-defined safety function it is impossible to test the level sensors against the
requirements of the applicable standards. It is crucial to have a safety function definition that is not too
narrow since otherwise the end-users will not be able to use the device for safety.
If the safety function is defined too wide it is too difficult to certify the device, as many requirements
can possibly not be met. A well-defined safety function makes also the testing task clear for anybody
involved in certification.
This is quite a list of requirements to manage. Therefore any TüV certification project always starts
with three important documents. These three documents are:
Safety plan
Verification & validation plan
Safety requirements specification
These documents are not demanded by TüV but they are a direct result from the requirements of
IEC 61508. Besides that, TüV having years of experience in dealing with functional safety projects
and having these three documents ready at the start of the project, is a guarantee that the project
runs faster and that everybody involved has a clear understanding of the project plan on how to
achieve safety.
The safety plan outlines the management of functional safety requirements and is basically the plan
or approach on how to achieve functional safety for the project. It outlines the people, departments
and organizations involved, the lifecycle to follow, the activities and documents in each step of the
lifecycle, the tools and measures that will need to be applied to avoid failures.
In other words, it is a document that outlines who will do what, how and at what time.
Since it is a plan, it is a living document that can be updated over the course of the project
whenever necessary.
The verification & validation plan is a document that outlines who will perform which verification
activities at what point in time. It does not outline actual tests but only the activities to come to these
tests. For example in the case of Whessoe, one of the activities was to understand the IEC 61508
standard. One cannot design and verify a design if one does not understand the requirements of IEC
61508…
The third document is the safety requirements specification (SRS). Where the first two documents
were process related, (that is: how we manage functional safety), the SRS is about the requirements
of the actual product or system. The SRS is the most important safety document as it outlines the
basic and top-level safety requirements of the product. It is a well-focused document, which does not
go into detail and does not include any non-safety requirements. For this project, a lot of time was
spend upfront to generate these three documents. This time was considered well spend though and
was gained back during the remainder of the project as less mistakes were made and less “surprises”
revealed themselves during the project. The following paragraphs give a more detailed overview of
the requirements directly related to functional safety and applied during certification of the product.
The functional safety management requirements are in general dealt with in the safety plan and the
verification & validation plan. Detailed verification & validation documentation is created for each step
of the lifecycle, both for hardware as well as for software. The hardware and software requirements
are, on a general level, explained in the safety requirements specification and in more detail in the
supporting design specifications.
A qualitative and quantitative reliability analysis needs to be carried out on the hardware and is part
of the hardware verification documentation. Besides specifications, verification and validation, also
supporting documentation needs to be created like a user manual, including the safety manual.
One of the most important IEC 61508 concepts that need to be addressed is the architectural
constraint. According to IEC 61508, it is not possible to just build any kind of safety system, as the
architecture is limited according to the requirements in Table 1. This table applies to so-called
subsystems. Typical subsystems are sensors, valves, logic solvers, etc.
For each subsystem we need to determine the following:
The type
The safe failure fraction
The hardware fault tolerance
The type of the subsystem deals with the complexity of the component.
There are two types, A or B. Type A subsystems are simple systems with well-defined failure
modes and failure behavior. Type B subsystems are complex systems where one or more failure
modes are not clear or where we cannot fully understand the failure behavior of the system.
The safe failure fraction is a measure of the “fail-safe” design and build-in diagnostics of the safety
system. The more internal failures go to the safe side, or the more failures we can detect via build in
diagnostics, the higher the safe failure fraction.
The hardware fault tolerance is a measure of redundancy. A hardware fault tolerance of 0 means
that the safety function of the subsystem is lost when 0+1=1 dangerous failure occurs.
A single subsystem has a hardware fault tolerance of zero;
A redundant subsystem has a hardware fault tolerance 1, and so on.
Type A Type B
Safe Failure
Hardware Fault Tolerance (HFT) Hardware Fault Tolerance (HFT)
Fraction (SFF)
0 1 2 0 1 2
< 60 % SIL 1 SIL 2 SIL 3 N.A. SIL 1 SIL 2
The level gauge can actually be divided into three subsystems as shown in Figure 3.
The division is based on the type of the subsystem according to IEC 61508. A single level gauge is
a mixed Type subsystem as it consists of Type A mechanical hardware, Type A electronic hardware
and type B electronic hardware. In order for a single level gauge to achieve SIL 2 the following
conditions need to be met:
Type A mechanical hardware needs to have a safe failure fraction of 60-90%
Type A electronic hardware needs to have a safe failure fraction of 60-90%
Type B electronic hardware needs to have a safe failure fraction of 90-99%
Sub system
level sensor
1143-2
To verify the safe failure fraction of a single sensor a detailed component level failure modes and
effects analysis (FMEA) has been carried out. This FMEA addresses the mechanical as well as the
electronic hardware of the sensor. For every single internal component of the level gauge, the failure
modes were listed and the effects of these failure modes were analyzed taking into account the safety
function as defined before. This was indeed a tedious task but it documented the full possible failure
behavior of the level sensor as required by the standard.
Failure rate data was added to the FMEA in order to calculate the safe failure fraction. During the
FMEA also existing diagnostics features of the gauge were taken into account.
Not all diagnostics as required by the standard were available in the first design of the level sensor.
The FMEA revealed that there were several improvements to be made in order to achieve the
required safe failure fractions. Additional software diagnostics were implemented. The accepted
design for a single level sensor currently meets the safe failure fractions for SIL 2. As it is possible to
use multiple sensors in different architectures, it is also possible to achieve SIL 3.
NFPA 59 A requirements
gives a complete overview of the possible architectures for the level sensor and their achievable
SIL levels according to IEC 61508.
Table 2 – Overview of the possible architectures and their achievable SIL level
Architecture
Attribute 1oo1 1oo2 2oo3
Hardware fault tolerance 0 1 1
Fit for use in SIL 2 3 3
5 NFPA 59 A requirements
Level sensors for LNG tanks need to comply in many countries to the US standard NFPA 59 A. This
standard is application specific, which means that besides the IEC 61508 requirements it is also
necessary for these levels gauges to comply with the NFPA 59 A standard.
Although this being a US standard, many countries in the world storing LNG apply this standard as
a basis when building LNG storage tanks. There are a few very significant requirements in the
standard that need to be considered when using level gauges. The requirements within NFPA 59A
call for three level gauges, one being dedicated to high – high level alarming only.
In other words, no matter how well the level sensors perform according to the IEC 61508 standard if
a company needs to comply with NFPA 59 A then per definition they need to use three level gauges.
At the time of writing the NFPA 59 A standard, IEC 61508 was not known to the committee. Possibly
in the future the requirement of using 3 sensors may be reduced to 1 or 2 levels gauges fit for use in a
certain SIL level according to IEC 61508.
6 Environmental conditions
At design stage the safety system should integrate the following environmental factors
Temperature range: 20°C to + 50°C
Enclosure: IP 65
Components Tropical type protection: Optional coating for PCB
Pressure range: Up to 500 mBar relative to atmospheric pressure
Seismic resistance: Up to 2g in all directions
Besides the above, the level gauges must also comply to the EMC requirements.
Although the PFD can only be calculated for a complete safety function, in this paper we will
calculate the contribution the level sensor will have to the overall safety function. One of the most
advanced techniques to make reliability calculations is Markov analysis [11]. To make the
calculations, three Markov models were created for the three possible architectures the level sensors
can be used in. For each Markov model the reliability data as resulted from the FMEA were used as
failure rate inputs. The actual voting of the 1oo2 and 2oo3 system occurs in the logic solver of the
ESD system. As the level sensors have excellent diagnostics capabilities, it is possible to send to the
logic solver signals indicating safe and dangerous detected failures. In other words, the logic solvers
know which signal from which sensor to trust and which signal not to trust. This helps significantly in
deciding whether to shutdown or to indicate to the operators to repair the sensors. The results of the
PFD calculation are presented in Table 4. The PFD calculations are performed for 1 and 10 years
continuous operation.
Architecture
Attribute 1oo1 1oo2 2oo3
PFD after 1 year 1.802e-004 4.404e-008 3.287e-007
Percentage of PFD after 1 year 0.180% 0.004% 0.033%
PFD after 10 years 1.771e-003 4.181e-006 3.201e-005
Percentage of PFD after 10 year 17.7% 0.42% 3.20%
Fit for use in SIL 2 3 3
PFS after 1 year 1.154e-006 9.701e-005 1.918e-010
Fit for use in STL 5 4 9
Figure 4 shows how the probability of failure on demand develops over time for all three
architectures. A graphical representation like this can be used by an end user to determine periodic
proof test interval. This can only be done though if the logic solver and actuating part are also
included in the calculation. The 1oo1 architecture clearly performs the worst of the three architectures.
The reason that the 1oo2 architecture has a better performance then the 2oo3 architecture is because
the 2oo3 has more possibilities to fail.
Figure 4 – Probability of Failure on Demand for 1oo1, 1oo2, and 2oo3 architectures.
8 Conclusions
The paper presented the work performed by Whessoe S.A. to certify their LNG level sensor to the IEC
61508 and related standards. The level sensors were rigorously tested, not only for functional safety,
but also for specific environmental conditions. Whessoe decided to have the level sensor certify by
TÜV. This certification ensure the end-user that they do not need to evaluate the sensor any more
according to the IEC 61508 standard. The independent review by TÜV demonstrated that the level
sensor is capable of achieving SIL 2 in a 1oo1 configuration and SIL 3 in a 1oo2 or 2oo3
configuration.
9 References
1 IEC 61508, Functional safety of electrical, electronic, programmable electronic safety-related
systems. International Electrotechnical Committee, Geneva,. Switzerland, 1999
2 IEC 61511, Functional safety – safety instrumented systems for the process industry.
International Electrotechnical Committee, Geneva, Switzerland, 2003
3 NFPA 59, NFPA 59: Utility LP-Gas Plant Code. National Fire Protection Association, Quincy, MA,
USA, 2004
4 49 CFR. Part 13 USA
5 EN1473. 4. 5. 8, Installation and equipment for liquefied natural gas. Design of onshore
installations, 1997
6 EN 61326. 1, Electrical equipment for measurement, control and laboratory use - EMC
requirements. International Electrotechnical Committee, Geneva, Switzerland, 2005
7 IEC 61010, Safety requirements for electrical equipment for measurement, control, and
laboratory use, International Electrotechnical Committee, Geneva, Switzerland, 2003
8 EN50014, Electrical apparatus for potentially explosive atmospheres. General requirements,
1998
9 EN50018, Electrical apparatus for potentially explosive atmospheres. Flameproof enclosure 'd',
2000
10 EN50020, Electrical apparatus for potentially explosive atmospheres. Intrinsic safety 'i', 2002
11 Börcsök, J., Electronic Safety Systems, Hardware Concepts, Models, and Calculations, ISBN 3-
7785-2944-7, Heidelberg, Germany, 2004
White paper
10
Not all aspects are yet addressed in this draft. Comments and suggestions that can improve this
maintenance override paper in terms of safety are very welcome.
2. Introduction
The purpose of this document is to describe the procedures for the use of maintenance override of
safety related programmable electronic systems, like sensors, controllers, and actuators. The
document also shows how to overcome safety problems and the inconvenience of hardwired
solutions.
2.3.2. Operation:
1. The alarm shall not be overridden. It should always be clear that signals are in a maintenance
condition.
2. The PLC alerts the operator (e.-g. via the DCS) indicating the override condition. The operator will
be warned until the override is removed.
3. I missioning
Co
oni
nsta
m
7. Decommissi
HIMA
llation &
LIFECYCLE
SERVICES
6. M & Re
od tro
o n
ific fi
ati
io
at
n
lid
t s Va
4.
5. Operation &
M a i n te n a n c e
www.hima.com
fscs@hima.com
Copyright
©
2004
-‐
2012
HIMA
Paul
Hildebrandt
GmbH
+
Co
KG
SpecificaDon
are
subject
to
change,
All
rights
reserved.
For
a
detailed
list
of
all
our
subsidiaries
and
representaDves,
please
visit
our
website:
www.hima.com/contact