Sie sind auf Seite 1von 12

International Journal of Engineering Sciences, 2(7) July 2013, Pages: 266-277

TI Journals
ISSN
2306-6474

International Journal of Engineering Sciences


www.tijournals.com

Common Cause Failure Analysis (CCFA) of a


Spacecraft Embedded Computing System
Osamu Saotome *1, Paulo Elias 2
1,2

Technology Institute for Aeronautics, ITA, Brazil.

AR TIC LE INF O

AB STR AC T

Keywords:

This paper highlights concerns regarding effects of single event effects (SEEs) on the spacecraft
embedded computing system by applying analysis and mitigation techniques for handling the SEE
hazard caused by space radiation (e.g. cosmic rays and solar storms). The purpose is to demonstrate
that the common causes of system failures are, in many cases, underestimated in the spacecraft
mission risk assessment and consequently the risks associated to external events are underevaluated and may result in erroneous risk quantification into the risk assessment process. The
methodology for performing common cause failure analysis of a spacecraft embedded computing
system which performs a critical function is called risk tree analysis (RTA) and the risk scenario
considered in such analysis is the fact that in embedded computers there is a potential to
malfunction caused by space radiation on electronic hardware within those computers leading to
common cause failures at the spacecraft level; that risk scenario is motivated by the hostile
environmental conditions where spacecraft is subject for operating without mission losses, either
caused by loss of spacecraft or loss of communication link. From launch phase upon to reach
Earths orbit it usually cross atmosphere levels till enter space operating environment where the
cosmic rays and solar storms are more severe than inside Earths atmosphere. The space radiation
usually cause single event effects on electronic devices within onboard computers causing
malfunction at the functional levels of spacecraft. A case study is applied to demonstrate the RTA
method and the analysis results are demonstrated at the end of this paper.

Fault
Error
Integrity
Risk
Common-cause failure
Analysis
Embedded Computer
SEU

2013 Int. j. eng. sci. All rights reserved for TI Journals.

1.

Introduction

Critical systems are systems whose failures may danger human life, or lead to substantial economic losses, or cause extensive
environmental damage. Spacecraft embedded computing systems combine electronic hardware and software and are capable of processing
large amounts of data in a very small time. Although space mission is considered critical from the program management viewpoint, most of
the hardware components used in the spacecraft architecture design is COTS (commercial of-the-shelf) and the fault tolerance characteristic
of such components are not feasible to be implemented by physical means due to their architectures are not subject to changes after their
release to the market. In this case, means to achieve system dependability, e.g. fault tolerance techniques, may be implemented into the
hardware design by adding software (i.e. specific algorithms) at the application layer of these electronic hardware components. Such a
technique is known as software-implemented hardware fault tolerance (SIHFT) and is widely used to improve the error detection and
recovery capability of COTS electronic devices and also to make that the COTS usage could be feasible and reliable for space missions.
Spacecraft computing systems need dependable computer devices in order to improve the total system dependability. If fault/error tolerant
devices are used to architect the computers design thus the final result can be satisfactory from the dependability point of view.
Due to its complexity in design and to perform its risk-related analyses, the embedded computing systems are subject to specific hazards
which may cause multiple component failures due to the same cause, usually called common cause event(s) that can defeat the systems
redundancies [3]. Another characteristic of such hazard is the physical allocation of the internal computers devices; these devices are
electronic hardware components susceptible to high energy particles released by cosmic rays or solar storms causing single event effects in
the microelectronic within even the smallest electronic components of such computers. All sub-micron integrated electronic devices are
susceptible to single event effects (SEEs) to some degree. The effects can range from transients causing logical errors, to upsets changing
data, or to destructive single-event latch-up (SEL). From launch phase upon to reach Earths orbit it usually cross atmosphere levels till
enter space operating environment where the cosmic rays and solar storms are more severe than inside Earths atmosphere. The space
radiation usually cause single event effects on electronic devices within onboard computers causing malfunction at the functional levels of
the spacecraft. Figure 1 illustrates the cosmic rays and the solar storm showering a spacecraft in orbit.

* Corresponding author.
Email address: osaotome@ita.br

Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

267

Internat ional Jour nal of Engineeri ng Science s, 2(7) July 2013

Figure 1. Cosmic Rays and Solar Storms Showering the Space Vehicle
[from: ESA, http://www.esa.int/Our_Activities/Technology/Proba_Missions/Detecting_radiation]

SEE is a basic hardware issue as it occurs when a bit is flipped in hardware due to, among other causes, the effects of radiation on
microelectronic circuits. SEEs may be non-destructive (typically transient errors that cause a temporary change of combinational logic,
called Single Event Transients or SETs, or permanent errors that cause for example a change of a memory cell value, called Single Bit
Upsets, Multiple Bit Upsets, Single Event Functional Interruptions or Single Event Latchups) or destructive (Single Event Burnouts, Single
Event Gate Ruptures or Stuck Bits) [17].
Hardware can be damaged, as in the case of a burnout or gate rupture, but most often the failures are non-destructive. Single event upsets
are the most common type of event [12] and [17].
SEE is defined in [19] as a disturbance of an active electronic device (transistor/gate) caused by energy deposited from the interaction with
a single energetic particle. An event occurs when an ionization charge from the energy deposition exceeds the device critical charge.
Failure characteristics include:

Single or multiple bit flips,


Single event functional interrupt,
Single event transients,
Errors in entire blocks, and/or
Latch up condition.

In space, energetic heavy ions passing through materials generate intense tracks of ionisation. If the ion passes through a sensitive part of a
semiconductor chip, for example parts of a "bit", the free charge generated in the track is often sufficient to flip the logic state of the bit.
This results in a single-event upset (SEU).
A SEU can also result from energetic protons or ions hitting the nucleus of an atom in a sensitive component location. The nuclear
interaction can produce spallation, which is the splitting of the nucleus, the heavy debris from which carries away a sizeable portion of the
initial particle's energy. The spallation products generate the ionisation which can flip the bit state [10, [11], and [12].
In the presence of SEU, the embedded computer components failure rates can be increased up to 100 times [23] and consequently the
likelihood for an erroneous data being produced specially within the CPU become expected during the spacecraft mission.
Single event upset (SEU) is defined by [19] as being a change of state in a memory or latch in a device induced by the energy deposited by
an energetic particle. That hazard may lead to a system failure condition of Undetected Erroneous Data (UED) produced by embedded
computers and used by end users of such data. This failure scenario may be catastrophic to the spacecraft and consequently cause the loss
of vehicle (LOV). In this case, the system architectural design can be used as mitigation means to reduce the impact of a SEU on the
electronic devices to the spacecraft function hosted at the computers. So, the residual risk modelled in this scenario is the own UED being
used by the end users.

2.

Common-Cause Failure Analysis (CCFA)

A common cause is an event or mechanism that can cause two or more failures (basic events) to occur simultaneously. The failures
resulting from the common cause are called common cause failures (CCFs). Because common causes can induce the failure of multiple
components, they have the potential to increase system failure probabilities. Thus, the elimination of common causes can appreciably

Osamu Saotome and Paulo Elias

268

Inter nat ional Journal of Engineer ing Sci ences, 2(7) July 2013

improve system reliability. To eliminate common causes, analysts must be able to recognize the failure sources that are responsible for
CCFs and implement specific solutions to deal with them. The following table lists examples of common causes that are frequently
encountered by type.
Table 1. Types of common-cause events
Common-cause type

Common-cause sub-type

Mechanical

Abnormally high or low temperature


Abnormally high or low pressure
Stress above design limits
Impact
Vibration

Electrical

Abnormally high voltage


Abnormally high current
Electromagnetic Interference (EMI)

Chemical

Corrosion
Chemical reaction

Other

Earthquake
Tornado
Flood
Lightning
Fire
Radiation
Moisture
Dust
Design or production defect
Test/maintenance/operation error

There are, basically, four models for quantifying systems subject to common cause failures. The following table describes the existing
models for performing common cause failure analysis. The Beta model is the only model that can consider the combinations of more than
four events in a CCF group. All other models can consider the combinations of two, three, or four events in a CCF group.
Table 2. CCFA models
Model

Description

Alpha

This model represents the probability of failure for a specified number of items at the same time. For example,
Alpha 2 is the probability that exactly two items fails at the same time. Alpha 3 is the probability that exactly
three items fail at the same time.

Beta

This model is the most basic. It assumes that all components that belong to a CCF group fail when the
common cause occurs. By definition, this model distinguishes between individual failures and CCFs, with the
assumption that if the CCF occurs, all components fail simultaneously by a common cause. Multiple
independent failures are neglected.

Beta BFR (Binomial Failure Rate)

This model is also known as a shock model.

MGL (Multiple Greek Letter)

This model is a generalization of the Beta model.

Because the Alpha, BFR, and MGL models do not distinguish CCFs of order 4 or higher, the size of input parameters for these models
must be restricted to support these calculations.

If the number of basic events in the CCF group is less than four, considers only the meaningful parameters.
If the number of basic events in the CCF group is more than four, assumes that the values of other parameters required to
calculating CCF combinations of an order more than four is zero.

2.1 The Proposed CCFA Methodology


Systems affected by common cause failures are systems in which two or more events have the potential of occurring due to the same cause.
Some typical common causes include impact, vibration, pressure, grit, stress and temperature, radiation, high-intensity radio frequency, and
so on. This article deals with an unmanned spacecraft which depends on embedded computers architecture for their correct operation
throughout the mission.
The proposed CCFA methodology is aimed at such a scenario where more than four CCF combinations may be needed for calculation.
Such method is based on a mathematical model combined with a graphical model called risk tree structured model, similar to a fault tree
topology, therefore with logical gates combining not only component faults, but also input events that may be either failures and/or hazards

Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

269

Internat ional Jour nal of Engineeri ng Science s, 2(7) July 2013

that may influence, and often increase, the logic of system failure. That method is called Risk Tree Analysis (RTA) and will be
demonstrated, as follows:
For the mathematical model the following example illustrates how it is performed:
Assume that there are four basic events belonging to the CCF group: A, B, C, and D. When calculates the minimal cut set for this fault tree,
the following CCF events shall automatically be created:
AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and ABCD
For calculation purposes, each of the four original basic events (A, B, C, D) is replaced with an OR gate. The inputs to the OR gate include
the individual basic event and CCF events that contain that basic event. For example, basic event A is replaced by an OR gate with A
(individual failure) and AB, AC, AD, ABC, ABD, ACD, and ABCD (CCF events) as inputs.
The following parameters are used to calculate CCF events:
Qt = The total unavailability of each basic event in the CCF group.
Qk = The unavailability of the CCF event of order k, that is a CCF involving k components.
n = The number of basic events in the CCF group.
2.2 Defining the System Architecture to be Analyzed
The system architecture under analysis is a generic embedded computing system which provides critical functions for the spacecraft. These
functions are software-performed and are hosted at the dual-redundant computers. To limit the analysis boundary it is necessary to define
exactly the object under analysis, and then the specific system architecture to be analyzed in the risk assessment process.
In this work, basically two computers are used in the system architecture: Computers #1 and #2. The basic components of the computers
are shown in Figure 1, which are:
a)
b)
c)

a central processing unit (CPU);


a memory, comprising both read/write and read only devices (commonly called RAM and ROM respectively);
a mean of providing input and output (I/O). For example, a keypad for input and a display for output.

In the microprocessor-based architecture the functions of the CPU are provided by a single very large scale integrated (VLSI)
microprocessor chip. This chip is equivalent to many thousands of individual transistors.
Semiconductor devices are also used to provide the read/write and read-only memory. Strictly speaking, both types of memory permit
random accesses since any item of data can be retrieved with equal ease regardless of its actual location within the memory. Despite this,
the term RAM has become synonymous of semiconductor read/write memory.
The basic components of the system (CPU, RAM, ROM and I/O) are linked together using a multiple-wire connecting system known as a
bus (see
Figure 2). Three different buses are presented, these are:
(1)
(2)
(3)

the address bus used to specify memory locations;


the data bus on which data is transferred between devices; and
the control bus which provides timing and control signals throughout the system.

The number of individual lines present within the address bus and data bus depends upon the particular microprocessor employed. Signals
on all lines, no matter whether they are used for address, data, or control, can exist in only two basic states: logic 0 (low) or logic 1 (high).
Data and addresses are represented by binary numbers (a sequence of 1s and 0s) that appear respectively on the data and address bus.
Some basic microprocessors designed for control and instrumentation applications have an 8-bit data bus and a 16-bit address bus. More
sophisticated processors can operate with as many as 64 or 128 bit at a time.
The largest binary number that can appear on an 8-bit data bus corresponds to the condition when all eight lines are at logic 1. Therefore the
largest value of data that can be present on the bus at any instant of time is equivalent to the binary number 11111111 (or 255). Similarly,
the highest address that can appear on a 16-bit address bus is 1111111111111111 (or 65,535). The full range of data values and addresses
for a simple microprocessor of this type is thus:
Data
Address

from 00000000 to 11111111.


from 0000000000000000 to 1111111111111111.

Finally, a locally generated clock signal provides a time reference for synchronizing the transfer of data within the system. The clock
usually consists of a high-frequency square wave pulse train derived from a quartz crystal.

Osamu Saotome and Paulo Elias

270

Inter nat ional Journal of Engineer ing Sci ences, 2(7) July 2013

Address Bus

CPU

ROM

Parallel
I/O

RAM

I/O

Data Bus

Clock

Serial
I/O

Control Bus
Figure 2. Embedded Computer Architecture

The single computer shown in


Figure 2 can host multiple software-based spacecraft functions, both critical and non-critical functions, to be performed based on the

hardware and software architectures within the computer. When it is used in the spacecraft we call this system an embedded computing
system. Such computers are susceptible to SEE because they fly at high altitudes (above 100,000 ft) where this event occur more frequently
than low or zero level altitudes.
The proposed system architecture shown in Figure 3 is a dual-redundant computer system architecture interconnected with a comparator
device which performs error detection, correction and alerting function.

CPU

ROM

Parallel
I/O

RAM

I/O

Data Bus

Clock

Serial
I/O

Control Bus

Subsystem 2

Address Bus

CPU

ROM

Parallel
I/O

RAM

I/O

Data Bus

Clock

COMPARATOR

Serial
I/O

Control Bus

Figure 3. Dual-Redundant Embedded Computer System Architecture

The reliability of duplex system architecture can be written as follows:

Rduplex t Rcomp t R 2 t 2cRt 1 Rt

MTTFduplex

1 c

Where,
c = coverage factor; it is the probability that a faulty processor will be correctly diagnosed, identified, and disconnected.

CENTRAL BUS

Subsystem 1

Address Bus

Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

271

Internat ional Jour nal of Engineeri ng Science s, 2(7) July 2013

MTTF = mean time to failure


The comparator unit shown in Figure 2 is an EDAC (error detection and correction) algorithm and it basically performs two functions:
(1)
(2)

Comparison of the two computers output to detect incorrect results by differences between them, and
Correction of the detected erroneous data resulting from computing.

The probability of the EDAC fails to detect and correct the error is in the order of 0,002 failures per hour of operation [18].
So, the coverage factor, c, is as follows:
C = (1 Probability of EDAC failure) per hour of operation
= 1 0,002 = 0,998 (/h)
Once the reliability of the duplex system architecture is modeled the next step is modeling the risk.
2.3 The Risk Model
The risk model developed here considers the radiation effects as a specific hazard which may affect electronic hardware components within
the embedded computer.
The SEU occurrence is added to the computer system risk model shown in Figure 4.

SEU

SEU

CPU

SEU

Address Bus

Parallel
I/O

SEU

ROM

RAM

Data Bus

I/O

Erroneous
Data
Serial
I/O

Control Bus

Clock
Figure 4. Embedded Computer System Architecture (with the SEU hazard producing Erroneous Data)

The microprocessor central processing unit (CPU) forms the heart of any computer system and, consequently, its operation is crucial to the
entire system. The primary function of the microprocessor is that of fetching, decoding, and executing instructions resident in memory. As
such, it must be able to transfer data from external memory into its own internal registers and vice versa. Furthermore, it must operate
predictably, distinguishing, for example, between an operation contained within an instruction and any accompanying addresses of
read/write memory locations. In addition, various system housekeeping tasks need to be performed including being able to suspend normal
processing in order to responding to an external device that needs attention. As the spacecraft operates in space environment, the specific
electronics devices like microprocessors and memories become susceptible to SEU effects that may adversely affect multiple different
spacecraft functions, applications, and partitions [13] and [14] hosted on such computers.
Since spacecraft embedded computers system hosts mission-critical functions using shared resources such as electrical power, data
processing, and memory, there is a potential for an erroneous operation (or malfunction) induced by SEU (either SBU or MBU) caused by
external cosmic radiation. The system failure mechanism process is motivated by the error propagation from one component to another and
the causing an incorrect service at the computer system outcome, as follows:
Designing for the appropriate level of redundancy into the embedded computers design to assure the system reliability, as well as providing
means of fault/error management should address the potential for SEU hazard and its effects on system performance. Design
implementation can either account for eliminate the hazard or mitigate it, depending on the available engineering resources and providing
for appropriate means for recovering the computers functions in case of failures or malfunction.

Osamu Saotome and Paulo Elias

272

Inter nat ional Journal of Engineer ing Sci ences, 2(7) July 2013

Radiation

Figure 5. Error Propagation Model, adaptation from [19]

2.4 Risk Tree Analysis (RTA)


The initial RT is developed around basic independent failure events, which provides a first approximation of cut sets and probability. Many
component failure dependencies are not accounted for explicitly in the first approximation RT model, resulting in an underestimation of the
risk of the RT top-level event [20]. As SEU rate is quantified in the next sub-section, the RT model can be expanded taking into account the
SEU event probability and its rate. Thus, the final RT model includes identified CCF events around SEU in space environment, as shown in
Figure 6.

Figure 6. Dual Embedded Computers Systems Fault Tree Model (without SEU hazard)

Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

273

Internat ional Jour nal of Engineeri ng Science s, 2(7) July 2013

Table 3. One Computers Parts Failure Rates

Name

Qty

Category

Subcategory

FR Type
(Calculated, MILHDBK-217FN2)

System
I/O Device
BUS Interface
Memory
Clock
CPU
Microprocessor
Error Detector
Interface

1
1
1
1
1
1
1
1

Integrated Circuit
Integrated Circuit
Integrated Circuit
Integrated Circuit
Assembly
Integrated Circuit
Software
Integrated Circuit

Linear
Linear
Memory
ASIC

Relex Prediction
Relex Prediction
Relex Prediction
Relex Prediction

VLSI CMOS
Algorithm
GaAs Digital

Relex Prediction
Field data [18]
Relex Prediction

Failure Rate (FPMH)


= 0.269867
0.05
0.05
0.005715
0.00312
= 0.161033
0.154319
0.002
0.006714

2.5 Quantifying the SEU rate


Instead of many existing methods for quantifying SEU rate at either within atmospheric environment or low/high orbit environment, in this
paper we propose a different way to take SEU hazard into account for calculating the upsets effects on failure rate of affected components;
in this case, the affected components are the microprocessor and memories devices based on VLSI technology.
According to [19], in a worst case scenario the SEU events can increase the failure rate of the electronic hardware in the order of 100X (or
102 ); it is considered to high from the risk assessment point of view because the system mission reliability could be strongly affected and
the probability of successful mission could be lower than acceptable (in terms of probability calculation).
Table 4. One Computers Parts Failure Rates (Updated)

Qty

Category

Subcategory

FR Type
(Calculated, MILHDBK-217FN2)

I/O Device
BUS Interface
Memory
Clock
CPU

1
1
1
1
1

Integrated Circuit
Integrated Circuit
Integrated Circuit
Integrated Circuit
Assembly

Linear
Linear
Memory
ASIC

Relex Prediction
Relex Prediction
Relex Prediction
Relex Prediction

0.05
0.05
0.005715
0.00312
= 0.161033

5
5
0.5715
0.312

Microprocessor
Error Detector
Interface

1
1
1

Integrated Circuit
Software
Integrated Circuit

VLSI CMOS
Algorithm
GaAs Digital

Relex Prediction
Field data [18]
Relex Prediction

0.154319
0.002
0.006714

15
0.002*
0.6714

Name
System

Failure Rate
(FPMH)

Updated FR
with SEU rate
(= FR*100)

= 0.269867

The EDAC probability of missed detection and correction is constant along the time.

It is important to note that: (1) the EDAC error rate is a constant probability measured from field data [18]; and (2) the updated failure rate
of hardware components is based on the assumption that a SEU will produce an erroneous data in the integrated circuits outcome, either in
the microprocessor or memory. The bit error considered here is the bit-flips that changes the data content; the observed effects of such a bitflip may be either a misleading information or corrupted data and then it may be processed by end users and the end effect on spacecraft
might be a malfunction or LOV.
Note: Both of these effects are catastrophic.

3.

Case Study

This section presents an application of the proposed methodology to assess the space mission risk related to the functional failures of the
spacecraft. For an unmanned spacecraft the mission criticality is measured in terms of losses of scientific data and expectations, and also
financial budget spent in researches and spacecraft construction (including infra-structure investment and supporting costs); no loss of
human life is expected to occur because the spacecraft is unmanned.
The mission duration (from launch phase to Earths surface landing) is 30 days, or 720 hours.
In the proposed system architecture (dual-redundant computers) shown in Figure 2, the redundant computers are working in parallel and
their outputs are connected with the comparator before being connected to the central bus. End users of computers data receive the
computers output data through the central bus connection; after the comparator step, the end users process the received data to produce the
intended function on its output. In other cases, the end users can select only the first valid data and discards the unused data by a simpler
logic without requiring a new check between the computers output and their inputs. The criticality of each data produced by the computers
depends on its applicability for producing the end users functions, i.e. the same data produced by each computer could be used either by

Osamu Saotome and Paulo Elias

274

Inter nat ional Journal of Engineer ing Sci ences, 2(7) July 2013

critical and non-critical systems connected to the spacecraft central bus. Thus, considering the worst case scenario, the computers output
data will be always considered as critical data for the spacecraft mission. This assumption will simplify the risk analysis in terms of
classification; therefore, the analysis will become too conservative in terms of calculation. This is an assumption done before assessing the
risks and it is necessary to modelling the scenario.
Figure 7 illustrates the mission profile used for the case study.

SAFE ZONE

UNSAFE ZONE

LUNAR-ORBIT

Critical Point
Earths ORBIT

Figure 7. Space Mission Profile Illustration

If spacecraft follows the correct path, in green, during the space crossing, the success probability to accomplish the mission will be high;
but if its trajectory does not follows the predetermined path the likelihood to accomplish its mission will become unlikely. Some risks are
visible in this scenario; for example, if the spacecraft suffer an excessive drag during transition from earths orbit to lunar-orbit the critical
point may not be correctly crossed by the spacecraft and its trajectory probably will become wrong and the result will be the transition from
the safe to unsafe zone. Thus, if the spacecraft cut cross the safe zone boundary it will be lost in the space. Any error in its trajectory could
be catastrophic for the mission leading to a loss of vehicle (LOV). In this scenario, any malfunction of embedded computers might cause an
effect on spacecraft navigation function leading the spacecraft to an incorrect path and consequently causing a LOV. So, a reliability
requirement can be written as follows:
Requirement #1:
The embedded computing system shall be designed so that its reliability (or probability of success) must be at
least 0.98 (or 98%) per mission.
Note that as longer as the mission duration lower the probability of success due to the reliability is directly dependent of time; it is called
time-dependent attribute of spacecraft where the exposure time of the spacecraft systems is the own mission duration time.
It is important to remind that in the space there is no maintenance action available to maintain the spacecraft systems availability, so the
embedded computing system shall be fault/error tolerant and highly reliable. So, another requirement related to recoverability can be
written as follows:
Requirement #2:
The embedded computing system shall be designed so that any computer functional error must be detected
and corrected such that the probability of recoverability success of the faulty computer must be at least 0.99
(or 99%) per mission.
3.1 Requirements Analysis
The analysis of spacecraft mission requirements is presented in Table 5.

Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

275

Internat ional Jour nal of Engineeri ng Science s, 2(7) July 2013

Table 5. Requirements Analysis Table


Req
#
1

Requirement Description

Compliance Analysis

The embedded computing system shall be designed so that its reliability


(or probability of success) must be at least 0.98 (or 98%) per mission.

A reliability of 0,98 per mission can be re-written as:


Probability of failure = Q = 1 R = 1 0,98 = 0,02

The embedded computing system shall be designed so that any computer


functional error must be detected and corrected such that the probability
of recoverability success of the faulty computer must be at least 0.99 (or
99%) per mission.

The probability that a faulty computer will be detected and


recovered to its normal state shall be 0.99 per mission. It can be
represented as being a probability that the tolerable errors of any
computer will be 1% in the total error events.
The EDAC algorithm shall be designed for detecting and
correcting CPU errors in a manner that satisfies the probability
requirement;

Total System Reliability = System Reliability X Recovery Probability = 0.98*0.99 = 0.9702 per mission
Mission duration = 30 days = 720 hours
Thus, the Mission Unreliability = Q = 1 R = 1 0.9702 = 0.0298
When considering the Space radiation effects on CPUs error rate caused by SEU, the CPU reliability might be degraded due to the bit-flips
in microprocessors and memory leading to a data integrity issue. It is recommended to re-evaluate the specific risk, in this case, the
undetected erroneous data (UED) in computers electronic devices caused by single event upset, to implement protection against that
specific risk and then update the risk tree to calculate the top-event failure probability. It is expected that after the EDAC implementation
the calculated system reliability will be compliant with the requirement. Figure 8 shows the RT model representing the system architecture
of Figure 2.

Figure 8. Dual Embedded Computers Systems Fault Tree Model (with SEU hazard)

As can be noted the top-event probability is of the order of 0.02 which is considered compliant to the specified reliability requirement.
The next section summarizes the achieved results and compliances.

Osamu Saotome and Paulo Elias

276

Inter nat ional Journal of Engineer ing Sci ences, 2(7) July 2013

4.

Results

Table 6 shows the achieved results from the case study where the system requirements are demonstrated to be accomplished and their
compliance are substantiated as appropriated.
Table 6. Results of case study

Req
#
1

Requirement Description

Compliance Analysis

The embedded computing system shall be designed


so that its reliability (or probability of success) must
be at least 0.98 (or 98%) per mission.

A reliability of 0,98 per mission can be re-written as:


Probability of failure = Q = 1 R = 1 0,98 = 0,02

The embedded computing system shall be designed


so that any computer functional error must be
detected and corrected such that the probability of
recoverability success of the faulty computer must be
at least 0.99 (or 99%) per mission.

The probability that a faulty computer will be detected


and recovered to its normal state shall be 0.99 per
mission. It can be represented as being a probability
that the tolerable errors of any computer will be 1% in
the total error events.
The EDAC algorithm is designed for detecting and
correcting CPU errors; it is expected to occurring only
one missed detection and correction in 20 days [18] per
EDAC unit; thus, in 30 days it will be expected to
occur 1.5/30 which yields 0.0020833/hour.
Thus, the implemented EDACs failure importance is
<< 1% into the total system failure rate.

Compliant?
(Yes/No)
Yes

Achieved result = 0.022549


Yes

Achieved result << 1%

5.

Conclusion

Safety critical systems embedded on spacecraft computers are a complex subject. Although those safety critical systems have been in use
for many years, the development of an embedded computing system which hosts spacecraft critical functions is still a relatively new and
immature subject.
When applying the CCFA methodology, the specific risks must be considered and verified by a systematic process during the system
architecture design and analysis. During the architecture selection and verification, it is essential to include in the risk tree analysis to verify
the effects of common cause failures which may cause malfunctions and/or catastrophic failures. If common cause failures are not
considered in the spacecraft system analysis the results probably will be erroneous. In this context, the presented CCFA methodology
which groups RTA and probabilistic assessment has demonstrated that the final result is satisfactory from the mission point of view. Such
approach has demonstrated to be efficient on identifying and analyzing risks, and implementing corrective actions for the system. It is
important to note that the analyst background and skills are very important to correctly apply the CCFA method and evaluate the achieved
results.
In this article, a simplified CCFA has been presented and the unique CCBE modelled was the Single Event Upset (SEU) potentially causing
Undetected Erroneous Data (UED) within the embedded computers during spacecraft mission.
To achieve the compliance with the mission requirements, stated as the system requirements, the system architecture had to be designed
through an incremental way and then the calculation of the system failure probability has been provided in contrast to the system
requirements for demonstrating compliance to those requirements. Although reliability requirements are established by mission board
management, the accomplishment of these requirements is an engineering issue. From this, the implemented system architecture with the
EDAC algorithm used to enhance the reliability attribute of COTS hardware components has demonstrated to be satisfactory and the
proposed solution is useful not only for spacecraft computers but also for other embedded computers that perform critical functions.
Normally, in the overall risk assessment process all the CCBEs must be identified and modelled into the RTA to perform a robust CCFA in
order to provide valid results (realistic) on hazards that may actually affect the redundant systems/subsystems and components.
To conclude, it is important to note that although this paper has focused on Space domain, the CCFA methodology can be successfully used
in other context that is concerned with critical systems.

Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

277

Internat ional Jour nal of Engineeri ng Science s, 2(7) July 2013

References
[1]
[2]
[3]
[4]
[5]
[6]

[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]

[16]

[17]
[18]
[19]
[20]
[21]
[22]

[23]

Dyer, C., Rodgers, D.: Effects on Spacecraft & Aircraft Electronics. Space Department, DERA. Farnborough, Hampshire, UK. British Crown (1998)
Kang, D., Han, S. H., and Park, J. H.: Common Cause Failure Analyses by Using the Decomposition Approach. Integrated Safety Assessment
Center, KAERI, 1045 Daedeokdaero, Yuseong-Gu, Daejon, KOREA. Transactions, SMiRT 19, Toronto (2007)
Ericson II, C. A.: Hazard Analysis Techniques for System Safety. Fredericksburg, Virginia, USA. WilleyInterscience, p397-421 (2005)
Donaldson, J., Jenkins, J.: Systems Failures: An approach to understanding what can go wrong. In: European Software Day of EuroMicr'00. ISBN 07695-0872-4 (2000)
U.S. Nuclear Regulatory Commission NUREG/CR-6268, Rev-1, Common-Cause Failure (CCF) Database and Analysis System Event Data
Collection, Classification and Coding (2007)
Wood, R. T.: Diversity Strategies to Mitigate Postulated Common Cause Failure Vulnerabilities. In: Seventh American Nuclear Society
International Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine Interface Technologies NPIC&HMIT 2010, Las
Vegas, Nevada, November 7-11 (2010)
Tang, Z., Dugan, J.: An Integrated Method for Incorporating Common Cause Failures in System Analysis. In: IEEE Reliability and Maintainability,
2004 Annual Symposium - RAMS.
Balen, T., Leite, F., Kastensmidt, F., Lubaszewski, M.: A Self-Checking Scheme to Mitigate Single Event Upset Effects in SRAM-Based FPAAs.
In: IEEE Transactions on Nuclear Science, Vol. 56, n 4, Aug 2009. ISSN: 0018-9499.
Dion, M., Dominik, L.: Incorporation of Atmospheric Neutron Single Event Effects Analysis into a System Safety Assessment, SAE Int. J. Aerosp.
4(2):619-632, 2011, doi: 10.4271/2011-01-2497.
White, D.: Single event effects (SEEs) in FPGAs, ASICs, and processors. EE Times University, Design Article, January 12, 2012. USA.
Dominik, L.: Atmospheric Radiation Testing. In: 2012 Annual NUFO (National User Facility Organization) Meeting.
Normand, E.: Single Event Effects (SEE) on Avionics Systems. Boeing Radiation Effects Laboratory. August 29th, 2012.
Radio Technical Commission for Aeronautics, RTCA DO-178C, Standard for Software Considerations in Airborne Systems and Equipment
Certification, December 13, 2011.
Amarendra, K., Rao, A.: Safety Critical Systems Analysis. Global Journal of Computer Science and Technology, Volume 11 Issue 21 Version 1.0
December 2011. Publisher: Global Journals Inc. (USA). Online ISSN: 0975-4172 & Print ISSN: 0975-4350.
Domenico Di Leo, Fatemeh Ayatolahi, Behrooz Sangchoolie, Johan Karlsson, and Roger Johansson.: On the Impact of Hardware Faults An
Investigation of the Relationship between Workload Inputs and Failure Mode Distributions. In: SAFECOMP 2012, LNCS 7612, pp. 198209,
Springer-Verlag Berlin Heidelberg, 2012.
Anton Tarasyuk, Inna Pereverzeva, Elena Troubitsyna, Timo Latvala, and Laura Nummila.: Formal Development and Assessment of a
Reconfigurable On-board Satellite System. F. Ortmeier and P. Daniel (Eds.): SAFECOMP 2012, LNCS 7612, pp. 210222, 2012. SpringerVerlag Berlin Heidelberg, 2012.
Ludovic Pintard, Christel Seguin, and Jean-Paul Blanquart.: Which Automata for Which Safety Assessment Step of Satellite FDIR? In: SAFECOMP
2012, LNCS 7612, pp. 235246, 2012. Springer-Verlag Berlin Heidelberg, 2012.
Yenier, U.: Fault Tolerant Computing In Space Environment And Software Implemented Hardware Fault Tolerance Techniques. Department of
Computer Engineering, Bosphorus University, Istanbul (2002)
Avizienis, Algirdas., Laprie, Jean-Claude., Randell, Brian., and Landwehr, Carl.: Basic Concepts and Taxonomy of Dependable and Secure
Computing. IEEE Transactions on Dependable and Secure Computing, Vol. 1, N. 1, Jan-Mar 2004.
Elias, P., Saotome, O.: System Architecture-based Design Methodology for Monitoring the Ground-based Augmentation System: Category I
Integrity Risk. J. Aerosp. Technol. Manag., So Jos dos Campos, Vol. 4, No 2, pp. 205-218, Apr.-Jun., 2012.
NASA Probabilistic Risk Assessment (PRA) Guide. 2002.
Turner, J.V., Fragola, J. R.: Re-inventing How NASA uses Safety and Reliability Analysis to Develop the Next Generation of Human Spacecraft.
2010. Available at: http://www.valador.com/wp-content/uploads/2010/10/Re-Inventing-How-NASA-Uses-Safety-and-Reliability-Analysis-toDevelop-the-Next-Generation-of-Human-Spacecraft.pdf. Last accessed on April 28th , 2013.
Vranish, Ken: The Growing Impact of Atmospheric Radiation Effects on Semiconductor Devices and the Associated Impact on Avionics Suppliers.
KVA Engineering Company. FAA Conference, 2007.

Das könnte Ihnen auch gefallen