Sie sind auf Seite 1von 5

Capability of single hardware channel for

automotive safety applications according to


ISO 26262
Juergen Braun

Christian Miedl, Dirk Geyer

University of Applied Sciences Regensburg, LaS


Seybothstrae 2, D-93053 Regensburg
Juergen.Braun@hs-regensburg.de

AVL Software & Functions GmbH


Im Gewerbepark B27, D-93059 Regensburg
{Christian.Miedl;Dirk.Geyer}@avl.com

Juergen Mottok

Mark Minas

University of Applied Sciences Regensburg, LaS


Seybothstrae 2, D-93053 Regensburg
Juergen.Mottok@hs-regensburg.de

Universitt der Bundeswehr Mnchen


D-85577 Neubiberg
Mark.Minas@unibw.de

Abstract There is no doubt that electromobility will be


the future. All-electric vehicles were already available
on the market in 2011 and 14 new vehicles will be
commercially available in 2012. Due to the fact that
automotive applications are influenced by the safety
requirements of the ISO 26262, nowadays the use of new
technologies requires more and more understanding for
fail-safe and fault-tolerant systems due to increasingly
complex systems. The safety of electric vehicles has the
highest priority because it helps contribute to customer
confidence and thereby ensures further growth of the
electromobility market. Therefore in series production
redundant hardware concepts like dual core
microcontrollers running in lock-step-mode are used to
reach ASIL D requirements given from the ISO 26262.
In this paper redundant hardware concepts and the
coded processing will be taken into account, which are
listed in the current standard ISO 26262 as
recommended safety measures.

my research studies for coded processing I have been


inspired by the work [2] for the following
considerations.
II.

FAILURE RATES OF MECHANICAL AND


ELECTRONIC SYSTEMS

Fault tolerant measures require a minimum level of


understanding for the probability calculation. For
example probability calculations are needed, to be able
to calculate the failure probability of systems or
functions. It is important to know, that there is a major
difference between the failure probability of a
mechanical and of an electronic system.
Mechanical systems have a very high probability
of working at the beginning. However, over time they
tend to fail more often due to several factors, such as
wear, fatigue of the material and aging (see Fig. 1).

Key Words fail-safe, fault-tolerant, failure rates,


failure probability, safety measure, mechanism, ISO
26262, electromobility, series system, parallel system,
coded processing, Safely Embedded Software, SES,
backward recovery, reliability, Proof-Test, diversity

I.

INTRODUCTION

Upcoming innovations in the area of electric


vehicles as well as active safety and driver assistance
systems ask for achieving the requirements of the
corresponding ASIL (automotive safety integrity
level). Thus it is becoming mandatory for many
automotive applications to implement one of two
highest safety levels, ASIL C or ASIL D. But at the
same time the implementation of these respective
safety functions must be done in the most efficient and
cost-effective way. It seems that duplicating the
hardware is the easiest way to do this, but it is obvious
that this is probably not the most cost-effective way.
Redundancy does not only mean the duplication of
systems. By definition, all additional units and
methods are included, that are implemented for error
detection and error avoidance. Such methods are for
example integrated test units as well as error-detecting
codes and coded processing, like with SES [1]. During

Figure 1. Failure probability of mechanical systems

Though, electronic systems behave as if they


would fail nearly at any time with the same failure rate
. There is an exception for the early failures and for
failures due to aging, which is also available in
electronic systems (see Fig. 2). Early failures could be
failures due to "burn-in" or software failures as a
consequence of teething troubles, yet not resolved.
Consequently early failures are often based on errors
that were already present at delivery of the product.
The late failures include the aging of components such
as flash memories (e.g. reaching of maximum erase
cycles), but also errors due to contamination and
oxidation. Based on these facts the traditional Weibull

distribution is unable to model the complete lifetime of


electronic systems. Therefore the extended Weibull
distribution has been developed which matches such
systems with a bathtub-shaped failure rate function.
The characteristics of the extended Weibull
distribution are, that at the beginning the Weibull
distribution is dominating the gradient of the curve, at
the end the function is increasing rapidly according to
a nominal distribution and in between the failure rate
is almost constant which corresponds to exponentially
distributed random failures.
For simplification an ideal model is assumed for
electronic systems [3]. In this ideal model the early
and late failures are neglected (see Fig. 2). This results
in a constant failure rate:
(t) = const. =

of a failure in a given time interval. For real systems


every subsystem will fail over an infinite period of
time with > 0.
For the representation of series systems, the value
= 0,3 FIT (failure in time) is used, which is a
realistic failure rate for example for subsystems like
RAM or clock. Illustrated are the curves for series
systems from n = 2 to n = 4 which are compared to
the single system n = 1 (see Fig. 3). The x-axis is the
time axis which is divided by the normalized time
with:

1
*109

(4)

(1)

Figure 3. Reliability of series systems with series connection of


one to four subsystems (line: n=1, short dashed: n=2, long dashed:
n=3, dotted: n=4)
Figure 2. Failure probability of electronic systems (line) and
simplified failure probability of electronic systems (dashed line)

This is the basis to be able to make a statement


about redundancy principles. Thus the failure
probability of redundant systems can be determined by
calculations based on certain function parameters.
It should be stressed that at the beginning the
failure rate of mechanical systems is significantly
smaller than the failure rate of electronic systems.
A. Series system
The series system works only if all the individual
subsystems work. Consequently the series system fails
if one or more of the subsystems fail. Therefore the
reliability RS of a series system is composed of the
product of all individual reliabilities Ri of the
subsystems (with i = [1;n]):

RS (t )

= Ri (t )
i =1

= e 1t e 2t ... e nt

FS (t ) = 1 RS (t )

(5)

B. Parallel system
In contrast to the series system, the failure
probability FP of the parallel system with diagnosis
consists of the product of the individual failure
probabilities Fi of the subsystems. This results in the
formula for the reliability of a parallel system RP(t)
where Fi(t) is the complement of Ri(t), and must be
taken into account with

(1 e i t ) .

(2)

= e ( 1 +2 +...+n ) t
For the consideration of the reliability of series
systems it is assumed that the failure rates of the
individual subsystems are equal:

= 1 = 2 = = i = = n

It can be seen in the diagram, that the reliability is


the higher the less subsystems are working in series.
The reliability RS decreases rapidly with increasing n.
The more subsystems are connected in series, the
larger is the probability of a failure of the entire
system FS.

(3)

In an ideal system the failure probability = 0.


This means that no subsystem would fail at any time.
The lower the value for , the lower is the probability

RP (t ) = 1 Fi (t )

(6)

i =1

= 1 (1 e 1t )(1 e 2 t )...(1 e n t )
For the consideration of the reliability of parallel
systems it is assumed that the failure rates of the
individual subsystems are equal (cp. (3)).
For the representation of parallel systems, the
value = 0.3 FIT is used as well (see Fig. 4). The xaxis is again the time axis which is divided by the
normalized time (cp. (4)).
It can be seen in Fig. 4, the more the number of
parallel subsystems, the higher is the reliability.

However, it must be also recognized that the


improvement of the reliability by adding a further
subsystem is reduced with the increase in subsystems
that are working in parallel. This means that the
disadvantages like the system costs predominate in
comparison to the minor increase of the system
reliability.

number is used as A, this offset is only a multiple if i


or f is divisible by A. This is the same fact for the
multiplication of two faulty operands. Additionally, so
called deterministic criteria like the above mentioned
Hamming Distance and the Arithmetic Distance must
be considered for choosing an adequate prime number.
Other functional characteristics like necessary bitfield
size and consequential handling of overflow are also
caused by the value of A. The simple transformation
xc = A * xf is illustrated in Fig. 5.

Figure 4. Reliability of parallel systems with parallel connection


of one to four subsystems (line: n=1, short dashed: n=2, long
dashed: n=3, dotted: n=4) and diagnosis

Figure 5. Simple transformation for xc = A * xf.

III.

CODED PROCESSING

The concept of coded processing is capable of


reducing redundancy in hardware by adding diverse
redundancy in software, e.g. by specific coding of data
and instructions. Hardware and software coding can be
combined using approaches like the Vital Coded
Processor [6]. Consequently besides the actual safetycritical control program, also the other programs can
run on the same hardware. Thus it is possible to
specifically protect only the safety-critical parts of the
control program. Coded Processing enables the
verification of safety properties and fulfills the
condition of single fault detection [1]. Coded
Processing does not constrict capabilities but rather
supplements multi-version software fault tolerance
techniques like N version programming, consensus
recovery block techniques, or N self-checking
programming [1]. In this paper the Safely Embedded
Software approach is described in more detail.
A. Safely Embedded Software (SES)
The given SES approach generates the safety of
the overall system in the application software level.
SES is based on the (AN+B)-code of the Coded
Monoprocessor transformation of original integer data
xf into diverse coded data xc [1].
Coded data fulfills this relation:

xc = A * x f + B x + D
where

(7)

x c , x f , A + , B x ,
D 0 , Bx + D < A

The prime number A is important for safety


characteristics like Hamming Distance and residual
error probability P = 1/A of the code [1]. Number A
has to be prime because in case of a sequence of i
faulty operations with constant offset f, the final offset
will be i * f. If A is not a prime number then several
factors of i and f may cause multiples of A. If a prime

To ensure the correct memory addresses of the


variables any user defined specific number or the
memory address of the variable itself could be used as
static signature Bx [1]. The dynamic signature D
ensures that the variable is used in the correct task
cycle.
The two software channels (original data and
coded data) could be verified for example at the end of
each task cycle before starting the output of the
calculated values. Therefore the instructions are coded
in such a way that, either a comparator could verify
the diverse channel results for the condition
z c = A * z f + B z + D , or the coded channel could
be checked directly by the verification condition
( z c Bz D) mod A = 0 (cf. (7)).
B. Backward Recovery
Backward recovery is a fault reaction technique to
prevent that failures in continuously running systems
occur. At certain points in the program flow the
currently used variables and the corresponding
program counter will be stored on a stack. If an error is
detected the backward recovery could restore the
program counter and the variables from the stack.
Consequently the execution of the function could be
restarted at this point in the program flow. The main
aim of SES in the case of backward recovery is to
distinguish the time to perform backward recovery.
Due to this the reliability of the system could be
increased.
With the backward recovery errors could be solved
which are caused by dynamic faults. Dynamic faults
are for example bit errors due to electromagnetic
interference. The case where backward recovery fails
to fulfill its purpose is a static failure. With static
failures like a faulty ALU (arithmetic logic unit) the
backward recovery will not solve the problem, because
SES will detect the error again. As a consequence of
this the fault reaction will enter a safe state, no
increase of the reliability is possible and the
availability of the system will decrease.

IV.

INCREASING THE RELIABILITY OF SERIES AND


PARALLEL SYSTEMS

Safety critical embedded systems in the current


automotive industry are safeguarded using redundant
hardware. The target is that their complexity and costs
should not increase. To support this, an approach for
constructing a single-channel fail-safe microcontroller
should be proposed, which is able to support a variety
of safety functions.

almost constant. Consequently such systems do have a


poor long-term behavior, but of course better than the
initial series system (S1 = 0.3 FIT), and these systems
have a very high reliability for short working periods.
A typical field for application would be for example in
systems where often a proof test [5] is performed, so
that the system is frequently brought into an "as new"
condition or as close as practically possible to this
state.

The easiest way for improving the reliability of a


system is to add additional parallel redundancy. A
common possibility is to improve the quality of the
used components, but this is usually the most
expensive possibility. Of course a combination of both
is possible to reach the required reliability. Another
way to improve the reliability by software is the coded
processing and backward recovery. Beside this, with
coded processing it is also possible to reach higher
ASIL levels.
Assumed is a series system with S1 = 0.3 FIT.
This series system could be compared to a series
system consisting of an equal number of components,
but with implemented coded processing and backward
recovery (see Fig. 6). We assume, that hereby the
failure probability has been improved to one third with
S2 = 0.1 FIT.

Figure 7. Comparison of the reliability of a series system with


coded processing (line: S2 = 0.1 FIT) and a parallel system
(dashed: P1 = 0.3 FIT).

Another possible step to improve the failure


probability would be to use an additional parallel
system with implemented Coded Processing. In this
case the curve of the parallel system with P1 = 0.3 FIT
shows the parallel connection of the initial system.
The curve with P2 = 0.1 FIT shows the same parallel
system but with implemented coded processing and
backward recovery (see Fig. 8). As shown in the
figure, this parallel system with coded processing and
backward recovery increases the reliability of the
system at any time. However, the disadvantage of both
parallel solutions is that the system costs are higher.
For both parallel systems there are higher costs for the
increased redundancy done in hardware.

Figure 6. Comparison of the reliability of a series system with


coded processing (line: S2 = 0.1 FIT) and a series system without
coded processing (dashed: S1 = 0.3 FIT).

It can be clearly seen that the long-term behavior


becomes much better. Such measures which contribute
to the reduction of the failure probability increase the
reliability of large working times. Apart from that the
quality improvement meets higher safety standards.
According to the ISO 26262 part 5 [4] with one
channel coded processing the maximum diagnostic
coverage, which could be considered as achievable for
processing units, is high.
Now, the improved system with coded processing
and backward recovery (S2 = 0.1 FIT) is compared to
a parallel system (P1 = 0.3 FIT), each line is identical
to the series system of the initial system (see Fig. 7).
For parallel systems with P1 a high improvement of
the reliability can be seen near the zero point and at
small working times up to the intersection with the
curve of the series system (S2 = 0.1 FIT). A time
derivative of the reliability function with P1 results
around the time t/ = 0 to an exit angle of 90 degrees,
which means that the curve runs parallel to the time
axis (see Fig. 7). For very small working times
t/ << 1 (e.g. around zero), the reliability is thus

Figure 8. Comparison of the reliability of a series system with


coded processing (line: S2 = 0.1 FIT), a parallel system without
coded processing (short dashed: P1 = 0.3 FIT) and a parallel
system with coded processing (long dashed: P2 = 0.1 FIT).

V.

CONCLUSION

Like seen before, the parallel system without


coded processing is an advantage for systems with
short working times or systems with frequent proof
tests. In contrast, systems with coded processing and
backward recovery have a better long-term behavior.
Furthermore coded processing could be used, if an

improvement of the failure probability could not be


reached with further redundancy.
There is one major disadvantage in improving the
quality of the used components in the system with the
aim of reducing the failure probability. There could be
the problem that there are no components with higher
quality available, so that such systems could generally
not be realized. In contrast with coded processing and
backward recovery the reliability of each system could
be improved independent of the quality of the
components which are used. In a further step, with
coded processing and backward recovery it would be
imaginable to use COTS (components-off-the-shelf) in
order to save costs, while the reliability of the system
remains identical. Another possibility of coded
processing and backward recovery would be running
the systems at lower voltage. If microcontrollers are
used at lower voltages, they tend to have bit failures
for example in the RAM. As counter measure with
coded processing the bit failures could be detected
reliable and with the backward recovery the function
could be restarted at this point. With this approach the
power consumption of the control unit could be
reduced and it is probable that also the operating life
could be increased.
To reach the targets of the ISO 26262 with singlechannel microcontrollers, there must be an
independent shutdown control decision logic which
realizes an independent shutdown control path. This
independency has to be proven according to
measurements of the ISO 26262, so that common
cause failures could be avoided. For example a failure
of the power supply must not influence the
functionality of the shutdown-path. Furthermore with
single-channel microcontrollers the standard methods
like comparing two sensor values with each other and
providing
end-to-end
protection
over
the
communication bus must be implemented. All these
normative recommended safety measurements ensure
that in addition to the coded processing approach the
system enters a safe state within the system safety time
when required safety-relevant data does not arrive or a
system failure occurs. Even a dual-core-lock-step
architecture is unable to reach ASIL D without
additional measurements. The reason for this is the
likelihood of common-cause failures that affect both
CPUs and therefore could afterwards not be detected
by a comparison. Consequently even with the dualcore-lock-step an intelligent watchdog must be used

which independently verifies the calculations in a


simplified way. Beyond that the intelligent watchdog
must have the possibility to bring the system into a
safe state via a shutdown path to fulfill the
requirements of a fail-safe system.
In contrast to state of the art in automotive
embedded systems, with SES safety is either a result
of diverse software channels or of one channel of
specifically coded software. Due to the high pressure
on costs, fault tolerance in hardware should only be
implemented at the points where they are really
needed. This causes that redundant solutions in
software with coded processing should be aspired as
far as possible.
So far according to the ISO26262 the techniques of
coded processing have been applied only to relatively
simple circuits and are not widespread. However,
future developments are not excluded [4].
Future steps will be to determine MTBF (mean
time between failure) and diagnostic coverage with
markov models, in order to approve the proposed
measurement.
ACKNOWLEDGMENT
This work is supported by the research project
SEMO of LaS in cooperation with AVL Software
and Functions GmbH and the Universitt der
Bundeswehr Mnchen. For more information visit
www.las3.de.
REFERENCES
[1]

[2]
[3]
[4]
[5]

[6]

J. Mottok, F. Schiller, T. Zeitler, Embedded Systems - Theory


and Design Methodology, Chapter 2: Safely Embedded
Software for State Machines in Automotive Applications,
InTech, March 2012
B. Weis, Grundlagen und Einteilungsmglichkeiten fr
fehlertolerante Systeme, Technische Universitt Wien, 2002
J. Brcsk, Electronic Safety Systems - Hardware Concepts,
Models, and Calculations, Hthig, 2004
ISO/FDIS 26262: Road vehicles - Functional safety, 2011
J. Braun, J. Mottok, C. Miedl, D. Geyer, and M. Minas.
Overview and Evaluation of Fail-Safe and Fault-Tolerant
Mechanisms for the Electromobility according to ISO26262.
In Proceedings of the 1st International Electromobility
Congress in Prague, p. 5, May 2011.
P. Forin, Vital Coded Microprocessor Principles and
Application for Various Transit Systems, IFAC Control,
Computers, Communications, pp. 79-84, 1989

Das könnte Ihnen auch gefallen