Beruflich Dokumente
Kultur Dokumente
ISRN LUTFD2/TFRT--5782--SE
Safety-Critical Communication
in Avionics
Dan Gunnarsson
Sponsoring organization
Abstract
The aircraft of today use electrical fly-by-wire systems for manoeuvring. These safety-critical distributed
systems are called flight control systems and put high requirements on the communication networks that
interconnect the parts of the systems. Reliability, predictability, flexibility, low weight and cost are important
factors that all need to be taken in to consideration when designing a safety-critical communication system.
In this thesis certification issues, requirements in avionics, fault management, protocols and topologies for
safety-critical communication systems in avionics are discussed and investigated. The protocols that are
investigated in this thesis are: TTP/C, FlexRay and AFDX, as a reference protocol MIL-STD-1553 is used. As
reference architecture analogue point-to-point is used. The protocols are described and evaluated regarding
features such as services, maturity, supported physical layers and topologies.Pros and cons with each protocol
are then illustrated by a theoretical implementation of a flight control system that uses each protocol for the
highly critical communication between sensors, actuators and flight computers.The results show that from a
theoretical point of view TTP/C could be used as a replacement for a point-to-point flight control system.
However, there are a number of issues regarding the physical layer that needs to be examined.
Finally a TTP/C cluster has been implemented and basic functionality tests have been conducted. The plan was
to perform tests on delays, start-up time and reintegration time but the time to acquire the proper hardware for
these tests exceeded the time for the thesis work.
More advanced testing will be continued here at Saab beyond the time frame of this thesis.
Keywords
The report may be ordered from the Department of Automatic Control or borrowed through:University Library, Box 3, SE-221 00 Lund, Sweden Fax +46 46
222 42 43
Acknowledgements 5
Acknowledgements
First of all I want to thank my advisor at Saab, Kristina Forsberg. Your friendly attitude, ideas
and extensive knowledge in the area of safety-critical systems has made my work interesting
and my time here at Saab enjoyable. Without your help this work would not have been
possible!
I want to thank my friends and work colleagues at Saab: Stellan Nordenbro and Åke
Cederbom. Other people at Saab that has given me useful advice are Jonas Nordqvist, Jonas
Dahlqvist, Håkan Forsberg and Anita Karlsson. Per-Olof Bergman has been a very helpful
with the numerous modifications and adapters that has been manufactured in order to make a
complete setup in the laboratory. Jan-Olof Nilsson has been very helpful providing the
excellent photographs used in the report and presentation.
A special gratitude goes out to Thomas Mörth who first negotiated the contact that gave me
the opportunity to write this thesis in cooperation with Saab.
I also want to thank my advisor at LTH, Anton Cervin for giving useful comments on my
work and for taking care of the administrative parts of the thesis at the university.
Thanks to Saab for sponsoring TTP-Seminar and Workshop in at TTTech in Vienna and for
employing me!.
6 Table of contents
1 INTRODUCTION ....................................................................................................................................... 9
1.1 BACKGROUND ...................................................................................................................................... 9
1.2 SCOPE AND GOALS .............................................................................................................................. 10
1.3 THESIS OUTLINE.................................................................................................................................. 10
2 AVIONIC STANDARDS AND REGULATIONS .................................................................................. 12
3 REQUIREMENTS ON COMMUNICATION NETWORKS IN AVIONICS ..................................... 15
3.1 FUNCTIONAL REQUIREMENTS ............................................................................................................. 15
3.2 ENVIRONMENTAL REQUIREMENTS ...................................................................................................... 15
3.3 TEST REQUIREMENTS .......................................................................................................................... 16
3.4 SAFETY REQUIREMENTS...................................................................................................................... 16
3.5 BROADCAST BUS REQUIREMENTS ....................................................................................................... 17
4 FAULT MANAGEMENT ........................................................................................................................ 18
4.1 FAULT TYPES ...................................................................................................................................... 18
4.1.1 Design faults.................................................................................................................................. 18
4.1.2 Hardware and software faults....................................................................................................... 19
4.1.3 Common mode faults..................................................................................................................... 19
4.1.4 Specific fault types......................................................................................................................... 19
4.1.5 SEU and MBU ............................................................................................................................... 21
U
1 Introduction
This thesis has been carried out at Saab Avitronics in Jönköping with the over-all goal to gain
knowledge about safety-critical communication systems. In particular physical
implementations issues are addressed via practical experience. In this work the safety-critical
communication is exemplified with communication within a flight control system.
Fly-by-wire systems are becoming increasingly common in civil transport aircraft due to the
economic and technological benefit that the technology provides. These systems are
comprised of two major components; the flight control laws, which govern the aircraft’s
handling characteristics, and the flight control architecture, or the hardware, which is used to
implement the control laws. This thesis does only address the latter.
1.1 Background
Over the years new aircraft have brought with them new opportunities to introduce new
technologies. The demands for increased functionality conflict with the demand of lower cost
hence, the challenge avionics engineers are facing today is to build systems meet this
requirement for high functionality and less maintenance, at reduced cost.
Fly-by-wire systems have followed the development exploiting the benefits of the new
technology. These systems use electrical signalling to control the control surfaces with the aid
of flight control computers that contain control laws. The fact that a large number of
subsystems in an aircraft use microprocessors has led to the development of a variety of
digital data buses. In an integrated architecture the flight control system takes advantage of
these subsystems to perform its tasks.
Most flight control systems are traditionally designed with a central computer that is
connected to sensors and actuators using point-to-point connections. Traditional point-to-
point communication systems are very reliable since they use a federated architecture where
the whole system is physically isolated from other systems. Physical separated subsystems
make the system robust to fault propagation, local disturbances, fire etc.
In order to decrease weight and increase flexibility replacing the traditional point-to-point
system an integrated system with broadcast bus communication could be an alternative.
The bus communication topology decreases weight but introduced new concerns such as
possible failure modes which emphasis the need of fault containment techniques, structured
medium access method and communication scheduling. These issues might be handled by a
communication protocol and hence, the choice of protocol a key decisions that strongly
influences the design.
There are both Time-Triggered (TT) and Event-Triggered (ET) protocols that could be used
for broadcast bus communication. Communication in a safety-critical application, such as
flight control, must be predictable and analysable. Hence, TT communication out-roles ET
communication since it guarantees delays and does not suffer the risk of collisions as ET. This
thesis concentrates on protocols using TT medium access; comparing chosen protocol and
investigating problems involved in implementation of these in a flight control system.
10 Introduction
Problem statement:
The main question that this thesis aims to answer is whether TT broadcast bus communication
(exemplified by TTP/C) is suitable for usage in safety-critical communication system in
avionics (exemplified by a flight control system).
The first two parts gives the reader a basic understanding of the concepts, terminology and a
brief introduction to protocols that are currently used or are considered to be used in avionics.
Chapter 4 illustrates different fault types and the importance of fault tolerance and error
detection.
Chapter 5 describes protocols used in safety-critical communication. STD-MIL-1553, AFDX
and FlexRay are described in brief, for comparison, and TTP/C in detail.
Chapter 6 gives an introduction to network topologies with focus on safety-critical
communication.
Chapter 7 contains a conceptual study of communication architectures based on the protocols
described in Chapter 5.
Chapters 8-12 describe the experiments conducted, analyses and discuss the results and give
conclusions of the study.
The authority that does the final certification for civil aircraft is FAA in the United States and
EASA (formerly known as JAA) in Europe. The rules and regulations stated by these
authorities are almost identical. However, the assessment can vary between the two. FAR and
JAR are regulation documents authored by FAA and EASA. They contain rules and
regulations how an aircraft should function in order to pass certification and to be allowed to
fly.
To aid the certification there are guideline documents that specify how design,
implementation and usage of hard- and software should be done to provide a satisfying level
of reliability (picture in Figure 2.1). These guidelines are international standards and have to
be followed in order to manufacture aircraft. The guidelines contain requirements, in detail,
how every part of the systems treated should function and behave under certain conditions.
Software
Life-Cycle
Process Software Development
Life-Cycle
(DO-178B)
Figure 2.1: Certification guidance documents covering system, safety, hardware and software processes.
Figure copied from [10]
Chapter 2 13
Systems discussed in this thesis are safety-critical systems level A and B (See table 2.1). The
focus is on communication within these systems and certification guidelines relevant are:
• ARP4754 [10] discusses the certification aspects of highly-integrated or complex systems
installed on aircraft, defining requirements for the overall aircraft operating environment
and functions. Highly-integrated system refers to systems that perform multiple aircraft-
level functions. Complex refers to systems whose safety cannot be shown solely by test and
whose logic is difficult to comprehend without the aid of analytical tools. ARP4754
addresses the total life cycle of systems that implement aircraft functions. It excludes
information of detailed system, software and hardware design processes.
• ARP4761 [16] describes the safety-assessment process that include requirements generate
and verification. The process provides a methodology to evaluate aircraft functions and to
determine associated hazards.
• RTCA/DO-178B [11] deals with software life cycles. It specifies how to manage the design
process and how to prove that the output meets the requirements. It is noticeable that
RTCA/DO-248B is a clarification document for DO-178B to give a hint of the complexity
of this process.
• RTCA/DO-254 [12] deals with hardware design assurance for developing complex
hardware for safety-critical applications (see Figure 2.1).
• For environmental requirements the guidance document RTCA/DO-160D Environmental
Conditions and Test Procedures for Airborne Equipment.
To classify the criticality of a part of a system five development assurance levels (see Table
2.1) have been introduced.
14
System Failure rate Failure condition Failure condition description
Development (failures/ classification
Assurance hour)
Level
A λ< 10E-9 Catastrophic Failure conditions that would prevent continued safe flight and landing.
B 10E-9 < λ < Hazardous / Failure conditions that would reduce the capability of the aircraft or the
10E-7 Severe-Major ability of the flight crew to cope with adverse operating conditions to the
extent that there would be: a large reduction in safety margins or
functional capabilities, physical distress or higher workload such that the
flight crew could not be relied on to perform their tasks accurately or
completely, or adverse effects on occupants including serious or
potentially fatal injuries to a small number of those occupants.
C 10E-7 < λ < Major Failure conditions which would reduce the capability of the aircraft or the
10E-5 ability of the crew to cope with adverse operating conditions to the extent
that there would be, for example, a significant reduction in safety margins
or functional capabilities, a significant increase in crew workload or in
conditions impairing crew efficiency, or discomfort to occupants,
possibly including injuries.
D 10E-5 < λ < 1 Minor Failure conditions which would not significantly reduce aircraft safety,
and which would involve crew actions that are well within their
capabilities. Minor failure conditions may include, for example, a slight
reduction in safety margins or functional capabilities, a slight increase in
crew workload, such as, routine flight plan changes, or some
Table 2.1: System development assurance levels for civil aircraft [9] [11] [12]
inconvenience to occupants.
E Any range No safety effect Failure conditions that do not affect the operational capability of the
aircraft or increase crew workload.
Avionic standards and regulations
15 Requirements on communication networks in avionics
Interface requirements include the physical system and interconnections between subsystems.
For the systems discussed in this thesis many functional requirements such as usability have
no importance since the purpose is to evaluate time-triggered broadcast buses. Focus is put
upon the technology itself and what constraints and features it has and whether it is suitable
for use in airborne systems or not. Some aspects of functional requirements are treated in
chapter 7 and 10.
Tests are conducted on all parts of a system. Documentation of the tests is crucial so a second
party, i.e. the certification authorities, can reproduce and verify the test results.
For each test the required inputs, actions required and expected results with tolerance should
be specified. Test result data should at least contain version of test and item being tested,
version of tools and test equipment, test results, deviation from expected values (if any) and
conclusions containing success or failure or the testing process.
Tests are often associated with the development phase of a system. Equally important are
build in self-tests that are carried out every time the system is powered up and continuous
self-tests during operation to detect and handle errors.
In general when a new technique is introduced in the aviation industry a backup system is
required for the technique to be proven reliable, this is also often the case when using an
already proven technique. The possibility of design faults (described in section 4.1.1) must be
assessed by service history, safety specific analysis or diversity in hard- and software.
The most common fault hypothesis is that no single point of failure is allowed to affect the
performance of the system. Techniques to achieve this are to use redundancy both in software
and hardware. Fault containment zones are also important to keep errors from propagating in
the system.
A basic problem that is evident when it comes to a communication bus is that when nodes are
connected to each other it is possible that a short circuit could prevent communication or
destroy nodes connected to the bus. The situation is slightly different if a central hub or star is
Chapter 3 17
used in star topology where the star can filter out errors. However, if the star coupler breaks
all communication will be down.
The vicinity problem is another aspect that must be considered. In case of a fire a whole area
of the aircraft might be destroyed. Even if the system has been designed with the most critical
part replicated to tolerate a single point of failure this might destroy all replicas. Hence, it
would be necessary to put the replicated nodes in different zones of the aircraft to create
physical independence.
4 Fault Management
There are two approaches to achieve reliability in a system. Fault-avoidance aims at
preventing faults of occurring in the first place. This approach is implemented both in the
design phase and in the planning of service and maintenance where components are replaced
with certain time intervals to avoid faults. The second approach is to design the system with
fault tolerance mechanisms such as redundancy so if one part fails another part can resume its
function. The change in the system can either use a mechanism where the fault is recognized
and some defined action can be taken or the fault can be masked using replicated hardware.
Section 4.1 describes fault types that occur in electrical systems. Mechanisms that are suitable
to handle the specific fault types are briefly discussed. Section 4.2 describes fault tolerance
and how to achieve it using error detection and fault tolerance mechanisms and Section 4.3
summarize fault management of different fault types
This section gives a short description of these different faults and their appearance which is
essential in order to chose appropriate and efficient fault tolerance mechanisms.
Mechanisms to detect or avoid design faults are design reviews, rigid development processes,
simulations, use of formal methods and testing (independent of design).
Hardware design faults can be mitigated through “service history” i.e. the complex hardware
device can be proven to be free from design faults if it has been used successfully earlier or by
safety specific analysis (e.g. element analysis, formal methods). Hardware design faults can
be tolerated by hardware diversity. However software design faults are quite different and are
best addressed with fault prevention. Software diversity adds cost and complexity and is not
sufficient for tolerating “bugs” [19] .
Hardware faults cause errors in the software, e.g. “stuck-high or stuck-low” faults in
interfaces or memory cells cause data faults. These faults are tolerated using data redundancy
(there are more than one source to the information). Detection and localisation is done
through monitoring and/or self-tests.
Diversity is a way to avoid some common mode faults. In practise redundant system functions
are implemented using different software and different hardware. Which, however, is a very
expensive and strenuous approach since components have different life cycles, hence service
and replacement might be hard to coordinate.
A Byzantine fault is when a component behaves in an arbitrary way. It might even send
different information to different components.
According to [4] , [13] for a system to exhibit a Byzantine failure there must be a system-level
requirement for consensus. If there is no consensus requirement, a Byzantine fault will not
result in a Byzantine failure. A class of systems that exhibit this requirement strongly are
time-triggered systems where the failure of the global clock will lead to system failure.
This also applies to most asynchronous approaches as well since a coordinated system action
will require consensus. Redundancy, which is widely used in safety-critical systems, is nearly
impossible to create without consensus.
One possible Byzantine fault is when a digital signal is in between the voltage thresholds for a
“0” and a “1” and may be interpreted differently by receiving nodes, see Figure 4.2. This kind
of signal is called a 1/2 and can be the source of a fault that can propagate through several
parts of a system since most systems allow all voltages that are within, for the system,
specified range.
Figure 4.2: Gate transfer function with 1/2 area defined. Figure copied from [4]
Slightly off specification (SOS) faults, these are faults where components deviate from their
specification in e.g. jitter and voltage range. SOS faults can appear as Byzantine faults. An
example of a SOS-fault is if there is a corruption in a nodes time base that leads it to send
messages at periods that are slightly outside specification, slightly too early or too late.
Failure modes like clique formation might occur when this happens if receiving nodes with
somewhat fast or slow clock (but within specification) would accept these messages but other
correct nodes would not, which could create a disagreement whether to accept the sender as a
functional or disregard it as a dysfunctional node.
Chapter 4 21
These kinds of faults are very hard to detect and sometimes to account for in design. Creating
fault containment regions in design is however the only way to prevent errors from
propagating in an uncontrolled way.
Babbling idiot fault refer to a node which is sending/talking in an uncontrolled way. This is
particular dangerous in a bus topology, where such a fault is a single point of failure for the
bus. These faults are prevented with bus guardians.
Timing faults appear when the timing requirements are not met. In a communication network
this can be due to bottlenecks or jamming in systems using heuristic scheduling techniques or
due to insufficient timeslots using TDMA scheduling. Fault prevention (analysis, simulations,
testing, experience etc…) is the only way to address these faults.
Design considerations includes: a) limit the use of RAM (EPROM is preferred), b) assess the
risk of registers in microprocessors, programmable circuits, etc. For memory protection error
detection and auto correction (EDAC) code can be used and also new memories are being
developed using a distributed storage of data where MBU only cause one bit to flip hence
common ECC protection (Error Correction Code for one bit) is sufficient to tolerate both SEU
and MBU.
4.2.1 Redundancy
Static redundancy
In static redundancy (also called active replication) the application is executed at N redundant
nodes in parallel and a majority vote is performed to prevent faulty data to propagate to other
parts of the system. (See Figure 4.3)
Dynamic redundancy
In dynamic redundancy (also called passive replication) some nodes are active and some are
used as stand-by spares, which are activated in case of failure. The stand-by spares can either
be cold (unpowered) or hot (powered). A shadow node is hot but does not deliver anything to
the bus except for in case of failure.
A hot spare generally uses less time to replace the faulty node than a cold spare. However, the
cold spare generally have lower failure rate.
4.2.2 Recovery
Roll back recovery
Roll back recovery is the simplest scheme where calculations are allowed to be re-executed in
additional time when a failure is detected.
Self-tests
Check of components by running a program with known input and comparing output with
known result.
A reliable communication system is designed using several of the mechanisms described here.
For hardware faults, the single fault hypothesis duplicates the network, design fault mitigation
(for broadcast bus) adds a backup network (hardware and software diversity), coding (CRC)
24 Fault types
is used for error detection of messages, bus guardians to prevent babbling idiot faults. If the
nodes are FCUs (see below) SOS and Byzantines faults can not occur.
A fault containment unit (FCU) is created with the purpose of preventing faults to propagate
to other parts of the system after occurring. The main goal is to minimize dependence
between parts of a system so a faulty component cannot affect the operation in other FCUs. A
system design using FCUs has the additional benefit that it simplifies the safety assessment
work.
Chapter 5 25
5 Protocols
Three time-triggered protocols are included here TTP/C is chosen since this is the most
mature time-triggered protocol available today (2006) addressing fault tolerance and
predictability. AFDX is a protocol that is in consideration for safety-critical applications and
is chosen because it could, in some cases, be an alternative to TTP/C. FlexRay is included
since it is probable to be the next big standard in the automotive industry and can also use
both bus and star topology. MIL-STD-1553 is widely used in both military and civil aircraft
i.e. in Gripen. This protocol is included for comparison reasons. If a digital bus architecture
shall replace a traditional point-to-point architecture it must be proven to be as reliable as the
former one.
The choice of protocol is, of course, an application dependant process. Protocols that
communicate over a joint medium such as a bus have to coordinate the communication to
avoid collisions. In a point-to-point system every communication link is dedicated to the
communication between two nodes and collisions is not a problem. Medium access method is
one of the most critical parts of a communication protocol. An access attempt can either be
triggered by an event or scheduled in advance.
An Event-triggered protocol does not guarantee a deterministic behaviour since on high loads
more collision will occur which will increase delays. It is possible to analyze a specific
schedule to guarantee i.e. response times. A protocol that uses Time-triggered (TT) medium
access provides deterministic behaviour since all nodes know which node that is next to
transmit at all times. This is well suited for safety-critical tasks like control loops that
periodically reads sensors and updates actuators because of its deterministic behaviour and
periodic nature.
TDMA is the medium access method used by TTP/C, FlexRay and MIL-STD-1553. It assigns
all nodes a time window to transmit according to a predefined scheme (MEDL in TTP/C).
Within the specified time window a node is granted exclusive access to transmit on the bus.
The communication schedule is cyclic and divided into TDMA rounds where a node is
allowed to transmit once. A number of TDMA rounds form a cluster cycle that is repeated
continuously. One trade off when using time-triggered communication is loss of efficiency
when nodes do only send sporadic messages but are still assigned a time-window. The effect
of this can be reduced by using different schemes in different modes of operation i.e. take-off,
flight and landing in an aircraft. Some protocols, i.e. FlexRay, can combine static TT-frames
with dynamic ET-frames for sporadic messages.
This chapter describes protocols that could be interesting in safety-critical applications in
avionics. Sections 5.1-5-4 give a brief description of each protocol.
5.1 MIL-STD-1553
The aircraft internal time division command/response multiplex data bus is a military standard
with the designation MIL-STD-1553b. This bus was published in 1978 and is one of first data
communication buses to send digital data between parts of a system over a common set of
wires (a bus). Some of the applications are in the F-16 and the AH-64A Apache helicopter. It
is also used in satellites, space shuttles and the International Space Station.
MIL-STD-1553 defines a redundant serial communication bus that interconnects nodes on a
network. The medium is twisted pair and the maximum length of the bus is not defined, it is
however recommended to test an implementation before deploying it.
26 Protocols
The medium access is TDMA and a Bus Controller (BC) controls access to the bus. The BC
contains a clock and commands nodes when to communicate, which removes the requirement
for a global clock. The BC is replicated, since a malfunctioning BC would mean that there
could be no communication on the bus.
The communication is normally not configured for redundant communication on both buses
but rather with the secondary bus on hot backup in case of failure such as babbling idiot
failure on the primary bus. One of the drawbacks with MIL-STD-1553 is that it is limited to 1
Mbit/s. There are implementations that are extended to 10 Mbit/s or faster that require star or
hub coupling [14]
5.2 AFDX
Avionics Full-Duplex Switched Ethernet (AFDX) is a trademark of Airbus and was
developed for the A380 passenger plane. It is a standard that defines the electrical and
protocol specifications for the exchange of data between avionic subsystems using IEEE
802.3 (100 Base-TX) for the communication architecture. AFDX has been derived from
Ethernet adding deterministic timing and redundancy management to the widely used
protocol.
The deterministic communication is achieved by virtual links (VLs) that specifies virtual
connections between parts of the system though the shared 100Mbit/s physical link.
Due to queues at switches that might introduce jitter and message latency there is a
requirement that the delay from sender to receiver must be less than 500 μ s , which does not
include jitter at switches and the receiver [14] . Messages are numbered by the sender and
checked at the receiver to ensure that the order of packets is correct.
The protocol does not support bus topology since the physical layer requires switches. The
routing information is contained in tables in the switches. A redundant set of communication
links and switches is required and messages are sent redundantly on both channels. The
message that arrives first is used. AFDX has a slightly different area of application then
FlexRay and TTP/C, such as less critical applications with higher requirements on bandwidth,
but is an upcoming interesting technology.
5.3 FlexRay
According to [14] the FlexRay protocol is specifically designed to address the needs of a
dependable automotive network for applications like drive-by-wire, brake-by-wire, and power
train control. It is designed to support communication over single or redundant
communication channels. It includes synchronous frames and asynchronous communication
frames in a single communication cycle. The synchronous communication frames are
transmitted during the static segment of a communication cycle. All slots are the same length
and are repeated in the same order every communication cycle. Each node is provided one or
more slots whose position in the order is determined at design time. Every node interface is
provided only with the information concerning its time to send messages in this segment and
must count slots on each communication channel. After this segment, the dynamic segment
begins with the time divided into minislots. At the beginning of each minislot there is the
opportunity to send a message, if one is sent the minislot expands into the message frame. If a
message is not sent the minislot elapses as a short idle period. Messages are arbitrated in this
segment by sending the message with the lowest message ID. It is not required that messages
are sent over both communication channels when a redundant channel exists.
Chapter 5 27
The current version of FlexRay lacks some of the fault tolerating mechanism that TTP/C
features. No membership services are provided to detect faulty nodes. There are no bus
guardian specification currently published and no published fault hypothesis.
The FlexRay consortium, consisting of many major automotive companies, has indicated it
has no current interest in any field of application other than the automotive industry. The
hardware that has been developed is only available to the consortium members and cannot be
purchased by non-members. The protocol and physical layer specification was not publicly
available until recently.
However, FlexRay is an interesting protocol because it is considered to be the next big
standard, as CAN is today, in the automotive industry. This means that a lot of benefits can be
drawn from large volumes that contribute to large scale development and testing, availability
of tools and test equipment to low cost.
5.4 TTP/C
Figure 5.1: Cluster using TTP/C in a bus topology. Figure copied from [1]
The Time-Triggered Protocol (TTP) was developed by Hermann Kopetz and colleagues at the
University of Vienna and is commercially developed by TTTech.
The protocol is a real-time communication protocol for the interconnection of nodes in
distributed fault-tolerant real-time systems. TTP/C is designed to meet both the strong
requirement of safety, availability and composability in the fields of aerospace and
automotive electronics [1] , [7] . For a typical bus layout when using TTP/C, see Figure 5.1.
TTP/C can be implemented using a bus topology (see Figure 6.2), star topology (see Figure
6.3) or a combination of the two [1] .
The communication in TTP/C is redundant on two buses. Each communication interface has a
bus guardian that prevents it from transmitting on the bus outside the predefined time-
window. A faulty node that transmits on the bus outside of the defined time-window is called
a babbling idiot. It has a fault tolerant time synchronization mechanism to establish a global
time base and membership service to notify all nodes in the cluster of which nodes that are
operational and if any failure has been detected.
The protocol is master less which allows communication to continue on the bus even if some
nodes have failed.
TTP/C follows the single point of failure fault hypothesis meaning that it should tolerate a
failure in any part of the communication system without degradation of performance. The
term TTA (the Time-Triggered Architecture) is commonly used together with TTP/C. TTA is
a concept where tasks are executed synchronously to the global time in TTP/C [3].
28 Protocols
5.4.1 Membership
The membership service informs all nodes in a cluster about the operational state of each node
with a latency of one TDMA round. A node is operational if the node has updated its life-sign
in the membership vector within the last TDMA round. Requirements to update the
membership are that controller is operating and is synchronized with the rest of the cluster.
If the node fails to update its life-sign it will be considered non-operational. It then stays
synchronized to receive frames but not to send.
The membership vector is sent by each node in every slot and is check by all nodes to
maintain an updated view of the cluster. The membership vector is also used for implicit
acknowledgement. Implicit acknowledgement means that the receiver does not send an
explicit acknowledgement that is has received data correctly. Instead the sender will check its
own life-sign bit in the membership vector that the receiver transmits in the next transmission.
If the sender is flagged as a correct node the transmission was correct and if not then the
transmission failed. In this scenario the sender (A) might be the faulty node, but since nodes
initially always consider them selves to be correct the sender will consider the receiver (B) to
be faulty. However, if the next node to transmit (C) tells A that the B was correct the sender A
sets it life-sign bit to zero until the problem is resolved and the sender can be reintegrated.
Identification Section
1 2 3 4 5 6 1 2 3 4 5 6
Mode 2
1 2 3 4 5 6 1 2 1 2 3 4 1 2
Cluster cycle
A MEDL contains a few key elements that determine the communication on the bus (see
Figure 5.2). Schedule parameters that describe the basic communication behaviour of the
node and are necessary to start up or integrate a cluster.
Identification section contains data that is used when controlling that the MEDL is compatible
with the cluster. Round slot section holds information about the rounds used in different
operating modes. The example in Figure 5.2 is a visualisation of a MEDL containing two
modes of operation; Mode 1 with two identical TDMA rounds and Mode 2 with four TDMA
rounds that form a cluster cycle. In Mode 2 node 1 and 2 are assigned slots in each round,
node 3 and 4 transmits every second round and 5 and 6 only once in every cluster cycle. This
scheduling is suitable for control loops where some values and commands are needed more
frequently than others. The empty slots are used to send I-frames with synchronization
information etc.
If the sender is not a master clock or the frame is corrupt, the transmission is disregarded.
The time difference between the expected arrival time from the MEDL and the actual arrival
time is measured locally in microticks at the receiver.
The MEDL also contains a correction term Δ corrs , r that holds the number of microticks that a
transmission takes between sender and receiver. This is added to the expected time of arrival.
The clock values received are stored in a four-value push down stack. New values replace old
values that get pushed down off the stack upon a correct arrival of a new clock value.
from the two remaining values. This forms the clock state correction term. If the stack does
not contain four values the average is taken from the available values.
It is also possible to make use of an external clock correction term from e.g. a GPS. This term
is calculated at the host of the node that is connected to the GPS and sent to all the other
nodes as normal data. It is extracted and added as external rate correction field in the CNI of
the controller.
If the absolute value of the external clock correction term or the absolute value of the total
correction term is larger than Π / 2 (where Π is the precision and has to be smaller the one
macrotick) the node raises a synchronization error and freezes.
This clock synchronization procedure is only for one channel. For two channels both channels
need to be done differently because of different propagation delays.
Chapter 6 31
6 Topologies
There are two types of data communication between processors and between processors and
peripherals: channels and networks. A channel provides a direct or switched point-to-point
connection between communicating devices. Channels are usually hardware intensive and
provide high bandwidth with low overhead. A network is a collection of processors and
peripherals that interact with each other using a protocol. Networks require more software to
handle the communication but can be used to a larger variety of tasks. Each communication
technique has its advantages and disadvantages and within aircraft different topologies,
protocols and media are used.
The most common topologies are point-to-point, bus, star, ring and combinations of these. In
avionics point-to-point topologies is dominant because of superior reliability. Over the last
decades the focus on environmental issues of the aviation industry has grown. Due to this
weight reduction is a big issue today. Large savings can be made if a sparse (partly connected)
mesh (see section 6.1) or bus can be used instead of a fully connected mesh topology.
A ring topology is point-to-point connection where every node has two connections to other
nodes so that all nodes form a ring. Messages are sent in one direction and repeated by all
nodes until it reaches the sender again. In order to tolerate a single fault on the ring it is
possible to exchange communication direction. A system with redundant rings and
redirectional transmission is very robust.
32 Topologies
A big advantage with a point-to-point topology is that it creates physical separation between
communication channels. Separation between the channels creates natural fault containment
zones.
A star interconnect is also a kind of point-to-point topology that will be further discusses in
section 6.3.
The broadcast bus topology consists of a medium (which in principle can be an electric wire,
optical fibre or radio link) to which all nodes are connected. Buses always transmit broadcast
messages to all members on the bus. Because of this only one member can send at a time
despite the fact that there usually exist several paths within the bus. Since all nodes will see
the signal at, virtually, the same time on the bus timing analysis becomes easy compared to
making an analysis of a complete point-to-point communication system.
The bus topology is flexible and composable and can easily be reconfigured without any
major hardware redesign. Buses generally utilize less wiring and interfaces than i.e. point-to-
point and star topologies. Since all nodes are connected to the bus there is a single point of
failure issue which must be handled. Fault containment need to be created using mechanisms
such as membership to prevent that faulty members disrupt the communications between
other nodes on the bus.
• Short-circuit: At a short circuit between the wires in a twisted pair connection the whole bus
fails to transfer information. This is a rarely occurring event with very low probability and
the only way to be able to handle this fault is to use static replication if the form of dual or
more redundant bus. A more frequent fault mode is short circuit to ground in the node bus
interface or in the connector. It is usually possible to still maintain communication with a
lower capacity in this fault mode if the system is designed for “graceful degradation”. The
influence of short circuit fault modes can be reduce by using galvanic isolation at the stub of
each node on the bus.
Chapter 6 33
• Circuit cut-off: A cut off of wire is a low probability fault, which will divide the system in
two parts that might or might not be recoverable. Redundancy in communication buses
located separately handles this fault mode. Circuit cut off in a connector is the most
common connector fault mode and it will only affect the node but not the bus function.
• SOS-faults: When a node sends messages on the limits or slightly out-of-specification of the
assigned time window such that some of the receiving nodes will receive the messages
correctly and some will not. In this case the nodes membership opinion will not be
consistent if not an atomic broadcast mechanism is provided.
• Babbling idiot: This is when a faulty node sends outside its pre-assigned time-window
which destroys the communication that was scheduled at the time on the bus. Only one
message must be sent at a time and depending on the chosen protocol different mechanisms
prevent nodes from sending at the wrong time. All nodes in a safety-critical bus
communication system therefore must be designed with a fail-silent behaviour in the time
domain or at least it must not violate the protocol rules. Bus guardians are one way of
preventing babbling idiot faults.
A star-coupled system relies on one central node, a star or a hub, that is connected to all other
nodes. The star is the obvious master of the communication since there are no direct
connections between the other nodes. The central position puts the star in charge of all
communication and faulty nodes can be excluded since it controls the access to the
communication medium. In a star topology it is natural to implement the guardians in the star
coupler since this makes them physically independent from the node.
However, if the star breaks down or becomes faulty the whole network will experience loss of
communication. The probability of that this fault mode can cause total communication black
out can be significantly reduced by using redundant stars.
Star topology is a hardware intensive topology that requires more wiring than most other
topologies; an exception is the fully connected mesh. Additionally the star coupler is much
more complex and has larger failure rate than a passive component like a cable in a mesh or
34 Topologies
bus interconnect. Delays are increased when using a star coupler since the routing at the
switch is not instant.
Fault tolerance mechanisms placed in the star (filters, pulse shaping, etc) can handle
asymmetric faults as Byzantine, SOS. However, the star is an active and rather complex
component which might have many fault modes. As a worst case the star coupler becomes
Byzantine; it will tell different things to nodes in the system. That is a very hard fault mode to
detect and resolve.
The star topology is inflexible to hardware reconfigurations since the star couplers need to be
reconfigured. One advantage a star topology has over a bus is that the physical layers that are
available for star topologies today supports higher bandwidth. TTP/C using 100Base-TX has
the maximum bandwidth of 25Mbit/s and AFDX 100Mbit/s.
Chapter 7 35
7 Architecture comparison
This chapter contains a conceptual study of communication architectures where the protocols
investigated in this thesis are theoretically evaluated for use as a primary communication
system in a Flight Control System (FCS).
The flight control system that has been used as a template for this study is a fictitious flight
control system based on JAS 39 Gripen. To avoid confidentiality issues the details of this
flight control system has been entirely collected from [15] .
Note that power supplies and extra connections for i.e. emergency shutdown are not included
in this study.
Section 7.1 describes the flight control system that has been used as a template for this
chapter. Sections 7.2- 7.4 describe the communication architectures and discuss strengths and
weaknesses. Section 7.5 contains reliability calculations and failure rate estimates of the
architectures. Section 7.6 contain estimated data specific to the architectures and a
comparison of features of the protocols included.
Figure 7.1: Distributed flight control architecture with analogue point to point connections
The FCS in JAS 39 Gripen has one system core (flight control computer) with triple
redundant channels. In order to simplify the “porting” and make it more understandable to the
reader the FCC is described as one function distributed over three FCC-nodes (see Figure
7.1). The FCCs use the sensors(s) data as input to the control laws to calculate control
commands that are sent to the actuators. Data are compared between the FCCs and the control
commands go trough a voter at the actuator in order to prevent a faulty control command to
reach the actuator.
The flight control system contains a total of 20 nodes that are connected to the FCC nodes
using point-to-point connections. The FCCs are connected to each other to make data
exchange possible e.g. for state comparison, data exchange etc.This is a traditional point-to-
point flight control system used here as benchmark.
36 Architecture comparison
Rate Gyro
Accelerometer Air Data
S S S
Cockpit
S S S Interconnections FCC FCC FCC
Bus 1
Bus 2
Figure 7.2: Distributed flight control architecture with dual redundant broadcast bus connections using
TTP/C
TTP/C is specifically designed for safety-critical hard realtime systems and has several
features to provide dependable communication built in. These mechanism are crucial in a
safe-critical communication system because of the fault modes that are introduced when using
a bus described in 6.2.
The protocol is not specified for a certain physical layer which makes it flexible. It has been
tested with buslengths up to 100 m using using MFM/Manchester coding over a RS485
interface which would make implementation feasible even in a large aircraft.
The maximum bandwith using RS485 with MFM/Manchester coding is 5 Mbit/s. MII
100Base-TX ,that provides bitrates up to 25 Mbit/s, can also be used if changing the
archtecture to use star couplers which uses a topology similar to the one in section 7.3.
In order to protect the nodes from a lightning effects, short circuit and other electrical
impulses that might be distributed by the bus galvanic isolation is needed at each node. This
will reduce the transmission rate on the network to somewhere in the range 1-2Mbit/s. As can
be seen in the cluster example in APPENDIX C; 2Mbit/s or lower is sufficient to schedule a
system of this magnitude on a TTP/C bus.
A small CPU is required at each node to handle the FT-COM layer that is either included in
TTP-OS or implemented by the user at the host CPU. It is possible to implement the TTP
controller in an FPGA or ASIC in order to reduce the amount of hardware at critical parts of a
system.
node. Enivironmental testing of the controller and the host CPU also need to be performed to
ensure that the severe enviroment by an sensor / actuator can be handled.
It is highly probable that for a flight control system using this technique to pass certification it
would have to have a backup system that has been fully certified for operating without
support from the main FCS.
AFDX AFDX
Switch Switch
Figure 7.3: Distributed flight control architecture with dual redundant star couplers using AFDX
The architecture using AFDX uses ethernet as physical layer and consequently need four
AFDX-switches The placement of these switches in the body of the aircraft will affect the
amount of cables that is needed. Since the switches are complex hardware and the most
critical part of the system they will have to be placed inside the aircraft-body.
Each node in the system (including the FCCs) has four redundant links; one to each of the
redundant switches.
AFDX requires one switch for each channel which makes it a hardware intensive and
complex compared to a bus or point-to-point architecture. The introduction of additional
complex hardware increases the failure rate significantly which can be seen in the realiability
calculations in 7.5.
According to the ethernet specification a shielded link can be up to 100 m which would be
sufficient for implementation even in a large civil aircraft.
My conclsion of using AFDX in a flight control system are that there are two major
disadvantages compared to the two previous architectures that make that archtecture unsuited
for that application. First, the protocol require a powerful CPU in each node to process the
packing and unpacking of frames of the datapackets from the 100Mbit/s datastream. Putting
such a computer by every primary control surface is not feasible. Secondly, the delays
introduced at the switches combined with the significatly higher latency jitter bound 500μs
are not acceptable in a communcation system that transport highly critical data used for flight
control.
38 Architecture comparison
S S S
Cockpit
S S S Interconnections FCC FCC FCC
Bus 1
Bus 2
Figure 7.4: Distributed flight control architecture with dual redundant broadcast bus connections using
FlexRay
An architecture using FlexRay as primary communication system has many similarities to the
TTP/C architecture in 7.2. However, as previously mentioned FlexRay lacks fault tolerance
such as bus guardians and membership service that are needed to handle fault modes such as
babbling idiot faults. Exclusion of fault tolerant mechanisms such as membership in the
protocol layer means that such mechanisms must be implemented at application level. A
result of this is that FlexRay would need a more powerful CPU at each node than a protocol
with these services built in at protocol level.
The physical layer of FlexRay only allows bus lengths up to 24 m which is a major
disadvantage considering that a civil aircraft usually have a wing span larger than 50m.
This is clearly a protocol that is designed for the automotive industry. However, it is not
impossible that in a future a FlexRay that has a broader market targeted will be published.
The physical layer of FlexRay is specified for up to 10 Mbit/s and allows event-triggered
communication to be fitted in the time-triggered schedule.
The conclusion is that FlexRay is still an immature technology that is only available to
members of the FlexRay consortium. As long as the maximum bus length is 24 meters an
implementation in a FCS would not be feasible.
Point-to-Point
FCC P-t-P
FCC P-t-P
FCC P-t-P
p1 = e −λ1 p2 = e −λ 2
P = (1 − (1 − p1 )3 )(1 − (1 − p2 )3 )
p1 is reliability of a single FCC
Chapter 7 39
TTP/C
FCC
TTP-bus
FCC
TTP-bus
FCC
p1 = e −λ1 p2 = e −λ 2
P = (1 − (1 − p1 )3 )(2 p2 − p2 2 )
p1 is reliability of a single FCC
p2 is reliability of a TTP-bus
P is reliability for the complete system
PSystemfailure = 1 − PTTP / C = 76 ⋅10−12
FCC
AFDX interface AFDX switch
FCC
AFDX interface AFDX switch
FCC
p1 = e −λ1 p2 = e −λ2
PAFDX = (1 − (1 − p1 )3 )(2 p2 − p2 2 )(2 p3 − p3 2 )
p1 is reliability of a single FCC
p2 is reliability of a AFDX interface
p3 is reliability of a AFDX switch
P is reliability for the complete system
7.6 Comparison
Table 7.1: Architecture summary
Communication TTP/C (bus) TTP/C (star) Point to Point AFDX FlexRay (bus)
Architecture
Number 92 4*23*2 = 184 120 + 6 = 126 4*23*2 = 184 92
network
interfaces
Cable length 4*50 = 200 1 23*4*7 = 644 2 23*3*7 = 483 23*4*7 = 644 4*50 = 200
(m)
Star coupler / None 4 None 4 None
Switch
Bandwidth 5 Mbit/s using 25 Mbit/s using Depending on 100Mbit/s using 10 Mbit/s using
RS485 100Base-TX physical layer 100Base-TX twisted pair
P(System 76 ⋅10−12 NA 29 ⋅10−12 10 ⋅ 10−9 NA
failure)
1
A bus length of 50 meters is used.
2
An average link length from star/hub coupler of 7 meters is used.
Chapter 7 41
Physical Layer
Medium access TDMA TDMA CSMA/CD Event-triggered
method
Topology Bus / Star Bus / Star Star Point-to-point
Physical layer Yes No, has aNo, Traffic No
independent twisted pair PL control using VL
in specification maintained by
end system
Cable length Up to 100m 24m bus 100m cable Not specified >
using bus. specified length (100m 100 m
Depending on from node to
number of star)
nodes.
Fault tolerance
Inherent Dual-redundant Dual-redundant Dual-redundant
Dual-redundant
Redundancy bus bus bus links and
switches
Redundancy Yes, Node and No, must be No, must be No, must be
management task replication implemented at implemented at implemented at
in both HW and application level application level application level
SW
Membership Yes No No No
service
Fault Yes, in No, must be Yes, if fault is Yes, if the
containment hardware, implemented at limited to one of secondary bus
membership, application level the redundant can be used to
message status, switch networks remove or
dual redundant tolerate the fault
bus
42 Experimental setup
8 Experimental setup
The TTTech equipment used in the lab includes four nodes. Below is the work, problems
experienced described when trying to setup a four node cluster. The intention was to setup a
cluster where we could measure/test
• Delays
• Start-up time
• Resynchronization
• Fault tolerance mechanisms (replication, reliable message passing, etc)
• Physical layer issues
PC PXI Backplane
cPCI Carrier Board
TTTech Tools
TTP Channel 0
Ethernet
TTP Channel 1
TTP
Monitoring
Node
Figure 8.2: Schematic lab setup using two TTP-IP modules and a monitor node
Due to one broken CPCI IP-carrier board (see Section 8.4) the connection in Figure 8.2 was
the only possible. A VME IP-carrier card was ordered to be able to run a four node cluster but
the delivery time was outside the time line of the thesis work. The actual lab connection is
shown in Figure 8.3.
All tests described here use RS485 as physical layer and 2Mbit/s bandwidth. In order to have
a properly working bus it is necessary to have termination resistors at the end of the bus,
either in the last node on the bus or at the end of the bus past the last stub.
The TTP-IP modules do, by default, have RS485 termination resistors mounted on them. In
the connection below there is no need to change the configuration since the TTP Monitoring
node is not terminated by default.
Figure 8.3: Laboratory setup using two TTP-IP modules and a monitor node
44 Experimental setup
The connectors used to connect the IP-modules to the bus are standard RJ45 connectors where
pin 3-6 is used. For simplicity both channels of the TTP/C bus are connected to the bus using
one cable. The bus here is a RJ45 connection block that has all pin 3 connected, all pin 4
connected etc.
Figure 8.4: The TTP software development tool-chain. Figure copied from [17]
The connectors on the Tews Compact PCI carrier boards is of the type Champ 50 which is a
type of high density shielded connector similar to the a high density SCSI connector. Some
search for this connector resulted in that it had to be ordered from the US and the delivery
time would be at least 8 weeks which was too much time. The purpose of these connectors
was to connect the TTP/C bus.
The problem was solved by soldering wires directly from the Compact PCI board (see Figure
8.6).
One of the Compact PCI carrier boards was found to be broken when testing the wires
soldered on to it. It is probable that the invalid backplane was the cause of this since the error
in the backplane initially used put 12V on a 3.3V pin.
The fact that one of the carrier boards was broken was a major draw back in the testing since
this meant that only two TTP-IP modules could be used at the same time.
Chapter 9 47
9 Tests performed
This chapter describes the performed tests and results of these. For more information about
the test cases see APPENDIX B.
Figure 9.1: Schedule of the test application used in TC1 and TC2
Since TTP/C is master less nodes a node will synchronize and establish communication if any
other participant on the network is alive. The signals on the bus can be seen in Figure 9.2
where it is visible the node number 1 and 2 in the cluster is present and transmits on the bus,
while nodes 3 and 4 are absent these slots are empty.
This test illustrates the physical signalling on the bus and shows conformance to the generated
schedule. By visual inspection and comparison of Figure 9.1 and 9.2 it is illustrated that
schedule and real signalling correspond to each other.
Test case 6 - Time measurement of start up, reintegration and restart: Did not finish
This test case did not finish. The start up time of the cluster was not measure because it is
dependent of how many nodes that is operational. The start-up time of two nodes is not
considered interesting.
Reintegration and restart would require at least three “normal” nodes on the bus and since
only two nodes have been operational during testing (discussed in Section 8.4) these tests
could not be carried out.
Chapter 10 49
Certification:
TTTech that develops TTP/C commercially has put a lot of effort in making TTP/C
certifiable. Since the protocol is used in the new Airbus A380 it will, eventually, be certified
when the final certification of the aircraft is completed.
The AS8202NF controller has been designed to meet the criticality level A of RTCA/DO-254.
The TTP driver, TTP-OS, TTP-Verify (tool to verify communication schedules) and the TTP
loading library has been designed to meet level A of RTCA/DO-178B.
This proves that certification of TTP/C itself will most likely not be a problem if it would be
used in a safety-critical system in avionics. However, when using a new technology in the
avionics industry a backup system is needed for certification.
Functional requirements:
Bandwidth: Yes. The 5 Mbit/s that is supported by RS485 using Manchester/MFM coding is
sufficient for a flight control system (2 Mbit/s has been found sufficient with proper
scheduling and communication running on 80-100 Hz). Bandwidth using transformer
coupling is yet to be investigated.
Latency Jitter: Yes. (Is configurable in TTP/C 1-10 us)
Maintenance: Yes. Nodes can easily be added but the communication schedule needs to be
rebuilt if empty slots have not been included for future additions in previous design. TTP/C
even supports hot-swap: a node could be replaced without disrupting the other
communication.
Number of nodes: Yes. TTP/C supports up to 64 nodes in one cluster which is considered to
be more than enough for a flight control system.
Environmental requirements:
Temperature: Not shown. The TTP chip AS8202NF is specified for -40°C to 125°C. For
electrical components that are to be placed in hazardous areas like the wings an operating
range of -60°C to 70°C. Considering that when powered components will generate heat it
might very well show that the AS8202NF chip would survive in such an environment.
However, further investigation and testing need to be conducted.
EMC: Not shown. Among other disturbances in the wings of an airplane electro magnetic
fields will induce disturbances in wires. It is probable that a multi drop bus will suffer more
from this than point-to-point connections since the bus has a greater area to absorb
disturbances.
TTP/C is independent of the physical layer which makes it possible to choose a physical layer
that will meet the requirements in exposed environments.
The nodes need to have galvanic isolation from the communication bus to be robust against
severe electromagnetic disturbances and lightning effects. It is not considered to be a problem
to implement since TTP/C is physical layer independent.
A short-circuit that prevents all communication on a channel can be handled by having double
redundant channels.
50 Result and analysis
Implementation in an aircraft would in some cases mean that a bus would be up to 100 m long
which can be accomplished using RS485.
Safety requirements:
The predefined communication schedule makes the communication predictable and hence
easy to analyze. It guarantees latency and jitter which is a strong requirement for distributed
control functions to be able to send and receive data in time. TTP/C is master less and uses
fail silence which means that faulty nodes will not affect the rest of the communication on the
network. This is ensured using bus guardians and membership service. Data consistency in the
form of CRC is performed by the TTP-controller which means that the data consistency check
is performed at controller level which saves host CPU power.
10. The physical layer should be able to use transformer couplers to achieve galvanic isolation
at each node.
Based on the results and analysis (chapter 10) the conclusion is drawn that a broadcast bus
using TTP/C would be suitable for use in a flight control system.
Regarding the protocols included for comparison; AFDX and FlexRay the conclusions is
drawn that AFDX is not really suitable for a flight control system but interesting to show a
different approach to TT communication. FlexRay would be an interesting alternative to
TTP/C if an updated version with a more complete set of fault tolerance mechanism where to
be released and the consortium would allow usage outside of the automotive industry.
Unfortunately the practical part of the evaluation was delayed due to hardware issue
(described in chapter 11) which resulted in that not merely as many measurements as planned
could be carried out. The contact with the hardware suppliers concerning availability and
delivery time is extremely important in any project since the delays fundamentally affect the
project and its outcome.
As a result of this thesis a lot of thoughts and ideas have been born. Some of the questions in
this thesis remain unanswered and will need a significant amount of analysis and testing to be
answered.
Future work for further evaluation of the bus architecture discussed in section 7.2 involves
testing in a full scale network. Environmental stress testing of the components chosen is
absolutely necessary for certification. To conducting safety-analysis of the bus guardian
implementation in TTP/C to ensure that the system can be proved to meet a sufficient level of
reliability for certification. Testing to decide whether RS485 can provide sufficient bandwidth
in the bus configuration discussed in chapter 7 using transformer coupling.
12 References
[1] TTTech, Time-Triggered Protocol TTP/C High-Level Specification Document
Protocol Version 1.1, version 1.4.3 19 Nov 2003
[2] TTTech Computertechnik AG, Homepage of TTTech at www.tttech.com
[3] H. Kopetz, G. Bauer, The Time-Triggered Architecture, In Proceedings of the IEEE
Special Issue on modelling and Design of Embedded Software, Oct 2002
[4] K. Driscoll, B. Hall, M. Paulitsch, P. Zumsteg, H. Sivencrona, The Real Byzantine
Generals, In Proceedings of the 23rd DASC, Oct 24-28, 2004
[5] J. Rushby, A Comparison of Bus Architectures for Safety-Critical Embedded
Systems. Technical report, Computer Science Laboratory, SRI 2003
[6] GAST project, The GAST project homepage at www.chl.chalmers.se/gast
[7] H Kopetz, Real-time Systems: Design Principles for Distributed Embedded
Applications, Kluwer International Series in Engineering and Computer Science.
Real-time Systems. 1997
[8] Jean-Claude Laprie, “Dependable Computing: Concepts Limits and Challenges”, In
Proceedings of the 25th IEEE International Symposium on Fault-Tolerant
Computing, Pasadena, California, June 27-30, 1995, pp 42-54.
[9] K. Forsberg, NFFP3+ WP2 Certification guidance documents, Saab technical report
NFFP-2006:002, 2006-04-12, (Not public)
[10] SAE ARP 4754, Certification Considerations for Highly Integrated or Complex
Aircraft Systems
[11] DO-178B, Software consideration in Airborne Systems and Equipment Certification,
RTCA inc, 1140 Connecticut Avenue, N.W. Suite 1020, Washington D.C. 20036
[12] DO-254, Design Assurance Guidance for Airborne Electronic Hardware, RTCA inc,
1140 Connecticut Avenue, N.W. Suite 1020, Washington D.C. 20036
[13] K. Forsberg, Analysis and Calculations for Dependable DIMA Architectures, Report
Safety Critical issues (WP5) SD25101, Saab technical report, MOEL-2004:046,
2004-11-01, (Not public)
[14] D.A. Gwaltney, J.M. Briscoe, Comparison of Communication Architectures for
Spacecraft Modular Avionics Systems, NASA Technical report, NASA/TM-2006-
214431, June 2006
[15] K. Forsberg, Design Principles for Fly-By-Wire Architectures, PhD Thesis,
Department of Computer Engineering
[16] SAE ARP 4761, Guidelines and methods for conduction the safety assessment
process on civil airborne systems and equipment.
[17] TTTech, TTP-Plan manual, manual edition 5.3.8 11-Nov-2005
[18] H. Sivencrona, Heavy-Ion Fault Injection in TTP-C2 Implementation. Report of the
SP Swedish National Testing and Research Institute, September 2003.
[19] John C. Knight and Nancy G. Leveson. An Experimental Evaluation of the
Assumptions of Independence in Multi version programming. IEEE Transactions on
Software Engineering, SE-12(1):96-109, January 1986.
[20] K Forsberg, B Habberstad, J Torin. FoT25 Systems Architecture WP6, Detailed
DIMA and DTT UAV architectures. Saab technical report (Not public)
[21] IEEE, IEEE Recommended Practice for Architecture Description of Software-
Intensive Systems, IEEE Recommended practice for architectural description of
software-intensive systems, E-ISBN 0-7381-2519-9, ISBN 0-7381-2518-0
54 Definitions and terminology
Attributes
The dependability attributes characterize the dependability of a given system.
Availability is a measurement of how probable it is that the system is operational and able to
provide service at any given time. Higher availability means that the probability that the
system can provide the requested service is higher.
Reliability is the measurement of whether a system can provide the intended service within
the specified time, which makes the given response time accurate.
Safety is a measurement of how a system can provide service to its users without being a
threat to its environment e.g. when performing services that it was not originally intended to.
Impairments
The impairments of a system are divided into faults, errors and failures. Although none of
these are wanted in a system they are unavoidable. Fortunately there are ways to prevent and
deal with these problems. A fault may if the part of the system where it occurred is activated
lead to an error. An error might lead to other errors and if an error prevents the system from
providing the intended service it leads to a failure. An error or a failure that stops all operation
within a system is called system failure.
Appendix A 55
Means
The means for dependability are methods to increase the trustworthiness of a system. During
the design process fault prevention aims at preventing faults from occurring and being
introduced. It is however impossible to prevent all faults from occurring so the design must be
made is such a way that faults can be tolerated and prevented from propagating to failures.
This is accomplished through fault tolerance. Fault removal tries to deal with and minimized
the effects of faults while fault forecasting estimates the probability and severity of faults. For
more information about means see Chapter 4.
A.2 Definitions
A.2.1 General
Term Definition
100BASE-TX 100BASE-TX is the predominant form of Fast Ethernet, providing
100 Mbit/s Ethernet.
Architecture The fundamental organization of a system, embodied in its
components, their relationships to each other and the environment,
and the principles governing its design and evolution [21].
Avionics Aviation electronics
Certification Is used here as short for the certification process that is applied to
an aircraft before given permission to be used by the certification
authorities.
Complex system Complex refers to systems whose safety cannot be shown solely by
test and whose logic is difficult to comprehend without the aid of
analytical tools.
Control surface Parts of an aircraft such as rudder, flaps and air brake that are used
to affect the air stream to be able to control the manoeuvring of the
aircraft.
Composability The ability to build new systems from existing pieces, to run
different applications on a platform composed from a pool of
reusable system components.
Distributed system A system where functions are distributed to several nodes.
Error An error is a deviation from the required operation of the system
due to a fault.
Failure A system failure occurs when the system fails to perform its
required function due to an error.
Fault A fault is a defect within the system.
Fault Forecasting Fault forecasting is conducted by performing an evaluation of
system behaviour with respect to fault occurrence and activation.
Fault Prevention Fault prevention is a technique that aims to prevent faults from
entering the system in the design stage. It includes structured
programming, information hiding and modularisation for software
and rigorous design rules for hardware.
56 Definitions and terminology
Fault Removal Fault removal both tries to track and locate faults in a system
before it enters service and after the system have been taken into
service. This includes both hardware and software testing
techniques.
Fault Tolerance Fault tolerance is a method that aims to design a system so faults
can be tolerated. If fault tolerance is not implemented a single fault
may lead to global system failure.
Federated system A system that is built to isolated from other systems to only carry
out the system specific function. This is the opposite of an
integrated system.
Hard real-time system A real-time system where failure to complete its tasks before
deadline expiration will have catastrophic consequences, i.e. a
control loop.
Integrated system A system where resources like a computational node are shared
Node A node is a device connected to a communication network. It
consists of a communication interface, a host computer and a buffer
interface connecting the two.
RJ45 A standard connector with 8 pins used for 100BASE-TX among
other Ethernet standards.
RS485 RS-485 (also known as EIA-485) is an OSI Model physical layer
electrical specification of a two-wire, half-duplex, multipoint serial
connection.
Predictability Different methods to provide fault-tolerant services by making
implicit or explicit assumptions about the behaviour of the system,
so called failure modes.
Overhead Amount of data in a transmission that is not a part of the data sent
but is needed by i.e. the communication protocol such as CRC,
routing information etc.
Time-triggered In time-triggered communication each node is allowed to send
communication according to a predefined cyclic schedule. The communication is
deterministic since unexpected events cannot occur as in event-
triggered communication. The deterministic characteristics are
crucial when i.e. performing a safety analysis. Since a node only is
allowed to send in its time-slot network load will not affect the
delays.
Clique formation Part of a cluster i.e. the nodes in a cluster that interprets a message
in a different way then the rest of the cluster due to some (TTP/C)
Macrotick A periodic signal that delimits a granule of the global time.
(TTP/C)
Microtick A periodic signal that is generated by the oscillator of the
controller. Each macrotick is made up of a number of microticks.
(TTP/C)
A.3 Abbreviations
Abbreviation Description
AFDX Avionics Full-Duplex Switched Ethernet
ARINC Aeronautical Radio Incorporated
ARP Aerospace Recommended Practise
ASIC Application-specific integrated circuit
AT Action Time (TTP)
BC Bus Controller (MIL-STD-1553)
BIT Built in test
CAN Controller Area Network
CMA Common mode analysis
CNI Communication Network Interface, TTP
CPCI Compact Peripheral Component Interconnect
CRC Cyclic Redundancy Check
CSMA/CD Carrier Sense Multiple Access with Collision Detection
ECAC European Civil Aviation Conference
ECC Error Correction Code
EDAC Error Detection Automatic Correction
EMC Electromagnetic Compability
EMI Electromagnetic Interference
EPROM Erasable Programmable Read Only Memory
ESD Electrostatic Discharge
ET Event-triggered
FAA Federal Aviation Administration
FAR Federal Aviation Regulations
FCC Flight Control Computer
FCS Flight Control System
FCU Fault Containment Unit
FMEA Failure Mode Effect Analysis
FT-COM Fault Tolerant Communication (TTP)
FTA Fault Tolerant Average algorithm, TTP
58 Definitions and terminology
Test 4: Reintegration
Goal: Get successful reintegration of a node after power off during operation.
Purpose: To evaluate the fault tolerating mechanisms of TTP/C.
Requirement: A node should resume normal operation without disrupting the other
communication on the bus in case of i.e. a power failure
A TDMA round of 6000 μs is used which gives a cluster cycle of 12000 μs; hence the
messages will be sent with ~83 Hz. Message sizes in Table C.1 are from [15]
A short analysis of the schedule gives the conclusion that 6208 bits are sent every round and
with 83.3 rounds per second the effective data send over the bus is 517 Kbit/s. From the
schedule data we get that the total transmission rate on the bus is 1010 Kbit/s and
transmission time is 78.9 % of that, 797 Kbit/s. This means that the synchronization
information and overhead on the bus is approximately 35%. The reason for this is the high
number of nodes on the bus and the fact that the FCC sends a lot more data than then other
nodes (see Figure C.2). If the functionality of the FCC where to be distributed over a few
small subsystems the communication is believed to be more efficient. This experiment
however is left to future work.
Table C.1 contain message size, if feedback is used and the number of replicated nodes. Table
C.2 contain schedule data. Figures C.3 and C.4 gives a graphical overview of the cluster
communication schedule.
The Simulink model of the cluster is displayed in Figure C.1. Each block connected to a TTP-
block represents a subsystem and the TTP-blocks represent messages, or arrays of messages
in this case, that are sent over the communication bus. A subsystem can be distributed over
several nodes and contain one or more tasks. TTP-Matlink is used to generate MEDL and
node specific code for each node without the need for prewritten tasks which is very useful
for prototyping.
1. Create a basic cluster design in Simulink including subsystems, tasks, messages and the
mapping between subsystems and hosts, TDMA round and what kind of hardware target
is used.
2. Run TTP-Plan to generate a
cluster database and a schedule 3
3. Import the schedule parameter 2
from TTP-Plan to TTP-Matlink 4
4. Generate node database, MEDL 5
and TTP-OS code for each node 6
with TTP-Build 7
5. Import data such as the task
8
sample times generated by TTP-
Build 9
6. Generate application code using
Real-time workshop embedded
coder Figure C.2: Design steps in TTP-Matlink
7. Make nodes using the compiler of choice (I used Diab C compiler from Wind river)
8. Load the compiled code on to your embedded target over the TTP-bus using TTP-Load.
9. Use TTP-View to monitor the communication on the bus
Figure C.3: Part of the schedule high lighting data transmitted by FCC_A in one TDMA round
Appendix C 63