Beruflich Dokumente
Kultur Dokumente
Reaction 2016
Preface
This volume contains the papers presented at Reaction 2016: 4th International Workshop
on Real Time Computing and Distributed Systems in Emergent Applications held on November
29, 2016 in Porto, Portugal. It is the 4th edition of this event that started in 2012 in San Juan
de Puerto Rico. Then, the 2013 edition took place in Vancouver, Canada, and on 2014 we met
in Rome, Italy.
A total of eight papers have been accepted that cover a broad domain range with the goal
of attracting different complementary research lines.
i
Reaction 2016 Program Committee
Program Committee
1
Reaction 2016 Table of Contents
Table of Contents
Analyzing End-to-End Delays in Automotive Systems at Various Levels of Timing
Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Matthias Becker, Dakshina Dasari, Saad Mubeen, Moris Behnam and Thomas Nolte
Using DDS middleware in distributed partitioned systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Marisol Garcı́a-Valls, Jorge Domı́nguez-Poblete and Imad Eddine Touahria
Towards Architectural Support for Bandwidth Management in Mixed-Critical
Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Georgios Kornaros, Ioanins Christoforakis and Maria Astrinaki
Programming Support for Time-sensitive Adaptation in Cyberphysical Systems. . . . . . . . . . . 19
Mikhail Afanasov, Aleksandr Iavorskii and Luca Mottola
FLOPSYNC-QACS: Quantization-Aware Clock Synchronization for Wireless Sensor
Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Federico Terraneo, Alessandro Vittorio Papadopoulos, Alberto Leva and Maria Prandini
Cost minimization of network services with buffer and end-to-end deadline constraints . . . . 31
Victor Millnert, Johan Eker and Enrico Bini
Go-RealTime: A Lightweight Framework for Multiprocessor Real-Time System in User
Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Zhou Fang, Mulong Luo, Fatima M. Anwar, Hao Zhuang and Rajesh Gupta
DTFM: A Flexible Model for Schedulability Analysis of Real-Time Applications on
NoC-based Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Mourad Dridi, Stéphane Rubini, Frank Singhoff and Jean-Philippe Diguet
1
Analyzing End-to-End Delays in Automotive
Systems at Various Levels of Timing Information
Matthias Becker∗ , Dakshina Dasari† , Saad Mubeen∗ , Moris Behnam∗ , Thomas Nolte∗
∗ MRTC / Mälardalen University, Sweden
{matthias.becker, saad.mubeen, moris.behnam, thomas.nolte}@mdh.se
† Robert Bosch GmbH, Renningen, Germany
dakshina.dasari@de.bosch.com
Abstract—Software design for automotive systems is highly propagation through a chain of tasks, so called end-to-end
complex due to the presence of strict data age constraints for timing constraints, one of which is the age constraint. The
event chains in addition to task specific requirements. These age constraint specifies the maximum time from reading a
age constraints define the maximum time for the propagation
of data through an event chain consisting of independently sensor value until the corresponding output value is produced
triggered tasks. Tasks in event chains can have different periods, at the end of the chain. This kind of constraint is especially
introducing over- and under-sampling effects, which additionally important for control systems, such as the EMS, where it
aggravates their timing analysis. Furthermore, different func- directly influences the quality of control.
tionality in these systems, is developed by different suppliers Many design decisions have direct influence on the data age.
before the final system integration on the ECU. The software
itself is developed in a hardware agnostic manner and this Thus, bounding the data age of a chain early during the design
uncertainty and limited information at the early design phases process can potentially avoid costly software redesigns at later
may not allow effective analysis of end-to-end delays during that development stages. The analysis gets complex as a chain
phase. In this paper, we present a method to compute end-to- may consist of tasks with different periods leading to over-
end delays given the information available in the design phases, and under-sampling situations. Most of the available analysis
thereby enabling timing analysis throughout the development
process. The presented methods are evaluated with extensive methods for such systems analyze existing schedules [3] and
experiments where the decreasing pessimism with increasing thus they are not applicable during early phases. In [4], a
system information is shown. generic framework to calculate the data age of a cause-effect
chain is presented, which targets single processor systems and
I. I NTRODUCTION is agnostic of the scheduling algorithm used. In this paper we
Automotive systems are getting complex with respect to show how this analysis can be extended to cater the needs of
traditional components like the Engine Management System the complete development process of automotive applications.
(EMS) as well as modern features like assisted driving. While The increased system information during the design process
the increase in the EMS complexity is attributed to newer can thus be used to obtain end-to-end latencies with decreasing
hybrid engines and stricter emission norms, assisted driving pessimism at various levels of timing information.
requires the perfect convergence of various technologies to Contributions: In this work, we extend our earlier work on
provide safe, efficient and accurate guidance. This has led analyzing end-to-end delays among multi-rate effect chains [4]
to software intensive cars containing several million lines of to utilize the information available in different development
code, spread over up to hundred Electronic Control Units stages. Specifically, we highlight the generic nature of the
(ECU) [1]. framework by showing how extensions with varied levels of
Given this complexity, over the last decades, standards like information can be used to compute the maximum data age
AUTOSAR [2] have been formulated in order to provide a given the:
common platform for the development of automotive software. 1) Knowledge of offsets: The analysis for systems without
These standards allow software components provided by dif- knowledge of the schedule is extended to allow for task
ferent suppliers to be integrated on the same ECU, since they release-offset specifications.
provide for a hardware agnostic software development. Such 2) Knowledge of the scheduling algorithm (like Fixed Prior-
robust interfaces enable designers to design software at early ity Scheduling (FPS)): Most ECUs utilize operating systems
stages without knowledge of the concrete hardware platform which schedule tasks based on FPS. This allows to utilize
on which it will be eventually executed. Thus, during the existing analysis for such systems to determine worst-case
development it is often not known which other applications response times of the individual tasks. It is then shown how
share the same execution platform. the concepts of the analysis can be adapted to account for this
Most of these automotive applications typically have strict information.
real-time requirements – it is not only important that a 3) Knowledge of the exact schedule: Similar to most of the
computation result is the correct result, but also that the existing end-to-end delay analyses, we show how the exact
result is presented at the correct time. In addition to the schedule can be used to determine the exact delays with low
timing requirements for each task execution (i.e. the tasks computational overheads.
deadline), these applications often have constraints for the data 4) Knowledge of the communication semantic: We extend
the analysis to incorporate Logical Execution Time (LET), an hyperperiod can be defined as the least common multiple of all
important paradigm guiding how and when data is exchanged periods, HP = LCM(∀Ti , i ∈ Γ). Hence, a task τi executes a
between tasks of an automotive application [5]. number of jobs during one HP , where the j th job is denoted
Finally, we compare these different scenarios with extensive as τi,j .
evaluations, considering i) the tightness of the computed B. Communication between Tasks
bounds and ii) the computation time for the analysis.
In this work inter-task communication is realized via shared
II. R ELATED W ORK registers. This is a common form of communication and can
be found in many industrial application areas. In such a
The end-to-end timing constraints found in automotive
communication model, a sending task writes an output value to
multi-rate systems were first described in [6]. Here, the authors
a shared register. Similarly, a receiving task reads the current
describe the different design phases and link them to EAST-
value of this register. Hence, there is no signaling between the
ADL [7] and AUTOSAR [2]. An increased level of system
communicating tasks, and a receiving task always consumes
knowledge during the consecutive design phases is outlined.
the newest value (i.e. last-is-best).
A method to compute the different end-to-end delays of In order to increase determinism, tasks operate on the
multi-rate cause-effect chains is presented in [3]. In addition, read-execute-write model. Meaning, a task reads all its input
the authors relate the reaction delay to ”button to reaction” values into local copies before the execution starts. During the
functionality and the maximum data age delay to ”control” execution phase only those local copies are accessed. Finally,
functionality. In this work the focus lies on the maximum data at the end of the execution the task writes the output values to
age and hence on control applications. the shared registers, making them available to other tasks. In
A model-checking based technique to compute the end- short, reading and writing of input and output values is done
to-end latencies in automotive systems is proposed in [8]. at deterministic points in time, at the beginning and end of the
The authors generate a formal model based on the system tasks execution respectively. This is a common communication
description which is then analyzed. form found in several industrial standards (i.e. in AUTOSAR
The end-to-end timing analysis in an industrial tool suite this model is defined as the implicit communication [15]. Also
is discussed in [9]. Two different activation methods are the standard IEC 61131-3 for the automation systems defines
discussed; trigger chains, where a predecessor task triggers similar communication mechanisms [16]).
the release of a successor task, and data chains, where tasks
are individually triggered and hence over- and under-sampling C. Cause-Effect Chains
may occur. In this work we focus on chains with the latter For many systems it is not only important that the individual
activations. tasks execute within their timing constraints but also that the
End-to-end delays in heterogeneous multiprocessor systems data propagates through a chain of tasks within certain time
are analyzed in [10]. Ashjae et al. [11] propose a model bounds. One example is the Air Intake System (AIS), which
for end-to-end resource reservation in distributed embedded is part of the Engine Management System (EMS) in a modern
systems, and also present the analysis, based on [3], for end- car. For a smooth operation, the air and fuel mixture inside
to-end delays under their model. the engine must be controlled and the AIS is responsible for
Additionally, several industrial tools implement the end-to- injecting the correct amount of air. To do so, an initial sensor
end delay analysis for multi-rate effect chains [12], [13], [14]. task periodically samples the position of the pedal, followed
However all of the discussed works require system information by a number of control tasks that process this information,
which is only available in the implementation level. In [4], a and finally an actuator task actuates the throttle to regulate the
scheduling agnostic end-to-end delay analysis for data age is amount of air inside the engine. For the control algorithm, it is
described, where only information about the tasks of the cause- important that the sensed input data is fresh in order to reach
effect chains is required. In this work, we extend the results the required control quality. Hence, the time from reading
presented in [4], and show that by augmenting information the data until the actuation is subject to delay constraints in
available during the different design phases, we can analyze addition to the task’s individual timing constraints.
the maximum data age with decreasing degree of pessimism. In AUTOSAR such constraints are described by the cause-
effect chains [17]. For a task set Γ, a set of cause-effect
III. S YSTEM M ODEL chains Π can be specified. Where Π contains the individual
This section introduces the application model, inter task cause-effect chains ζi . A chain ζi is represented by a Directed
communication mechanisms, and the notion of cause-effect Acyclic Graph (DAG) {V, E}. The nodes of the graph are
chains as used in this work. represented by the set V and contain the tasks involved in
the cause-effect chain. The set E includes all edges between
A. Application Model the nodes. An edge from τi to τk implies that τi has at least
We model the application as a set of periodic tasks Γ. Each one output variable which is consumed by τk . A cause-effect
task τi ∈ Γ can be described by the tuple {Ci , Ti , Ψi }, where chain can have forks and joins, but the first and the last task
Ci is the task’s Worst Case Execution Time (WCET), and in the chain must be the same for all possible data paths.
Ti is the task’s period. All tasks have implicit deadlines, i.e. To simplify the analysis, chains with fork/join operations
the deadline of τi is equal to Ti . A task can further have a are decomposed into individual sequential chains. Hence, all
release offset Ψi . For all tasks executing on a processor, the cause-effect chains in Π are sequential.
Hyperperiod
that a job can be scheduled anywhere, as long as it starts not
!" before its release and finishes not after its deadline. In [4], the
Task arrival notations to define the intervals for systems without offset are
Task execution
!# defined as follows:
Data propagation
!$
Overwritten data Rmin (τi,j ) = (j − 1) · Ti (1)
2 4 6 t
Rmax (τi,j ) = Rmin (τi,j+1 ) − Ci (2)
Maximum Data Age
Dmin (τi,j ) = Rmin (τi,j ) + Ci (3)
Fig. 1: Data propagation between tasks of a cause-effect chain in a real-time
system with maximum data age specified. Dmax (τi,j ) = Rmax (τi,j+1 ) + Ci (4)
found where Follows() returns true. To these nodes a logical B. Reachability in known Schedules
data path is created. The same principle is applied from each
of these nodes to the jobs of the next lower level of the cause- Many real-time systems deploy time-triggered schedules in
effect chain. Once the last level is reached all possible paths order to guarantee a deterministic system behavior. In such a
are calculated and the function returns. Interested readers are schedule it is known at design time when the different jobs of
referred to [4] for a detailed explanation. the different tasks are executed. Thus, a complete knowledge
of the system is available. On the other hand, for dynamically
scheduled systems it is often possible to compute the Worst-
C. Constructing Data Propagation Paths and Maximum Data Case Response Time (WCRT). In that case the exact execution
Age times of a task are not known but the earliest and latest time
For one data path the maximum end-to-end latency, and thus a task can execute is known.
the data age, can be computed as follows, where τstart is a 1) Schedule is Available: Let’s assume an offline schedule
job of the first task of the cause-effect chain, and τstop is a is available for the system. So for each job τi,j of the task
job of the last task of a cause-effect chain: set its exact start time is known as starti,j , and similarly
its finishing time is known as endi,j , see Fig. 3. With this
AgeMax(τstart , τend ) = (Rmax (τend )+Cend )−Rmin (τstart ) additional knowledge the read and data interval can be adjusted
as shown in Table I.
In order to compute the maximum data age for any possible Since the start of the jobs execution, and hence the time it
path in the system, AgeMax() must be computed for all reads its input data, is known, the read interval collapses to a
computed data paths. The maximum of these values is the point. This also leads to smaller data intervals, resulting to no
maximum data age of the cause-effect chain. overlap between consecutive jobs.
Ci
V. R EACHABILITY BETWEEN J OBS IN
D IFFERENT S YSTEMS ⌧i
t
of a task can now read its input data only after Ψ time units DIi,k
DIi,k 1 DIi,k+1
after the start of its period. The end of the read interval is
unchanged at C time units before the next period starts. Since t
Dmin and Dmax are described by Rmin and Rmax , no direct Fig. 4: Read and data intervals of consecutive jobs of τi if WCRT of the tasks
changes are required in the formulation. Figure 1: Read and data intervals of consecutive jobs of ⌧i .
are available.
C. Reachability in the LET model VI. E VALUATION
The LET model provides an abstraction to the system
designer by temporally decoupling the communication among This section presents the evaluation of the proposed ap-
tasks from the tasks execution. In this model, the input values proaches to analyze the end-to-end delay based on various
of a task are always read at the release of the task. The output levels of system knowledge. First the computed worst-case
values become available once the next period starts. In Fig. 5 data-age based on various information levels is compared.
these points are highlighted by the thick orange line below the Further the required computation time to perform the analysis
arrows marking the job releases. This temporal decoupling of at the presented information states are evaluated.
communication and execution has significant advantages for
the end-to-end delay calculations. A. Experimental Setup
The periodic access to all input variables at the beginning
of the period collapses the read interval to a point. The data The analyzed cause-effect chains for the experiments are
interval is also defined, making the output data available for generated according to the automotive benchmarks described
exactly the period after the jobs execution. These modifications in [18]. The task periods are uniformly selected out of the
are shown in Table I. As can be seen, all descriptions are set {1, 2, 5, 10, 20, 50, 100, 200, 1000}ms. The individual task
independent of the actual execution time of the job. utilization is computed by UUnifast [19]. As stated in [18],
an individual cause-effect chain is comprised of either 2 or
Ci 3 different periods, where tasks of the same period appear in
⌧i sequence in the chain. Note that not all period-pairs are valid
t predecessors in a cause-effect chain [18], which is accounted
for during the random generation of the cause-effect chain.
RIi,k RIi,k+1 RIi,k+2 For each of the presented data points 1000 random cause-
DIi,k
effect chains are examined.
DIi,k 1 DIi,k+1
Fixed priority scheduling is used as scheduling algorithm in
t order to compute the response times and the information for
Fig. 5: Read and data intervals of consecutive jobs of τi if the system operates the exact schedule of the tasks. Priorities are assigned based
basedFigure
on the 1:
LETRead and data intervals of consecutive jobs of ⌧i .
model.
on the Rate Monotonic [20] policy, where priorities between
D. Discussion tasks of the same period are assigned in arbitrary order. For
All presented modifications affect solely the read and/or the evaluation of the systems with required response times or
data intervals of the jobs. Hence, the existing calculations for known schedule, the response times are calculated based on the
the maximum data age, as presented in [4] and recapitulated well-known analysis for task scheduling presented in [21], and
in Section IV, can be applied without modification. This the schedule is generated by simulation of the tasks execution.
fact allows the system designer to perform the calculation
of maximum data age during early design phases, where B. Pessimism during the Individual Design Phases
only limited knowledge is present, or during the end of the
system design where more complete system knowledge can The first experiment examines systems under all four pre-
be obtained. sented information states, no information, response times,
A tradeoff between required system knowledge and accu- known schedule, and LET model. The same cause-effect chain
racy of the obtained maximum data age exists. For systems is analyzed with increased available system information. The
with exact knowledge, and for the LET system, it can be system contains 30 tasks while the cause-effect chain is
observed, that data intervals of different jobs of the same task comprised of 4 to 10 tasks in the case of two activation patterns
are never overlapping with each other. This means that it is (i.e. periods), and 6 to 15 tasks in the case of three activation
always certain which data is consumed by a job and thus, all patterns and the system utilization is set to 80%. The results
data paths which are computed are observed in the execution are presented in Fig. 6. The calculated end-to-end latencies
of the system. are normalized in respect to the chains hyperperiod and shown
on the y-axis. The decreased pessimism in the analysis with
Lemma 1. If it holds for all tasks in a chain, that the read
increased system knowledge is visible for all experiments. The
intervals of a task are reduced to a point (i.e. Rmin (τi,j ) =
computed worst-case data age of the same scenario includes
Rmax (τi,j ), then the calculated end-to-end delays are exact.
lesser pessimism from systems with no prior information to
Here ”exact” means that all calculated end-to-end delays are
systems with known response times up to systems where the
observed during the execution of the real system.
schedule is available. Additionally, we present the maximum
Proof. From the definition of the read- and data-interval in data age under the LET model, which behaves close to the
Section V we can see that once Rmin (τi,j ) = Rmax (τi,j ) the computed results based on response times in our setting. The
resulting data intervals of consecutive jobs are not overlapping. difference of the execution semantic becomes visible when
Given that data intervals are not overlapping and read intervals comparing with the results for known schedules. Both results
are reduced to points, the function Follows(τi,j , τk,l ) only are exact results under the respective execution semantic but
returns true for the jobs which actually consume the respective the observed maximum data age for the LET model is two
data during the execution of the real system. times as large as the value for the known schedule.
12 No Information 12 No Information
Response Times Response Times
10 Known Schedule 10 Known Schedule
Max Data Age in Chain-HP
6 6
4 4
2 2
0 0
4 5 6 7 8 9 10 6 7 8 9 10 1 12 13 14 15
Number of Tasks in Chain Number of Tasks in Chain
(a) Cause-effect chains with 2 involved periods. (b) Cause-effect chains with 3 involved periods.
Fig. 6: Comparison of the max. data age (normalized to the chains HP) for chains with 2 and 3 activation patterns under various levels of system knowledge.
Abstract—Communication middleware technologies are On the other hand, the penetration of virtualization tech-
slowly being integrated also into critical domains that are also nology has opened the door to the integration of heterogeneous
progresively transitioning to partitioned systems. Especially, functions over the same physical platform. This effect of
avionics systems have transitioned from federated architectures
to IMA (Integrated Modular Avionics) standard that targets virtualization technology has also arrived to the real-time
partitioned systems to comply with the requirements of cost, systems area. The design of mixed criticality systems (MCS)
safety, and weight. In the future developments, it is fully [7] is an important trend that supports the execution of various
considered the integration of middleware to support data applications and functions of different criticality levels [6]
communication and application interoperability. As specified in in the same physical machine. The term criticality refers to
FACE (Future Airborne Capability Environment), middleware
will be integrated into mixed criticality systems to ease the the levels of assurance over the system execution in what
development of portable components that can interoperate concerns failures. For example, in avionics systems, software
effectively. Still nowadays, in real-time environments, design follows DO-178B that is a de facto standard for
communication middleware is perceived as a source of software safety; software is guided by DAL (Design Assurance
unpredictability; and still there are very few contributions that Levels), and failure conditions are categorized against their
present real applications of the integration of communication
middleware into partitioned systems to support distribution. consequences: from catastrophic (DAL A) to no effect (DAL
This paper describes the usage of a publish-subscribe mid- E). Then, an MCS is one that has, at least, two functions of
dleware (precisely, DDS –Data Distribution Service for real- different criticalities on the same physical machines.
time systems–) into a fully distributed partitioned system. We Over the past 30 years, middleware technology has been
explain the design of a reliable communication setting enabled applied in critical domains but in those subsystems of lower
by the middleware, and we exemplify it using a distributed
monitoring application for an emulated partitioned system with criticality levels. This is the case of, e.g., CORBA applied to
the goal of obtaining the middleware communication overhead. combat systems [25] or, recently, DDS [21] applied to control
Implementation results show stable communication times that of interoperability of unmanned aircraft and air traffic manage-
can be integrated in the resource assignment to partitions. ment1 , mainly for ground segment control. Still middleware is
mostly used directly on bare machine deployments; yet it is
I. I NTRODUCTION
not used in partitioned software systems.
Communication middleware and virtualization technolo- This paper provides an initial design of a fully distributed
gies are two main contributions to the development and partitioned deployment that integrates DDS middleware. We
maintainability of software systems as well as to machine exemplify this concept on a data monitoring application that
consolidation [2]. These were initially used in mainstream has been developed to provide hands on the actual technology,
applications, but are progressively entering into the critical to analyze the temporal behavior of the overall distributed par-
environments and complex systems, where their role is in- titioned setting. The system is fully distributed across different
creasingly important. In fact, in the avionics domain, the physical machines to perform data sampling and transmission
combination of IMA [9] and FACE [4] require the usage of that is later received, processed and displayed at a remote node.
both virtualization technologies to develop partitioned systems The nodes of the monitoring system emulate a mixed criticality
and middleware to ease interoperability and portability of system. The setting replicates that of a partitioned system in a
components. This satisfies key requirement regarding cost, FACE compliant architecture. We concentrate on the design
space, weight, power consumption, and temperature. of the software stack and the analysis of the middleware
On the one hand, middleware brings in the capacity to performance over an Ethernet network that emulates an AFDX
abstract the low-level details of the networking protocols (Avionics Full DupleX) [5] compliant communication.
and the associated specifics of the physical platforms (e.g. The paper is structured as follows. Section II presents the
endianness, frame structure, and packaging, among others). work background and a selected related contributions, includ-
Consequently, the productivity of systems development is ing concepts and technologies relative to partitioned systems
augmented by easing the programmability, maintanability, and
debugging. 1 http://www.atlantida-cenit.org
and distribution middleware technologies for critical domains. execution of DDS over a real-time hypervisor [20] although
Section III analyzes the most important characteristics and the actual network stack processing was not measured; or
properties of DDS for partitioned systems within FACE. [13], [14] for network level P/S evaluation, and [15] for bare
Section IV presents the design of the distributed partitioned machine deployments. Overall, there is not sufficient analysis
system, illustrated for a remote monitoring application. Section on the actual execution characteristics of specific middleware
V provides the implementation of the system and presents technologies in general partitioned environments. Moreover,
the results obtained for the communication. Section VI draws there are no practical design models of partitioned systems
some conclusions and describes the future work. that can comprehensively put forward the required software
levels integration and there actual performance results. This
II. BACKGROUND paper contributes in this direction with a practical design of a
IMA has been a very successful approach to transition from distributed partitioned enviroment based on DDS, providing a
the former federated architectures to a more efficient design study of the behavior of a partitioned system that communi-
and final deployment into avionics systems. In this context, cates using this technology.
different standards are proposed to facilitate componentization,
portability and interoperability at different levels of a system. III. M IDDLEWARE WITHIN A PARTITIONED SYSTEM
In this way, ARINC 653 standard decouples the real-time A. Middleware in FACE Standard
operating system platform from the application software. For FACE standard defines the software computing environ-
this purpose, it defines an APEX (APplication EXecutive) ment and interfaces designed to support the development of
where each application software is called a partition having portable components across the general-purpose, safety, and
its own memory space. Each partition has dedicated time slots security profiles required by the avionics domain. Its goal is to
allocated through the APEX API, so that each partition can define the interaction points between the different technologies
have multi-tasking and its own scheduling policy. Overall, that ease systems integration. Actually, FACE uses indus-
the execution is embodied in a hierarchical scheduling policy try standards for distributed communications, programming
where the top level is a cyclic schedule. Current work is to en- languages, graphics, operating sytsems, among other areas.
hance ARINC 653 for multi-core processor architectures. The Version 2.0 further promoted application interoperability and
underlying network is AFDX that uses commercial technology portability, with enhanced requirements for exchanging data
with redundancy to support safe transmission. The integration among FACE components; also, v2.0 emphasizes on defining
of networked and distributed systems follows ARINC 429 common language requirements for the standard. Precisely, the
that is the data bus defining the characteristics of the data operating system segment of FACE defines different levels:
exchanges among the connected subsystems. It defines the the portable components segment (the applications), the trans-
physical and electrical interfaces and a data protocol for a port services segment (where middleware technologies are
local avionics network. employed for interoperability), the platform specific services
In other real-time systems, network scheduling typically segment (for the common services of a given domain), and
relies on either: network scheduling off-line network trans- the I/O services segment (related to the low-level adapters and
mission plan given the a-priori knowledge of the generated drivers to access peripherals including the actual network).
traffic (e.g. [26]); architectures such as TTA (Time Triggered At the transport services segment, many technological
Architecture) [17]; or distributed component models highly choices can be employed (e.g. CORBA, Web services, DDS,
related to the hardware on-chip design such as Genesis [19]. etc.). At the moment, the most widely accepted is DDS.
Nevertheless, in the last decade the trend in the devel-
opment of complex systems is to move to more productive B. Data Distribution Service for real-time systems
ways of designing the communication and interaction. The The Data Distribution Service (DDS) is an OMG standard
avionics industry has developed FACE (Future Airborne Ca- that provides a publish-subscribe (P/S), i.e., a decoupled
pability Environment) standard to facilitate the development interaction model among remote components. DDS relies on
and easy integration of portable components. The communi- the concept of a global data space where entities exchange
cation middleware is given a key role as an interoperability messages based on their type and content. Such entities are
enabler. Different technological choices can be used in FACE remote nodes or remote processes, although it is also possible
such as CORBA, Web services, or DDS, among others. The to communicate from within the same local machine.
most popular technology at the moment is (probably) OMG’s Entities can take two roles; they can be publishers or
DDS standard [21] that has been applied in a number of subscribers of a given data type. Types are based on the
domains such as remote systems control [24]. It provides concept of topics that are constructions supporting the actual
an asynchronous interoperability via a publish-subscribe (P/S) data exchange. Topics are identified by a unique name, a data
paradigm that is data-centric, and the communication can be type and a set of QoS policies; also, they can use keys that
fine tuned through quality of service (QoS) policies. enable the existance of different instances of a topic so that
There are some related works such as a thorough perfor- the receiving entities can differentiate the data source.
mance study of DDS for desktop virtualization technologies Applications organize the communicating entities into do-
[22] but was not dealing with partioned systems; or the mains. Essentially, a domain defines an application range
where communication among related entities (an application) partition needs to transmit or receive data, it sends it to the
can be established. A domain becomes alive when a partic- communication partition via a shared memory space. When
ipant is created. A participant is an entity that owns a set the kernel schedules the partition at its assigned time slot, the
of resources such as memory and transport. If an application transmission takes place.
has different transport needs, then two participants can be
created. A participant may contain the following child entities: Other
publishers, subscribers, data writers, data readers, and topics. machines
5000
8000
Min Min
Avg Avg
4000
q
6000
FM = F m . Each minor frame is divided into a number r
P
Time (microseconds)
Time (microseconds)
3000
j=1
4000
of time slots of different durations (Ck ). Each of these time
2000
r
slots is assigned to a specific partition Vk : F m =
P
Ck .
2000
1000
k=1
During its assigned time slot, a partition has uninterrupted
0
access to common resources. 1 1024 2048 4096 25 50 75 100
A part from the numbers provided by middleware vendors, Payload (Bytes) Load (%)
a middleware stability analysis is performed off-line. The Fig. 4: Bare machine; reliable communication; varing load.
analysis considers the message size required by the specific
application (1024KB in our case), the required load conditions,
and the specific hardware. In the case of the used implemen- The communication cost was increased by 7x in the partitioned
tation, DDS RTI Connext, this yields efficient values and a scenario. Nevertheless, it was shown that the average case
stable behaviour in different situations. The maximum value times ranged from 3x to 7x lower than the worst case.
of the middleware communication cost (cmw ) is associated Figure 4 shows the results of a bare machine deployment
to very exceptional situations (< 0.1), although possible. of the emulated scenario for the distributed monitoring appli-
Consequently, the maximum value cmw is considered in the cation. This scenario fully replicates the system and it also
schedule as part of the corresponding partition(s) Ck that make provides a DDS reliable communication configuration. The
use of it. For a partition, cmw = 1δ Ck where δ is the ratio scenario of figure 4(a) (left side graph) is executed with no
between the overall partition time and the middleware cost. additional load, whereas 4(b) (at the right side) presents a pro-
gressive loaded system. Resulting times follow the expected
V. R ESULTS pattern, and a similar phenomena as in the previous scenario is
Experimental results are presented with the goal of as- observed. Average behavior for (a) are between 3.7x and 5.33x
sessing the cost of using DDS in a partitioned context for smaller than the maximum times. Again, the communication
supporting remote communication across local and remote proves to be very stable; dispersion is around 50µs for all the
partitions. We intend to validate the suitability of the pro- cases except for the largest message size that is 98.2µs. For the
posed system design; we analyze the system in terms of scenario (b) that shows progressive load, the average times are
the communication time from the side of the invoking node. between 11x and 4x smaller that the worst case. Nevertheless,
Even for an unreliable setting, we have measured a partition- the system shows to be stable, and the dispersion is between
level reliable communication implementation, i.e., including 47.2µs and 123.20µs.
the client response reception. The goal is to show that the In the last scenario, the full fledged deployment of the
chosen middleware provides communication bounds for this system is shown. Figure 5 shows the reliable communication
distributed partitioned system. Also, results show the stability setting on a distributed partitioned system deployment with
of the middleware in the partitioned system, that is compared no additional load. Figure 6 shows the same scenario with
against the control group (a bare machine deployment in progressive load.
an unrealiable DDS configuration). Results of the execution
of the implemented system show its feasibility, given the Communication time accross partitions
Max
Std
20000
Min
Avg [5] Airbus Deutschland GmbH AFDX. Avionics Full Duplex Switched Eth-
Max
Std ernet. http://www.afdx.com (Accessed 2016)
[6] S. Vestal. Preemptive scheduling of multi-criticality systems with varying
15000 degrees of execution time assurance. In Proc. of the IEEE Real-Time
Time (microseconds)
25 50 75 100
www.eurocae.org (Accessed July 2016)
Load (%)
[10] – ARINC Specification 653: Part 1, Avionics Application Software
Standard Interface, Required Services. https://www.arinc.com/
(Access 2016)
Fig. 6: Partitioned system; reliable communication; varing load. [11] S. Groesbrink, S. Oberthr, and D. Baldin. Architecture for adaptive
resource assignment to virtualized mixed-criticality real-time systems.
4th Workshop on Adaptive and Reconfigurable Embedded Systems
the order of 300µs for the empty scenario and between 212 (APRES12), vol.10(1). ACM SIGBED Review, 2013.
[12] C. Gu, et al. Partitioned mixed-criticality scheduling on multiprocessor
and 1400µs for the progressive load. Moreover, the worst case platforms. In Proc. of IEEE Design, Automation and Test in Europe
is, for the empty scenario, between 8x and 18x larger than the Conference and Exhibition (DATE), pp.1-6. 2014.
average case. For the progressive load tests, the average case [13] C. Esposito, D. Cotroneo, S. Russo. On reliability in publish/subscribe
services. Computer Networks, vol. 57(5), pp. 1318-1343. April 2013.
is between 8x and 39x smaller than the worst case. Also, it can [14] A. Hakiri, P. Berthou, A. Gokhale, D. C. Schmidt, T. Gayraud. Support-
be seen that the minimum and average cases are very close. ing end-to-end quality of service properties in OMG data distribution
Results show that the fully distributed partitioned envirornment service publish/subscribe middleware over wide area networks. Journal
of Systems and Software, vol. 86(10), pp. 2574-2593. October 2013.
yields a stable execution and the communication overhead of [15] T. Rizano, L. Abeni, L. Palopoli. Experimental evaluation of the real-
the middleware follows the expected pattern. time performance of publish-subscribe middleware. In Proc. of 2nd Inter-
national Workshop on Real-Time and Distributed Computing in Emerging
VI. C ONCLUSION Applications (REACTION 2013). Vancouver, Canada. December 2014.
[16] Object Management Groug (OMG). The Real-time Publish-Subscribe
The paper describes the design of a distributed partitioned Wire Protocol DDS Interoperability Wire Protocol Specification, v2.2.
system that supports communication of remote partitions September 2014.
[17] H. Kopetz, G. Bauer. The time-triggered architecture. Proceedings of
through DDS middleware. Topics are defined to support the the IEEE, vol 91(1), pp. 112-126. 2003.
data centric model and it is exemplified for a distributed [18] C. M. Otero Pérez, et al. QoS-Based Resource Management for Ambient
monitoring application. The communication overhead caused Intelligence. Chapter on Ambient Intelligence: Impact on Embedded
System Design, pp. 159–182. Kluwer Academic Publishers. 2003.
by the middleware and the partitioned setting is analyzed for [19] R. Obermaisser et al. Fundamental Design Principles for Embed-
a sufficient number of trials. The novelty of the paper is ded Systems: The Architectural Style of the Cross-Domain Architecture
the exhaustive trials on the specific DDS technology and the GENESYS. IEEE ISORC 2009, pp. 3-11. 2009.
[20] H. Pérez, J. J. Gutiérrez, S. Peiró, A. Crespo. Distributed architecture for
measurements that provide the overhead of the whole middle- developing mixed-criticality systems in multi-core platforms. The Journal
ware processing stack. Results show that the communication of Systems and Software, vol.123, pp. 145-159. 2017.
is stable even in presense of very high loads. The average case [21] Object Management Group – OMG. A Data Dis-
tribution Service for Real-time Systems Version 1.4.
times are significanly smaller that the worst case (from 8x to http://www.omg.org/spec/DDS/1.4 2015.
39x) and dispersion is 1.4ms for the worst possible scenario. [22] M. Garcı́a-Valls, P. Basanta-Val. Analyzing point-to-point DDS com-
We show that the overhead can be obtained for its integration munication over desktop virtualization software. Computer Standards &
Interfaces, vol. 49, pp. 11-21. January 2017.
in the partition resource assignment. [23] M. Garcı́a-Valls. A Proposal for Cost-Effective Server Usage in CPS
in the Presence of Dynamic Client Requests. 19th IEEE International
ACKNOWLEDGMENT Symposium on Real-Time Distributed Computing (ISORC), pp. 19-26.
This work has been partly supported by the Spanish Ministry York , UK. May 2016.
of Economy and Competitiveness through projects REM4VSS (TIN [24] M. Garcı́a-Valls, P. Basanta-Val. Usage of DDS Data-Centric Middle-
ware for Remote Monitoring and Control Laboratories. IEEE Transac-
2011-28339) and M2C2 (TIN2014-56158-C4-3-P). tions on Industrial Informatics vol. 9(1), pp. 567-574. 2013.
[25] R. Schantz, et al. Towards adaptive and reflective middleware for
R EFERENCES network-centric combat systems. Encyc. of Software Engineering. Wiley&
[1] S. Trujillo, A. Crespo, A. Alonso, J. Perez. MultiPARTES: Multi-core Sons. 2002.
partitioning and virtualization for easing the certification of mixed- [26] R. Davis, A. Burns, R. J. Bril, J. J. Lukkien. Controller Area Network
criticality systems. Microprocessors and Microsystems, vol.38, pp.921– (CAN) schedulability analysis: Refuted, revisited and revised. Real-Time
932. November 2014. Systems, vol.35(3), pp 239272. April 2007.
[2] M. Garcı́a Valls, T. Cucinotta, C. Lu. Challenges in real-time virtualiza- [27] M. Garcı́a-Valls, L. Fernández Villar, I. Rodrı́guez López. iLAND: An
tion and predictable cloud computing. Journal of Systems Architecture, enhanced middleware for real-time reconfiguration of service oriented
vol.60(9), pp736–740. October 2014. distributed real-time Systems. IEEE Transactions on Industrial Informat-
[3] RTCA DO-297–EUROCAE ED-124. Integrated Modular Avion- ics, vol. 9(1), pp. 228- 236. February 2013.
ics (IMA) Development Guidance and Certification Considerations.
www.rtca.org and www.eurocae.org (Accessed 2016)
Towards Architectural Support for Bandwidth
Management in Mixed-Critical Embedded Systems
Ioannis Christoforakis, Maria Astrinaki and George Kornaros
Computer Engineering Department
Technological Educational Institute of Crete
71500, Heraklion, Crete, Greece
Email: kornaros@ie.teicrete.gr
Abstract—Mixed-critical platforms require an on-chip inter- A recent trend in real-time systems is to integrate tasks and
connect and a memory controller capable of providing suf- components of different criticality levels on the same hardware
ficient timing independence for critical applications. Existing platform. The objective of such mixed-criticality systems [1],
real-time memory controllers, however, either do not support
mixed criticality or still fail to ensure negligible interference [2] is to save space, weight, or energy by reducing the number
between applications. On the other hand, Networks-on-Chip of computation platforms, and at the same time to provide
manage the traffic injection rate mainly by employing complex safety guarantees for the critical components of the system.
techniques; either back-pressure based flow-control mechanisms Timing predictability is an important property when designing
or rate-control of traffic load (i.e. traffic shaping). This work such systems, especially for the safety-critical components.
proposes such a Traffic Shaper Module that supports both
monitoring and traffic control at the on-chip network interface Worst-case execution time (WCET) analysis [3], [4] becomes
or the memory controller. The advantage of this Traffic Shaper significantly easier if the hardware is more predictable. Many
Module is that at system level it provides guaranteed memory researchers have explored the possibility of designing pre-
bandwidth to the critical applications by limiting traffic of non- dictable processors [5], [6], [7] and predictable memory hi-
critical tasks. The system is developed in the Xilinx ZYNQ7000 erarchies [9].
System-on-Chip while the measurements were captured on a
Zedboard development board. By enabling the Traffic Shaper To provide guarantees for time-critical applications it is
in our architecture we achieved fine-grain bandwidth control mandatory to precisely control the usage of system resources.
with negligible overhead, while providing bandwidth of only 0.5- This work describes a methodology for bandwidth manage-
5 percent less than the theoretical specified bandwidth. ment on Network-on-Chip (NoC) interfaces and system mem-
ory traffic, by using custom circuitry with minimal overhead.
I. I NTRODUCTION
In particular, we demonstrate how to achieve this with the
Nowadays, the modern Systems-on-Chips (SoCs) present a use of a specialized hardware Traffic Shaper Module (TSM);
complex growth. Power, area and cost constraints [13], due through attaching a TSM instance at any initiator Network
to resource sharing, occur since the amount of applications Interface (NI) we control the number of accesses based on
mapped to a single system have been increasing. Applica- configured bandwidth and regulate the flow of traffic. We
tions characterized by real-time constraints exist alongside present a prototype system based on the ZYNQ7000 Pro-
with applications with no constraints thus creating mixed cessing System-on-Chip [12]. More specifically, we make the
time-criticality systems. Large numbers of applications run following contributions:
simultaneously in different combinations or use-cases and are
dynamic, therefore they can be dynamically stopped or started. • introduce a low overhead Traffic Shaper Module at
In order to guarantee satisfaction in application requirements, register-transfer level (RTL) that allows precise fine-
the system has the need to be verified for each of the use- grain traffic control in bytes/clock cycle; the block may
cases. The verification complexity grows exponentially with provide the possibility for both monitoring, as well as for
the number of applications if they can be combined arbitrarily. guaranteeing maximum bandwidth at an initiator on-chip
network interface
In the case of the active use-case change, in order to adapt
• propose a technique that can be used to enforce bitrates
to the new set of applications and requirements, the hardware
different than what a physical interface is capable of
components may have to be reconfigured. Even during recon-
• prove that, by enabling the Traffic Shaper Module in our
figuration, the running applications must demonstrate correct
architecture, we achieve a performance overhead of just
behavior. For this reason, the transitions between use-cases
0.5-5%, in contrast to the theoretical bandwidth
result in additional analysis complexity. Finding a common
analysis model has been proved difficult when mixed time- The rest of this paper is organized as follows. In Section
critical applications and different models of computation are II this work is positioned with respect to related works. Sec-
combined. Even if such a model exists, a single change in an tion III presents the proposed architecture and the developed
application requires all use-cases, in which the application is traffic control methodologies, while experiments are shown in
active, to be re-verified in general. Section IV, and finally Section V concludes the paper.
II. R ELATED W ORK in terms of usage of system resources, e.g., memory band-
There is a range of previous works for creating systems width.
that support quality of service with mixed-criticality tech- A. System Architecture
niques both in register-transfer level and in the operating
We target embedded devices that can host applications
system and kernel level. For instance, several techniques for
with different requirements in terms of bandwidth either of
predictable real-time DRAM controllers have been proposed
system memory or of other on-chip components. We assume
in previous works, although not all are suitable for mixed-
a Network-on-Chip viewed as an interconnect intellectual
criticality systems, since non-critical task performance cannot
property module with conventional read and write semantics;
be sacrificed too much for critical task predictability. In a
the network interface offers read/write and block transfers. Our
real-time system, bounded latency and interference must be
architecture is based on ZYNQ System-on-Chip of Xilinx. We
considered, in addition to overall throughput and fairness. Kim
utilize both the Processing System that consists of the dual
et al [14] exhibited a DRAM controller that is able to sep-
ARM Cortex-A9 CPU [12], as well as the Programmable
arate the critical from the non-critical memory access groups
Logic (PL) area, wherein we integrate the Traffic Shaper
(MAGs). They define algorithms for computing safe and tight
Module (TSM). More precisely, our architecture involves the
upper bounds of worst-case latencies, resulting in predictable
following components and attributes.
memory accesses for critical MAGs. In [15] developers de-
signed a reconfigurable SDRAM controller. This controller • It utilizes the Processing System (PS) Direct Memory Ac-
offers both predictable and compossable service, making it cess (DMA) controller that is used for burst transactions.
a suitable SDRAM resource for use in virtual platforms. • The CPU routes accesses to the memory through the GP
Composability is enabled by the use of composable memory port to the PS DDR3 memory. A central interconnect is
patterns combined with TDM arbitration, and they show that located within the PS that comprises multiple switches
the worst-case performance degradation is negligible. The in order to connect to the system resources by using
work in [16] deals with the problem of mixed time criticality the AXI point-to-point channels for communicating ad-
workloads in the context of an SDRAM controller. The main dresses, data, and response transactions between master
idea is to allow the command scheduler to exploit locality and slave ports [12]. This ARM AMBA 3.0 interconnect
that presents itself within the time window that naturally implements a full array of the interconnect communica-
exists between opening and closing a row. The controller can tions capabilities and overlays for QoS, debug, and test
exploit a fraction of the locality available in the request stream, monitoring. The interconnect manages multiple outstand-
without increasing the worst-case schedule length. Compared ing transactions and is architected for low-latency paths
to a close-page policy (keeping a DRAM bank in idle state), for the ARM CPUs and for the PL master controllers.
the average execution time of the benchmark applications The accesses from the PS DMA cross only one GP Port
is reduced by 7.9% using the conservative open-page, while and HP port (HP0), as shown in Fig. 1.
still satisfying the constraints of the Field Resource Tracker • The device integrates a Microblaze soft-processor [10]
(FRT) application (guarantee enough worst-case performance that is used to monitor resources that are used by the main
to satisfy requirements). processor (A9 CPU) and to collect information from the
Another technique is the one presented in [17]. This paper AXI performance monitor. Even though monitoring and
presents FlexPRET, a fine-grained multi-threaded processor controlling of system resources, i.e. NoC and memory
designed to exhibit architectural techniques useful for mixed bandwidth must be implemented with customized hard-
criticality systems (e.g. scratchpad memories). The designers ware to offer fine-grain control and response times, a
claim that their system provides hardware-based isolation to software-oriented approach (using MicroBlaze) offers a
hard real-time threads while allowing soft real-time threads to coarser-grain solution but more flexible.
efficiently utilize processor resources. There are also timing • The device uses a custom Address Remap Block, to
instructions, extending the RISC-V ISA, that enable cycles to remap addresses that arrive through the PS and redirects
be reallocated to other threads when not needed to satisfy a them to the DDR.
temporal constraint. They even provide a concrete soft-core • Custom circuitry is integrated at RTL level, which pro-
FPGA implementation, which is evaluated for resource usage. vides the ability of monitoring, control and supplying
A further option that can also be combined with our guaranteed bandwidth for critical applications.
proposed mechanisms from our perspective is the implemen-
B. Traffic Shaper Module (TSM)
tation of algorithms for Mixed Criticality Optimization as
in [18], where the optimization of mapping and partitioning is Shaping is a QoS (Quality-of-Service) technique that we
performed at the same time, and not separately. can use to enforce pre-specified bitrates, different than what
the physical interface is capable of. TSM is a hardware block
III. T RAFFIC C ONTROL M ETHODOLOGY that implements control of incoming data traffic which actually
Initially, we introduce the design of a modern heterogeneous consists of read and write transactions, following the AXI4
SoC to support a mixture of critical and non-critical computing protocol [22]. More precisely, the developed TSM controls
entities with the requirement to provide application isolation, the number of accesses based on configured/programmed
all burst types [3]. However, the AXI4 protocol extends the
burst length support for the INCR burst type to support 1
to 256 transfers. The burst length for AXI3 is defined as,
Burst Length = AxLEN[3:0] + 1, while the burst length for
AXI4 is defined as, Burst Length = AxLEN[7:0] + 1, to
accommodate the extended burst length of the INCR burst
type in AXI4.
TABLE I
D ESCRIPTION OF T RAFFIC S HAPER M ODULE BASIC COMPONENTS
Register Description
maxcc Stores the Maximum time limit (cycles)
maxbw Stores the maximum transfer bytes limit, maximum bytes
to transfer in maxcc time interval
acc Accumulating counter adding the new transaction bytes,
should be less than maxbw limit
Fig. 1. Architecture layout supporting precise bandwidth shaping through the totalacc Sums all the new accs and keep the values after the reset
use of the hardware Traffic Shaper and embedded microcontrolled manager burst Stores each new AXI4 transaction, burst or single word
in a dedicated MicroBlaze soft-processor (MB); memory traffic flows through cccnt Clock cycle counter; in each clock cycle cccnt increases
the GP port to the DDR3 memory while TSM monitors and controls at byte by 1 measuring total clock cycles
granularity
IV. E VALUATION
A. Methodology
Several different ways and methods exist, so as to monitor
the activity of a processor or a module. As we presented
previously this can be done by designing a controller that
supports mixed-criticality service. Software techniques also
exist, which support control at task level or at OS level [19].
Our methodology supports both control and monitoring
on the Network-on-Chip (NoC) Interface, without requiring
the re-design or tampering with the NoC routers or the
Fig. 2. Traffic Shaper Module (TSM) internals as implemented in ZYNQ- memory controller [20]. The Traffic Shaper Module provides
7000; TSM configuration can occur on the fly and takes effect only after the guaranteed bandwidth to the critical applications by limiting
current time window
consumed bandwidth of the non-critical tasks.
We used a tightly system-coupled configuration to run our
bandwidth and regulates the flow of traffic, in order to apply applications. Tightly system-coupled software means routines
restrictions on the consumed bandwidth. Figure 1 shows (modules or full applications) that work on only one type of
the prototype system that integrates the TSM. The proposed system since these are dependent on each other and on the
architecture contains the registers that are listed in Table I. The system configuration. For example, the driver of an embedded
designer’s options are, a) to attach it on a Master Interface cyber-physical system device requires extensive programming
and control it via the main CPU, b) to connect the TSM changes to work in another environment, but is usually opti-
control port and establish the overall management of the TSM mized to offer predictability, minimal latency and even fault-
to another independent core, which thus offloads the burden tolerance. We ran two stand-alone (baremetal) applications in
to dynamically monitor and evaluate QoS per connection or A9 CPU and MicroBlaze. The MicroBlaze is responsible to
per process; the main CPU (ARM Cortex-A9) communicates monitor and control the configuration of the TSM registers.
with that core only to provide configuration parameters. In this The A9 CPU can be configured to be in control of this block
work we chose the second method. (start, restart, counters initialization and monitoring), but the
TSM utilizes a separate AXI-lite interface for configuration. main goal is to dedicate the processor only to the application
The TSM main components are shown in Figure 2. The TSM running on it. In our case the accesses cross only one GP port
configuration is done via the MicroBlaze. The programming (GP0) towards the slave PS interconnect (as shown in Fig. 1).
of registers determines the maximum data transfer bytes to If the DMA engine is utilized to perform bulk data transfers,
transfer in a time window. The AxLen signals, AxSize and then the DMA sequence of operations is as follows:
AxValid are responsible for indicating a new transaction and • DMA Configuration
whether this is a single word or a burst. The DMA inside • Setup DMA Interrupt
the PS supports AXI burst lengths of 1 to 16 transfers, for • DMA Initialization
Fig. 4. For four different time intervals (1500, 1000, 500, 250 cycles) we
Fig. 3. Comparison of accuracy when scaling the time window using varying
measure the error rate that raises between the theoretical and the measured
data unit size
traffic bandwidth without activation of the TSM. For each interval we count
three different data size transfers (20, 24, 36 bytes). For example, as we
observe in the graph, for 1500 cycles the average difference arises to 17.48%
• DMA Start Transfer of bandwidth that was measured compared to the theoretical values that were
• DMA Check handler done calculated. By enabling the TSM the error rate in the worst case reaches
6.07%
• DMA Stop and reset
Fig. 6. System impact when the TSM is deactivated; random DMAs cause
irregular and uncontrolled memory bandwidth consumption
V. C ONCLUSIONS
Abstract—Cyberphysical systems (CPS) integrate embedded developers do recognize these issues, they tend to implement
sensors, actuators, and computing elements for controlling physi- complex hand-crafted solutions, mostly due to the lack of time
cal processes. Due to the intimate interactions with the surround- semantics in mainstream programming abstractions.
ing environment, CPS software must continuously adapt to chang-
ing conditions. Enacting adaptation decisions is often subject to To address these issues, we design and implement a custom
strict time requirements to ensure control stability, while CPS realization of context-oriented programming [10], [22] that is:
software must operate within the tight resource constraints that i) conceived for resource-constrained embedded devices, and
characterize CPS platforms. Developers are typically left without ii) embeds well-specified notions of adaptation modality and
dedicated programming support to cope with these aspects. This time. As described in Sec. III, these notions allow developers to
results in either to neglect functional or timing issues that may
distinguish different ways to schedule an adaptation decision
potentially arise or to invest significant efforts to implement hand-
crafted solutions. We provide programming constructs that allow and to place an upper-bound on the time taken to apply
developers to simplify the specification of adaptive processing such decisions. The former is useful, for example, to avoid
and to rely on well-defined time semantics. Our evaluation functional conflicts when switching from one control logic to
shows that using these constructs simplifies implementations while another. The latter provides a specified time semantics when
reducing developers’ effort, at the price of a modest memory and adaptation decisions need to abide to real-time deadlines. We
processing overhead. render these notions in a dedicated extension of the C++
language we call COP-C++, supported by a corresponding
I. I NTRODUCTION tool-chain we develop.
Cyberphysical systems (CPS) enable the tight integration As reported in Sec. IV, we assess our work by quanti-
of embedded sensors, actuators, and computing elements into tatively comparing the complexity of representative imple-
feedback loops for controlling physical processes. Example mentations of CPS software using traditional programming
applications include factory automation, automotive systems, constructs against those we design. Our results indicate that the
and robotics [25]. CPS operate at the fringe between the cyber developers’ effort is reduced using COP-C++. The cost to gain
domain and the real world [11]. Both the execution of the this benefit is a modest increase in resource consumption, es-
control logic and the platforms it runs on are thus inherently pecially in processing time and memory occupation. For exam-
affected by environmental dynamics [25]. This requires CPS ple, the worst-case processing overhead we measure through
software to continuously adapt to these dynamics. To enact real-world experiments on modern 32-bit MCUs amounts to
the needed adaptations, developers may employ various ap- only 20µs, negligible given the time scales of the considered
proaches, including dynamically changing the control logic. control loops. We end the paper by discussing related efforts
in Sec. V and with brief remarks in Sec. VI.
To complicate matters, control loops are most often time-
sensitive [24]; the control logic must be executed at a given II. P ROBLEM
frequency to ensure the stability of processes. Adaptation is
an integral part of the control loop, and thus subject to the Consider the need start
same time constraints. Thus, the timing aspects of taking and to localize a gas leak in Navigation emergency
enforcing adaptation decisions become crucial. Such complex an indoor environment. arrived emergency
time-sensitive adaptive processing must withstand the strict Tiny aerial drones are Hovering Landing end
resource constraints of CPS platforms; the most advanced CPS envisioned to perform gas concentration is
emergency
devices feature 32-bit micro-controller units (MCUs) with tens this task efficiently and above threshold
of KBytes of RAM, while being battery-operated. with minimal cost [4]. or alert beacon is received
Their behavior, as LeakLocalization
As we illustrate in Sec. II, developers are often left without dictated by a given
dedicated support to implement adaptive time-sensitive CPS control logic, depends on end
software. This leads to easily overlooking the potential issues surrounding conditions,
related to the timing aspect of run-time adaptation, ultimately Fig. 1: Software controller for
application state, and gas leak localization.
affecting the stability of the controlled processes. Fritsch sensed data [4].
et al. [7] show, for example, that timing aspects are often
neglected in developing adaptive automotive software. Similar Fig. 1 depicts a possible design for such an application.
observations also apply to robot controllers [2], [18]. Whenever Initially, every drone moves to a predefined location using
1 static int8_t current_controller = NONE; to transmit alert beacons that may still operate when the
22 static void step() {// should be called at 100Hz or more
3 switch (current_controller) {
Landing mode is possibly initialized. This would mean
4 case NAVIGATE: navigate_step(); break; the drone keeps beaconing also when it lands, which
5 case HOVERING: hovering_step(); break; may affect the system’s correctness. Asynchronous op-
6 case LEAK_LOC: leak_loc_step(); break;
7 case LANDING: landing_step(); break;}} erations, such as interrupt handlers firing while switching
88 static bool set_controller(uint8_t controller) { controller, may create similar issues.
9 if (controller == current_controller) { return true;}
10 bool success = false; As we argued earlier, these problems are often overlooked
11 switch(controller) {
12 case HOVERING: success = hovering_init(); break;
by CPS developers. Issue I1 impacts the quality of implementa-
13 case LEAK_LOC: success = leak_loc_init(); break; tions, whereas issue I2 and I3 potentially affect dependability.
14 case NAVIGATE: success = navigate_init(); break; Addressing these issues, however, is also not trivial without
15 case LANDING: success = landing_init(); break;
16 default: success = false; break;} proper programming support. For example, issue I3 often
17 if (success) {// update controller emerges as developers intentionally overlap initialization and
18
18 exit_mode(current_controller, controller); clean-up operations to increase parallelism when performing
19
19 current_controller = controller;
20 } else {// log error I/O operations. Solving issue I1 by simply switching the
21 Log_Write_Error(controller);} ordering of clean-up and initialization decreases parallelism,
22 return success;}
possibly prolonging the time required for switching controllers
Fig. 2: Example implementation of adaptive controller. and thus exacerbating issue I2. To remedy this, developers
implement hand-crafted solutions to regulate the time for
a Navigation controller. Upon arriving, a drone switches switching controllers, further impacting the quality of imple-
to a Hovering controller to sample the gas concentration. mentations, making issue I1 even worse.
Whenever it detects a value above a safety threshold, a drone III. COP-C++
switches to a LeakLocalization controller that broadcasts
Context-oriented programming (COP) [10] is a paradigm
alert beacons using a low-range radio. Nearby drones that
to simplify the implementation of adaptive software. It fosters
receive the beacon also switch to the LeakLocalization
a strict separation of concerns, achieved through two key
controller, and come closer to the drone that initially detected
notions: i) the different situations where the software needs to
the leak to obtain finer-grained measurements. In case of emer-
operate are mapped to different contexts, and ii) the context-
gencies, such as a hardware failure, the Landing controller
dependent behaviors are encapsulated in layered functions,
is activated.
that is, functions whose behavior changes—transparently to
Fig. 2 depicts an example implementation of the required the caller—depending on the active context.
adaptive behavior. The structure of the code reflects real COP already proved effective in creating adaptive software
implementations in the considered systems; for example, in the in mainstream applications, such as user interfaces [13] and
Ardupilot [2] autopilot software for drones. Control is triggered text editors [12]. To that end, COP extensions of popular high-
in function step() in line 2 , which is called at 100Hz. level languages, such as Java and Erlang, emerged [12], [22].
Depending on the global variable current_controller, Embedding COP extensions within an existing language, how-
different concrete controllers are executed. Controller adap- ever, often relies on features such as reflection and dynamic
tation is implemented in function set_controller() in binary loading, which are difficult to implement on resource-
line 8 . Depending on what controller is to be activated, an constrained platforms, such as those employed in CPS.
individual controller is first initialized, the clean-up routine of A. Context-oriented C++
the previous controller is executed on line 18 , and the global
variable indicating the active controller is updated in line 19 . To address issue I1 in Sec. II, we embed the key COP
abstractions within C++, as it arguably represents a large frac-
Despite being simplified, the code in Fig. 2 already shows tion of the existing codebase in CPS. The resulting language,
several issues, some not even entirely evident: called COP-C++, retains the original C++ semantics with
the addition of custom semantics and keywords. We focus
I1) The processing is strictly coupled with the different on the local adaptation, while any distributed functionality is
controllers. For example, adding a new controller, or orthogonal to our efforts. Indeed, our approach can be used
removing an existing one would require changing the code with any middleware for adaptation in distributed systems; for
in several places. On the other hand, resource constraints example, iLand [8].
prevent using high-level languages that ameliorate these
issues. As a result, implementations are typically entan- For simplicity, we illustrate the language through examples
gled and thus difficult to debug and to maintain. here. The full grammar is publicly available together with the
I2) In Ardupilot, control runs at 100 Hz: every 10 ms a corresponding tool-chain.1
controller must perform the necessary actuation. How- Context groups and individual contexts. Similar to previous
ever, current implementations enforce no time limit on work [1], we group together contexts sharing common charac-
adaptation. This may become an issue when executing teristics, such as the functionality provided or the environment
set_controller: should a controller’s initialization dimension that drives the transition between contexts. In the
or clean-up take too long, the controller will not be example of Sec. II, individual contexts would map to the indi-
executed in time, which may affect the system’s stability. vidual controllers in Fig. 1. These contexts would be grouped
I3) When switching controller, the previous and new con- together as they all provide control functionality.
trollers may conflict with each other. For example, the
LeakLocalization controller includes a periodic task 1 https://bitbucket.org/neslabpolimi/cop cpp translator
11 context group FlightControlGroup { appears much simplified. No global variables are necessary to
22 context Hovering; context Navigate; context LeakLoc;
33 context Landing;
keep track of the current controller. No cumbersome switch
4 public: statements are required, either. Only a single call to the step
55 layered void step() = 0;}; function appears in the code, which is automatically dis-
patched to the active context. The interleaving of controllers’
Fig. 3: Example definition of context group.
initialization and clean-up while switching, as well as the
11 context LeakLoc : private FlightControlGroup { corresponding time semantics, are completely encapsulated in
2 public:
3 LeakLoc(): _ticker(new Ticker()) {};
the aforementioned qualifiers, as explained next.
4 virtual ˜LeakLoc(); B. Qualifiers
5 private:
66 layered void step() {// ... controller functionality} As described in Sec. II, functionality meant to operate in
77 bool initialize(){_ticker->attach(&broadcast, 0.3);}
88 void cleanup() {_ticker->detach();} different situations may conflict with each other during the
9 void broadcast(){// broadcast routine} switch—as per issue I2. In addition, potential issues may
10
10 Ticker∗ _ticker;}; emerge as current implementations enforce no time limit on
Fig. 4: Example implementation of individual context. the execution of adaptation decisions, as per I3. To help
programmers cope with these issues, the qualifiers lazy and
Fig. 3 shows the specification of a context group in the fast[within] indicate different modes to enact adaptation
application of Sec. II using COP-C++, which extends the decisions and possibly specify time constraints.
notion and syntax of C++ classes. Context groups are Lazy Time Fast
declared with the keyword context group, as shown in Modes. Fig. 5 illustrates
Leak Loc.
ing a context switch us- Leak Loc. Leak Loc.
of contexts that belong to the group, as in lines 2 to 3 .
In addition, they specify the layered functions that the group ing lazy or fast. Us- Leak Loc. Landing
offers to other classes. These are indicated using the layered ing lazy, the clean-up clean-up initialize
keyword, as in line 5 , and implemented differently by the of the previous context Landing Leak Loc.
needs to complete be-
Landing
individual contexts. initialize clean-up
fore initializing the new
Only one context is active inside a group at every point context. As a result, no Landing Landing
in time; the active context is in charge of executing the functional conflicts may
corresponding layered functions. As a result, the context group ever arise. However, dur- Adaptation starts
acts as a container that hides the individual contexts from the ing the switch, the sys- New context is operational
users of a specific functionality. This helps developers decou- tem rests in an uncer- Fig. 5: Lazy and fast activation
ple context-dependent functionality from their use. Further, it tainty state where no con- of a new context.
makes it possible to statically generate the code to dispatch text is active. A call to a
layered function calls to the corresponding implementation in layered function within this time results in no operation. In
the active context, ameliorating the need for advanced language addition, as no parallel execution occurs, the latency grows as
features hardly feasible on resource-constrained platforms. the sum of the time to clean-up from the previous context and
In COP-C++, individual contexts also extend the notion the time to initialize the new one.
and syntax of C++ classes. Fig. 4 shows an example based In contrast, as shown in Fig. 5, using fast the system
on the application of Sec. II. The context keyword in first initializes the new context; then performs the clean-up
line 1 specifies that this class is an individual context, part of the previous one. If some operations inside initialize
of the FlightControlGroup it inherits from. In line 6 , are non-blocking and may be asynchronously executed, such
the programmer implements the context-dependent behavior as those involving I/O operations, fast allows the system to
for the layered function step defined in the corresponding increase parallelism. As a result, the time to apply an adap-
context group. In every individual context, programmers may tation decision reduces, yet programmers must take additional
add initialization and clean-up routines with the predefined care to avoid functional conflicts between contexts during the
initialize() and cleanup() methods, as in lines 7 switch. There is indeed a time where both contexts might be
and 8 . These are useful when starting a new controller or possibly simultaneously executing.
stopping the currently executing one.
Deadlines and rapid context switches. To control the time
Adaptation. Using COP-C++, enacting an adaptation invested in switching between contexts, programmers may
decision that prompts to easily switch between the define an optional activation deadline. Say, for example, that
controllers. For example, the command activate in the scenario of Sec. II the context switch needs to complete
FlightControlGroup::Hovering fast within within T ms to let the new controller perform the actuation
5ms performs the change of the currently-executing within the next (10 − T ) ms. The optional qualifier within
controller including initialization and clean-up. The allows programmers to specify the upper-bound T on the time
specific scheduling of these operations depends on the to switch contexts, as shown in Sec. III-A. Should the upper
qualifiers lazy and fast[within] in the same lines, bound be violated, the initialization is interrupted and the
whose semantics we explain next. The call to function programmer is notified through a callback, which can be used
FlightControlGroup::step() is then transparently to implement application-specific countermeasures.
dispatched to the currently active context.
All the activate commands are placed in a queue that is
Compared to plain C/C++, as shown in Fig. 2, the code asynchronously checked by the system. The latter pulls out the
Name/Definition Description
first activate command and executes it. In an emergency
Length (LTH) Shows how long is a program in terms of “words”, where
situation, such as a collision threat, programmers may want a “word” is an operator or an operand.
OP + OD
to switch the context immediately. To this end, an instruction Vocabulary (VOC) Indicates the number of unique “words” in a program.
activate FlightControlGroup::Landing U OP + U OD
immediately cancels all pending activate commands Difficulty (DIF) Indicates how difficult it is to understand the program.
U OP/2 × OD/U OD
and immediately switches to the Landing context. Refers to how much information does a reader have to
Volume (VOL)
LT H ∗ log2 V OC absorb to understand the program semantics.
The qualifiers we discuss here also naturally apply across
Effort (EFF) Reflects the effort required to recreate, or to initially write
multiple context groups, defined as explained in Sec. III-A, in DIF ∗ V OL the program.
applications with parallel adaptive controllers.
Fig. 6: Definition of Halstead metrics.
C. Translator
Metric PURE C++ TIME C++ COP-C++
We implement the translator as an extension to the CDT Operators count (OP) 981 1068 750
plug-in for Eclipse. It allows programmers to translate from Unique operators (UOP) 32 33 36
COP-C++ to pure C++ and to rely on standard toolchains Operands count (OD) 439 478 383
Distinct operands (UOD) 121 124 101
to generate executable binaries. First, the translator ensures Program length (LTH) 1420 1546 1133
the consistency of the context-oriented design. For example, Program vocabulary (VOC) 153 157 137
a context may not implement any layered functions or not Difficulty (DIF) 58 61 68
belong to any context group. In this case the translation stops Volume (VOL) 10305.49 11227.48 8042.07
and the developer is informed. Second, the generated source Effort (EFF) 597718.46 687926.5 546860.78
code remains human readable; programmers can modify it to Fig. 7: Halstead metrics applied to the COP-C++ implemen-
perform further fine-grained optimizations and tuning. tation of the application in Sec. II against baselines.
IV. E VALUATION
the programming language. Halstead et al. use four basic
We assess the effectiveness of COP-C++ along several di- metrics: the total number of operands (OP), the total number
mensions. Sec. IV-A quantitatively demonstrates the benefits of of operators (OD), the number of unique operands (UOP),
COP-C++ in complexity of the implementations and required and the number of unique operators (UOD). An operator is
programming effort. Such benefits come at a price of pro- a language-specific keyword, whereas variables and constants
cessing and memory overhead, which we report in Sec. IV-B. are operands. For example, in Fig. 4, layered is an operator,
We compare the performance of different combinations of whereas _ticker is an operand. Based on the four basic
qualifiers for context switch in Sec. IV-C, whereas Sec. IV-D metrics, other metrics are derived as shown in Fig. 6.
shows the programming effort required to manually cope with
functional conflicts when using fast switching. Results. Fig. 7 reports the values of the Halstead metrics for
the three aforementioned implementations. The Difficulty in
Our evaluation targets modern 32-bit ARM Cortex M a COP-C++ implementation is slightly higher than for the
MCUs that are often employed in CPS. We employ STM baselines. This is due to the additional keywords we add to
Nucleo prototyping boards equipped with Cortex M3 MCUs define context groups and individual contexts, as well as the
running at 32 MHz and 80 KBytes of RAM. The architecture use of qualifiers.
of these is similar, if not the same, to devices employed
in real-world CPS applications, while the board also offers On the other hand, COP-C++ reduces the Volume of the
convenient testing facilities that enable the kind of fine-grained program; therefore, maintenance and debugging should be
measurements we discuss next. facilitated as programmers need to absorb less information
to understand a program. Programmers also spend less Effort
As input to our evaluation, we implement the gas leak to realize the program in the first place. The benefits of
localization application described in Sec. II using COP-C++. COP-C++ in these regards become even more apparent when
We use two functionally-equivalent implementations as base- considering TIME C++. In this case, both the Volume and Effort
lines. The first one uses pure C++ by following the structure further increase, amplifying the benefits of COP-C++.
of an original ArduPilot-like implementation, discussed in
Sec. II. We call this implementation PURE C++. The second B. Processing and Memory Overhead
baseline is similar to PURE C++, with the addition of manually-
The benefits above incur a run-time cost in processing
implemented functionality to control the time for switching
time and memory occupation. To assess these, we separately
controllers, that is, the same functionality the within qual-
compare a COP-C++ implementation that only uses the fast
ifiers provides declaratively. Such a baseline is instrumental
qualifier against PURE C++, and a COP-C++ implementation
to examine the difference between the manual implementation
that also employs the within qualifier against TIME C++.
and the automatic generation of this functionality. We call this
The original Ardupilot implementation only uses a kind of
implementation TIME C++. We implement all three versions
controller switch similar to the semantics of the fast qualifier,
of the application using the mBed [15] libraries provided by
so we do not study here the run-time overhead for the lazy
STM. We use the standard ARM gcc tool-chain for compiling.
one. We investigate this in Sec. IV-C.
A. Complexity and Effort
According to Fig. 1, different environmental events may
We compare the complexity and efforts for the three im- trigger the adaptation. We emulate these events on the Nucleo
plementations using Halstead metrics [9]. These are intended board as external interrupts through GPIO pins. We use a
to investigate the properties of source code independently of Tektronix TBS 1072B-EDU oscilloscope attached to the board
16 standard
12 wrapper
µSec
8
COP-C++ PURE C++ COP-C++ (within) TIME C++
4
0
40 Ticker Timeout InterruptIn
µSec
20
Fig. 9: Time to handle asynchronous events, either using
0
ed on s as y y y
standard handlers or wrappers.
iv c ga on h g tion nc .) nc ) nc )
arr ea lowtrati higntra rge loc rge ring rge tion
rt b e meleak mehove me viga Parameter COP-C++ COP-C++ with wrappers
ale on
c
nc
e e ( e ( e
(na
c co Events
Operators count 750 1237
Distinct operators 36 52
Fig. 8: Processing time for switching controller depending on Operands count 383 646
the external event. Distinct operands 101 160
Program length 1133 1883
to measure the time to perform the controller switch. Memory Program vocabulary 137 212
usage statistics are obtained from the mBed [15] on-line IDE. Difficulty 68 104
Volume 8042.07 14551.67
Results. Fig. 8 shows the processing time to switch controller Effort 546860.78 1513374.12
depending on the external event. Adaptation in COP-C++ Fig. 10: Halstead metrics for the application of Sec. II using
takes slightly more time—approximately 11µSec—compared COP-C++ with and without manually-implemented wrappers
to PURE C++. The absolute values vary due to different ini- to cope with functional conflicts during a context switch.
tialization routines; for example, periodic beaconing is only
initialized when LeakLocalization is activated by alert additional processing required to initialize a dedicated timer
beacon or high gas concentration. With the within quali- that fires if the switch takes too much time. In the latter
fier, the processing overhead compared to TIME C++ reaches case, the additional processing time amounts to only ≈1µSec,
20µSec. Such a penalty, however, is still almost unnoticeable, required to purge the queue of pending context switches. In
as the typical control loop runs at hundreds of Hz. both cases, the absolute values are very limited, and they
should not impact the timings of control loops that typically
COP-C++ shows a mere 200B RAM overhead, which is run with periods that are orders of magnitude larger.
negligible compared to both baselines that consume 2,1kB of
RAM. RAM consumption is often an issue when developing D. Development Trade-offs
CPS software; therefore, minimizing the impact on this figure Using the fast qualifier may result in functional conflicts
is key. On the other hand, the program memory usage using because the initialization and clean-up routines of different
COP-C++ appears 14,6kB higher than in the baselines that contexts overlap in time. Programmers need to handle these
use 20kB. This is mainly due to the simplified implementation conflicts by hand. To assess the additional programming effort
of the control loops we employ for this study, where only required, we implement a new version of the application in
basic functionality are included and platform-specific libraries Sec. II with the addition of simple wrappers around any of
are replaced with empty stubs. Functionally-complete imple- the classes where asynchronous events may fire during a
mentations are much larger; for example, the full ArduPilot [2] fast context switch. These include timers and classes that
requires 792 KB of program memory. A major fraction of these signal hardware interrupts. The wrappers intercept any such
are not processed by our translator; therefore, we expect the asynchronous event and forward it further only after checking
relative overhead in program memory to amortize. that the active context corresponds to the one the event is
C. Qualifiers addressed to. Cleaner, yet more complex solutions are also
possible. Considering simple wrappers provides a lower-bound
As the example application in Sec. II only uses the fast in terms of the additional programming effort.
qualifier, we quantitatively study here the trade-offs between
the adaptation qualifiers in COP-C++, described in Sec. III-B. We assess the added programming effort by re-calculating
The memory overhead is the same independent of the specific the Halstead metrics of Sec. IV-A on the new implementation.
combination of qualifiers that appear in the code, and corre- Further, the wrappers obviously add latency to the context
sponds to the values shown in Sec. IV-B. Therefore, here we switch. We measure this with the same setup of Sec. IV-B.
focus on the latency to perform the context switch depending
Result. As shown in Fig. 9, the wrappers add a mere ≈2µSec
on the combination of qualifiers. The experimental setup is the
in latency during a context switch. Thus, their performance
same as in Sec. IV-B.
impact is negligible. However, the complexity of the im-
Results. Our investigations show that the lazy qualifier plementation increases considerably, as reported in Fig. 10.
requires 33, 6µSec to complete the context switch. This latency Programmers need to invest significant efforts not only in im-
is the price programmers pay to ensure that no functional con- plementing the wrappers, but also to nail down all the classes
flicts arise during the switch. To reduce this time, programmers that could possibly lead to functional conflicts, and provide
can use the fast activation type that requires only 21, 4µSec a wrapper for each of these. Fig. 10 reports a considerable
without optional qualifier within. This choice, however, increase in all the complexity metrics when wrappers are used.
requires programmers to handle potential functional conflicts For example, the Volume of the source code increases by 80%,
by hand, increasing the programming effort. We investigate the source code is 53% more Difficult, and requires almost 3
this aspect in Sec. IV-D. times the Effort to be written.
V. R ELATED W ORK
Using the optional within or immediately qualifier
only marginally increases the latency in switching context. As Time-sensitive software adaptation in CPS is a multi-facted
for the former, the latency grows by 10µSec because of the problem. Albeit comprehensive programming support largely
lacks, works exist that tackle individual aspects. from COP and realized them in a way that is feasible
on resource-constrained devices, while adding semantics to
Parameter and component configurations. Adjusting the govern the time aspects during the adaptation process. We
software operating parameters is one way to adapt. For ex- implemented a dedicated translator from COP-C++ to pure
ample, Garcia-Valls et al. [8] focus on the reconfiguration C++. Our quantitative evaluation shows that COP-C++ simpli-
of the nodes in distributed soft real-time system that meets fies implementations of paradigmatic CPS functionality while
stated performance goals. In the area of adaptive controllers, reducing programmers’ effort, at the price of a modest run-time
Mokhtari et al. [17] and Frew et al. [6] tune the operation processing and memory overhead. For example, processing
of unmanned aerial vehicles (UAVs) based on sensor inputs. overhead in our experiments is limited to tens of µSec, while
These approaches focus on adapting a specific fraction of RAM overhead is negligible. Program memory overhead, on
the system’s functionality, for example, motors’ parameters or the other hand, should be amortized with the increasing size
nodes’ configuration, and cannot be applied where the whole of implementations.
control logic must be changed on a single node. Our work does
not focus on the mechanism to adapt a specific functionality, R EFERENCES
but provides generic programming support for implementing [1] M. Afanasov et al. Context-oriented programming for adaptive wireless
adaptive CPS software under time constraints. sensor network software. In Proc. of DCOSS, 2014.
[2] ArduPilot. www.ardupilot.com.
In component-based systems, software reconfiguration of- [3] J. E. Bardram. The Java context awareness framework (JCAF) – A
ten occurs by plugging components in/out or by changing service infrastructure and programming framework for context-aware
applications. In Proc. of PERVASIVE, 2005.
component wirings [21]. Dedicated component models ex- [4] T. R. Bretschneider and K. Shetti. Uav-based gas pipeline leak
ist that allow developers to verify—using formal techniques detection. In Proc. of ARCS, 2015.
such as model-checking—the correctness of new component [5] Y. Eustache and J. P. Diguet. Reconfiguration management in the
configurations [20]. Some of these works focus on specific context of rtos-based HW/SW embedded systems. J. Emb. Sys., 2008.
[6] E. W. Frew et al. Adaptive receding horizon control for vision-based
application domains such as automotive [26] and autonomous navigation of small unmanned aircraft. In Proc. of ACC, 2006.
underwater vehicles [16]. Unlike our work, these approaches [7] S. Fritsch et al. Time-bounded adaptation for automotive system
offer no programming support to deal with enacting time- software. In Proc. of ICSE, 2008.
sensitive adaptation decisions. [8] M. Garcia Valls et al. iland: An enhanced middleware for real-time
reconfiguration of service oriented distributed real-time systems. IEEE
Programming support for adaptation. Software adaptation Transactions on Industrial Informatics, 9(1):228–236, 2013.
[9] M. H. Halstead. Elements of Software Science (Operating and Pro-
for traditional platforms is extensively studied. Some of the gramming Systems Series). 1977.
works explicitly focus on programming support. For example, [10] R. Hirschfeld et al. Context-oriented programming. Journal of Object
COP [10] itself was implemented in a number of high-level Technology, 2008.
languages [3], [12], [22], [23]. The techniques normally used [11] M. Jackson. The world and the machine. In Proc. of ICSE, 1995.
to embed COP abstractions into a host language tend to be [12] T. Kamina et al. EventCJ: A context-oriented programming language
with declarative event-based context transition. In Proc. of AOSD, 2011.
impractical in CPS because of resource constraints. Similar [13] R. Keays et al. Context-oriented programming. In Proc. of MobiDe,
observations apply to Meta- and Aspect-oriented programming 2003.
(AOP) [22]. The corresponding abstractions often require self- [14] G. Kiczales et al. Aspect-oriented programming. In Proc. of ECOOP,
modification of the deployed binaries [14], which is hard 1997.
[15] mBed on-line IDE. developer.mbed.org.
to implement on resource-constrained platforms. Our work [16] C. McGann et al. Adaptive control for autonomous underwater vehicles.
renders COP concepts amenable for implementation on typical In Proc. of the AAAI, 2008.
CPS devices, while adding semantics useful when enacting [17] A. Mokhtari and A. Benallegue. Dynamic feedback controller of euler
time-constrained adaptation decisions. angles and wind parameters estimation for a quadrotor unmanned aerial
vehicle. In Proc. of ICRA, 2004.
The need to provision programming support for time- [18] OpenROV. www.openrov.com.
sensitive adaptation was also recognized in the area or [19] A. Patil and N. Audsley. An application adaptive generic module-based
reflective framework for real-time operating systems. In Proc. of RTSS,
real-time operating systems (RTOSes). Dedicated program- 2004.
ming abstractions based on reflection were added to existing [20] F. J. Rammig et al. Designing self-adaptive embedded real-time
RTOSes [19] or specific services were made available that software – towards system engineering of self-adaptation. In Proc. of
perform the needed reconfiguration in a safe manner [5]. These SBESC, 2014.
[21] J. C. Romero and M. Garca-Valls. Scheduling component replacement
attempts utilize language- or operating system-specific features for timely execution in dynamic systems. Software: Practice and
that are often not available in typical CPS platforms, because Experience, 44(8):889–910, 2014.
of resource-constraints. Target platforms of these approaches [22] G. Salvaneschi et al. Context-oriented programming: A software
are either the traditional computing machines [8], [21] or engineering perspective. J. Syst. Softw., 2012.
FPGAs [5], which greatly surpass CPS platforms in terms [23] S. Sehic et al. COPAL-ML: A macro language for rapid development
of context-aware applications in wireless sensor networks. In Proc. of
of available resources and energy consumption. Our solution, SESENA, 2011.
instead, is designed for resource-constrained devices, does not [24] J. A. Stankovic et al. Real-time communication and coordination in
require any language-specific features such as reflection, and embedded sensor networks. Proceedings of the IEEE, 91(7), 2003.
remains decoupled from the the underlying operating system. [25] J. A. Stankovic et al. Opportunities and obligations for physical
computing systems. IEEE Computer, 38(11), 2005.
VI. C ONCLUSION [26] M. Trapp et al. Runtime adaptation in safety-critical automotive
systems. In Proc. of SE, 2007.
We presented COP-C++, an extension to C++ we ex-
pressly conceived to simplify the implementation of time-
sensitive CPS software. To that end, we borrowed concepts
FLOPSYNC-QACS: Quantization-Aware Clock Synchronization for
Wireless Sensor Networks
Abstract—The development of distributed real-time systems is crucial, especially when real-time operations must be
often relies on clock synchronization. However, achieving pre- executed. Since many IoT devices are battery powered,
cise synchronization in the field of Wireless Sensor Networks and time synchronization is continuously executed, energy
(WSNs) is hampered by competing design challenges, such as
energy consumption and cost constraints, e.g., in Internet of efficient architectures and protocols are necessary [1].
Things applications. For these reasons many WSN hardware Even more, since the cost of the nodes is relevant and
platforms rely on a low frequency clock crystal to provide a high-precision time synchronization may not be needed
the local timebase. Although this solution is inexpensive and in some IoT applications, the use of low resolution clock
allows for a remarkably low energy consumption, it limits the oscillators is often a viable choice, see e.g. [17, 18]. Ap-
resolution at which time can be measured. The FLOPSYNC
synchronization scheme provides low-energy synchronization parently, the clock resolution affects the minimum achiev-
that takes into account the quartz crystal imperfections. The able synchronization error, and protocols able to push the
main limitation of the approach are the effects of quantization. performance to the limits are needed. In this paper, we
In this paper we propose a clock synchronization scheme propose a synchronization mechanism based on a switched
that explicitly takes into account quantization effects caused control policy aimed to minimize the effect of quantization
by low frequency clock crystal, thus addressing the clock syn-
chronization issue in cost-sensitive WSN node platforms. The on the synchronization error, with a minimal overhead.
solution adopts switched control for minimizing the effect of The following work is cast in the framework of multi-hop
quantization, with minimal overhead. We provide experimental master-slave clock synchronization, i.e., when the network
evidence that the approach manages to reach a synchronization is composed by a master, and of a number of slaves which
error of at most 1 clock tick in a real WSN. synchronize their clock to the masters’ clock.
The rest of the paper is organized as follows. Section II
I. I NTRODUCTION presents a technological view on the clock synchronization
During the past decade, the Internet of Things (IoT) has problem. Section III describes the proposed switched control
gained significant attention both in academia and industry. scheme, while Section IV presents simulation and experi-
The basic idea of IoT is to connect objects to the internet mental results. Section V concludes the paper.
and make them communicate with each other. According
to a recent study by Gartner [4], the current count of II. T HE SYNCHRONIZATION PROBLEM
IoT devices is around 6.4 billion devices (not including A master-slave clock synchronization scheme works by
smartphones, tablets, and computers), and it is expected disseminating the timing information using packets trans-
to grow up to 21 billion by 2020. In such a scenario, mitted by the radio transceiver of the WSN nodes. In
wireless communication technologies will play a major role, simple topologies, such as star networks, synchronization
and in particular, Wireless Sensor Networks (WSNs) will packets can be transmitted by the master node. In multi-
proliferate in many application domains. Indeed, the small, hop networks, flooding schemes [3, 15] allow for the dis-
inexpensive and low power WSN nodes can be an enabling semination of synchronization packets also to nodes that
technology for IoT [8, 13]. are not directly in the radio range of the master node.
Among the various challenges, time synchronization is Packets may contain a timestamp with the time of the
one of the most important for the correct operation of master node, or this information can be implicit if packets
WSNs [23, 27]. In particular, it allows for successful com- are transmitted periodically over a contention-delay-free
munication between the different nodes, but also for location MAC [24]. A clock skew compensation [10, 19, 28] scheme
and proximity estimation [14], energy efficiency [11, 21], is often used to minimize the divergence of the nodes’ local
and mobility [22]. In general, WSN nodes coordination clock in between synchronizations. Drift compensation is
also possible, accounting for common error causes such as
This work was partially supported by the European Commission under temperature variations [20, 24]. Finally, propagation delay
the project UnCoVerCPS with grant number 643921, and by the Swedish
Foundation for Strategic Research under the project “Future factories in the estimation and compensation [12, 25] can be employed for
cloud (FiC)” with grant number GMT14-0032. ultra high precision synchronization.
Despite the many different features a master-slave clock
Sync. error
u ρ (u)+ + e 2
R ρ (·) P b·c 1
0
bec
−1
0 5 10 15 20 25 30
Figure 2: FLOPSYNC control scheme with quantizers. k
u(k + 1) = u(k) + be(k)c − α be(k + 1)c (4) In this section, we propose a switched variant of the
classical FLOPSYNC controller which improves its perfor-
where α is the only design parameter. Figure 2 shows the re- mance in the presence of quantization effects, and it is called
sulting control scheme, where P is the dynamic process (3), FLOPSYNC-QACS.
and R is the controller (4). Plugging (4) into (3), we get
The controller is composed by a linear part:
that:
ũ(k + 1) = be(k)c − α be(k + 1)c (5)
e(k + 1) = e(k) + d(k)
+ ρ (u(k − 1) + be(k − 1)c − α be(k)c) and of a switched part where the control action ũ is set as
the input to the following modified discrete-time integrator:
In [10] the FLOPSYNC control scheme was designed by (
ignoring the quantizers. More specifically, when no quanti- u(k + 1) = u(k) + ũ(k + 1), if be(k + 1)c =
6 0
zation were in place, and inverting (3), leads to: u(k + 1) = ρ (u(k)) + ũ(k + 1), if be(k + 1)c = 0
e(k + 1) = (1 − α)e(k) + e(k − 1) + u(k − 1) + d(k) that finally computes the actual corrective action u, based
= (2 − α)e(k) + d(k) − d(k − 1) on the quantized synchronization error measurement bec.
which corresponds to an asymptotically stable linear system Figure 4 shows the resulting switched control scheme with
if 1 < α < 3. For a constant disturbance, i.e., d(k) = R̃ being (5), P being the synchronization error dynamics (3),
d(k − 1) = d, the scheme guarantees the convergence of the and z −1 being a unit delay operator in discrete time.
synchronization error to zero, with a rate of convergence that
depends on the parameter α. d
When quantizers are in place, on the other hand, the ũ + u ρ (u)+ +
e
synchronization error still converges towards zero, but, intu- R̃
+
ρ (·) P b·c
itively, it is not possible to distinguish errors below the clock
resolution. Moreover, the disturbance d is integrated over bec
time according to (3). The integrated residual disturbance z −1
is not detectable on the quantized output bec as long as
it is smaller than the clock resolution due. This causes Figure 4: Control scheme of FLOPSYNC-QACS.
the controller to react whenever the quantization of the
integrated residual disturbance switches to 1 or −1, steering
the controlled system to a limit cycle of amplitude 2. An The switched control system dynamics is characterized by
example of this √ effect is shown in Figure 3, with α = 1.5, e as per (3), and by u, that can is computed as:
d(k) = d = 2, and the control system initialized as • if be(k + 1)c = be(k) + ρ (u(k)) + d(k)c = 0, then:
e(0) = 2, u(0) = 0.
This paper proposes a switched control scheme that re- u(k + 1) = ρ (u(k)) + be(k)c (6)
2
1
FLOPSYNC
be(k)c
0
−1
−2
2
FLOPSYNC-QACS
1
be(k)c
0
0
art approaches.
−2
R EFERENCES
1 [1] N. Aakvaag et al. “Timing and Power Issues in Wireless Sensor
d(k) ρ (d(k)) Networks: An Industrial Test Case”. In: ICPP. 2005, pp. 419–426.
Disturbance
m2 m5
services, so called services chains, which consist of several
VNF1
VNF VNF1
interconnected virtual network functions. This allows for the VNF 1
1 VNF2 VNF 5
VNF3
VNF
capacity to be scaled up and down by adding or removing VNF 3
1
m5
virtual resources. In this work, we develop a model of a service
chain and pose the dynamic allocation of resources as an
optimization problem. We design and present a set of strategies NFVI
to allot virtual network nodes in an optimal fashion subject to Figure 1: Several virtual networking functions (VNF) are connected together to
latency and buffer constraints. provide a set of services. Packet flow through a specific path the VNFs (a virtual
forwarding graph). The VNFs are mapped onto physical hardware referred to
as NFVI.
1. Introduction
Over the last years, cloud computing has swiftly trans- a network service within a given latency bound? The dis-
formed the IT infrastructure landscape, leading to large cost- tilled problem is illustrated as the packet flows in Figure 1.
savings for deployment of a wide range of IT applications. The forwarding graph is implemented as a chain of virtual
Some main characteristics of cloud computing are resource network nodes, also known as a service chains. To ensure
pooling, elasticity, and metering. Physical resources such that the capacity of a service chain matches the time-varying
as compute nodes, storage nodes, and network fabrics are load, the number of instances mi of each individual network
shared among tenants. Virtual resource elasticity brings the function VNFi may be scaled up or down.
ability to dynamically change the amount of allocated re- The contribution of the paper is
sources, for example as a function of workload or cost. • a mathematical model of the virtual resources sup-
Resource usage is metered and in most pricing models the porting the packet flows in Figure 1,
tenant only pays for the allocated capacity. • the set-up of an optimization problem for controlling
While cloud technology initially was mostly used for IT the number of machines needed by each function in
applications, e.g. web servers, databases, etc., it is rapidly the service chain,
finding its way into new domains. One such domain is • solution of the optimization-problem leading to a
processing of network packages. Today network services are control-scheme of the number of machines needed
packaged as physical appliances that are connected together to guarantee that the end-to-end deadline is met for
using physical network. Network services consist of inter- incoming packets under a constant input flow.
connected network functions (NF). Examples of network
functions are firewalls, deep packet inspections, transcod- Related work
ing, etc. A recent initiative from the standardisation body There are a number of well known and established re-
ETSI (European Telecommunications Standards Institute) source management frameworks for data centers, but few
addresses the standardisation of virtual network services of them explicitly address the issue of latency. Sparrow [3]
under the name Network Functions Virtualisation (NFV) [1]. presents an approach for scheduling a large number of
The expected benefits from this are, among others, better parallel jobs with short deadlines. The problem domain is
hardware utilisation and more flexibility, which translate into different compared to our work in that we focus on sequential
reduced capital and operating expenses (CAPEX and OPEX). rather than parallel jobs. Chronos [4] focuses on reducing
A number of interesting use cases are found in [2], and in latency on the communication stack. RT-OpenStack [5] adds
this paper we are investigating the one referred to as Virtual real-time performance to OpenStack by usage of a real-time
Network Functions Forwarding Graphs, see Figure 1. hypervisor and a timing-aware VM-to-host mapping.
We investigate the allocation of virtual resources to a The enforcement of an end-to-end (E2E) deadline of a
given packet flow, i.e. what is the most cost efficient way sequence of jobs to be executed through a sequence of com-
to allocate VNFs with a given capacity that still provide puting elements was addressed by several works, possibly
under different terminologies. In the holistic analysis [6], In a recent benchmarking study it was shown that a typical
[7], [8] the schedulability analysis is performed locally. At virtual machine can process around 0.1–2.8 million packets
global level the local response times are transformed into per second, [22]. Hence, in this work the number of packets
jitter or offset constraints for the subsequent tasks. flowing through the functions is assumed to be in the order
A second approach to guarantee an E2E deadline is of millions of packets per second, supporting the use of a
to split a constraint into several local deadline constraints. fluid model.
While this approach avoids the iteration of the analysis, r1 (t) r2 (t) rn−1 (t)
it requires an effective splitting method. Di Natale and r0 (t) F1 F2 ... Fn rn (t)
Stankovic [9] proposed to split the E2E deadline proportion-
ally to the local computation time or to divide equally the Figure 2: Illustration of the service-chain.
slack time. Later, Jiang [10] used time slices to decouple the
schedulability analysis of each node, reducing the complexity 2.1. Service model
of the analysis. Such an approach improves the robustness of As illustrated in Figure 3, the incoming requests to
the schedule, and allows to analyse each pipeline in isolation. function Fi are stored in the queue and then processed
Serreli et al. [11], [12] proposed to assign local deadlines to once it reaches the head of the queue. At time t there are
minimize a linear upper bound of the resulting local demand mi (t) ∈ Z+ machines ready to serve the requests, each with
bound functions. More recently, Hong et al [13] formulated a nominal speed of s̄i ∈ R+ (note that this nominal speed
the local deadline assignment problem as a Mixed-Integer might differ between different functions in the service chain,
Linear Program (MILP) with the goal of maximizing the i.e. it does not in general hold that s̄i = s̄j for i 6= j ). The
slack time. After local deadlines are assigned, the processor maximum speed that function Fi can process requests at is
demand criterion was used to analyze distributed real-time thus mi (t)s̄i . The rate by which Fi is actually processing
pipelines [14], [12]. requests at time t is denoted si (t) ∈ R+ . The cumulative
In all the mentioned works, jobs have non-negligible served requests is defined as
Z t
execution times. Hence, their delay is caused by the pre-
emption experienced at each function. In our context, which Si (t) = si (τ ) dτ. (2)
0
is scheduling of virtual network services, jobs are executed At time t the number of requests stored in the queue is
non-preemptively and in FIFO order. Hence, the impact of defined as the queue length qi (t) ∈ R+ :
the local computation onto the E2E delay of a request is Z t
minor compared to the queueing delay. This type of delay is qi (t) = (ri−1 (τ ) − si (τ ))dτ = Ri−1 (t) − Si (t). (3)
intensively investigated in the networking community in the 0
broad area queuing systems [15]. In this area, Henriksson et Each function has a fixed maximum-queue capacity qimax ∈
al. [16] proposed a feedforward/feedback controller to adjust R+ , representing the largest number of requests that can be
the processing speed to match a given delay target. stored at the function Fi .
The queueing delay, depends on the status of the queue
Most of the works in queuing theory assumes a stochastic
as well as on the service rate. We denote by Di,j (t) the time
(usually markovian) model of job arrivals and service times.
taken by a request from when it enters function Fi to when
A solid contribution to the theory of deterministic queuing
it exits Fj , with j ≥ i, where t is the time when the request
systems is due to Baccelli et al. [17], Cruz [18], and Parekh
exits function Fj :
& Gallager [19]. These results built the foundation for the
network calculus [20], later applied to real-time systems in Di,j (t) = inf {τ ≥ 0 : Ri−1 (t − τ ) ≤ Sj (t)}.
the real-time calculus [21]. The advantage of network/real- The maximum queueing delay then is D̂i,j =
time calculus is that, together with an analysis of the E2E maxt≥0 Di,j (t). The requirement that a requests meets it
delays, the sizes of the queues are also modelled. As in the end-to-end deadline is D̂1,n ≤ Dmax .
cloud computing scenario the impact of the queue is very To control the queueing delay, it is necessary to control
relevant since that is part of the resource usage which we the service rate of the function. Therefore, we assume that
aim to minimize, hence we follow this type of modeling. it is possible to change the maximum service-rate of a
function by changing the number of machines that are on, i.e.
2. Problem formulation
Service Function Fi
Section 1. We consider a service-chain consisting of
n functions F1 , . . . , Fn , as illustrated in Figure 2. Packets fi
are flowing through the service-chain and they must be
processed by each function in the chain within some end-
to-end deadline, denoted by Dmax . A fluid model is used ri−1 (t) .. + si (t)
.
to approximate the packet flow and at time t there are
ri−1 (t) ∈ R+ packets per second (pps) entering the i’th qi (t) ≤ qimax
fi
function and the the cumulative arrived requests for this
function is mi (t)
Z t
Ri−1 (t) = ri−1 (τ ) dτ. (1) Figure 3: Illustration of the structure and different entities of the service chain.
0
changing mi (t). However, turning on a machine takes ∆on i is not violated and the maximum queue sizes qimax are not
time units, and turning off a machine takes ∆off
i time units. exceeded. This can be posed as the following problem:
Together they account for a time delay, ∆i = ∆on i + ∆i ,
off n
Jic (mi (t)) + Jiq (qimax )
X
associated with turning on/off a machine. minimize J =
In 2012 Google profiled where the latency in a data i=1 (6)
center occurred, [4]. They showed that less than 1% (≈ 1µs) subject to D̂1,n ≤ Dmax
of the latency occurred was due to the propagation in the net- qi (t) ≤ qimax , ∀t ≥ 0, i = 1, 2, . . . , n
work fabric. The other 99% (≈ 85µs) occurred somewhere with Ji and Jiq as in (4) and (5), respectively. In this paper
c
in the kernel, the switches, the memory, or the application. the optimization problem (6) will be solved for a service-
Since it is difficult to say exactly which of this 99% is chain fed with a constant incoming rate r.
due to processing, or queueing, we make the abstraction of A valid lower bound J lb to the cost achieved by any fea-
considering queueing delay and processing delay together, sible solution of (6) is found by assuming that all functions
simply as queueing delay. Furthermore, we assume that no are capable of providing exactly a service rate r equal to the
request is lost in the communication links, and that there input rate. This is possible by running a fractional number of
is no propagation delay. Hence the concatenation of the machines r/s̄i at function Fi . In such an ideal case, buffers
functions F1 through Fn implies that the input of function can be of zero size (∀i, qimax = 0), and there is no queueing
Fi is exactly the output of function Fi−1 , for i = 2, . . . , n, delay (D̂1,n = 0) since service and the arrival rates are the
as illustrated in Figure 2. same at all functions. Hence, the lower bound to the cost is
n
2.2. Cost model
X r
J lb = ci . (7)
To be able to provide guarantees about the behaviour of i=1
s̄i
the service chain, it is necessary to make hard reservations Such a lower bound will be used to compare the quality of
of the resources needed by each function in the chain. the solution found later on.
This means that when a certain resource is reserved, it is In Section 3 we make a general consideration about the
guaranteed to be available for utilization. Reserving this on/off scheme of each machine in presence of a constant
resource results in a cost, and due to the hard reservation, the input rate r. Later in Section 4, the optimal design problem
cost does not dependent on the actual utilisation, but only of (6) is solved.
on the resource reserved.
3. Machine switching scheme
The computation cost per time-unit per machine is de-
noted ci , and can be seen as the cost for the CPU-cycles In presence of an incoming flow of requests at a constant
needed by one machine in Fi . This cost will also occur dur- rate r0 (t) = r, a number
ing the time-delay ∆i . Without being too conservative, this r
m̄i = (8)
time-delay can be assumed to occur only when a machine s̄i
is started. The average computing cost per time-unit for the of machines running in function Fi must always stay on. To
whole function Fi is then match the incoming rate r, in addition to the m̄i machines
Zt always on, another machine must be on for some time in
c ci order to process a request rate of s̄i ρi where ρi is the
Ji (mi (t)) = lim mi (s) + ∆i · (∂− mi (s))+ ds (4)
t→∞ t normalized residual request rate:
0
ρi = r/s̄i − m̄i , (9)
where (x)+ = max(x, 0), and ∂− mi (t) is the left-limit of
where ρi ∈ [0, 1).
mi (t):
In our scheme, the extra machine is switched on at a
mi (t) − mi (a) desired on-time ton
∂− mi (t) = lim , i :
a→t− t−a
that is, a sequence of Dirac’s deltas at all points where the • off → on: function Fi switches on the additional
number of machines changes. This means that the value of machine when the time t exceeds ton
i .
the left-limit of mi (t) is only adding to the computation-cost Since the additional machine does not need to always be on,
whenever it is positive, i.e. when a machine is switched on. it could be switched off after some time. The off-switching
The queue cost per time-unit per space for a request is is also based on a time-condition, the desired stop-time toff
i ,
denoted qi . This can be seen as the cost that comes from the i.e. the time-instance that the machine should be switched
fact that physical storage needs to be reserved such that a off, and is given by:
queue can be hosted on it, normally this would correspond toff on
i = ti + Ti .
on
to the RAM of the network-card. Reserving the capacity of on
where Ti is the duration that the machine should be on for,
qimax would thus result in a cost per time-unit of and something that needs to be found. The off-switching is
Jiq (qimax ) = qi qimax . (5) then triggered in the following way:
2.3. Problem definition • on → off : function Fi switches off the additional
The aim of this paper is to control the number mi (t) machine when the time t exceeds toff
i .
of machines running in function Fi , such that the total Note that this control-scheme, in addition with the con-
average cost is minimized, while the E2E constraint Dmax stant input, result in the extra machine being switched on/off
periodically, with a period Ti . We thus assume that the extra qi (t)
qimax
machine can process requests for a time Tion every period
Ti . The time during each period where the machine is not
processing any requests is denoted Tioff = Ti − Tion . Notice,
however, that the actual time the extra machine is consuming
si−1 (t)
power is Tion + ∆i due to the time-delay for starting a new
machine.
In the presence of a constant input, it is straight-forward
to find the necessary on-time during each period—in order r off
Ti−1 on
Ti−1
for the additional machine to provide the residual processing si (t)
capacity of r − m̄i s̄i , its on-time Tion must be such that
Tion s̄i = Ti (r − m̄i s̄i ),
which implies Tion
Tion = Ti ρi , Tioff = Ti − Tion = Ti (1 − ρi ). (10) Ti = Ti−1 = T
With each additional machine being switched on/off pe- Figure 4: Example of an on/off-switching scheme when function Fi and Fi−1
riodically, it is also straightforward to find the computation switch on/off their additional machine with the same period and how it affects the
queue-size qi (t) of the functions.For this example: r = 17, s̄i−1 = 6, s̄i = 8,
cost for each function. If m̄i + 1 machines are on for a time T = 120, Ti−1on
= 100, Tion = 15, qimax = 90.
Tion , and only m̄i machines are on for a time Tioff , then the
cost Jic of (4) becomes and second function is s̄i−1 = 6 and s̄i = 8 respectively. The
on
machine-switching period is T = 120, and the on-time for
Ti + ∆i ∆i
Jic = ci + m̄i = ci m̄i + ρi + (11) the two additional machines are Ti−1on
= 100 and Tion = 15
Ti Ti
respectively. The maximum queue-size needed for the second
if Tioff ≥ ∆i . If instead Tioff < ∆i , that is if
machine is qimax = 90.
∆i
Ti < T i := , (12) In order to solve the optimization problem one need two
1 − ρi ingredients. The first ingredient is the expression for the
then there is no time to switch the additional machine off and maximum queue-size needed for a given period T :
then on again before the new period start. Hence, we keep
the last machine on, even if it is not processing packets, and Lemma 1. With a constant input rate r0 (t) = r, along with
the computing cost becomes all functions switching on/off their additional machine with a
Tioff
common period T , the maximum queue size qimax at function
c c
Ji = i m̄i + ρi + = ci (m̄i + 1). (13) Fi is
Ti
Next, using this control-scheme, the optimization prob- qimax = T × αi , (14)
lem of (6) will be studied and solved under the assumption where
that every function will switch on/off its additional machine
αi = max ρi s̄i (1 − ρi ) − s̄i−1 (1 − ρi−1 ) ,
with the same period, T . (1 − ρi−1 )(s̄i−1 ρi−1 − s̄i ρi ),
4. Design of machine-switching period ρi−1 s̄i−1 (1 − ρi−1 ) − s̄i (1 − ρi ) ,
In this section we solve the optimization problem (6) (1 − ρi )(s̄i ρi − s̄i−1 ρi−1 ) ,
under the assumption of a constant input. In order to some- with ρi as defined in (9), and T being the period of the
what reduce the complexity of the solution we also make switching scheme, common to all functions.
the assumption of letting every function switch its additional
machine on/off with the same period, Ti = T . The common Proof. Due to limited space the proof is shown in a technical
period T of the schedule, by which every function switches report published at Lund University Publications, [23]. 1
its additional machine on/off, is the only design variable in
the optimization problem (6). In Lemma 1 and Lemma 2 The expression of qimax in Eq. (14) suggests a property
below, the maximum queue size qimax of any function Fi and that is condensed in the next Corollary.
the end-to-end delay D̂1,n are both shown to be proportional Corollary 1. The maximum queue qize qimax at any function
to the switching period T . The intuition behind this fact is Fi is bounded, regardless of the rate r of the input.
that the longer the period T is, the longer a function will
have to wait with the additional machine being off, before Proof. From the definition of ρi in Eq. (9), it always holds
turning it on again. During this interval of time, each function that ρi ∈ [0, 1). Hence, from the expression of (14), it follows
is accumulating work and consequently both the maximum that qimax is always bounded.
queue-size and the delay grows with T . The second ingredient needed to solve the optimal design
Figure 4 illustrate how two functions in a service-chain, problem is the expression of how the end-to-end delay relate
Fi and Fi−1 , switch on/off their additional machine with the to the switching period T .
same period T . However, one should note that they are not on
for the same duration. In this example, the input rate to the 1. https://lup.lub.lu.se/search/publication/
service-chain is r = 17, the nominal service-rate of the first 8c7b837e-bca3-4375-bb9d-28ce6bbc889a
Lemma 2. With a constant input rate, r0 (t) = r, the longest i s̄i ci qi ∆i
1 6 6 0.5 0.01
end-to-end delay D̂i,n for any request passing through func- 2 8 8 0.5 0.01
tions F1 thru Fn is
Xn TABLE 1: Parameters of the example.
D̂1,n = T × δi . (15)
i=1 The input r0 (t) = r can be seen as dummy function F0
with δi being an opportune constant that depends on r, s̄i , preceding F1 , with s̄0 = r, m̄0 = 1, and ρ0 = 0. From (8)
and s̄i−1 . and (9) it follows that m̄1 = m̄2 = 2, and ρ1 = 65 , ρ2 = 18 ,
implying that both functions must always keep two machines
Proof. Due to limited space the proof is shown in a technical on, and then periodically switch a third one on/off. This leads
report published at Lund University Publications, [23]. to T 1 = 60.0 × 10−3 and T 2 = 11.4 × 10−3 , where T i is
Solution to the optimization problem the threshold period for function Fi , as defined in (12). From
Lemma 1 it follows that the parameter a of the cost function
With these hypothesis, the cost function of the optimiza-
(16) is a = 0.792, while from Lemma 2 the parameters δi
tion problem (6) becomes
X X ∆i determining the queuing delay introduced by each function,
J(T ) = aT + ci (1 − ρi ) + ci + J lb , (16) are δ1 = 49.0 × 10−3 and δ2 = 22.1 × 10−3 , which in turn
T leads to
i:T <T i i:T ≥T i
lb Dmax 0.02
Pn qJ is the lower bound given by (7) and a =
where c= = = 281 × 10−3 .
δ1 + δ2 71.1 × 10−3
i=1 ji αi , where αi is given by Lemma 1. Furthermore,
T i (defined in (12)) represents the value of the period below Since T 2 < T 1 < c, the set C of (17) containing the
which it is not feasible to switch the additional machine off boundary is
and then on again (T < T i ⇔ Tioff < ∆i ). In fact, ∀i with C = {0, |0.00114
T < T i we pay the full cost of having m̄i + 1 machines {z }, |0.060 | {z }}.
{z }, 0.281
T2 T1 c
always on.
∗
The deadline constraint in (6), can be simply written as To compute the set C of interior points with derivative equal
Dmax to zero defined in (18), which is needed to compute the
T ≤ c := Pn , period with minimum cost from (19), we must check all
i=1 δi
with δi given in Lemma 2. intervals with boundaries at two consecutive points in C . In
The cost J(T ) of (16) is a continuous function of one the interval (0, T 2 ) the derivative of J is never zero. When
variable T . It has to be minimized over the closed interval checking the interval (T 2 , T 1 ), the derivative is zero at
r
[0, c]. Hence, by the Weierstaß’s extreme-value theorem, it ∗ c2 ∆2
has a minimum. To find this minimum, we just check all c1 = = 0.318,
a
(finite) points at which the cost is not differentiable and the which, however, falls outside the interval. Finally, when
ones where the derivative is equal to zero. Let us define all checking the interval (T 1 , c) the derivative is zero at
points in [0, c] in which J(T ) is not differentiable: r
C = {T i : T i < c} ∪ {0} ∪ {c}. c1 ∆1 + c2 ∆2
(17) c∗2 = = 0.421 > c = 0.281.
We denote by p = |C| ≤ n + 2 the number of points in a
C . Also, we denote by ck ∈ C the points in C and we Hence, the set of points with derivative equal to zero is
assume they are ordered increasingly c1 < c2 < . . . < cp . C ∗ = ∅. By inspecting the cost at points in C we find
Since the cost J(T ) is differentiable over the open interval that the minimum occurs at T ∗ = c = 0.281, with cost
(ck , ck+1 ), the minimum may also occur at an interior point J(T ∗ ) = 34.7. To conclude the example we show the state-
of (ck , ck+1 ) with derivative equal to zero. Let us denote by space trajectory for the two queues in Figure 5. Again, it
C ∗ the set of all interior points of (ck , ck+1 ) with derivative should be noted that this example is meant to illustrate how
of J(T ) equal to zero, that is one can use the design methodology of this section in order
C ∗ = {c∗k : k = 1, . . . , p − 1, ck < c∗k < ck+1 } (18) to find the best period T . In a real setting the incoming traffic
will likely be around millions of requests per second, [24].
with sP
c
i:T i <ck+1 i ∆i q2 (t)
c∗k = . q2max
a
on )
(o n
n, ,o
T ∗ = arg min ∗ {J(T )}. (19) off ff )
T ∈C∪C )
Next, we illustrate an example of how to use this to find (off
a solution to the design problem (6). , off q1 (t)
)
Example. We use an example to illustrate the solu- q1max
tion of the optimization problem of a service chain with two
functions. The input rate of the service-chain is r0 (t) = r = Figure 5: State-space trajectory for the example in Section 4. (on, off) corre-
spond to F1 having its additional machine on, while F2 has its extra machine
17. Every request has an E2E-deadline of Dmax = 0.02. The off.
parameters of the two functions are reported in Table 1.
5. Summary [10] S. Jiang, “A decoupled scheduling approach for distributed real-time
embedded automotive systems,” in Proceedings of the 12th IEEE Real-
In this paper we have developed a general mathematical Time and Embedded Technology and Applications Symposium, 2006,
model for a service-chain residing in a Cloud environment. pp. 191–198.
This model includes an input model, a service model, and [11] N. Serreli, G. Lipari, and E. Bini, “Deadline assignment for
a cost model. The input-model defines the input-stream of component-based analysis of real-time transactions,” in 2nd Workshop
requests to each NFV along with end-to-end deadlines for the on Compositional Real-Time Systems, Washington, DC, USA, Dec.
2009.
requests, meaning that they have to pass through the service-
[12] ——, “The demand bound function interface of distributed sporadic
chain before this deadline. In the service-model, we define pipelines of tasks scheduled by EDF,” in Proceedings of the 22-nd
an abstract model of a NFV, in which requests are processed Euromicro Conference on Real-Time Systems, Bruxelles, Belgium,
by a number of machines inside the service function. It July 2010.
is assumed that each function can change the number of [13] S. Hong, T. Chantem, and X. S. Hu, “Local-deadline assignment
machines that are up and running, but doing so is assumed for distributed real-time systems,” IEEE Transactions on Computers,
to take some time. The cost-model defines the cost for vol. 64, no. 7, pp. 1983–1997, July 2015.
allocating compute- and storage capacity, and naturally leads [14] A. Rahni, E. Grolleau, and M. Richard, “Feasibility analysis of non-
to the optimization problem of how to allocate the resources. concrete real-time transactions with edf assignment priority,” in Pro-
ceedings of the 16-th conference on Real-Time and Network Systems,
We analyze the case with a constant input-stream of requests Rennes, France, Oct. 2008, pp. 109–117.
and derive control-strategies for this. This is a simplified case [15] L. Kleinrock, Queueing Systems. John Wiley & Sons, 1975.
it will constitute the foundation of adaptive schemes to time-
[16] D. Henriksson, Y. Lu, and T. Abdelzaher, “Improved prediction for
varying requests in the future. web server delay control,” in Proceedings of the 16th Euromicro
We plan to extend this work by allowing for a dynamic Conference on Real-Time Systems, June 2004, pp. 61–68.
input as well as uncertainties in the true performance of [17] F. Baccelli, G. Cohen, G. J. Olsder, and J.-P. Quadrat, Synchronization
the machines running in the functions, leading to the need and linearity. Wiley New York, 1992, vol. 3.
of using a more advanced feedback loop to guarantee the [18] R. L. Cruz, “A calculus for network delay, part I: Network elements in
desired performance. isolation,” IEEE Transactions on Information Theory, vol. 37, no. 1,
pp. 114–131, Jan. 1991.
Acknowledgements. The authors would like to
thank Karl-Erik Årzén and Bengt Lindoff for the useful [19] A. K. Parekh and R. G. Gallager, “A generalized processor sharing
approach to flow control in integrated services networks: the single-
comments on early versions of this paper. node case,” IEEE/ACM Transactions on Networking, vol. 1, no. 3, pp.
Source code. The source code used to compute the 344–357, June 1993.
solution of the example in Section 4 can be found on Github [20] J.-Y. Le Boudec and P. Thiran, Network Calculus: a theory of de-
at https://github.com/vmillnert/REACTION-source-code. terministic queuing systems for the internet, ser. Lecture Notes in
Computer Science. Springer, 2001, vol. 2050.
References [21] S. Chakraborty and L. Thiele, “A new task model for streaming
applications and its schedulability analysis,” in Design, Automation
[1] ETSI, “Network Functions Virtualization (NFV),” and Test in Europe Conference and Exposition, Mar. 2005, pp. 486–
https://portal.etsi.org/nfv/nfv white paper.pdf, October 2012. 491.
[2] ——, “Network Functions Virtualization (NFV); Use Cases,” October [22] R. Bonafiglia, I. Cerrato, F. Ciaccia, M. Nemirovsky, and F. Risso,
2013. “Assessing the performance of virtualization technologies for nfv: a
preliminary benchmarking,” in 2015 Fourth European Workshop on
[3] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, “Sparrow:
Software Defined Networks. IEEE, 2015, pp. 67–72.
Distributed, low latency scheduling,” in Proceedings of the 24th ACM
Symposium on Operating Systems Principles. ACM, 2013, pp. 69–84. [23] V. Millnert, J. Eker, and E. Bini, “Cost minimization of network
services with buffer and end-to-end deadline constraints,” p. 11,
[4] R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat,
09 2016. [Online]. Available: https://lup.lub.lu.se/search/publication/
“Chronos: Predictable low latency for data center applications,” in
8c7b837e-bca3-4375-bb9d-28ce6bbc889a
Proceedings of the Third ACM Symposium on Cloud Computing,
ser. SoCC ’12. New York, NY, USA: ACM, 2012, pp. 9:1–9:14. [24] W. Zhang, T. Wood, and J. Hwang, “Netkv: Scalable, self-managing,
[Online]. Available: http://doi.acm.org/10.1145/2391229.2391238 load balancing as a network function,” in Proceedings of the 13th
IEEE International Conference on Autonomic Computing, 2016.
[5] S. Xi, C. Li, C. Lu, C. D. Gill, M. Xu, L. T. Phan, I. Lee, and
O. Sokolsky, “RT-Open Stack: CPU resource management for real-
time cloud computing,” in Cloud Computing (CLOUD), 2015 IEEE
8th International Conference on. IEEE, 2015, pp. 179–186.
[6] K. W. Tindell, A. Burns, and A. Wellings, “An extendible approach
for analysing fixed priority hard real-time tasks,” Journal of Real Time
Systems, vol. 6, no. 2, pp. 133–152, Mar. 1994.
[7] J. Palencia and M. G. Harbour, “Offset-based response time analysis
of distributed systems scheduled under EDF,” in 15th Euromicro
Conference on Real-Time Systems, Porto, Portugal, July 2003.
[8] R. Pellizzoni and G. Lipari, “Holistic analysis of asynchronous real-
time transactions with earliest deadline scheduling,” Journal of Com-
puter and System Sciences, vol. 73, no. 2, pp. 186–206, Mar. 2007.
[9] M. Di Natale and J. A. Stankovic, “Dynamic end-to-end guarantees
in distributed real time systems,” in Proceedings of the 15-th IEEE
Real-Time Systems Symposium, Dec. 1994, pp. 215–227.
DTFM: a Flexible Model for Schedulability Analysis of
Real-Time Applications on NoC-based Architectures
Abstract—Many-core processors are expected to be hardware targets for the evaluation of different solutions. Such a platform is required
to support the execution of real-time applications. In a many-core to run simulations and perform schedulability analysis with real-life
processor, cores communicate through a Network-On-Chip (NoC), which
cases in order to compare different solutions at hardware, software,
offers high bandwidth and scalability, but also introduces contentions
leading to additional variability to task execution times. Such contentions scheduling or protocol levels.
also strongly increase the pessimistic trend of worst case execution time
estimation. Consequently, modeling and analysis of network contentions B. Contributions
interferences on many-core processors is a challenge to support real-
time applications. In this article, we formalize a dual task and flow We first propose DTFM, a Dual Task and Flow Model, that allows
model called DTFM. From the specification of a real-time application designers to perform a complete schedulability analysis in order to
composed of a set of tasks and their communication dependencies, DTFM
allows us to compute flow requirements and to assess predictability of
assess the predictability of real-time tasks running over NoC-based
the tasks. DTFM is extensible enough to be adapted to various NoCs and architectures.
task models, allowing designers to compare candidate software and NoC
architectures. Furthermore, we introduce an original validation approach The second contribution is a unique validation environment,
based on the cross-use of a task level real-time scheduling analysis tool which combines a tick accurate real-time simulator and analysis tool
and a cycle-accurate SystemC NoC simulator. that implements the DTFM model and produces a task scheduling,
which is then used to feed a SystemC cycle accurate NoC simulator.
I. I NTRODUCTION Such a validation environment could be applied to investigate the
temporal behaviour of real-time applications running on various other
NoCs are widely used in many-core systems since they provide
hardware components such as cache hierarchies. The analysis tool,
scalability, modularity, and communication parallelism. Many-core
that implements DTFM, allows us to evaluate and compare different
allows multiple applications to run at the same time in the same
NoC architectures, which can be standard or designed for mixed
processor. These applications exchange messages through the com-
criticality systems. It also allows us to consider different types of
munication infrastructure. Real-time applications have very stringent
task communication protocols.
communication service requirements. The correctness of real-time
communications depends not only on the communication result but The remainder of the paper is organized as follows. The next
also on the completion time bound [1]. section presents related works. Section III introduces background
elements about the NoC we consider. Then, section IV proposes our
In this article, we address temporal failures which occur when
dual flow and task model. Implementation and evaluation of DTFM
a real-time application is unable to meet its deadlines. To predict
are explained in section V and section VI concludes the article.
whether a given application is able to meet its deadline, usual schedu-
lability analysis methods require to know the Worst Case Execution
II. R ELATED W ORKS
Time (WCET) of each task [2]. Then, schedulability analysis methods
allow designers to investigate how the tasks will be scheduled on the Several schedulability analysis approaches for real-time on-chip
hardware platform and how task executions will be delayed due to communication have been explored.
hardware contentions.
In [3], the authors propose an analysis approach for real-time
on chip communication with wormhole switching and fixed priority
A. Problem Statement
scheduling. [1] focuses on real-time communication service with a
Many task models were proposed to deal with real-time applica- priority share policy.
tions [2]. All these models are not immediately suitable with NoC
In all these works, the scheduling analysis of the real-time
architectures. They do not take into account the communications
applications is computed for a given flow model and for a given
through the network, the task mapping and the latencies introduced
task model and they do not take into account the schedulability of
by this new type of shared resources. Many traffic-flow models are
tasks. While in our work, we expect to compute the flow model
also proposed to analyze the real-time communications for NoC ar-
from a given task model so we examine simultaneously task and
chitectures. All these models focus on real-time traffic-flows and they
flow schedulability. Moreover we provide a flexible model in order
do not take into account the schedulability of tasks. A schedulable
to consider different communication protocols and different NoC
traffic-flow does not allow us to conclude that the whole system
architectures.
is schedulable. Moreover, they do not offer the flexibility, which is
necessary to consider different communication protocols. Considering There are also several existing tools and frameworks which
the incomplete state of the art, we investigate how to formalize and support mapping of applications onto NoC-based architectures. [4]
implement task and flow models that will provide a common platform proposes a simulator developed in SystemC with a C++ plugin,
in order to model concurrent hardware modules at the cycle-level depends on several parameters such as the flow rate and the size
accuracy. of the message [3]. Note that with the path delay, we consider the
worst case latency introduced by the network without any contention.
There are also other simulators such as [5] which allows users to
If the real-time flow meets its deadline with the worst case then it
setup and simulate a broad range of network configurations defined
will meet the deadline for any other scenario.
by network size, buffer size and routing algorithm.
2) Delay due to Direct Interference: A second kind of delay
In order to evaluate the usage of Many-Core architecture for
occurs due to direct interference:
critical applications, authors in [6] propose a simple one task to
one core execution model and develops test drive programs. The Definition 2: There is a direct interference when two flows share
developed programs evaluate the NoC latency performance. the same physical link at the same time. We called the delay caused
by a direct interference between flows a Direct Delay DD.
All of these simulators provide Transaction-Level Model of NoC
architectures, without any consideration for Task schedulability. On In all situations, the direct interference can be presented as a
the other hand, DTFM, with the tools that support it, allows us non-deterministic source of delay. In our analysis, we consider the
to analyze the schedulability of real-time systems along with a worst case latency introduced by the direct interference. We assume
cycle-accurate implementation of communications so that a real-time the packet from the observed traffic-flow is arrived just after other
communication analysis can also be performed. packets for each shared link. We assume also a maximum size of the
packets [1].
III. BACKGROUND
3) Delay due to Indirect Interference: The last delay we address
In this section, we introduce the NoC we assume in this work. in this article is called Indirect Interference:
Then, we identify the timing issues a real-time application faces when
running over NoC architectures. Definition 3: There is an indirect interference when two flows do
not share the same physical link but they interfere directly with the
same flows. We call the delay caused by an indirect interference, an
A. NoC
Indirect Delay, noted DI.
A NoC is a network of nodes which can be a processor, a memory
We can have many cases where a flow is blocked by another
a peripheral or a cluster. A node can access the network through a
flow which is also blocked by others and so on. In this case, the first
network interface (NI) that usually transmit and receives data in the
flow must wait for a non-deterministic delay. Like the direct delay, an
form of packets.
indirect delay is not deterministic because it depends on the history of
Wormhole is largely adopted in NoCs because it does not require network, on the flows rate, the messages size and the links states [1].
large capacity buffers. In a wormhole network, the packet is divided
into a number of fixed size flits [7]. The packet is split into a header In this paper, we compute path delay and direct delays according
flit, one or several body flits and a tail flit. The header flit stores the to equations given in [1] and we consider that indirect interferences
routing information and builds the route. As the header flit moves are unpredictable delays, we then do not handle indirect delays.
ahead along the selected path, the remaining flits follow in a pipeline
way and possibly span a number of routers. If the chosen link is IV. T HE D UAL TASK AND F LOW M ODEL
busy, the header flit is blocked until the link becomes available. In In this section, we introduce our task and flow model. We also
this case, all the packets will be blocked in routers along the selected show how to calculate the flow model from the task model. A simple
path. Then the network will introduce a new latency for the task that example is presented in order to illustrate those models. Finally, we
we cannot ignore during schedulability analysis. explain how to compute flow parameters for a given communication
We assume a standard NoC based on packet-based wormhole protocol.
switching, with a 2D-mesh topology and XY deterministic routing.
In this study we do not consider NoC with flit level preemption or A. Presentation of the approach
with any specific protocol for real-time traffic.
The figure 1 shows the overall approach we propose in this article.
From the task model, the mapping of the tasks on the processing
B. Types of delays in a NoC
units and the NoC model, DTFM allows us to generate the associated
For each NoC configuration, we have different types of latency. flow model. The flow model specifies the messages that have to be
In this section we focus on latency introduced by the type of NoC transmitted in the NoC to enforce the communications between the
assumed in this paper. Let’s see the different types of delays that we tasks. From a flow model and after identification of the different
must understood to assess schedulability of real-time tasks deployed delays elapsed in the NoC, DTFM checks the predictability of the
on such NoC [3], [1]. flows, i.e. detects when unpredictable delays may occur. If there is
no unpredictable flows, the analysis tool implementing DTFM can
1) Path delay: The first type of delay is straightforward and can compute the deadline and the communication delays of each message,
be called the Path Delay: and then, decides whether the system is schedulable. In the sequel,
Definition 1: The Path Delay DP is the time required to send we introduce the task and flow notations used in this article.
the packet from the transmitter to the receiver when no traffic flow
contention exists. B. Task and flow Models
We assume here that packets of a flow do not share any physical The task model Γ comprises n periodic tasks:
link with other flows along the path. This delay is variable and Γ ={τ1 , τ2 , . . . , τn }
NoC Task Task • Oi ∈ N+ is the release time of the messages, i.e. the first
parameters Model Mapping time when a message of the flow becomes ready to be
transmitted.
Fig. 1: The Dual Task and Flow Model • N odeDi ∈ N+ is the node running the receiver task.
• Fi : ψ −→ Ωp
ρi −→ F (ρi ) ={ep,k , . . . , ei,j }
Each task leads to a sequence of jobs. Each task τi has a set of where Ω is the set of physical links in the NoC. If a and
properties and timing requirements as follows: b are neighbor routers, ea,b (resp. eb,a ) is the physical link
from the router a (resp. b) to the router b (resp. a).
τi = { Oi , Ti , Ci , Di , Πi , N odei , Ei }
Fi is a function that computes the links used by a flow. Said
where: differently, Fi identifies the physical links that will be used
by a given message.
• Oi ∈ N+ is the first release time of the task τi , i.e. the The function Fi helps us to understand the relationships
release time of the first job of τi . between the various flows that transit through the network
and to determine the interferences between messages sent
• Ti ∈ N+ is the period of the task, i.e. the fixed delay between by the tasks.
two successive jobs of the task.
• Ci ∈ N+ specifies the worst case execution time of a job From the task model, we can compute the corresponding flow
of the task. model. For each task given by the E function, we have a flow ρ
where the transmitter is the task antecedent and the receiver is the
• Di ∈ N+ is the deadline. We note that each task deadline task image.
must be less than or equal to its period.
We define the function G with two variables to calculate the flow
• Πi ∈ N+ is the fixed priority of the task. The value 1 denotes model from the task model:
the highest priority level while a larger value is a lower
priority. G : Γ2 −→ ψ
+
• N odei ∈ N identifies the node running the task. This (τi , τj ) −→ G( (τi , τj ) ) = ρ
parameter allows us to introduce the mapping configuration,
i.e. on which processing unit each task is assigned to. i ∈ [1 .. n ]; j ∈ [1 .. n]
• Ei : Γ −→ Γp If E (τi ) = τj then G( (τi , τj ) ) = ρ = ρi,j
τi −→ E(τi ) ={τj , . . . , τk }
The function E allows us to introduce the precedence with:
constraints in the task model. For a given task τi , the
function E determines all the tasks τj that receive messages Π(ρi,j ) = Π(τi )
from the task τi . Furthermore, we also assume that tasks use
NodeS(ρi,j ) = Node(τi ) = N odei
a given communication protocol to exchange their messages.
NodeD(ρi,j ) = Node(τj ) = N odej
The flow model comprises m periodic traffic flows:
ψ ={ρ1 , ρ2 , . . . , ρm } Function G is surjective. Every element of the codomain is
Each flow ρi raises a sequence of messages in a similar way that mapped to at least one element of the domain.
a task raises a sequence of jobs. The flow is defined as follows:
∀ ρ ∈ ψ; ∃ (τi , τj ) such that ρ = G( (τi , τj ) )
ρi ={ Oi , Ti , Di , Πi , N odeSi , N odeDi , Fi }
You should notice that G depend on the communication protocol
where: used by tasks to exchange their messages.
Task O(s) T(s) C(µs) D(s) Π Node E
τ1 1 6 100 2 1 3 τ2 τ3 their messages. Then, to produce those parameters from the task
τ2 3 2 100 2 2 5 τ5 model, the behavior of the task communication protocol has to be
τ3 3 2 300 2 1 10 τ4 τ5 considered.
τ4 5 2 500 2 2 8 τ5
τ5 7 2 100 2 1 16 τ1 In this paper, we assume immediate data connection protocol
which is defined in Architecture Analysis and Design Language
TABLE I: Task Model
(AADL) [8]. This protocol works as follow: tasks send their packets
at the end of their execution and the packets are read before the
Flow Π NodeS NodeD F execution of the receiving tasks. We assume the period of the flow is
ρ1 = e(3,2) e(2,1) equal to the period of the transmitter task.
ρ1,2 1 3 5 e(1,5)
ρ2 = e(3,2) e(2,6) Let consider an example of the two tasks τ1 and τ2 . τ1 sends a
ρ1,3 1 3 10 e(6,10)
ρ3 = e(5,6) (6,7) flow ρ1 to τ2 .
ρ2,5 2 5 16 e(7,8) e(8,12) e(12,16)
ρ4 = e(10,11) e(11,12) τ1 = { O1 , T1 , C1 , D1 , Π1 , N ode1 , E1 }
ρ3,4 1 10 8 e(12,8)
ρ5 = e(10,11) e(11,12) τ2 = { O2 , T2 , C2 , D2 , Π2 , N ode2 , E2 }
ρ3,5 1 10 16 e(12,16)
ρ6 = e(8,12) e(12,16) ρ1 = { O1 , T1 , D1 , Π1 , N odeS1 , N odeD1 , F1 }
ρ4,5 2 8 16
ρ7 = e(16,15) e(15,11) For such flows, the parameters Oi , Di and Ti can be computed
ρ5,1 1 16 3 e(11,7) e(7,3) as follows:
TABLE II: Flow Model computed from the task model Oρ1 = Oτ1 + Cτ 1 = O1 + C1
Tρ1 = Tτ1
Dρ1 = Dτ2 - (Cτ1 + Cτ2 + Oτ1 - Oτ2 )
C. Example of a real-time system over NoCs
The following example illustrates our task model and shows how V. I MPLEMENTATION AND E VALUATION OF DTFM
we can generate the corresponding flow model. For this example, we
In this section, we first explain the implementation of DTFM.
use a mesh 4 × 4 NoC and the routing algorithm is a XY routing.
Second, we present its evaluation.
We consider a real-time application composed of 5 tasks as shown in
Figure 2.
A. Implementation into Cheddar
We implemented DTFM into Cheddar [9]1 . Cheddar is a GPL
framework that provides a scheduling simulator, schedulability tests
τ2 and various features related to the design and the scheduling anal-
τ5 ysis of real-time systems. To perform analysis, Cheddar handles an
architecture design language, called CheddarADL [10]. CheddarADL
τ1 allows the users to describe both the software and the hardware parts
of the system they expect to analyze. To implement DTFM into
τ4 Cheddar, we extend CheddarADL to model the NoC and its size,
τ3 the processor locations in the mesh, usual scheduling parameters, the
tasks parameters, the tasks mapping and the dependencies between
tasks. From the task set and the NoC model, DTFM produces
the associated flow model, checks if deterministic communication
Fig. 2: Task Graph Model delays can be produced by the NoC, and finally, computes those
communication delays.
Table I presents the task model. As shown in Table I, the task τ3
B. Evaluation with a multiscale toolset
runs on node 10. Its parameters are priority Π = 1, period T = 2
and deadline D = 2. E(τ3 ) indicates that τ3 sends messages to τ4 We have produced several experiments in order to evaluate the
and τ5 . correctness and the scalability of DTFM.
From this task model and with the G function, we can compute To perform such evaluations, we have implemented and used a
the flow model. As shown in Table II, the flow ρ4 is sent from τ3 to multiscale toolset composed of the tick accurate real-time scheduling
τ4 , then it is sent from the node 10 to the node 8 because NodeS(ρ4 ) simulator of Cheddar that simulates the behavior of the task set, and
= Node(τ3 ) = 10 and NodeD(ρ4 ) = Node(τ4 ) = 8. Then it will use of SHOC [11], a NoC cycle accurate SystemC-TLM simulator that
the links e(10,11), e(11,12) and e(12,8). allows us to observe the traffic in the NoC. SHOC supports different
types of traffic generators and consumers; In our study we consider ad
D. Example of flow parameters for a given task communication hoc producers and consumers. SHOC provides all NoC components,
protocol and IPs required for MPSoC simulations. All components can be
parameterized. In this experiment we have designed and simulated
You should notice that Oi , Ti and Di parameters of flow ρi
depend on the communication protocol used by tasks to exchange 1 DTFM is available at http://beru.univ-brest.fr/svn/CHEDDAR/.
U = 0.3 U = 0.3 U = 0.5 U = 0.5 U = 0.7 U = 0.7
DP(ns) DP(ns) DP(ns) DP(ns) DP(ns) DP(ns)
(SHOC) (DTFM) (SHOC) (DTFM) (SHOC) (DTFM)
ρ1 300 303 300 303 300 303
ρ2 297 298 297 298 297 298
ρ3 290 293 290 293 290 293
ρ4 295 298 295 298 295 298
ρ5 292 293 292 293 292 293
In this experiment, we change the release time of tasks in Figure 4 presents the percent of correct analysis computed by
order to study different situations of contention between flows. We DTFM. We can see that, if there is no indirect delays, any task sets
present in Figure 3 the difference between SHOC simulations and said schedulable by DTFM are also said schedulable by SHOC. We
DTFM computations for some of the analyzed task sets. Each curve also notice that flows with indirect delays are detected and rejected by
represents the difference for the same flow with different release DTFM: the corresponding task sets are said unschedulable by DTFM.
times. We can see that DTFM is able to predict a upper bound of the But in some cases, indirect interference may only add a few extra
direct delays given by SHOC. delays and the system could still be schedulable. In such cases, DTFM
Such multiscale toolset simulator could be applied to investigate the
temporal behavior of real-time applications running on various other
hardware components such as cache hierarchies. With this multiscale
simulator, we made experiments that evaluate DTFM correctness and
scalability.
In future work, we will enrich DTFM with new delay models
for different and usual task communication protocols and NoCs
architectures. We also intend to use DTFM and its associated tools
to investigate task mapping and arbitration techniques.
VII. ACKNOWLEDGEMENT
The authors would like to thanks Dr Martha Johanna Sepúlveda
for her help on the SHOC simulator. This work and Cheddar are
supported by BMO, Ellidiss Technologies, CR de Bretagne, CG du
Fig. 5: Response time of DTFM Finistère, and Campus France.
R EFERENCES
[1] Z. Shi and A. Burns, “Real-time communication analysis with a priority
can make too conservative decisions by considering unschedulable share policy in on-chip networks,” in Proceedings of the 21st Euromicro
cases which are schedulable. Conference on Real-Time Systems (ECRTS), July 2009, pp. 3–12.
[2] C. L. Liu and J. W. Layland, “Scheduling algorithms for multiprogram-
3) Evaluation of the scalability of DTFM : In order to evaluate ming in a hard-real-time environment,” Journal of the ACM (JACM),
the scalability of DTFM, we measured the response time of DTFM vol. 20, no. 1, pp. 46–61, 1973.
to perform the analysis of a given task set. [3] Z. Shi and A. Burns, “Real time communication analysis for on-chip
networks with wormhole switching,” in Proceedings of the Second
For such evaluations, we call DTFM for several task sets with ACM/IEEE International Symposium on Networks-on-Chip (NOCS),
different numbers of dependencies ranging from 10 to 200. Figure 5 Nov 2008, pp. 161–170.
presents the response times of DTFM and shows how much time the [4] A. T. Tran and B. Baas, “NoCTweak: a highly parameterizable sim-
tool needs to verify a task set. As shown in this figure, DTFM takes ulator for early exploration of performance and energy of networks
5 seconds to analyze a task set with 10 dependencies and up to 9 on-chip,” VLSI Computation Lab, ECE Department, University of
California, Davis, Tech. Rep. ECE-VCL-2012-2, July 2012, http://www.
seconds with 200 task dependencies. Cycle-accurate simulation of a
ece.ucdavis.edu/vcl/pubs/2012.07.techreport.noctweak/.
few seconds of the system execution can take several hours, while the
[5] S. P. Azad, P. E. Behrad Niazmand, J. Raik, G. Jervan, and T. Hollstein,
proposed model decides about the schedulability of the whole system “SoCDep2 : a framework for dependable task deployment on many-core
faster with cycle accuracy limited to NOC communications. systems under mixed-criticality constraints,” in Proceedings of 11th
International Symposium on Reconfigurable Communication-centric
Systems-on-Chip (ReCoSoC), June 2016, pp. 1–6.
VI. CONCLUSION
[6] B. d’Ausbourg, M. Boyer, E. Noulard, and C. Pagetti, “Deterministic
In this article, we introduced DTFM, a dual task and flow model execution on many-core platforms: application to the scc,” in Many-core
in order to improve the predictability of real-time applications over Applications Researc Community Symposium (MARC), Dec 2011.
NoC architectures. DTFM allows us, from the task model and the task [7] Z. Shi, “Real-time communication services for networks on chip,” Ph.D.
dissertation, University of York, Nov 2009.
mapping, to check the schedulability of the system. Sharing resources
between tasks, contentions of the networks and interferences lead [8] P. H. Feiler and D. P. Gluch, Model-Based Engineering with AADL:
An Introduction to the SAE Architecture Analysis & Design Language.
to non-deterministic communication delays which make complex the Addison-Wesley, 2012.
schedulability analysis of tasks. In DTFM, we take into account all [9] F. Singhoff, J. Legrand, L. Nana, and L. Marcé, “Cheddar: a flexible
of these factors in order to assess the schedulability of tasks. real-time scheduling framework,” ACM SIGAda Ada Letters, vol. 24,
no. 4, pp. 1–8, Dec 2004.
We use a standard NoC architecture to illustrate DTFM. Such
[10] C. Fotsing, F. Singhoff, A. Plantec, V. Gaudel, S. Rubini, S. Li, H. N.
NoCs are unsafe platforms for real-time applications because they Tran, L. Lemarchand, P. Dissaux, and J. Legrand, “Cheddar architecture
introduce various types of latencies which may prevent tasks to meet description language,” Lab-STICC Technical report, 2014.
their deadlines. [11] M. Sepúlveda, M. Strum, and W. Chau, “Performance impact of QoSS
(quality-of-security-service) inclusion for NoC-based systems,” in 17th
We have implemented DTFM into Cheddar. From a set of tasks IFIP/IEEE International Conference on Very Large Scale Integration
and their dependencies, we can automatically check and compute (VLSI-SoC), Oct 2009, pp. 12–14.
the various latencies in the NoC and perform schedulability analysis. [12] E. Bini and G. C. Buttazzo, “Measuring the performance of schedula-
Furthermore, when a task set leads to flows with indirect delays, such bility tests,” Real-Time Systems, vol. 30, no. 1-2, pp. 129–154, 2005.
flows are detected by DTFM and the task set is said unschedulable.
Besides DTFM and its corresponding tool, a second contribution is
the tool-based validation approach which is based on a multiscale
simulator toolset. The validation tool is composed of the Cheddar
real-time scheduling simulator and SHOC, a SystemC NoC simulator.
The task scheduling produced by Cheddar is used as an input of the
SystemC simulator, which is used to measure delays in the NoC.
Go-RealTime: A Lightweight Framework for
Multiprocessor Real-Time System in User Space
Zhou Fang1 , Mulong Luo1 , Fatima M. Anwar2 , Hao Zhuang1 , and Rajesh K. Gupta1
1
Department of Computer Science and Engineering, UCSD
2
Department of Electrical Engineering, UCLA
Abstract—We present the design of Go-RealTime, a lightweight [14]. It supports both real-time scheduling and parallel pro-
framework for real-time parallel programming based on Go gramming with a set of APIs. Go provides native support
language. Go-RealTime is implemented in user space with for concurrency by means of goroutines. We leverage this
portability and efficiency as important design goals. It takes advantage of Go for real-time programs running on concur-
the advantage of Go language’s ease of programming and rency units in user space. OS thread scheduler is integrated
natural model of concurrency. Goroutines are adapted to provide
scheduling for real-time tasks, with resource reservation enabled into Go runtime and adopted for resource reservation. In
by exposing Linux APIs to Go. We demonstrate nearly full Go-RealTime, we import OS APIs (thread scheduler, CPU
utilization on 32 processors scheduling periodic heavy tasks using affinity and timer) into Go, modify Go runtime to enable
Least Laxity First (LLF) algorithm. With its abstraction and direct goroutine switch, and build external Go packages to
system support, Go-RealTime greatly simplifies the set up of support asynchronous events, parallel programming and real-
sequential and parallel real-time programs on multiprocessor time scheduling. Go-RealTime is able to make a given portion
systems. of processors (or a fraction of one processor) real-time and
Index Terms—Real-Time Systems; Multiprocessor Systems;
Parallel Programming; Go Language; Scheduling Algorithms keep the rest unchanged, thus it supports implementation of
mixed criticality systems. This approach is also safer and more
I. I NTRODUCTION portable than earlier works [4], [6] because no modifications
to the OS kernel are needed.
Multiprocessing is now the primary means of improving In Go-RealTime, a user can program a parallel task as
performance on diverse platforms from embedded systems Directed Acyclic Graph (DAG) [15] of sub-tasks. Sub-tasks
to data-centers. Many frameworks and the associated tools are executed by subroutines, so the overhead of process-level
make parallel programming easier, such as compiler support concurrency is saved. It supports multiple scheduling algo-
(OpenMP [1]), language extension (Intel Cilk Plus [2]) and rithms for different tasks running at the same time. Different
library (Intel Thread Building Block (TBB) [3]). These tools types of tasks are scheduled by the suitable algorithm (EDF
provide easy-to-use programming support to construct parallel [5], LLF [8]). The total processor resources are partitioned
tasks, and orchestrate execution details behind the scene. dynamically among all schedulers by the resource balancing
Yet, the goal of minimizing execution time is only partially mechanism. Implementation results show that our framework
achieved due to the lack of resource reservation. This is is lightweight, flexible and efficient to deploy real-time pro-
particularly limiting for real-time (RT) applications. grams on platforms ranging from embedded devices to large-
The majority of prior work on real-time systems is focused scale servers.
on operating system (OS) support for scheduling of sequen- The main contributions of our work are: (1) we design
tial tasks on uniprocessor models. Multiprocessor systems a real-time parallel framework using Go language without
are attracting increasing research interest, which have more modifying OS kernel; (2) we present the resource balancing
complex schedulers. Among notable multiprocessor system approach to execute multiple scheduling algorithms in parallel;
implementations, Calandrino et. al. proposed LITMUSRT [4], (3) we implement system support for handling asynchronous
a Linux based testbed for real-time multiprocessor schedulers. event in user space; (4) we implement a multiprocessor LLF
Based on LITMUSRT , several different implementations of scheduler with a simple tie breaking method, which can
global Earliest Deadline First (EDF) algorithm [5] were tested achieve near full utilization for randomly generated heavy task
and compared [6]. The drawback of global EDF algorithm set (0.5 ≤ µtask ≤ 0.9) on a 32-processor machine; (5) we
on multiprocessor system is that schedulability of heavy uti- evaluate the implementation of EDF and LLF in Go-RealTime
lization task is degraded dramatically [7]. Dynamic priority through analyzing system overhead and taking schedulability
algorithms, such as Least Laxity First (LLF) [8], [9] and tests.
P-fairness [10], are able to achieve higher utilization than The rest of the paper is organized as follows. We introduce
EDF on multiprocessor systems. However, they induce larger the system architecture of Go-RealTime in Section II. It
system cost due to frequent context switching and cache presents Go-RealTime’s APIs, system model, resource balanc-
pollution. ing, and asynchronous event handling. Programming parallel
Prior to scheduling, because there are many choices to task in the framework is presented in Section III. The design,
decompose a parallel task into a group of sequential sub-tasks, implementation and evaluation of real-time schedulers are
parallel task scheduling algorithms have additional complexity presented in Section IV. Section V gives the conclusion and
on top of sequential scheduling. Lakshmanan et. al. [11] future works.
studied fork-join tasks and proposed the task stretch transform
to reduce scheduling penalty. For a general class of parallel II. T HE G O -R EALT IME A RCHITECTURE
tasks, Saifullah et. al. [12] proposed to decompose a parallel
task into several sequential sub-tasks with carefully computed In this section we introduce the overall system architecture,
deadlines, which are then scheduled using global EDF. This APIs and data structure of Go-RealTime. It is currently im-
algorithm was then implemented by Ferry et. al. [13]. Due plemented on top of Linux version of Go 1.4. Go-RealTime
to lack of a direct API to make a thread switch, in [13] it works concurrently with the original runtime of Go language.
indirectly switches threads by changing their priority. This It creates RT goroutines which run real-time tasks. The threads
indirect approach results in a complex implementation and carrying RT goroutines are separated from native Go threads
limits the performance. (no goroutine migration). In the following discussion, threads
In this work, we propose a new RT framework, Go- and goroutines are created by Go-RealTime unless otherwise
RealTime to solve the above challenges using Go language stated.
TABLE I
G O -R EALT IME API S
Type Method Description
Resource SetSched(thread, policy, priority) Linux thread scheduler
BindCPU(CPU) processor affinity
SetTimerFd(file descriptor, time) Linux timer
Scheduling GoSwitch(goroutine) switch to a goroutine
Task NewWorker(Nw ) create Nw workers
task.SetTimeSpec(ts , tp , td , tr ) set timing specification
task.Run(func, arg) start a RT task
Fig. 1. System model of Go RealTime. A worker is the abstraction of a
thread with one processor resource reservation. Goroutines are sorted in a
few priority queues. A parallel task is decomposed into DAG of sequential
Go-RealTime is targeting for applications which require sub-tasks.
millisecond level timing precision. The requirement is given
by δasync , the upper bound of tolerable asynchronous event
timing uncertainty. A real-time asynchronous event handler is Task is the object to describe a RT program. It has the
designed to meet the requirement dynamically. The framework following features:
turns off Go’s garbage collector to reduce timing uncertainty.
• Time specification: starting time ts , period tp , dead-
In the current implementation, memory usage is managed by
recycling dynamically allocated objects. Replacing the Go line td , and worst case running budget tr , set by
garbage collector by a more controllable approach which task.SetTimeSpec(ts , tp , td , tr ).
meets real-time requirement lies in the future work. • Utilization: µtask = tr /tp . (µtask can be larger than 1
for parallel tasks).
A. Go-RealTime’s APIs Goroutine is the concurrency unit. A new RT goroutine
Go language’s concurrency is enabled through goroutines is created by calling task.Run(func, arg). task then starts to
and invoked with keyword go. A user creates a goroutine and run on the new goroutine. A group of workers shares some
associates it with a program using go func(arg). After creation, runnable queues of goroutines. Nqueue , the number of queues,
go runtime scheduler automatically allocates goroutines to run is a parameter of Go-RealTime scheduler. Priority queue is
on OS threads. Each thread created by the Go runtime keeps used to sort goroutines by a key (“deadline” in EDF, “laxity”
a runnable queue of goroutines. In the original Go runtime, a in LLF). We use the lock free priority queue data structure
simple scheduling algorithm is adopted: each thread keeps a [16] to reduce the cost of parallel access contention.
runnable goroutine queue in First-In-First-Out (FIFO) order. It Workers fetch the head goroutine in queue to execute.
tries to make a goroutine switch per 10ms. User can yield the When a worker becomes idle, it fetches the first goroutine in
execution of a goroutine using the API GoSched(). Compared queue. The scheduler is preemptive: if a goroutine g0 becomes
to Linux’s Completely Fair Scheduler (CFS), this design is runnable and its priority is higher than g1 which is running, the
more lightweight and scalable. It is also much easier to use scheduler switches off g1 and let g0 run. A sequential task is
than thread libraries such as Pthread. However, the Go runtime executed by one goroutine. A parallel task is decomposed into
completely hides system details from the programmer, thus sequential sub-tasks described by DAG model, executed by a
making it difficult to carry out RT task scheduling due to lack group of goroutines. Since no preemption happens among sub-
of control over threads and processors. Go-RealTime modifies tasks of the same parallel task, they are implemented simply
Go by adding resource and scheduling APIs (Table I). These as subroutines.
include:
• SetSched: use Linux system call sched setscheduler to C. Resource Balancing in Go-RealTime
change the scheduling policy of a Go thread from time- Go-RealTime is able to keep more than one goroutine
sharing (CFS as default) to RT (FIFO, Round-Robin). queue to support multiple scheduling algorithms in parallel.
• BindCPU: bind a Go thread to a processor using This approach has two benefits: (1) it supports partitioned
sched setaffinity system call. Then Go thread runs ex- scheduling policy, and different algorithms can be deployed
clusively on the processor. for different sets of tasks, (2) scheduling algorithm can be
• GoSwitch: directly switch to a specific target goroutine. changed on-the-fly without interrupting running tasks.
It is implemented by modifying Go runtime source code. An example with two queues is shown in Figure 2, one
These modifications and additions make the execution of queue uses EDF algorithm, and the other uses LLF. Load
goroutines fully controllable in user space. The APIs can imbalance happens when the total task utilization of EDF and
reserve a fraction f of one processor as well: two Linux timers LLF queues do not match their allocated processor resources,
are used to set Go thread RT at trt and set it back to time- which induces utilization loss. Load balancing technique such
sharing at tts . Given the period of timers ttimer , it satisfies as work stealing [17] is a solution. Whereas for RT scheduling,
tts − trt = f · ttimer , so the Go thread occupies fraction f of work stealing becomes complex because it must respect task
the processor deterministically. It is useful to run RT tasks on priority during stealing. Otherwise it may run lower priority
uniprocessor systems. In this work we focus on taking Ncpu tasks but leave higher priority tasks waiting. Go-RealTime
processors completely for RT tasks. uses resource balancing instead of load balancing. Because all
Go-RealTime relies on these APIs to implement RT sched- tasks are periodic and their utilization µtask are known,
P total
ulers in Section IV. As an alternative to directly using these utilization of a queue is simply the sum µqueue = µtask .
APIs, the users can also use the easier Task APIs (Table I), It is the amount of processor resource the queue should get.
which assign the timing specifications and submit tasks to the The total resources, Nw workers, are allocated to the queues
RT scheduler. proportionally according to µqueue .
When the load of one queue (µqueue ) changes, the number
B. Go-RealTime’s System Model of workers allocated to each queue (Nwqueue ) is recomputed.
The system model of Go-RealTime consists of three objects: Go-RealTime dynamically assign workers to queues according
worker, task and goroutine, as shown in Figure 1. A worker to Nwqueue . In most cases, Nwqueue is not an integer, so a
is a thread created by Go-RealTime. It is an abstraction of queue may be assigned a fraction of a worker. In this example
reserved processor resource. A Go-RealTime program creates (Figure 2), worker 1 is allocated to both queues. A fraction
a group of workers via NewWorker(Nw ) method. The number fEDF of this worker should be allocated for EDF queue,
of workers (Nw ) is usually equal to Ncpu to maximize and fLLF for LLF queue (fEDF + fLLF = 1). Worker 1
parallelism. tries to give the right fractions of total running time to both
queues. It records the total running time of each queue. When on real-time threads on a server with 32 Intel Xeon E5-2640
it becomes idle and is going to grab a goroutine, it firstly 2.6GHz CPUs running CentOS 6.6. The statistics of ∆timer is
checks the current fraction of total running time of each queue, shown in Figure 3. The x-coordinate is the value of ∆timer and
and selects the queue which has received the smallest fraction. the y-coordinate gives the cumulative probability function. The
Go-RealTime asynchronously checks and enforces resource overhead of Go-RealTime timer is given near the curve. The
balancing while goroutine is running, repeated by a given time results show that the uncertainty of the Linux timer is around
interval tbalance . The implementation will be explained in the 100µs. The native Go timer daemon is running on a goroutine,
next section. which may be influenced by other goroutines. In order to test
its performance under load, we create a few dummy goroutines
running on the same thread as the timer goroutine (load=1
means one dummy goroutine). We see that 10 goroutines
increase the uncertainty to sub-second level, which means
the native Go timer can not guarantee its precision. The
precision of Go-RealTime timer depends on program details
(check async location). We use the program in Figure 1 for
the test. Because all Go-RealTime goroutines are checking
the timers generated by other goroutines as well, loading
goroutines do not influence the timer precision. The result
shows check async method is able to achieve millisecond-level
Fig. 2. Resource balancing: allocating the correct fraction of the workers to precision with small overhead (2.9ms/s for lpy, 0.3ms/s for
goroutine queues. lpy 1/10). Go-RealTime timer does not rely on OS timer.
Compared to the Linux timer, it provides the ability to tune
precision and overhead in a large range.
D. Handling of Asynchronous Events
Go-RealTime handles asynchronous events via check async
method. It checks all asynchronous events (timeout, message,
etc.) and responds to the events which should have occurred.
The challenge is how to call check async method repeatedly
in an asynchronous manner. Our current implementation is
to insert check async calls in the program until the interval
between two consecutive checks is less than δasync . By
skipping some of these checks, we can control the timing
precision. Algorithm 1 gives an example of instrumenting a
ConvertGray function [18] with check async. The function
converts a colored image in memory to gray scale. The three
locations to insert check async (@out, @lpy, @lpx) result in
different timing precisions and overheads. To tune precision
in a finer way, it can skip a fixed number of check async calls
using a counter. For example, it can call check async once
every ten times when the program runs to location lpy (denote Fig. 3. Timer uncertainty comparison of Linux, Go and Go-RealTime. Each
test contains 500 samples.
as “lpy, 1/10”).
2) Messages: Passing asynchronous messages and get-
Algorithm 1 Instrumented ConvertGray ting on-time responses is important in RT systems. In Go-
1: function C ONVERT G RAY RealTime, each worker has a message queue to be updated by
2: @out: check async() asynchronous events. The current running goroutine checks
3: for Loop over Y coordinate do the messages when check async runs. In this way the running
4: @lpy: check async() goroutine can respond to asynchronous messages quickly. As
5: for Loop over X coordinate do an instance, for EDF and LLF scheduling, an asynchronous
6: @lpx: check async() message is sent to the running task when a new task becomes
7: SetPixelGray(image, x, y) runnable. Upon receiving the message, the running task com-
8: end for pares its priority with the new one, then the task with higher
9: end for priority continues to run.
10: end function 3) Other Events: Go-RealTime implements a few other
asynchronous events based on check async. Two representa-
The check async facility is used in a number of components tive examples are (i) checking resource balancing is done in
in Go-RealTime as discussed below. check async method every tbalance , and (ii) in LLF schedul-
1) Timer: Go-RealTime implements a software timer upon ing, all goroutines actively update and compare laxity in
check async. It keeps a priority queue of active timers at check async method every tllf .
each worker. The queue uses “time” value to assign priorities.
The current running goroutine on a worker is responsible for III. PARALLEL TASKS IN G O -R EALT IME
checking the timers. check timer method gets called inside Go language provides goroutine and channel constructs [14]
check async. It compares the head timer in queue with the to build scalable concurrent programs. Parallel programming
current clock time. If any timeout is due, it calls the handler in Go is studied in [19] using a dynamic programming example
function of the timer and removes the timer from the queue. that we will use to illustrate the advantages of Go-RealTime.
We use timer uncertainty to evaluate timing precision of The example addresses a search method: given Nkey keys and
the framework. Timer uncertainty is computed as ∆timer = the probability of each key, find the optimal binary search tree
tnotif y − texpect . texpect is the expected timeout of timer. that minimize average search time. The graph of computing
tnotif y is when the program is notified and responds to the nodes is shown in Figure 5 (a). It starts from the diagonal
timer. nodes and ends at the right upper corner node. The program
We compare timers of Linux (notify via file descriptor), Go is parallelized by grouping nodes as sub-task as Figure 5 (b). In
and Go-RealTime. ∆timer is measured by repeatedly setting the reference implementation in [19], each sub-task is executed
a timer and computing tnotif y − texpect . All tests are running as a goroutine. Because there is a fixed dependency among
all sub-tasks, there is no need for each sub-task to exclusively
occupy a concurrency unit, which costs resource and increases
switch time. Executing each sub-task as a subroutine is more
efficient.
Go-RealTime models a parallel task as a DAG of sub-tasks.
It stores runnable sub-tasks in a global pool protected by a
spin lock. A sub-task is pushed into the runnable pool after
all its predecessors have completed. When a goroutine is idle,
it fetches a sub-task to execute in the pool of its associated
parallel task. A goroutine sleeps if the pool is empty and
is woken up when new sub-tasks are runnable. Go-RealTime
creates a group of goroutines for a parallel task. It includes
a initializing goroutine, a finalizing goroutine, and a few
goroutines to execute the parallel section, constructed in the Fig. 5. Parallel programming example. (a) dynamic programming grids are
fork-join pattern, as shown by Figure 4. A goroutine is blocked group into boxes as sub-tasks, (b) DAG of sub-tasks and (c) impact of
until all its predecessors finish. The initializing goroutine is decomposition granularity.
responsible for preparing for parallel computation, such as
loading data and initializing sub-task pool. The finalizing
goroutine does clearing work and handles computation result.
The API to construct a parallel task is the subtask object: users
create a set of sub-tasks, assign each sub-task a sequential
program to run and a list of predecessor/follower in the DAG.
(a) (b)
Fig. 6. Parallelize dynamic programming on 16 processors with Np as (a)
Np = 16 and (b) Np = 128.
(a) (b) Fig. 8. System cost of EDF and LLF scheduler on a 32-processor machine:
(a) ultra light tasks, (b) light tasks and (c) heavy tasks.
(c) (d)
Fig. 7. Real-Time scheduler performance. The tests for EDF on heavy
tasks contain 400 samples. The other tests contain 100 samples each. (a)
4-processor, light tasks, (b) 4-processor, heavy tasks, (c) 32-processor, light
tasks and (d) 32-processor, heavy tasks.
Fig. 9. Average count of goroutine switching per second·processor: (a) ultra
light tasks, (b) light tasks and (c) heavy tasks.
B. System Cost
System cost is an important consideration for the timing
overhead it represents. The cost includes direct and indirect C. Schedulability Evaluation
components. Direct cost consists of time consumed by the
framework code and goroutine switch. Indirect cost that we We give detailed schedulability evaluation of the Go-
consider is mainly cache pollution. RealTime for EDF and LLF. We use the convention in [21],
1) Direct Cost: We classify direct cost into three sources: [22], [23], [24] to generate the test sets. That is, starting with
queueing, switching and timer. Queueing cost is the time zero task in the task set, we randomly generate tasks with
consumed by scheduler code, the majority comes from the cost uniformly distributed period and utilization, then add this task
of insert and fetch operations on goroutine queues. Increasing to the current task set. We check the total utilization of this
number of concurrent tasks leads to higher queueing cost. task set, if it is less than the lower bound of the predetermined
Shorter period of tasks also increases the queueing cost. This is utilization, we continue to add new task into this task set until
because goroutine queueing operation is more frequent, which the total utilization is in the range, then this task set is tagged
results in intense contention of parallel access. Switching cost as valid for schedulability test. If the total utilization exceeds
is the direct cost of goroutine switch. The overhead of timer the upper bound, we abandon this task set and start with a new
is mainly induced by check timer call, decided by its calling task set with zero task. For the total utilization of each task
interval. It also includes the cost of operations on timer queue set, we set it to be within 2.0 to 3.9. The period of each task
such as insertion. is uniformly distributed between 100ms to 300ms. For light
The histogram of direct cost on 32-processor tests is given in tasks, the utilization per task is uniformly distributed between
Figure 8. Beside light and heavy tasks, we consider ultra light 0.1 to 0.5, for heavy tasks the utilization per task is uniformly
tasks which has 0.01 ≤ µtask ≤ 0.1, to study the queueing distributed between 0.5 to 0.9. We generate 5000 task sets for
cost with a large number of tasks in queue. The result shows each case of testing.
that direct cost is still small when 500 ultra light tasks run Figure 10 shows the results of schedulability experiments.
on 32 processors (smaller than 0.055%, shown in (a)). As To be specific, Figure 10(a) shows the number of schedulable
the number of tasks reduces and period increases (like most task sets with only light tasks. For light tasks, the number of
application cases), the direct cost becomes even smaller and tasks for each total utilization is almost uniformly distributed.
can be ignored. We can see that for both EDF and LLF algorithms, all the
2) Cache Pollution: When goroutine switch happens, the task sets are schedulable. Figure 10(b) shows the number of
current cache status may be invalid due to context change schedulable task sets with heavy tasks. For heavy tasks, the
of data and code. The time required to update cache from distribution of tasks under different total utilization is not uni-
memory is called cache pollution. The cost highly depends on form. It first increases with the total utilization then decreases
programs and machines, thus it is hard to be exactly quantified. with it. We see that for EDF, the number of successful task sets
The cost reported by [20] is in the range of milliseconds. drops quickly with the increase of total utilization. While for
Therefore it should be considered as a major cost. We use LLF, the number of successful task sets decreases slowly. This
the number of goroutine switch as the metric of indirect cost. implies that LLF performs better than EDF for this settings in
Figure 9 gives the histogram of average number of goroutine Go-RealTime.
switches per second·processor. It shows the switch induced
by LLF is evidently larger than EDF. Because of similar D. On-the-fly Scheduling Algorithm Change
performance in scheduling ultra light and light tasks, EDF The support for multiple schedulers in Go-RealTime is the
is preferred in these cases to reduce switching cost. key to combine advantages of different algorithms, such as
check async method based on timing profile of programs, (2)
design a controllable approach to run garbage collection which
respects the priority of RT tasks.
R EFERENCES
[1] OpenMP. [Online]. Available: http://openmp.org/wp/
[2] Intel. Cilk Plus. [Online]. Available: https://www.cilkplus.org/
[3] ——. Thread Building Blocks. [Online]. Available: https://www.
threadingbuildingblocks.org/
[4] J. M. Calandrino, H. Leontyev, A. Block, U. C. Devi, and J. H.
Anderson, “LITMUSRT : A testbed for empirically comparing real-
time multiprocessor schedulers,” in 27th IEEE International Real-Time
Systems Symposium (RTSS), Dec 2006, pp. 111–126.
[5] C. L. Liu and J. W. Layland, “Scheduling algorithms for
(a) (b) multiprogramming in a hard-real-time environment,” J. ACM,
vol. 20, no. 1, pp. 46–61, Jan. 1973. [Online]. Available:
Fig. 10. Number of schedulable task sets for (a) light and (b) heavy tasks http://doi.acm.org/10.1145/321738.321743
out of 5000 task sets on 4 processors. [6] B. B. Brandenburg and J. H. Anderson, “On the implementation of
global real-time schedulers,” in 30th IEEE International Real-Time
Systems Symposium (RTSS), Dec 2009, pp. 214–224.
[7] J. Goossens, S. Funk, and S. Baruah, “Priority-driven scheduling
EDF’s low cost and LLF’s high schedulability. The other of periodic task systems on multiprocessors,” Real-Time Syst.,
benefit of the design is that one scheduling algorithm can be vol. 25, no. 2-3, pp. 187–205, Sep. 2003. [Online]. Available:
http://dx.doi.org/10.1023/A:1025120124771
changed to another on-the-fly, when the features of the task [8] J. Y. T. Leung, “A new algorithm for scheduling periodic, real-time
set become different. For example, when the total utilization tasks,” Algorithmica, vol. 4, no. 1, pp. 209–219, 1989. [Online].
increases, the scheduler should change from EDF to LLF in Available: http://dx.doi.org/10.1007/BF01553887
order to avoid deadline miss. During the changing process, [9] S.-H. Oh and S.-M. Yang, “A modified least-laxity-first scheduling
algorithm for real-time tasks,” in Real-Time Computing Systems and
both EDF and LLF goroutine queues are kept. Each task is Applications, 1998. Proceedings. Fifth International Conference on, Oct
inserted into LLF queue at the beginning of its next period, 1998, pp. 31–36.
while the current state (state may be “running”, “runnable in [10] S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel,
EDF queue” or “idle until next period starting”) of the task “Proportionate progress: A notion of fairness in resource allocation,” in
Proceedings of the Twenty-fifth Annual ACM Symposium on Theory of
remains unchanged. An example of this process is shown in Computing. New York, NY, USA: ACM, 1993, pp. 345–354. [Online].
Figure 11. Scheduler changes from EDF to LLF at time 4s. It Available: http://doi.acm.org/10.1145/167088.167194
can be found that after the change, the schedulability of the [11] K. Lakshmanan, S. Kato, and R. Rajkumar, “Scheduling parallel real-
time tasks on multi-core processors,” in 31st IEEE International Real-
task set is improved upon more frequent task switches. Time Systems Symposium (RTSS), Nov 2010, pp. 259–268.
[12] A. Saifullah, K. Agrawal, C. Lu, and C. Gill, “Multi-core real-time
scheduling for generalized parallel task models,” in IEEE 32nd Interna-
tional Real-Time Systems Symposium (RTSS), Nov 2011, pp. 217–226.
[13] K. Agrawal, C. Gill, J. Li, M. Mahadevan, D. Ferry, and C. Lu,
“A real-time scheduling service for parallel tasks,” in Proceedings
of the 2013 IEEE 19th Real-Time and Embedded Technology
and Applications Symposium (RTAS). Washington, DC, USA:
IEEE Computer Society, 2013, pp. 261–272. [Online]. Available:
http://dx.doi.org/10.1109/RTAS.2013.6531098
[14] Google. Go language. [Online]. Available: https://golang.org/
[15] M. McCool, A. D. Robison, and J. Reinders, Structured Parallel
Programming: Patterns for Efficient Computation. Morgan Kaufmann
Publishers, 2012.
[16] T. L. Harris, “A pragmatic implementation of non-blocking linked-lists,”
in Proceedings of the 15th International Conference on Distributed
Computing. London, UK, UK: Springer-Verlag, 2001, pp. 300–314.
[Online]. Available: http://dl.acm.org/citation.cfm?id=645958.676105
[17] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded
Fig. 11. On-the-fly scheduling algorithm change: a task is represented by a computations by work stealing,” J. ACM, vol. 46, no. 5, pp. 720–
colored box. The box boarder means an attempt of task switch. For EDF, this 748, Sep. 1999. [Online]. Available: http://doi.acm.org/10.1145/324133.
attempt is only made when a current task ends or a new task is runnable. LLF 324234
makes a switch attempt per tllf (tllf = 20ms used here). For both EDF and [18] A sample implementation of HoG in golang. [Online]. Available:
https://github.com/satojkovic/go-HoG-sample
LLF, an attempt successes if a current task ends or a new task has higher [19] P. Tang, “Multi-core parallel programming in Go,” in the First Interna-
priority. An arrow stands for the time when a deadline miss happens. tional Conference on Advanced Computing and Communications, 2010.
[20] C. Li, C. Ding, and K. Shen, “Quantifying the cost of context switch,”
in Proceedings of the 2007 Workshop on Experimental Computer
V. C ONCLUSIONS AND F UTURE W ORK Science. New York, NY, USA: ACM, 2007. [Online]. Available:
http://doi.acm.org/10.1145/1281700.1281702
In this paper, we present the design and implementation [21] T. P. Baker, “Comparison of empirical success rates of global vs.
of Go-RealTime, a real-time parallel programming framework partitioned fixed-priority and EDF scheduling for hard real time,” FSU
implemented in user space using Go programming language. Technical Report, TR-050601, 2005.
Important design choices related to resource reservation, asyn- [22] J. Lee and I. Shin, “EDZL schedulability analysis in real-time multicore
scheduling,” IEEE Transactions on Software Engineering, vol. 39, no. 7,
chronous event handling and programming interface make pp. 910–916, July 2013.
it possible for the application developer to embed strong [23] M. Bertogna, M. Cirinei, and G. Lipari, “Schedulability analysis of
timing requirements and ensure their satisfaction through a global scheduling algorithms on multiprocessor platforms,” IEEE Trans-
actions on Parallel and Distributed Systems, vol. 20, no. 4, pp. 553–566,
flexible runtime scheduler. It also supports DAG-based parallel April 2009.
programming to deploy parallel program on multiprocessor [24] M. Cirinei and T. P. Baker, “EDZL scheduling analysis,” in 19th
system. Euromicro Conference on Real-Time Systems (ECRTS’07), July 2007,
Go-RealTime is implemented by modifying open source pp. 9–18.
Go runtime, implementing Go packages and by importing
important Linux system calls into Go. It uses Linux thread
scheduler for resource reservation. Goroutine running on top of
thread is the unit of concurrency. Our framework greatly sim-
plifies the implementation and deployment of RT programs,
and improves the flexibility in system extension.
Our prototype of Go-RealTime is moving forward in the fol-
lowing directions: (1) design a strategy to automatically place
Reaction 2016 Author Index
Author Index
Afanasov, Mikhail 19
Astrinaki, Maria 13
Becker, Matthias 1
Behnam, Moris 1
Bini, Enrico 31
Christoforakis, Ioanins 13
Dasari, Dakshina 1
Diguet, Jean-Philippe 43
Domı́nguez-Poblete, Jorge 7
Dridi, Mourad 43
Eker, Johan 31
Fang, Zhou 37
Garcı́a-Valls, Marisol 7
Gupta, Rajesh 37
Iavorskii, Aleksandr 19
Kornaros, Georgios 13
Leva, Alberto 25
Luo, Mulong 37
M. Anwar, Fatima 37
Millnert, Victor 31
Mottola, Luca 19
Mubeen, Saad 1
Nolte, Thomas 1
Rubini, Stéphane 43
Singhoff, Frank 43
Terraneo, Federico 25
Touahria, Imad Eddine 7
1
Reaction 2016 Author Index
Zhuang, Hao 37
2
Reaction 2016 Keyword Index
Keyword Index
adaptive software 19
architectural support for QoS 13
Automotive 1
Cheddar 43
Clock Synchronization 25
Cloud computing 31
Communication delay 43
context oriented 19
Control Theory 25
cyber-physical systems 19
Data Age 1
DDS 7
Design Phase 1
End-to-End 1
Go Language 37
guaranteing on-chip throughput 13
Parallel Programming 37
Partitioned systems 7
programming paradigm 19
Publish-subscribe 7
Real-time 7
Real-Time Systems 37
Real-time systems 43
resource management 13
Scheduling Algorithms 37
service chain 31
Switched control 25
SystemC NoC simulator 43
1
Reaction 2016 Keyword Index
task model 43
time-sensitive adaptation 19
traffic-flow model 43