Reinforcement Learning Agents To Tactical Air Traffic Flow Management

Int. J. Aviation Management, Vol. 1, No.
3, 2012
Reinforcement learning agents to tactical air traffic

flow management
Antonio Marcio Ferreira Crespo*
Comando da Aeronutica,
Centro de Gerenciamento da Navegao Area,
Praa Sen. Salgado Filho S/N,
Centro Rio de Janeiro, 71615-600, RJ, Brazil
E-mail: crespo@cgna.gov.br
*Corresponding author
Li Weigang
Department of Computer Science,
University of Brasilia,
Caixa Postal 4466, 70919-970 Brasilia, DF, Brazil
Fax: (061) 32733589
E-mail: weigang@unb.br
Alexandre Gomes de Barros

Department of Civil Engineering,
Schulich School of Engineering,
University of Calgary,
2500 University Drive NW, Calgary,
Alberta, T2N 1N4, Canada
Fax: (403) 2827026
E-mail: debarros@ucalgary.ca
Abstract: Air traffic flow management (ATFM) is of crucial importance for
the airspace control system, due to two factors: first, the impact of ATFM on
air traffic control, including inherent safety implications on air operations;
second, the possible consequences of ATFM measures on airport operations.
Thus, it is imperative to establish procedures and develop systems that help
traffic flow managers to take optimal actions. In this context, this work presents
a comparative study of ATFM measures generated by a computational agent
based on artificial intelligence (reinforcement learning). The goal of the agent
is to establish delays upon takeoff schedules of aircraft departing from certain
terminal areas so as to avoid congestion or saturation in the air traffic control
sectors due to a possible imbalance between demand and capacity. The paper
includes a case study comparing the ATFM measures generated by the agent
autonomously and measures generated taking into account the experience of
human traffic flow managers. The experiments showed satisfactory results.
Keywords: ATFM measures; agents; artificial intelligence; AI; reinforcement
learning.
Copyright 2012 Inderscience Enterprises Ltd.
145
146
A.M.F. Crespo et al.

Reference to this paper should be made as follows: Crespo, A.M.F.,
Weigang, L. and de Barros, A.G. (2012) Reinforcement learning agents to
tactical air traffic flow management, Int. J. Aviation Management, Vol. 1,
No. 3, pp.145161.
Biographical notes: Antonio Marcio Ferreira Crespo received his Bachelor of
Aeronautical Science from the Air Force Academy in 1994, Bachelor of Social
Science from Federal University of Santa Catarina in 2003, specialisation in
Electronic Engineering from the Technological Institute of Aeronautics in
2004, and Master in Computer Science from the University of Brasilia in 2010.
He is a Brazilian Air Force Officer with experience in the fields of sociology,
with emphasis on social indicators, electronic engineering, with emphasis on
antennas applied to military systems and control and air traffic management,
acting mainly in the following areas: management of air traffic flow, electronic
warfare programmes and operational safety and quality assurance in air traffic
services.
Li Weigang is Associate Professor and coordinator of TransLab Laboratory
of Computational Model for Air Transport, Department of Computer Science
CIC of the University of Brasilia UnB. He is Vice-President and Scientific
Merit of the Brazilian Air SBTA. He was coordinator of the graduate
department (20032006, 20082009), general coordinator of the Brazilian
Symposium of Air Transportation SITRAER (2006), and member of editorial
board of The Journal of the Brazilian Air Transportation Research Society
(2007). He received his Doctor of Science from the Technological Institute of
Aeronautics ITA in 1994. He received his post-doctorate in the area from the
University of Calgary, Canada with support from CNPq (20022003). His lines
of his research include artificial intelligence and application in related areas
such as air transport (ATFM), frequent flyer programme (FFP), finance and
web.
Alexandre Gomes de Barros received his graduation in Civil Engineering from
Universidade Estadual de Campinas in 1991, Masters in Operations Research
from the Technological Institute of Transport Aircraft in 1994, and PhD in
Transportation Engineering from the University of Calgary, Canada in 2001.
He was the Director of Airport Infrastructure National Civil Aviation Agency
(ANAC) from 2008 to 2010, where he contributed to meaningful re-structuring
of the Brazilian civil aviation system. He is currently an Assistant Professor of
Transportation Engineering at the University of Calgary, Canada. He is also a
member of the editorial board of the Journal of Advanced Transportation. He
has experience in transport engineering, with emphasis on airports and air
transport, acting on the following topics: planning and design of airports;
planning air transport systems; intelligent transport systems.
This paper is a revised and expanded version of a paper entitled Reinforcement
learning agents to tactical air traffic flow management presented at ATRS
2010, Porto-Portugal, 69 July 2010.
Introduction
Artificial intelligence (AI) is a valuable technique in the development of computer tools

for the improvement of air traffic management (ATM). AI tools have been developed for
all three categories of ATM: airspace management (ASM), air traffic control (ATC) and
Reinforcement learning agents to tactical air traffic flow management
147
air traffic flow management (ATFM). Recent literature on the topic have discussed the
use of several forms of AI in ATM, such as: automats theory (Bayen et al., 2005),
intelligent agents (Wolf, 2007), reinforcement learning and multi-agents theory (Alves
et al., 2008).
Under normal operational conditions, the demand for air traffic services is satisfied
without delays imposed to aircraft. However, occasional operational constraints such as
bad weather, equipment and telecommunication malfunctions can create an unbalance
between demand and ATC capacity, resulting in overloads at one or more ATC sectors.
Such overloads require the adoption of traffic flow restrictions, known as tactical ATFM.
Such restrictions can include both ground and airbone holding delays.
Figure 1
SISCONFLUX-MAAD architecture
In 2007, the Brazilian Centre for Air Navigation Management (CGNA) started to develop
a modular system to assist air traffic flow managers in the application and management of
air traffic flow control measures. This system, called SISCONFLUX (the Portuguese
abbreviation for decision support system for tactical traffic flow management) is
capable of stipulating intentional delays to aircraft departing from certain terminal
manoeuvring areas (TMA) in order to avoid overloads on en route ATC sectors due to
occasional, temporary unbalances between demand and capacity. Figure 1 illustrates the
system, which is comprised of three modules:
148
scenario prediction and monitoring (MAPC)
flow balancing (MBF)
decision evaluation and support (MAAD).
At first, the actions are generated by maximum flow algorithms in MBF, representing
solutions that will not saturate the ATC sectors. From a certain point, however, the
actions suggested are generated by the reinforcement learning algorithm in MAAD.
SISCONFLUX was developed with the following goals:
1
to assist in the implementation of an ATFM system that improves the efficiency of

the traffic flow control measures (holding aircraft departures on the ground) taken by
the CGNA
to conceive an automated process, supported by a reinforcement learning agent, that

is capable of incorporating the experience of human experts, i.e., CGNAs air traffic
flow managers
to develop a decision support system that can provide CGNAs flow managers with
traffic flow control measures according to pre-established criteria such as total
system delay and ATC workload.
As part of the SISCONFLUX system development, two prototypes were built for the
decision evaluation and support module (MAAD). The first prototype suggests flow
traffic control measures without taking the human agents experience into account. The
second one is a decision support system that incorporates the experience of CGNAs flow
managers. This paper describes these two prototypes and discusses the results of their
application to scenarios based on real demand.
Literature review
Several techniques are used in the existing literature for ATFM. The most commonly
used are mixed integer linear programming (Ball et al., 2003), dynamic programming
(Zhang et al., 2005), and AI techniques such as expert systems (Weigang et al., 1997) and
reinforcement learning (Alves et al., 2008; Agogino and Tumer, 2009). In the application
of AI, ATFM is a complex decision-making process that involves several entities, where
intelligent software agents can be used both in simulations and real operations, assisting
human operators (Wolf, 2007).
In this context, agents can be conceived to collect, process and disseminate relevant
information about the ATFM environment, such as demand forecasts, projected impacts
and variations in airspace capacity. Ultimately, intelligent agents can be constructed to
generate ATFM actions/measures from the analysis of several scenarios, acting directly
in the decision-making process.
Ball et al. (2003) used a mixed integer linear programming approach to ATFM.
They proposed a generalisation of the classic network flow model by replacing the
deterministic demand with a stochastic one. As such generalisation destroys the
networks original structure, it makes it possible to show that the matrix subjacent to the
stochastic model constitutes a dual network. Therefore, the integer programming problem
associated to the stochastic model can be solved efficiently. This model was applied to
149
solve the ground holding problem (GHP) within ATFM. Simply put, the authors
presented an objective function that consisted in minimising the linear relation between
supply and demand in a classic network flow model, based on origin and destination node
sets, where the nodes represent the airports.
Ma et al. (2004) proposed a model based on graph theory where the airport and
ATC sector network is represented by nodes connected by arcs. The computing time to
reach the optimal solution, however, was prohibitively long. To address the problem of
computing time, Zhang et al. (2005) used dynamic programming and focused on
minimising congestion at the ATC sectors as opposed to minimising congestion at the
airports, using both ground and airspace holdings. The network is represented by a
tri-dimensional directional graph. Each airport, transfer point and navigational aid is
represented by a vertex, whereas the air route segments and approach corridors are
represented by directional arcs. An aircrafts flight plan is modelled as a sequence of
nodes on the graph. The objective function includes fuel consumption, ground and
airborne holding delay costs.
Recent research in ATFM combines reinforcement learning and multi-agent
techniques. Agogino and Tumer (2009) use this combination to incorporate the human
agents experience into automated ATM processes. ATC domains are too complex to be
completely automated and are directly impacted by tactical ATFM. Figure 2 illustrates
the flow of information and actions considering a system where the agent is fully
automated.
Figure 2
Fully automated system
Source: Agogino and Tumer (2009)
In the fully automated process, the control measure suggested by the ATFM agent is
applied directly, modifying the traffic flow. The modified scenario is then assessed
according to pre-established criteria and the system computes a reward to the agent. This
reward is then used to modify the control policy that it adopts, which may change the
next recommendations. In ATFM, however, the most adequate configuration suggests a
decision-making process in which a software agent offers recommendations to a human
150
agent, who acts as a filter by either accepting or rejecting the software agents suggestion.
Figure 3 illustrates this concept.
Figure 3
Semi-automated system
Agogino and Tumer (2009) developed metrics to evaluate the efficiency of traffic flow
control measures generated by reinforcement learning agents. As a result, it was possible
to improve the actions taken by an agent conceived to simulate the behaviour of air traffic
controllers, given the suggestions by the ATFM agent and as such suggestions were
accepted by the air traffic controller. Figure 4 illustrates this model.
Figure 4
Semi-automated system
Source: Agogino and Tumer (2009)
It should be noted that, in the system proposed by Agogino and Tumer (2009), the ATFM
agent is a reinforcement learning algorithm that seeks to maximise the reward that
derives from the scenarios generated, which will be evaluated based on the total delays
generated and the number of aircraft in the ATC sectors both resulting from the flow
control policy. In turn, the air traffic controller is characterised as a reward maximising
reinforcement learning agent that seeks only the minimisation of the amount of aircraft in
151
the ATC sectors. This assumed behaviour for the air traffic controller is not entirely valid,
as the minimisation of traffic in the ATC sectors is not the ATCs only goal. In reality,
ATC seeks a balance between safety which benefits from a low number of aircraft in
the ATC sectors and operational efficiency, which depends on the maximisation of
flows. This assumed behaviour is, therefore, a significant limitation of the model.
In the process illustrated in Figure 4, the ATFM agent suggests a control measure.
The air traffic controller has a choice between accepting the measure and rejecting it. The
air traffic controller has rewarding scheme conceived to encourage acceptance of the
suggestions. When the measure is adopted, the new scenario is projected and the total
delay and number of aircraft in the ATC sectors are computed. Based on these values, the
system determines the value of the reward, which will impact the policy used in future
suggestions. The simulations performed with real data showed a reduction in air traffic
congestions of about 20% compared to when the air traffic controller acts alone.
Methodology
MAAD was developed with the use of AI techniques. As the air traffic system can be
modelled as a Markov decision process, an agent was developed based on multi-attribute
utility theory, decision theory and reinforcement learning (Russel and Norvig, 2003). The
first two were the base for the development of state evaluation function for the evaluation
of ATFM scenarios. The results of this evaluation are used to produce the rewards
awarded to the agent.
Reinforcement learning was used in the implementation of a software agent for
decision support in ATFM, with the ability to incorporate the experience of decision
makers. Hence, a device capable of generating flow adjustment policies sets of
traffic flow control measures that are suggested to the flow managers the human
agents was developed.
3.1 Model development

As part of this research, several meetings were held with CGNA managers, software
engineers and analysts responsible for the development of the ATFM system
SYNCROMAX, currently used by CGNA (Timoszczuk et al., 2009). A database
containing historical information on traffic flows was made available to the authors by
CGNA, as well as all reports related to traffic flow control measures applied in Brazil
from January 2008 to March 2010. This material was used to create the scenarios used in
the training and testing of the agent. The history of traffic flow control measures allowed
for the construction of an action set for the software agent that incorporated the CGNA
flow managers experience.
3.1.1 Action set

Two action sets were defined for the implementation and testing of the algorithms used in
both MAAD prototypes. The first set was defined autonomously, without human
intervention, from flow adjustment measures produced by the flow balancing module in
SISCONFLUX. The second set consists of the autonomous action set with the addition of
actions effectively applied by CGNA flow managers in real situations between 2008 and
152
2010. Such actions were extracted from the CGNAs daily status reports. This action sets
will be referred to as flow adjustment measures, or simply measures.
In the definition of the measures, all records from the daily status reports between
January 2008 and March 2010 were surveyed. In general, a measure consists of delays
(ground holdings) of 5, 7, 10, 15, 20 and 25 minutes imputed on aircraft departure times
from the TMAs. These delays can either be applied indistinctively to all aircraft
departing from a TMA or selectively to certain aircraft with specific destinations. Among
the 37 TMAs in the system, only 19 featuring an annual traffic of at least 20,000
movements were included in the model. These 19 TMAs represent approximately 93%
of the total traffic at the airports operated by INFRAERO, the Brazilian state-owned
airport operator (INFRAERO, 2009).
Table 1 shows two examples of measures contained in the action sets developed.
Table 1
TMA
Flow adjustment measures

Measure 1
5
SP
10
15
25
X
X
BH
AN
X
-
10
15
20
25
RE
BR
CY
Measure 2
20
SP
SP
As can be seen in Table 1, an action consists of a set of delays specific for each terminal.
The X represents delays being applied to all aircraft leaving the TMA. If, for a certain
TMA, the delay is to be applied only to airplanes with a specific destination, a two letter
code representing that destination is used.
3.1.2 State set

The reinforced learning algorithm used in MAAD is known as Q-learning (Watkins and
Dayan, 1992). This algorithm uses a table denoted the Q-table that, as part of the
learning process, incorporates the agents experience, i.e., it stores action-scenario pairs
that represent the states observed by the system. The scenario configuration consists of
the number of aircraft in the ATC sectors at certain times. The typical number of aircraft
in each sector ranges from 0 to 16, but can surpass 20 at times. Table 2 shows the
amounts of aircraft in a portion of the Brazilian airspace known as FIR BS on March 12,
2010, at times chosen randomly.
The Brazilian airspace is divided into 46 ATC sectors. If each sector can have
between 0 and 16 aircraft, then the number of possible states for each scenario is 4617. In
order to facilitate the Q-table learning process, a simplification was made to reduce the
number of possible states. The sectors were categorised by ranges of numbers of aircraft.
Individual sectors were classified as normal if the number of aircraft is under 14 and
saturated otherwise. For ATC purposes, some sectors may be grouped. In this case,
grouped sectors are considered saturated if their number of aircraft is equal to or larger
than 12 and normal otherwise. Instead of containing the exact number of aircraft in each
sector, the Q-table will only contain the information saturated (S) or normal (N).
Table 3 shows an example of a simplified Q-table.

Table 2
153
Projected scenario for FIR BS on 12/03/2010
Home
Sector
12
Sector
34
Sector
5
Sector
6
Sector
7
Sector
8
Sector
9
Sector
10
Sector
11
Sector
12
00:15
19
10
02:15
10
10:30
13
14
10
12:40
10
10
10
14:10
17
12
10
15.25
10
13
21:00
10
11
10
23:00
11
16
13
23:25
11
14
10
12
Table 3
Projected scenario for FIR BS on 12/03/2010 simplified compilation
Home
Sector
12
Sector
34
Sector
5
Sector
6
Sector
7
Sector
8
Sector
9
Sector
10
Sector
11
Sector
12
00:15
02:15
10:30
12:40
14:10
15.25
21:00
23:00
23:25
3.2 Scenario evaluation

In the development of a reward scheme for the reinforcement learning agent, it is
important to define a scenario evaluation function that effectively reflects the agents
objectives. Each of the MAAD prototypes will have its own scenario evaluation function,
which includes TMA delay and ATC load. The scenario evaluation function is defined as:
TC (e) = ( D(e) + C (e) )
where
e
state
TC(e)
scenario evaluation function
D(e)
total delay
C(e)
ATC load factor
normalisation coefficients, + = 1.
(1)
154
3.2.1 Total delay

The total delay criterion, D(e), is calculated as the sum of the delays measured in each
TMA, given by
D (e ) =
D (e )
t
(2)
tT
where
Dt(e)
delay calculated for TMA t
set of TMAs.
The delay for each TMA is the product of the number of aircraft set to depart in the next
2 hours and the suggested action (delay in minutes per aircraft), i.e.,
Dt (e) = mt pt
(3)
where
mt number of aircraft to in TMA t to which the measure is applied
pt
measure applied to TMA t, in minutes of delay per aircraft.
3.2.2 ATC load factor

The number of aircraft in the ATC sectors, despite not being the preponderant factor in
the determination of the air traffic controllers workload, is included in the scenario
evaluation function as a proxy to ATC complexity, which is currently measured only in a
limited number of ATC sectors in Brazil. The ATC load factor is calculated for each
sector as:
10, ms cs
Cs (e ) =
0, ms < cs
(4)
where
Cs(e)
ATC load factor for sector s
ms
number of aircraft in sector s
cs
capacity of sector s.
The total ATC load factor used in the scenario evaluation function is the sum of the
factors calculated for all factors,
C (e ) =
C (e )
s
sS
where
S
set of ATC sectors.
(5)
155
In summary, MAAD was built as a software agent with the goal of choosing the measure
that produces the best evaluation of the scenario, i.e., the highest reward. The agent seeks
to maximise the rewards using the Q-learning algorithm.
Proposed learning structure for ATFM applications
This research was preceded by several other studies on the application of reinforcement
learning agents to ATFM. Such studies have demonstrated that the use of an agent system
for ATM requires intensive knowledge (Wolf, 2007). Some of this knowledge can be
obtained from experts and in real time, but this is not always practical especially when
the experts do not have all the knowledge required. In such cases, knowledge acquisition
can be done in alternative ways, such as historical data. These data can, in turn, consist of
a set of actions applied by human experts in past scenarios. This mode of knowledge
acquisition is defined as historical data learning (Wolf, 2007).
In historical data learning, the data when available can serve as the basis for the
development of human behaviour models, either manually or through machine learning
methods. The scenario configuration and the inputs from human experts must form the
attribute set, from which any correct decision will be learned.
Figure 5
Proposed learning cycle
It should be noted that in the studies on the application of software agents to ATFM
found in the literature, the behaviour of air traffic controllers and traffic flow managers
was simulated e.g., Agogino and Tumer (2009). As a consequence, the agents did not
learn from real human agents. This research proposes a novel methodology in which the
experience of human ATFM experts is incorporated by the software agent using historical
156
data, extracted from ATFM measures applied by CGNA flow managers between 2008
and 2010. Figure 5 illustrates the proposed learning cycle.
In Figure 5, it can be seen that the ATFM agent is fed with ATFM measures
previously taken by CGNA flow managers in real situations. These measures, together
with those generated by the flow balancing module (MBF) in SISCONFLUX, will make
up the agents action set. From this action set, the agent will suggest measures and the
acting/learning cycle will proceed as normal. In summary, the learning cycle effectively
incorporates the human agents experience. Such concept represents a clear advantage of
this approach when compared to previous works, as it allows the evaluation through
comparative analysis of typical actions both by human and software agents.
Since the ATFM agent developed in this research has the ability to suggest actions
typically taken by human agents, it made it possible to solve an extremely relevant
problem in ATFM: the need to apply flow control measures in a more punctual, selective
manner, affecting only aircraft with specific destinations. This application aims to restrict
the impact of those measures to certain areas more directly related to the unbalance
between demand and capacity. Such measures could not have been generated by a system
only equipped with maximum network flow algorithms such as the ones used in MBF.
Case study
In the case study, two agent prototypes were used: MAAD and MAAD*. The difference
between them is in the action set used. MAAD uses an action set generated exclusively
by the flow balancing module (MBF). On the other hand, MAAD* uses an action set
containing a sub-set comprised of historical actions taken by CGNA traffic flow
managers. The use of an action set that includes historical actions taken by human flow
managers is one of the main contributions of this work.
The scenarios submitted to the prototypes were defined using traffic flows predicted
by CGNA. The prototypes were trained with the use of congested scenarios projected
from the demand forecasts supplied by CGNA. To accelerate the learning process, a
scenario configuration that reduces airspace capacity and consequently increases the
number of saturation points was used.
5.1 Case study preparation

The case study was developed through the following steps:
a
Collection of data on air traffic demand for the Brazilian airspace between January
2008 and March 2010.
Collection of data on air traffic flow control measures taken by CGNA traffic flow
managers in real situations for the same period as above. These measures were added
to the actions set used in MAAD*.
Inserting the flights in systems database, allowing the occupancy of the ATC sectors
to be projected.
Definition of the scenario configurations.
Initial tests for validation of the prototypes, using the scenario for 6 March 2008.
157
Verifying that Q-learning algorithm used in MAAD converges, using the scenario
for 18 March 2010.
MAAD prototype training, using 20 different scenarios selected randomly.
Application of the MAAD prototype to the scenario for 17 March 2010.
Verifying that Q-learning algorithm used in MAAD* converges, using the scenario
for 18 March 2010.
MAAD* prototype training, using 20 different scenarios selected randomly.
Application of the MAAD* prototype to the scenario for 17 March 2010.
5.2 Study settings

The scenarios used in this case study were projected from real demand observed in
specific periods. The prototypes processed scenarios under four different conditions. The
first one was a validation exercise, which consisted in the processing of a simple scenario
to validate the prototypes learning process.
In the second run, the prototypes processed a complex scenario, so that it could verify
that the Q-learning algorithm used in the prototypes converges. In the third run, each
prototype processed 20 scenarios chosen randomly. The processing of these scenarios
aimed to train the prototypes, i.e., to confirm the Q-table. Finally, in the fourth run,
the prototypes processed an unknown scenario in a real situation, allowing for the
evaluation of their performance and applicability.
5.2.1 Scenario configurations

The Brazilian airspace is divided into 46 ATC sectors. However, depending on demand
and following specific operational rules, the ATC configuration can be changed by
grouping certain sectors. In this experiment, the scenarios were built using an airspace
configuration with 25 sectors or groups of sectors. Since each sector can have two
possible states normal and saturated the resulting state set has 252 possible states.
5.2.2 Non-scheduled flight plans

Before departure, aircraft must file a flight plan detailing their intended routes through
the airspace. Authorisation for departure will be given only after the flight plan has been
evaluated and approved by ATC. Regularly scheduled flights follow repetitive flight
plans that are filed by the operator at least 15 days in advance to flight departure date.
Repetitive flight plans are evaluated and approved by CGNA and can have their effect on
ATC evaluated well in advance to their actual realisation. Non-scheduled flights, on the
other hand, are required to file their flight plans (FPL) up to 45 minutes in advance to
their departure time. These plans are approved by the local airport ATC authority and are
sent to flow managers for demand/capacity analysis only 20 minutes before departure
time. ATC does not keep these plans for statistical analysis.
Given these limitations imposed by non-scheduled flight plans, CGNA ran a survey
of non-regularly scheduled flight plans filed in high-volume areas. The data collected in
this survey allowed CGNA to extract statistics on the proportion of non-scheduled flights
158
with respect to total traffic. To account for this non-scheduled demand, several TMAs in
the case study had their demands increased by a factor ranging from 5% to 20%.
5.2.3 Learning parameters

The Q-learning algorithm used in the prototypes was set with the learning parameters
presented in Table 4.
Table 4
Learning parameters for the Q-learning algorithm
Parameter
Value
Normalisation constants
= 0.0136985, = 0.9863015
Temporal discount
= 0.8
Learning rate
= 0.2
Exploration/exploitation criteria
1 to 25,000 iterations = 20%/80%

25,001 to 40,000 iterations = 5%/95%
40,001 or more iterations = 1%/99%
In Table 4, the normalisation constants were set such that each factor, in the worst-case
scenario, contributes with approximately 50% to the reinforcement value. The temporal
discount and the learning rate were set using values of Ribeiro et al. (2006) to evaluate
the efficiency of the Q-learning algorithm. Finally, the exploration/exploitation criteria
was set empirically.
5.3 Results and discussion

The two prototypes developed MAAD and MAAD* were applied to the scenario for
17 March 2010 and generated, for the TMA listed, the flow adjustment measures shown
in Table 5.
In Table 5, one can notice the differences in measures suggested by the two
prototypes. For example, at 10:20 h, MAAD suggested a 15-minute delay for all
departures from TMA SP, whereas MAAD* suggested 10-minute delays. At 12:00 h,
MAAD suggested 5-minute delays for all departures from TMA SP; MAAD*, on the
other hand, suggested 7-minute delays only for aircraft departing from TMA SP to TMA
BR. This last suggestion is an action found exclusively in MAAD*s action set,
representing a typical human experience that was incorporated into the ATFM agents
knowledge base.
The performance of the prototypes can be evaluated using the quality values Q(s, a)
calculated by the Q-learning algorithm. Figure 6 illustrates the performance of the
prototypes throughout the experiment.
The values of Q(s, a) were presented and evaluated after each iteration and a new
scenario was projected from the action suggested by the prototype. The crescent shape of
the Q(s, a) curve shows that the scenarios presented a progressive reduction in saturation.
The evolution of the Q(s, a) was slightly different for each prototype, as can be seen in
Figure 6, but both presented satisfactory levels of convergence.
MAAD
15
15
15
10
TMA
SP
BR
BH
AN
UL
RJ
CT
PA
FL
10
10
15
15
15
MAAD*
10
15
15
MAAD
15
15
10
MAAD*
10:20
Delay (minutes)
09:55
Delay (minutes)
10:45
10
15
10
MAAD
10
10
10
MAAD*
Delay (minutes)
11:10
10
10
MAAD
10
10
MAAD*
Delay (minutes)
11:35
10
10
MAAD
10
MAAD*
Delay (minutes)
12:00
MAAD
5SP
5SP
5SP
5SP
5SP
7BR
MAAD*
Delay (minutes)
12:25
MAAD
5SP
5SP
5BR
MAAD*
Delay (minutes)
Table 5
Time

Flow adjustment measures
159
160
Figure 6
Performance comparison MAAD MAAD*
An important finding form the observation of the values in Table 5 and the evolution of
Q(s, a) observed from the 26th iteration is that, as the scenario becomes less complex,
MAAD* tends to suggest actions from the subset of actions extracted from CGNA flow
managers experience. This confirms the assertion that the human expertise becomes
more efficient than the machines as the scenario complexity decreases. Indeed, in the
last iterations, the actions typically taken by human experts resulted in larger values of
Q(s, a) when compared with the actions suggested by the MAAD prototype without any
built-in human experience. The results of this case study indicate that the prototypes
developed can be of great value in ATFM decision support, especially when such
measures are more localised and applied to departures in a more selective fashion.
Conclusions
ATFM is of crucial importance to the airspace control system. The use of reinforced
learning in ATFM applications is relatively recent. This paper presented an application of
a reinforced learning agent that suggests actions to be taken by traffic flow managers in
terms of delays imposed to departing aircraft through ground holding. The objectives of
these actions are to avoid congestion at the ATC sectors.
In the case study, comparisons were made between actions defined exclusively by
computer algorithms and actions that included the experience of human managers. The
results indicate that the incorporation of the human experience improves the actions
suggested by the algorithm, especially when the complexity of the scenario is low.
Acknowledgements
This research is partially supported by Brazilian National Council for Scientific and
Technological Development CNPq (Procs. 306065/2004-5 and 485940/07-8) and
ATECH Tecnologias Crticas, Brazil.
161
References
Agogino, A. and Tumer, K. (2009) Learning indirect actions in complex domains:
action suggestions for air traffic control, Advances in Complex Systems, Vol. 12, Nos. 45,
pp.493512, World Scientific Company.
Alves, D.P., Weigang, L. and Souza, B.B. (2008) Reinforcement learning to support metalevel
control in air traffic management, in Weber, C., Elshaw, M. and Mayer, N. (Eds.):
Reinforcement Learning Theory and Applications, ARS, Vienna.
Ball, M.O., Hoffman, R., Odoni, A. and Rifkin, R. (2003) Stochastic integer program with dual
network structure and its application to the ground holding problem, Operations Research,
Vol. 51, No. 1, pp.167171, Institute for Operations Research and the Management Sciences
(INFORMS) Linthicum, Maryland, USA, ISSN:0030-364X.
Bayen, A.M., Grieder, P., Meyer, G. and Tomlin, C.J. (2005) Lagrangian delay predictive model
for sector-based air traffic flow, AIAA Journal of Guidance, Control, and Dynamics, Vol. 28,
No. 5, pp.10151026.
INFRAERO (2009) Movimentos nos Aeroportos, available at http://www.infraero.gov.br
(accessed on July 2009).
Ma, Z., Cui, D. and Cheng, P. (2004) Dynamic network flow model for short-term air traffic flow
management, IEEE Transactions on Systems, Man and Cybernetics Part A: Systems and
Humans, Vol. 34, No. 3, pp.351358.
Ribeiro, R., Koerich, A.L. and Enembreck, F. (2006) Uma nova metodologia para avaliao de
desempenho de algoritmos baseados em aprendizagem por reforo, XXXIII Seminrio
Integrado de Software e Hardware. Anais do XXVI Congresso da Sociedade Brasileira de
Computao.
Russel, S. and Norvig, P. (2003) Artificial Intelligence A Modern Approach, 2nd ed., Pearson
Education, Inc., New Jersey.
Timoszczuk, A.P., Pizzo, W.N., Staniscia, G.F. and Siewerdt, E. (2009) The SYNCROMAX
solution for air traffic flow management in Brazil, in Weigang, L., de Barros, A. and
de Oliveira, I.R. (Eds.): Computational Models, Software Engineering, and Advanced
Technologies in Air Transportation: Next Generation Applications, pp.2337, IGI Global,
Hershey.
Watkins, C.J.C.H. and Dayan, P. (1992) Q-learning, Machine Learning, Vol. 8, No. 3,
pp.279292.
Weigang, L., Alves, C.J.P. and Omar, N. (1997) An expert system for air traffic flow
management, Journal of Advanced Transportation, Vol. 31, No. 3, pp.343361, ISSN: 01976729.
Wolf, S.R. (2007) Supporting air traffic flow management with agents, American Association for
Artificial Intelligence Spring Symposium: Interaction Challenges for Intelligent Assistants.
Zhang, Z., Gao, W. and Wang, L. (2005) Short-term flow management based on dynamic
flow programming network, Journal of the Eastern Asia Society for Transportation Studies,
Vol. 6, pp.640647.

Reinforcement Learning Agents To Tactical Air Traffic Flow Management

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Reinforcement Learning Agents To Tactical Air Traffic Flow Management

Hochgeladen von

Copyright:

Verfügbare Formate

Int. J. Aviation Management, Vol. 1, No.

Reinforcement learning agents to tactical air traffic

Alexandre Gomes de Barros

Copyright 2012 Inderscience Enterprises Ltd.

A.M.F. Crespo et al.

Artificial intelligence (AI) is a valuable technique in the development of computer tools

Reinforcement learning agents to tactical air traffic flow management

A.M.F. Crespo et al.

scenario prediction and monitoring (MAPC)

flow balancing (MBF)

decision evaluation and support (MAAD).

to assist in the implementation of an ATFM system that improves the efficiency of

to conceive an automated process, supported by a reinforcement learning agent, that

Reinforcement learning agents to tactical air traffic flow management

Fully automated system

Source: Agogino and Tumer (2009)

A.M.F. Crespo et al.

Source: Agogino and Tumer (2009)

Reinforcement learning agents to tactical air traffic flow management

3.1 Model development

3.1.1 Action set

A.M.F. Crespo et al.

Flow adjustment measures

3.1.2 State set

Reinforcement learning agents to tactical air traffic flow management

Projected scenario for FIR BS on 12/03/2010

Projected scenario for FIR BS on 12/03/2010 simplified compilation

3.2 Scenario evaluation

scenario evaluation function

ATC load factor

A.M.F. Crespo et al.

3.2.1 Total delay

delay calculated for TMA t

measure applied to TMA t, in minutes of delay per aircraft.

3.2.2 ATC load factor

ATC load factor for sector s

number of aircraft in sector s

set of ATC sectors.

Reinforcement learning agents to tactical air traffic flow management

Proposed learning structure for ATFM applications

Proposed learning cycle

A.M.F. Crespo et al.

5.1 Case study preparation

Definition of the scenario configurations.

Reinforcement learning agents to tactical air traffic flow management

MAAD prototype training, using 20 different scenarios selected randomly.

Application of the MAAD prototype to the scenario for 17 March 2010.

MAAD* prototype training, using 20 different scenarios selected randomly.

Application of the MAAD* prototype to the scenario for 17 March 2010.

5.2 Study settings

5.2.1 Scenario configurations

5.2.2 Non-scheduled flight plans

A.M.F. Crespo et al.

5.2.3 Learning parameters

Learning parameters for the Q-learning algorithm

1 to 25,000 iterations = 20%/80%

5.3 Results and discussion

Reinforcement learning agents to tactical air traffic flow management

A.M.F. Crespo et al.

Performance comparison MAAD MAAD*

Reinforcement learning agents to tactical air traffic flow management

Das könnte Ihnen auch gefallen