Dispatching Algorithm Design For Elevator Group Control System With Q-Learning Based On A Recurrent Neural Network

Dispatching Algorithm Design for Elevator Group Control System with
Q-Learning based on a Recurrent Neural Network

Weipeng Liu1, Ning Liu1, Hexu Sun2,3, Guansheng Xing1,2, Yan Dong1,2, Haiyong Chen1,2
1. School of Control Science and Engineering, Hebei University of Technology, Tianjin 300130
E-mail: liuweipeng@hebut.edu.cn
2. Education Ministry Engineering Research Center of Intelligent Rehabilitation Equipment and Detection Technology,
Tianjin 300130
Email: xinggs@gmail.com
3. Hebei University of Science and Technology, Shijiazhuang 050018
E-mail: hxsun@hebust.edu.cn
Abstract: A dispatching algorithm of elevator group control system is proposed based on reinforcement learning.
Elevator dispatching is modeled by a Markov Decision Process. Then an internally recurrent neural network based
reinforcement learning method is designed to find the optimal dispatching policy while the state-action value function is
iteratively approximated. Finally, several simulated experiments are done to compare the trained dispatching policy with
other traditional ones. The experimental results demonstrate the effectiveness of proposed dispatching method.
Key Words: Elevator Group Control, Dispatching Algorithm, Reinforcement Learning, Neural Network
MDP and define the state set, action set and immediate
1 INTRODUCTION reward. Then an internally recurrent neural network based
The dispatching problem in elevator group supervisory reinforcement learning method is designed to find the
control system has been investigated extensively due to its optimal dispatching policy while the state-action value
high practical significance. Some stochastic models such as function is iteratively approximated. The recurrent neural
Markov decision process (MDP) are used to model the network can increase the learning speed due to its memory
elevator group control problem[1][2]. Reinforcement of past input/output information. To balance the
learning as an approximate method of dynamic exploration and exploitation, Boltzmann distribution is
programming to solve a MDP problem has drawn more used to select an action in the action space. The algorithm is
attention of researchers in the field of artificial intelligence, trained by the data derived from simulated experience in a
control theory and operational research, for it has the ability virtual environment. Finally, under several different traffic
to learn the optimal policy from interaction with the flows, several simulated experiments are done to compare
environment. the learned policy with other traditional dispatching method.
The experimental results demonstrate the effectiveness of
In the literature, some results has been attained when
proposed dispatching method.
researchers use reinforcement learning method to design
supervisory control and optimization algorithm[3-7]. 2 PRELIMINARIES
Q-learning based elevator group control algorithm is
mostly discussed. For example, the reference [3] designs
2.1 Dispatching Problem of Elevator Group System
multiple agents with Q-learning ability to make a decision
that each elevator car should stops or not. When Q-learning The elevator group control system can be considered as a
is used to solve a large-scale, complex dynamic discrete event dynamic system. An elevator system
optimization problem, the value function approximation is schematic diagram is shown in Fig.1. Passengers arrive at
critical. Neural network is an effective solution to store the an elevator system randomly. When a passenger arrives at a
mapping from action/state pairs to values and find out the landing floor and gives a hall call, the group control system
optimal value function. Feedforward neural network is used allocates the call to the most suitable elevator. Each
popularly, such as BP neural network in the reference elevator control system handles functions of car running,
[3][4][5], and CMAC neural network in the reference [6]. stopping and door open, etc., based on the call allocation
In this paper we focus on the elevator dispatching message sent from the group controller. The allocation
problem where the objective is to allocate a car to serve a decision is made by optimizing a cost function. A number
new hall call instead of elevator group control to make a car of costs, such as call time, passenger waiting and journey
run or stop. Firstly, we model the dispatching problem as a times, car load factor, energy consumption, transportation
capacity, and number of starts, can be considered during the
call allocation.[8] So the dispatching problem is always
This work is supported by Hebei Province’s University Scientific
Research Program of under Grant z2012016 and Tianjin Education
modeled by a static or dynamic optimization problem. In
Commission Scientific Research Program under Grant 20120833
978-1-4673-5534-6/13/$31.00 2013
c IEEE 3397
this paper, we formulate the dispatching problem from the And the optimal action-value function
viewpoint of Q* ( s , u )
def
max QS ( s, u ) (4)
S
The problem of MDP can be solved by the method of

Dynamic Programming (DP) when the equation (1) is
rewritten in the form of Bellman Equation
V S ( s ) R ( s, u ) J ¦ P ( s ' | s, u )V S ( s ')] (5)
s 'S
But dynamic programming suffers from “the curse of

dimensionality” when the state space and action space are
huge.
Reinforcement learning (RL) is an approximate dynamic
programming in the way of learning by interaction with an
environment to accomplish a goal. The ultimate goal of RL
agent is to take sequence of actions that maximize the
received rewards from the environment in the long run.
There are many different reinforcement learning method,
such as TD(0) and TD( O ), Q-learning, SARSA, etc. In the
Fig. 1 Elevator system schematic diagram RL framework, the optimal value function V * ( s ) or action
value function Q* ( s, u ) is approximated iteratively in the
stochastic sequential decision, especially Markov Decision
Processes (MDPs). course of taking an action, receiving reward and updating
the state value or state/action value. In this paper we use
2.2 MDP and RL Q-learning to optimize the action value function for finding
an optimal dispatching policy.
Markov Decision Process (MDP) is a model in solutions
of stochastic sequential decision problem. MDP can be 3 PROBLEM MODEL
described as a set consisting of six elements:
Suppose the elevator group system is composed of N
{S , U s , S : S o U s , R : S u U o R, P : S u U u S o [0,1],V S }
sS cars running in a building with M floors. To formulate the
where S is a finite set of states of the world, U s is a finite dispatching problem in the framework of MDP, the state
set of actions, R : S u U o R is the reward function, space, action space, cost or reward and value function.
P : S u U u S o [0,1] is the state-transition probability
3.1 State Set and Action Set
function, S is the decision policy, V S is the state-value
We define S {st } as the discrete state set of the
function of policy S . In the framework of a infinite
horizon discounted MDP, when an agent observes the elevator group control system, including the direction and
system is in the state of s , it can executes an action u current position of each car, car calls belong to each car,
governed by the policy S and will receives an immediate hall calls already existing and the new hall call which has
not been allocated. Each hall call has direction and the
reward with expected value R ( s, u ) and the system state
number of the floor where the button is pressed.
will transit to s ' with probability P ( s ' | s, u ) . Given a
Let U {ui } denote the action set. ui refers to the
policy S , the state-value function V S ( s ) describes the dispatching algorithm allocating the new hall call to the car
long run reward defined by i, 0id N .
[rt 1 J rt 2 J 2 rt 3 " | st s, S ]
V S ( s )
f (1) 3.2 Cost and Action-value Function
[¦ J i 1rt i |st s, S ]

i 1
The definition of the cost or reward is closely related to
the performance indices of the elevator group scheduling
where J [0,1) is the discount factor and >@ denotes
system. In the viewpoint of sequential decision process, we
expectation over random samples generated by following take into account the waiting and journey times of
policy S . The agent's goal is to find a policy S * in order to passengers existing in the elevator group system when last
maximize the expected discounted sum of future reward. decision is made and the number of stops in the future
That is according to the existing car calls.
def
V * (s) max V S ( s ) (2) 1) Waiting time:
S
Let Tw ( p ) be the waiting time of passenger p who
S * is called the optimal policy and V * ( s) is the optimal
arrives before the last decision making. The total waiting
value function.
time of all passengers is defined by
Analogously, the action-values, Q S ( s, u ) , can be defined,
which evaluates the value of taking action u from state s :
Rwt ¦T p
w ( p) ¦ t t
p
p (6)
Q S ( s, u ) E (rt 1 Jrt 2 J 2 rt 3 " | st s, ut u, S ) (3) where t p is the arriving time of passenger p .
3398 2013 25th Chinese Control and Decision Conference (CCDC)

2) Journey time: approximate the optimal action value function Q* ( s, u ) by
Let Tr ( p c) be the journey time of passenger p c existing iteratively update the action value according to
in a car before the last decision making. The total waiting Qt 1 ( st , ut ) Qt ( st , ut ) D [ Rt 1 J min Qt ( st 1 , u ) Qt ( st , ut )]
us '
time of all passengers in cars is defined by
(11)
R jny ¦ Tr ( p c) ¦ (t tr ) (7)
pc pc
where D is the step-size parameter, and J is the discount
Where tr denotes the time of passenger p entering the car. factor.
From the equation (11), we can know that Q-learning is
3) Number of stops:
off-policy because the agent learns about a policy that
1 N
Rstp ¦ Ci (8) differs from the policy it is following. That is, at time t the
N i1 RL agent is in state st and takes the action, ut , according
Where Ci denotes the number of stops of car i . Thus, the to the policy S , and the agent’s state transfers to state st 1 .
total cost is: The RL agent updates its action-values by learning about
R Rwt R jny Rstp the value of a greedy action, and not the value of executed
(9)
actions.
According to the cost defined above, we define the
Due to the state space of elevator group control system is
state-value function of an infinite-horizon, discounted
huge, the Q-value and its update are very difficult to store.
MDP and then attain the action-value function in
Neural network is a popular solution to function
Q-learning. The state value function is
approximation. To increase the iteration speed, we apply an
V S ( s ) E ( Rt 1 J Rt 2 J 2 Rt 3 ! | st s, S ) internally recurrent neural network instead of feedforward
¦ S (s, u )( R ¦ J Pu
s
u
V S ( s t )) neural network. The network structure is shown in Fig. 2.
sst
u st This kind of network has three layers, input layer, hidden
where layer and output layer, but different with feedforward
P ( st 1 s ' | st s, ut u) Pssu ' neural network, the outputs of neurons in the hidden layer
are transferred back to its input layer.
R ( s, u , s ' ) Rssu ' We use N recurrent neural networks of this kind. N is
And the number of cars in an elevator group system. The
V * ( s) min V S ( s ) min ( Rsu ¦ JPssu'V * ( s t )) network i , i 1, 2," , N , is related to the action ui to
S u
s'
store the action value function QS ( s, ui ) . All networks
Accordingly, the action-value function is:
Q * ( s, u ) Rsu ¦ JPssu 'V * ( s t ) have the same inputs which are the states of elevator group
s' (10) system ( 2N inputs indicate cars’ current direction and
u position, 2 M 2 binary input indicate states of hall calls,
where R is the reward expectation when the action ut is
s
M u N binary input indicate states of car calls and two
taken when the system in the state st and Pssu ' is the inputs are the state of the new-coming hall call). The output
probability of the state transition from st to st 1 after the of network i is the Q-value, Qi \ i ( s, ui , Wi ) , where Wi
action ut is taken. In this paper, functions of Rsu and Pssu ' denote the weights of neural network i . Let N I be the
are unknown. number of inputs of network i and N H be the neuron
numbers in hidden layer of network i . According to the
4 Dispatching Algorithm Design structure of internally recurrent neural network shown in
In this section of the article, we use Q-learning as the Fig. 2, the neuron numbers of the input layer N w is
basic iteration algorithm to approximate the action-value N I N H . All networks have only one output.
function of elevator group dispatching modeled by a MDP.
Qt (st , ui )
In the algorithm implementation, there are two critical
problems must be addressed. One is generalization of the
Q-value function, which is addressed with an internally
recurrent neural network applied. In the way of network
training, those iterative action value updates can be stored
in the neural network. The other is the trade-off between
#
#
exploration and exploitation during action selection, which

will be solved by a stochastic exploration strategy based on
Gibbs distribution.
4.1 Value Iteration and Recurrent NN Generalization
#
#
#
To solve the Equation (10), Q-learning method is used.

Q-learning is one of the most popular algorithms among RL st
methods. It’s not necessary to know the model of reward Fig. 2 The structure of proposed recurrent neural network
expectation Rsu and transition probability Pssu ' . It
2013 25th Chinese Control and Decision Conference (CCDC) 3399

We adopt the back-propagation algorithm to update the elevator to offer service to the newly incoming call with
weights of all N networks. Q-learning iteration updates probability:
derived from the equation (11) offer the training data for e Q ( s t ,u t ) / T
these neural networks. In other words, according to the prob(ut )
output Q-value, the error is computed and the weights are ¦ e Q ( st , u ) / T
updated. Considering N networks have the same structure,
u (19)
we use network i as an example to explain the weight Where T ! 0 is called the temperature and descended.
adjusting algorithm of this kind of recurrent neural 4.3 RL based Dispatching Algorithm
network.
The complete algorithm for the elevator group
For the mth weight updating of the network, let
m
scheduling based on reinforcement learning is as follows:
U [u1m , u2m ," u Nm ] be the vector value of input layer, 1: Initialize Q ( s0 , u0 ) , s S , u Us
Bm [b1m , b2m ," , bNmH ] be the output of the hidden layer and 2: Observe the state st , and compute the reward function
c m be the current network output. The expected output, Rt according to equation (6)-(9)
ym Qt 1 ( st , ui ) , is calculated by the Q-learning update 3: Forward compute min Q (st , u ) in the N neural
uU s
equation (11) before starting the current network training.
networks using st as their inputs
Let ^wij ` denote the weights from input layer to hidden
4: Get the Qt with the equation as :
layerˈ^v j ` be the weights from hidden layer to output layer, 'Q ( st 1 , ut 1 ) D [ Rt J min Q ( st , u ) Q ( st 1 , ut 1 )]
uU s
^T ` be the biases of neurons in hidden layer and M
j be the
5: Update the weights of the neural networks with the value
bias of the output neuron, where i 1, 2," , N w ˈ Qt , Qt 1 , st 1 and ut 1 according to equations (12)-(18)
j 1, 2," , N H . The activation function in hidden layer is 6: Choose the dispatching action ut , according to Gibbs
nonlinear sigmoid function distribution in equation (19).
1 s m s ,u u mu
f ( x) (12) 7: t 1 t t 1 t 1 t
1 e x 8: Return to Step 2 until the task is finished.
and the activation function in output neuron is linear. The whole process diagram is shown in Fig. 3.
Let
1 m 1 m 2 5 SIMULATED EXPERIMENT
Em ( y c m )2 (G ) (13)
2 2
Based on the back-propagation algorithm, we first 5.1 Experiment Parameters Setup
update the value of v j with the following law: In this section, we carry out a training and simulation test
wE m m
wE wc m of the algorithm mentioned above. An elevator group
'v j D D DG m b mj (14) system with four cars operated in an office building with 16
wv j wc m wv j floors is investigated. The parameters of the elevators and
wE m
m
wE m wb j the building are listed in the following tables:
= G m v j b mj (1 b mj )uim (15)
wwij wb mj wwij Table 1 Elevator Parameters
So the updating law of the value of wij is as following: Parameters of elevators

wE m Speed ( m/s ) 1.75
'wij E EG m v j b mj (1 b mj )uim (16) 2
wwij Acceleration ( m/s ) 1
3
Analogously, we have the update law of the bias T j and J Jerk ( m/s ) 1
Capacity (person number) 15
'M DG m (17)
Door closing time ( s ) 1
'T j EG m v j b mj (1 b mj ) (18)
Average passenger transfer time ( s ) 1
4.2 Stochastic Action Selection Table 2 Building Parameters
One of the challenges in the implementation of
Parameters of building
reinforcement learning is the trade-off between exploration
and exploitation. When making the dispatching decision, Floor number 16
on one hand, the agent has to exploit the policy what it Height of lobby (m) 4
already knows to obtain best rewards, and on the other hand, Height of other floors (m) 3
it has to explore the action space to make a decision which
may bring better action selections in the future. In our Parameters in the dispatching algorithm are set: the
article, we apply Gibbs distribution as the action selection maximum iteration number is 10000, the learning rate
method. It chooses elevator number i as the dispatched

Dispatching Algorithm
ut 1
Qt 1
Action Selection t 1
st 1
Time
evolution
Q-value Neural Elevator Group

Qt 1
Network System
Q-value Rt Reward st
t
Updating Computing
Qt
Fig. 3 Diagram of dispatching algorithm
D 0.2 and E 0.2 , J 0.9 the temperature T in Gibbs RL based method has better adaptability to different traffic
distribution is TK K
d T0 ( d 0.98 , T0 1000 ) flows than SZ and GA although it is not better than SZ in
TF1 and TF2 and a little worse than GA in TF3. RL method
5.2 Simulation Results is not customized to a specific traffic flow mode. It’s the
nature of learning in experience that make this dispatching
The learning process and simulated experiments are all algorithm can handle the varying, unknown traffic and thus
done in a virtual environment[9], as shown in Fig. 4. have a good average performance.
6 CONCLUSIONS
Elevator group systems have the characteristics of huge
state space and random passenger arrival which make the
dispatching problem not trivial. In this paper we try to
design the optimal policy in a learning way. Markov
decision process and reinforcement learning are suitable
models for this problem. Q-learning is used to find optimal
dispatching action while it approximates the action value
function of MDP. Internally recurrent neural networks are
introduced to store the value function. The recurrent neural
network can increase the learning speed due to its memory
of past input/output information. Several simulated
Fig 4. Elevator group system simulation software experiments verify the effectiveness of the learning-style
dispatching method in different modes of traffic flows. The
We use three different traffic flows for dispatching agent
method has better adaptability to varying traffic load than
training and comparison between the proposed method and
the other two.
another two classic ones. Data sets of these traffic flows are
shown in the Table 3. The three modes are pure up-peak REFERENCES
traffic (TF1, Table 3(a)), pure down-peak traffic (TF1,
Table 3(b)) and down peak with light up traffic (TF1, Table [1] D. L. Pepyne, C. G. Cassandras, Optimal Dispatching Control for
Elevator Systems during Uppeak Traffic, IEEE Transactions on
3(c)). After 20-time training in each traffic mode, the Control Systems Technology, Vol.5, No.6, 629–643,1997.
dispatching policy is nearly stable, that is, the dispatching [2] M. Brand, D. Nikovski, Optimal parking in group elevator control,
scheme that the algorithm makes will not change when it Proceedings of IEEE International Conference on Robotics and
runs in situations past learned. Automation, 1002-1008, 2004.
Then we use two classic methods, Static Zoning (SZ) and [3] R. H. Crites, A. G. Barto, Elevator group control using multiple
reinforcement learning agents. Machine Learning, Vol.33, No.2,
Genetic Algorithm based method[10] (GA), with the same
235-262, 1998.
traffic data to evaluate the performance of the [4] Z. L. Zong, X. G. Wang, Z. Tang, G. Z. Zeng, Elevator group control
reinforcement learning based dispatching technology. algorithm based on residual gradient and Q-learning, SICE 2004
Results of comparison are shown in Table 4. Three Annual Conference, Vol. 1, 329 – 331, 2004.
common performance indices are used, average waiting [5] Q. Zong, C. F. Song, G. S. Xing, A study of elevator dynamic
time, average journey time and average crowding. scheduling policy based on reinforcement learning, Elevator World,
Vol.1, 58-64, 2006.
According to the data in Table 4, it is illuminated that the
2013 25th Chinese Control and Decision Conference (CCDC) 3401

Table 3 (a) Traffic 1: Pure Up Peak Traffic
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Persons 18 11 15 8 8 34 46 40 34 46 9 10 6 6 9
Table 3 (b) Traffic 3: Pure Down Peak Traffic
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Persons 17 17 13 13 10 34 31 26 41 38 13 12 14 8 12
Table 3 (c) Traffic 3: Down Peak with Light Up Traffic
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Persons(Up) 0 1 2 2 0 2 2 0 2 0 1 2 1 0 0
Persons(Down) 7 10 7 9 12 10 8 8 10 18 11 6 12 2 15
Table 4 Experimental Results
AvgWait (s) AvgJourney (s) AvgCrowding(%)
RL 50.19 45.82 68.09
TF1 GA 75.16 53.89 89.13
SZ 31.78 38.26 68.87
RL 33.48 40.56 68.91
TF2 GA 64.81 52.85 88.27
SZ 25.56 39.95 66.20
RL 11.95 15.38 11.47
TF3 GA 7.35 17.53 15.60
SZ 20.20 21.81 26.53
[6] Y. Gao, J. K. Hu, B. N. Wang, D. L. Wang, Elevator group control International Conference on Industrial Electronics, Control,
using reinforcement learning with CMAC, Acta Electronica Sinica, Instrumentation and Automation, Vol.2, 795-800, 1992.
Vol.35, No.2, 362-365, 2007. [9] Q. Zong, Y. Z. He, L. J. Wei, Modeling and research for
[7] F. L. Zeng, Q. Zong, Z. Y. Sun and L. Q. Dou, Self-adaptive agent-oriented elevator group control simulation system, Journal of
multi-objective optimization method design based on agent System Simulation, Vol.18, No.5, 1391-1393, 2006.
reinforcement learning for elevator group control systems, [10] L. H. Xue, Fuzzy neural network based elevator group control
Proceedings of the 8th World Congress on Intelligent Control and method with genetic algorithm, Master Thesis, Tianjin University,
Automation, 2577-2582, 2010. 2002.
[8] A. Fujino, T. Tobita, K. Yoneda, An on-line tuning method for
multi-objective control of elevator group, Proceedings of

Dispatching Algorithm Design For Elevator Group Control System With Q-Learning Based On A Recurrent Neural Network

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Dispatching Algorithm Design For Elevator Group Control System With Q-Learning Based On A Recurrent Neural Network

Hochgeladen von

Copyright:

Verfügbare Formate

Dispatching Algorithm Design for Elevator Group Control System with

Q-Learning based on a Recurrent Neural Network

The problem of MDP can be solved by the method of

But dynamic programming suffers from “the curse of

Q S ( s, u ) E (rt 1  Jrt  2  J 2 rt 3  " | st s, ut u, S ) (3) where t p is the arriving time of passenger p .

3398 2013 25th Chinese Control and Decision Conference (CCDC)

exploration and exploitation during action selection, which

To solve the Equation (10), Q-learning method is used.

2013 25th Chinese Control and Decision Conference (CCDC) 3399

So the updating law of the value of wij is as following: Parameters of elevators

3400 2013 25th Chinese Control and Decision Conference (CCDC)

Q-value Neural Elevator Group

Fig. 3 Diagram of dispatching algorithm

2013 25th Chinese Control and Decision Conference (CCDC) 3401

Table 3 (b) Traffic 3: Pure Down Peak Traffic

Table 3 (c) Traffic 3: Down Peak with Light Up Traffic

Table 4 Experimental Results

AvgWait (s) AvgJourney (s) AvgCrowding(%)

RL 50.19 45.82 68.09

TF1 GA 75.16 53.89 89.13

SZ 31.78 38.26 68.87

RL 33.48 40.56 68.91

TF2 GA 64.81 52.85 88.27

SZ 25.56 39.95 66.20

RL 11.95 15.38 11.47

TF3 GA 7.35 17.53 15.60

SZ 20.20 21.81 26.53

3402 2013 25th Chinese Control and Decision Conference (CCDC)

Das könnte Ihnen auch gefallen

Q S ( s, u ) E (rt 1 Jrt 2 J 2 rt 3 " | st s, ut u, S ) (3) where t p is the arriving time of passenger p .