Sie sind auf Seite 1von 8

Future Generation Computer Systems 99 (2019) 500–507

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

A smart agriculture IoT system based on deep reinforcement learning



Fanyu Bu a , , Xin Wang b
a
College of Computer and Information Management, Inner Mongolia University of Finance and Economics, Hohhot, China
b
Center of Information and Network Technology, Inner Mongolia Agricultural University, Hohhot, China

highlights

• We design a smart agriculture IoT system based on an edge-cloud computing.


• We present several representative deep reinforcement learning models.
• We discuss the possible challenges and applications of deep reinforcement learning in smart agriculture.

article info a b s t r a c t

Article history: Smart agriculture systems based on Internet of Things are the most promising to increase food
Received 15 March 2019 production and reduce the consumption of resources like fresh water. In this study, we present a
Received in revised form 1 April 2019 smart agriculture IoT system based on deep reinforcement learning which includes four layers, namely
Accepted 17 April 2019
agricultural data collection layer, edge computing layer, agricultural data transmission layer, and cloud
Available online 4 May 2019
computing layer. The presented system integrates some advanced information techniques, especially
Keywords: artificial intelligence and cloud computing, with agricultural production to increase food production.
Deep reinforcement learning Specially, the most advanced artificial intelligence model, deep reinforcement learning is combined
Smart agriculture IoT in the cloud layer to make immediate smart decisions such as determining the amount of water
Edge computing needed to be irrigated for improving crop growth environment. We present several representative deep
Cloud computing reinforcement learning models with their broad applications. Finally, we talk about the open challenges
and the potential applications of deep reinforcement learning in smart agriculture IoT systems.
© 2019 Elsevier B.V. All rights reserved.

1. Introduction systems. IoT is mainly used to automatically collect agricultural


data and to transmit the collected data to the data centers [3,4]
Agriculture is one of the biggest concerns to all mankind since while artificial intelligence techniques such as artificial neural
most of food is produced by agriculture. At present, many people networks and clustering are used to analyze the agricultural
still suffer from hunger due to lack of food in some countries, data for smart decision-making. An example is to decide the
especially in Africa. Specially, hunger caused the chronic un- amount of water needed to be irrigated by analyzing the collected
dernourishment of more than 800 million people in the world agricultural environment data. In detail, once the requirement of
in 2016. More notably, more than 10 million people died from irrigation is detected when the environment is lack of water, the
hunger per year. Obviously, increasing food production is un- action will be immediately taken to offer the appropriate amount
doubtedly an effective means of eradicating hunger and poverty. of water. As a recently emerged advanced artificial intelligence
However, the agriculture is far from modernization in most of technique, deep reinforcement learning (DRL) was presented for
developing countries, leading to low food production. smart decision-making several years ago and it has enjoyed a
Recently, smart agriculture has been proposed to promote great success in several domains such as computer game, medical
the modernization of agriculture for increasing food production diagnosis and energy management with the deep combination of
greatly [1,2]. Specially, smart agriculture systems introduce many deep learning models and reinforcement learning strategies [5,6].
advanced computer and information technologies such as Inter- The representative examples of deep reinforcement learning are
net of Things, artificial intelligence, and cloud computing into the deep Q-network, multi-agent deep reinforcement learning, and
agricultural production. Internet of Things (IoT) and artificial in- deep successor network. Therefore, it is a promising model used
telligence are two core techniques for building smart agriculture in building smart agriculture systems. Cloud computing is an-
other crucial computer technique for building smart agriculture
∗ Corresponding author. systems since it provides a powerful computing infrastructure
E-mail address: bufanyu@imufe.edu.cn (F. Bu). for data analytics and deep learning models. Generally, large

https://doi.org/10.1016/j.future.2019.04.041
0167-739X/© 2019 Elsevier B.V. All rights reserved.
F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507 501

amounts of agricultural data are collected from agriculture pro-


duction and they must be processed and analyzed at a high speed.
Cloud computing has demonstrated its strong power in the area
of big data computing [7]. For example, cloud computing has been
successfully used to improve the efficiency for deep computation
models and big data processing [8,9].
In this paper, we design a smart agriculture IoT system that
is made of four layers from bottom to up including agricultural
data collection layer, edge computing layer, data transmission
layer, and cloud computing layer. As a crucial component, deep
reinforcement learning is deployed in the cloud layer for making
immediate smart decisions. Specially, we present several repre-
sentative DRL models that are potential to be used in building
smart agriculture systems. This work is expected to promote the
development of smart agriculture and furthermore contribute to
increasing food production.
The paper has three contributions listed below.

• We design a smart agriculture IoT system based on an


edge-cloud computing.
• We present several representative deep reinforcement
learning models.
• We discuss the possible challenges and applications of deep
reinforcement learning in smart agriculture.

In the following parts, the smart agriculture IoT system is


presented in Section 2 and representative deep reinforcement
learning models are described in Section 3. Section 4 discusses
the potential applications and challenges of deep reinforcement
learning models in building smart agriculture systems.
Fig. 1. Architecture of smart agriculture IoT system.

2. Smart agriculture IoT system

3. Deep reinforcement learning


This section presents the smart agriculture IoT system. In
particular, the framework of the smart agriculture IoT system is Deep reinforcement learning, a novel model that combines
illustrated in Fig. 1. deep learning such as stacked auto-encoders and convolutional
In the smart agriculture IoT system, the bottom layer is aimed neural networks and reinforcement learning such as Q-leaning,
at agriculture data collection. In this layer, various sensors are was initially presented by DeepMind to optimize a policy for
deployed to collected the environmental parameters that are complex tasks [10]. In detail, deep learning aims to learn ab-
important for crop growth. Representative environmental param- stract feature representations for high-dimensional raw state in-
eters include the air temperature and humidity, carbon dioxide put from the environment while reinforcement learning methods
concentration, soil moisture and temperature, and light intensity, are used to find optimal policies that enable the agent to obtain
the maximum cumulative reward [11–13]. In 2015, DeepMind,
etc. Specially, these parameters are collected every five minutes.
the artificial intelligence research group in Google, published
Before the collected parameters are transmitted to the data
their research results on deep reinforcement learning in Nature
center in the cloud for analytics, they are preprocessed in the
magazine, implying that deep reinforcement learning has made
edge computing layer. In particular, two preprocessing tasks are an exciting advance [10]. Specially, DeepMind developed a game
carried out in the edge layer. The objective of the first task is to program, AlphaGo, based on deep reinforcement learning. In the
monitor the state of each sensors and this objective is accom- five-game Go match between AlphaGo and Lee Sedol who is
plished by detecting the incomplete data or the outliers from the a South Korean professional Go player of 9-dan rank, AlphaGo
collected parameters. The objective of the other task is to greatly defeats Lee Sedol with 4 : 1, indicating that deep reinforcement
reduce the amount of data and this objective is accomplished learning model has reached the human level in complex chess
by compressing the collected parameters. For example, if the air games. Besides, deep reinforcement learning has also achieved
temperature is kept at 25 degrees in ten sampling periods, each an excellent performance in computer vision, robot control, and
with five minutes, only one value namely 25 will be transmitted language modeling.
One of the commonly used reinforcement learning approach
to the data center. This way the bandwidth can be saved by
is policy gradient. In particular, policy gradient methods find
reducing the amount of data transferred.
the optimal policy directly by parameterizing a policy. They first
The collected data is analyzed by various data mining algo-
estimate the gradients of the objective with regard to the pa-
rithms and artificial intelligence models in the cloud layer. In a rameters, and then use the gradient descent strategy to update
smart agriculture IoT system, an important task is to take a series the parameters for the optimal policy. Different from the value-
of measures to adjust the environment to adapt to crop growth. based methods that can only be used for deterministic policies,
This task is done by using deep reinforcement learning to make policy gradient methods can obtain the optimal policy for both
smart decisions in our system. deterministic policies and stochastic policies.
502 F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507

A parameterized policy π is defined as a probability distri- where Q̂tπ denotes an estimation, such as Rt , of Qtπ . If f satisfies
bution from the state space s ∈ S to the action space a ∈ A, Eq. (9) and compatible property simultaneously, i.e.,
i.e., at ∼ π (s, a; θ ) at the time step t, where θ ∈ Rl for l << |S | ∂ fw (s, a) ∂π (s, a) 1
denotes a parameter vector for the policy π . = , (10)
It is assumed that the policy π (s, a; θ ) is differential to the ∂w ∂θ π (s, a)
parameters θ . The objective can be defined as: the gradient can be obtained via:
∂ J(θ ) ∑ π ∑ ∂π
∫ ∫
J(θ ) = dπ (s) π (s, a; θ )r(s, a)dsda, (1) = d (s) fw (s, a). (11)
s A ∂θ s a
∂θ
where dπ (s) = limt →∞ P(st = s|s0 , π ) denotes the state dis-
More recently, deterministic policy gradient algorithms have
tribution under the policy π . Policy gradient methods obtain
been proposed [17]. Deterministic policy gradient algorithms per-
the optimal policy πθ by using the gradient descent strategy to
∂ J(θ ) form better than stochastic policy gradient algorithms in high-
update the parameters θ for ∂θ → 0. dimensional action space, so they have attracted much interests
Since π (s, a; θ ) cannot be represented explicitly, the gradient in recent years. Other policy gradient algorithms, including reg-
∂Jθ
∂θ
can only be estimated approximately. An approximation for ularized policy gradients and batch policy gradient, have been
∂Jθ
the gradient ∂θ was proposed for any Markov Decision Process presented in the past few years for reinforcement learning.
as [14]: Another idea of the policy gradient methods is to increase
∂ J θ ∑ π ∑ ∂π (s, a) π the occurrence probability of the good actions. Based on this
= d (s) Q (s, a), (2) idea, advantage functions can be used to evaluate an action in
∂θ s a
∂θ
reinforcement learning. Specially, the policy gradient g with the
where Q π (s, a) denotes the return of the policy π (s, a; θ ). Gener- advantage function is constructed as:
ally, the analytic expression for Q π (s, a) cannot also be obtained T −1
directly, so some policy gradient methods were presented to

g = ∇θ Ât log π (st , at ; θ ), (12)
estimate Q π (s, a).
t =0
The earliest policy gradient ∑∞algorithm, called REINFORCE [15],
k=1 γ r(st +k , at +k ) as the approx-
k−1
used the actual return, R = where Ât denotes an estimator for the advantage function of the
imation of Q π (st , at ) at the time step t, action–state pair (st , at ) that is basically defined as:
∂ J(θ ) ∂π (st , at ) 1 γ
Ât = rt + γ rt +1 + · · · − V (st ),
∝ Rt . (3) (13)
∂θ ∂θ π (st , at )
where V (st ) denotes a baseline associate with the current tra-
The infinite-horizontal policy-gradient (GPOMDP) was pro- γ
jectory. Ât > 0 will increase the probability that the action is
posed to obtain a biased approximation of the gradient of the
selected.
average reward in partially observable Markov Decision Processes
Some deep reinforcement learning models based on policy
(POMDPs) [16]. Specially, GPOMDP estimates the policy gradient
gradient have been presented in the past few years. Basic deep
as:
policy gradient methods use a deep learning model to param-
∂ J(θ ) ∂π (s, a) π 1 eterize a policy and then find the optimal policy by utilizing
≈ Q (st , at ) , (4)
∂θ ∂θ π (s, a) policy gradient methods to optimize the parameters [18]. These
where Q π (st , at ) is estimated by the return of the current state– methods require N sample trajectories {τi }Ni=1 to update the policy
action pair: gradient in each iteration. However, it is difficult to obtain a
large number of training samples online in many complex tasks,
L−t
leading to the local optima solution. To tackle this problem, the
Q π (st , at ) =

γ k−1 r(st +k , at +k ). (5)
actor–critic model has been introduced into the deep policy gra-
k=1
dient method. The actor–critic model combines the value function
GPOMDP updates the eligible trace et and the gradient ∆t as: and the policy gradient for challenging tasks in the traditional
∂π (st , at ) 1 reinforcement learning [19]. Specially, it is composed of two
et +1 = γ et + . (6) independent components, i.e., actor that is used to select an
∂θ π (st , at )
action, and critic that estimates the value function and calcu-
∆t +1 = ∆t + r(st +1 , at +1 )et +1 . (7) lates the temporal difference to evaluate the selected action. The
∂ J(θ ) ∂ J(θ ) architecture of the actor–critic model is shown in Fig. 2.
After L iterations, the gradient ∂θ
is estimated as ∂θ
= 1
L
∆L .
Furthermore, In the actor–critic model, the agent selects an action depend-
ing on the actor’s policy. The critic receives the immediate reward
L
∑ by interacting with the environment under the selected action to
∆L = r(st , at )et = update the value function and calculates the error of temporal
t =1
(8) difference accordingly. The actor updates the policy depending
L L−t
∑ ∂π (st , at ) 1 ∑ on the error of temporal difference to increase the occurrence
[ γ k−1 r(st +k , at +k )]. probability of the good actions and to decrease the occurrence
∂θ π (st , at )
t =0 k=1 probability of the bad actions. Given the parameterized policy
Although GPOMDP extended REINFORCE to infinite-horizon π (s; θ ) with the parameters θ ∈ Rn and the value function V (s; ϕ )
Markov Decision Process, it still has a low convergence. with the parameters ϕ ∈ Rn , the error of temporal difference from
Sutton et al. introduced the function approximation into policy the state st to the state st +1 is defined as:
gradient (PGFA) for reinforcement learning with continuous state
space [14]. Assume that Q π is approximated by a function fw : δTD,t = rt +1 + γ V (st +1 ; ϕt ) − V (st ; ϕt ). (14)
S × A → R with the parameters w satisfying the following Accordingly, the policy parameters and the value parameters
property: are updated via:
2
(Q̂tπ − f (st , at )) , ∂π (st ; θt )

w = arg min (9)
w θt +1 = θt + αA,t [ut − π (st ; θt )]δTD,t . (15)
t ∂θ
F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507 503

To alleviate the problem of instability when combining tradi-


tional policy gradient methods and deep neural networks, deep
policy gradient methods use experience replay to select train-
ing samples. However, experience replay requires large memory
and computation during interaction with environment and it
requires off-policy learning algorithms which could only update
from training samples produced by an older policy. Aiming at
these problems, Mnih et al. presented a lightweight framework
for deep reinforcement learning based on asynchronous gradi-
ent descent [23]. This framework uses asynchronous gradient
descent to optimize the parameters of deep learning model con-
trollers and combines many reinforcement learning models. Spe-
cially, asynchronous advantage actor–critic (A3C) performs best
for control tasks in the continuous action space.
In the traditional deep reinforcement learning methods, the
Fig. 2. The architecture of the actor–critic model.
agent can only solve one single task after it is trained every time.
However, in some complex settings, the agent should be able to
∂ V (st ; ϕt ) accomplish multiple tasks simultaneously, that is, multi-task and
ϕt +1 = ϕt + αC ,t δTD,t . (16) transfer reinforcement learning [24]. Specially, an agent that is
∂ϕ
trained to accomplish multiple tasks should have an ability to
Lillicrap et al. presented a deep deterministic policy gradient generalize its knowledge among the tasks, and further to transfer
algorithm (DDPG) based on the actor–critic model for deep re- its knowledge to new tasks. Using multi-task and transfer learn-
inforcement learning in the continuous action space [20]. DDPG ing, the agent could speed up learning effectively, and therefore
uses a deep neural network with θ µ to represent the determinis- reduce training time.
tic policy π (s, a; θ µ ) and uses a deep neural network with θ Q to Parisotto et al. proposed a deep multi-task and transfer Pol-
represent the value function Q (s, a; θ Q ). Besides, DDPG defines icy Gradient Method based on actor-mimic (AM-DMTRL) [25].
the objective function as the expected discounted cumulative AM-DMTRL first designs a actor-mimic method to train a single
reward: multi-task policy network which is able to accomplish multiple
given source tasks S1 , S2 , . . . , SN . The basic idea of actor-mimic is
J(θ µ ) = Eθ µ [r1 + γ r2 + γ 2 r3 + · · ·]. (17)
to force the student network to mimic the expert network for
Furthermore, DDPG uses the gradient descent strategy to opti- making decisions at each state. To achieve a single multi-task
mize the objective function. The gradient of the objective function policy network, this algorithm constructs the deep Q-networks,
with regard to θ µ equals the expected gradient of the value E1 , E2 . . . , EN , for S1 , S2 , . . . , SN , respectively, as the expert net-
function with regard to θ µ : works. Specially, this algorithm transforms every expert deep
Q-network into a policy network:
∂ J(θ µ ) ∂ Q (s, a; θ Q )
µ
= Es [ ]. (18) τ −1 QEi (s,a)
∂θ ∂θ µ e
πEi (a|s) = ∑ , (21)
Depending on the deterministic policy a = π (s; θ µ ), e
τ −1 QEi (s,a′ )
a′ ∈AE
i
∂ J(θ µ ) ∂ Q (s, a; θ Q ) ∂π (s; θ µ )
= E s [ ]. (19) where τ denotes a temperature parameter and AEi denotes the
∂θ µ ∂a ∂θ µ action space of Ei . For each state s of the task Si , the policy ob-
The critic network is updated by using the method of the deep jective function is defined as the cross-entropy between πEi (a|s)
Q-network updating the value network: and the multi-task policy πAMN (a|s; θ ) parameterized by θ :
∂ L(θ Q ) ∂ Q (s, a; θ Q )

= E ′ ∼D [(y − Q (s, a; θ ))
Q
], (20) Lipolicy (θ ) = πEi (a|s) log πAMN (a|s; θ ). (22)
s, a ,s
∂θ Q ∂θ Q a∈AE
i

where y = r + γ Q (s′ , π (s′ ; θ̂ µ ); θ̂ Q ) with θ̂ µ and θ̂ Q denoting To get the guidance from the expert networks further, a fea-
the parameters of the target policy network and the target value ture regression network fi (hAMN (s)) is defined to estimate the
network, respectively. DDPG selects samples from D in the replay feature values hEi (s) from hANM (s), where hEi (s) and hANM (s) denote
memory, and transfers the gradient of the Q value function with the activations in the final hidden layer of the multi-task network
regard to the actions from the critic network to the actor network. and the ith expert network for the state s, respectively. Specially,
The actor updates the parameters of the policy network according fi is trained by minimizing the following objective function:
to Eq. (20).
However, DDPG is only suitable for training deterministic poli- LiFR (θ, θfi ) = (fi (hAMN (s; θ ); θfi ) − hEi (s))2 , (23)
cies. To tackle this problem, Heess et al. proposed a unified
where θ and θfi denote the parameters of the multi-task network
framework to learn optimal policies for continuous control tasks,
and the ith expert network, respectively. Training this objec-
called stochastic value gradient (SVG) [21]. Specially, they extend
tive function forces the multi-task network to calculate features
the deterministic value gradient methods to the optimization
which could estimate the expert’s features. After training the
of stochastic policies by using ‘‘re-parameterization’’ that is a
objective function, the feature information of the ith expert is
mathematical tool used for generative models. Balduzzi and Ghi-
included into the multi-task network.
fary presented a deep reinforcement learning method based on
Furthermore, the objective function of actor-mimic is defined
compatible function approximation [22]. In this method, they
as the combination of the policy regression function and the
proposed a deviator actor–critic model consisting of three neural
feature regression function:
networks used to estimate the value function, the gradient, and
the actor’s policy, respectively. LiAM = Lipolicy (θ ) + LiFR (θ, θfi ). (24)
504 F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507

for a state s:

π (l) (s, a) := hlL (s). (26)

Furthermore, an action is selected from this distribution and


yields the subsequent state.
Another multi-task reinforcement learning was presented
based on a hierarchical Bayesian approach. Specially, a hierar-
chical Bayesian mixture framework is used to model a Markov
Decision Process. When encountering a new Markov Decision
Process, they use the learned distribution as the prior knowledge
for model-based Bayesian learning to speed up the adaptation
to new environments. Li et al. proposed a multi-task reinforce-
ment learning for partially observable stochastic environments
by designing a regionalized policy representation to describe the
behavior of the agent in different tasks [27].
Recently, a deep multi-task and transfer reinforcement learn-
ing, called policy distillation, was presented which determines
the Q-value regression function based on the error between the
student network and the expert network. Specially, policy distilla-
tion transfers knowledge from an expert network T to a student
network S. Given the training dataset DT = {(si , qi )}Ni=0 , where
Fig. 3. Progressive neural network with three columns. si and qi denote an observation sequence and an unnormalized
Q-value vector, respectively, generated by the expert network T ,
three methods can be used for policy distillation from T to S.
Finally, the multi-task network can be obtained using the In the first method, only the action with the biggest Q-value
gradient descent strategy to train this objective. ai,best = arg max(qi ) is transferred from T to S. In this case, the
Rusu et al. proposed a method of progressive neural networks parameters of S are trained using a negative log likelihood loss to
for deep multi-task and transfer policy gradient, which is effective estimate the same action:
for continual learning [26]. Progressive neural networks achieve |D |

two goals simultaneously, i.e., avoiding catastrophic forgetting L(DT , θs ) = − log P(ai = ai,best |xi , θs ). (27)
previously acquired knowledge and transferring prior knowledge i=1
to new tasks, by lateral connections to learned features. Specially,
The second method uses the mean squared error loss to train
forgetting is avoided by instantiating a column for each task
the parameters of S by transferring all Q-values from T to S:
and transferring is achieved by lateral connections to previously
learned features. Fig. 3 shows a progressive neural network with |D|

three columns. L(DT , θs ) = ∥qTi − qSi ∥22 , (28)
In the first column, a deep learning model with L layers, the i=1
(1)
ith layer having an activation output hi ∈ Rni with ni denoting
where qT and qS denote the Q-value vectors of the expert network
the number of units, is trained for the first task. Specially, θ (1) is
and the student network, respectively. This method preserves all
used to represent the parameters of this deep neural network. In
the action-values into the trained student network.
order to train the second task, a new deep neural network with
The last method utilizes the Kullback–Leibler divergence to
parameters θ (2) is constructed in the second column with random
transfer the Q-values from T to S:
initialization. During training the second neural network, the
(2) qTi
parameters θ (1) are fixed and the ith layer hi obtains input from |D|
qTi soft max( )
τ

(2) (1)
hi−1 and hi−1 via lateral connections. This manner generalizes to L(DT , θs ) = soft max( ) ln , (29)
τ soft max(qSi )
K tasks: i=1

where τ denotes a temperature parameter, and softmax is a mul-


(k)

hi = f (Wi(k) h(k)
i−1 +
(k:j) (j)
Ui hi−1 ) , (25)
tiple regression function which can transform a D-dimensional
j<k
vector to another D-dimensional vector where each element is in
(0, 1) and the sum of all elements is 1. Increasing τ can transfer
(k)
where Wi denotes the weight matrix of the ith layer of the
(k:j) more secondary knowledge from T to S.
kth deep neural network, Ui denotes the lateral connection
from the i − 1th layer of the jth deep neural network to the ith Experiments conducted in [28] demonstrated that the second
layer of the kth deep neural network. f is an activation function method performed worst while the third method performed best.
f (x) = max(0, x). Rusu et al. showed two policy distillations, i.e., single-task
When solving a sequence of tasks, progressive neural net- policy distillation and multi-task policy distillation, as shown in
works achieves transfer learning by connecting the previously Figs. 4 and 5 [28], respectively.
learned features to the subsequent deep neural networks and Multi-task policy distillation uses N separately-trained deep
extracting valuable features. Furthermore, when constructing a Q-network single-task experts to generate inputs, task ids and
new deep neural network to solve a new task, the parameters of targets, and stores them into separate memory buffers. Id is used
other deep neural networks are preserved, which can catastrophic to identify each expert network which has a distinct output layer
forgetting. since different tasks usually have different action spaces. The stu-
To apply progressive neural networks to reinforcement learn- dent network learns from the N expert networks sequentially by
ing, each deep neural network is trained to solve a Markov selecting some samples from each replay memory every episode.
Decision Process. Specially, the lth deep neural network deter- In some real dynamic settings, the tasks of reinforcement
mines a policy π (k) (s, a) which produces probabilities over actions learning are notoriously complex, which requires multiple agents
F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507 505

keep the game going as long as possible. In this case, the agent
aims to keep the match for a long time by learning policies.
The multi-agent deep reinforcement learning assumes that
all the agents can fully observe the state as input, however,
this assumption is not valid in some cases. In the partially ob-
servable settings, each agent must communicate with others to
learn policies for maximum returns, which poses a challenge on
deep reinforcement learning. The deep Q-network for a single
agent to solve the partially observable Markov Decision Process
has been successfully investigated, namely deep recurrent Q-
network. The deep recurrent Q-network (DRQN) estimates Q (o, a)
Fig. 4. Single-task policy distillation. with a recurrent neural network with the Q function represented
by Q (ot , ht −1 , a; θ ), where ot denotes the partial observation at
the time t, ht −1 denotes the hidden state of the long short-term
memory network, a denotes the taken action and θi denotes the
parameters of the network in the ith iteration. The recurrent
neural network for estimating Q (o, a) is able to aggregate ob-
servations over time. Therefore, the outputs of DRQN include Qt
and ht . So the most straightforward method for the multi-agent
environment is to incorporate the deep recurrent Q-network with
independent Q-learning. However, this method performs poor in
most cases.
Therefore, Foerster et al. presented deep distributed recurrent
Fig. 5. Multi-task policy distillation. Q-networks (DDRQN) to solve communication-based cooperative
tasks for multiple agents by making three major changes to the
above method [30].
to accomplish together. For example, in the game Atari 2600 First, at each time step, the previous last-action is fed into
where multiple agents participate at the same time, only learning the input of the agent to enable the agent to estimate action-
policies for a single agent cannot achieve the best results. In observation histories.
this case, policies learning for multiple agents making decisions The second modification is inter-agent weight sharing. Each
should be solved, which involves the problems of cooperation, agent is still trained depending on its observations. Meanwhile,
communication and competition between agents. While multi- the number of parameters required to be learned is reduced
agent systems enhance the learning ability to solve complex significantly via weight sharing, resulting in a great speedup for
tasks, some novel challenges about multi-agent deep reinforce- learning.
ment learning arise. For example, convergence and consistency of Finally, DDRQN does not use experience replay longer.
reinforcement learning methods cannot be guaranteed in multi- Based on the three changes, the Q-function of DDRQN has a
t , ht −1 , m, at −1 , at ; θi ) where m denotes the index
m m
agent settings. Besides, each agent requires to follow other agents form of Q (om m
m
of the mth agent. at −1 denotes a portion of the history and
since its action value relies on others’ actions.
am
t denotes the taken action according to the estimator of the
Tampuu et al. generalized the deep Q-network to multi-agent
Q-network.
environments to explore how multiple agents to cooperate and
In order to enhance the ability to solve the hierarchical tasks,
compete in the video game Pong [29]. This work constructs a
the agent is required to be equipped with both sensing function
multi-agent deep reinforcement learning system by assigning
and memory function. Therefore, the deep reinforcement learn-
an independent deep Q-network for each agent for distributed
ing with memory networks has been studied based on the fact
learning, which can reduce the learning difficulty and computa-
that neural networks with external memory have made a great
tional complexity. Various kinds of collective behavior of multi-
progress. For example, Graves et al. presented a neural turing
ple agents are investigated by adjusting the rewarding schemes,
machine that can be trained using gradient descent [31]. The
i.e., changing the reward an agent can receive during playing in neural turing machine is able to achieves some simple memory
the game Pong. and inferring functions including copying, sorting and associative
In the first case where the agent who wins obtains a positive recall. Afterwards, Sukhbaatar et al. presented a memory network
reward of +1 and the agent who fails obtains a negative reward that is trained end-to-end, for question-answering and language
of −1, a fully competitive policy can be obtained. modeling [32].
In the second setting, both of the agents whenever they win More recently, Oh et al. presented a memory-based deep rein-
or fail are penalized with a reward of −1, resulting in a fully forcement learning architecture by adding a network with exter-
cooperative policy. nal memory to the traditional deep reinforcement learning mod-
When the winner obtains a reward r ∈ [−1, 1] and the els [33]. Specially, they constructed three memory-based deep
loser is penalized with a reward of −1, increasing r results in reinforcement learning models, i.e., memory Q-network (MQN),
a policy towards competition while reducing r leads to a more recurrent memory Q-network (RMQN), and feedback recurrent
cooperative policy. memory Q-network (FRMQN), as shown in Fig. 6.
In the fully competitive policy, each agent receives an im- Recent years have witnessed a great advance in deep rein-
mediate reward either +1 for winning or −1 for losing. In this forcement learning. Deep reinforcement learning combines deep
case, each agent tries to learn for winning a match as soon learning models for learning abstract feature representations
as possible, so the players become increasingly professional as from high-dimensional raw state input and reinforcement learn-
training continues. ing methods that enable the agent to learn an optimal policy for
In the fully cooperative policy, both agents obtain an imme- maximizing its cumulative reward. Deep reinforcement learning
diate punishment whenever they win or fail, motivating them to has been used in solving many challenging tasks. One most
506 F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507

Mongolia Talent Development Funded Project, China, CERNET


Innovation Project, China (No. NGII20161209), Natural Science
Foundation of Inner Mongolia Autonomous Region of China No.
2017MS0610, Program for Young Talents of Science and Technol-
ogy in Universities of Inner Mongolia Autonomous Region, China
(No. NJYT-18-A13), Inner Mongolia Key Laboratory of economic
data analysis and mining, China.

Conflict of interest

The authors declare that they have no known competing finan-


Fig. 6. Different memory-based deep Q-network architectures. cial interests or personal relationships that could have appeared
to influence the work reported in this paper.

representative application of deep reinforcement learning is high- References


dimensional robot control. For example, Zhang et al. applied deep
reinforcement learning with internal memory to the complex [1] N. Gondchawar, R.S. Kawitkar, IoT based smart agriculture, Int. J. Adv. Res.
Comput. Commun. Eng. 5 (6) (2016) 838–842.
robotic manipulators [34] while Finn et al. focused on torque
[2] M. Roopaei, P. Rad, K.R. Choo, Cloud of things in smart agriculture:
control of high-dimensional robotic systems by combining deep Intelligent irrigation monitoring by thermal imaging, IEEE Cloud Comput.
reinforcement learning and inverse optimal control [35]. Another 4 (1) (2017) 10–15.
representative application of deep reinforcement learning is in [3] Q. Zhang, L.T. Yang, Z. Chen, P. Li, F. Bu, An adaptive droupout deep
computer vision, including video prediction in computer games computation model for industrial IoT big data learning with crowdsourcing
to cloud computing, IEEE Trans. Ind. Inf. (2018) http://dx.doi.org/10.1109/
and visual navigation [36]. In addition, Zhang et al. [37] applied TII.2018.2791424.
deep reinforcement learning for energy management. Besides, [4] J. Gubbi, R. Buyya, S. Marusic, M. Palaniswami, Internet of things (IoT): A
deep reinforcement learning has shown super performance in vision, architectural elements, and future directions, Future Gener. Comput.
natural language processing [38]. For instance, Li et al. applied Syst. 29 (7) (2013) 1645–1660.
[5] H. Huang, M. Lin, Q. Zhang, Double-Q learning-based DVFS for multi-core
deep policy gradient to dialog generation by modeling reward
real-time systems, in: Proceedings of IEEE International Conference on
sequences depending on informatively, coherence and ease of an- Green Computing and Communications, 2017.
swering [39]. Satijia et al. [40] achieves a simultaneous machine [6] Z. Liu, C. Yao, H. Yu, T. Wu, Deep reinforcement learning with its
translation system by integrating deep reinforcement learning application for lung cancer detection in medical internet of things, Future
with neural machine translation. Other application examples of Gener. Comput. Syst. 97 (2019) 1–9.
[7] Q. Zhang, C. Bai, Z. Chen, P. Li, H. Yu, S. Wang, H. Gao, Deep learning
deep reinforcement learning include parameters optimization models for diagnosing spleen and stomach diseases in smart chinese
and game theory with deep learning models and reinforcement medicine with cloud computing, Concurr. Comput.: Pract. Exper. (2019)
learning strategies [41,42]. http://dx.doi.org/10.1002/cpe.5252.
[8] Q. Zhang, L.T. Yang, Z. Chen, Privacy preserving deep computation model
4. Conclusion on cloud for big data feature learning, IEEE Trans. Comput. 65 (5) (2016)
1351–1362.
[9] Q. Zhang, H. Zhong, L.T. Yang, Z. Chen, F. Bu, PPHOCFS: Privacy preserving
In this paper, we presented a smart agriculture system based high-order CFS algorithm on the cloud for clustering multimedia data, ACM
on deep reinforcement learning. Specially, the deep reinforce- Trans. Multimedia Comput. Commun. Appl. 12 (4s) (2016) 66.
ment learning models are used to make smart decisions to adjust [10] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare,
A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Peterson, C.
the environment to adapt to the crow growth. We presented
Beasttie, A. Sadik, L. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg,
recently developed representative deep reinforcement learning D. Hassabis, Human-level control through deep reinforcement learning,
models and algorithms. Nature 518 (7540) (2015) 529–533.
Although deep reinforcement learning has shown a great [11] Q. Zhang, L.T. Yang, Z. Chen, P. Li, Incremental deep computation model
progress in model design and training algorithms, it cannot for wireless big data feature learning, IEEE Trans. Big Data (2019) http:
//dx.doi.org/10.1109/TBDATA.2019.2903092.
achieve the human-level performance in adaptation to dynamic
[12] M.L. Littman, Reinforcement learning improves behaviour from evaluative
environments and solving complex tasks. In the future, efforts feedback, Nature 521 (7553) (2015) 445–451.
can be made on deep reinforcement learning to improve its [13] Q. Zhang, C. Bai, Z. Chen, P. Li, S. Wang, H. Gao, Smart chinese medicine
performance in the following directions. The first direction is for hypertension treatment with a deep learning model, J. Netw. Comput.
to design the incremental models to speed up the training for Appl. 129 (2019) 1–8.
[14] R.S. Sutton, D. Mcallester, S. Singh, Y. Mansour, Policy gradient methods
deep reinforcement learning in dynamic environments for smart for reinforcement learning with function approximation, in: Proceedings of
agriculture systems. Integrating different memory units such as Advances on Neural Information Processing Systems, 2000, pp. 1057–1063.
long short-term memory and neural turing machine to deep [15] R.J. Willaims, Simple statistical gradient-following algorithms for connec-
reinforcement learning to improve its performance for active rea- tionist reinforcement learning, Mach. Learn. 8 (1992) 229–256.
[16] J. Baxter, P.L. Bartlett, Infinite-horizon policy-gradient estimation, J.
soning and cognition. Another direction is to apply more effective
Artificial Intelligence Res. 15 (2001) 319–350.
transfer learning methods to deep reinforcement learning for the [17] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Determin-
settings which are short of training data. Cloud computing should istic policy gradient algorithms, in: Proceedings of International Conference
be used to improve the training efficiency of large-scale deep on Machine Learning, 2014, pp. 387–395.
reinforcement learning for complex tasks. Finally, improving the [18] S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep
visuomotor policies, J. Mach. Learn. Res. 17 (2016) 1–40.
versatility of deep reinforcement learning by combining multi- [19] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, M. Lee, Incremental natural
task learning and deep computation is also an important topic actor-critic algorithms, in: Proceedings of Advances in Neural Information
for smart agriculture in the future. Processing Systems, 2008, pp. 105–112.
[20] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D.
Acknowledgments Wierstra, Continuous Control with Deep Reinforcement Learning, (2015),
arXiv:1509.02971.
[21] N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, T. Erez, Learning
This paper is supported by the National Natural Science Foun- continuous control policies by stochastic value gradients, in: Proceedings of
dation of China (Grants No. 61762068 No. 61702309), Inner Advances in Neural Information Processing Systems, 2015, pp. 2944–2952.
F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507 507

[22] D. Balduzzi, M. Ghifary, Compatible Value Gradients for Reinforcement [35] C. Finn, S. Levine, P. Abbeel, Guided cost learning: Deep inverse optimal
Learning of Continuous Deep Policies, (2015), arXiv:1509.03005. control via policy optimization, in: Proceedings of International Conference
[23] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Harley, T.P. Lillicrap, D. Silver, on Machine Learning, 2016, pp. 49–58.
K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, [36] Y. Rao, J. Lu, J. Zhou, Attention-aware deep reinforcement learning for
in: Proceedings of International Conference on Machine Learning, 2016, video face recognition, in: Proceedings of IEEE International Conference
pp. 1928–1937. on Computer Vision, 2017, pp. 3951–3960.
[24] D. Calandriello, A. lazaric, M. Restelli, Sparse multi-task reinforcement [37] Q. Zhang, M. Lin, L.T. Yang, Z. Chen, S.U. Khan, P. Li, A double deep
learning, in: Proceedings of Advances in Neural Information Processing Q-learning model for energy-efficient edge scheduling, IEEE Trans. Serv.
Systems, 2014, pp. 819–827. Comput. (2018) http://dx.doi.org/10.1109/TSC.2018.2867482.
[25] E. Parisotto, J. Ba, R. Salakhutdinov, Actor-mimic deep multitask and trans- [38] A. Das, S. Kottur, J.M.F. Moura, S. Lee, D. Batra, Learning cooperative visual
fer reinforcement learning, in: Proceedings of International Conference on dialog agents with deep reinforcement learning, in: Proceedings of IEEE
Learning Representations, 2016, pp. 156–171. International Conference on Computer Vision, 2017, pp. 2970–2979.
[26] A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. [39] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, D. Jurafsky, Deep reinforcement
Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive Neural Networks, (2016), learning for dialogue generation, in: Proceedings of the Conference on
arXiv:1606.04671. Empirical Methods in Natural Language Processing, 2016, pp. 1192–1202.
[27] H. Li, X. Liao, L. Carin, Multi-task reinforcement learning in par- [40] H. Satija, M. Mcgill, J. Pineau, Simultaneous machine translation using deep
tially observable stochastic environment, J. Mach. Learn. Res. 10 (2009) reinforcement learning, in: Proceedings of the Workshops of International
1131–1186. Conference on Machine Learning, 2016, pp. 110–119.
[28] A.A. Rusu, S.G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. [41] Q. Zhang, L.T. Yang, Z. Chen, P. Li, Dependable deep computation model
Pascanu, V. Mnih, K. Kavukcuoglu, R. Hadsell, Policy Distrillation, (2015), for feature learning on big data in cyber-physical systems, ACM Trans.
arXiv:1511.06295. Cyber-Phys. Syst. 3 (1) (2018) 11.
[29] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. [42] Q. Zhang, L.T. Yang, Z. Yan, Z. Chen, P. Li, An efficient deep learning model
Aru, R. Vicente, Multiagent Cooperation and Competition with Deep to predict cloud workload for industry informatics, IEEE Trans. Ind. Inf. 14
Reinforcement Learning, (2015), arXiv:1511.08779. (7) (2018) 3170–3178.
[30] J.N. Foerster, Y.M. Assael, N. de Freitas, S. Whiteson, Learning to Communi-
cate to Solve Riddles with Deep Distributed Recurrent Q-Networks, (2016),
arXiv:1602.02672.
[31] A. Graves, G. Wayne, I. Danihelka, Neural Turing Machines, (2014), arXiv: Fanyu Bu received the BSc degree in computer science
1410.5401. from Inner Mongolia Agricultural University, Hohhot,
[32] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus, End-to-end memory net- China, in 2003, and the MSc degree in computer appli-
works, in: Proceedins of Advances in Neural Information Processing cation from Inner Mongolia University, Hohhot, China,
Systems, 2015, pp. 2440–2448. in 2009. He got the Ph.D degree in computer applica-
[33] J. Oh, V. Chockalingam, S. Singh, H. Lee, Control of memory, active percep- tion technology from Dalian University of Technology,
tion, and action in minecraft, in: Proceedings of International Conference Dalian, China, in 2018. He is currently an assistant
on Machine Learning, 2016, pp. 2790–2799. professor at Department of Computer and Information
[34] M. Zhang, Z. McCarthy, C. Finn, S. Levine, P. Abbeel, Learning deep neural Management at Inner Mongolia University of Finance
network policies with continuous memory states, in: Proceedings of IEEE and Economics, China. His research interests include
International Conference on Robotics and Automation, 2016, pp. 520–527. Big Data, Smart agriculture and Internet of Things.

Das könnte Ihnen auch gefallen