Beruflich Dokumente
Kultur Dokumente
highlights
article info a b s t r a c t
Article history: Smart agriculture systems based on Internet of Things are the most promising to increase food
Received 15 March 2019 production and reduce the consumption of resources like fresh water. In this study, we present a
Received in revised form 1 April 2019 smart agriculture IoT system based on deep reinforcement learning which includes four layers, namely
Accepted 17 April 2019
agricultural data collection layer, edge computing layer, agricultural data transmission layer, and cloud
Available online 4 May 2019
computing layer. The presented system integrates some advanced information techniques, especially
Keywords: artificial intelligence and cloud computing, with agricultural production to increase food production.
Deep reinforcement learning Specially, the most advanced artificial intelligence model, deep reinforcement learning is combined
Smart agriculture IoT in the cloud layer to make immediate smart decisions such as determining the amount of water
Edge computing needed to be irrigated for improving crop growth environment. We present several representative deep
Cloud computing reinforcement learning models with their broad applications. Finally, we talk about the open challenges
and the potential applications of deep reinforcement learning in smart agriculture IoT systems.
© 2019 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.future.2019.04.041
0167-739X/© 2019 Elsevier B.V. All rights reserved.
F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507 501
A parameterized policy π is defined as a probability distri- where Q̂tπ denotes an estimation, such as Rt , of Qtπ . If f satisfies
bution from the state space s ∈ S to the action space a ∈ A, Eq. (9) and compatible property simultaneously, i.e.,
i.e., at ∼ π (s, a; θ ) at the time step t, where θ ∈ Rl for l << |S | ∂ fw (s, a) ∂π (s, a) 1
denotes a parameter vector for the policy π . = , (10)
It is assumed that the policy π (s, a; θ ) is differential to the ∂w ∂θ π (s, a)
parameters θ . The objective can be defined as: the gradient can be obtained via:
∂ J(θ ) ∑ π ∑ ∂π
∫ ∫
J(θ ) = dπ (s) π (s, a; θ )r(s, a)dsda, (1) = d (s) fw (s, a). (11)
s A ∂θ s a
∂θ
where dπ (s) = limt →∞ P(st = s|s0 , π ) denotes the state dis-
More recently, deterministic policy gradient algorithms have
tribution under the policy π . Policy gradient methods obtain
been proposed [17]. Deterministic policy gradient algorithms per-
the optimal policy πθ by using the gradient descent strategy to
∂ J(θ ) form better than stochastic policy gradient algorithms in high-
update the parameters θ for ∂θ → 0. dimensional action space, so they have attracted much interests
Since π (s, a; θ ) cannot be represented explicitly, the gradient in recent years. Other policy gradient algorithms, including reg-
∂Jθ
∂θ
can only be estimated approximately. An approximation for ularized policy gradients and batch policy gradient, have been
∂Jθ
the gradient ∂θ was proposed for any Markov Decision Process presented in the past few years for reinforcement learning.
as [14]: Another idea of the policy gradient methods is to increase
∂ J θ ∑ π ∑ ∂π (s, a) π the occurrence probability of the good actions. Based on this
= d (s) Q (s, a), (2) idea, advantage functions can be used to evaluate an action in
∂θ s a
∂θ
reinforcement learning. Specially, the policy gradient g with the
where Q π (s, a) denotes the return of the policy π (s, a; θ ). Gener- advantage function is constructed as:
ally, the analytic expression for Q π (s, a) cannot also be obtained T −1
directly, so some policy gradient methods were presented to
∑
g = ∇θ Ât log π (st , at ; θ ), (12)
estimate Q π (s, a).
t =0
The earliest policy gradient ∑∞algorithm, called REINFORCE [15],
k=1 γ r(st +k , at +k ) as the approx-
k−1
used the actual return, R = where Ât denotes an estimator for the advantage function of the
imation of Q π (st , at ) at the time step t, action–state pair (st , at ) that is basically defined as:
∂ J(θ ) ∂π (st , at ) 1 γ
Ât = rt + γ rt +1 + · · · − V (st ),
∝ Rt . (3) (13)
∂θ ∂θ π (st , at )
where V (st ) denotes a baseline associate with the current tra-
The infinite-horizontal policy-gradient (GPOMDP) was pro- γ
jectory. Ât > 0 will increase the probability that the action is
posed to obtain a biased approximation of the gradient of the
selected.
average reward in partially observable Markov Decision Processes
Some deep reinforcement learning models based on policy
(POMDPs) [16]. Specially, GPOMDP estimates the policy gradient
gradient have been presented in the past few years. Basic deep
as:
policy gradient methods use a deep learning model to param-
∂ J(θ ) ∂π (s, a) π 1 eterize a policy and then find the optimal policy by utilizing
≈ Q (st , at ) , (4)
∂θ ∂θ π (s, a) policy gradient methods to optimize the parameters [18]. These
where Q π (st , at ) is estimated by the return of the current state– methods require N sample trajectories {τi }Ni=1 to update the policy
action pair: gradient in each iteration. However, it is difficult to obtain a
large number of training samples online in many complex tasks,
L−t
leading to the local optima solution. To tackle this problem, the
Q π (st , at ) =
∑
γ k−1 r(st +k , at +k ). (5)
actor–critic model has been introduced into the deep policy gra-
k=1
dient method. The actor–critic model combines the value function
GPOMDP updates the eligible trace et and the gradient ∆t as: and the policy gradient for challenging tasks in the traditional
∂π (st , at ) 1 reinforcement learning [19]. Specially, it is composed of two
et +1 = γ et + . (6) independent components, i.e., actor that is used to select an
∂θ π (st , at )
action, and critic that estimates the value function and calcu-
∆t +1 = ∆t + r(st +1 , at +1 )et +1 . (7) lates the temporal difference to evaluate the selected action. The
∂ J(θ ) ∂ J(θ ) architecture of the actor–critic model is shown in Fig. 2.
After L iterations, the gradient ∂θ
is estimated as ∂θ
= 1
L
∆L .
Furthermore, In the actor–critic model, the agent selects an action depend-
ing on the actor’s policy. The critic receives the immediate reward
L
∑ by interacting with the environment under the selected action to
∆L = r(st , at )et = update the value function and calculates the error of temporal
t =1
(8) difference accordingly. The actor updates the policy depending
L L−t
∑ ∂π (st , at ) 1 ∑ on the error of temporal difference to increase the occurrence
[ γ k−1 r(st +k , at +k )]. probability of the good actions and to decrease the occurrence
∂θ π (st , at )
t =0 k=1 probability of the bad actions. Given the parameterized policy
Although GPOMDP extended REINFORCE to infinite-horizon π (s; θ ) with the parameters θ ∈ Rn and the value function V (s; ϕ )
Markov Decision Process, it still has a low convergence. with the parameters ϕ ∈ Rn , the error of temporal difference from
Sutton et al. introduced the function approximation into policy the state st to the state st +1 is defined as:
gradient (PGFA) for reinforcement learning with continuous state
space [14]. Assume that Q π is approximated by a function fw : δTD,t = rt +1 + γ V (st +1 ; ϕt ) − V (st ; ϕt ). (14)
S × A → R with the parameters w satisfying the following Accordingly, the policy parameters and the value parameters
property: are updated via:
2
(Q̂tπ − f (st , at )) , ∂π (st ; θt )
∑
w = arg min (9)
w θt +1 = θt + αA,t [ut − π (st ; θt )]δTD,t . (15)
t ∂θ
F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507 503
where y = r + γ Q (s′ , π (s′ ; θ̂ µ ); θ̂ Q ) with θ̂ µ and θ̂ Q denoting To get the guidance from the expert networks further, a fea-
the parameters of the target policy network and the target value ture regression network fi (hAMN (s)) is defined to estimate the
network, respectively. DDPG selects samples from D in the replay feature values hEi (s) from hANM (s), where hEi (s) and hANM (s) denote
memory, and transfers the gradient of the Q value function with the activations in the final hidden layer of the multi-task network
regard to the actions from the critic network to the actor network. and the ith expert network for the state s, respectively. Specially,
The actor updates the parameters of the policy network according fi is trained by minimizing the following objective function:
to Eq. (20).
However, DDPG is only suitable for training deterministic poli- LiFR (θ, θfi ) = (fi (hAMN (s; θ ); θfi ) − hEi (s))2 , (23)
cies. To tackle this problem, Heess et al. proposed a unified
where θ and θfi denote the parameters of the multi-task network
framework to learn optimal policies for continuous control tasks,
and the ith expert network, respectively. Training this objec-
called stochastic value gradient (SVG) [21]. Specially, they extend
tive function forces the multi-task network to calculate features
the deterministic value gradient methods to the optimization
which could estimate the expert’s features. After training the
of stochastic policies by using ‘‘re-parameterization’’ that is a
objective function, the feature information of the ith expert is
mathematical tool used for generative models. Balduzzi and Ghi-
included into the multi-task network.
fary presented a deep reinforcement learning method based on
Furthermore, the objective function of actor-mimic is defined
compatible function approximation [22]. In this method, they
as the combination of the policy regression function and the
proposed a deviator actor–critic model consisting of three neural
feature regression function:
networks used to estimate the value function, the gradient, and
the actor’s policy, respectively. LiAM = Lipolicy (θ ) + LiFR (θ, θfi ). (24)
504 F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507
for a state s:
keep the game going as long as possible. In this case, the agent
aims to keep the match for a long time by learning policies.
The multi-agent deep reinforcement learning assumes that
all the agents can fully observe the state as input, however,
this assumption is not valid in some cases. In the partially ob-
servable settings, each agent must communicate with others to
learn policies for maximum returns, which poses a challenge on
deep reinforcement learning. The deep Q-network for a single
agent to solve the partially observable Markov Decision Process
has been successfully investigated, namely deep recurrent Q-
network. The deep recurrent Q-network (DRQN) estimates Q (o, a)
Fig. 4. Single-task policy distillation. with a recurrent neural network with the Q function represented
by Q (ot , ht −1 , a; θ ), where ot denotes the partial observation at
the time t, ht −1 denotes the hidden state of the long short-term
memory network, a denotes the taken action and θi denotes the
parameters of the network in the ith iteration. The recurrent
neural network for estimating Q (o, a) is able to aggregate ob-
servations over time. Therefore, the outputs of DRQN include Qt
and ht . So the most straightforward method for the multi-agent
environment is to incorporate the deep recurrent Q-network with
independent Q-learning. However, this method performs poor in
most cases.
Therefore, Foerster et al. presented deep distributed recurrent
Fig. 5. Multi-task policy distillation. Q-networks (DDRQN) to solve communication-based cooperative
tasks for multiple agents by making three major changes to the
above method [30].
to accomplish together. For example, in the game Atari 2600 First, at each time step, the previous last-action is fed into
where multiple agents participate at the same time, only learning the input of the agent to enable the agent to estimate action-
policies for a single agent cannot achieve the best results. In observation histories.
this case, policies learning for multiple agents making decisions The second modification is inter-agent weight sharing. Each
should be solved, which involves the problems of cooperation, agent is still trained depending on its observations. Meanwhile,
communication and competition between agents. While multi- the number of parameters required to be learned is reduced
agent systems enhance the learning ability to solve complex significantly via weight sharing, resulting in a great speedup for
tasks, some novel challenges about multi-agent deep reinforce- learning.
ment learning arise. For example, convergence and consistency of Finally, DDRQN does not use experience replay longer.
reinforcement learning methods cannot be guaranteed in multi- Based on the three changes, the Q-function of DDRQN has a
t , ht −1 , m, at −1 , at ; θi ) where m denotes the index
m m
agent settings. Besides, each agent requires to follow other agents form of Q (om m
m
of the mth agent. at −1 denotes a portion of the history and
since its action value relies on others’ actions.
am
t denotes the taken action according to the estimator of the
Tampuu et al. generalized the deep Q-network to multi-agent
Q-network.
environments to explore how multiple agents to cooperate and
In order to enhance the ability to solve the hierarchical tasks,
compete in the video game Pong [29]. This work constructs a
the agent is required to be equipped with both sensing function
multi-agent deep reinforcement learning system by assigning
and memory function. Therefore, the deep reinforcement learn-
an independent deep Q-network for each agent for distributed
ing with memory networks has been studied based on the fact
learning, which can reduce the learning difficulty and computa-
that neural networks with external memory have made a great
tional complexity. Various kinds of collective behavior of multi-
progress. For example, Graves et al. presented a neural turing
ple agents are investigated by adjusting the rewarding schemes,
machine that can be trained using gradient descent [31]. The
i.e., changing the reward an agent can receive during playing in neural turing machine is able to achieves some simple memory
the game Pong. and inferring functions including copying, sorting and associative
In the first case where the agent who wins obtains a positive recall. Afterwards, Sukhbaatar et al. presented a memory network
reward of +1 and the agent who fails obtains a negative reward that is trained end-to-end, for question-answering and language
of −1, a fully competitive policy can be obtained. modeling [32].
In the second setting, both of the agents whenever they win More recently, Oh et al. presented a memory-based deep rein-
or fail are penalized with a reward of −1, resulting in a fully forcement learning architecture by adding a network with exter-
cooperative policy. nal memory to the traditional deep reinforcement learning mod-
When the winner obtains a reward r ∈ [−1, 1] and the els [33]. Specially, they constructed three memory-based deep
loser is penalized with a reward of −1, increasing r results in reinforcement learning models, i.e., memory Q-network (MQN),
a policy towards competition while reducing r leads to a more recurrent memory Q-network (RMQN), and feedback recurrent
cooperative policy. memory Q-network (FRMQN), as shown in Fig. 6.
In the fully competitive policy, each agent receives an im- Recent years have witnessed a great advance in deep rein-
mediate reward either +1 for winning or −1 for losing. In this forcement learning. Deep reinforcement learning combines deep
case, each agent tries to learn for winning a match as soon learning models for learning abstract feature representations
as possible, so the players become increasingly professional as from high-dimensional raw state input and reinforcement learn-
training continues. ing methods that enable the agent to learn an optimal policy for
In the fully cooperative policy, both agents obtain an imme- maximizing its cumulative reward. Deep reinforcement learning
diate punishment whenever they win or fail, motivating them to has been used in solving many challenging tasks. One most
506 F. Bu and X. Wang / Future Generation Computer Systems 99 (2019) 500–507
Conflict of interest
[22] D. Balduzzi, M. Ghifary, Compatible Value Gradients for Reinforcement [35] C. Finn, S. Levine, P. Abbeel, Guided cost learning: Deep inverse optimal
Learning of Continuous Deep Policies, (2015), arXiv:1509.03005. control via policy optimization, in: Proceedings of International Conference
[23] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Harley, T.P. Lillicrap, D. Silver, on Machine Learning, 2016, pp. 49–58.
K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, [36] Y. Rao, J. Lu, J. Zhou, Attention-aware deep reinforcement learning for
in: Proceedings of International Conference on Machine Learning, 2016, video face recognition, in: Proceedings of IEEE International Conference
pp. 1928–1937. on Computer Vision, 2017, pp. 3951–3960.
[24] D. Calandriello, A. lazaric, M. Restelli, Sparse multi-task reinforcement [37] Q. Zhang, M. Lin, L.T. Yang, Z. Chen, S.U. Khan, P. Li, A double deep
learning, in: Proceedings of Advances in Neural Information Processing Q-learning model for energy-efficient edge scheduling, IEEE Trans. Serv.
Systems, 2014, pp. 819–827. Comput. (2018) http://dx.doi.org/10.1109/TSC.2018.2867482.
[25] E. Parisotto, J. Ba, R. Salakhutdinov, Actor-mimic deep multitask and trans- [38] A. Das, S. Kottur, J.M.F. Moura, S. Lee, D. Batra, Learning cooperative visual
fer reinforcement learning, in: Proceedings of International Conference on dialog agents with deep reinforcement learning, in: Proceedings of IEEE
Learning Representations, 2016, pp. 156–171. International Conference on Computer Vision, 2017, pp. 2970–2979.
[26] A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. [39] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, D. Jurafsky, Deep reinforcement
Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive Neural Networks, (2016), learning for dialogue generation, in: Proceedings of the Conference on
arXiv:1606.04671. Empirical Methods in Natural Language Processing, 2016, pp. 1192–1202.
[27] H. Li, X. Liao, L. Carin, Multi-task reinforcement learning in par- [40] H. Satija, M. Mcgill, J. Pineau, Simultaneous machine translation using deep
tially observable stochastic environment, J. Mach. Learn. Res. 10 (2009) reinforcement learning, in: Proceedings of the Workshops of International
1131–1186. Conference on Machine Learning, 2016, pp. 110–119.
[28] A.A. Rusu, S.G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. [41] Q. Zhang, L.T. Yang, Z. Chen, P. Li, Dependable deep computation model
Pascanu, V. Mnih, K. Kavukcuoglu, R. Hadsell, Policy Distrillation, (2015), for feature learning on big data in cyber-physical systems, ACM Trans.
arXiv:1511.06295. Cyber-Phys. Syst. 3 (1) (2018) 11.
[29] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. [42] Q. Zhang, L.T. Yang, Z. Yan, Z. Chen, P. Li, An efficient deep learning model
Aru, R. Vicente, Multiagent Cooperation and Competition with Deep to predict cloud workload for industry informatics, IEEE Trans. Ind. Inf. 14
Reinforcement Learning, (2015), arXiv:1511.08779. (7) (2018) 3170–3178.
[30] J.N. Foerster, Y.M. Assael, N. de Freitas, S. Whiteson, Learning to Communi-
cate to Solve Riddles with Deep Distributed Recurrent Q-Networks, (2016),
arXiv:1602.02672.
[31] A. Graves, G. Wayne, I. Danihelka, Neural Turing Machines, (2014), arXiv: Fanyu Bu received the BSc degree in computer science
1410.5401. from Inner Mongolia Agricultural University, Hohhot,
[32] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus, End-to-end memory net- China, in 2003, and the MSc degree in computer appli-
works, in: Proceedins of Advances in Neural Information Processing cation from Inner Mongolia University, Hohhot, China,
Systems, 2015, pp. 2440–2448. in 2009. He got the Ph.D degree in computer applica-
[33] J. Oh, V. Chockalingam, S. Singh, H. Lee, Control of memory, active percep- tion technology from Dalian University of Technology,
tion, and action in minecraft, in: Proceedings of International Conference Dalian, China, in 2018. He is currently an assistant
on Machine Learning, 2016, pp. 2790–2799. professor at Department of Computer and Information
[34] M. Zhang, Z. McCarthy, C. Finn, S. Levine, P. Abbeel, Learning deep neural Management at Inner Mongolia University of Finance
network policies with continuous memory states, in: Proceedings of IEEE and Economics, China. His research interests include
International Conference on Robotics and Automation, 2016, pp. 520–527. Big Data, Smart agriculture and Internet of Things.