Sie sind auf Seite 1von 628

Most Influential Data Science Research Papers

These papers provide a breadth of information about data science that is generally useful and interesting from an AI

perspective.
Contents
 A General and Adaptive Robust Loss Function

 On the Origin of Deep Learning

 Neural Style Transfer: A Review

 Deep Learning: A Critical Appraisal

 Recent Advances in Recurrent Neural Networks

 Deep Learning: An Introduction for Applied Mathematicians

 Deep Learning for Sentiment Analysis: A Survey

 A New Backpropagation Algorithm without Gradient Descent

 The Matrix Calculus You Need For Deep Learning

 Averaging Weights Leads to Wider Optima and Better Generalization

 Group Normalization

 A Survey on Neural Network-Based Summarization Methods

 geomstats: a Python Package for Riemannian Geometry in Machine Learning

 Backdrop: Stochastic Backpropagation

 Relational Deep Reinforcement Learning

 An intriguing failing of convolutional neural networks and the CoordConv solution

 Backprop Evolution

 Recent Advances in Object Detection in the Age of Deep Convolutional Neural Networks

 Neural Approaches to Conversational AI: Question Answering, Task-Oriented Dialogues

and Social Chatbots

 Reversible Recurrent Neural Networks


A General and Adaptive Robust Loss Function

Jonathan T. Barron
Google Research

Abstract and can be adjusted to model a wider family of functions.


arXiv:1701.03077v10 [cs.CV] 4 Apr 2019

This allows us to generalize algorithms built around a fixed


We present a generalization of the Cauchy/Lorentzian, robust loss with a new “robustness” hyperparameter that can
Geman-McClure, Welsch/Leclerc, generalized Charbon- be tuned or annealed to improve performance.
nier, Charbonnier/pseudo-Huber/L1-L2, and L2 loss func- Though new hyperparameters may be valuable to a prac-
tions. By introducing robustness as a continuous param- titioner, they complicate experimentation by requiring man-
eter, our loss function allows algorithms built around ro- ual tuning or time-consuming cross-validation. However,
bust loss minimization to be generalized, which improves by viewing our general loss function as the negative log-
performance on basic vision tasks such as registration and likelihood of a probability distribution, and by treating the
clustering. Interpreting our loss as the negative log of a robustness of that distribution as a latent variable, we show
univariate density yields a general probability distribution that maximizing the likelihood of that distribution allows
that includes normal and Cauchy distributions as special gradient-based optimization frameworks to automatically
cases. This probabilistic interpretation enables the training determine how robust the loss should be without any manual
of neural networks in which the robustness of the loss auto- parameter tuning. This “adaptive” form of our loss is par-
matically adapts itself during training, which improves per- ticularly effective in models with multivariate output spaces
formance on learning-based tasks such as generative im- (say, image generation or depth estimation) as we can intro-
age synthesis and unsupervised monocular depth estima- duce independent robustness variables for each dimension
tion, without requiring any manual parameter tuning. in the output and thereby allow the model to independently
adapt the robustness of its loss in each dimension.
The rest of the paper is as follows: In Section 1 we de-
Many problems in statistics and optimization require ro- fine our general loss function, relate it to existing losses,
bustness — that a model be less influenced by outliers than and enumerate some of its useful properties. In Sec-
by inliers [18, 20]. This idea is common in parameter es- tion 2 we use our loss to construct a probability distri-
timation and learning tasks, where a robust loss (say, ab- bution, which requires deriving a partition function and a
solute error) may be preferred over a non-robust loss (say, sampling procedure. Section 3 discusses four representa-
squared error) due to its reduced sensitivity to large errors. tive experiments: In Sections 3.1 and 3.2 we take two
Researchers have developed various robust penalties with
particular properties, many of which are summarized well
in [3, 40]. In gradient descent or M-estimation [17] these
losses are often interchangeable, so researchers may exper-
iment with different losses when designing a system. This
flexibility in shaping a loss function may be useful because
of non-Gaussian noise, or simply because the loss that is
minimized during learning or parameter estimation is dif-
ferent from how the resulting learned model or estimated
parameters will be evaluated. For example, one might train
a neural network by minimizing the difference between the
network’s output and a set of images, but evaluate that net-
work in terms of how well it hallucinates random images.
Figure 1. Our general loss function (left) and its gradient (right)
In this paper we present a single loss function that is a for different values of its shape parameter α. Several values of α
superset of many common robust loss functions. A single reproduce existing loss functions: L2 loss (α = 2), Charbonnier
continuous-valued parameter in our general loss function loss (α = 1), Cauchy loss (α = 0), Geman-McClure loss (α =
can be set such that it is equal to several traditional losses, −2), and Welsch loss (α = −∞).

1
vision-oriented deep learning models (variational autoen- In the limit as α approaches negative infinity, our loss be-
coders for image synthesis and self-supervised monocular comes Welsch [21] (aka Leclerc [26]) loss:
depth estimation), replace their losses with the negative log- 
1 x 2

likelihood of our general distribution, and demonstrate that lim f (x, α, c) = 1 − exp − ( /c) (7)
α→−∞ 2
allowing our distribution to automatically determine its own
robustness can improve performance without introducing With this analysis we can present our final loss function,
any additional manually-tuned hyperparameters. In Sec- which is simply f (·) with special cases for its removable
tions 3.3 and 3.4 we use our loss function to generalize singularities at α = 0 and α = 2 and its limit at α = −∞.
algorithms for the classic vision tasks of registration and 1
x/c)2
clustering, and demonstrate the performance improvement


 2 (  if α = 2
 1 x 2
that can be achieved by introducing robustness as a hyper-


 log (
2  /c ) + 1 if α = 0

parameter that is annealed or manually tuned. ρ (x, α, c) = 1 − exp − 1 (x/c)2 if α = −∞


  2  α/2

2
1. Loss Function

 |α−2| (x/c)
−1

|α−2| + 1 otherwise

α
The simplest form of our loss function is: (8)
 !α/2  As we have shown, this loss function is a superset of
2
|α − 2|  (x/c) the Welsch/Leclerc, Geman-McClure, Cauchy/Lorentzian,
f (x, α, c) = +1 − 1 (1)
α |α − 2| generalized Charbonnier, Charbonnier/pseudo-Huber/L1-
L2, and L2 loss functions.
Here α ∈ R is a shape parameter that controls the robust- To enable gradient-based optimization we can derive the
ness of the loss and c > 0 is a scale parameter that controls derivative of ρ (x, α, c) with respect to x:
the size of the loss’s quadratic bowl near x = 0. 
x
Though our loss is undefined when α = 2, it approaches 
 c2 if α = 2
 2x
if α = 0

L2 loss (squared error) in the limit: ∂ρ

 x2 +2c2  
(x, α, c) = x exp − 1 (x/c)2 if α = −∞
1 x 2 ∂x  c2 2
lim f (x, α, c) = ( /c) (2) 
  x 2 (α/2−1)
α→2 2  x2 ( /c) + 1

otherwise

c |α−2|
When α = 1 our loss is a smoothed form of L1 loss: (9)
Our loss and its derivative are visualized for different values
p
f (x, 1, c) = (x/c)2 + 1 − 1 (3)
of α in Figure 1.
This is often referred to as Charbonnier loss [6], pseudo- The shape of the derivative gives some intuition as to
Huber loss (as it resembles Huber loss [19]), or L1-L2 loss how α affects behavior when our loss is being minimized by
[40] (as it behaves like L2 loss near the origin and like L1 gradient descent or some related method. For all values of α
loss elsewhere). the derivative is approximately linear when |x| < c, so the
Our loss’s ability to express L2 and smoothed L1 losses effect of a small residual is always linearly proportional to
is shared by the “generalized Charbonnier” loss [35], which that residual’s magnitude. If α = 2, the derivative’s magni-
has been used in flow and depth estimation tasks that require tude stays linearly proportional to the residual’s magnitude
robustness [7, 24] and is commonly defined as: — a larger residual has a correspondingly larger effect. If
α/2 α = 1 the derivative’s magnitude saturates to a constant 1/c
x 2 + 2 (4) as |x| grows larger than c, so as a residual increases its ef-
Our loss has significantly more expressive power than the fect never decreases but never exceeds a fixed amount. If
generalized Charbonnier loss, which we can see by set- α < 1 the derivative’s magnitude begins to decrease as |x|
ting our shape parameter α to nonpositive values. Though grows larger than c (in the language of M-estimation [17],
f (x, 0, c) is undefined, we can take the limit of f (x, α, c) the derivative, aka “influence”, is “redescending”) so as the
as α approaches zero: residual of an outlier increases, that outlier has less effect
  during gradient descent. The effect of an outlier diminishes
1 x 2 as α becomes more negative, and as α approaches −∞ an
lim f (x, α, c) = log ( /c) + 1 (5)
α→0 2 outlier whose residual magnitude is larger than 3c is almost
This yields Cauchy (aka Lorentzian) loss [2]. By setting completely ignored.
α = −2, our loss reproduces Geman-McClure loss [14]: We can also reason about α in terms of averages. Be-
cause the empirical mean of a set of values minimizes total
2
2 (x/c) squared error between the mean and the set, and the empir-
f (x, −2, c) = 2 (6)
(x/c) + 4 ical median similarly minimizes absolute error, minimizing
Z ∞
our loss with α = 2 is equivalent to estimating a mean, and Z (α) = exp (−ρ (x, α, 1)) (17)
with α = 1 is similar to estimating a median. Minimizing −∞
our loss with α = −∞ is equivalent to local mode-finding
[36]. Values of α between these extents can be thought of where p (x | µ, α, c) is only defined if α ≥ 0, as Z (α) is
as smoothly interpolating between these three kinds of av- divergent when α < 0. For some values of α the partition
erages during estimation. function is relatively straightforward:
Our loss function has several useful properties that we √
Z (0) = π 2 Z (1) = 2eK1 (1)
will take advantage of. The loss is smooth (i.e., in C ∞ ) √ 1
with respect to x, α, and c > 0, and is therefore well-suited Z (2) = 2π Z (4) = e /4 K1/4 (1/4) (18)
to gradient-based optimization over its input and its param-
where Kn (·) is the modified Bessel function of the second
eters. The loss is zero at the origin, and increases monoton-
kind. For any rational positive α (excluding a singularity at
ically with respect to |x|:
α = 2) where α = n/d with n, d ∈ N, we see that
∂ρ
ρ (0, α, c) = 0 (x, α, c) ≥ 0 (10) q
 n  e| n −1| 2d
2d 2d !
∂|x| n −1

ap 1 1

0,0
Z = Gp,q −
The loss is invariant to a simultaneous scaling of c and x: d (2π)(d−1) bq n 2d
   
∀k>0 ρ(kx, α, kc) = ρ(x, α, c) (11) i 1 3 i
bq = i = − , ..., n − ∪ i = 1, ..., 2d − 1
n 2 2 2d
The loss increases monotonically with respect to α:  
i
∂ρ ap = i = 1, ..., n − 1 (19)
(x, α, c) ≥ 0 (12) n
∂α
where G(·) is the Meijer G-function and bq is a multiset
This is convenient for graduated non-convexity [4]: we can
(items may occur twice). Because the partition function
initialize α such that our loss is convex and then gradually
is difficult to evaluate or differentiate, in our experiments
reduce α (and therefore reduce convexity and increase ro-
we approximate log(Z (α)) with a cubic hermite spline (see
bustness) during optimization, thereby enabling robust esti-
Appendix C for details).
mation that (often) avoids local minima.
Just as our loss function includes several common loss
We can take the limit of the loss as α approaches infinity,
function as special cases, our distribution includes several
which due to Eq. 12 must be the upper bound of the loss:
common distributions as special cases. When α = 2 our
distribution becomes a normal (Gaussian) distribution, and
 
1 x 2
ρ (x, α, c) ≤ lim ρ (x, α, c) = exp ( /c) − 1 when α = 0 our distribution becomes a Cauchy distri-
α→+∞ 2
(13) bution. These are also both special cases of Student’s t-
We can bound the magnitude of the gradient of the loss, distribution (ν = ∞ and ν = 1, respectively), though these
which allows us to better reason about exploding gradients: are the only two points where these two families of distribu-
 ( α−1 tions intersect. Our distribution resembles the generalized
2 )
 
Gaussian distribution [29, 34], except that it is “smoothed”

1 α−2
∂ρ ≤ 1c if α ≤ 1
∂x (x, α, c) ≤  |x| (14)
c α−1
so as to approach a Gaussian distribution near the origin re-

c2 if α ≤ 2 gardless of the shape parameter α. The PDF and NLL of our
L1 loss is not expressible by our loss, but if c is much distribution for different values of α can be seen in Figure 2.
smaller than x we can approximate it with α = 1: In later experiments we will use the NLL of our general
distribution − log(p(·|α, c)) as the loss for training our neu-
|x| ral networks, not our general loss ρ (·, α, c). Critically, us-
f (x, 1, c) ≈ −1 if c  x (15)
c ing the NLL allows us to treat α as a free parameter, thereby
See Appendix E for other potentially-useful properties that allowing optimization to automatically determine the de-
are not used in our experiments. gree of robustness that should be imposed by the loss be-
ing used during training. To understand why the NLL must
2. Probability Density Function be used for this, consider a training procedure in which we
simply minimize ρ (·, α, c) with respect to α and our model
With our loss function we can construct a general prob- weights. In this scenario, the monotonicity of our general
ability distribution, such that the negative log-likelihood loss with respect to α (Eq. 12) means that optimization can
(NLL) of its PDF is a shifted version of our loss function: trivially minimize the cost of outliers by setting α to be as
1 small as possible. Now consider that same training pro-
p (x | µ, α, c) = exp (−ρ (x − µ, α, c)) (16) cedure in which we minimize the NLL of our distribution
cZ (α)
3. Experiments
We will now demonstrate the utility of our loss function
and distribution with four experiments. None of these re-
sults are intended to represent the state-of-the-art for any
particular task — our goal is to demonstrate the value of our
loss and distribution as useful tools in isolation. We will
show that across a variety of tasks, just replacing the loss
function of an existing model with our general loss function
can enable significant performance improvements.
In Sections 3.1 and 3.2 we focus on learning based vi-
Figure 2. The negative log-likelihoods (left) and probability den-
sion tasks in which training involves minimizing the differ-
sities (right) of the distribution corresponding to our loss function
ence between images: variational autoencoders for image
when it is defined (α ≥ 0). NLLs are simply losses (Fig. 1) shifted
by a log partition function. Densities are bounded by a scaled synthesis and self-supervised monocular depth estimation.
Cauchy distribution. We will generalize and improve models for both tasks by
using our general distribution (either as a conditional dis-
tribution in a generative model or by using its NLL as an
adaptive loss) and allowing the distribution to automatically
instead of our loss. As can be observed in Figure 2, reduc- determine its own degree of robustness. Because robustness
ing α will decrease the NLL of outliers but will increase is automatic and requires no manually-tuned hyperparame-
the NLL of inliers. During training, optimization will have ters, we can even allow for the robustness of our loss to
to choose between reducing α, thereby getting “discount” be adapted individually for each dimension of our output
on large errors at the cost of paying a penalty for small er- space — we can have a different degree of robustness at
rors, or increasing α, thereby incurring a higher cost for each pixel in an image, for example. As we will show, this
outliers but a lower cost for inliers. This tradeoff forces op- approach is particularly effective when combined with im-
timization to judiciously adapt the robustness of the NLL age representations such as wavelets, in which we expect to
being minimized. As we will demonstrate later, allowing see non-Gaussian, heavy-tailed distributions.
the NLL to adapt in this way can increase performance on In Sections 3.3 and 3.4 we will build upon existing al-
a variety of learning tasks, in addition to obviating the need gorithms for two classic vision tasks (registration and clus-
for manually tuning α as a fixed hyperparameter. tering) that both work by minimizing a robust loss that is
Sampling from our distribution is straightforward given subsumed by our general loss. We will then replace each
the observation that − log (p (x | 0, α, 1)) is bounded from algorithm’s fixed robust loss with our loss, thereby intro-
below by ρ(x, 0, 1) + log(Z(α)) (shifted Cauchy loss). See ducing a continuous tunable robustness parameter α. This
Figure 2 for visualizations of this bound when α = ∞, generalization allows us to introduce new models in which
which also bounds the NLL for all values of α. This lets α is manually tuned or annealed, thereby improving per-
us perform rejection sampling using a Cauchy as the pro- formance. These results demonstrate the value of our loss
posal distribution. Because our distribution is a location- function when designing classic vision algorithms, by al-
scale family, we sample from p (x | 0, α, 1) and then scale lowing model robustness to be introduced into the algorithm
and shift that sample by c and µ respectively. This sam- design space as a continuous hyperparameter.
pling approach is efficient, with an acceptance rate between 3.1. Variational Autoencoders
∼ 45% (α = ∞) and 100% (α = 0). Pseudocode for sam-
pling is shown in Algorithm 1. Variational autoencoders [23, 31] are a landmark tech-
nique for training autoencoders as generative models, which
can then be used to draw random samples that resemble
training data. We will demonstrate that our general distribu-
Algorithm 1 Sampling from our general distribution tion can be used to improve the log-likelihood performance
Input: Parameters for the distribution to sample {µ, α, c} of VAEs for image synthesis on the CelebA dataset [27]. A
Output: A sample drawn from p (x | µ, α, c). common design decision for VAEs is to model images us-
1: while True:
ing an independent normal distribution on a vector of RGB
√ pixel values [23], and we use this design as our baseline
2: x ∼ Cauchy(x0 = 0, γ = 2)
3: u ∼ Uniform(0, 1) model. Recent work has improved upon this model by us-
p(x | 0,α,1) ing deep, learned, and adversarial loss functions [9, 16, 25].
4: if u < exp(−ρ(x,0,1)−log(Z(α))) :
Though it’s possible that our general loss or distribution
5: return cx + µ
can add value in these circumstances, to more precisely iso-
late our contribution we will explore the hypothesis that the Normal Cauchy t-dist. Ours
baseline model of normal distributions placed on a per-pixel Pixels + RGB 8,662 9,602 10,177 10,240
image representation can be improved significantly with the DCT + YUV 31,837 31,295 32,804 32,806
small change of just modeling a linear transformation of a Wavelets + YUV 31,505 35,779 36,373 36,316
VAE’s output with our general distribution. Again, our goal Table 1. Validation set ELBOs (higher is better) for our varia-
is not to advance the state of the art for any particular im- tional autoencoders. Models using our general distribution better
age synthesis task, but is instead to explore the value of our maximize the likelihood of unseen data than those using normal
distribution in an experimentally controlled setting. or Cauchy distributions (both special cases of our model) for all
In our baseline model we give each pixel’s normal distri- three image representations, and perform similarly to Student’s t-
bution a variable scale parameter σ (i) that will be optimized distribution (a different generalization of normal and Cauchy dis-
tributions). The best and second best performing techniques for
over during training, thereby allowing the VAE to adjust the
each representation are colored orange and yellow respectively.
scale of its distribution for each output dimension. We can
straightforwardly replace this per-pixel normal distribution
Normal Cauchy t-distribution Ours
with a per-pixel general distribution, in which each output
dimension is given a distinct shape parameter α(i) in ad-

Pixels + RGB
dition to its scale parameter c(i) (i.e., σ (i) ). By letting the
α(i) parameters be free variables alongside the scale param-
eters, training is able to adaptively select both the scale and
robustness of the VAE’s posterior distribution over pixel
values. We restrict all α(i) to be in (0, 3), which allows
our distribution to generalize Cauchy (α = 0) and Normal
(α = 2) distributions and anything in between, as well as
DCT + YUV

more platykurtic distributions (α > 2) which helps for this


task. We limit α to be less than 3 because of the increased
risk of numerical instability during training as α increases.
We also compare against a Cauchy distribution as an ex-
ample of a fixed heavy-tailed distribution, and against Stu-
dent’s t-distribution as an example of a distribution that can
Wavelets + YUV

adjust its own robustness similarly to ours.


Regarding implementation, for each output dimension
(i)
i we construct unconstrained TensorFlow variables {α` }
(i)
and {c` } and define
 
(i)
α(i) = (αmax − αmin ) sigmoid α` + αmin (20)
 
(i)
Figure 3. Random samples from our variational autoencoders. We
c(i) = softplus c` + cmin (21) use either normal, Cauchy, Student’s t, or our general distributions
(columns) to model the coefficients of three different image rep-
αmin = 0, αmax = 3, cmin = 10−8 (22) resentations (rows). Because our distribution can adaptively inter-
polate between Cauchy-like or normal-like behavior for each co-
The cmin offset avoids degenerate optima where likelihood
efficient individually, using it results in sharper and higher-quality
is maximized by having c(i) approach 0, while αmin and samples (particularly when using DCT or wavelet representations)
αmax determine the range of values that α(i) can take. Vari- and does a better job of capturing low-frequency image content
ables are initialized such that initially all α(i) = 1 and than Student’s t-distribution.
c(i) = 0.01, and are optimized simultaneously with the au-
toencoder’s weights using the same Adam [22] optimizer
instance. distribution. For this we use the DCT [1] and the CDF 9/7
Though modeling images using independent distribu- wavelet decomposition [8], both with a YUV colorspace.
tions on pixel intensities is a popular choice due to its sim- These representations resemble the JPEG and JPEG 2000
plicity, classic work in natural image statistics suggest that compression standards, respectively.
images are better modeled with heavy-tailed distributions Our results can be seen in Table 1, where we report the
on wavelet-like image decompositions [10, 28]. We there- validation set evidence lower bound (ELBO) for all com-
fore train additional models in which our decoded RGB per- binations of our four distributions and three image repre-
pixel images are linearly transformed into spaces that bet- sentations, and in Figure 3, where we visualize samples
ter model natural images before computing the NLL of our from these models. We see that our general distribution per-
lower is better higher is better
forms similarly to a Student’s t-distribution, with both pro- Avg AbsRel SqRel RMS logRMS < 1.25 < 1.252 < 1.253
ducing higher ELBOs than any fixed distribution across all Baseline [42] as reported 0.407 0.221 2.226 7.527 0.294 0.676 0.885 0.954
Baseline [42] reproduced 0.398 0.208 2.773 7.085 0.286 0.726 0.895 0.953
representations. These two adaptive distributions appear to Ours, fixed α = 1 0.356 0.194 2.138 6.743 0.268 0.738 0.906 0.960
have complementary strengths: ours can be more platykur- Ours, fixed α = 0 0.350 0.187 2.407 6.649 0.261 0.766 0.911 0.960
Ours, fixed α = 2 0.349 0.190 1.922 6.648 0.267 0.737 0.904 0.961
tic (α > 2) while a t-distribution can be more leptokurtic Ours, annealing α = 2 → 0 0.341 0.184 2.063 6.697 0.260 0.756 0.911 0.963
(ν < 1), which may explain why neither model consis- Ours, adaptive α ∈ (0, 2) 0.332 0.181 2.144 6.454 0.254 0.766 0.916 0.965

tently outperforms the other across representations. Note Table 2. Results on unsupervised monocular depth estimation us-
that the t-distribution’s NLL does not generalize the Char- ing the KITTI dataset [13], building upon the model from [42]
bonnier, L1, Geman-McClure, or Welsch losses, so unlike (“Baseline”). By replacing the per-pixel loss used by [42] with
ours it will not generalize the losses used in the other tasks several variants of our own per-wavelet general loss function in
we will address. For all representations, VAEs trained with which our loss’s shape parameters are fixed, annealed, or adap-
our general distribution produce sharper and more detailed tive, we see a significant performance improvement. The top three
techniques are colored red, orange, and yellow for each metric.
samples than those trained with normal distributions. Mod-
els trained with Cauchy and t-distributions preserve high-
frequency detail and work well on pixel representations,

Input
but systematically fail to synthesize low-frequency image
content when given non-pixel representations, as evidenced
by the gray backgrounds of those samples. Comparing
performance across image representations shows that the

Baseline
“Wavelets + YUV” representation best maximizes valida-
tion set ELBO — though if we were to limit our model to
only normal distributions the “DCT + YUV” model would
appear superior, suggesting that there is value in reason-
Ours

ing jointly about distributions and image representations.


After training we see shape parameters {α(i) } that span
(0, 2.5), suggesting that an adaptive mixture of normal-like
and Cauchy-like distributions is useful in modeling natural
images, as has been observed previously [30]. Note that
Truth

this adaptive robustness is just a consequence of allowing


(i)
{α` } to be free variables during training, and requires no
manual parameter tuning. See Appendix G for more sam- Figure 4. Monocular depth estimation results on the KITTI bench-
ples and reconstructions from these models, and a review of mark using the “Baseline” network of [42]. Replacing only the
our experimental procedure. network’s loss function with our “adaptive” loss over wavelet co-
efficients results in significantly improved depth estimates.
3.2. Unsupervised Monocular Depth Estimation
Due to the difficulty of acquiring ground-truth direct bution with our general distribution, keeping our scale fixed
depth observations, there has been recent interest in “unsu- but allowing the shape parameter α to vary. Following our
pervised” monocular depth estimation, in which stereo pairs observation from Section 3.1 that YUV wavelet representa-
and geometric constraints are used to directly train a neural tions work well when modeling images with our loss func-
network [11, 12, 15, 42]. We use [42] as a representative tion, we impose our loss on a YUV wavelet decomposition
model from this literature, which is notable for its estima- instead of the RGB pixel representation of [42]. The only
tion of depth and camera pose. This model is trained by changes we made to the code from [42] were to replace its
minimizing the differences between two images in a stereo loss function with our own and to remove the model compo-
pair, where one image has been warped to match the other nents that stopped yielding any improvement after the loss
according to the depth and pose predictions of a neural net- function was replaced (see Appendix H for details). All
work. In [42] that difference between images is defined as training and evaluation was performed on the KITTI dataset
the absolute difference between RGB values. We will re- [13] using the same training/test split as [42].
place that loss with different varieties of our general loss, Results can be seen in Table 2. We present the error
and demonstrate that using annealed or adaptive forms of and accuracy metrics used in [42] and our own “average”
our loss can improve performance. error measure: the geometric mean of the four errors and
The absolute loss in [42] is equivalent to maximizing the one minus the three accuracies. The “Baseline“ models use
likelihood of a Laplacian distribution with a fixed scale on the loss function of [42], and we present both the numbers
RGB pixel values. We replace that fixed Laplacian distri- in [42] (“as reported”) and our own numbers from running
Mean RMSE ×100 Max RMSE ×100
the code from [42] ourselves (“reproduced”). The “Ours” σ= 0 0.0025 0.005 0 0.0025 0.005
entries all use our general loss imposed on wavelet coeffi- FGR [41] 0.373 0.518 0.821 0.591 1.040 1.767
cients, but for each entry we use a different strategy for set- shape-annealed gFGR 0.374 0.510 0.802 0.590 0.997 1.670
ting the shape parameter or parameters. We keep our loss’s gFGR* 0.370 0.509 0.806 0.545 0.961 1.669

scale c fixed to 0.01, thereby matching the fixed scale as- Table 3. Results on the registration task of [41], in which we
compare their “FGR” algorithm to two versions of our “gFGR”
sumption of the baseline model and roughly matching the
generalization.
shape of its L1 loss (Eq. 15). To avoid exploding gradients
we multiply the loss being minimized by c, thereby bound-
ing gradient magnitudes by residual magnitudes (Eq. 14).
For the “fixed” models we use a constant value for α for all
wavelet coefficients, and observe that though performance
is improved relative to the baseline, no single value of α is
optimal. The α = 1 entry is simply a smoothed version
of the L1 loss used by the baseline model, suggesting that
just using a wavelet representation improves performance.
In the “annealing α = 2 → 0” model we linearly inter-
polate α from 2 (L2) to 0 (Cauchy) as a function of train-
ing iteration, which outperforms all “fixed” models. In the Figure 5. Performance (lower is better) of our gFGR algorithm
“adaptive α ∈ (0, 2)” model we assign each wavelet co- on the task of [41] as we vary our shape parameter α, with the
efficient its own shape parameter as a free variable and we lowest-error point indicated by a circle. FGR (equivalent to gFGR
allow those variables to be optimized alongside our network with α = −2) is shown as a dashed line and a square, and shape-
annealed gFGR for each noise level is shown as a dotted line.
weights during training as was done in Section 3.1, but with
αmin = 0 and αmax = 2. This “adaptive” strategy out-
performs the “annealing” and all “fixed” strategies, thereby where ρgm (·) is Geman-McClure loss. By using the Black
demonstrating the value of allowing the model to adaptively and Rangarajan duality between robust estimation and line
determine the robustness of its loss during training. Note processes [3] FGR is capable of producing high-quality reg-
that though the “fixed” and “annealed” strategies only re- istrations at high speeds. Because Geman-McClure loss is a
quire our general loss, the “adaptive” strategy requires that special case of our loss, and because we can formulate our
we use the NLL of our general distribution as our loss — loss as an outlier process (see Appendix A), we can gener-
otherwise training would simply drive α to be as small as alize FGR to an arbitrary shape parameter α by replacing
possible due to the monotonicity of our loss with respect ρgm (·, c) with our ρ(·, α, c) (where setting α = −2 repro-
to α, causing performance to degrade to the “fixed α = 0” duces FGR).
model. Comparing the “adaptive” model’s performance to This generalized FGR (gFGR) enables algorithmic im-
that of the “fixed” models suggests that, as in Section 3.1, provements. FGR iteratively solves a linear system while
no single setting of α is optimal for all wavelet coefficients. annealing its scale parameter c, which has the effect of grad-
Overall, we see that just replacing the loss function of [42] ually introducing nonconvexity. gFGR enables an alterna-
with our adaptive loss on wavelet coefficients reduces aver- tive strategy in which we directly manipulate convexity by
age error by ∼ 17%. annealing α instead of c. This “shape-annealed gFGR” fol-
In Figure 4 we compare our “adaptive” model’s out- lows the same procedure as [41]: 64 iterations in which a
put to the baseline model and the ground-truth depth, and parameter is annealed every 4 iterations. Instead of anneal-
demonstrate a substantial qualitative improvement. See Ap- ing c, we set it to its terminal value and instead anneal α
pendix H for many more results, and for visualizations of over the following values:
the per-coefficient robustness selected by our model.
2, 1, 1/2, 1/4, 0, −1/4, −1/2, −1, −2, −4, −8, −16, −32

3.3. Fast Global Registration Table 3 shows results for the 3D point cloud registration
task of [41] (Table 1 in that paper), which shows that an-
Robustness is often a core component of geometric regis- nealing shape produces moderately improved performance
tration [38]. The Fast Global Registration (FGR) algorithm over FGR for high-noise inputs, and behaves equivalently
of [41] finds the rigid transformation T that aligns point sets in low-noise inputs. This suggests that performing gradu-
{p} and {q} by minimizing the following loss: ated non-convexity by directly adjusting a shape parameter
X that controls non-convexity — a procedure that is enabled
ρgm (kp − Tqk, c) (23) by our general loss – is preferable to indirectly controlling
(p,q) non-convexity by annealing a scale parameter.
RCC-DR [32]
LDMGI [37]
N-Cuts [33]

Rel. Impr.
RCC [32]
PIC [39]

gRCC*
AC-W
Dataset
YaleB 0.767 0.928 0.945 0.941 0.974 0.975 0.975 0.4%
COIL-100 0.853 0.871 0.888 0.965 0.957 0.957 0.962 11.6%
MNIST 0.679 - 0.761 - 0.828 0.893 0.901 7.9%
YTF 0.801 0.752 0.518 0.676 0.874 0.836 0.888 31.9%
Pendigits 0.728 0.813 0.775 0.467 0.854 0.848 0.871 15.1%
Mice Protein 0.525 0.536 0.527 0.394 0.638 0.649 0.650 0.2%
Reuters 0.471 0.545 0.523 0.057 0.553 0.556 0.561 1.1%
Shuttle 0.291 0.000 0.591 - 0.513 0.488 0.493 0.9%
RCV1 0.364 0.140 0.382 0.015 0.442 0.138 0.338 23.2%
Table 4. Results on the clustering task of [32] where we compare
their “RCC” algorithm to our “gRCC*” generalization in terms
of AMI on several datasets. We also report the AMI increase of
“gRCC*” with respect to “RCC”. Baselines are taken from [32].

Another generalization is to continue using the c- Figure 6. Performance (higher is better) of our gRCC algorithm
annealing strategy of [41], but treat α as a hyperparameter on the clustering task of [32], for different values of our shape
and tune it independently for each noise level in this task. parameter α, with the highest-accuracy point indicated by a dot.
In Figure 5 we set α to a wide range of values and report Because the baseline RCC algorithm is equivalent to gRCC with
errors for each setting, using the same evaluation of [41]. α = −2, we highlight that α value with a dashed line and a square.
We see that for high-noise inputs more negative values of
α are preferable, but for low-noise inputs values closer to 4. Conclusion
0 are optimal. We report the lowest-error entry for each
noise level as “gFGR*” in Table 3 where we see a signifi- We have presented a two-parameter loss function
cant reduction in error, thereby demonstrating the improve- that generalizes many existing one-parameter ro-
ment that can be achieved from treating robustness as a hy- bust loss functions: the Cauchy/Lorentzian, Geman-
perparameter. McClure, Welsch/Leclerc, generalized Charbonnier,
Charbonnier/pseudo-Huber/L1-L2, and L2 loss functions.
By reducing a family of discrete single-parameter losses
3.4. Robust Continuous Clustering
to a single function with two continuous parameters, our
In [32] robust losses are used for unsupervised cluster- loss enables the convenient exploration and comparison
ing, by minimizing: of different robust penalties. This allows us to generalize
and improve algorithms designed around the minimiza-
X 2
X tion of some fixed robust loss function, which we have
kxi − ui k2 + λ wp,q ρgm (kup − uq k2 ) (24)
demonstrated for registration and clustering. When used
i (p,q)∈E
as a negative log-likelihood, this loss gives a general
probability distribution that includes normal and Cauchy
where {xi } is a set of input datapoints, {ui } is a set of “rep- distributions as special cases. This distribution lets us train
resentatives” (cluster centers), and E is a mutual k-nearest neural networks in which the loss has an adaptive degree
neighbors (m-kNN) graph. As in Section 3.3, ρgm (·) is of robustness for each output dimension, which allows
Geman-McClure loss, which means that our loss can be training to automatically determine how much robustness
used to generalize this algorithm. Using the RCC code should be imposed by the loss without any manual param-
provided by the authors (and keeping all hyperparameters eter tuning. When this adaptive loss is paired with image
fixed to their default values) we replace Geman-McClure representations in which variable degrees of heavy-tailed
loss with our general loss and then sweep over values of α. behavior occurs, such as wavelets, this adaptive training ap-
In Figure 6 we show the adjusted mutual information (AMI, proach allows us to improve the performance of variational
the metric used by [32]) of the resulting clustering for each autoencoders for image synthesis and of neural networks
value of α on the datasets used in [32], and in Table 4 we for unsupervised monocular depth estimation.
report the AMI for the best-performing value of α for each
dataset as “gRCC*”. On some datasets performance is in- Acknowledgements: Thanks to Rob Anderson, Jesse En-
sensitive to α, but on others adjusting α improves perfor- gel, David Gallup, Ross Girshick, Jaesik Park, Ben Poole,
mance by as much as 32%. This improvement demonstrates Vivek Rathod, and Tinghui Zhou.
the gains that can be achieved by introducing robustness as
a hyperparameter and tuning it accordingly.
References [21] John E. Dennis Jr. and Roy E. Welsch. Techniques for non-
linear least squares and robust regression. Communications
[1] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete in Statistics-simulation and Computation, 1978.
cosine transform. IEEE Transactions on Computers, 1974.
[22] Diederik P. Kingma and Jimmy Ba. Adam: A method for
[2] Michael J Black and Paul Anandan. The robust estimation stochastic optimization. ICLR, 2015.
of multiple motions: Parametric and piecewise-smooth flow
[23] Diederik P. Kingma and Max Welling. Auto-encoding vari-
fields. CVIU, 1996.
ational bayes. ICLR, 2014.
[3] Michael J. Black and Anand Rangarajan. On the unification
[24] Philipp Krähenbühl and Vladlen Koltun. Efficient nonlocal
of line processes, outlier rejection, and robust statistics with
regularization for optical flow. ECCV, 2012.
applications in early vision. IJCV, 1996.
[25] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo
[4] Andrew Blake and Andrew Zisserman. Visual Reconstruc-
Larochelle, and Ole Winther. Autoencoding beyond pixels
tion. MIT Press, 1987.
using a learned similarity metric. ICML, 2016.
[5] Richard H. Byrd and David A. Pyne. Convergence of the
[26] Yvan G Leclerc. Constructing simple stable descriptions for
iteratively reweighted least-squares algorithm for robust re-
image partitioning. IJCV, 1989.
gression. Technical report, Dept. of Mathematical Sciences,
[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
The Johns Hopkins University, 1979.
Deep learning face attributes in the wild. ICCV, 2015.
[6] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and
[28] Stéphane Mallat. A theory for multiresolution signal decom-
Michel Barlaud. Two deterministic half-quadratic regular-
position: The wavelet representation. TPAMI, 1989.
ization algorithms for computed imaging. ICIP, 1994.
[29] Saralees Nadarajah. A generalized normal distribution. Jour-
[7] Qifeng Chen and Vladlen Koltun. Fast mrf optimization with
nal of Applied Statistics, 2005.
application to depth reconstruction. CVPR, 2014.
[30] Javier Portilla, Vasily Strela, Martin J. Wainwright, and
[8] Albert Cohen, Ingrid Daubechies, and J-C Feauveau.
Eero P. Simoncelli. Image denoising using scale mixtures
Biorthogonal bases of compactly supported wavelets. Com-
of gaussians in the wavelet domain. IEEE TIP, 2003.
munications on pure and applied mathematics, 1992.
[31] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-
[9] Alexey Dosovitskiy and Thomas Brox. Generating images
stra. Stochastic backpropagation and approximate inference
with perceptual similarity metrics based on deep networks.
in deep generative models. ICML, 2014.
NIPS, 2016.
[32] Sohil Atul Shah and Vladlen Koltun. Robust continuous
[10] David J. Field. Relations between the statistics of natural
clustering. PNAS, 2017.
images and the response properties of cortical cells. JOSA A,
1987. [33] Jianbo Shi and Jitendra Malik. Normalized cuts and image
segmentation. TPAMI, 2000.
[11] John Flynn, Ivan Neulander, James Philbin, and Noah
Snavely. Deepstereo: Learning to predict new views from [34] M Th Subbotin. On the law of frequency of error. Matem-
the world’s imagery. CVPR, 2016. aticheskii Sbornik, 1923.
[12] Ravi Garg, BG Vijay Kumar, Gustavo Carneiro, and Ian [35] Deqing Sun, Stefan Roth, and Michael J. Black. Secrets of
Reid. Unsupervised cnn for single view depth estimation: optical flow estimation and their principles. CVPR, 2010.
Geometry to the rescue. ECCV, 2016. [36] Rein van den Boomgaard and Joost van de Weijer. On
[13] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we the equivalence of local-mode finding, robust estimation and
ready for autonomous driving? the kitti vision benchmark mean-shift analysis as used in early vision tasks. ICPR, 2002.
suite. CVPR, 2012. [37] Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting
[14] Stuart Geman and Donald E. McClure. Bayesian image anal- Zhuang. Image clustering using local discriminant models
ysis: An application to single photon emission tomography. and global integration. TIP, 2010.
Proceedings of the American Statistical Association, 1985. [38] Christopher Zach. Robust bundle adjustment revisited.
[15] Clément Godard, Oisin Mac Aodha, and Gabriel J. Bros- ECCV, 2014.
tow. Unsupervised monocular depth estimation with left- [39] Wei Zhang, Deli Zhao, and Xiaogang Wang. Agglomerative
right consistency. CVPR, 2017. clustering via maximum incremental path integral. Pattern
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Recognition, 2013.
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [40] Zhengyou Zhang. Parameter estimation techniques: A tuto-
Yoshua Bengio. Generative adversarial nets. NIPS, 2014. rial with application to conic fitting, 1995.
[17] Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, [41] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast global
and Werner A. Stahel. Robust Statistics: The Approach registration. ECCV, 2016.
Based on Influence Functions. Wiley, 1986. [42] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G.
[18] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Lowe. Unsupervised learning of depth and ego-motion from
Statistical Learning with Sparsity: The Lasso and General- video. CVPR, 2017.
izations. Chapman and Hall/CRC, 2015.
[19] Peter J. Huber. Robust estimation of a location parameter.
Annals of Mathematical Statistics, 1964.
[20] Peter J. Huber. Robust Statistics. Wiley, 1981.
A. Alternative Forms

The registration and clustering experiments in the paper


require that we formulate our loss as an outlier process. Us-
ing the equivalence between robust loss minimization and
outlier processes established by Black and Rangarajan [3],
we can derive our loss’s Ψ-function:


− log(z) + z − 1 if α = 0


 − z +1 α
Ψ(z, α) = z log(z)  if α = −∞ Figure 7. Our general loss’s IRLS weight function (left) and Ψ-
 |α−2| 1 − α z (α−2) +

 αz
−1 if α < 2 function (right) for different values of the shape parameter α.
α 2 2
 
1 x 2
ρ(x, α, c) = min ( /c) z + Ψ(z, α) (25)
0≤z≤1 2
 !(d/2) 
2
b (x/c)
ρ (x, α, c) =  +1 − 1
Ψ(z, α) is not defined when α ≥ 2 because for those values d b
the loss is no longer robust, and so is not well described as !(d/2−1)
2
a process that rejects outliers. ∂ρ x (x/c)
(x, α, c) = 2 +1
∂x c b
We can also derive our loss’s weight function to be used
(  d

during iteratively reweighted least squares [5, 17]: b
1 − d2 z (d−2) + dz

d 2 −1 if α < 2
Ψ(z, α) =
0 if α = 2
 (
1
 c2
 if α = 2 α +  if α ≥ 0
 2 b =|α − 2| +  d=
if α = 0 α −  if α < 0


1 ∂ρ  x2 +2c 2
 
(x, α, c) = 1 exp − 1 (x/c)2 if α = −∞ Where  is some small value, such as 10−5 . Note that even
x ∂x  c2 2
very small values of  can cause significant inaccuracy be-

  (α/2−1)
 12 (x/c)2 + 1

otherwise

c |α−2| tween our true partition function Z (α) and the effective
(26) partition function of our approximate distribution when α
Curiously, these IRLS weights resemble a non-normalized is near 0, so this approximate implementation should be
form of Student’s t-distribution. These weights are not used avoided when accurate values of Z (α) are necessary.
in any of our experiments, but they are an intuitive way to
demonstrate how reducing α attenuates the effect of out-
liers. A visualization of our loss’s Ψ-functions and weight C. Partition Function Approximation
functions for different values of α can be seen in Figure 7.
Implementing the negative log-likelihood of our general
distribution (ie, our adaptive loss) requires a tractable and
differentiable approximation of its log partition function.
B. Practical Implementation Because the analytical form of Z (α) detailed in the pa-
per is difficult to evaluate efficiently for any real number,
The special cases in the definition of ρ (·) that are and especially difficult to differentiate with respect to α,
required because of the removable singularities of f (·) we approximate log(Z (α)) using cubic hermite spline in-
at α = 0 and α = 2 can make implementing our terpolation in a transformed space. Efficiently approximat-
loss somewhat inconvenient. Additionally, f (·) is nu- ing log(Z (α)) with a spline is difficult, as we would like a
merically unstable near these singularities, due to divi- concise approximation that holds over the entire valid range
sions by small values. Furthermore, many deep learning α ≥ 0, and we would like to allocate more precision in our
frameworks handle special cases inefficiently by evaluat- spline interpolation to values near α = 2 (which is where
ing all cases of a conditional statement, even though only log(Z (α)) varies most rapidly). To accomplish this, we
one case is needed. To circumvent these issues we can first apply a monotonic nonlinearity to α that stretches val-
slightly modify our loss (and its gradient and Ψ-function) to ues near α = 2 (thereby increasing the density of spline
guard against singularities and make implementation easier: knots in this region) and compresses values as α  4, for
which we use:
(
9(α−2)
4|α−2|+1 + α + 2 if α < 4
curve(α) = 5
(27)
18 log (4α − 15) + 8 otherwise

This curve is roughly piecewise-linear in [0, 4] with a slope


of ∼1 at α = 0 and α = 4, but with a slope of ∼10 at
α = 2. When α > 4 the curve becomes logarithmic. This
function is continuously differentiable, as is required for our
log-partition approximation to also be continuously differ-
entiable.
We transform α with this nonlinearity, and then approx-
imate log(Z (α)) in that transformed space using a spline
with knots in the range of [0, 12] evenly spaced apart by
1/1024. Values for each knot are set to their true value, and

tangents for each knot are set to minimize the squared er-
ror between the spline and the true log partition function.
Because our spline knots are evenly spaced in this trans-
formed space, spline interpolation can be performed in con-
stant time with respect to the number of spline knots. For all
values of α this approximation is accurate to within 10−6 , Figure 8. Because our distribution’s log partition function
which appears to be sufficient for our purposes. Our non- log(Z (α)) is difficult to evaluate for arbitrary inputs, we approx-
linearity and our spline approximation to the true partition imate it using cubic hermite spline interpolation in a transformed
function for small values of α can be seen in Figure 8. space: first we curve α by a continuously differentiable nonlin-
earity that increases knot density near α = 2 and decreases knot
D. Motivation and Derivation density when α > 4 (top) and then we fit an evenly-sampled cubic
hermite spline in that curved space (bottom). The dots shown in
Our loss function is derived from the “generalized Char- the bottom plot are a subset of the knots used by our cubic spline,
bonnier” loss [35], which itself builds upon the Charbon- and are presented here to demonstrate how this approach allocates
nier loss function [6]. To better motivate the construction spline knots with respect to α.
of our loss function, and to clarify its relationship to prior
work, here we work through how our loss function was con-
structed. α to a negative value yield a family of meaningful robust
Generalized Charbonnier loss can be defined as: loss functions, such as Geman-McClure loss.
But this loss function still has several unintuitive proper-
α/2
d (x, α, c) = x2 + c2 (28) ties: the loss is non-zero when x = 0 (assuming a non-zero
value of c), and the curvature of the quadratic “bowl” near
Here we use a slightly different parametrization from [35] x = 0 varies as a function of c and α. We therefore con-
and use α/2 as the exponent instead of just α. This makes the struct a shifted and scaled version of Equation 30 that does
generalized Charbonnier somewhat easier to reason about not have these properties:
with respect to standard loss functions: d (x, 2, c) resembles  
L2 loss, d (x, 1, c) resembles L1 loss, etc. g (x, α, c) − g (0, α, c) 1 x 2 a/2
= ( /c) + 1 − 1
We can reparametrize generalized Charbonnier loss as: c2 g 00 (0, α, c) α
 α/2 (32)
2 This loss generalizes L2, Cauchy, and Geman-McClure
d (x, α, c) = cα (x/c) + 1 (29)
loss, but it has the unfortunate side-effect of flattening out to
We omit the cα scale factor, which gives us a loss that is 0 when α  0, thereby prohibiting many annealing strate-
scale invariant with respect to c: gies. This can be addressed by modifying the 1/α scaling
α/2 to approach 1 instead of 0 when α  0 by introducing an-
g (x, α, c) = (x/c)2 + 1 (30) other scaling that cancels out the division by α. To preserve
∀k>0 g(kx, α, kc) = g(x, α, c) (31) the scale-invariance of Equation 31, this scaling also needs
2
to be applied to the (x/c) term in the loss. This scaling
This lets us view the c “padding” variable as a “scale” pa- also needs to maintain the monotonicity of our loss with
rameter, similar to other common robust loss functions. Ad- respect to α so as to make annealing possible. There are
ditionally, only after dropping this scale factor does setting several scalings that satisfy this property, so we select one
that is efficient to evaluation and which keeps our loss func- F. Wavelet Implementation
tion smooth (ie, having derivatives of all orders everywhere)
with respect to x, α, and c, which is |α − 2|. This gives us Two of our experiments impose our loss on im-
our final loss function: ages reparametrized with the Cohen-Daubechies-Feauveau
  (CDF) 9/7 wavelet decomposition [8]. The analysis filters
!α/2 used for these experiments are:
2
|α − 2|  (x/c)
f (x, α, c) = +1 − 1 (33)
α |α − 2| lowpass highpass
0.852698679009 0.788485616406
Using |α − 2| satisfies all of our criteria, though it does 0.377402855613 -0.418092273222
introduce a removable singularity into our loss function at -0.110624404418 -0.040689417609
α = 2 and reduces numerical stability near α = 2. -0.023849465020 0.064538882629
0.037828455507
E. Additional Properties
Here the origin coefficient of the filter is listed first, and the
Here we enumerate additional properties of our loss rest of the filter is symmetric. The synthesis filters are de-
function that were not used in our experiments. fined as usual, by reversing the sign of alternating wavelet
At the origin the IRLS weight of our loss is c12 : coefficients in the analysis filters. The lowpass filter sums

to 2, which means that image intensities are doubled at
1 ∂ρ 1 each scale of the wavelet decomposition, and that the mag-
(0, α, c) = 2 (34)
x ∂x c nitude of an image is preserved in its wavelet decomposi-
For all values of α, when |x| is small with respect to c the tion. Boundary conditions are “reflecting”, or half-sample
loss is well-approximated by a quadratic bowl: symmetric.

ρ (x, α, c) ≈
1 x 2
( /c) if |x| < c (35)
G. Variational Autoencoders
2
Our VAE experiments were performed using the
Because the second derivative of the loss is maximized at code included in the TensorFlow Probability codebase
x = 0, this quadratic approximation tells us that the second at http://github.com/tensorflow/probability/blob/
derivative is bounded from above: master/tensorflow_probability/examples/vae.py.
This code was designed for binarized MNIST data, so
∂2ρ 1 adapting it to the real-valued color images in CelebA [27]
2
(x, α, c) ≤ 2 (36)
∂x c required the following changes:
When α is negative the loss approaches a constant as |x|
• Changing the input and output image resolution from
approaches infinity, letting us bound the loss:
(28, 28, 1) to (64, 64, 3).
α−2
∀x,c ρ (x, α, c) ≤ if α < 0 (37) • Increasing the number of training steps from 5000 to
α 50000, as CelebA is significantly larger than MNIST.
The loss’s Ψ-function increases monotonically with respect
to α when α < 2 for all values of z in [0, 1]: • Delaying the start of cosine decay of the learning rate
until the final 10000 training iterations.
∂Ψ
(z, α) ≥ 0 if 0 ≤ z ≤ 1 (38) • Changing the CNN architecture from a 5-layer network
∂α
with 5-tap and 7-tap filters with interleaved strides of 1
The roots of the second derivative of ρ (x, α, c) are: and 2 (which maps from a 28 × 28 image to a vector)
r to a 6-layer network consisting of all 5-tap filters with
α−2 strides of 2 (which maps from a 64×64 input to a vector).
x = ±c (39)
α−1 The number of hidden units was left unchanged, and the
one extra layer we added at the end of our decoder (and
This tells us at what value of x the loss begins to redescend. beginning of our decoder) was given the same number of
This point has a magnitude of c when α = −∞, and that hidden units as the layer before it.
magnitude increases as α increases. The root is undefined
when α ≥ 1, as our loss is redescending iff α < 1. Our • In our “DCT + YUV” and “Wavelets + YUV” models,
loss is strictly convex iff α ≥ 1, non-convex iff α < 1, and before imposing our model’s posterior we apply an RGB-
pseudoconvex for all values of α. to-YUV transformation and then a per-channel DCT or
an independently-adapted model. As we can see, though
quality can be improved by selecting a value for α in be-
tween 0 and 2, no single global setting of the shape parame-
ter matches the performance achieved by allowing each co-
efficient’s shape parameter to automatically adapt itself to
the training data. This observation is consistent with ear-
lier results on adaptive heavy-tailed distributions for image
data [30].
In our Student’s t-distribution experiments, we
parametrize each “degrees of freedom” parameter as
the exponentiation of some latent free parameter:
 
(i)
ν (i) = exp ν` (40)

(i)
where all ν` are initialized to 0. Technically, these ex-
Figure 9. Here we compare the validation set ELBO of our adap- periments are performed with the “Generalized Student’s
tive “Wavelets + YUV” VAE model with the ELBO achieved when t-distribution”, meaning that we have an additional scale
setting all wavelet coefficients to have the same fixed shape pa- parameter σ (i) that is divided into x before computing the
rameter α. We see that allowing our distribution to individually log-likelihood and is accounted for in the partition function.
adapt its shape parameter to each coefficient outperforms any sin- These scale parameters are parametrized identically to the
gle fixed shape parameter. c(i) parameters used by our general distribution.
Comparing likelihoods across our different image repre-
wavelet transformation to the YUV images, and then in- sentations requires that the “Wavelets + YUV” and “DCT +
vert these transformations to visualize each sampled im- YUV” representations be normalized to match the “Pixels
age. In the “Pixels + RGB” model this transformation + RGB” representation. We therefore construct the linear
and its inverse are the identity function. transformations used for the “Wavelets + YUV” and “DCT
+ YUV” spaces to have determinants of 1 as per the change
• As discussed in the paper, for each output coefficient of variable formula (that is, both transformations are in the
(pixel value, DCT coefficient, or wavelet coefficient) we “special linear group”). Our wavelet construction in Sec-
add a scale variable (σ when using normal distributions, tion F satisfies this criteria, and we use the orthonormal ver-
c when using our general distributions) and a shape vari- sion of the DCT which also satisfies this criteria. However,
able α (when using our general distribution). the standard RGB to YUV conversion matrix does not have
We made as few changes to the reference code as possible a determinant of 1, so we scale it by the inverse of the cube
so as to keep our model architecture as simple as possible, root of the standard conversion matrix, thereby forcing its
as our goal is not to produce state-of-the-art image synthesis determinant to be 1. The resulting matrix is:
results for some task, but is instead to simply demonstrate 
0.47249 0.92759 0.18015

the value of our general distribution in isolation. −0.23252 −0.45648 0.68900
CelebA [27] images are processed by extracting a square 0.97180 −0.81376 −0.15804
160 × 160 image region at the center of each 178 × 218
image and downsampling it to 64 × 64 by a factor of 2.5× Naturally, its inverse maps from YUV to RGB.
using TensorFlow’s bilinear interpolation implementation. Because our model can adapt the shape and scale pa-
Pixel intensities are scaled to [0, 1]. rameters of our general distribution to each output coeffi-
In the main paper we demonstrated that using our gen- cient, after training we can inspect the shapes and scales
eral distribution to independently model the robustness of that have emerged during training, and from them gain in-
each coefficient of our image representation works better sight into how optimization has modeled our training data.
than assuming a Cauchy (α = 0) or normal distribution In Figures 10 and 11 we visualize the shape and scale pa-
(α = 2) for all coefficients (as those two distributions lie rameters for our “Pixels + RGB” and “Wavelets + YUV”
within our general distribution). To further demonstrate the VAEs respectively. Our “Pixels” model is easy to visual-
value of independently modeling the robustness of each in- ize as each output coefficient simply corresponds to a pixel
dividual coefficient, we ran a more thorough experiment in in a channel, and our “Wavelets” model can be visualized
which we densely sampled values for α in [0, 2] that are by flattening each wavelet scale and orientation into an im-
used for all coefficients. In Figure 9 we visualize the val- age (our DCT-based model is difficult to visualize in any
idation set ELBO for each fixed value of α compared to intuitive way). In both models we observe that training has
Mean Reconstruction Sampled Reconstruction
{α(i) }

Pixels + RGB
{log c(i) }


R G B

DCT + YUV
Figure 10. The final shape and scale parameters {α(i) } and
{c(i) } for our “Pixels + RGB” VAE after training has con-
verged. We visualize α with black=0 and white=2 and log(c) with
black=log(0.002) and white=log(0.02).
{α(i) }

Wavelets + YUV
{log c(i) }


Figure 12. As is common practice, the VAE samples shown in


this paper are samples from the latent space (left) but not from
Y U V the final conditional distribution (right). Here we contrast decoded
Figure 11. The final shape and scale parameters {α(i) } and means and samples from VAEs using our different output spaces,
{c(i) } for our “Wavelets + YUV” VAE after training has con- all using our general distribution.
verged. We visualize α with black=0 and white=2 and log(c) with
black=log(0.00002) and white=log(0.2).
samples from those distributions. That is, we draw samples
from the latent encoded space and then decode them, but we
do not draw samples in our output space. Samples drawn
determined that these face images should be modeled us-
from these output distributions tend to look noisy and irreg-
ing normal-like distributions near the eyes and mouth, pre-
ular across all distributions and image representations, but
sumably because these structures are consistent and repeat-
they provide a good intuition of how our general distribution
able on human faces, and Cauchy-like distributions on the
behaves in each image representation, so in Figure 12 we
background and in flat regions of skin. Though our “Pix-
present side-by-side visualizations of decoded means and
els + RGB” model has estimated similar distributions for
samples.
each color channel, our “Wavelets + YUV” model has esti-
mated very different behavior for luma and chroma: more
Cauchy-like behavior is expected in luma variation, espe- H. Unsupervised Monocular Depth Estimation
cially at fine frequencies, while chroma variation is modeled Our unsupervised monocular depth estimation experi-
as being closer to a normal distribution across all scales. ments use the code from https://github.com/tinghuiz/
See Figure 14 for additional samples from our models, SfMLearner, which appears to correspond to the “Ours (w/o
and see Figure 15 for reconstructions from our models on explainability)” model from Table 1 of [42]. The only
validation-set images. As is common practice, the sam- changes we made to this code were: replacing its loss func-
ples and reconstructions in those figures and in the paper tion with our own, reducing the number of training iter-
are the means of the output distributions of the decoder, not ations from 200000 to 100000 (training converges faster
when using our loss function) and disabling the smooth-
ness term and multi-scale side predictions used by [42], as
neither yielded much benefit when combined with our new
loss function and they complicated experimentation by in-
troducing hyperparameters. Because the reconstruction loss
in [42] is the sum of the means of the losses imposed at
each scale in a D-level pyramid of side predictions, we use
a D level normalized wavelet decomposition (wherein im-
ages in [0, 1] result in wavelet coefficients in [0, 1]) and then
scale each coefficient’s loss by 2d , where d is the coeffi-
cients level.
In Figure 13 we visualize the final shape parameters for
each output coefficient that were converged upon during
training. These results provide some insight into why our
adaptive model produces improved results compared to the
ablations of our model in which we use a single fixed or
annealed value for α for all output coefficients. From the
low α values in the luma channel we can infer that training
has decided that luma variation often has outliers, and from

Y
the high α values in the chroma channel we can infer that
chroma variation rarely has outliers. Horizontal luma varia-
tion (upper right) tends to have larger α values than vertical
luma variation (lower left), perhaps because depth in this
dataset is largely due to horizontal motion, and so horizon-
U

tal gradients tend to provide more depth information than


vertical gradients. Looking at the sides and the bottom of
all scales and channels we see that the model expects more
outliers in these regions, which is likely due to boundary
effects: these areas often contain consistent errors due to
there not being a matching pixel in the alternate view.
V

In Figures 16 and 17 we present many more results from


the test split of the KITTI dataset, in which we compare
our “adaptive” model’s output to the baseline model and the
ground-truth depth. The improvement we see is substantial Figure 13. The final shape parameters α for our unsupervised
and consistent across a variety of scenes. monocular depth estimation model trained on KITTI data. The
parameters are visualized in the same “YUV + Wavelet” output
I. Fast Global Registration space as was used during training, where black is α = 0 and white
is α = 2.
Our registration results were produced using the code re-
lease corresponding to [41]. Because the numbers presented
in [41] have low precision, we reproduced the performance
of the baseline FGR algorithm using this code. This code
included some evaluation details that were omitted from the
paper that we determined through correspondence with the
author: for each input, FGR is run 20 times with random
initialization and the median error is reported. We use this
procedure when reproducing the baseline performance of
[41] and when evaluating our own models.
Pixels + RGB DCT + YUV Wavelets + YUV
Normal distribution
Cauchy distribution
Student’s t-distribution
Our distribution

Figure 14. Random samples (more precisely, means of the output distributions decoded from random samples in our latent space) from
our family of trained variational autoencoders.
Pixels + RGB DCT + YUV Wavelets + YUV
Input Normal Cauchy t-dist Ours Normal Cauchy t-dist Ours Normal Cauchy t-dist Ours

Figure 15. Reconstructions from our family of trained variational autoencoders, in which we use one of three different image represen-
tations for modeling images (super-columns) and use either normal, Cauchy, Student’s t, or our general distributions for modeling the
coefficients of each representation (sub-columns). The leftmost column shows the images which are used as input to each autoencoder.
Reconstructions from models using general distributions tend to be sharper and more detailed than reconstructions from the correspond-
ing model that uses normal distributions, particularly for the DCT or wavelet representations, though this difference is less pronounced
than what is seen when comparing samples from these models. The DCT and wavelet models trained with Cauchy distributions or Stu-
dent’s t-distributions systematically fail to preserve the background of the input image, as was noted when observing samples from these
distributions.
Input

Input
Baseline

Baseline
Ours

Ours
Truth

Truth
Input

Input
Baseline

Baseline
Ours

Ours
Truth

Truth
Input

Input
Baseline

Baseline
Ours

Ours
Truth

Truth

Figure 16. Monocular depth estimation results on the KITTI benchmark using the “Baseline” network of [42] and our own variant in
which we replace the network’s loss function with our own adaptive loss over wavelet coefficients. Changing only the loss function results
in significantly improved depth estimates.
Truth Ours Baseline Input Truth Ours Baseline Input Truth Ours Baseline Input

Truth Ours Baseline Input Truth Ours Baseline Input Truth Ours Baseline Input

Figure 17. Additional monocular depth estimation results, in the same format as Figure 16.
On the Origin of Deep Learning

On the Origin of Deep Learning

Haohan Wang haohanw@cs.cmu.edu


Bhiksha Raj bhiksha@cs.cmu.edu
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
arXiv:1702.07800v4 [cs.LG] 3 Mar 2017

Abstract
This paper is a review of the evolutionary history of deep learning models. It covers from
the genesis of neural networks when associationism modeling of the brain is studied, to the
models that dominate the last decade of research in deep learning like convolutional neural
networks, deep belief networks, and recurrent neural networks. In addition to a review of
these models, this paper primarily focuses on the precedents of the models above, examining
how the initial ideas are assembled to construct the early models and how these preliminary
models are developed into their current forms. Many of these evolutionary paths last more
than half a century and have a diversity of directions. For example, CNN is built on prior
knowledge of biological vision system; DBN is evolved from a trade-off of modeling power
and computation complexity of graphical models and many nowadays models are neural
counterparts of ancient linear models. This paper reviews these evolutionary paths and
offers a concise thought flow of how these models are developed, and aims to provide a
thorough background for deep learning. More importantly, along with the path, this paper
summarizes the gist behind these milestones and proposes many directions to guide the
future research of deep learning.

1
Wang and Raj

1. Introduction
Deep learning has dramatically improved the state-of-the-art in many different artificial
intelligent tasks like object detection, speech recognition, machine translation (LeCun et al.,
2015). Its deep architecture nature grants deep learning the possibility of solving many more
complicated AI tasks (Bengio, 2009). As a result, researchers are extending deep learning
to a variety of different modern domains and tasks in additional to traditional tasks like
object detection, face recognition, or language models, for example, Osako et al. (2015) uses
the recurrent neural network to denoise speech signals, Gupta et al. (2015) uses stacked
autoencoders to discover clustering patterns of gene expressions. Gatys et al. (2015) uses a
neural model to generate images with different styles. Wang et al. (2016) uses deep learning
to allow sentiment analysis from multiple modalities simultaneously, etc. This period is the
era to witness the blooming of deep learning research.
However, to fundamentally push the deep learning research frontier forward, one needs
to thoroughly understand what has been attempted in the history and why current models
exist in present forms. This paper summarizes the evolutionary history of several different
deep learning models and explains the main ideas behind these models and their relationship
to the ancestors. To understand the past work is not trivial as deep learning has evolved
over a long time of history, as showed in Table 1. Therefore, this paper aims to offer the
readers a walk-through of the major milestones of deep learning research. We will cover the
milestones as showed in Table 1, as well as many additional works. We will split the story
into different sections for the clearness of presentation.
This paper starts the discussion from research on the human brain modeling. Although
the success of deep learning nowadays is not necessarily due to its resemblance of the human
brain (more due to its deep architecture), the ambition to build a system that simulate brain
indeed thrust the initial development of neural networks. Therefore, the next section begins
with connectionism and naturally leads to the age when shallow neural network matures.
With the maturity of neural networks, this paper continues to briefly discuss the ne-
cessity of extending shallow neural networks into deeper ones, as well as the promises deep
neural networks make and the challenges deep architecture introduces.
With the establishment of the deep neural network, this paper diverges into three dif-
ferent popular deep learning topics. Specifically, in Section 4, this paper elaborates how
Deep Belief Nets and its construction component Restricted Boltzmann Machine evolve as a
trade-off of modeling power and computation loads. In Section 5, this paper focuses on the
development history of Convolutional Neural Network, featured with the prominent steps
along the ladder of ImageNet competition. In Section 6, this paper discusses the develop-
ment of Recurrent Neural Networks, its successors like LSTM, attention models and the
successes they achieved.
While this paper primarily discusses deep learning models, optimization of deep archi-
tecture is an inevitable topic in this society. Section 7 is devoted to a brief summary of
optimization techniques, including advanced gradient method, Dropout, Batch Normaliza-
tion, etc.
This paper could be read as a complementary of (Schmidhuber, 2015). Schmidhuber’s
paper is aimed to assign credit to all those who contributed to the present state of the art,
so his paper focuses on every single incremental work along the path, therefore cannot elab-

2
On the Origin of Deep Learning

Table 1: Major milestones that will be covered in this paper


Year Contributer Contribution
introduced Associationism, started the history of human’s
300 BC Aristotle
attempt to understand brain.
introduced Neural Groupings as the earliest models of
1873 Alexander Bain
neural network, inspired Hebbian Learning Rule.
introduced MCP Model, which is considered as the
1943 McCulloch & Pitts
ancestor of Artificial Neural Model.
considered as the father of neural networks, introduced
1949 Donald Hebb Hebbian Learning Rule, which lays the foundation of
modern neural network.
introduced the first perceptron, which highly resembles
1958 Frank Rosenblatt
modern perceptron.
1974 Paul Werbos introduced Backpropagation
Teuvo Kohonen introduced Self Organizing Map
1980
introduced Neocogitron, which inspired Convolutional
Kunihiko Fukushima
Neural Network
1982 John Hopfield introduced Hopfield Network
1985 Hilton & Sejnowski introduced Boltzmann Machine
introduced Harmonium, which is later known as Restricted
Paul Smolensky
1986 Boltzmann Machine
Michael I. Jordan defined and introduced Recurrent Neural Network
introduced LeNet, showed the possibility of deep neural
1990 Yann LeCun
networks in practice
Schuster & Paliwal introduced Bidirectional Recurrent Neural Network
1997
Hochreiter & introduced LSTM, solved the problem of vanishing
Schmidhuber gradient in recurrent neural networks
introduced Deep Belief Networks, also introduced
2006 Geoffrey Hinton layer-wise pretraining technique, opened current deep
learning era.
Salakhutdinov &
2009 introduced Deep Boltzmann Machines
Hinton
introduced Dropout, an efficient way of training neural
2012 Geoffrey Hinton
networks

3
Wang and Raj

orate well enough on each of them. On the other hand, our paper is aimed at providing the
background for readers to understand how these models are developed. Therefore, we em-
phasize on the milestones and elaborate those ideas to help build associations between these
ideas. In addition to the paths of classical deep learning models in (Schmidhuber, 2015),
we also discuss those recent deep learning work that builds from classical linear models.
Another article that readers could read as a complementary is (Anderson and Rosenfeld,
2000) where the authors conducted extensive interviews with well-known scientific leaders
in the 90s on the topic of the neural networks’ history.

4
On the Origin of Deep Learning

2. From Aristotle to Modern Artificial Neural Networks


The study of deep learning and artificial neural networks originates from our ambition to
build a computer system simulating the human brain. To build such a system requires
understandings of the functionality of our cognitive system. Therefore, this paper traces all
the way back to the origins of attempts to understand the brain and starts the discussion
of Aristotle’s Associationism around 300 B.C.

2.1 Associationism
“When, therefore, we accomplish an act of reminiscence, we pass through a
certain series of precursive movements, until we arrive at a movement on which
the one we are in quest of is habitually consequent. Hence, too, it is that we
hunt through the mental train, excogitating from the present or some other,
and from similar or contrary or coadjacent. Through this process reminiscence
takes place. For the movements are, in these cases, sometimes at the same time,
sometimes parts of the same whole, so that the subsequent movement is already
more than half accomplished.”

This remarkable paragraph of Aristotle is seen as the starting point of Association-


ism (Burnham, 1888). Associationism is a theory states that mind is a set of conceptual
elements that are organized as associations between these elements. Inspired by Plato,
Aristotle examined the processes of remembrance and recall and brought up with four laws
of association (Boeree, 2000).

• Contiguity: Things or events with spatial or temporal proximity tend to be associated


in the mind.

• Frequency: The number of occurrences of two events is proportional to the strength


of association between these two events.

• Similarity: Thought of one event tends to trigger the thought of a similar event.

• Contrast: Thought of one event tends to trigger the thought of an opposite event.

Back then, Aristotle described the implementation of these laws in our mind as common
sense. For example, the feel, the smell, or the taste of an apple should naturally lead to
the concept of an apple, as common sense. Nowadays, it is surprising to see that these
laws proposed more than 2000 years ago still serve as the fundamental assumptions of
machine learning methods. For example, samples that are near each other (under a defined
distance) are clustered into one group; explanatory variables that frequently occur with
response variables draw more attention from the model; similar/dissimilar data are usually
represented with more similar/dissimilar embeddings in latent space.
Contemporaneously, similar laws were also proposed by Zeno of Citium, Epicurus and
St Augustine of Hippo. The theory of associationism was later strengthened with a variety
of philosophers or psychologists. Thomas Hobbes (1588-1679) stated that the complex
experiences were the association of simple experiences, which were associations of sensations.
He also believed that association exists by means of coherence and frequency as its strength

5
Wang and Raj

Figure 1: Illustration of neural groupings in (Bain, 1873)

factor. Meanwhile, John Locke (1632-1704) introduced the concept of “association of ideas”.
He still separated the concept of ideas of sensation and ideas of reflection and he stated
that complex ideas could be derived from a combination of these two simple ideas. David
Hume (1711-1776) later reduced Aristotle’s four laws into three: resemblance (similarity),
contiguity, and cause and effect. He believed that whatever coherence the world seemed to
have was a matter of these three laws. Dugald Stewart (1753-1828) extended these three
laws with several other principles, among an obvious one: accidental coincidence in the
sounds of words. Thomas Reid (1710-1796) believed that no original quality of mind was
required to explain the spontaneous recurrence of thinking, rather than habits. James Mill
(1773-1836) emphasized on the law of frequency as the key to learning, which is very similar
to later stages of research.
David Hartley (1705-1757), as a physician, was remarkably regarded as the one that
made associationism popular (Hartley, 2013). In addition to existing laws, he proposed his
argument that memory could be conceived as smaller scale vibrations in the same regions
of the brain as the original sensory experience. These vibrations can link up to represent
complex ideas and therefore act as a material basis for the stream of consciousness. This
idea potentially inspired Hebbian learning rule, which will be discussed later in this paper
to lay the foundation of neural networks.

2.2 Bain and Neural Groupings


Besides David Hartley, Alexander Bain (1818-1903) also contributed to the fundamental
ideas of Hebbian Learning Rule (Wilkes and Wade, 1997). In this book, Bain (1873) related
the processes of associative memory to the distribution of activity of neural groupings (a
term that he used to denote neural networks back then). He proposed a constructive mode
of storage capable of assembling what was required, in contrast to alternative traditional
mode of storage with prestored memories.
To further illustrate his ideas, Bain first described the computational flexibility that
allows a neural grouping to function when multiple associations are to be stored. With
a few hypothesis, Bain managed to describe a structure that highly resembled the neural

6
On the Origin of Deep Learning

networks of today: an individual cell is summarizing the stimulation from other selected
linked cells within a grouping, as showed in Figure 1. The joint stimulation from a and c
triggers X, stimulation from b and c triggers Y and stimulation from a and c triggers Z. In
his original illustration, a, b, c stand for simulations, X and Y are outcomes of cells.
With the establishment of how this associative structure of neural grouping can function
as memory, Bain proceeded to describe the construction of these structures. He followed the
directions of associationism and stated that relevant impressions of neural groupings must
be made in temporal contiguity for a period, either on one occasion or repeated occasions.
Further, Bain described the computational properties of neural grouping: connections
are strengthened or weakened through experience via changes of intervening cell-substance.
Therefore, the induction of these circuits would be selected comparatively strong or weak.
As we will see in the following section, Hebb’s postulate highly resembles Bain’s de-
scription, although nowadays we usually label this postulate as Hebb’s, rather than Bain’s,
according to (Wilkes and Wade, 1997). This omission of Bain’s contribution may also be
due to Bain’s lack of confidence in his own theory: Eventually, Bain was not convinced by
himself and doubted about the practical values of neural groupings.

2.3 Hebbian Learning Rule


Hebbian Learning Rule is named after Donald O. Hebb (1904-1985) since it was introduced
in his work The Organization of Behavior (Hebb, 1949). Hebb is also seen as the father of
Neural Networks because of this work (Didier and Bigand, 2011).
In 1949, Hebb stated the famous rule: “Cells that fire together, wire together”, which
emphasized on the activation behavior of co-fired cells. More specifically, in his book, he
stated that:

“When an axon of cell A is near enough to excite a cell B and repeatedly


or persistently takes part in firing it, some growth process or metabolic change
takes place in one or both cells such that As efficiency, as one of the cells firing
B, is increased.”

This archaic paragraph can be re-written into modern machine learning languages as the
following:

∆wi = ηxi y (1)

where ∆wi stands for the change of synaptic weights (wi ) of Neuron i, of which the input
signal is xi . y denotes the postsynaptic response and η denotes learning rate. In other
words, Hebbian Learning Rule states that the connection between two units should be
strengthened as the frequency of co-occurrences of these two units increase.
Although Hebbian Learning Rule is seen as laying the foundation of neural networks,
seen today, its drawbacks are obvious: as co-occurrences appear more, the weights of connec-
tions keep increasing and the weights of a dominant signal will increase exponentially. This
is known as the unstableness of Hebbian Learning Rule (Principe et al., 1999). Fortunately,
these problems did not influence Hebb’s identity as the father of neural networks.

7
Wang and Raj

2.4 Oja’s Rule and Principal Component Analyzer


Erkki Oja extended Hebbian Learning Rule to avoid the unstableness property and he also
showed that a neuron, following this updating rule, is approximating the behavior of a
Principal Component Analyzer (PCA) (Oja, 1982).
Long story short, Oja introduced a normalization term to rescue Hebbian Learning
rule, and further he showed that his learning rule is simply an online update of Principal
Component Analyzer. We present the details of this argument in the following paragraphs.
Starting from Equation 1 and following the same notation, Oja showed:

wit+1 = wit + ηxi y

where t denotes the iteration. A straightforward way to avoid the exploding of weights is
to apply normalization at the end of each iteration, yielding:

wt + ηxi y
wit+1 = Pn i 1
( i=1 (wit + ηxi y)2 ) 2

where n denotes the number of neurons. The above equation can be further expanded into
the following form:
Pn
t w
w yx i i j yxj wj
wit+1 = i + η( + ) + O(η 2 )
Z Z Z3
1
where Z = ( ni wi2 ) 2 . Further, two more assumptions are introduced: 1) η is small.
P
1
Therefore O(η 2 ) is approximately 0. 2) Weights are normalized, therefore Z = ( ni wi2 ) 2 =
P
1.
When these two assumptions were introduced back to the previous equation, Oja’s rule
was proposed as following:

wit+1 = wit + ηy(xi − ywit ) (2)

Oja took a step further to show that a neuron that was updated with this rule was
effectively performing Principal Component Analysis on the data. To show this, Oja first
re-wrote Equation 2 as the following forms with two additional assumptions (Oja, 1982):

d t
w = Cwit − ((wit )T Cwit )wit
d(t) i

where C is the covariance matrix of input X. Then he proceeded to show this property
with many conclusions from his another work (Oja and Karhunen, 1985) and linked back
to PCA with the fact that components from PCA are eigenvectors and the first component
is the eigenvector corresponding to largest eigenvalues of the covariance matrix. Intuitively,
we could interpret this property with a simpler explanation: the eigenvectors of C are the
solution when we maximize the rule updating function. Since wit are the eigenvectors of the
covariance matrix of X, we can get that wit are the PCA.
Oja’s learning rule concludes our story of learning rules of the early-stage neural network.
Now we proceed to visit the ideas on neural models.

8
On the Origin of Deep Learning

2.5 MCP Neural Model


While Donald Hebb is seen as the father of neural networks, the first model of neuron
could trace back to six years ahead of the publication of Hebbian Learning Rule, when
a neurophysiologist Warren McCulloch and a mathematician Walter Pitts speculated the
inner workings of neurons and modeled a primitive neural network by electrical circuits
based on their findings (McCulloch and Pitts, 1943). Their model, known as MCP neural
model, was a linear step function upon weighted linearly interpolated data that could be
described as:
( P
1, i wi xi ≥ θ AND zj = 0, ∀j
y=
0, otherwise

where y stands for output, xi stands for input of signals, wi stands for the corresponding
weights and zj stands for the inhibitory input. θ stands for the threshold. The function is
designed in a way that the activity of any inhibitory input completely prevents excitation
of the neuron at any time.
Despite the resemblance between MCP Neural Model and modern perceptron, they are
still different distinctly in many different aspects:

• MCP Neural Model is initially built as electrical circuits. Later we will see that the
study of neural networks has borrowed many ideas from the field of electrical circuits.

• The weights of MCP Neural Model wi are fixed, in contrast to the adjustable weights
in modern perceptron. All the weights must be assigned with manual calculation.

• The idea of inhibitory input is quite unconventional even seen today. It might be an
idea worth further study in modern deep learning research.

2.6 Perceptron
With the success of MCP Neural Model, Frank Rosenblatt further substantialized Hebbian
Learning Rule with the introduction of perceptrons (Rosenblatt, 1958). While theorists
like Hebb were focusing on the biological system in the natural environment, Rosenblatt
constructed the electronic device named Perceptron that was showed with the ability to
learn in accordance with associationism.
Rosenblatt (1958) introduced the perceptron with the context of the vision system, as
showed in Figure 2(a). He introduced the rules of the organization of a perceptron as
following:

• Stimuli impact on a retina of the sensory units, which respond in a manner that the
pulse amplitude or frequency is proportional to the stimulus intensity.

• Impulses are transmitted to Projection Area (AI ). This projection area is optional.

• Impulses are then transmitted to Association Area through random connections. If


the sum of impulse intensities is equal to or greater than the threshold (θ) of this unit,
then this unit fires.

9
Wang and Raj

(a) Illustration of organization of a perceptron in (b) A typical perceptron in modern machine learn-
(Rosenblatt, 1958) ing literature

Figure 2: Perceptrons: (a) A new figure of the illustration of organization of perceptron as


in (Rosenblatt, 1958). (b) A typical perceptron nowadays, when AI (Projection
Area) is omitted.

• Response units work in the same fashion as those intermediate units.

Figure 2(a) illustrates his explanation of perceptron. From left to right, the four units
are sensory unit, projection unit, association unit and response unit respectively. Projection
unit receives the information from sensory unit and passes onto association unit. This unit
is often omitted in other description of similar models. With the omission of projection
unit, the structure resembles the structure of nowadays perceptron in a neural network (as
showed in Figure 2(b)): sensory units collect data, association units linearly adds these data
with different weights and apply non-linear transform onto the thresholded sum, then pass
the results to response units.
One distinction between the early stage neuron models and modern perceptrons is the
introduction of non-linear activation functions (we use sigmoid function as an example
in Figure 2(b)). This originates from the argument that linear threshold function should
be softened to simulate biological neural networks (Bose et al., 1996) as well as from the
consideration of the feasibility of computation to replace step function with a continuous
one (Mitchell et al., 1997).
After Rosenblatt’s introduction of Perceptron, Widrow et al. (1960) introduced a follow-
up model called ADALINE. However, the difference between Rosenblatt’s Perceptron and
ADALINE is mainly on the algorithm aspect. As the primary focus of this paper is neural
network models, we skip the discussion of ADALINE.

2.7 Perceptron’s Linear Representation Power


A perceptron is fundamentally a linear function of input signals; therefore it is limited to
represent linear decision boundaries like the logical operations like NOT, AND or OR, but
not XOR when a more sophisticated decision boundary is required. This limitation was
highlighted by Minski and Papert (1969), when they attacked the limitations of perceptions
by emphasizing that perceptrons cannot solve functions like XOR or NXOR. As a result,
very little research was done in this area until about the 1980s.

10
On the Origin of Deep Learning

(a) (b) (c) (d)

Figure 3: The linear representation power of preceptron

To show a more concrete example, we introduce a linear preceptron with only two inputs
x1 and x2 , therefore, the decision boundary w1 x1 + w2 x2 forms a line in a two-dimensional
space. The choice of threshold magnitude shifts the line horizontally and the sign of the
function picks one side of the line as the halfspace the function represents. The halfspace
is showed in Figure 3 (a).
In Figure 3 (b)-(d), we present two nodes a and b to denote to input, as well as the node
to denote the situation when both of them trigger and a node to denote the situation when
neither of them triggers. Figure 3 (b) and Figure 3 (c) show clearly that a linear perceptron
can be used to describe AND and OR operation of these two inputs. However, in Figure 3
(d), when we are interested in XOR operation, the operation can no longer be described by
a single linear decision boundary.
In the next section, we will show that the representation ability is greatly enlarged when
we put perceptrons together to make a neural network. However, when we keep stacking
one neural network upon the other to make a deep learning model, the representation power
will not necessarily increase.

11
Wang and Raj

3. From Modern Neural Network to the Era of Deep Learning


In this section, we will introduce some important properties of neural networks. These
properties partially explain the popularity neural network gains these days and also moti-
vate the necessity of exploring deeper architecture. To be specific, we will discuss a set of
universal approximation properties, in which each property has its condition. Then, we will
show that although a shallow neural network is an universal approximator, deeper architec-
ture can significantly reduce the requirement of resources while retaining the representation
power. At last, we will also show some interesting properties discovered in the 1990s about
backpropagation, which may inspire some related research today.

3.1 Universal Approximation Property


The step from perceptrons to basic neural networks is only placing the perceptrons together.
By placing the perceptrons side by side, we get a single one-layer neural network and by
stacking one one-layer neural network upon the other, we get a multi-layer neural network,
which is often known as multi-layer perceptrons (MLP) (Kawaguchi, 2000).
One remarkable property of neural networks, widely known as universal approximation
property, roughly describes that an MLP can represent any functions. Here we discussed
this property in three different aspects:

• Boolean Approximation: an MLP of one hidden layer1 can represent any boolean
function exactly.

• Continuous Approximation: an MLP of one hidden layer can approximate any bounded
continuous function with arbitrary accuracy.

• Arbitrary Approximation: an MLP of two hidden layers can approximate any function
with arbitrary accuracy.

We will discuss these three properties in detail in the following paragraphs. To suit different
readers’ interest, we will first offer an intuitive explanation of these properties and then offer
the proofs.

3.1.1 Representation of any Boolean Functions


This approximation property is very straightforward. In the previous section we have shown
that every linear preceptron can perform either AND or OR. According to De Morgan’s
laws, every propositional formula can be converted into an equivalent Conjunctive Normal
Form, which is an OR of multiple AND functions. Therefore, we simply rewrite the target
Boolean function into an OR of multiple AND operations. Then we design the network in
such a way: the input layer performs all AND operations, and the hidden layer is simply
an OR operation.
The formal proof is not very different from this intuitive explanation, we skip it for
simplicity.

1. Through this paper, we will follow the most widely accepted naming convention that calls a two-layer
neural network as one hidden layer neural network.

12
On the Origin of Deep Learning

(a) (b) (c) (d)

Figure 4: Example of Universal Approximation of any Bounded Continuous Functions

3.1.2 Approximation of any Bounded Continuous Functions


Continuing from the linear representation power of perceptron discussed previously, if we
want to represent a more complex function, showed in Figure 4 (a), we can use a set
of linear perceptrons, each of them describing a halfspace. One of these perceptrons is
shown in Figure 4 (b), we will need five of these perceptrons. With these perceptrons, we
can bound the target function out, as showed in Figure 4 (c). The numbers showed in
Figure 4 (c) represent the number of subspaces described by perceptrons that fall into the
corresponding region. As we can see, with an appropriate selection of the threshold (e.g.
θ = 5 in Figure 4 (c)), we can bound the target function out. Therefore, we can describe any
bounded continuous function with only one hidden layer; even it is a shape as complicated
as Figure 4 (d).
This property was first shown in (Cybenko, 1989) and (Hornik et al., 1989). To be
specific, Cybenko (1989) showed that, if we have a function in the following form:
X
f (x) = ωi σ(wiT x + θ) (3)
i

f (x) is dense in the subspace of where it is in. In other words, for an arbitrary function
g(x) in the same subspace as f (x), we have

|f (x) − g(x)| < 

where  > 0. In Equation 3, σ denotes the activation function (a squashing function back
then), wi denotes the weights for the input layer and ωi denotes the weights for the hidden
layer.
This conclusion was drawn with a proof by contradiction: With Hahn-Banach Theorem
and Riesz Representation Theorem, the fact that the closure of f (x) is not all the subspace
where f (x) is in contradicts the assumption that σ is an activation (squashing) function.
Till today, this property has drawn thousands of citations. Unfortunately, many of
the later works cite this property inappropriately (Castro et al., 2000) because Equation 3
is not the widely accepted form of a one-hidden-layer neural network because it does not
deliver a thresholded/squashed output, but a linear output instead. Ten years later after
this property was shown, Castro et al. (2000) concluded this story by showing that when
the final output is squashed, this universal approximation property still holds.
Note that, this property was shown with the context that activation functions are squash-
ing functions. By definition, a squashing function σ : R → [0, 1] is a non-decreasing function

13
Wang and Raj

Figure 5: Threshold is not necessary with a large number of linear perceptrons.

with the properties limx→∞ σ(x) = 1 and limx→−∞ σ(x) = 0. Many activation functions of
recent deep learning research do not fall into this category.

3.1.3 Approximation of Arbitrary Functions

Before we move on to explain this property, we need first to show a major property regarding
combining linear perceptrons into neural networks. Figure 5 shows that as the number of
linear perceptrons increases to bound the target function, the area outside the polygon with
the sum close to the threshold shrinks. Following this trend, we can use a large number of
perceptrons to bound a circle, and this can be achieved even without knowing the threshold
because the area close to the threshold shrinks to nothing. What left outside the circle is,
in fact, the area that sums to N2 , where N is the number of perceptrons used.
Therefore, a neural network with one hidden layer can represent a circle with arbitrary
diameter. Further, we introduce another hidden layer that is used to combine the outputs of
many different circles. This newly added hidden layer is only used to perform OR operation.
Figure 6 shows an example that when the extra hidden layer is used to merge the circles
from the previous layer, the neural network can be used to approximate any function. The
target function is not necessarily continuous. However, each circle requires a large number
of neurons, consequently, the entire function requires even more.
This property was showed in (Lapedes and Farber, 1988) and (Cybenko, 1988) respec-
tively. Looking back at this property today, it is not arduous to build the connections
between this property to Fourier series approximation, which, in informal words, states
that every function curve can be decomposed into the sum of many simpler curves. With
this linkage, to show this universal approximation property is to show that any one-hidden-
layer neural network can represent one simple surface, then the second hidden layer sums
up these simple surfaces to approximate an arbitrary function.
As we know, one hidden layer neural network simply performs a thresholded sum op-
eration, therefore, the only step left is to show that the first hidden layer can represent a
simple surface. To understand the “simple surface”, with linkage to Fourier transform, one
can imagine one cycle of the sinusoid for the one-dimensional case or a “bump” of a plane
in the two-dimensional case.

14
On the Origin of Deep Learning

Figure 6: How a neural network can be used to approximate a leaf shaped function.

For one dimension, to create a simple surface, we only need two sigmoid functions
appropriately placed, for example, as following:

h
f1 (x) =
1+ e−(x+t1 )
h
f2 (x) =
1 + ex−t2

Then, with f1 (x) + f2 (x), we create a simple surface with height 2h from t1 ≤ x ≤ t2 .
This could be easily generalized to n-dimensional case, where we need 2n sigmoid functions
(neurons) for each simple surface. Then for each simple surface that contributes to the final
function, one neuron is added onto the second hidden layer. Therefore, despite the number
of neurons need, one will never need a third hidden layer to approximate any function.
Similarly to how Gibbs phenomenon affects Fourier series approximation, this approxi-
mation cannot guarantee an exact representation.
The universal approximation properties showed a great potential of shallow neural net-
works at the price of exponentially many neurons at these layers. One followed-up question
is that how to reduce the number of required neurons while maintaining the representation
power. This question motivates people to proceed to deeper neural networks despite that
shallow neural networks already have infinite modeling power. Another issue worth atten-
tion is that, although neural networks can approximate any functions, it is not trivial to
find the set of parameters to explain the data. In the next two sections, we will discuss
these two questions respectively.

15
Wang and Raj

3.2 The Necessity of Depth

The universal approximation properties of shallow neural networks come at a price of expo-
nentially many neurons and therefore are not realistic. The question about how to maintain
this expressive power of the network while reducing the number of computation units has
been asked for years. Intuitively, Bengio and Delalleau (2011) suggested that it is nature
to pursue deeper networks because 1) human neural system is a deep architecture (as we
will see examples in Section 5 about human visual cortex.) and 2) humans tend to rep-
resent concepts at one level of abstraction as the composition of concepts at lower levels.
Nowadays, the solution is to build deeper architectures, which comes from a conclusion that
states the representation power of a k layer neural network with polynomial many neurons
need to be expressed with exponentially many neurons if a k − 1 layer structured is used.
However, theoretically, this conclusion is still being completed.
This conclusion could trace back to three decades ago when Yao (1985) showed the
limitations of shallow circuits functions. Hastad (1986) later showed this property with
parity circuits: “there are functions computable in polynomial size and depth k but requires
exponential size when depth is restricted to k − 1”. He showed this property mainly by
the application of DeMorgan’s law, which states that any AND or ORs can be rewritten
as OR of ANDs and vice versa. Therefore, he simplified a circuit where ANDs and ORs
appear one after the other by rewriting one layer of ANDs into ORs and therefore merge
this operation to its neighboring layer of ORs. By repeating this procedure, he was able to
represent the same function with fewer layers, but more computations.
Moving from circuits to neural networks, Delalleau and Bengio (2011) compared deep
and shallow sum-product neural networks. They showed that a function√ that could be
expressed with O(n) neurons on a network of depth k required at least O(2 n ) and O((n −
1)k ) neurons on a two-layer neural network.
Further, Bianchini and Scarselli (2014) extended this study to a general neural net-
work with many major activation functions including tanh and sigmoid. They derived
the conclusion with the concept of Betti numbers, and used this number to describe the
representation power of neural networks. They showed that for a shallow network, the rep-
resentation power can only grow polynomially with respect to the number of neurons, but
for deep architecture, the representation can grow exponentially with respect to the number
of neurons. They also related their conclusion to VC-dimension of neural networks, which
is O(p2 ) for tanh (Bartlett and Maass, 2003) where p is the number of parameters.
Recently, Eldan and Shamir (2015) presented a more thorough proof to show that depth
of a neural network is exponentially more valuable than the width of a neural network, for a
standard MLP with any popular activation functions. Their conclusion is drawn with only a
few weak assumptions that constrain the activation functions to be mildly increasing, mea-
surable, and able to allow shallow neural networks to approximate any univariate Lipschitz
function. Finally, we have a well-grounded theory to support the fact that deeper network
is preferred over shallow ones. However, in reality, many problems will arise if we keep
increasing the layers. Among them, the increased difficulty of learning proper parameters
is probably the most prominent one. Immediately in the next section, we will discuss the
main drive of searching parameters for a neural network: Backpropagation.

16
On the Origin of Deep Learning

3.3 Backpropagation and Its Properties


Before we proceed, we need to clarify that the name backpropagation, originally, is not
referring to an algorithm that is used to learn the parameters of a neural network, instead,
it stands for a technique that can help efficiently compute the gradient of parameters when
gradient descent algorithm is applied to learn parameters (Hecht-Nielsen, 1989). However,
nowadays it is widely recognized as the term to refer gradient descent algorithm with such
a technique.
Compared to a standard gradient descent, which updates all the parameters with re-
spect to error, backpropagation first propagates the error term at output layer back to
the layer at which parameters need to be updated, then uses standard gradient descent
to update parameters with respect to the propagated error. Intuitively, the derivation of
backpropagation is about organizing the terms when the gradient is expressed with the
chain rule. The derivation is neat but skipped in this paper due to the extensive resources
available (Werbos, 1990; Mitchell et al., 1997; LeCun et al., 2015). Instead, we will discuss
two interesting and seemingly contradictory properties of backpropagation.

3.3.1 Backpropagation Finds Global Optimal for Linear Separable Data


Gori and Tesi (1992) studied on the problem of local minima in backpropagation. Inter-
estingly, when the society believes that neural networks or deep learning approaches are
believed to suffer from local optimal, they proposed an architecture where global optimal
is guaranteed. Only a few weak assumptions of the network are needed to reach global
optimal, including

• Pyramidal Architecture: upper layers have fewer neurons

• Weight matrices are full row rank

• The number of input neurons cannot smaller than the classes/patterns of data.

However, their approaches may not be relevant anymore as they require the data to be
linearly separable, under which condition that many other models can be applied.

3.3.2 Backpropagation Fails for Linear Separable Data


On the other hand, Brady et al. (1989) studied the situations when backpropagation fails
on linearly separable data sets. He showed that there could be situations when the data
is linearly separable, but a neural network learned with backpropagation cannot find that
boundary. He also showed examples when this situation occurs.
His illustrative examples only hold when the misclassified data samples are significantly
less than correctly classified data samples, in other words, the misclassified data samples
might be just outliers. Therefore, this interesting property, when viewed today, is arguably
a desirable property of backpropagation as we typically expect a machine learning model
to neglect outliers. Therefore, this finding has not attracted many attentions.
However, no matter whether the data is an outlier or not, neural network should be
able to overfit training data given sufficient training iterations and a legitimate learning
algorithm, especially considering that Brady et al. (1989) showed that an inferior algorithm

17
Wang and Raj

was able to overfit the data. Therefore, this phenomenon should have played a critical role
in the research of improving the optimization techniques. Recently, the studying of cost
surfaces of neural networks have indicated the existence of saddle points (Choromanska
et al., 2015; Dauphin et al., 2014; Pascanu et al., 2014), which may explain the findings of
Brady et al back in the late 80s.
Backpropagation enables the optimization of deep neural networks. However, there is
still a long way to go before we can optimize it well. Later in Section 7, we will briefly
discuss more techniques related to the optimization of neural networks.

18
On the Origin of Deep Learning

4. The Network as Memory and Deep Belief Nets

Figure 7: Trade off of representation power and computation complexity of several models,
that guides the development of better models

With the background of how modern neural network is set up, we proceed to visit the
each prominent branch of current deep learning family. Our first stop is the branch that
leads to the popular Restricted Boltzmann Machines and Deep Belief Nets, and it starts as
a model to understand the data unsupervisedly.
Figure 7 summarizes the model that will be covered in this Section. The horizontal axis
stands for the computation complexity of these models while the vertical axis stands for the
representation power. The six milestones that will be focused in this section are placed in
the figure.

4.1 Self Organizing Map


The discussion starts with Self Organizing Map (SOM) invented by Kohonen (1990). SOM
is a powerful technique that is primarily used in reducing the dimension of data, usually
to one or two dimensions (Germano, 1999). While reducing the dimensionality, SOM also
retains the topological similarity of data points. It can also be seen as a tool for clustering
while imposing the topology on clustered representation. Figure 8 is an illustration of Self
Organizing Map of two dimension hidden neurons. Therefore, it learns a two dimension
representation of data. The upper shaded nodes denote the units of SOM that are used to

19
Wang and Raj

Figure 8: Illustration of Self-Organizing Map

represent data while the lower circles denote the data. There is no connection between the
nodes in SOM 2 .
The position of each node is fixed. The representation should not be viewed as only a
numerical value. Instead, the position of it also matters. This property is different from
some widely-accepted representation criterion. For example, we compare the case when
one-hot vector and one-dimension SOM are used to denote colors: To denote green out
of a set: C = {green, red, purple}, one-hot representation can use any vector of (1, 0, 0),
(0, 1, 0) or (0, 0, 1) as long as we specify the bit for green correspondingly. However, for a
one-dimensional SOM, only two vectors are possible: (1, 0, 0) or (0, 0, 1). This is because
that, since SOM aims to represent the data while retaining the similarity; and red and
purple are much more similar than green and red or green and purple, green should not be
represented in a way that it splits red and purple. One should notice that, this example is
only used to demonstrate that the position of each unit in SOM matters. In practice, the
values of SOM unit are not restricted to integers.
The learned SOM is usually a good tool for visualizing data. For example, if we conduct
a survey on the happiness level and richness level of each country and feed the data into
a two-dimensional SOM. Then the trained units should represent the happiest and richest
country at one corner and represent the opposite country at the furthest corner. The rest
two corners represent the richest, yet unhappiest and the poorest but happiest countries.
The rest countries are positioned accordingly. The advantage of SOM is that it allows one

2. In some other literature, (Bullinaria, 2004) as an example, one may notice that there are connections
in the illustrations of models. However, those connections are only used to represent the neighborhood
relationship of nodes, and there is no information flowing via those connections. In this paper, as we
will show many other models that rely on a clear illustration of information flow, we decide to save the
connections to denote that.

20
On the Origin of Deep Learning

to easily tell how a country is ranked among the world with a simple glance of the learned
units (Guthikonda, 2005).

4.1.1 Learning Algorithm

With an understanding of the representation power of SOM, now we proceed to its param-
eter learning algorithm. The classic algorithm is heuristic and intuitive, as shown below:
Here we use a two-dimensional SOM as example, and i, j are indexes of units; w is weight

Initialize weights of all units, wi,j ∀ i, j


for t ≤ N do
Pick vk randomly
Select Best Matching Unit (BMU) as p, q := arg mini,j ||wij − vk ||22
Select the nodes of interest as the neighbors of BMU. I = {wi,j |dist(wi,j , wp,q ) < r(t)}
Update weights: wi,j = wi,j + P (i, j, p, q)l(t)||wij − vk ||22 , ∀i, j ∈ I
end for

of the unit; v denotes data vector; k is the index of data; t denotes the current iteration; N
constrains the maximum number of steps allowed; P (·) denotes the penalty considering the
distance between unit p, q and unit i, j; l is learning rate; r denotes a radius used to select
neighbor nodes. Both l and r typically decrease as t increases. || · ||22 denotes Euclidean
distance and dist(·) denotes the distance on the position of units.
This algorithm explains how SOM can be used to learn a representation and how the
similarities are retained as it always selects a subset of units that are similar with the data
sampled and adjust the weights of units to match the data sampled.
However, this algorithm relies on a careful selection of the radius of neighbor selection
and a good initialization of weights. Otherwise, although the learned weights will have a
local property of topological similarity, it loses this property globally: sometimes, two similar
clusters of similar events are separated by another dissimilar cluster of similar events. In
simpler words, units of green may actually separate units of red and units of purple if the
network is not appropriately trained. (Germano, 1999).

4.2 Hopfield Network

Hopfield Network is historically described as a form of recurrent3 neural network, first


introduced in (Hopfield, 1982). “Recurrent” in this context refers to the fact that the
weights connecting the neurons are bidirectional. Hopfield Network is widely recognized
because of its content-addressable memory property. This content-addressable memory
property is a simulation of the spin glass theory. Therefore, we start the discussion from
spin glass.

3. The term “recurrent” is very confusing nowadays because of the popularity recurrent neural network
(RNN) gains.

21
Wang and Raj

Figure 9: Illustration of Hopfield Network. It is a fully connected network of six binary


thresholding neural units. Every unit is connected with data, therefore these
units are denoted as unshaded nodes.

4.2.1 Spin Glass


The spin glass is physics term that is used to describe a magnetic phenomenon. Many
works have been done for a detailed study of related theory (Edwards and Anderson, 1975;
Mézard et al., 1990), so in this paper, we only describe this it intuitively.
When a group of dipoles is placed together in any space. Each dipole is forced to align
itself with the field generated by these dipoles at its location. However, by aligning itself,
it changes the field at other locations, leading other dipoles to flip, causing the field in the
original location to change. Eventually, these changes will converge to a stable state.
To describe the stable state, we first define the total field at location j as
X sk
sj = oj + ct
d2jk
k

where oj is an external field, ct


is a constant that depends on temperature t, sk is the
polarity of the kth dipole and djk is the distance from location j to location k. Therefore,
the total potential energy of the system is:
X X sk
PE = sj oj + ct sj (4)
j
d2jk
k

The magnetic system will evolve until this potential energy is minimum.

4.2.2 Hopfield Network


Hopfield Network is a fully connected neural network with binary thresholding neural units.
The values of these units are either 0 or 14 . These units are fully connected with bidirectional
weights.
4. Some other literature may use -1 and 1 to denote the values of these units. While the choice of values
does not affect the idea of Hopfiled Network, it changes the formulation of energy function. In this paper,
we only discuss in the context of 0 and 1 as values.

22
On the Origin of Deep Learning

With this setting, the energy of a Hopfield Network is defined as:


X X
E=− si bi − si sj wi,j (5)
i i,j

where s is the state of a unit, b denotes the bias; w denotes the bidirectional weights and i, j
are indexes of units. This energy function closely connects to the potential energy function
of spin glass, as showed in Equation 4.
Hopfield Network is typically applied to memorize the state of data. The weights of a
network are designed or learned to make sure that the energy is minimized given the state
of interest. Therefore, when another state presented to the network, while the weights are
fixed, Hopfield Network can search for the states that minimize the energy and recover the
state in memory. For example, in a face completion task, when some image of faces are
presented to Hopfield Network (in a way that each unit of the network corresponds to each
pixel of one image, and images are presented one after the other), the network can calculate
the weights to minimize the energy given these faces. Later, if one image is corrupted or
distorted and presented to this network again, the network is able to recover the original
image by searching a configuration of states to minimize the energy starting from corrupted
input presented.
The term “energy” may remind people of physics. To explain how Hopfield Network
works in a physics scenario will be clearer: nature uses Hopfield Network to memorize the
equilibrium position of a pendulum because, in an equilibrium position, the pendulum has
the lowest gravitational potential energy. Therefore, whenever a pendulum is placed, it will
converge back to the equilibrium position.

4.2.3 Learning and Inference


Learning of the weights of a Hopfield Network is straightforward (Gurney, 1997). The
weights can be calculated as:
X
wi,j = (2si − 1)(2sj − 1)
i,j

the notations are the same as Equation 5.


This learning procedure is simple, but still worth mentioning as it is an essential step of
a Hopfield Network when it is applied to solve practical problems. However, we find that
many online tutorials omit this step, and to make it worse, they refer the inference of states
as learning/training. To remove the confusion, in this paper, similar to how terms are used
in standard machine learning society, we refer the calculation of weights of a model (either
from closed-form solution, or numerical solution) as “parameter learning” or “training”. We
refer the process of applying an existing model with weights known onto solving a real-world
problem as “inference”5 or “testing” (to decode a hidden state of data, e.g. to predict a
label).
The inference of Hopfield Network is also intuitive. For a state of data, the network
tests that if inverting the state of one unit, whether the energy will decrease. If so, the
5. “inference” is conventionally used in such a way in machine learning society, although some statisticians
may disagree with this usage.

23
Wang and Raj

network will invert the state and proceed to test the next unit. This procedure is called
Asynchronous update and this procedure is obviously subject to the sequential order of
selection of units. A counterpart is known as Synchronous update when the network first
tests for all the units and then inverts all the unit-to-invert simultaneously. Both of these
methods may lead to a local optimal. Synchronous update may even result in an increasing
of energy and may converge to an oscillation or loop of states.

4.2.4 Capacity
One distinct disadvantage of Hopfield Network is that it cannot keep the memory very
efficient because a network of N units can only store memory up to 0.15N 2 bits. While a
network with N units has N 2 edges. In addition, after storing M memories (M instances
of data), each connection has an integer value in range [−M, M ]. Thus, the number of bits
required to store N units are N 2 log(2M + 1) (Hopfield, 1982). Therefore, we can safely
draw the conclusion that although Hopfield Network is a remarkable idea that enables the
network to memorize data, it is extremely inefficient in practice.
As follow-ups of the invention of Hopfield Network, many works are attempted to study
and increase the capacity of original Hopfield Network (Storkey, 1997; Liou and Yuan,
1999; Liou and Lin, 2006). Despite these attempts made, Hopfield Network still gradually
fades out of the society. It is replaced by other models that are inspired by it. Immediately
following this section, we will discuss the popular Boltzmann Machine and Restricted Boltz-
mann Machine and study how these models are upgraded from the initial ideas of Hopfield
Network and evolve to replace it.

4.3 Boltzmann Machine


Boltzmann Machine, invented by Ackley et al. (1985), is a stochastic with-hidden-unit
version Hopfield Network. It got its name from Boltzmann Distribution.

4.3.1 Boltzmann Distribution


Boltzmann Distribution is named after Ludwig Boltzmann and investigated extensively by
(Willard, 1902). It is originally used to describe the probability distribution of particles in
a system over various possible states as following:

Es
F (s) ∝ e− kT

where s stands for the state and Es is the corresponding energy. k and T are Boltzmann’s
constant and thermodynamic temperature respectively. Naturally, the ratio of two distri-
bution is only characterized by the difference of energies, as following:

F (s1 ) Es2 −Es1


r= = e kT
F (s2 )

which is known as Boltzmann factor.

24
On the Origin of Deep Learning

Figure 10: Illustration of Boltzmann Machine. With the introduction of hidden units
(shaded nodes), the model conceptually splits into two parts: visible units and
hidden units. The red dashed line is used to highlight the conceptual separation.

With how the distribution is specified by the energy, the probability is defined as the
term of each state divided by a normalizer, as following:
Esi
e− kT
psi =
P − Esj
je
kT

4.3.2 Boltzmann Machine


As we mentioned previously, Boltzmann Machine is a stochastic with-hidden-unit version
Hopfield Network. Figure 10 introduces how the idea of hidden units is introduced that
turns a Hopfield Network into a Boltzmann Machine. In a Boltzmann Machine, only visible
units are connected with data and hidden units are used to assist visible units to describe
the distribution of data. Therefore, the model conceptually splits into the visible part and
hidden part, while it still maintains a fully connected network among these units.
“Stochastic” is introduced for Boltzmann Machine to be improved from Hopfield Net-
work regarding leaping out of the local optimum or oscillation of states. Inspired by physics,
a method to transfer state regardless current energy is introduced: Set a state to State 1
(which means the state is on) regardless of the current state with the following probability:
1
p= ∆E
1 + e− T
where ∆E stands for the difference of energies when the state is on and off, i.e. ∆E =
Es=1 − Es=0 . T stands for the temperature. The idea of T is inspired by a physics process
that the higher the temperature is, the more likely the state will transfer6 . In addition,
the probability of higher energy state transferring to lower energy state will be always
greater than the reverse process7 . This idea is highly related to a very popular optimization
6. Molecules move faster when more kinetic energy is provided, which could be achieved by heating.
7. This corresponds to Zeroth Law of Thermodynamics.

25
Wang and Raj

algorithm called Simulated Annealing (Khachaturyan et al., 1979; Aarts and Korst, 1988)
back then, but Simulated Annealing is hardly relevant to nowadays deep learning society.
Regardless of the historical importance that the term T introduces, within this section, we
will assume T = 1 as a constant, for the sake of simplification.

4.3.3 Energy of Boltzmann Machine


The energy function of Boltzmann Machine is defined the same as how Equation 5 is defined
for Hopfield Network, except that now visible units and hidden units are noted separately,
as following:
X X X X X
E(v, h) = − vi bi − hk bk − vi vj wij − vi hk wik − hk hl wk,l
i k i,j i,k k,l

where v stands for visible units, h stands for hidden units. This equation also connects
back to Equation 4, except that Boltzmann Machine splits the energy function according
to hidden units and visible units.
Based on this energy function, the probability of a joint configuration over both visible
unit the hidden unit can be defined as following:
e−E(v,h)
p(v, h) = P −E(m,n)
m,n e

The probability of visible/hidden units can be achieved by marginalizing this joint proba-
bility.
For example, by marginalizing out hidden units, we can get the probability distribution
of visible units:
P −E(v,h)
e
p(v) = P h −E(m,n)
m,n e

which could be used to sample visible units, i.e. generating data.


When Boltzmann Machine is trained to its stable state, which is called thermal equi-
librium, the distribution of these probabilities p(v, h) will remain constant because the
distribution of energy will be a constant. However, the probability for each visible unit or
hidden unit may vary and the energy may not be at their minimum. This is related to how
thermal equilibrium is defined, where the only constant factor is the distribution of each
part of the system.
Thermal equilibrium can be a hard concept to understand. One can imagine that
pouring a cup of hot water into a bottle and then pouring a cup of cold water onto the
hot water. At start, the bottle feels hot at bottom and feels cold at top and gradually the
bottle feels mild as the cold water and hot water mix and heat is transferred. However, the
temperature of the bottle becomes mild stably (corresponding to the distribution of p(v, h))
does not necessarily mean that the molecules cease to move (corresponding to each p(v, h)).

4.3.4 Parameter Learning


The common way to train the Boltzmann machine is to determine the parameters that
maximize the likelihood of the observed data. Gradient descent on the log of the likelihood

26
On the Origin of Deep Learning

function is usually performed to determine the parameters. For simplicity, the following
derivation is based on a single observation.
First, we have the log likelihood function of visible units as
X X
l(v; w) = log p(v; w) = log e−Ev,h − log e−Em,n
h m,n

where the second term on RHS is the normalizer.


Now we take the derivative of log likelihood function w.r.t w, and simplify it, we have:
∂l(v; w) X ∂E(v, h) X ∂E(m, n)
=− p(h|v) + p(m, n)
∂w ∂w m,n
∂w
h
∂E(v, h) ∂E(m, n)
= − Ep(h|v) + Ep(m,n)
∂w ∂w
where E denotes expectation. Thus the gradient of the likelihood function is composed of
two parts. The first part is expected gradient of the energy function with respect to the
conditional distribution p(h|v). The second part is expected gradient of the energy function
with respect to the joint distribution over all variable states. However, calculating these
expectations is generally infeasible for any realistically-sized model, as it involves summing
over a huge number of possible states/configurations. The general approach for solving this
problem is to use Markov Chain Monte Carlo (MCMC) to approximate these sums:
∂l(v; w)
= − < si , sj >p(hdata |vdata ) + < si , sj >p(hmodel |vmodel ) (6)
∂w
where < · > denotes expectation.
Equation 6 is the difference between the expectation value of product of states while the
data is fed into visible states and the expectation of product of states while no data is fed.
The first term is calculated by taking the average value of the energy function gradient when
the visible and hidden units are being driven by observed data samples. In practice, this
first term is generally straightforward to calculate. Calculating the second term is generally
more complicated and involves running a set of Markov chains until they reach the current
models equilibrium distribution, then taking the average energy function gradient based on
those samples.
However, this sampling procedure could be very computationally complicated, which
motivates the topic in next section, the Restricted Boltzmann Machine.

4.4 Restricted Boltzmann Machine


Restricted Boltzmann Machine (RBM), originally known as Harmonium when invented by
Smolensky (1986), is a version of Boltzmann Machine with a restriction that there is no
connections either between visible units or between hidden units.
Figure 11 is an illustration of how Restricted Boltzmann Machine is achieved based
on Boltzmann Machine (Figure 10): the connections between hidden units, as well as the
connections between visible units are removed and the model becomes a bipartite graph.
With this restriction introduced, the energy function of RBM is much simpler:
X X X
E(v, h) = − vi bi − hk bk − vi hk wik (7)
i k i,k

27
Wang and Raj

Figure 11: Illustration of Restricted Boltzmann Machine. With the restriction that there is
no connections between hidden units (shaded nodes) and no connections between
visible units (unshaded nodes), the Boltzmann Machine turns into a Restricted
Boltzmann Machine. The model now is a a bipartite graph.

4.4.1 Contrastive Divergence


RBM can still be trained in the same way as how Boltzmann Machine is trained. Since the
energy function of RBM is much simpler, the sampling method used to infer the second
term in Equation 6 becomes easier. Despite this relative simplicity, this learning procedure
still requires a large amount of sampling steps to approximate the model distribution.
To emphasize the difficulties of such a sampling mechanism, as well as to simplify follow-
up introduction, we re-write Equation 6 with a different set of notations, as following:

∂l(v; w)
= − < si , sj >p0 + < si , sj >p∞ (8)
∂w
here we use p0 to denote data distribution and p∞ to denote model distribution. Other
notations remain unchanged. Therefore, the difficulty of mentioned methods to learn the
parameters is that it requires potentially “infinitely” many sampling steps to approximate
the model distribution.
Hinton (2002) overcame this issue magically, with the introduction of a method named
Contrastive Divergence. Empirically, he found that one does not have to perform “infinitely”
many sampling steps to converge to the model distribution, a finite k steps of sampling is
enough. Therefore, Equation 8 is effectively re-written into:

∂l(v; w)
= − < si , sj >p0 + < si , sj >pk
∂w
Remarkably, Hinton (2002) showed that k = 1 is sufficient for the learning algorithm to
work well in practice.
Carreira-Perpinan and Hinton (2005) attempted to justify Contrastive Divergence in
theory, but their derivation led to a negative conclusion that Contrastive Divergence is a

28
On the Origin of Deep Learning

Figure 12: Illustration of Deep Belief Networks. Deep Belief Networks is not just stacking
RBM together. The bottom layers (layers except the top one) do not have the
bi-directional connections, but only connections top down.

biased algorithm, and a finite k cannot represent the model distribution. However, their
empirical results suggested that finite k can approximate the model distribution well enough,
resulting a small enough bias. In addition, the algorithm works well in practice, which
strengthened the idea of Contrastive Divergence.
With the reasonable modeling power and a fast approximation algorithm, RBM quickly
draws great attention and becomes one of the most fundamental building blocks of deep
neural networks. In the following two sections, we will introduce two distinguished deep
neural networks that are built based on RBM/Boltzmann Machine, namely Deep Belief
Nets and Deep Boltzmann Machine.

4.5 Deep Belief Nets


Deep Belief Networks is introduced by Hinton et al. (2006)8 , when he showed that RBMs
can be stacked and trained in a greedy manner.
Figure 12 shows the structure of a three-layer Deep Belief Networks. Different from
stacking RBM, DBN only allows bi-directional connections (RBM-type connections) on the
top one layer while the following bottom layers only have top-down connections. Probably
a better way to understand DBN is to think it as multi-layer generative models. Despite the
fact that DBN is generally described as a stacked RBM, it is quite different from putting
one RBM on the top of the other. It is probably more appropriate to think DBN as a
one-layer RBM with extended layers specially devoted to generating patterns of data.
Therefore, the model only needs to sample for the thermal equilibrium at the topmost
layer and then pass the visible states top down to generate the data.

8. This paper is generally seen as the opening of nowadays Deep Learning era, as it first introduces the
possibility of training a deep neural network by layerwise training

29
Wang and Raj

4.5.1 Parameter Learning


Parameter learning of a Deep Belief Network falls into two steps: the first step is layer-wise
pre-training and the second step is fine-tuning.
Layerwise Pre-training The success of Deep Belief Network is largely due to the in-
troduction of the layer-wised pretraining. The idea is simple, but the reason why it works
still attracts researchers. The pre-training is simply to first train the network component
by component bottom up: treating the first two layers as an RBM and train it, then treat
the second layer and third layer as another RBM and train for the parameters.
Such an idea turns out to offer a critical support of the success of the later fine-
tuning process. Several explanations have been attempted to explain the mechanism of
pre-training:

• Intuitively, pre-training is a clever way of initialization. It puts the parameter values


in the appropriate range for further fine-tuning.

• Bengio et al. (2007) suggested that unsupervised pre-training initializes the model to
a point in parameter space which leads to a more effective optimization process, that
the optimization can find a lower minimum of the empirical cost function.

• Erhan et al. (2010) empirically argued for a regularization explanation, that unsu-
pervised pretraining guides the learning towards basins of attraction of minima that
support better generalization from the training data set.

In addition to Deep Belief Networks, this pretraining mechanism also inspires the pre-
training for many other classical models, including the autoencoders (Poultney et al., 2006;
Bengio et al., 2007), Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009) and some
models inspired by these classical models like (Yu et al., 2010).
After the pre-training is performed, fine-tuning is carried out to further optimize the net-
work to search for the parameters that lead to a lower minimum. For Deep Belief Networks,
there are two different fine tuning strategies dependent on the goals of the network.
Fine Tuning for Generative Model Fine-tuning for a generative model is achieved
with a contrastive version of wake-sleep algorithm (Hinton et al., 1995). This algorithm is
intriguing for the reason that it is designed to interpret how the brain works. Scientists have
found that sleeping is a critical process of brain function and it seems to be an inverse version
of how we learn when we are awake. The wake-sleep algorithm also has two steps. In wake
phase, we propagate information bottom up to adjust top-down weights for reconstructing
the layer below. Sleep phase is the inverse of wake phase. We propagate the information
top down to adjust bottom-up weights for reconstructing the layer above.
The contrastive version of this wake-sleep algorithm is that we add one Contrastive
Divergence phase between wake phase and sleep phase. The wake phase only goes up to the
visible layer of the top RBM, then we sample the top RBM with Contrastive Divergence,
then a sleep phase starts from the visible layer of top RBM.
Fine Tuning for Discriminative Model The strategy for fine tuning a DBN as a
discriminative model is to simply apply standard backpropagation to pre-trained model

30
On the Origin of Deep Learning

Figure 13: Illustration of Deep Boltzmann Machine. Deep Boltzmann Machine is more like
stacking RBM together. Connections between every two layers are bidirectional.

since we have labels of data. However, pre-training is still necessary in spite of the generally
good performance of backpropagation.

4.6 Deep Boltzmann Machine


The last milestone we introduce in the family of deep generative model is Deep Boltzmann
Machine introduced by Salakhutdinov and Hinton (2009).
Figure 13 shows a three layer Deep Boltzmann Machine (DBM). The distinction be-
tween DBM and DBN mentioned in the previous section is that DBM allows bidirectional
connections in the bottom layers. Therefore, DBM represents the idea of stacking RBMs
in a much better way than DBN, although it might be clearer if DBM is named as Deep
Restricted Boltzmann Machine.
Due to the nature of DBM, its energy function is defined as an extension of the energy
function of an RBM (Equation 7), as showed in the following:

X N X
X X N
X −1 X
E(v, h) = − vi bi − hn,k bn,k − vi wik hk − hn,k wn,k,l hn+1,l
i n=1 k i,k n=1 k,l

for a DBM with N hidden layers.


This similarity of energy function grants the possibility of training DBM with constrative
divergence. However, pre-training is typically necessary.

4.6.1 Deep Boltzmann Machine (DBM) v.s. Deep Belief Networks (DBN)
As their acronyms suggest, Deep Boltzmann Machine and Deep Belief Networks have many
similarities, especially from the first glance. Both of them are deep neural networks origi-
nates from the idea of Restricted Boltzmann Machine. (The name “Deep Belief Network”

31
Wang and Raj

seems to indicate that it also partially originates from Bayesian Network (Krieg, 2001).)
Both of them also rely on layerwise pre-training for a success of parameter learning.
However, the fundamental differences between these two models are dramatic, intro-
duced by how the connections are made between bottom layers (un-directed/bi-directed
v.s. directed). The bidirectional structure of DBM grants the possibility of DBM to learn
a more complex pattern of data. It also grants the possibility for the approximate inference
procedure to incorporate top-down feedback in addition to an initial bottom-up pass, allow-
ing Deep Boltzmann Machines to better propagate uncertainty about ambiguous inputs.

4.7 Deep Generative Models: Now and the Future


Deep Boltzmann Machine is the last milestone we discuss in the history of generative models,
but there are still much work after DBM and even more to be done in the future.
Lake et al. (2015) introduces a Bayesian Program Learning framework that can simulate
human learning abilities with large scale visual concepts. In addition to its performance on
one-shot learning classification task, their model passes the visual Turing Test in terms of
generating handwritten characters from the worlds alphabets. In other words, the generative
performance of their model is indistinguishable from human’s behavior. Being not a deep
neural model itself, their model outperforms several concurrent deep neural networks. Deep
neural counterpart of the Bayesian Program Learning framework can be surely expected
with even better performance.
Conditional image generation (given part of the image) is also another interesting topic
recently. The problem is usually solved by Pixel Networks (Pixel CNN (van den Oord et al.,
2016) and Pixel RNN (Oord et al., 2016)). However, given a part of the image seems to
simplify the generation task.
Another contribution to generative models is Generative Adversarial Network (Good-
fellow et al., 2014), however, GAN is still too young to be discussed in this paper.

32
On the Origin of Deep Learning

5. Convolutional Neural Networks and Vision Problems


In this section, we will start to discuss a different family of models: the Convolutional Neu-
ral Network (CNN) family. Distinct from the family in the previous section, Convolutional
Neural Network family mainly evolves from the knowledge of human visual cortex. There-
fore, in this section, we will first introduce one of the most important reasons that account
for the success of convolutional neural networks in vision problems: its bionic design to
replicate human vision system. The nowadays convolutional neural networks probably orig-
inate more from the such a design rather than from the early-stage ancestors. With these
background set-up, we will then briefly introduce the successful models that make them-
selves famous through the ImageNet Challenge (Deng et al., 2009). At last, we will present
some known problems of the vision task that may guide the future research directions in
vision tasks.

5.1 Visual Cortex


Convolutional Neural Network is widely known as being inspired by visual cortex, however,
except that some publications discuss this inspiration briefly (Poggio and Serre, 2013; Cox
and Dean, 2014), few resources present this inspiration thoroughly. In this section, we focus
on the discussion about basics on visual cortex (Hubel and Wiesel, 1959), which lays the
ground for further study in Convolutional Neural Networks.
The visual cortex of the brain, located in the occipital lobe which is located at the back
of the skull, is a part of the cerebral cortex that plays an important role in processing
visual information. Visual information coming from the eye, goes through a series of brain
structures and reaches the visual cortex. The parts of the visual cortex that receive the
sensory inputs is known as the primary visual cortex, also known as area V1. Visual
information is further managed by extrastriate areas, including visual areas two (V2) and
four (V4). There are also other visual areas (V3, V5, and V6), but in this paper, we
primarily focus on the visual areas that are related to object recognition, which is known
as ventral stream and consists of areas V1, V2, V4 and inferior temporal gyrus, which
is one of the higher levels of the ventral stream of visual processing, associated with the
representation of complex object features, such as global shape, like face perception (Haxby
et al., 2000).
Figure 14 is an illustration of the ventral stream of the visual cortex. It shows the
information process procedure from the retina which receives the image information and
passes all the way to inferior temporal gyrus. For each component:

• Retina converts the light energy that comes from the rays bouncing off of an object
into chemical energy. This chemical energy is then converted into action potentials
that are transferred onto primary visual cortex. (In fact, there are several other
brain structures involved between retina and V1, but we omit these structures for
simplicity9 .)

9. We deliberately discuss the components that have connections with established technologies in convo-
lutional neural network, one who is interested in developing more powerful models is encouraged to
investigate other components.

33
Wang and Raj

Figure 14: A brief illustration of ventral stream of the visual cortex in human vision system.
It consists of primary visual cortex (V1), visual areas (V2 and V4) and inferior
temporal gyrus.

• Primary visual cortex (V1) mainly fulfills the task of edge detection, where an edge
is an area with strongest local contrast in the visual signals.

• V2, also known as secondary visual cortex, is the first region within the visual as-
sociation area. It receives strong feedforward connections from V1 and sends strong
connections to later areas. In V2, cells are tuned to extract mainly simple properties
of the visual signals such as orientation, spatial frequency, and colour, and a few more
complex properties.

• V4 fulfills the functions including detecting object features of intermediate complexity,


like simple geometric shapes, in addition to orientation, spatial frequency, and color.
V4 is also shown with strong attentional modulation (Moran and Desimone, 1985).
V4 also receives direct input from V1.

• Inferior temporal gyrus (TI) is responsible for identifying the object based on the color
and form of the object and comparing that processed information to stored memories
of objects to identify that object (Kolb et al., 2014). In other words, IT performs the
semantic level tasks, like face recognition.

Many of the descriptions of functions about visual cortex should revive a recollection
of convolutional neural networks for the readers that have been exposed to some relevant
technical literature. Later in this section, we will discuss more details about convolutional
neural networks, which will help build explicit connections. Even for readers that barely

34
On the Origin of Deep Learning

have knowledge in convolutional neural networks, this hierarchical structure of visual cortex
should immediately ring a bell about neural networks.
Besides convolutional neural networks, visual cortex has been inspiring the works in
computer vision for a long time. For example, Li (1998) built a neural model inspired
by the primary visual cortex (V1). In another granularity, Serre et al. (2005) introduced
a system with feature detections inspired from the visual cortex. De Ladurantaye et al.
(2012) published a book describing the models of information processing in the visual cortex.
Poggio and Serre (2013) conducted a more comprehensive survey on the relevant topic, but
they didn’t focus on any particular subject in detail in their survey. In this section, we
discuss the connections between visual cortex and convolutional neural networks in details.
We will begin with Neocogitron, which borrows some ideas from visual cortex and later
inspires convolutional neural network.

5.2 Neocogitron and Visual Cortex

Neocogitron, proposed by Fukushima (1980), is generally seen as the model that inspires
Convolutional Neural Networks on the computation side. It is a neural network that con-
sists of two different kinds of layers (S-layer as feature extractor and C-layer as structured
connections to organize the extracted features.)
S-layer consists of a number of S-cells that are inspired by the cell in primary visual
cortex. It serves as a feature extractor. Each S-cell can be ideally trained to be responsive
to a particular feature presented in its receptive field. Generally, local features such as edges
in particular orientations are extracted in lower layers while global features are extracted
in higher layers. This structure highly resembles how human conceive objects. C-layer
resembles complex cell in the higher pathway of visual cortex. It is mainly introduced for
shift invariant property of features extracted by S-layer.

5.2.1 Parameter Learning

During parameter learning process, only the parameters of S-layer are updated. Neocog-
itron can also be trained unsupervisedly, for a good feature extractor out of S-layers. The
training process for S-layer is very similar to Hebbian Learning rule, which strengthens the
connections between S-layer and C-layer for whichever S-cell shows the strongest response.
This training mechanism also introduces the problem Hebbian Learning rule introduces,
that the strength of connections will saturate (since it keeps increasing). The solution was
also introduced by Fukushima (1980), which was introduced with the name “inhibitory
cell”. It performed the function as a normalization to avoid the problem.

5.3 Convolutional Neural Network and Visual Cortex

Now we proceed from Neocogitron to Convolutional Neural Network. First, we will in-
troduce the building components: convolutional layer and subsampling layer. Then we
assemble these components to present Convolutional Neural Network, using LeNet as an
example.

35
Wang and Raj

Figure 15: A simple illustration of two dimension convolution operation.

5.3.1 Convolution Operation


Convolution operation is strictly just a mathematical operation, which should be treated
equally with other operations like addition or multiplication and should not be discussed
particularly in a machine learning literature. However, we still discuss it here for complete-
ness and for the readers who may not be familiar with it.
Convolution is a mathematical operation on two functions (e.g. f and g) and produces
a third function h, which is an integral that expresses the amount of overlap of one function
(f ) as it is shifted over the other function (g). It is described formally as the following:
Z ∞
h(t) = f (τ )g(t − τ )dτ
−∞

and denoted as h = f ? g.
Convolutional neural network typically works with two-dimensional convolution opera-
tion, which could be summarized in Figure 15.
As showed in Figure 15, the leftmost matrix is the input matrix. The middle one is
usually called a kernel matrix. Convolution is applied to these matrices and the result
is showed as the rightmost matrix. The convolution process is an element-wise product
followed by a sum, as showed in the example. When the left upper 3×3 matrix is convoluted
with the kernel, the result is 29. Then we slide the target 3 × 3 matrix one column right,
convoluted with the kernel and get the result 12. We keep sliding and record the results as
a matrix. Because the kernel is 3 × 3, every target matrix is 3 × 3, thus, every 3 × 3 matrix
is convoluted to one digit and the whole 5 × 5 matrix is shrunk into 3 × 3 matrix. (Because
5 − (3 − 1) = 3. The first 3 means the size of the kernel matrix. )
One should realize that convolution is locally shift invariant, which means that for many
different combinations of how the nine numbers in the upper 3 × 3 matrix are placed, the
convoluted result will be 29. This invariant property plays a critical role in vision problem
because that in an ideal case, the recognition result should not be changed due to shift or
rotation of features. This critical property is used to be solved elegantly by Lowe (1999);
Bay et al. (2006), but convolutional neural network brought the performance up to a new
level.

5.3.2 Connection between CNN and Visual Cortex


With the ideas about two dimension convolution, we further discuss how convolution is a
useful operation that can simulate the tasks performed by visual cortex.

36
On the Origin of Deep Learning

(a) Identity kernel (b) Edge detection kernel

(c) Blur kernel (d) Sharpen kernel

(e) Lighten kernel (f) Darken kernel

(g) Random kernel 1 (h) Random kernel 2

Figure 16: Convolutional kernels example. Different kernels applied to the same image will
result in differently processed images. Note that there is a 91 divisor applied to
these kernels.

The convolution operation is usually known as kernels. By different choices of kernels,


different operations of the images could be achieved. Operations are typically including
identity, edge detection, blur, sharpening etc. By introducing random matrices as convolu-
tion operator, some interesting properties might be discovered.
Figure 16 is an illustration of some example kernels that are applied to the same figure.
One can see that different kernels can be applied to fulfill different tasks. Random kernels
can also be applied to transform the image into some interesting outcomes.
Figure 16 (b) shows that edge detection, which is one of the central tasks of primary
visual cortex, can be fulfilled by a clever choice of kernels. Furthermore, clever selection
of kernels can lead us to a success replication of visual cortex. As a result, learning a
meaningful convolutional kernel (i.e. parameter learning) is one of the central tasks in
convolutional neural networks when applied to vision tasks. This also explains that why

37
Wang and Raj

Figure 17: An illustration of LeNet, where Conv stands for convolutional layer and Sam-
pling stands for SubSampling Layer.

many well-trained popular models can usually perform well in other tasks with only limited
fine-tuning process: the kernels have been well trained and can be universally applicable.
With the understanding of the essential role convolution operation plays in vision tasks,
we proceed to investigate some major milestones along the way.

5.4 The Pioneer of Convolutional Neural Networks: LeNet


This section is devoted to a model that is widely recognized as the first convolutional neural
network: LeNet, invented by Le Cun et al. (1990) (further made popular with (LeCun et al.,
1998a)). It is inspired from the Neocogitron. In this section, we will introduce convolutional
neural network via introducing LeNet.
Figure 17 shows an illustration of the architecture of LeNet. It consists of two pairs of
Convolutional Layer and Subsampling Layer and is further connected with fully connected
layer and an RBF layer for classification.

5.4.1 Convolutional Layer


A convolutional layer is primarily a layer that performs convolution operation. As we have
discussed previously, a clever selection of convolution kernel can effectively simulate the
task of visual cortex. Convolutional layer introduces another operation after convolution to
assist the simulation to be more successful: the non-linearity transform.
Considering a ReLU (Nair and Hinton, 2010) non-linearity transform, which is defined
as following:

f (x) = max(0, x)

which is a transform that removes the negative part of the input, resulting in a clearer
contrast of meaningful features as opposed to other side product the kernel produces.
Therefore, this non-linearity grants the convolution more power in extracting useful
features and allows it to simulate the functions of visual cortex more closely.

38
On the Origin of Deep Learning

5.4.2 Subsampling Layer


Subsampling Layer performs a simpler task. It only samples one input out every region
it looks into. Some different strategies of sampling can be considered, like max-pooling
(taking the maximum value of the input), average-pooling (taking the averaged value of
input) or even probabilistic pooling (taking a random one.) (Lee et al., 2009).
Sampling turns the input representations into smaller and more manageable embeddings.
More importantly, sampling makes the network invariant to small transformations, distor-
tions, and translations in the input image. A small distortion in the input will not change
the outcome of pooling since we take the maximum/average value in a local neighborhood.

5.4.3 LeNet
With the two most important components introduced, we can stack them together to as-
semble a convolutional neural network. Following the recipe of Figure 17, we will end up
with the famous LeNet.
LeNet is known as its ability to classify digits and can handle a variety of different
problems of digits including variances in position and scale, rotation and squeezing of digits,
and even different stroke width of the digit. Meanwhile, with the introduction of LeNet,
LeCun et al. (1998b) also introduces the MNIST database, which later becomes the standard
benchmark in digit recognition field.

5.5 Milestones in ImageNet Challenge


With the success of LeNet, Convolutional Neural Network has been shown with great po-
tential in solving vision tasks. These potentials have attracted a large number of researchers
aiming to solve vision task regarding object recognition in CIFAR classification (Krizhevsky
and Hinton, 2009) and ImageNet challenge (Russakovsky et al., 2015). Along with this path,
several superstar milestones have attracted great attentions and has been applied to other
fields with good performance. In this section, we will briefly discuss these models.

5.5.1 AlexNet
While LeNet is the one that starts the era of convolutional neural networks, AlexNet,
invented by Krizhevsky et al. (2012), is the one that starts the era of CNN used for ImageNet
classification. AlexNet is the first evidence that CNN can perform well on this historically
difficult ImageNet dataset and it performs so well that leads the society into a competition
of developing CNNs.
The success of AlexNet is not only due to this unique design of architecture but also
due to the clever mechanism of training. To avoid the computationally expensive training
process, AlexNet has been split into two streams and trained on two GPUs. It also used
data augmentation techniques that consist of image translations, horizontal reflections, and
patch extractions.
The recipe of AlexNet is shown in Figure 18. However, rarely any lessons can be
learned from the architecture of AlexNet despite its remarkable performance. Even more
unfortunately, the fact that this particular architecture of AlexNet does not have a well-
grounded theoretical support pushes many researchers to blindly burn computing resources

39
Wang and Raj

Figure 18: An illustration of AlexNet

to test for a new architecture. Many models have been introduced during this period, but
only a few may be worth mentioning in the future.

5.5.2 VGG
In the blind competition of exploring different architectures, Simonyan and Zisserman (2014)
showed that simplicity is a promising direction with a model named VGG. Although VGG
is deeper (19 layer) than other models around that time, the architecture is extremely
simplified: all the layers are 3 × 3 convolutional layer with a 2 × 2 pooling layer. This
simple usage of convolutional layer simulates a larger filter while keeping the benefits of
smaller filter sizes, because the combination of two 3×3 convolutional layers has an effective
receptive field of a 5 × 5 convolutional layer, but with fewer parameters.
The spatial size of the input volumes at each layer will decrease as a result of the
convolutional and pooling layers, but the depth of the volumes increases because of the
increased number of filters (in VGG, the number of filters doubles after each pooling layer).
This behavior reinforces the idea of VGG to shrink spatial dimensions, but grow depth.
VGG is not the winner of the ImageNet competition of that year (The winner is
GoogLeNet invented by Szegedy et al. (2015)). GoogLeNet introduced several important
concepts like Inception module and the concept later used by R-CNN (Girshick et al., 2014;
Girshick, 2015; Ren et al., 2015), but the arbitrary/creative design of architecture barely
contribute more than what VGG does to the society, especially considering that Residual
Net, following the path of VGG, won the ImageNet challenge in an unprecedented level.

5.5.3 Residual Net


Residual Net (ResNet) is a 152 layer network, which was ten times deeper than what was
usually seen during the time when it was invented by He et al. (2015). Following the path
VGG introduces, ResNet explores deeper structure with simple layer. However, naively

40
On the Origin of Deep Learning

Figure 19: An illustration of Residual Block of ResNet

increasing the number of layers will only result in worse results, for both training cases and
testing cases (He et al., 2015).
The breakthrough ResNet introduces, which allows ResNet to be substantially deeper
than previous networks, is called Residual Block. The idea behind a Residual Block is
that some input of a certain layer (denoted as x) can be passed to the component two
layers later either following the traditional path which involves convolutional layers and
ReLU transform succession (we denote the result as f (x)), or going through an express way
that directly passes x there. As a result, the input to the component two layers later is
f (x) + x instead of what is typically seen as f (x). The idea of Residual Block is illustrated
in Figure 19.
In a complementary work, He et al. (2016) validated that residual blocks are essential
for propagating information smoothly, therefore simplifies the optimization. They also
extended the ResNet to a 1000-layer version with success on CIFAR data set.
Another interesting perspective of ResNet is provided by (Veit et al., 2016). They
showed that ResNet behave behaves like ensemble of shallow networks: the express way
introduced allows ResNet to perform as a collection of independent networks, each network
is significantly shallower than the integrated ResNet itself. This also explains why gradient
can be passed through the ultra-deep architecture without being vanished. (We will talk
more about vanishing gradient problem when we discuss recurrent neural network in the
next section.) Another work, which is not directly relevant to ResNet, but may help to
understand it, is conducted by Hariharan et al. (2015). They showed that features from
lower layers are informative in addition to what can be summarized from the final layer.
ResNet is still not completely vacant from clever designs. The number of layers in the
whole network and the number of layers that Residual Block allows identity to bypass are
still choices that require experimental validations. Nonetheless, to some extent, ResNet
has shown that critical reasoning can help the development of CNN better than blind

41
Wang and Raj

experimental trails. In addition, the idea of Residual Block has been found in the actual
visual cortex (In the ventral stream of the visual cortex, V4 can directly accept signals from
primary visual cortex), although ResNet is not designed according to this in the first place.
With the introduction of these state-of-the-art neural models that are successful in these
challenges, Canziani et al. (2016) conducted a comprehensive experimental study comparing
these models. Upon comparison, they showed that there is still room for improvement on
fully connected layers that show strong inefficiencies for smaller batches of images.

5.6 Challenges and Chances for Fundamental Vision Problems


ResNet is not the end of the story. New models and techniques appear every day to push
the limit of CNNs further. For example, Zhang et al. (2016b) took a step further and
put Residual Block inside Residual Block. Zagoruyko and Komodakis (2016) attempted to
decrease the depth of network by increasing the width. However, incremental works of this
kind are not in the scope of this paper.
We would like to end the story of Convolutional Neural Networks with some of the
current challenges of fundamental vision problems that may not able to be solved naively
by investigation of machine learning techniques.

5.6.1 Network Property and Vision Blindness Spot


Convolutional Neural Networks have reached to an unprecedented accuracy in object detec-
tion. However, it may still be far from industry reliable application due to some intriguing
properties found by Szegedy et al. (2013).
Szegedy et al. (2013) showed that they could force a deep learning model to misclassify an
image simply by adding perturbations to that image. More importantly, these perturbations
may not even be observed by naked human eyes. In other words, two objects that look
almost the same to human, may be recognized as different objects by a well-trained neural
network (for example, AlexNet). They have also shown that this property is more likely to
be a modeling problem, in contrast to problems raised by insufficient training.
On the other hand, Nguyen et al. (2015) showed that they could generate patterns
that convey almost no information to human, being recognized as some objects by neural
networks with high confidence (sometimes more than 99%). Since neural networks are typi-
cally forced to make a prediction, it is not surprising to see a network classify a meaningless
patter into something, however, this high confidence may indicate that the fundamental
differences between how neural networks and human learn to know this world.
Figure 20 shows some examples from the aforementioned two works. With construction,
we can show that the neural networks may misclassify an object, which should be easily
recognized by the human, to something unusual. On the other hand, a neural network may
also classify some weird patterns, which are not believed to be objects by the human, to
something we are familiar with. Both of these properties may restrict the usage of deep
learning to real world applications when a reliable prediction is necessary.
Even without these examples, one may also realize that the reliable prediction of neural
networks could be an issue due to the fundamental property of a matrix: the existence
of null space. As long as the perturbation happens within the null space of a matrix,
one may be able to alter an image dramatically while the neural network still makes the

42
On the Origin of Deep Learning

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 20: Illustrations of some mistakes of neural networks. (a)-(d) (from (Szegedy et al.,
2013)) are adversarial images that are generated based on original images. The
differences between these and the original ones are un-observable by naked eye,
but the neural network can successfully classify original ones but fail adversarial
ones. (e)-(h) (from (Nguyen et al., 2015)) are patterns that are generated. A
neural network classify them into (e) school bus, (f) guitar, (g) peacock and (h)
Pekinese respectively.

misclassification with high confidence. Null space works like a blind spot to a matrix and
changes within null space are never sensible to the corresponding matrix.
This blind spot should not discourage the promising future of neural networks. On the
contrary, it makes the convolutional neural network resemble the human vision system in a
deeper level. In the human vision system, blind spots (Gregory and Cavanagh, 2011) also
exist (Wandell, 1995). Interesting work might be seen about linking the flaws of human
vision system to the defects of neural networks and helping to overcome these defects in the
future.

5.6.2 Human Labeling Preference


At the very last, we present some of the misclassified images of ResNet on ImageNet Chal-
lenge. Hopefully, some of these examples could inspire some new methodologies invented
for the fundamental vision problem.
Figure 21 shows some misclassified images of ResNet when applied to ImageNet Chal-
lenge. These labels, provided by human effort, are very unexpected even to many other
humans. Therefore, the 3.6% error rate of ResNet (a general human usually predicts with
error rate 5%-10%) is probably hitting the limit since the labeling preference of an anno-
tator is harder to predict than the actual labels. For example, Figure 21 (a),(b),(h) are
labeled as a tiny part of the image, while there are more important contents expressed by
the image. On the other hand, Figure 21 (d) (e) are annotated as the background of the
image while that image is obviously centering on other object.

43
Wang and Raj

(a) flute (b) guinea pig (c) wig (d) seashore

(e) alp (f) screwdriver (g) comic book (h) sunglass

Figure 21: Some failed images of ImageNet classification by ResNet and the primary label
associated with the image.

To further improve the performance ResNet reached, one direction might be to modeling
the annotators’ labeling preference. One assumption could be that annotators prefer to label
an image to make it distinguishable. Some established work to modeling human factors
(Wilson et al., 2015) could be helpful.
However, the more important question is that whether it is worth optimizing the model
to increase the testing results on ImageNet dataset, since remaining misclassifications may
not be a result of the incompetency of the model, but problems of annotations.
The introduction of other data sets, like COCO (Lin et al., 2014), Flickr (Plummer et al.,
2015), and VisualGenome (Krishna et al., 2016) may open a new era of vision problems with
more competitive challenges. However, the fundamental problems and experiences that this
section introduces should never be forgotten.

44
On the Origin of Deep Learning

6. Time Series Data and Recurrent Networks


In this section, we will start to discuss a new family of deep learning models that have
attracted many attentions, especially for the tasks on time series data, or sequential data.
The Recurrent Neural Network (RNN) is a class of neural network whose connections of
units form a directed cycle; this nature grants its ability to work with temporal data. It has
also been discussed in literature like (Grossberg, 2013) and (Lipton et al., 2015). In this
paper, we will continue to offer complementary views to other surveys with an emphasis on
the evolutionary history of the milestone models and aim to provide insights into the future
direction of coming models.

6.1 Recurrent Neural Network: Jordan Network and Elman Network


As we have discussed previously, Hopfield Network is widely recognized as a recurrent neural
network, although its formalization is distinctly different from how recurrent neural network
is defined nowadays. Therefore, despite that other literature tend to begin the discussion
of RNN with Hopfield Network, we will not treat it as a member of RNN family to avoid
unnecessary confusion.
The modern definition of “recurrent” is initially introduced by Jordan (1986) as:
If a network has one or more cycles, that is, if it is possible to follow a
path from a unit back to itself, then the network is referred to as recurrent. A
nonrecurrent network has no cycles.
His model in (Jordan, 1986) is later referred to as Jordan Network. For a simple neural
network with one hidden layer, with input denoted as X, weights of hidden layer denoted
as wh and weights of output layer denoted as wy , weights of recurrent computation denoted
as wr , hidden representation denoted as h and output denoted as y, Jordan Network can
be formulated as
ht = σ(Wh X + Wr y t−1 )
y = σ(Wy ht )
A few years later, another RNN was introduced by Elman (1990), when he formalized
the recurrent structure slightly differently. Later, his network is known as Elman Network.
Elman network is formalized as following:
ht = σ(Wh X + Wr ht−1 )
y = σ(Wy ht )
The only difference is that whether the information of previous time step is provided by
previous output or previous hidden layer. This difference is further illustrated in Figure 22.
The difference is illustrated to respect the historical contribution of these works. One may
notice that there is no fundamental difference between these two structures since yt = Wy ht ,
therefore, the only difference lies in the choice of Wr . (Originally, Elman only introduces
his network with Wr = I, but more general cases could be derived from there.)
Nevertheless, the step from Jordan Network to Elman Network is still remarkable as
it introduces the possibility of passing information from hidden layers, which significantly
improve the flexibility of structure design in later work.

45
Wang and Raj

(a) Structure of Jordan Network (b) Structure of Elman Network

Figure 22: The difference of recurrent structure from Jordan Network and Elman Network.

6.1.1 Backpropagation through Time


The recurrent structure makes traditional backpropagation infeasible because of that with
the recurrent structure, there is not an end point where the backpropagation can stop.
Intuitively, one solution is to unfold the recurrent structure and expand it as a feedfor-
ward neural network with certain time steps and then apply traditional backpropagation
onto this unfolded neural network. This solution is known as Backpropagation through Time
(BPTT), independently invented by several researchers including (Robinson and Fallside,
1987; Werbos, 1988; Mozer, 1989)
However, as recurrent neural network usually has a more complex cost surface, naive
backpropagation may not work well. Later in this paper, we will see that the recurrent
structure introduces some critical problems, for example, the vanishing gradient problem,
which makes optimization for RNN a great challenge in the society.

6.2 Bidirectional Recurrent Neural Network


If we unfold an RNN, then we can get the structure of a feedforward neural network with infi-
nite depth. Therefore, we can build a conceptual connection between RNN and feedforward
network with infinite layers. Then since through the neural network history, bidirectional
neural networks have been playing important roles (like Hopfield Network, RBM, DBM), a
follow-up question is that what recurrent structures that correspond to the infinite layer of
bidirectional models are. The answer is Bidirectional Recurrent Neural Network.

46
On the Origin of Deep Learning

Figure 23: The unfolded structured of BRNN. The temporal order is from left to right.
Hidden layer 1 is unfolded in the standard way of an RNN. Hidden layer 2 is
unfolded to simulate the reverse connection.

Bidirectional Recurrent Neural Network (BRNN) was invented by Schuster and Paliwal
(1997) with the goal to introduce a structure that was unfolded to be a bidirectional neural
network. Therefore, when it is applied to time series data, not only the information can
be passed following the natural temporal sequences, but the further information can also
reversely provide knowledge to previous time steps.
Figure 23 shows the unfolded structure of a BRNN. Hidden layer 1 is unfolded in the
standard way of an RNN. Hidden layer 2 is unfolded to simulate the reverse connection.
Transparency (in Figure 23) is applied to emphasize that unfolding an RNN is only a concept
that is used for illustration purpose. The actual model handles data from different time
steps with the same single model.
BRNN is formulated as following:

ht1 = σ(Wh1 X + Wr1 ht−1


1 )
ht2 = σ(Wh2 X + Wr2 ht+1
2 )
y = σ(Wy1 ht1 + Wy2 ht2 )

where the subscript 1 and 2 denote the variables associated with hidden layer 1 and 2
respectively.
With the introduction of “recurrent” connections back from the future, Backpropaga-
tion through Time is no longer directly feasible. The solution is to treat this model as a
combination of two RNNs: a standard one and a reverse one, then apply BPTT onto each
of them. Weights are updated simultaneously once two gradients are computed.

6.3 Long Short-Term Memory


Another breakthrough in RNN family was introduced in the same year as BRNN. Hochreiter
and Schmidhuber (1997) introduced a new neuron for RNN family, named Long Short-Term
Memory (LSTM). When it was invented, the term “LSTM” is used to refer the algorithm

47
Wang and Raj

that is designed to overcome vanishing gradient problem, with the help of a special designed
memory cell. Nowadays, “LSTM” is widely used to denote any recurrent network that
with that memory cell, which is nowadays referred as an LSTM cell.
LSTM was introduced to overcome the problem that RNNs cannot long term dependen-
cies (Bengio et al., 1994). To overcome this issue, it requires the specially designed memory
cell, as illustrated in Figure 24 (a).
LSTM consists of several critical components.

• states: values that are used to offer the information for output.

? input data: it is denoted as x.


? hidden state: values of previous hidden layer. This is the same as traditional
RNN. It is denoted as h.
? input state: values that are (linear) combination of hidden state and input of
current time step. It is denoted as i, and we have:

it = σ(Wix xt + Wih ht−1 ) (9)

? internal state: Values that serve as “memory”. It is denoted as m

• gates: values that are used to decide the information flow of states.

? input gate: it decides whether input state enters internal state. It is denoted as
g, and we have:

g t = σ(Wgi it ) (10)

? forget gate: it decides whether internal state forgets the previous internal state.
It is denoted as f , and we have:

f t = σ(Wf i it ) (11)

? output gate: it decides whether internal state passes its value to output and
hidden state of next time step. It is denoted as o and we have:

ot = σ(Woi it ) (12)

Finally, considering how gates decide the information flow of states, we have the last two
equations to complete the formulation of LSTM:

mt =g t it + f t mt−1 (13)

ht =ot mt (14)

where denotes element-wise product.


Figure 24 describes the details about how LSTM cell works. Figure 24 (b) shows that
how the input state is constructed, as described in Equation 9. Figure 24 (c) shows how

48
On the Origin of Deep Learning

(a) LSTM “memory” cell (b) Input data and previous hidden
state form into input state

(c) Calculating input gate and forget (d) Calculating output gate
gate

(e) Update internal state (f) Output and update hidden state

Figure 24: The LSTM cell and its detailed functions.

49
Wang and Raj

input gate and forget gate are computed, as described in Equation 10 and Equation 11.
Figure 24 (d) shows how output gate is computed, as described in Equation 12. Figure 24
(e) shows how internal state is updated, as described in Equation 13. Figure 24 (f) shows
how output and hidden state are updated, as described in Equation 14.
All the weights are parameters that need to be learned during training. Therefore,
theoretically, LSTM can learn to memorize long time dependency if necessary and can
learn to forget the past when necessary, making itself a powerful model.
With this important theoretical guarantee, many works have been attempted to improve
LSTM. For example, Gers and Schmidhuber (2000) added a peephole connection that allows
the gate to use information from the internal state. Cho et al. (2014) introduced the Gated
Recurrent Unit, known as GRU, which simplified LSTM by merging internal state and
hidden state into one state, and merging forget gate and input gate into a simple update
gate. Integrating LSTM cell into bidirectional RNN is also an intuitive follow-up to look
into (Graves et al., 2013).
Interestingly, despite the novel LSTM variants proposed now and then, Greff et al.
(2015) conducted a large-scale experiment investigating the performance of LSTMs and got
the conclusion that none of the variants can improve upon the standard LSTM architecture
significantly. Probably, the improvement of LSTM is in another direction rather than
updating the structure inside a cell. Attention models seem to be a direction to go.

6.4 Attention Models

Attention Models are loosely based on a bionic design to simulate the behavior of human
vision attention mechanism: when humans look at an image, we do not scan it bit by bit
or stare at the whole image, but we focus on some major part of it and gradually build the
context after capturing the gist. Attention mechanisms were first discussed by Larochelle
and Hinton (2010) and Denil et al. (2012). The attention models mostly refer to the models
that were introduced in (Bahdanau et al., 2014) for machine translation and soon applied to
many different domains like (Chorowski et al., 2015) for speech recognition and (Xu et al.,
2015) for image caption generation.
Attention models are mostly used for sequence output prediction. Instead of seeing the
whole sequential data and make one single prediction (for example, language model), the
model needs to make a sequential prediction for the sequential input for tasks like machine
translation or image caption generation. Therefore, the attention model is mostly used to
answer the question on where to pay attention to based on previously predicted labels or
hidden states.
The output sequence may not have to be linked one-to-one to the input sequence, and the
input data may not even be a sequence. Therefore, usually, an encoder-decoder framework
(Cho et al., 2015) is necessary. The encoder is used to encode the data into representations
and decoder is used to make sequential predictions. Attention mechanism is used to locate
a region of the representation for predicting the label in current time step.
Figure 25 shows a basic attention model under encoder-decoder network structure. The
representation encoder encodes is all accessible to attention model, and attention model only
selects some regions to pass onto the LSTM cell for further usage of prediction making.

50
On the Origin of Deep Learning

Figure 25: The unfolded structured of an attention model. Transparency is used to show
that unfolding is only conceptual. The representation encoder learns are all
available to the decoder across all time steps. Attention module only selects
some to pass onto LSTM cell for prediction.

Therefore, all the magic of attention models is about how this attention module in
Figure 25 helps to localize the informative representations.
To formalize how it works, we use r to denote the encoded representation (there is a
total of M regions of representation), use h to denote hidden states of LSTM cell. Then,
the attention module can generate the unscaled weights for ith region of the encoded rep-
resentation as:
βit = f (ht−1 , r, {αjt−1 }M
j=1 )

where αjt−1 is the attention weights computed at the previous time step, and can be com-
puted at current time step as a simple softmax function:
exp(βit )
αit = PM
t
j exp(βj )

Therefore, we can further use the weights α to reweight the representation r for prediction.
There are two ways for the representation to be reweighted:
• Soft attention: The result is a simple weighted sum of the context vectors such that:
M
X
t
r = αjt cj
j

• Hard attention: The model is forced to make a hard decision by only localizing one
region: sampling one region out following multinoulli distribution.

51
Wang and Raj

(a) Deep input architecture (b) Deep recurrent architec- (c) Deep output architecture
ture

Figure 26: Three different formulations of deep recurrent neural network.

One problem about hard attention is that sampling out of multinoulli distribution is
not differentiable. Therefore, the gradient based method can be hardly applied. Variational
methods (Ba et al., 2014) or policy gradient based method (Sutton et al., 1999) can be
considered.

6.5 Deep RNNs and the future of RNNs

In this very last section of the evolutionary path of RNN family, we will visit some ideas
that have not been fully explored.

6.5.1 Deep Recurrent Neural Network

Although recurrent neural network suffers many issues that deep neural network has because
of the recurrent connections, current RNNs are still not deep models regarding representa-
tion learning compared to models in other families.
Pascanu et al. (2013a) formalizes the idea of constructing deep RNNs by extending
current RNNs. Figure 26 shows three different directions to construct a deep recurrent
neural network by increasing the layers of the input component (Figure 26 (a)), recurrent
component (Figure 26 (b)) and output component (Figure 26 (c)) respectively.

52
On the Origin of Deep Learning

6.5.2 The Future of RNNs


RNNs have been improved in a variety of different ways, like assembling the pieces together
with Conditional Random Field (Yang et al., 2016), and together with CNN components
(Ma and Hovy, 2016). In addition, convolutional operation can be directly built into LSTM,
resulting ConvLSTM (Xingjian et al., 2015), and then this ConvLSTM can be also connected
with a variety of different components (De Brabandere et al., 2016; Kalchbrenner et al.,
2016).
One of the most fundamental problems of training RNNs is the vanishing/exploding
gradient problem, introduced in detail in (Bengio et al., 1994). The problem basically
states that for traditional activation functions, the gradient is bounded. When gradients are
computed by backpropagation following chain rule, the error signal decreases exponentially
within the time steps the BPTT can trace back, so the long-term dependency is lost.
LSTM and ReLU are known to be good solutions for vanishing/exploding gradient problem.
However, these solutions introduce ways to bypass this problem with clever design, instead
of solving it fundamentally. While these methods work well practically, the fundamental
problem for a general RNN is still to be solved. Pascanu et al. (2013b) attempted some
solutions, but there are still more to be done.

53
Wang and Raj

7. Optimization of Neural Networks


The primary focus of this paper is deep learning models. However, optimization is an
inevitable topic in the development history of deep learning models. In this section, we will
briefly revisit the major topics of optimization of neural networks. During our introduction
of the models, some algorithms have been discussed along with the models. Here, we will
only discuss the remaining methods that have not been mentioned previously.

7.1 Gradient Methods


Despite the fact that neural networks have been developed for over fifty years, the optimiza-
tion of neural networks still heavily rely on gradient descent methods within the algorithm
of backpropagation. This paper does not intend to introduce the classical backpropagation,
gradient descent method and its stochastic version and batch version and simple techniques
like momentum method, but starts right after these topics.
Therefore, the discussion of following gradient methods starts from the vanilla gradient
descent as following:

θt+1 = θt − η5tθ

where 5θ is the gradient of the parameter θ, η is a hyperparameter, usually known as


learning rate.

7.1.1 Rprop
Rprop was introduced by Riedmiller and Braun (1993). It is a unique method even studied
back today as it does not fully utilize the information of gradient, but only considers the
sign of it. In other words, it updates the parameters following:

θt+1 = θt − ηI(5tθ > 0) + ηI(5tθ < 0)

where I(·) stands for an indicator function.


This unique formalization allows the gradient method to overcome some cost curvatures
that may not be easily solved with today’s dominant methods. This two-decade-old method
may be worth some further study these days.

7.1.2 AdaGrad
AdaGrad was introduced by Duchi et al. (2011). It follows the idea of introducing an
adaptive learning rate mechanism that assigns higher learning rate to the parameters that
have been updated more mildly and assigns lower learning rate to the parameters that have
been updated dramatically. The measure of the degree of the update applied is the `2
norm of historical gradients, S t = ||51θ , 52θ , ... 5tθ ||22 , therefore we have the update rule as
following:
η
θt+1 = θt − 5t
St + θ
where  is small term to avoid η divided by zero.

54
On the Origin of Deep Learning

AdaGrad has been showed with great improvement of robustness upon traditional gra-
dient method (Dean et al., 2012). However, the problem is that as `2 norm accumulates,
the fraction of η over `2 norm decays to a substantial small term.

7.1.3 AdaDelta
AdaDelta is an extension of AdaGrad that aims to reducing the decaying rate of learning
rate, proposed in (Zeiler, 2012). Instead of accumulating the gradients of each time step as
in AdaGrad, AdaDelta re-weights previously accumulation before adding current term onto
previously accumulated result, resulting in:

(S t )2 = β(S t−1 )2 + (1 − β)(5tθ )2

where β is the weight for re-weighting. Then the update rule is the same as AdaGrad:
η
θt+1 = θt − 5t
St + θ

which is almost the same as another famous gradient variant named RMSprop10 .

7.1.4 Adam
Adam stands for Adaptive Moment Estimation, proposed in (Kingma and Ba, 2014). Adam
is like a combination momentum method and AdaGrad method, but each component are
re-weighted at time step t. Formally, at time step t, we have:

∆tθ =α∆t−1
θ + (1 − α)5tθ
(S t )2 =β(S t−1 )2 + (1 − β)(5tθ )2
η
θt+1 =θt − t ∆t
S + θ
All these modern gradient variants have been published with a promising claim that is
helpful to improve the convergence rate of previous methods. Empirically, these methods
seem to be indeed helpful, however, in many cases, a good choice of these methods seems
only to benefit to a limited extent.

7.2 Dropout
Dropout was introduced in (Hinton et al., 2012; Srivastava et al., 2014). The technique soon
got influential, not only because of its good performance but also because of its simplicity
of implementation. The idea is very simple: randomly dropping out some of the units while
training. More formally: on each training case, each hidden unit is randomly omitted from
the network with a probability of p.
As suggested by Hinton et al. (2012), Dropout can be seen as an efficient way to perform
model averaging across a large number of different neural networks, where overfitting can
be avoided with much less cost of computation.
10. It seems this method never gets published, the resources all trace back to Hinton’s slides at
http://www.cs.toronto.edu/t̃ijmen/csc321/slides/lecture slides lec6.pdf

55
Wang and Raj

Because of the actual performance it introduces, Dropout soon became very popular
upon its introduction, a lot of work has attempted to understand its mechanism in different
perspectives, including (Baldi and Sadowski, 2013; Cho, 2013; Ma et al., 2016). It has also
been applied to train other models, like SVM (Chen et al., 2014).

7.3 Batch Normalization and Layer Normalization


Batch Normalization, introduced by Ioffe and Szegedy (2015), is another breakthrough of
optimization of deep neural networks. They addressed the problem they named as internal
covariate shift. Intuitively, the problem can be understood as the following two steps: 1) a
learned function is barely useful if its input changes (In statistics, the input of a function is
sometimes denoted as covariates). 2) each layer is a function and the changes of parameters
of below layers change the input of current layer. This change could be dramatic as it may
shift the distribution of inputs.
Ioffe and Szegedy (2015) proposed the Batch Normalization to solve this issue, formally
following the steps:
n
1X
µB = xi
n
i=1
n
2 1X
σB = (xi − µB )2
n
i=1
x i − µB
x̂i =
σB + 
yi =σL x̂i + µL

where µB and σB denote the mean and variance of that batch. µL and σL two parameters
learned by the algorithm to rescale and shift the output. xi and yi are inputs and outputs
of that function respectively.
These steps are performed for every batch during training. Batch Normalization turned
out to work very well in training empirically and soon became popularly.
As a follow-up, Ba et al. (2016) proposes the technique Layer Normalization, where
they “transpose” batch normalization into layer normalization by computing the mean and
variance used for normalization from all of the summed inputs to the neurons in a layer on a
single training case. Therefore, this technique has a nature advantage of being applicable to
recurrent neural network straightforwardly. However, it seems that this “transposed batch
normalization” cannot be implemented as simple as Batch Normalization. Therefore, it has
not become as influential as Batch Normalization is.

7.4 Optimization for “Optimal” Model Architecture


In the very last section of optimization techniques for neural networks, we revisit some old
methods that have been attempted with the aim to learn the “optimal” model architecture.
Many of these methods are known as constructive network approaches. Most of these meth-
ods have been proposed decades ago and did not raise enough impact back then. Nowadays,
with more powerful computation resources, people start to consider these methods again.

56
On the Origin of Deep Learning

Two remarks need to be made before we proceed: 1) Obviously, most of these meth-
ods can trace back to counterparts in non-parametric machine learning field, but because
most of these methods did not perform enough to raise an impact, focusing a discussion on
the evolutionary path may mislead readers. Instead, we will only list these methods for the
readers who seek for inspiration. 2) Many of these methods are not exclusively optimization
techniques because these methods are usually proposed with a particularly designed archi-
tecture. Technically speaking, these methods should be distributed to previous sections
according to the models associated. However, because these methods can barely inspire
modern modeling research, but may have a chance to inspire modern optimization research,
we list these methods in this section.

7.4.1 Cascade-Correlation Learning

One of the earliest and most important works on this topic was proposed by Fahlman and
Lebiere (1989). They introduced a model, as well as its corresponding algorithm named
Cascade-Correlation Learning. The idea is that the algorithm starts with a minimum net-
work and builds up towards a bigger network. Whenever another hidden unit is added,
the parameters of previous hidden units are fixed, and the algorithm only searches for an
optimal parameter for the newly-added hidden unit.
Interestingly, the unique architecture of Cascade-Correlation Learning grants the net-
work to grow deeper and wider at the same time because every newly added hidden unit
takes the data together with outputs of previously added units as input.
Two important questions of this algorithm are 1) when to fix the parameters of current
hidden units and proceed to add and tune a newly added one 2) when to terminate the
entire algorithm. These two questions are answered in a similar manner: the algorithm
adds a new hidden unit when there are no significant changes in existing architecture and
terminates when the overall performance is satisfying. This training process may introduce
problems of overfitting, which might account for the fact that this method is seen much in
modern deep learning research.

7.4.2 Tiling Algorithm

Mézard and Nadal (1989) presented the idea of Tiling Algorithm, which learns the parame-
ters, the number of layers, as well as the number of hidden units in each layer simultaneously
for feedforward neural network on Boolean functions. Later this algorithm was extended to
multiple class version by Parekh et al. (1997).
The algorithm works in such a way that on every layer, it tries to build a layer of hidden
units that can cluster the data into different clusters where there is only one label in one
cluster. The algorithm keeps increasing the number of hidden units until such a clustering
pattern can be achieved and proceed to add another layer.
Mézard and Nadal (1989) also offered a proof of theoretical guarantees for Tiling Algo-
rithm. Basically, the theorem says that Tiling Algorithm can greedily improve the perfor-
mance of a neural network.

57
Wang and Raj

7.4.3 Upstart Algorithm


Frean (1990) proposed the Upstart Algorithm. Long story short, this algorithm is simply a
neural network version of the standard decision tree (Safavian and Landgrebe, 1990) where
each tree node is replaced with a linear perceptron. Therefore, the tree is seen as a neural
network because it uses the core component of neural networks as a tree node. As a result,
standard way of building a tree is advertised as building a neural network automatically.
Similarly, Bengio et al. (2005) proposed a boosting algorithm where they replace the
weak classifier as neurons.

7.4.4 Evolutionary Algorithm


Evolutionary Algorithm is a family of algorithms uses mechanisms inspired by biological
evolution to search in a parameter space for the optimal solution. Some prominent examples
in this family are genetic algorithm (Mitchell, 1998), which simulates natural selection and
ant colony optimization algorithm (Colorni et al., 1991), which simulates the cooperation
of an ant colony to explore surroundings.
Yao (1999) offered an extensive survey of the usage of evolution algorithm upon the
optimization of neural networks, in which Yao introduced several encoding schemes that
can enable the neural network architecture to be learned with evolutionary algorithms. The
encoding schemes basically transfer the network architecture into vectors, so that a standard
algorithm can take it as input and optimize it.
So far, we discussed some representative algorithms that are aimed to learn the network
architecture automatically. Most of these algorithms eventually fade out of modern deep
learning research, we conjecture two main reasons for this outcome: 1) Most of these algo-
rithms tend to overfit the data. 2) Most of these algorithms are following a greedy search
paradigm, which will be unlikely to find the optimal architecture.
However, with the rapid development of machine learning methods and computation
resources in the last decade, we hope these constructive network methods we listed here
can still inspire the readers for substantial contributions to modern deep learning research.

58
On the Origin of Deep Learning

8. Conclusion
In this paper, we have revisited the evolutionary path of the nowadays deep learning models.
We revisited the paths for three major families of deep learning models: the deep generative
model family, convolutional neural network family, and recurrent neural network family as
well as some topics for optimization techniques.
This paper could serve two goals: 1) First, it documents the major milestones in the
science history that have impacted the current development of deep learning. These mile-
stones are not limited to the development in computer science fields. 2) More importantly,
by revisiting the evolutionary path of the major milestone, this paper should be able to sug-
gest the readers that how these remarkable works are developed among thousands of other
contemporaneous publications. Here we briefly summarize three directions that many of
these milestones pursue:
• Occam’s razor: While it seems that part of the society tends to favor more complex
models by layering up one architecture onto another and hoping backpropagation can
find the optimal parameters, history says that masterminds tend to think simple:
Dropout is widely recognized not only because of its performance, but more because
of its simplicity in implementation and intuitive (tentative) reasoning. From Hopfield
Network to Restricted Boltzmann Machine, models are simplified along the iterations
until when RBM is ready to be piled-up.
• Be ambitious: If a model is proposed with substantially more parameters than
contemporaneous ones, it must solve a problem that no others can solve nicely to be
remarkable. LSTM is much more complex than traditional RNN, but it bypasses the
vanishing gradient problem nicely. Deep Belief Network is famous not due to the fact
the they are the first one to come up with the idea of putting one RBM onto another,
but due to that they come up an algorithm that allow deep architectures to be trained
effectively.
• Widely read: Many models are inspired by domain knowledge outside of machine
learning or statistics field. Human visual cortex has greatly inspired the development
of convolutional neural networks. Even the recent popular Residual Networks can find
corresponding mechanism in human visual cortex. Generative Adversarial Network
can also find some connection with game theory, which was developed fifty years ago.
We hope these directions can help some readers to impact more on current society. More
directions should also be able to be summarized through our revisit of these milestones by
readers.

Acknowledgements
Thanks to the demo from http://beej.us/blog/data/convolution-image-processing/ for a
quick generation of examples in Figure 16. Thanks to Bojian Han at Carnegie Mellon Uni-
versity for the examples in Figure 21. Thanks to the blog at http://sebastianruder.com/optimizing-
gradient-descent/index.html for a summary of gradient methods in Section 7.1. Thanks to
Yutong Zheng and Xupeng Tong at Carnegie Mellon University for suggesting some relevant
contents.

59
Wang and Raj

References
Emile Aarts and Jan Korst. Simulated annealing and boltzmann machines. 1988.

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
boltzmann machines. Cognitive science, 9(1):147–169, 1985.

James A Anderson and Edward Rosenfeld. Talking nets: An oral history of neural networks.
MiT Press, 2000.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint
arXiv:1701.07875, 2017.

Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with
visual attention. arXiv preprint arXiv:1412.7755, 2014.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Alexander Bain. Mind and Body the Theories of Their Relation by Alexander Bain. Henry
S. King & Company, 1873.

Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in Neural Infor-
mation Processing Systems, pages 2814–2822, 2013.

Peter L Bartlett and Wolfgang Maass. Vapnik chervonenkis dimension of neural nets. The
handbook of brain theory and neural networks, pages 1188–1192, 2003.

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In
European conference on computer vision, pages 404–417. Springer, 2006.

Yoshua Bengio. Learning deep architectures for ai. Foundations and trends
R in Machine
Learning, 2(1):1–127, 2009.

Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In
International Conference on Algorithmic Learning Theory, pages 18–36. Springer, 2011.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with
gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.

Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte.
Convex neural networks. In Advances in neural information processing systems, pages
123–130, 2005.

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise
training of deep networks. Advances in neural information processing systems, 19:153,
2007.

60
On the Origin of Deep Learning

Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neural
network classifiers. In ESANN, 2014.

CG Boeree. Psychology: the beginnings. Retrieved April, 26:2008, 2000.

James G Booth and James P Hobert. Maximizing generalized linear mixed model likelihoods
with an automated monte carlo em algorithm. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 61(1):265–285, 1999.

Jörg Bornschein and Yoshua Bengio. Reweighted wake-sleep. arXiv preprint


arXiv:1406.2751, 2014.

Nirmal K Bose et al. Neural network fundamentals with graphs, algorithms, and applications.
Number 612.82 BOS. 1996.

Martin L Brady, Raghu Raghavan, and Joseph Slawny. Back propagation fails to separate
where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36(5):665–674,
1989.

George W Brown. Iterative solution of games by fictitious play. Activity analysis of pro-
duction and allocation, 13(1):374–376, 1951.

John A Bullinaria. Self organizing maps: Fundamentals. Introduction to Neural, 2004.

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.
arXiv preprint arXiv:1509.00519, 2015.

WH Burnham. Memory, historically and experimentally considered. i. an historical sketch


of the older conceptions of memory. The American Journal of Psychology, 2(1):39–90,
1888.

Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural
network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.

Miguel A Carreira-Perpinan and Geoffrey Hinton. On contrastive divergence learning. In


AISTATS, volume 10, pages 33–40. Citeseer, 2005.

Juan Luis Castro, Carlos Javier Mantas, and JM Benıtez. Neural networks with a continuous
squashing function in the output are universal approximators. Neural Networks, 13(6):
561–563, 2000.

Ning Chen, Jun Zhu, Jianfei Chen, and Bo Zhang. Dropout training for support vector
machines. arXiv preprint arXiv:1404.4171, 2014.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.
Infogan: Interpretable representation learning by information maximizing generative ad-
versarial nets. In Advances In Neural Information Processing Systems, pages 2172–2180,
2016.

61
Wang and Raj

KyungHyun Cho. Understanding dropout: training multi-layer perceptrons with auxil-


iary independent stochastic neurons. In International Conference on Neural Information
Processing, pages 474–481. Springer, 2013.

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using
rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,
2014.

Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing multimedia content
using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17
(11):1875–1886, 2015.

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun.
The loss surfaces of multilayer networks. In AISTATS, 2015.

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Ben-
gio. Attention-based models for speech recognition. In Advances in Neural Information
Processing Systems, pages 577–585, 2015.

Avital Cnaan, NM Laird, and Peter Slasor. Tutorial in biostatistics: Using the general
linear mixed model to analyse unbalanced repeated measures and longitudinal data. Stat
Med, 16:2349–2380, 1997.

Alberto Colorni, Marco Dorigo, Vittorio Maniezzo, et al. Distributed optimization by ant
colonies. In Proceedings of the first European conference on artificial life, volume 142,
pages 134–142. Paris, France, 1991.

David Daniel Cox and Thomas Dean. Neural networks and neuroscience-inspired computer
vision. Current Biology, 24(18):R921–R929, 2014.

G Cybenko. Continuous valued neural networks with two hidden layers are sufficient. 1988.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics


of control, signals and systems, 2(4):303–314, 1989.

Zihang Dai, Amjad Almahairi, Bachman Philip, Eduard Hovy, and Aaron Courville. Cali-
brating energy-based generative adversarial networks. ICLR submission, 2017.

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and
Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional
non-convex optimization. In Advances in neural information processing systems, pages
2933–2941, 2014.

Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks.
In Neural Information Processing Systems (NIPS), 2016.

Vincent De Ladurantaye, Jacques Vanden-Abeele, and Jean Rouat. Models of information


processing in the visual cortex. Citeseer, 2012.

62
On the Origin of Deep Learning

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew
Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks.
In Advances in neural information processing systems, pages 1223–1231, 2012.

Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances
in Neural Information Processing Systems, pages 666–674, 2011.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. Learning where to
attend with deep architectures for image tracking. Neural computation, 24(8):2151–2184,
2012.

Jean-Pierre Didier and Emmanuel Bigand. Rethinking physical and rehabilitation medicine:
New technologies induce new learning strategies. Springer Science & Business Media,
2011.

Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):
2121–2159, 2011.

Angela Lee Duckworth, Eli Tsukayama, and Henry May. Establishing causality using lon-
gitudinal hierarchical linear modeling: An illustration predicting achievement from self-
control. Social psychological and personality science, 2010.

Samuel Frederick Edwards and Phil W Anderson. Theory of spin glasses. Journal of Physics
F: Metal Physics, 5(5):965, 1975.

Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv
preprint arXiv:1512.03965, 2015.

Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent,
and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of
Machine Learning Research, 11(Feb):625–660, 2010.

Scott E Fahlman and Christian Lebiere. The cascade-correlation learning architecture.


1989.

Marcus Frean. The upstart algorithm: A method for constructing and training feedforward
neural networks. Neural computation, 2(2):198–209, 1990.

Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mecha-


nism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):
193–202, 1980.

63
Wang and Raj

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic
style. arXiv preprint arXiv:1508.06576, 2015.

Tom Germano. Self organizing maps. Available in http://davis. wpi. edu/˜


matt/courses/soms, 1999.

Felix A Gers and Jürgen Schmidhuber. Recurrent nets that time and count. In Neural
Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint
Conference on, volume 3, pages 189–194. IEEE, 2000.

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Com-
puter Vision, pages 1440–1448, 2015.

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 580–587, 2014.

Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint
arXiv:1701.00160, 2016.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, pages 2672–2680, 2014.

Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992.

Céline Gravelines. Deep Learning via Stacked Sparse Autoencoders for Automated Voxel-
Wise Brain Parcellation Based on Functional Connectivity. PhD thesis, The University
of Western Ontario, 1991.

Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with
deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU),
2013 IEEE Workshop on, pages 273–278. IEEE, 2013.

Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k, Bas R Steunebrink, and Jürgen
Schmidhuber. Lstm: A search space odyssey. arXiv preprint arXiv:1503.04069, 2015.

Richard Gregory and Patrick Cavanagh. The blind spot. Scholarpedia, 6(10):9618, 2011.

Stephen Grossberg. Recurrent neural networks. Scholarpedia, 8(2):1888, 2013.

Aman Gupta, Haohan Wang, and Madhavi Ganapathiraju. Learning structure in gene
expression data using deep architectures, with an application to gene clustering. In
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, pages
1328–1335. IEEE, 2015.

Kevin Gurney. An introduction to neural networks. CRC press, 1997.

Shyam M Guthikonda. Kohonen self-organizing maps. Wittenberg University, 2005.

64
On the Origin of Deep Learning

Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Hypercolumns for
object segmentation and fine-grained localization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 447–456, 2015.

David Hartley. Observations on Man, volume 1. Cambridge University Press, 2013.

Johan Hastad. Almost optimal lower bounds for small depth circuits. In Proceedings of the
eighteenth annual ACM symposium on Theory of computing, pages 6–20. ACM, 1986.

James V Haxby, Elizabeth A Hoffman, and M Ida Gobbini. The distributed human neural
system for face perception. Trends in cognitive sciences, 4(6):223–233, 2000.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep
residual networks. arXiv preprint arXiv:1603.05027, 2016.

Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychol-


ogy Press, 1949.

Robert Hecht-Nielsen. Theory of the backpropagation neural network. In Neural Networks,


1989. IJCNN., International Joint Conference on, pages 593–605. IEEE, 1989.

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence.


Neural computation, 14(8):1771–1800, 2002.

Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The” wake-sleep”
algorithm for unsupervised neural networks. Science, 268(5214):1158, 1995.

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep
belief nets. Neural computation, 18(7):1527–1554, 2006.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature de-
tectors. arXiv preprint arXiv:1207.0580, 2012.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.

John J Hopfield. Neural networks and physical systems with emergent collective computa-
tional abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural networks, 2(5):359–366, 1989.

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep
neural networks with logic rules. arXiv preprint arXiv:1603.06318, 2016.

David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s striate
cortex. The Journal of physiology, 148(3):574–591, 1959.

65
Wang and Raj

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Michael I Jordan. Serial order: A parallel distributed processing approach. Advances in


psychology, 121:471–495, 1986.

Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals,
Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint
arXiv:1610.00527, 2016.

Kiyoshi Kawaguchi. A multithreaded software model for backpropagation neural network


applications. 2000.

AG Khachaturyan, SV Semenovskaya, and B Vainstein. A statistical-thermodynamic ap-


proach to determination of structure amplitude phases. Sov. Phys. Crystallogr, 24:519–
524, 1979.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.

Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-
supervised learning with deep generative models. In Advances in Neural Information
Processing Systems, pages 3581–3589, 2014.

Tinne Hoff Kjeldsen. John von neumann’s conception of the minimax theorem: a journey
through different mathematical contexts. Archive for history of exact sciences, 56(1):
39–68, 2001.

Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.

Bryan Kolb, Ian Q Whishaw, and G Campbell Teskey. An introduction to brain and behav-
ior, volume 1273. 2014.

Mark L Krieg. A tutorial on bayesian belief networks. 2001.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense image annotations. arXiv
preprint arXiv:1602.07332, 2016.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.
2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems,
pages 1097–1105, 2012.

66
On the Origin of Deep Learning

Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convo-
lutional inverse graphics network. In Advances in Neural Information Processing Systems,
pages 2539–2547, 2015.
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept
learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
Alan S Lapedes and Robert M Farber. How neural nets work. In Neural information
processing systems, pages 442–456, 1988.
Hugo Larochelle and Geoffrey E Hinton. Learning to combine foveal glimpses with a third-
order boltzmann machine. In Advances in neural information processing systems, pages
1243–1251, 2010.
B Boser Le Cun, John S Denker, D Henderson, Richard E Howard, W Hubbard, and
Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In
Advances in neural information processing systems. Citeseer, 1990.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998a.
Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of hand-
written digits, 1998b.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436–444, 2015.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep
belief networks for scalable unsupervised learning of hierarchical representations. In Pro-
ceedings of the 26th annual international conference on machine learning, pages 609–616.
ACM, 2009.
Zhaoping Li. A neural model of contour integration in the primary visual cortex. Neural
computation, 10(4):903–940, 1998.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European Conference on Computer Vision, pages 740–755. Springer, 2014.
Cheng-Yuan Liou and Shiao-Lin Lin. Finite memory loading in hairy neurons. Natural
Computing, 5(1):15–42, 2006.
Cheng-Yuan Liou and Shao-Kuo Yuan. Error tolerant associative memory. Biological Cy-
bernetics, 81(4):331–342, 1999.
Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and
David Heckerman. Fast linear mixed models for genome-wide association studies. Nature
methods, 8(10):833–835, 2011.
Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural
networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.

67
Wang and Raj

David G Lowe. Object recognition from local scale-invariant features. In Computer vision,
1999. The proceedings of the seventh IEEE international conference on, volume 2, pages
1150–1157. Ieee, 1999.
Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf.
arXiv preprint arXiv:1603.01354, 2016.
Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy.
Dropout with expectation-linear regularization. arXiv preprint arXiv:1609.08017, 2016.
M Maschler, Eilon Solan, and Shmuel Zamir. Game theory. translated from the hebrew by
ziv hellman and edited by mike borns, 2013.
Charles E McCulloch and John M Neuhaus. Generalized linear mixed models. Wiley Online
Library, 2001.
Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
Marc Mézard and Jean-P Nadal. Learning in feedforward layered networks: The tiling
algorithm. Journal of Physics A: Mathematical and General, 22(12):2191, 1989.
Marc Mézard, Giorgio Parisi, and Miguel-Angel Virasoro. Spin glass theory and beyond.
1990.
Marvin L Minski and Seymour A Papert. Perceptrons: an introduction to computational
geometry. MA: MIT Press, Cambridge, 1969.
Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998.
Tom M Mitchell et al. Machine learning. wcb, 1997.
Jeffrey Moran and Robert Desimone. Selective attention gates visual processing in the
extrastriate cortex. Frontiers in cognitive neuroscience, 229:342–345, 1985.
Michael C Mozer. A focused back-propagation algorithm for temporal pattern recognition.
Complex systems, 3(4):349–381, 1989.
Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th International Conference on Machine Learning
(ICML-10), pages 807–814, 2010.
John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
John F Nash et al. Equilibrium points in n-person games. Proc. Nat. Acad. Sci. USA, 36
(1):48–49, 1950.
Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High
confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 427–436. IEEE, 2015.

68
On the Origin of Deep Learning

Danh V Nguyen, Damla Şentürk, and Raymond J Carroll. Covariate-adjusted linear mixed
effects model with an application to longitudinal data. Journal of nonparametric statis-
tics, 20(6):459–481, 2008.
Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathe-
matical biology, 15(3):267–273, 1982.
Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigen-
values of the expectation of a random matrix. Journal of mathematical analysis and
applications, 106(1):69–84, 1985.
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural
networks. arXiv preprint arXiv:1601.06759, 2016.
Keiichi Osako, Rita Singh, and Bhiksha Raj. Complex recurrent neural networks for denois-
ing speech signals. In Applications of Signal Processing to Audio and Acoustics (WAS-
PAA), 2015 IEEE Workshop on, pages 1–5. IEEE, 2015.
Rajesh G Parekh, Jihoon Yang, and Vasant Honavar. Constructive neural network learning
algorithms for multi-category real-valued pattern classification. 1997.
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct
deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013a.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. ICML (3), 28:1310–1318, 2013b.
Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point
problem for non-convex optimization. arXiv preprint arXiv:1405.4604, 2014.
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier,
and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences
for richer image-to-sentence models. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2641–2649, 2015.
Tomaso Poggio and Thomas Serre. Models of visual cortex. Scholarpedia, 8(4):3516, 2013.
Christopher Poultney, Sumit Chopra, Yann L Cun, et al. Efficient learning of sparse rep-
resentations with an energy-based model. In Advances in neural information processing
systems, pages 1137–1144, 2006.
Jose C Principe, Neil R Euliano, and W Curt Lefebvre. Neural and adaptive systems:
fundamentals through simulations with CD-ROM. John Wiley & Sons, Inc., 1999.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In Advances in neural information
processing systems, pages 91–99, 2015.
Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropa-
gation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International
Conference On, pages 586–591. IEEE, 1993.

69
Wang and Raj

AJ Robinson and Frank Fallside. The utility driven dynamic error propagation network.
University of Cambridge Department of Engineering, 1987.
Frank Rosenblatt. The perceptron: a probabilistic model for information storage and orga-
nization in the brain. Psychological review, 65(6):386, 1958.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal repre-
sentations by error propagation. Technical report, DTIC Document, 1985.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large
scale visual recognition challenge. International Journal of Computer Vision, 115(3):
211–252, 2015.
S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology.
1990.
Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS,
volume 1, page 3, 2009.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and
Xi Chen. Improved techniques for training gans. In Advances in Neural Information
Processing Systems, pages 2226–2234, 2016.
Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks,
61:85–117, 2015.
Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Trans-
actions on Signal Processing, 45(11):2673–2681, 1997.
Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with features inspired
by visual cortex. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), volume 2, pages 994–1000. IEEE, 2005.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin-
ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-
experts layer. arXiv preprint arXiv:1701.06538, 2017.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
Paul Smolensky. Information processing in dynamical systems: Foundations of harmony
theory. Technical report, DTIC Document, 1986.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation
using deep conditional generative models. In Advances in Neural Information Processing
Systems, pages 3483–3491, 2015.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of
Machine Learning Research, 15(1):1929–1958, 2014.

70
On the Origin of Deep Learning

Amos Storkey. Increasing the capacity of a hopfield network without sacrificing functionality.
In International Conference on Artificial Neural Networks, pages 451–456. Springer, 1997.

Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Pol-
icy gradient methods for reinforcement learning with function approximation. In NIPS,
volume 99, pages 1057–1063, 1999.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 1–9, 2015.

Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
Conditional image generation with pixelcnn decoders. In Advances In Neural Information
Processing Systems, pages 4790–4798, 2016.

Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like en-
sembles of relatively shallow networks. In Advances in Neural Information Processing
Systems, pages 550–558, 2016.

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man-
zagol. Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–
3408, 2010.

Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families,


and variational inference. Foundations and Trends
R in Machine Learning, 1(1–2):1–305,
2008.

Brian A Wandell. Foundations of vision. Sinauer Associates, 1995.

Haohan Wang and Jingkang Yang. Multiple confounders correction with regularized linear
mixed effect models, with application in biological processes. 2016.

Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. Select-additive
learning: Improving cross-individual generalization in multimodal sentiment analysis.
arXiv preprint arXiv:1609.05244, 2016.

Paul J Werbos. Generalization of backpropagation with application to a recurrent gas


market model. Neural Networks, 1(4):339–356, 1988.

Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings
of the IEEE, 78(10):1550–1560, 1990.

Bernard Widrow et al. Adaptive” adaline” Neuron Using Chemical” memistors.”. 1960.

71
Wang and Raj

Alan L Wilkes and Nicholas J Wade. Bain on neural networks. Brain and cognition, 33(3):
295–305, 1997.

Gibbs J Willard. Elementary principles in statistical mechanics. The Rational Foundation of


Thermodynamics, New York, Charles Scribners sons and London, Edward Arnold, 1902.

Andrew G Wilson, Christoph Dann, Chris Lucas, and Eric P Xing. The human kernel. In
Advances in Neural Information Processing Systems, pages 2854–2862, 2015.

SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-
chun Woo. Convolutional lstm network: A machine learning approach for precipitation
nowcasting. In Advances in Neural Information Processing Systems, pages 802–810, 2015.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdi-
nov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption
generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.

Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual sequence
tagging from scratch. arXiv preprint arXiv:1603.06270, 2016.

Andrew Chi-Chih Yao. Separating the polynomial-time hierarchy by oracles. In 26th Annual
Symposium on Foundations of Computer Science (sfcs 1985), 1985.

Xin Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447,
1999.

Dong Yu, Li Deng, and George Dahl. Roles of pre-training and fine-tuning in context-
dependent dbn-hmms for real-world speech recognition. In Proc. NIPS Workshop on
Deep Learning and Unsupervised Feature Learning, 2010.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.

Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint


arXiv:1212.5701, 2012.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.
Understanding deep learning requires rethinking generalization. arXiv preprint
arXiv:1611.03530, 2016a.

Ke Zhang, Miao Sun, Tony X Han, Xingfang Yuan, Liru Guo, and Tao Liu. Resid-
ual networks of residual networks: Multilevel residual networks. arXiv preprint
arXiv:1608.02908, 2016b.

Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial net-
work. arXiv preprint arXiv:1609.03126, 2016.

Xiang Zhou and Matthew Stephens. Genome-wide efficient mixed-model analysis for asso-
ciation studies. Nature genetics, 44(7):821–824, 2012.

72
1

Neural Style Transfer: A Review


Yongcheng Jing, Yezhou Yang, Member, IEEE, Zunlei Feng, Jingwen Ye,
Yizhou Yu, Senior Member, IEEE, and Mingli Song, Senior Member, IEEE

Abstract—The seminal work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNNs) in creating artistic
imagery by separating and recombining image content and style. This process of using CNNs to render a content image in different
styles is referred to as Neural Style Transfer (NST). Since then, NST has become a trending topic both in academic literature and
industrial applications. It is receiving increasing attention and a variety of approaches are proposed to either improve or extend the
original NST algorithm. In this paper, we aim to provide a comprehensive overview of the current progress towards NST. We first
propose a taxonomy of current algorithms in the field of NST. Then, we present several evaluation methods and compare different NST
algorithms both qualitatively and quantitatively. The review concludes with a discussion of various applications of NST and open
arXiv:1705.04058v7 [cs.CV] 30 Oct 2018

problems for future research. A list of papers discussed in this review, corresponding codes, pre-trained models and more comparison
results are publicly available at: https://github.com/ycjing/Neural-Style-Transfer-Papers.

Index Terms—Neural style transfer (NST), convolutional neural network

1 I NTRODUCTION

P AINTING is a popular form of art. For thousands of


years, people have been attracted by the art of painting
with the advent of many appealing artworks, e.g., van

Neural Style Transfer


Gogh’s “The Starry Night”. In the past, re-drawing an image
in a particular style requires a well-trained artist and lots of
time. Input Content
Since the mid-1990s, the art theories behind the ap-
pealing artworks have been attracting the attention of not
only the artists but many computer science researchers.
There are plenty of studies and techniques exploring how to Output
automatically turn images into synthetic artworks. Among
Input Style
these studies, the advances in non-photorealistic rendering
(NPR) [1], [2], [3] are inspiring, and nowadays, it is a firmly Figure 1: Example of NST algorithm to transfer the style
established field in the community of computer graphics. of a Chinese painting onto a given photograph. The style
However, most of these NPR stylisation algorithms are image is named “Dwelling in the Fuchun Mountains” by
designed for particular artistic styles [3], [4] and cannot Gongwang Huang.
be easily extended to other styles. In the community of
computer vision, style transfer is usually studied as a gener-
alised problem of texture synthesis, which is to extract and
transfer the texture from the source to target [5], [6], [7], [8]. Recently, inspired by the power of Convolutional Neural
Hertzmann et al. [9] further propose a framework named Networks (CNNs), Gatys et al. [10] first studied how to use
image analogies to perform a generalised style transfer by a CNN to reproduce famous painting styles on natural
learning the analogous transformation from the provided images. They proposed to model the content of a photo as
example pairs of unstylised and stylised images. However, the feature responses from a pre-trained CNN, and further
the common limitation of these methods is that they only model the style of an artwork as the summary feature
use low-level image features and often fail to capture image statistics. Their experimental results demonstrated that a
structures effectively. CNN is capable of extracting content information from an
arbitrary photograph and style information from a well-
known artwork. Based on this finding, Gatys et al. [10] first
• Y. Jing, Z. Feng, J. Ye, and M. Song are with Microsoft Visual Per- proposed to exploit CNN feature activations to recombine
ception Laboratory, College of Computer Science and Technology, the content of a given photo and the style of famous art-
Zhejiang University, Hangzhou 310027, China. E-mails: {ycjing,
zunleifeng, yejingwen, brooksong}@zju.edu.cn.
works. The key idea behind their algorithm is to iteratively
• Y. Yang is with School of Computing, Informatics, and Deci- optimise an image with the objective of matching desired
sion Systems Engineering, Arizona State University, Tempe, AZ CNN feature distributions, which involves both the photo’s
85281, USA. E-mail: yz.yang@asu.edu. content information and artwork’s style information. Their
• Y. Yu is with the Department of Computer Science, The Uni- proposed algorithm successfully produces stylised images
versity of Hong Kong, Pokfulam Road, Hong Kong. E-mail: with the appearance of a given artwork. Figure 1 shows
yizhouy@acm.org. an example of transferring the style of a Chinese painting
2

“Dwelling in the Fuchun Mountains” onto a photo of The Stroke-Based Rendering. Stroke-based rendering (SBR)
Great Wall. Since the algorithm of Gatys et al. does not have refers to a process of placing virtual strokes (e.g., brush
any explicit restrictions on the type of style images and also strokes, tiles, stipples) upon a digital canvas to render a
does not need ground truth results for training, it breaks the photograph with a particular style [16]. The process of
constraints of previous approaches. The work of Gatys et SBR is generally starting from a source photo, incremen-
al. opened up a new field called Neural Style Transfer (NST), tally compositing strokes to match the photo, and finally
which is the process of using Convolutional Neural Network producing a non-photorealistic imagery, which looks like
to render a content image in different styles. the photo but with an artistic style. During this process,
The seminal work of Gatys et al. has attracted wide an objective function is designed to guide the greedy or
attention from both academia and industry. In academia, iterative placement of strokes.
lots of follow-up studies were conducted to either improve The goal of SBR algorithms is to faithfully depict a
or extend this NST algorithm. The related researches of NST prescribed style. Therefore, they are generally effective at
have also led to many successful industrial applications simulating certain types of styles (e.g., oil paintings, water-
(e.g., Prisma [11], Ostagram [12], Deep Forger [13]). How- colours, sketches). However, each SBR algorithm is carefully
ever, there is no comprehensive survey summarising and designed for only one particular style and not capable of
discussing recent advances as well as challenges within this simulating an arbitrary style, which is not flexible.
new field of Neural Style Transfer. Region-Based Techniques. Region-based rendering is to
In this paper, we aim to provide an overview of cur- incorporate region segmentation to enable the adaption of
rent advances (up to March 2018) in Neural Style Transfer rendering based on the content in regions. Early region-
(NST). Our contributions are threefold. First, we investigate, based IB-AR algorithms exploit the shape of regions to
classify and summarise recent advances in the field of guide the stroke placement [17], [18]. In this way, different
NST. Second, we present several evaluation methods and stroke patterns can be produced in different semantic re-
experimentally compare different NST algorithms. Third, gions in an image. Song et al. [19] further propose a region-
we summarise current challenges in this field and propose based IB-AR algorithm to manipulate geometry for artistic
possible directions on how to deal with them in future styles. Their algorithm creates simplified shape rendering
works. effects by replacing regions with several canonical shapes.
The organisation of this paper is as follows. We start our Considering regions in rendering allows the local control
discussion with a brief review of previous artistic rendering over the level of details. However, the problem in SBR per-
methods without CNNs in Section 2. Then Section 3 ex- sists: one region-based rendering algorithm is not capable of
plores the derivations and foundations of NST. Based on the simulating an arbitrary style.
discussions in Section 3, we categorise and explain existing Example-Based Rendering. The goal of example-based
NST algorithms in Section 4. Some improvement strategies rendering is to learn the mapping between an exemplar
for these methods and their extensions will be given in pair. This category of IB-AR techniques is pioneered by
Section 5. Section 6 presents several methodologies for eval- Hertzmann et al., who propose a framework named image
uating NST algorithms and aims to build a standardised analogies [9]. Image analogies aim to learn a mapping
benchmark for follow-up studies. Then we demonstrate the between a pair of source images and target stylised images
commercial applications of NST in Section 7, including both in a supervised manner. The training set of image analogy
current successful usages and its potential applications. In comprises pairs of unstylised source images and the cor-
Section 8, we summarise current challenges in the field of responding stylised images with a particular style. Image
NST, as well as propose possible directions on how to deal analogy algorithm then learns the analogous transforma-
with them in future works. Finally, Section 9 concludes the tion from the example training pairs and creates analogous
paper and delineates several promising directions for future stylised results when given a test input photograph. Image
research. analogy can also be extended in various ways, e.g., to learn
stroke placements for portrait painting rendering [20].
In general, image analogies are effective for a variety of
2 S TYLE T RANSFER W ITHOUT N EURAL N ET- artistic styles. However, pairs of training data are usually
WORKS unavailable in practice. Another limitation is that image
Artistic stylisation is a long-standing research topic. Due analogies only exploit low-level image features. Therefore,
to its wide variety of applications, it has been an impor- image analogies typically fail to effectively capture content
tant research area for more than two decades. Before the and style, which limits the performance.
appearance of NST, the related researches have expanded Image Processing and Filtering. Creating an artistic
into an area called non-photorealistic rendering (NPR). In this image is a process that aims for image simplification and
section, we briefly review some of these artistic rendering abstraction. Therefore, it is natural to consider adopting and
(AR) algorithms without CNNs. Specifically, we focus on combining some related image processing filters to render a
artistic stylization of 2D images, which is called image-based given photo. For example, in [21], Winnemöller et al. for the
artistic rendering (IB-AR) in [14]. For a more comprehensive first time exploit bilateral [22] and difference of Gaussians
overview of IB-AR techniques, we recommend [3], [14], [15]. filters [23] to automatically produce cartoon-like effects.
Following the IB-AR taxonomy defined by Kyprianidis et al. Compared with other categories of IB-AR techniques,
[14], we first introduce each category of IB-AR techniques image-filtering based rendering algorithms are generally
without CNNs and then discuss their strengths and weak- straightforward to implement and efficient in practice. At
nesses. an expense, they are very limited in style diversity.
3

Example-Based Techniques

Image Analogy

Texture
Neural Style Transfer
Colour

Image-Optimisation-Based Model-Optimisation-Based
Online Neural Methods Offline Neural Methods

Parametric Neural Methods Non-parametric Neural Methods Per-Style-Per-Model Multiple-Style-Per-Model Arbitrary-Style-Per-Model


with Summary Statistics with MRFs Neural Methods Neural Methods Neural Methods

Non-Photorealistic Image Photorealistic Non-Photorealistic Non-Photorealistic Photorealistic


Photorealistic

Luan’17 [84] Li’16 [46] Zhang’17 [87] Dumoulin’17 [53] Chen’16 [57] Li’18 [86]
Mechrez’17 [85]
Non-Photorealistic Chen’17 [54] Ghiasi’17 [58]
Li’17 [55] Huang’17 [51]
Zhang’17 [56] Li’17 [59]
Semantic Attribute Video Image
Image Video
Champandard’16 [65] Liao’17 [88] Huang’17 [78]
Gatys’16 [10] Ruder’16 [74] Chen’16 [68] Gupta’17 [79] 2D 3D
Li’17 [42] Mechrez’18 [69] Chen’17 [80]
Risser’17 [44] Johnson’16 [47] Chen’18 [72]
Li’17 [45] Ulyanov’16 [48]
Doodle
Ulyanov’17 [50]
Champandard’16 [65] Li’16 [52]

Portrait Instance Character Improvement Fashion Semantic Character Improvement

Selim’16 [73] Castillo’17 [71] Atarsaikhan’17 [81] Gatys’17 [60] Jiang’17 [89] Lu’17 [70] Azadi’18 [83]
Depth Stroke Size
Liu’17 [63] Wang’17 [62]
Jing’18 [61]

Figure 2: A taxonomy of NST techniques. Our proposed NST taxonomy extends the IB-AR taxonomy proposed by
Kyprianidis et al. [14].

Summary. Based on the above discussions, although statistical property to model the texture. The idea is first
some IB-AR algorithms without CNNs are capable of faith- proposed by Julesz [27], who models textures as pixel-
fully depicting certain prescribed styles, they typically have based N -th order statistics. Later, the work in [28] exploits
the limitations in flexibility, style diversity, and effective filter responses to analyze textures, instead of direct pixel-
image structure extractions. Therefore, there is a demand for based measurements. After that, Portilla and Simoncelli
novel algorithms to address these limitations, which gives [29] further introduce a texture model based on multi-
birth to the field of NST. scale orientated filter responses and use gradient descent
to improve synthesised results. A more recent parametric
texture modelling approach proposed by Gatys et al. [30]
3 D ERIVATIONS OF N EURAL S TYLE T RANSFER is the first to measure summary statistics in the domain
For a better understanding of the NST development, we of a CNN. They design a Gram-based representation to
start by introducing its derivations. To automatically trans- model textures, which is the correlations between filter
fer an artistic style, the first and most important issue responses in different layers of a pre-trained classification
is how to model and extract style from an image. Since network (VGG network) [31]. More specifically, the Gram-
style is very related to texture1 , a straightforward way is to based representation encodes the second order statistics
relate Visual Style Modelling back to previously well-studied of the set of CNN filter responses. Next, we will explain
Visual Texture Modelling methods. After obtaining the style this representation in detail for the usage of the following
representation, the next issue is how to reconstruct an image sections.
with desired style information while preserving its content, Assume that the feature map of a sample texture image
which is addressed by the Image Reconstruction techniques. Is at layer l of a pre-trained deep classification network is
F l (Is ) ∈ RC×H×W , where C is the number of channels, and
3.1 Visual Texture Modelling H and W represent the height and width of the feature map
F(Is ). Then the Gram-based representation can be obtained
Visual texture modelling [24] is previously studied as the
by computing the Gram matrix G(F l (Is )0 ) ∈ RC×C over
heart of texture synthesis [25], [26]. Throughout the history,
the feature map F l (Is )0 ∈ RC×(HW ) (a reshaped version of
there are two distinct approaches to model visual textures,
F l (Is )):
which are Parametric Texture Modelling with Summary Statis-
G(F l (Is )0 ) = [F l (Is )0 ][F l (Is )0 ]T . (1)
tics and Non-parametric Texture Modelling with Markov Ran-
dom Fields (MRFs). This Gram-based texture representation from a CNN is
1) Parametric Texture Modelling with Summary Statis- effective at modelling wide varieties of both natural and
tics. One path towards texture modelling is to capture non-natural textures. However, the Gram-based represen-
image statistics from a sample texture and exploit summary tation is designed to capture global statistics and tosses
spatial arrangements, which leads to unsatisfying results
1. We clarify that style is very related to texture but not limited to for modelling regular textures with long-range symmetric
texture. Style also involves a large degree of simplification and shape
abstraction effects, which falls back to the composition or alignment of structures. To address this problem, Berger and Memisevic
texture features. [32] propose to horizontally and vertically translate feature
4

maps by δ pixels to correlate the feature at position (i, j) and strengths. Since it is complex to define the notion of
with those at positions (i + δ, j) and (i, j + δ). In this style [3], [38] and therefore very subjective to define what
way, the representation incorporates spatial arrangement criteria are important to make a successful style transfer
information and is therefore more effective at modelling algorithm [39], here we try to evaluate these algorithms in
textures with symmetric properties. a more structural way by only focusing on details, semantics,
2) Non-parametric Texture Modelling with MRFs. An- depth and variations in brush strokes2 . We will discuss more
other notable texture modelling methodology is to use non- about the problem of aesthetic evaluation criterion in Sec-
parametric resampling. A variety of non-parametric meth- tion 8 and also present more evaluation results in Section 6.
ods are based on MRFs model, which assumes that in a Our proposed taxonomy of NST techniques is shown
texture image, each pixel is entirely characterised by its in Figure 2. We keep the taxonomy of IB-AR techniques
spatial neighbourhood. Under this assumption, Efros and proposed by Kyprianidis et al. [14] unaffected and extend
Leung [25] propose to synthesise each pixel one by one it by NST algorithms. Current NST methods fit into one of
by searching similar neighbourhoods in the source texture two categories, Image-Optimisation-Based Online Neural Meth-
image and assigning the corresponding pixel. Their work is ods (IOB-NST) and Model-Optimisation-Based Offline Neural
one of the earliest non-parametric algorithms with MRFs. Methods (MOB-NST). The first category transfers the style
Following their work, Wei and Levoy [26] further speed by iteratively optimising an image, i.e., algorithms belong
up the neighbourhood matching process by always using to this category are built upon IOB-IR techniques. The
a fixed neighbourhood. second category optimises a generative model offline and
produces the stylised image with a single forward pass,
which exploits the idea of MOB-IR techniques.
3.2 Image Reconstruction
In general, an essential step for many vision tasks is to ex-
tract an abstract representation from the input image. Image 4.1 Image-Optimisation-Based Online Neural Methods
reconstruction is a reverse process, which is to reconstruct DeepDream [40] is the first attempt to produce artistic
the whole input image from the extracted image represen- images by reversing CNN representations with IOB-IR tech-
tation. It is previously studied to analyse a particular image niques. By further combining Visual Texture Modelling tech-
representation and discover what information is contained niques to model style, IOB-NST algorithms are subsequently
in the abstract representation. Here our major focus is on proposed, which build the early foundations for the field
CNN representation based image reconstruction algorithms, of NST. Their basic idea is to first model and extract style
which can be categorised into Image-Optimisation-Based On- and content information from the corresponding style and
line Image Reconstruction (IOB-IR) and Model-Optimisation- content images, recombine them as the target representa-
Based Offline Image Reconstruction (MOB-IR). tion, and then iteratively reconstruct a stylised result that
1) Image-Optimisation-Based Online Image Recon- matches the target representation. In general, different IOB-
struction. The first algorithm to reverse CNN representa- NST algorithms share the same IOB-IR technique, but differ
tions is proposed by Mahendran and Vedaldi [33], [34]. in the way they model the visual style, which is built on the
Given a CNN representation to be reversed, their algo- aforementioned two categories of Visual Texture Modelling
rithm iteratively optimises an image (generally starting techniques. The common limitation of IOB-NST algorithms
from random noise) until it has a similar desired CNN is that they are computationally expensive, due to the itera-
representation. The iterative optimisation process is based tive image optimisation procedure.
on gradient descent in image space. Therefore, the process is
time-consuming especially when the desired reconstructed 4.1.1 Parametric Neural Methods with Summary Statistics
image is large.
The first subset of IOB-NST methods is based on Parametric
2) Model-Optimisation-Based Offline Image Recon-
Texture Modelling with Summary Statistics. The style is char-
struction. To address the efficiency issue of [33], [34],
acterised as a set of spatial summary statistics.
Dosovitskiy and Brox [35] propose to train a feed-forward
We start by introducing the first NST algorithm proposed
network in advance and put the computational burden at
by Gatys et al. [4], [10]. By reconstructing representations
training stage. At testing stage, the reverse process can be
from intermediate layers of the VGG-19 network, Gatys
simply done with a network forward pass. Their algorithm
et al. observe that a deep convolutional neural network
significantly speeds up the image reconstruction process.
is capable of extracting image content from an arbitrary
In their later work [36], they further combine Generative
photograph and some appearance information from the
Adversarial Network (GAN) [37] to improve the results.
well-known artwork. According to this observation, they
build the content component of the newly stylised image
4 A TAXONOMY OF N EURAL S TYLE T RANSFER by penalising the difference of high-level representations
A LGORITHMS derived from content and stylised images, and further build
the style component by matching Gram-based summary
NST is a subset of the aforementioned example-based IB-AR statistics of style and stylised images, which is derived from
techniques. In this section, we first provide a categorisation their proposed texture modelling technique [30] (Section
of NST algorithms and then explain major 2D image based 3.1). The details of their algorithm are as follows.
non-photorealistic NST algorithms (Figure 2, purple boxes)
in detail. More specifically, for each algorithm, we start by 2. We claim that the visual criteria with respect to a successful style
introducing the main idea and then discuss its weaknesses transfer are definitely not limited to these factors.
5

Given a content image Ic and a style image Is , the algo- on the type of style images, which addresses the limitations
rithm in [4] tries to seek a stylised image I that minimises of previous IB-AR algorithms without CNNs (Section 2).
the following objective: However, the algorithm of Gatys et al. does not perform well
in preserving the coherence of fine structures and details
I ∗ = arg min Ltotal (Ic , Is , I)
I during stylisation since CNN features inevitably lose some
(2)
= arg min αLc (Ic , I) + βLs (Is , I), low-level information. Also, it generally fails for photore-
I alistic synthesis, due to the limitations of Gram-based style
where Lc compares the content representation of a given representation. Moreover, it does not consider the variations
content image to that of the stylised image, and Ls compares of brush strokes and the semantics and depth information
the Gram-based style representation derived from a style contained in the content image, which are important factors
image to that of the stylised image. α and β are used to in evaluating the visual quality.
balance the content component and style component in the In addition, a Gram-based style representation is not the
stylised result. only choice to statistically encode style information. There
The content loss Lc is defined by the squared Euclidean are also some other effective statistical style representations,
distance between the feature representations F l of the con- which are derived from a Gram-based representation. Li
tent image Ic in layer l and that of the stylised image I et al. [42] derive some different style representations by
which is initialised with a noise image: considering style transfer in the domain of transfer learning,
X or more specifically, domain adaption [43]. Given that training
Lc = kF l (Ic ) − F l (I)k2 , (3) and testing data are drawn from different distributions, the
l∈{lc }
goal of domain adaption is to adapt a model trained on
where {lc } denotes the set of VGG layers for computing labelled training data from a source domain to predict labels
the content loss. For the style loss Ls , [4] exploits Gram- of unlabelled testing data from a target domain. One way for
based visual texture modelling technique to model the style, domain adaption is to match a sample in the source domain
which has already been explained in Section 3.1. Therefore, to that in the target domain by minimising their distribution
the style loss is defined by the squared Euclidean distance discrepancy, in which Maximum Mean Discrepancy (MMD)
between the Gram-based style representations of Is and I : is a popular choice to measure the discrepancy between
X
Ls = kG(F l (Is )0 ) − G(F l (I)0 )k2 , (4) two distributions. Li et al. prove that matching Gram-based
l∈{ls } style representations between a pair of style and stylised
where G is the aforementioned Gram matrix to encode the images is intrinsically minimising MMD with a quadratic
second order statistics of the set of filter responses. {ls } polynomial kernel. Therefore, it is expected that other kernel
represents the set of VGG layers for calculating the style functions for MMD can be equally applied in NST, e.g.,
loss. the linear kernel, polynomial kernel and Gaussian kernel.
The choice of content and style layers is an important Another related representation is the batch normalisation
factor in the process of style transfer. Different positions (BN) statistic representation, which is to use mean and
and numbers of layers can result in very different visual variance of the feature maps in VGG layers to model style:
experiences. Given the pre-trained VGG-19 [31] as the loss l
C
network, Gatys et al.’s choice of {ls } and {lc } in [4] is X 1 X
Ls = kµ(Fcl (Is )) − µ(Fcl (I))k2 +
{ls } = {relu1 1, relu2 1, relu3 1, relu4 1, relu5 1} and l∈{ls } C l
c=1
{lc } = {relu4 2}. For {ls }, the idea of combining multiple
layers (up to higher layers) is critical for the success of Gatys
kσ(Fcl (Is )) − σ(Fcl (I))k2 , (5)
et al.’s NST algorithm. Matching the multi-scale style repre- where Fcl ∈ RH×W is the c-th feature map channel at layer
sentations leads to a smoother and more continuous stylisa- l of VGG network, and C l is the number of channels.
tion, which gives the visually most appealing results [4]. For The main contribution of Li et al.’s algorithm is to
the content layer {lc }, matching the content representations theoretically demonstrate that the Gram matrices matching
on a lower layer preserves the undesired fine structures process in NST is equivalent to minimising MMD with the
(e.g., edges and colour map) of the original content image second order polynomial kernel, thus proposing a timely
during stylisation. In contrast, by matching the content on interpretation of NST and making the principle of NST
a higher layer of the network, the fine structures can be clearer. However, the algorithm of Li et al. does not resolve
altered to agree with the desired style while preserving the the aforementioned limitations of Gatys et al.’s algorithm.
content information of the content image. Also, using VGG- One limitation of the Gram-based algorithm is its in-
based loss networks for style transfer is not the only option. stabilities during optimisations. Also, it requires manually
Similar performance can be achieved by selecting other pre- tuning the parameters, which is very tedious. Risser et al.
trained classification networks, e.g., ResNet [41]. [44] find that feature activations with quite different means
In Equation (2), both Lc and Ls are differentiable. Thus, and variances can still have the same Gram matrix, which is
with random noise as the initial I , Equation (2) can be the main reason of instabilities. Inspired by this observation,
minimised by using gradient descent in image space with Risser et al. introduce an extra histogram loss, which guides
backpropagation. In addition, a total variation denoising the optimisation to match the entire histogram of feature
term is usually added in practice to encourage the smooth- activations. They also present a preliminary solution to
ness in the stylised result. automatic parameter tuning, which is to explicitly prevent
The algorithm of Gatys et al. does not need ground truth gradients with extreme values through extreme gradient
data for training and also does not have explicit restrictions normalisation.
6

By additionally matching the histogram of feature ac- 4.2 Model-Optimisation-Based Offline Neural Methods
tivations, the algorithm of Risser et al. achieves a more Although IOB-NST is able to yield impressive stylised im-
stable style transfer with fewer iterations and parameter ages, there are still some limitations. The most concerned
tuning efforts. However, its benefit comes at an expense of limitation is the efficiency issue. The second category MOB-
a high computational complexity. Also, the aforementioned NST addresses the speed and computational cost issue by
weaknesses of Gatys et al.’s algorithm still exist, e.g., a lack exploiting MOB-IR to reconstruct the stylised result, i.e.,
of consideration in depth and the coherence of details. a feed-forward network g is optimised over a large set of
All these aforementioned neural methods only compare images Ic for one or more style images Is :
content and stylised images in the CNN feature space to
make the stylised image semantically similar to the content θ∗ = arg min Ltotal (Ic , Is , gθ∗ (Ic )), I ∗ = gθ∗ (Ic ). (7)
θ
image. But since CNN features inevitably lose some low-
level information contained in the image, there are usually Depending on the number of artistic styles a single g can
some unappealing distorted structures and irregular arte- produce, MOB-NST algorithms are further divided into Per-
facts in the stylised results. To preserve the coherence of Style-Per-Model (PSPM) MOB-NST methods , Multiple-Style-
fine structures during stylisation, Li et al. [45] propose to Per-Model (MSPM) MOB-NST Methods, and Arbitrary-Style-
incorporate additional constraints upon low-level features Per-Model (ASPM) MOB-NST Methods.
in pixel space. They introduce an additional Laplacian loss,
which is defined as the squared Euclidean distance between 4.2.1 Per-Style-Per-Model Neural Methods
the Laplacian filter responses of a content image and stylised 1) Parametric PSPM with Summary Statistics. The first
result. Laplacian filter computes the second order deriva- two MOB-NST algorithms are proposed by Johnson et al.
tives of the pixels in an image and is widely used for edge [47] and Ulyanov et al. [48] respectively. These two methods
detection. share a similar idea, which is to pre-train a feed-forward
The algorithm of Li et al. has a good performance in pre- style-specific network and produce a stylised result with
serving the fine structures and details during stylisation. But a single forward pass at testing stage. They only differ in
it still lacks considerations in semantics, depth, variations in the network architecture, for which Johnson et al. ’s design
brush strokes, etc. roughly follows the network proposed by Radford et al. [49]
but with residual blocks as well as fractionally strided con-
4.1.2 Non-parametric Neural Methods with MRFs volutions, and Ulyanov et al. use a multi-scale architecture
Non-parametric IOB-NST is built on the basis of Non- as the generator network. The objective function is similar
parametric Texture Modelling with MRFs. This category con- to the algorithm of Gatys et al. [4], which indicates that they
siders NST at a local level, i.e., operating on patches to are also Parametric Methods with Summary Statistics.
match the style. The algorithms of Johnson et al. and Ulyanov et al.
Li and Wand [46] are the first to propose an MRF- achieve a real-time style transfer. However, their algorithm
based NST algorithm. They find that the parametric NST design basically follows the algorithm of Gatys et al., which
method with summary statistics only captures the per- makes them suffer from the same aforementioned issues as
pixel feature correlations and does not constrain the spatial Gatys et al.’s algorithm (e.g., a lack of consideration in the
layout, which leads to a less visually plausible result for coherence of details and depth information).
photorealistic styles. Their solution is to model the style in a Shortly after [47], [48], Ulyanov et al. [50] further find
non-parametric way and introduce a new style loss function that simply applying normalisation to every single image
which includes a patch-based MRF prior: rather than a batch of images (precisely batch normalization
X m
X (BN)) leads to a significant improvement in stylisation qual-
Ls = kΨi (F l (I)) − ΨN N (i) (F l (Is ))k2 , (6) ity. This single image normalisation is called instance normal-
l∈{ls }
i=1 isation (IN), which is equivalent to batch normalisation when
where Ψ(F l (I)) is the set of all local patches from the the batch size is set to 1. The style transfer network with
feature map F l (I). Ψi denotes the ith local patch and IN is shown to converge faster than BN and also achieves
ΨN N (i) is the most similar style patch with the i-th local visually better results. One interpretation is that IN is a form
patch in the stylised image I . The best matching ΨN N (i) of style normalisation and can directly normalise the style
is obtained by calculating normalised cross-correlation over of each content image to the desired style [51]. Therefore,
all style patches in the style image Is . m is the total number the objective is easier to learn as the rest of the network only
of local patches. Since their algorithm matches a style in needs to take care of the content loss.
the patch-level, the fine structure and arrangement can be 2) Non-parametric PSPM with MRFs. Another work
preserved much better. by Li and Wand [52] is inspired by the MRF-based NST
The advantage of the algorithm of Li and Wand is that [46] algorithm in Section 4.1.2. They address the efficiency
it performs especially well for photorealistic styles, or more issue by training a Markovian feed-forward network using
specifically, when the content photo and the style are similar adversarial training. Similar to [46], their algorithm is a
in shape and perspective, due to the patch-based MRF Patch-based Non-parametric Method with MRFs. Their method
loss. However, it generally fails when the content and style is shown to outperform the algorithms of Johnson et al. and
images have strong differences in perspective and structure Ulyanov et al. in the preservation of coherent textures in
since the image patches could not be correctly matched. complex images, thanks to their patch-based design. How-
It is also limited in preserving sharp details and depth ever, their algorithm has a less satisfying performance with
information. non-texture styles (e.g., face images), since their algorithm
7

lacks a consideration in semantics. Other weaknesses of learn a new style and a flexible control over style fusion.
their algorithm include a lack of consideration in depth However, they do not address the common limitations of
information and variations of brush strokes, which are im- NST algorithms, e.g., a lack of details, semantics, depth and
portant visual factors. variations in brush strokes.
2) Combining both style and content as inputs. One
4.2.2 Multiple-Style-Per-Model Neural Methods disadvantage of the first category is that the model size
Although the above PSPM approaches can produce stylised generally becomes larger with the increase of the number
images two orders of magnitude faster than previous IOB- of learned styles. The second path of MSPM addresses this
NST methods, separate generative networks have to be limitation by fully exploring the capability of one single
trained for each particular style image, which is quite time- network and combining both content and style into the
consuming and inflexible. But many paintings (e.g., impres- network for style identification. Different MSPM algorithms
sionist paintings) share similar paint strokes and only differ differ in the way to incorporate style into the network.
in their colour palettes. Intuitively, it is redundant to train In [55], given N target styles, Li et al. design a selection
a separate network for each of them. MSPM is therefore unit for style selection, which is a N -dimensional one-hot
proposed, which improves the flexibility of PSPM by further vector. Each bit in the selection unit represents a specific
incorporating multiple styles into one single model. There style Is in the set of target styles. For each bit in the
are generally two paths towards handling this problem: 1) selection unit, Li et al. first sample a corresponding noise
tying only a small number of parameters in a network to map f (Is ) from a uniform distribution and then feed f (Is )
each style ( [53], [54]) and 2) still exploiting only a single into the style sub-network to obtain the corresponding style
network like PSPM but combining both style and content as encoded features F(f (Is )). By feeding the concatenation
inputs ( [55], [56]). of the style encoded features F(f (Is )) and the content
1) Tying only a small number of parameters to each encoded features Enc(Ic ) into the decoder part Dec of the
style. An early work by Dumoulin et al. [53] is built on style transfer network, the desired stylised result can be
the basis of the proposed IN layer in PSPM algorithm [50] produced: I = Dec( F(f (Is )) ⊕ Enc(Ic ) ).
(Section 4.2.1). They surprisingly find that using the same Another work by Zhang and Dana [56] first forwards
convolutional parameters but only scaling and shifting pa- each style image in the style set through the pre-trained
rameters in IN layers is sufficient to model different styles. VGG network and obtain multi-scale feature activations
Therefore, they propose an algorithm to train a conditional F(Is ) in different VGG layers. Then multi-scale F(Is ) are
multi-style transfer network based on conditional instance combined with multi-scale encoded features Enc(Ic ) from
normalisation (CIN), which is defined as: different layers in the encoder through their proposed
inspiration layers. The inspiration layers are designed to
F(Ic ) − µ(F(Ic ))
 
CIN(F(Ic ), s) = γ s + β s , (8) reshape F(Is ) to match the desired dimension, and also
σ(F(Ic )) have a learnable weight matrix to tune feature maps to help
where F is the input feature activation and s is the index minimise the objective function.
of the desired style from a set of style images. As shown in The second type of MSPM addresses the limitation of
Equation (8), the conditioning for each style Is is done by the increased model size in the first type of MSPM. At an
scaling and shifting parameters γ s and β s after normalising expense, the style scalability of the second type of MSPM
feature activation F(Ic ), i.e., each style Is can be achieved is much smaller, since only one single network is used for
by tuning parameters of an affine transformation. The in- multiple styles. We will quantitatively compare the style
terpretation is similar to that for [50] in Section 4.2.1, i.e., scalability of different MSPM algorithms in Section 6. In ad-
the normalisation of feature statistics with different affine dition, some aforementioned limitations in the first type of
parameters can normalise input content image to different MSPM still exist, i.e., the second type of MSPM algorithms
styles. Furthermore, the algorithm of Dumoulin et al. can are still limited in preserving the coherence of fine structures
also be extended to combine multiple styles in a single and also depth information.
stylised result by combining affine parameters of different
styles. 4.2.3 Arbitrary-Style-Per-Model Neural Methods
Another algorithm which follows the first path of MSPM The third category, ASPM-MOB-NST, aims at one-model-
is proposed by Chen et al. [54]. Their idea is to explicitly for-all, i.e., one single trainable model to transfer arbitrary
decouple style and content, i.e., using separate network artistic styles. There are also two types of ASPM, one built
components to learn the corresponding content and style upon Non-parametric Texture Modelling with MRFs and the
information. More specifically, they use mid-level convolu- other one built upon Parametric Texture Modelling with Sum-
tional filters (called “StyleBank” layer) to individually learn mary Statistics.
different styles. Each style is tied to a set of parameters 1) Non-parametric ASPM with MRFs. The first ASPM
in “StyleBank” layer. The rest components in the network algorithm is proposed by Chen and Schmidt [57]. They first
are used to learn content information, which is shared extract a set of activation patches from content and style
by different styles. Their algorithm also supports flexible feature activations computed in pre-trained VGG network.
incremental training, which is to fix the content components Then they match each content patch to the most similar
in the network and only train a “StyleBank” layer for a new style patch and swap them (called “Style Swap” in [57]).
style. The stylised result can be produced by reconstructing the
In summary, both the algorithms of Dumoulin et al. resulting activation map after “Style Swap”, with either
and Chen et al. have the benefits of little efforts needed to IOB-IR or MOB-IR techniques. The algorithm of Chen and
8

Schmidt is more flexible than the previous approaches single-level stylisation to multi-level stylisation to further
due to its characteristic of one-model-for-all-style. But the improve visual quality.
stylised results of [57] are less appealing since the content The algorithm of Li et al. is the first ASPM algorithm to
patches are typically swapped with the style patches which transfer artistic styles in a learning-free manner. Therefore,
are not representative of the desired style. As a result, the compared with [51], it does not have the limitation in
content is well preserved while the style is generally not generalisation capabilities. But the algorithm of Li et al. is
well reflected. still not effective at producing sharp details and fine strokes.
2) Parametric ASPM with Summary Statistics. Con- The stylisation results will be shown in Section 6. Also, it
sidering [53] in Section 4.2.2, the simplest approach for lacks a consideration in preserving depth information and
arbitrary style transfer is to train a separate parameter variations in brush strokes.
prediction network P to predict γ s and β s in Equation (8)
with a number of training styles [58]. Given a test style
image Is , CIN layers in the style transfer network take affine
5 I MPROVEMENTS AND E XTENSIONS
parameters γ s and β s from P (Is ), and normalise the input Since the emergence of NST algorithms, there are also some
content image to the desired style with a forward pass. researches devoted to improving current NST algorithms
Another similar approach based on [53] is proposed by by controlling perceptual factors (e.g., stroke size control,
Huang and Belongie [51]. Instead of training a parameter spatial style control, and colour control) (Figure 2, green
prediction network, Huang and Belongie propose to modify boxes). Also, all of aforementioned NST methods are de-
conditional instance normalisation (CIN) in Equation (8) to signed for general still images. They may not be appropriate
adaptive instance normalisation (AdaIN): for specialised types of images and videos (e.g., doodles,
head portraits, and video frames). Thus, a variety of follow-
up studies (Figure 2, pink boxes) aim to extend general NST
AdaIN(F(Ic ), F(Is )) =
algorithms to these particular types of images and even
F(Ic ) − µ(F(Ic ))
 
σ(F(Is )) + µ(F(Is )). (9) extend them beyond artistic image style (e.g., audio style).
σ(F(Ic )) Controlling Perceptual Factors in Neural Style Trans-
fer. Gatys et al. themselves [60] propose several slight
AdaIN transfers the channel-wise mean and variance fea- modifications to improve their previous algorithm [4]. They
ture statistics between content and style feature activations, demonstrate a spatial style control strategy to control the
which also shares a similar idea with [57]. Different from style in each region of the content image. Their idea is to
[53], the encoder in the style transfer network of [51] is define guidance channels for the feature activations for both
fixed and comprises the first few layers in pre-trained VGG content and style image. The guidance channel has values in
network. Therefore, F in [51] is the feature activation from [0, 1] specifying which style should be transferred to which
a pre-trained VGG network. The decoder part needs to content region, i.e., the content regions where the content
be trained with a large set of style and content images guidance channel is 1 should be rendered with the style
to decode resulting feature activations after AdaIN to the where the style guidance channel is equal to 1. While for the
stylised result: I = Dec( AdaIN(F(Ic ), F(Is )) ). colour control, the original NST algorithm produces stylised
The algorithm of Huang and Belongie [51] is the first images with the colour distribution of the style image.
ASPM algorithm that achieves a real-time stylisation. How- However, sometimes people prefer a colour-preserving style
ever, the algorithm of Huang and Belongie [51] is data- transfer, i.e., preserving the colour of the content image
driven and limited in generalising on unseen styles. Also, during style transfer. The corresponding solution is to first
simply adjusting the mean and variance of feature statistics transform the style image’s colours to match the content im-
makes it hard to synthesise complicated style patterns with age’s colours before style transfer, or alternatively perform
rich details and local structures. style transfer only in the luminance channel.
A more recent work by Li et al. [59] attempts to exploit a For stroke size control, the problem is much more com-
series of feature transformations to transfer arbitrary artistic plex. We show sample results of stroke size control in
style in a style learning free manner. Similar to [51], Li et al. Figure 3. The discussions of stroke size control strategy need
use the first few layers of pre-trained VGG as the encoder to be split into several cases [61]:
and train the corresponding decoder. But they replace the 1) IOB-NST with non-high-resolution images: Since current
AdaIN layer [51] in between the encoder and decoder style statistics (e.g., Gram-based and BN-based statistics)
with a pair of whitening and colouring transformations are scale-sensitive [61], to achieve different stroke sizes, the
(WCT): I = Dec( WCT(F(Ic ), F(Is )) ). Their algorithm is solution is simply resizing a given style image to different
built on the observation that the whitening transformation scales.
can remove the style related information and preserve the 2) MOB-NST with non-high-resolution images: One possi-
structure of content. Therefore, receiving content activations ble solution is to resize the input image to different scales
F(Ic ) from the encoder, whitening transformation can filter before the forward pass, which inevitably hurts stylisation
the original style out of the input content image and return a quality. Another possible solution is to train multiple mod-
filtered representation with only content information. Then, els with different scales of a style image, which is space and
by applying colouring transformation, the style patterns time consuming. Also, the possible solution fails to preserve
contained in F(Is ) are incorporated into the filtered content stroke consistency among results with different stroke sizes,
representation, and the stylised result I can be obtained by i.e., the results vary in stroke orientations, stroke configu-
decoding the transformed features. They also extend this rations, etc. However, users generally desire to only change
9

(a) Content (b) Style (c) Small Stroke Size (d) Large Stroke Size

Figure 3: Control the brush stroke size in NST. (c) is the output with smaller brush size and (d) with larger brush size. The
style image is “The Starry Night” by Vincent van Gogh.

the stroke size but not others. To address this problem, Jing 1) Image-Optimisation-Based Semantic Style Transfer. Since
et al. [61] propose a stroke controllable PSPM algorithm. the patch matching scheme naturally meets the require-
The core component of their algorithm is a StrokePyramid ments of the region-based correspondence, Champandard
module, which learns different stroke sizes with adaptive [65] proposes to build a semantic style transfer algorithm
receptive fields. Without trading off quality and speed, their based on the aforementioned patch-based algorithm [46]
algorithm is the first to exploit one single model to achieve (Section 4.1.2). Although the result produced by the algo-
flexible continuous stroke size control while preserving rithm of Li and Wand [46] is close to the target of semantic
stroke consistency, and further achieve spatial stroke size con- style transfer, [46] does not incorporate an accurate segmen-
trol to produce new artistic effects. Although one can also tation mask, which sometimes leads to a wrong semantic
use ASPM algorithm to control stroke size, ASPM trades match. Therefore, Champandard augments an additional
off quality and speed. As a result, ASPM is not effective at semantic channel upon [46], which is a downsampled se-
producing fine strokes and details compared with [61]. mantic segmentation map. The segmentation map can be
3) IOB-NST with high-resolution images: For high- either manually annotated or from a semantic segmentation
resolution images (e.g., 3000 × 3000 pixels in [60]), a large algorithm [66], [67]. Despite the effectiveness of [65], MRF-
stroke size cannot be achieved by simply resizing style based design is not the only choice. Instead of combining
image to a large scale. Since only the region in the content MRF prior, Chen and Hsu [68] provide an alternative way
image with a receptive field size of VGG can be affected for semantic style transfer, which is to exploit masking out
by a neuron in the loss network, there is almost no visual process to constrain the spatial correspondence and also
difference between a large and larger brush strokes in a a higher order style feature statistic to further improve
small image region with receptive field size. Gatys et al. [60] the result. More recently, Mechrez et al. [69] propose an
tackle this problem by proposing a coarse-to-fine IOB-NST alternative contextual loss to realise semantic style transfer
procedure with several steps of downsampling, stylising, in a segmentation-free manner.
upsampling and final stylising. 2) Model-Optimisation-Based Semantic Style Transfer. As
4) MOB-NST with high-resolution images: Similar to 3), before, the efficiency issue is always a big issue. Both [65]
stroke size in stylised result does not vary with style image and [68] are based on IOB-NST algorithms and therefore
scale for high-resolution images. The solution is also similar leave much room for improvement. Lu et al. [70] speed
to Gatys et al. ’s algorithm in [60], which is a coarse- up the process by optimising the objective function in
to-fine stylisation procedure [62]. The idea is to exploit a feature space, instead of in pixel space. More specifically,
multimodel, which comprises multiple subnetworks. Each they propose to do feature reconstruction, instead of image
subnetwork receives the upsampled stylised result of the reconstruction as previous algorithms do. This optimisation
previous subnetwork as the input, and stylises it again with strategy reduces the computation burden, since the loss does
finer strokes. not need to propagate through a deep network. The result-
Another limitation of current NST algorithms is that ing reconstructed feature is decoded into the final result
they do not consider the depth information contained in the with a trained decoder. Since the speed of [70] does not reach
image. To address this limitation, the depth preserving NST real-time, there is still big room for further research.
algorithm [63] is proposed. Their approach is to add a depth Instance Style Transfer. Instance style transfer is built
loss function based on [47] to measure the depth difference on instance segmentation and aims to stylise only a single
between the content image and the stylised image. The user-specified object within an image. The challenge mainly
image depth is acquired by applying a single-image depth lies in the transition between a stylised object and non-
estimation algorithm (e.g., Chen et al.’s work in [64]). stylised background. Castillo et al. [71] tackle this problem
Semantic Style Transfer. Given a pair of style and by adding an extra MRF-based loss to smooth and anti-alias
content images which are similar in content, the goal of boundary pixels.
semantic style transfer is to build a semantic correspondence Doodle Style Transfer. An interesting extension can be
between the style and content, which maps each style re- found in [65], which is to exploit NST to transform rough
gion to a corresponding semantically similar content region. sketches into fine artworks. The method is simply discard-
Then the style in each style region is transferred to the ing content loss term and using doodles as segmentation
semantically similar content region. map to do semantic style transfer.
10

Stereoscopic Style Transfer. Driven by the demand of diction. By training these two networks jointly, font style
AR/VR, Chen et al. [72] propose a stereoscopic NST al- transfer can be realised in an end-to-end manner.
gorithm for stereoscopic images. They propose a disparity Photorealistic Style Transfer. Photorealistic style trans-
loss to penalise the bidirectional disparity. Their algorithm fer (also known as colour style transfer) aims to transfer
is shown to produce more consistent strokes for different the style of colour distributions. The general idea is to
views. build upon current semantic style transfer but to eliminate
Portrait Style Transfer. Current style transfer algorithms distortions and preserve the original structure of the content
are usually not optimised for head portraits. As they do not image.
impose spatial constraints, directly applying these existing 1) Image-Optimisation-Based Photorealistic Style Transfer.
algorithms to head portraits will deform facial structures, The earliest photorealistic style transfer approach is pro-
which is unacceptable for the human visual system. Selim et posed by Luan et al. [84]. They propose a two-stage opti-
al. [73] address this problem and extend [4] to head portrait misation procedure, which is to initialise the optimisation
painting transfer. They propose to use the notion of gain by stylising a given photo with non-photorealistic style
maps to constrain spatial configurations, which can preserve transfer algorithm [65] and then penalise image distortions
the facial structures while transferring the texture of the by adding a photorealism regularization. But since Luan
style image. et al.’s algorithm is built on the Image-Optimisation-Based
Video Style Transfer. NST algorithms for video se- Semantic Style Transfer method [65], their algorithm is com-
quences are substantially proposed shortly after Gatys et putationally expensive. Similar to [84], another algorithm
al.’s first NST algorithm for still images [4]. Different proposed by Mechrez et al. [85] also adopts a two-stage
from still image style transfer, the design of video style optimisation procedure. They propose to refine the non-
transfer algorithm needs to consider the smooth transi- photorealistic stylised result by matching the gradients in
tion between adjacent video frames. Like before, we di- the output image to those in the content photo. Compared
vide related algorithms into Image-Optimisation-Based and to [84], the algorithm of Mechrez et al. achieves a faster
Model-Optimisation-Based Video Style Transfer. photorealistic stylisation speed.
1) Image-Optimisation-Based Online Video Style Transfer. 2) Model-Optimisation-Based Photorealistic Style Transfer. Li
The first video style transfer algorithm is proposed by Ruder et al. [86] address the efficiency issue of [84] by handling this
et al. [74], [75]. They introduce a temporal consistency loss problem with two steps, the stylisation step and smoothing
based on optical flow to penalise the deviations along point step. The stylisation step is to apply the NST algorithm in
trajectories. The optical flow is calculated by using novel [59] but replace upsampling layers with unpooling layers
optical flow estimation algorithms [76], [77]. As a result, to produce the stylised result with fewer distortions. Then
their algorithm eliminates temporal artefacts and produces the smoothing step further eliminates structural artefacts.
smooth stylised videos. However, they build their algorithm These two aforementioned algorithms [84], [86] are mainly
upon [4] and need several minutes to process a single frame. designed for natural images. Another work in [87] proposes
2) Model-Optimisation-Based Offline Video Style Transfer. to exploit GAN to transfer the colour from human-designed
Several follow-up studies are devoted to stylising a given anime images to sketches. Their algorithm demonstrates a
video in real-time. Huang et al. [78] propose to augment promising application of Photorealistic Style Transfer, which
Ruder et al.’s temporal consistency loss [74] upon cur- is the automatic image colourisation.
rent PSPM algorithm. Given two consecutive frames, the Attribute Style Transfer. Image attributes are generally
temporal consistency loss is directly computed using two referred to image colours, textures, etc. Previously, image
corresponding outputs of style transfer network to encour- attribute transfer is accomplished through image analogy
age pixel-wise consistency, and a corresponding two-frame [9] in a supervised manner (Section 2). Derived from the
synergic training strategy is introduced for the computa- idea of patch-based NST [46], Liao et al. [88] propose a deep
tion of temporal consistency loss. Another concurrent work image analogy to study image analogy in the domain of
which shares a similar idea with [78] but with an additional CNN features. Their algorithm is based on a patch matching
exploration of style instability problem can be found in [79]. technique and realises a weakly supervised image analogy,
Different from [78], [79], Chen et al. [80] propose a flow i.e., their algorithm only needs a single pair of source and
subnetwork to produce feature flow and incorporate optical target images instead of a large training set.
flow information in feature space. Their algorithm is built Fashion Style Transfer. Fashion style transfer receives
on a pre-trained style transfer network (an encoder-decoder fashion style image as the target and generates clothing
pair) and wraps feature activations from the pre-trained images with desired fashion styles. The challenge of Fashion
stylisation encoder using the obtained feature flow. Style Transfer lies in the preservation of similar design
Character Style Transfer. Given a style image containing with the basic input clothing while blending desired style
multiple characters, the goal of Character Style Transfer is to patterns. This idea is first proposed by Jiang and Fu [89].
apply the idea of NST to generate new fonts and text effects. They tackle this problem by proposing a pair of fashion style
In [81], Atarsaikhan et al. directly apply the algorithm in [4] generator and discriminator.
to font style transfer and achieve visually plausible results. Audio Style Transfer. In addition to transferring im-
While Yang et al. [82] propose to first characterise style age styles, [90], [91] extend the domain of image style to
elements and exploit extracted characteristics to guide the audio style, and synthesise new sounds by transferring
generation of text effects. A more recent work [83] designs the desired style from a target audio. The study of audio
a conditional GAN model for glyph shape prediction, and style transfer also follows the route of image style transfer,
also an ornamentation network for colour and texture pre- i.e., Audio-Optimisation-Based Online Audio Style Transfer and
11

on cardboard or wool, cotton, polyester, etc. In addition, we


also try to cover a range of image characteristics (such as de-
tails, contrast, complexity and color distributions), inspired
by the works in [92], [93], [95]. More detailed information of
(1) (2) (3) (4) (5) our style images are given in Table 1.
For content images, there are already carefully selected
and well-described benchmark datasets for evaluating styli-
sation by Mould and Rosin [92], [93], [95]. Their proposed
NPR benchmark called NPRgeneral consists of the images
(6) (7) (8) (9) (10) that cover a wide range of characteristics (e.g., contrast,
texture, edges and meaningful structures) and satisfy lots
Figure 4: Diversified style images used in our experiment. of criteria. Therefore, we directly use the selected twenty
images in their proposed NPRgeneral benchmark as our
Table 1: Detailed information of our style images. content images.
For the algorithms based on offline model optimisation,
No. Author Name & Year MS-COCO dataset [96] is used to perform the training. All
1 Claude Monet Three Fishing Boats (1886) the content images are not used in training.
2 Georges Rouault Head of a Clown (1907) Principles. To maximise the fairness of the comparisons,
3 Henri de Toulouse-Lautrec Divan Japonais (1893) we also obey the following principles during our experi-
4 Wassily Kandinsky White Zig Zags (1922) ment:
5 John Ruskin Trees in a Lane (1847) 1) In order to cover every detail in each algorithm, we try
6 Severini Gino Ritmo plastico del 14 luglio (1913) to use the provided implementation from their published
7 Juan Gris Portrait of Pablo Picasso (1912) literatures. To maximise the fairness of comparison espe-
8 Vincent van Gogh Landscape at Saint-Rémy (1889) cially for speed comparison, for [10], we use a popular torch-
9 Pieter Bruegel the Elder The Tower of Babel (1563) based open source code [97], which is also admitted by the
10 Egon Schiele Edith with Striped Dress (1915) authors. In our experiment, except for [32], [53] which are
based on TensorFlow, all the other codes are implemented
Note: All our style images are in the public domain. based on Torch 7.
2) Since the visual effect is influenced by the content and
then Model-Optimisation-Based Offline Audio Style Transfer. style weight, it is difficult to compare results with different
Inspired by image-based IOB-NST, Verma and Smith [90] degrees of stylisation. Simply giving the same content and
propose a Audio-Optimisation-Based Online Audio Style Trans- style weight is not an optimal solution due to the different
fer algorithm based on online audio optimisation. They start ways to calculate losses in each algorithm (e.g., different
from a noise signal and optimise it iteratively using back- choices of content and style layers, different loss functions).
propagation. [91] improves the efficiency by transferring an Therefore, in our experiment, we try our best to balance the
audio in a feed-forward manner and can produce the result content and style weight among different algorithms.
in real-time. 3) We try to use the default parameters (e.g., choice of
layers, learning rate, etc) suggested by the authors except
for the aforementioned content and style weight. Although
6 E VALUATION M ETHODOLOGY the results for some algorithms may be further improved by
The evaluations of NST algorithms remain an open and im- more careful hyperparameter tuning, we select the authors’
portant problem in this field. In general, there are two major default parameters since we hold the point that the sensitiv-
types of evaluation methodologies that can be employed in ity for hyperparameters is also an important implicit criterion
the field of NST, i.e., qualitative evaluation and quantitative for comparison. For example, we cannot say an algorithm
evaluation. Qualitative evaluation relies on the aesthetic is effective if it needs heavy work to tune its parameters for
judgements of observers. The evaluation results are related each style.
to lots of factors (e.g., age and occupation of participants). There are also some other implementation details to be
While quantitative evaluation focuses on the precise evalu- noted. For [47] and [48], we use the instance normalisation
ation metrics, which include time complexity, loss variation, strategy proposed in [50], which is not covered in the
etc. In this section, we experimentally compare different published papers. Also, we do not consider the diversity
NST algorithms both qualitatively and quantitatively. loss term (proposed in [50], [55]) for all algorithms, i.e., one
pair of content and style images corresponds to one stylised
6.1 Experimental Setup result in our experiment. For Chen and Schmidt’s algorithm
[57], we use the feed-forward reconstruction to reconstruct
Evaluation datasets. Totally, there are ten style images and
the stylised results.
twenty content images used in our experiment.
For style images, we select artworks of diversified styles,
as shown in Figure 4. For example, there are impressionism,
6.2 Qualitative Evaluation
cubism, abstract, contemporary, futurism, surrealist, and
expressionism art. Regarding the mediums, some of these Example stylised results are shown in Figure 5, Figure 7 and
artworks are painted on canvas, while others are painted Figure 9. More results can be found in the supplementary
12

Group I Group II Group III Group IV Group V Group VI

Content & Style:

Gatys et al. [4]:

Johnson et al. [47]:

Ulyanov et al. [48]:

Li and Wand [52]:

Figure 5: Some example results of IOB-NST and PSPM-MOB-NST for qualitative evaluation. The content images are
from the benchmark dataset proposed by Mould and Rosin [92], [93]. The style images are in the public domain. Detailed
information of our style images can be found in Table 1.

Group I Group II Group III Group IV Group V Group VI

Content:

Gatys et al. [4]:

Johnson et al. [47]:

Ulyanov et al. [48]:

Li and Wand [52]:

Figure 6: Saliency detection results of IOB-NST and PSPM-MOB-NST, corresponding to Figure 5. The results are produced
by using the discriminative regional feature integration approach proposed by Wang et al. [94].
13

Group I Group II Group III Group IV Group V Group VI

Content & Style:

Dumoulin
et al. [53]:

Chen et al. [54]:

Li et al. [55]:

Zhang and Dana


[56]:

Figure 7: Some example results of MSPM-MOB-NST for qualitative evaluation. The content images are from the
benchmark dataset proposed by Mould and Rosin [92], [93]. The style images are in the public domain. Detailed information
of our style images can be found in Table 1.

Group I Group II Group III Group IV Group V Group VI

Content:

Dumoulin
et al. [53]:

Chen et al. [54]:

Li et al. [55]:

Zhang and Dana


[56]:

Figure 8: Saliency detection results of MSPM-MOB-NST, corresponding to Figure 7. The results are produced by using the
discriminative regional feature integration approach proposed by Wang et al. [94].
14

Group I Group II Group III Group IV Group V Group VI

Content & Style:

Chen and Schmidt


[57]:

Ghiasi et al. [58]:

Huang and
Belongie [51]:

Li et al. [59]:

Figure 9: Some example results of ASPM-MOB-NST for qualitative evaluation. The content images are from the benchmark
dataset proposed by Mould and Rosin [92], [93]. The style images are in the public domain. Detailed information of our
style images can be found in Table 1.

Group I Group II Group III Group IV Group V Group VI

Content:

Chen and Schmidt


[57]:

Ghiasi et al. [58]:

Huang and
Belongie [51]:

Li et al. [59]:

Figure 10: Saliency detection results of ASPM-MOB-NST, corresponding to Figure 9. The results are produced by using
the discriminative regional feature integration approach proposed by Wang et al. [94].
15

material3 . But [59] is not effective at producing sharp details and fine
1) Results of IOB-NST. Following the content and style strokes.
images, Figure 5 contains the results of Gatys et al.’s IOB- Saliency Comparison. NST is an art creation process.
NST algorithm based on online image optimisation [4]. The As indicated in [3], [38], [39], the definition of style is
style transfer process is computationally expensive, but in subjective and also very complex, which involves personal
contrast, the results are appealing in visual quality. There- preferences, texture compositions as well as the used tools
fore, the algorithm of Gatys et al. is usually regarded as the and medium. As a result, it is difficult to define the aesthetic
gold-standard method in the community of NST. criterion for a stylised artwork. For the same stylised result,
2) Results of PSPM-MOB-NST. Figure 5 shows the different people may have different or even opposite views.
results of Per-Style-Per-Model MOB-NST algorithms (Section Nevertheless, our goal is to compare the results of different
4.2). Each model only fits one style. It can be noticed that NST techniques (shown in Figure 5, Figure 7 and Figure 9)
the stylised results of Ulyanov et al. [48] and Johnson et as objectively as possible. Here, we consider comparing
al. [47] are somewhat similar. This is not surprising since saliency maps, as proposed in [63]. The corresponding re-
they share a similar idea and only differ in their detailed sults are shown in Figure 6, Figure 8 and Figure 10. Saliency
network architectures. For the results of Li and Wand [52], maps can demonstrate visually dominant locations in im-
the results are sightly less impressive. Since [52] is based ages. Intuitively, a successful style transfer could weaken or
on Generative Adversarial Network (GAN), to some extent, enhance the saliency maps in content images, but should not
the training process is not that stable. But we believe that change the integrity and coherence. From Figure 6 (saliency
GAN-based style transfer is a very promising direction, and detection results of IOB-NST and PSPM-MOB-NST), it can
there are already some other GAN-based works [83], [87], be noticed that the stylised results of [4], [47], [48] preserve
[98] (Section 5) in the field of NST. the structures of content images well; however, for [52], it
might be harder for an observer to recognise the objects after
3) Results of MSPM-MOB-NST. Figure 7 demonstrates
stylisation. Using similar analytical method, from Figure 8
the results of Multiple-Style-Per-Model MOB-NST algorithms.
(saliency detection results of MSPM-MOB-NST), [53] and
Multiple styles are incorporated into a single model. The
[54] preserve similar saliency of the original content images
idea of both Dumoulin et al.’s algorithm [53] and Chen et
since they both tie a small number of parameters to each
al.’s algorithm [54] is to tie a small number of parameters to
style. [56] and [55] are also similar regarding the ability to
each style. Also, both of them build their algorithm upon the
retain the integrity of the original saliency maps, because
architecture of [47]. Therefore, it is not surprising that their
they both use a single network for all styles. As shown
results are visually similar. Although the results of [53], [54]
in Figure 10, for the saliency detection results of ASPM-
are appealing, their model size will become larger with the
MOB-NST, [58] and [51] perform better than [57] and [59];
increase of the number of learned styles. In contrast, Zhang
however, both [58] and [51] are data-driven methods and
and Dana’s algorithm [56] and Li et al.’s algorithm [55] use
their quality depends on the diversity of training styles.
a single network with the same trainable network weights
In general, it seems that the results of MSPM-MOB-NST
for multiple styles. The model size issue is tackled, but there
preserve better saliency coherence than ASPM-MOB-NST,
seem to be some interferences among different styles, which
but a little inferior to IOB-NST and PSPM-MOB-NST.
slightly influences the stylisation quality.
4) Results of ASPM-MOB-NST. Figure 9 presents the
last category of MOB-NST algorithms, namely Arbitrary- 6.3 Quantitative Evaluation
Style-Per-Model MOB-NST algorithms. Their idea is one- Regarding the quantitative evaluation, we mainly focus on
model-for-all. Globally, the results of ASPM are slightly less five evaluation metrics, which are: generating time for a
impressive than other types of algorithms. This is acceptable single content image of different sizes; training time for a
in that a three-way trade-off between speed, flexibility and single model; average loss for content images to measure
quality is common in research. Chen and Schmidt’s patch- how well the loss function is minimised; loss variation
based algorithm [57] seems to not combine enough style during training to measure how fast the model converges;
elements into the content image. Their algorithm is based style scalability to measure how large the learned style set
on similar patch swap. When lots of content patches are can be.
swapped with style patches that do not contain enough style 1) Stylisation speed. The issue of efficiency is the focus
elements, the target style will not be reflected well. Ghiasi of MOB-NST algorithms. In this subsection, we compare
et al.’s algorithm [58] is data-driven and their stylisation different algorithms quantitatively in terms of the stylisation
quality is very dependent on the varieties of training styles. speed. Table 2 demonstrates the average time to stylise one
For the algorithm of Huang and Belongie [51], they propose image with three resolutions using different algorithms. In
to match global summary feature statistics and successfully our experiment, the style images have the same size as the
improve the visual quality compared with [57]. However, content images. The fifth column in Table 2 represents the
their algorithm seems not good at handling complex style number of styles one model of each algorithm can produce.
patterns, and their stylisation quality is still related to the k(k ∈ Z + ) denotes that a single model can produce multiple
varieties of training styles. The algorithm of Li et al. [59] re- styles, which corresponds to MSPM algorithms. ∞ means
places the training process with a series of transformations. a single model works for any style, which corresponds to
ASPM algorithms. The numbers reported in Table 2 are
3. https://www.dropbox.com/s/5xd8iizoigvjcxz/ obtained by averaging the generating time of 100 images.
SupplementaryMaterial neuralStyleReview.pdf?dl=0 Note that we do not include the speed of [53], [58] in Table 2
16

Table 2: Average speed comparison of NST algorithms for images of size 256 × 256 pixels, 512 × 512 pixels and 1024 × 1024
pixels (on an NVIDIA Quadro M6000)

Methods Time(s) Styles/Model


256 × 256 512 × 512 1024 × 1024
Gatys et al. [10] 14.32 51.19 200.3 ∞
Johnson et al. [47] 0.014 0.045 0.166 1
Ulyanov et al. [48] 0.022 0.047 0.145 1
Li and Wand [52] 0.015 0.055 0.229 1
Zhang and Dana [56] 0.019 (0.039) 0.059 (0.133) 0.230 (0.533) k(k ∈ Z + )
Li et al. [55] 0.017 0.064 0.254 k(k ∈ Z + )
Chen and Schmidt [57] 0.123 (0.130) 1.495 (1.520) − ∞
Huang and Belongie [51] 0.026 (0.037) 0.095 (0.137) 0.382 (0.552) ∞
Li et al. [59] 0.620 1.139 2.947 ∞

Note: The fifth column shows the number of styles that a single model can produce. Time both excludes (out of parenthesis) and includes (in
parenthesis) the style encoding process is shown, since [56], [57] and [51] support storing encoded style statistics in advance to further speed up
the stylisation process for the same style but different content images. Time of [57] for producing 1024 × 1024 images is not shown due to the
memory limitation. The speed of [53], [58] are similar to [47] since they share similar architecture. We do not redundantly list them in this table.

Table 3: A summary of the advantages and disadvantages of the mentioned algorithms in our experiment.

Types Methods Pros & Cons


E AS LF VQ
√ √
IOB-NST Gatys et al. [4] × Good and usually regarded as a gold standard.

Ulyanov et al. [47] × ×
PSPM- √ The results of [47], [50] are close to [4]. [52] is generally less appealing
MOB-NST Johnson et al. [50] × ×
√ than [47], [50].
Li and Wand [52] × ×

Dumoulin et al. [53] × ×
√ The results of [53] and [54] are close to [4], but the model size generally
MSPM- Chen et al. [54] × × becomes larger with the increase of the number of learned styles. [55],

MOB-NST Li et al. [55] × × [56] have a fixed model size but there seem to be some interferences
√ among different styles.
Zhang and Dana [56] × ×
√ √
Chen and Schmidt [57] ×
√ √ In general, the results of ASPM are less impressive than other types of
ASPM- Ghiasi et al. [58] × NST algorithms. [57] does not combine enough style elements. [51], [58]
√ √
MOB-NST Huang and Belongie [51] × are generally not effective at producing complex style patterns. [59] is
√ √ √ not good at producing sharp details and fine strokes.
Li et al. [59]

Note: E, AS, LF, and VQ represent Efficient, Arbitrary Style, Learning-Free, and Visual Quality, respectively. IOB-NST denotes the category
Image-Optimisation-Based Neural Style Transfer and MOB-NST represents Model-Optimisation-Based Neural Style Transfer.

as their algorithm is to scale and shift parameters based on a few iterations is capable of producing enough visually
the algorithm of Johnson et al. [47]. The time required to appealing results. So we just outline our training time of
stylise one image using [32], [53] is very close to [47] under different algorithms (under the same setting) as a reference
the same setting. For Chen et al.’s algorithm in [54], since for follow-up studies. On a NVIDIA Quadro M6000, the
their algorithm is protected by patent and they do not make training time for a single model is about 3.5 hours for the
public the detailed architecture design, here we just attach algorithm of Johnson et al. [47], 3 hours for the algorithm
the speed information provided by the authors for reference: of Ulyanov et al. [48], 2 hours for the algorithm of Li
On a Pascal Titan X GPU, 256×256: 0.007s; 512×512: 0.024s; and Wand [52], 4 hours for Zhang and Dana [56], and 8
1024 × 1024: 0.089s. For Chen and Schmidt’s algorithm [57], hours for Li et al. [55]. Chen and Schmidt’s algorithm [57]
the time for processing a 1024 × 1024 image is not reported and Huang and Belongie’s algorithm [51] take much longer
due to the limit of video memory. Swapping patches for (e.g., a couple of days), which is acceptable since a pre-
two 1024 × 1024 images needs more than 24 GB video trained model can work for any style. The training time
memory and thus, the stylisation process is not practical. of [58] depends on how large the training style set is. For
We can observe that except for [57], [59], all the other MOB- MSPM algorithms, the training time can be further reduced
NST algorithms are capable of stylising even high-resolution through incremental learning over a pre-trained model. For
content images in real-time. ASPM algorithms are generally example, the algorithm of Chen et al. only needs 8 minutes
slower than PSPM and MSPM, which demonstrates the to incrementally learn a new style, as reported in [54].
aforementioned three-way trade-off again.
3) Loss comparison. One way to evaluate some MOB-
2) Training time. Another concern is the training time for NST algorithms which share the same loss function is to
one single model. The training time of different algorithms compare their loss variation during training, i.e., the train-
is hard to compare as sometimes the model trained with just ing curve comparison. It helps researchers to justify the
17

 

 

 


ln( c)
ln( s)
ln( )

 

 
 

              
,WHUDWLRQV ,WHUDWLRQV ,WHUDWLRQV
(a) Total Loss Curve (b) Style Loss Curve (c) Content Loss Curve

Figure 11: Training curves of total loss, style loss and content loss of different algorithms. Solid curves represent the loss
variation of the algorithm of Ulyanov et al. [48], while the dashed curves represent the algorithm of Johnson et al. [47].
Different colours correspond to different randomly selected styles from our style set.

*DW\Vetal. 

 8O\DQRYetal. 

&RQWHQW/RVV ×104
-RKQVRQetal.
7RWDO/RVV ×104

6W\OH/RVV ×104

&RQWHQW,PDJH  



 

 

  
                 
,WHUDWLRQV ,WHUDWLRQV ,WHUDWLRQV
(a) Total Loss (b) Style Loss (c) Content Loss

Figure 12: Average total loss, style loss and content loss of different algorithms [4], [47], [48]. The reported numbers are
averaged over our set of style and content images.

choice of architecture design by measuring how fast the 4) Style scalability. Scalability is a very important cri-
model converges and how well the same loss function can terion for MSPM algorithms. However, it is very hard to
be minimised. Here we compare training curves of two measure since the maximum capabilities of a single model
popular MOB-NST algorithms [47], [48] in Figure 11, since is highly related to the set of particular styles. If most styles
most of the follow-up works are based on their architecture have somewhat similar patterns, a single model can pro-
designs. We remove the total variation term and keep the duce thousands of styles or even more, since these similar
same objective for both two algorithms. Other settings (e.g., styles share somewhat similar distribution of style feature
loss network, chosen layers) are also kept the same. For the statistics. In contrast, if the style patterns vary a lot among
style images, we randomly select four styles from our style different style images, the capability of a single model will
set and represent them in different colours in Figure 11. It be much smaller. But it is hard to measure how much these
can be observed that the two algorithms are similar in terms styles differ from each other in style patterns. Therefore, to
of the convergence speed. Also, both algorithms minimise provide the reader a reference, here we just summarise the
the content loss well during training, and they mainly differ authors’ attempt for style scalability: the number is 32 for
in the speed of learning the style objective. The algorithm in [53], 1000 for both [54] and [55], and 100 for [56].
[47] minimises the style loss better. A summary of the advantages and disadvantages of
Another related criterion is to compare the final loss the mentioned algorithms in this experiment section can be
values of different algorithms over a set of test images. This found in Table 3.
metric demonstrates how well the same loss function can be
minimised by using different algorithms. For a fair compar-
ison, the loss function and other settings are also required
to be kept the same. We show the results of one IOB-NST 7 A PPLICATIONS
algorithm [4] and two MOB-NST algorithms [47], [48] in
Figure 12. The result is consistent with the aforementioned Due to the visually plausible stylised results, the research of
trade-off between speed and quality. Although MOB-NST NST has led to many successful industrial applications and
algorithms are capable of stylising images in real-time, they begun to deliver commercial benefits. In this section, we
are not good as IOB-NST algorithms in terms of minimising summarise these applications and present some potential
the same loss function. usages.
18

7.1 Social Communication


One reason why NST catches eyes in both academia and
industry is its popularity in some social networking sites,
e.g., Facebook and Twitter. A recently emerged mobile ap-
plication named Prisma [11] is one of the first industrial
applications that provide the NST algorithm as a service.   

$HVWKHWLFV6FRUH

$HVWKHWLFV6FRUH

$HVWKHWLFV6FRUH
Due to its high stylisation quality, Prisma achieved great   
  
success and is becoming popular around the world. Some   
other applications providing the same service appeared one   
                          
after another and began to deliver commercial benefits, 'LIIHUHQW2EVHUYHUV 'LIIHUHQW2EVHUYHUV 'LIIHUHQW2EVHUYHUV
e.g., a web application Ostagram [12] requires users to
pay for a faster stylisation speed. Under the help of these Figure 13: Example of aesthetic preference scores for the
industrial applications [13], [99], [100], people can create outputs of different algorithms given the same style and
their own art paintings and share their artwork with others content.
on Twitter and Facebook, which is a new form of social
communication. There are also some related application
papers: [101] introduces an iOS app Pictory which combines et al. explore the use of NST in redrawing some scenes in a
style transfer techniques with image filtering; [102] further movie named Come Swim [105], which indicates the promis-
presents the technical implementation details of Pictory; ing potential applications of NST in this field. In [106], Fišer
[103] demonstrates the design of another GPU-based mobile et al. study an illumination-guided style transfer algorithm
app ProsumerFX. for stylisation of 3D renderings. They demonstrate how to
The application of NST in social communication rein- exploit their algorithm for rendering previews on various
forces the connections between people and also has positive geometries, autocomplete shading, and transferring style
effects on both academia and industry. For academia, when without a reference 3D model.
people share their own masterpiece, their comments can
help the researchers to further improve the algorithm. More-
over, the application of NST in social communication also 8 F UTURE C HALLENGES
drives the advances of other new techniques. For instance, The advances in the field of NST are inspiring and some
inspired by the real-time requirements of NST for videos, algorithms have already found use in industrial applica-
Facebook AI Research (FAIR) first developed a new mobile- tions. Although current algorithms are capable of good
embedded deep learning system Caffe2Go and then Caffe2 performance, there are still several challenges and open
(now merged with PyTorch), which can run deep neural issues. In this section, we summarise key challenges within
networks on mobile phones [104]. For industry, the applica- this field of NST and discuss possible strategies on how to
tion brings commercial benefits and promotes the economic deal with them in future works. Since NST is very related
development. to NPR, some critical problems in NPR (summarised in [3],
[14], [107], [108], [109], [110]) also remain future challenges
7.2 User-assisted Creation Tools for the research of NST. Therefore, we first review some of
the major challenges existing in both NPR and NST and
Another use of NST is to make it act as user-assisted
then discuss the research questions specialised for the field
creation tools. Although there are no popular applications
of NST.
that applied the NST technique in creation tools, we believe
that it will be a promising potential usage in the future.
As a creation tool for painters and designers, NST can 8.1 Evaluation Methodology
make it more convenient for a painter to create an artwork of Aesthetic evaluation is a critical issue in both NPR and NST.
a particular style, especially when creating computer-made In the field of NPR, the necessity of aesthetic evaluation is
artworks. Moreover, with NST algorithms, it is trivial to explained by many researchers [3], [14], [107], [108], [109],
produce stylised fashion elements for fashion designers and [110], e.g., in [3], Rosin and Collomosse use two chapters
stylised CAD drawings for architects in a variety of styles, to explore this issue. This problem is increasingly critical as
which will be costly when creating them by hand. the fields of NPR and NST mature. As pointed out in [3],
researchers need some reliable criteria to assess the benefits
7.3 Production Tools for Entertainment Applications of their proposed approach over the prior art and also a
Some entertainment applications such as movies, anima- way to evaluate the suitability of one particular approach
tions and games are probably the most application forms of to one particular scenario. However, most NPR and NST
NST. For example, creating an animation usually requires 8 papers evaluate their proposed approach with side-by-side
to 24 painted frames per second. The production costs will subjective visual comparisons, or through measurements
be largely reduced if NST can be applied to automatically derived from various user studies [59], [111], [112]. For
stylise a live-action video into an animation style. Similarly, example, to evaluate the proposed universal style transfer
NST can significantly save time and costs when applied to algorithm, Li et al. [59] conduct a user study which is to ask
the creation of some movies and computer games. participants to vote for their favourite stylised results. We
There are already some application papers aiming at argue that it is not an optimal solution since the results vary
introducing how to apply NST for production, e.g., Joshi a lot with different observers. Inspired by [113], we conduct
19

a simple experiment for user studies with the stylised results Table 4: Normalisation methods in NST.
of different NST algorithms. In our experiment, each stylised
image is rated by 8 different raters (4 males and 4 females) Paper Author Name
with the same occupation and age. As depicted in Figure 13,
given the same stylised result, different observers with the [50] Ulyanov et al. Instance Normalisation
same occupation and age still have quite different ratings. [53] Dumoulin et al. Conditional Instance Normalisation
Nevertheless, there is currently no gold standard evaluation
[51] Huang and Belongie Adaptive Instance Normalisation
method for assessing NPR and NST algorithms. This chal-
lenge of aesthetic evaluation will continue to be an open
question in both NPR and NST communities, the solution
of which might require the collaboration with professional
artists and the efforts in the identification of underlying learning [118] and transfer learning [119]. For example, in
aesthetic principles. style transfer, if one could learn a representation where
In the field of NST, there is another important issue the factors of variation (e.g., colour, shape, stroke size,
related to aesthetic evaluation. Currently, there is no stan- stroke orientation and stroke composition) are precisely
dard benchmark image set for evaluating NST algorithms. disentangled, these factors could then be freely controlled
Different authors typically use their own images for evalu- during stylisation. For example, one could change the stroke
ation. In our experiment, we use the carefully selected NPR orientations in a stylised image by simply changing the cor-
benchmark image set named NPRgeneral [92], [93] as our responding dimension in the learned disentangled represen-
content images to compare different techniques, which is tation. Towards the goal of disentangled representation, cur-
backed by the comprehensive study in [92], [93]; however, rent methods fit into two categories, which are supervised
we have to admit that the selection of our style images is far approaches and unsupervised ones. The basic idea of super-
from being a standard NST benchmark style set. Different vised disentangling methods is to exploit annotated data to
from NPR, NST algorithms do not have explicit restrictions supervise the mapping between inputs and attributes [120],
on the types of style images. Therefore, to compare the style [121]. Despite their effectiveness, supervised disentangling
scalability of different NST methods, it is critical to seek approaches typically require numbers of training samples.
a benchmark style set which collectively exhibits a broad However, in the case of NST, it is quite complicated to
range of possible properties, accompanied by a detailed model and capture some of those aforementioned factors
description of adopted principles, numerical measurements of variation. For example, it is hard to collect a set of
of image characteristics as well as a discussion of limitations images which have different stroke orientations but exactly
like the works in [92], [93], [95]. Based on the above discus- the same colour distribution, stroke size and stroke com-
sion, seeking an NST benchmark image set is quite a sep- position. By contrast, unsupervised disentangling methods
arate and important research direction, which provides not do not require annotations; however, they usually yield
only a way for researchers to demonstrate the improvement disentangled representations which are dimension-wise un-
of their proposed approach over the prior art, but also a tool controllable and uninterpretable [122], i.e., we could not
to measure the suitability of one particular NST algorithm control what would be encoded in each specific dimension.
to one particular requirement. In addition, as the emergence Based on the above discussion, to acquire disentangled
of several NST extensions (Section 5), it remains another representations in NST, the first issue to be addressed is
open problem to study the specialised benchmark data set how to define, model and capture the complicated factors
and also the corresponding evaluation criteria for assessing of variation in NST.
those extended works (e.g., video style transfer, audio style Normalisation methods. The advances in the field of
transfer, stereoscopic style transfer, character style transfer NST are closely related to the emergence of novel nor-
and fashion style transfer). malisation methods, as shown in Table 4. Some of these
normalisation methods also have an influence on a larger
vision community beyond style transfer (e.g., image re-
8.2 Interpretable Neural Style Transfer colourisation [123] and video colour propagation [124]). In
Another challenging problem is the interpretability of NST this part, we first briefly review these normalisation meth-
algorithms. Like many other CNN-based vision tasks, the ods in NST and then discuss the corresponding problem.
process of NST is like a black box, which makes it quite The first emerged normalisation method in NST is instance
uncontrollable. In this part, we focus on three critical issues normalisation (or contrast normalisation) proposed by Ulyanov
related to the interpretability of NST, i.e., interpretable and et al. [50]. Instance normalisation is equivalent to batch nor-
controllable NST via disentangled representations, normali- malisation when the batch size is one. It is shown that style
sation methods associated with NST, and adversarial exam- transfer network with instance normalisation layer converges
ples in NST. faster and produces visually better results compared with
Representation disentangling. The goal of representa- the network with batch normalisation layer. Ulyanov et al. be-
tion disentangling is to learn dimension-wise interpretable lieve that the superior performance of instance normalisation
representations, where some changes in one or more specific results from the fact that instance normalisation enables the
dimensions correspond to changes precisely in a single network to discard contrast information in content images
factor of variation while being invariant to other factors and therefore makes learning simpler. Another explanation
[114], [115], [116], [117]. Such representations are useful to proposed by Huang and Belongie [51] is that instance normal-
a variety of machine learning tasks, e.g., visual concepts isation performs a kind of style normalisation by normalising
20

feature statistics (i.e., the mean and variance). With instance


normalisation, the style of each individual image could be
directly normalised to the target style. As a result, the rest
of the network only needs to take care of the content loss,
making the objective easier to learn. Based on instance nor-
malisation, Dumoulin et al. [53] further propose conditional (a) (b) (c) (d)
instance normalisation, which is to scale and shift parameters
in instance normalisation layers (shown in Equation (8)). Fol- Figure 14: Adversarial example for NST: (a) is the original
lowing the interpretation proposed by Huang and Belongie, content and style image pair and (b) is the stylised result of
by using different affine parameters, the feature statistics (a) with [47]; (c) is the generated adversarial example and
could be normalised to different values. Correspondingly, (d) is the stylised result of (c) with the same model as (b).
the style of each individual sample could be normalised to
different styles. Furthermore, in [51], Huang and Belongie
propose adaptive instance normalisation to adaptively instance of target styles. Although ASPM-MOB-NST algorithms suc-
normalise content feature by the style feature statistics cessfully transfer arbitrary styles, they are not that satisfy-
(shown in Equation (9)). In this way, they believe that ing in perceptual quality and speed. The quality of data-
the style of an individual image could be normalised to driven ASPM quite relies on the diversity of training styles.
arbitrary styles. Despite the superior performance achieved However, one can hardly cover every style due to the great
by instance normalisation, conditional instance normalisation diversity of artworks. Image transformation based ASPM
and adaptive instance normalisation, the reason behind their algorithm transfers arbitrary styles in a learning-free man-
success still remains unclear. Although Ulyanov et al. [50] ner, but it is behind others in speed. Another related issue
and Huang and Belongie [51] propose their own hypothesis is the problem of hyperparameter tuning. To produce the
based on pixel space and feature space respectively, there most visually appealing results, it remains uncertain how
is a lack of theoretical proof for their proposed theories. to set the value of content and style weights, how to choose
In addition, their proposed theories are also built on other layers for computing content and style loss, which optimiser
hypothesises, e.g., Huang and Belongie propose their inter- to use and how to set the value of learning rate. Currently,
pretation based on the observation by Li et al. [42]: channel- researchers empirically set these hyperparameters; however,
wise feature statistics, namely mean and variance, could one set of hyperparameters does not necessarily work for
represent styles. However, it remains uncertain why feature any style and it is tedious to manually tune these parameters
statistics could represent the style, or even whether the for each combination of content and style images. One of
feature statistics could represent all styles, which relates the keys for this problem is a better understanding of the
back to the interpretability of style representations. optimisation procedure in NST. A deep understanding of
Adversarial examples. Several studies have shown that optimisation procedure would help understand how to find
deep classification networks are easily fooled by adversar- the local minima that lead to a high quality.
ial examples [125], [126], which are generated by applying
perturbations to input images (e.g., Figure 14(c)). Previous 9 D ISCUSSIONS AND C ONCLUSIONS
studies on adversarial examples mainly focus on deep clas-
Over the past several years, NST has continued to become
sification networks. However, as shown in Figure 14, we
an inspiring research area, motivated by both scientific
find that adversarial examples also exist in generative style
challenges and industrial demands. A considerable amount
transfer networks. In Figure 14(d), one can hardly recognise
of researches have been conducted in the field of NST.
the content, which is originally contained in Figure 14(c).
Key advances in this field are summarised in Figure 2. A
It reveals the difference between generative networks and
summary of the corresponding style transfer loss functions
the human vision system. The perturbed image is still
can be found in Table 5. NST is quite a fast-paced area, and
recognisable to humans but leads to a different result for
we are looking forwarding to more exciting works devoted
generative style transfer networks. However, it remains un-
to advancing the development of this field.
clear why some perturbations could make such a difference,
During the period of preparing this review, we are also
and whether some similar noised images uploaded by the
delighted to find that related researches on NST also bring
user could still be stylised into the desired style. Interpreting
new inspirations for other areas [127], [128], [129], [130],
and understanding adversarial examples in NST could help
[131] and accelerate the development of a wider vision
to avoid some failure cases in stylisation.
community. For the area of Image Reconstruction, inspired
by NST, Ulyanov et al. [127] propose a novel deep image
8.3 Three-way Trade-off in Neural Style Transfer prior, which replaces the manually-designed total variation
regulariser in [33] with a randomly initialised deep neural
In the field of NST, there is a three-way trade-off between
network. Given a task-dependent loss function L, an image
speed, flexibility and quality. IOB-NST achieves superior
Io and a fixed uniform noise z as inputs, their algorithm can
performance in quality but is computationally expensive.
be formulated as:
PSPM-MOB-NST achieves real-time stylisation; however,
PSPM-MOB-NST needs to train a separate network for each θ∗ = arg min L(gθ∗ (z), Io ), I ∗ = gθ∗ (z). (10)
style, which is not flexible. MSPM-MOB-NST improves the θ

flexibility by incorporating multiple styles into one single One can easily notice that Equation (10) is very similar
model, but it still needs to pre-train a network for a set to Equation (7). The process in [127] is equivalent with
21

Table 5: An overview of major style transfer loss functions.

Paper Loss Description


Gatys et al. [4] Gram Loss The first proposed style loss based on Gram-based style representations.
Johnson et al. [47] Perceptual Loss Widely adopted content loss based on perceptual similarity.
Computing Gram Loss over horizontally and vertically translated feature
Berger and Memisevic [32] Transformed Gram Loss representations. More effective at modelling style with symmetric properties,
compared with Gram Loss.
Subtracting the mean of feature representations before computing Gram Loss.
Li et al. [55] Mean-substraction Gram Loss Eliminating large discrepancy in scale. Effective at multi-style transfer with
one single network.
Computing Gram Loss over multi-scale feature representations. Eliminating a
Zhang and Dana [56] Multi-scale Gram Loss few artefacts.
Gram Loss is equivalent to MMD Loss with Second Order Polynomial Kernel.
Li et al. [42] MMD Loss with Different Kernels MMD Loss with Linear Kernel is capable of comparable quality with Gram
Loss, but with lower computational complexity.
Achieving comparable quality with Gram Loss, but conceptually clearer in
Li et al. [42] BN Loss theory.
Matching the entire histogram of feature representations. Eliminating insta-
Risser et al. [44] Histogram Loss bility artefacts, compared with single Gram Loss.
Li et al. [45] Laplacian Loss Eliminating distorted structures and irregular artefacts.
More effective when the content and style are similar in shape and perspec-
Li and Wand [46] MRF Loss tive, compared with Gram Loss.
Incorporating a segmentation mask over MRF Loss. Enabling a more accurate
Champandard [65] Semantic Loss semantic match.
Computed based on PatchGAN. Utilising contextual correspondence be-
Li and Wand [52] Adversarial Loss tween patches. More effective at preserving coherent textures in complex
images, compared with Gram Loss.
Jing et al. [61] Stroke Loss Achieving continuous stroke size control while preserving stroke consistency.
Enabling a coarse-to-fine stylisation procedure. Capable of producing large
Wang et al. [62] Hierarchical Loss but also subtle strokes for high-resolution content images.
Preserving depth maps of content images. Effective at retaining spatial layout
Liu et al. [63] Depth Loss and structure of content images, compared with single Gram Loss.
Designed for video style transfer. Penalising the deviations along point tra-
Ruder et al. [74] Temporal Consistency Loss jectories based on optical flow. Capable of maintaining temporal consistency
among stylised video frames.
Designed for stereoscopic style transfer. Penalising bidirectional disparity.
Chen et al. [72] Disparity Loss Capable of consistent strokes for different views.

the training process of MOB-NST when there is only one NST is to refine and optimise recent NST algorithms, aiming
available image in the training set, but replacing Ic with to perfectly imitate varieties of styles. This stage involves
z and Ltotal with L. In other words, g in [127] is trained two technical directions. The first one is to reduce failure
to overfit one single sample. Inspired by NST, Upchurch cases and improve stylised quality on a wider variety of
et al. [128] propose a deep feature interpolation technique style and content images. Although there is not an explicit
and provide a new baseline for the area of Image Transfor- restriction on the type of styles, NST does have styles it is
mation (e.g., face aging and smiling). Upon the procedure particularly good at and also some certain styles it is weak
of IOB-NST algorithm [4], they add an extra step which in. For example, NST typically performs well in producing
is interpolating in the VGG feature space. In this way, irregular style elements (e.g., paintings), as demonstrated
their algorithm successfully changes image contents in a in many NST papers [4], [47], [53], [59]; however, for some
learning-free manner. Another field closely related to NST styles with regular elements such as low-poly styles [134],
is Face Photo-sketch Synthesis. For example, [132] exploits [135] and pixelator styles [136], NST generally produces
style transfer to generate shadings and textures for final face distorted and irregular results due to the property of CNN-
sketches. Similarly, for the area of Face Swapping, the idea of based image reconstruction. For content images, previous
MOB-NST algorithm [48] can be directly applied to build a NST papers usually use natural images as content to demon-
feed-forward Face-Swap algorithm [133]. NST also provides strate their proposed algorithms; however, given abstract
a new way for Domain Adaption, as is validated in the images (e.g., sketches and cartoons) as input content, NST
work of Atapour-Abarghouei and Breckon [131]. They apply typically does not combine enough style elements to match
style transfer technique to translate images from different the content [137], since a pre-trained classification network
domains so as to improve the generalisation capabilities of could not extract proper image content from these abstract
their Monocular Depth Estimation model. images. The other technical direction of the first stage lies
in deriving more extensions from general NST algorithms.
Despite the great progress in recent years, the area of
For example, as the emergence of 3D vision techniques,
NST is far from a mature state. Currently, the first stage of
22

it is promising to study 3D surface stylisation, which is [19] Y.-Z. Song, P. L. Rosin, P. M. Hall, and J. Collomosse, “Arty
to directly optimise and produce 3D objects for both pho- shapes,” in Proceedings of the Fourth Eurographics conference on
Computational Aesthetics in Graphics, Visualization and Imaging.
torealistic and non-photorealistic stylisation. After moving Eurographics Association, 2008, pp. 65–72.
beyond the first stage, a further trend of NST is to not [20] M. Zhao and S.-C. Zhu, “Portrait painting using active tem-
just imitate human-created art with NST techniques, but plates,” in Proceedings of the ACM SIGGRAPH/Eurographics Sym-
rather to create a new form of AI-created art under the posium on Non-Photorealistic Animation and Rendering. ACM, 2011,
pp. 117–124.
guidance of underlying aesthetic principles. The first step [21] H. Winnemöller, S. C. Olsen, and B. Gooch, “Real-time video
towards this direction has been taken, i.e., using current abstraction,” in ACM Transactions On Graphics (TOG), vol. 25,
NST methods [53], [54], [62] to combine different styles. no. 3. ACM, 2006, pp. 1221–1226.
For example, in [62], Wang et al. successfully utilise their [22] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and
color images,” in Proceedings of the IEEE International Conference
proposed algorithm to produce a new style which fuses the on Computer Vision. IEEE, 1998, pp. 839–846.
coarse texture distortions of one style with the fine brush [23] B. Gooch, E. Reinhard, and A. Gooch, “Human facial illustrations:
strokes of another style image. Creation and psychophysical evaluation,” ACM Transactions on
Graphics, vol. 23, no. 1, pp. 27–44, 2004.
[24] L.-Y. Wei, S. Lefebvre, V. Kwatra, and G. Turk, “State of the art
in example-based texture synthesis,” in Eurographics 2009, State
of the Art Report, EG-STAR. Eurographics Association, 2009, pp.
R EFERENCES 93–117.
[25] A. A. Efros and T. K. Leung, “Texture synthesis by non-
[1] B. Gooch and A. Gooch, Non-photorealistic rendering. Natick, MA, parametric sampling,” in Proceedings of the IEEE International
USA: A. K. Peters, Ltd., 2001. Conference on Computer Vision, vol. 2. IEEE, 1999, pp. 1033–1038.
[2] T. Strothotte and S. Schlechtweg, Non-photorealistic computer [26] L.-Y. Wei and M. Levoy, “Fast texture synthesis using tree-
graphics: modeling, rendering, and animation. Morgan Kaufmann, structured vector quantization,” in Proceedings of the 27th annual
2002. conference on Computer graphics and interactive techniques. ACM
[3] P. Rosin and J. Collomosse, Image and video-based artistic stylisation. Press/Addison-Wesley Publishing Co., 2000, pp. 479–488.
Springer Science & Business Media, 2012, vol. 42. [27] B. Julesz, “Visual pattern discrimination,” IRE transactions on
[4] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer Information Theory, vol. 8, no. 2, pp. 84–92, 1962.
using convolutional neural networks,” in Proceedings of the IEEE [28] D. J. Heeger and J. R. Bergen, “Pyramid-based texture anal-
Conference on Computer Vision and Pattern Recognition, 2016, pp. ysis/synthesis,” in Proceedings of the 22nd annual conference on
2414–2423. Computer graphics and interactive techniques. ACM, 1995, pp. 229–
[5] A. A. Efros and W. T. Freeman, “Image quilting for texture 238.
synthesis and transfer,” in Proceedings of the 28th annual conference [29] J. Portilla and E. P. Simoncelli, “A parametric texture model based
on Computer graphics and interactive techniques. ACM, 2001, pp. on joint statistics of complex wavelet coefficients,” International
341–346. journal of computer vision, vol. 40, no. 1, pp. 49–70, 2000.
[6] I. Drori, D. Cohen-Or, and H. Yeshurun, “Example-based style [30] L. A. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using
synthesis,” in Proceedings of the IEEE Conference on Computer Vision convolutional neural networks,” in Advances in Neural Information
and Pattern Recognition, vol. 2. IEEE, 2003, pp. II–143. Processing Systems, 2015, pp. 262–270.
[7] O. Frigo, N. Sabater, J. Delon, and P. Hellier, “Split and match: [31] K. Simonyan and A. Zisserman, “Very deep convolutional
Example-based adaptive patch sampling for unsupervised style networks for large-scale image recognition,” arXiv preprint
transfer,” in Proceedings of the IEEE Conference on Computer Vision arXiv:1409.1556, 2014.
and Pattern Recognition, 2016, pp. 553–561. [32] G. Berger and R. Memisevic, “Incorporating long-range consis-
[8] M. Elad and P. Milanfar, “Style transfer via texture synthesis,” tency in cnn-based texture generation,” in International Conference
IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2338– on Learning Representations, 2017.
2351, 2017. [33] A. Mahendran and A. Vedaldi, “Understanding deep image
[9] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. representations by inverting them,” in Proceedings of the IEEE
Salesin, “Image analogies,” in Proceedings of the 28th annual confer- Conference on Computer Vision and Pattern Recognition, 2015, pp.
ence on Computer graphics and interactive techniques. ACM, 2001, 5188–5196.
pp. 327–340. [34] ——, “Visualizing deep convolutional neural networks using
[10] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of natural pre-images,” International Journal of Computer Vision, vol.
artistic style,” ArXiv e-prints, Aug. 2015. 120, no. 3, pp. 233–255, 2016.
[11] I. Prisma Labs, “Prisma: Turn memories into art using artificial [35] A. Dosovitskiy and T. Brox, “Inverting visual representations
intelligence,” 2016. [Online]. Available: http://prisma-ai.com with convolutional networks,” in Proceedings of the IEEE Confer-
[12] “Ostagram,” 2016. [Online]. Available: http://ostagram.ru ence on Computer Vision and Pattern Recognition, 2016, pp. 4829–
[13] A. J. Champandard, “Deep forger: Paint photos in the style of 4837.
famous artists,” 2015. [Online]. Available: http://deepforger.com [36] ——, “Generating images with perceptual similarity metrics
[14] J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg, “State based on deep networks,” in Advances in Neural Information
of the ‘art’: A taxonomy of artistic stylization techniques for Processing Systems, 2016, pp. 658–666.
images and video,” IEEE transactions on visualization and computer [37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
graphics, vol. 19, no. 5, pp. 866–885, 2013. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
[15] A. Semmo, T. Isenberg, and J. Döllner, “Neural style transfer: A sarial nets,” in Advances in neural information processing systems,
paradigm shift for image-based artistic rendering?” in Proceedings 2014, pp. 2672–2680.
of the Symposium on Non-Photorealistic Animation and Rendering. [38] X. Xie, F. Tian, and H. S. Seah, “Feature guided texture synthesis
ACM, 2017, pp. 5:1–5:13. (fgts) for artistic style transfer,” in Proceedings of the 2nd interna-
[16] A. Hertzmann, “Painterly rendering with curved brush strokes tional conference on Digital interactive media in entertainment and
of multiple sizes,” in Proceedings of the 25th annual conference on arts. ACM, 2007, pp. 44–49.
Computer graphics and interactive techniques. ACM, 1998, pp. 453– [39] M. Ashikhmin, “Fast texture transfer,” IEEE Computer Graphics
460. and Applications, no. 4, pp. 38–43, 2003.
[17] A. Kolliopoulos, “Image segmentation for stylized non- [40] A. Mordvintsev, C. Olah, and M. Tyka, “Incep-
photorealistic rendering and animation,” Ph.D. dissertation, Uni- tionism: Going deeper into neural networks,” 2015.
versity of Toronto, 2005. [Online]. Available: https://research.googleblog.com/2015/06/
[18] B. Gooch, G. Coombe, and P. Shirley, “Artistic vision: painterly inceptionism-going-deeper-into-neural.html
rendering using computer vision techniques,” in Proceedings of [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
the 2nd international symposium on Non-photorealistic animation and for image recognition,” in Proceedings of the IEEE conference on
rendering. ACM, 2002, pp. 83–ff. computer vision and pattern recognition, 2016, pp. 770–778.
23

[42] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style [64] W. Chen, Z. Fu, D. Yang, and J. Deng, “Single-image depth per-
transfer,” in Proceedings of the Twenty-Sixth International Joint ception in the wild,” in Advances in Neural Information Processing
Conference on Artificial Intelligence, IJCAI-17, 2017, pp. 2230–2236. Systems, 2016, pp. 730–738.
[Online]. Available: https://doi.org/10.24963/ijcai.2017/310 [65] A. J. Champandard, “Semantic style transfer and turning two-bit
[43] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain doodles into fine artworks,” ArXiv e-prints, Mar. 2016.
adaptation: A survey of recent advances,” IEEE signal processing [66] J. Ye, Z. Feng, Y. Jing, and M. Song, “Finer-net: Cascaded human
magazine, vol. 32, no. 3, pp. 53–69, 2015. parsing with hierarchical granularity,” in Proceedings of the IEEE
[44] E. Risser, P. Wilmot, and C. Barnes, “Stable and controllable neu- International Conference on Multimedia and Expo. IEEE, 2018, pp.
ral texture synthesis and style transfer using histogram losses,” 1–6.
ArXiv e-prints, Jan. 2017. [67] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
[45] S. Li, X. Xu, L. Nie, and T.-S. Chua, “Laplacian-steered neural A. Agrawal, “Context encoding for semantic segmentation,” in
style transfer,” in Proceedings of the 2017 ACM on Multimedia Proceedings of the IEEE International Conference on Computer Vision
Conference. ACM, 2017, pp. 1716–1724. and Pattern Recognition. IEEE, 2018.
[46] C. Li and M. Wand, “Combining markov random fields and con- [68] Y.-L. Chen and C.-T. Hsu, “Towards deep style transfer: A
volutional neural networks for image synthesis,” in Proceedings content-aware perspective,” in Proceedings of the British Machine
of the IEEE Conference on Computer Vision and Pattern Recognition, Vision Conference, 2016.
2016, pp. 2479–2486. [69] R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss
[47] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real- for image transformation with non-aligned data,” in European
time style transfer and super-resolution,” in European Conference Conference on Computer Vision, 2018.
on Computer Vision, 2016, pp. 694–711. [70] M. Lu, H. Zhao, A. Yao, F. Xu, Y. Chen, and L. Zhang, “Decoder
[48] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky, “Texture network over lightweight reconstructed feature for fast semantic
networks: Feed-forward synthesis of textures and stylized im- style transfer,” in Proceedings of the IEEE International Conference
ages,” in International Conference on Machine Learning, 2016, pp. on Computer Vision, 2017, pp. 2469–2477.
1349–1357. [71] C. Castillo, S. De, X. Han, B. Singh, A. K. Yadav, and T. Goldstein,
[49] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen- “Son of zorn’s lemma: Targeted style transfer using instance-
tation learning with deep convolutional generative adversarial aware semantic segmentation,” in IEEE International Conference
networks,” ArXiv e-prints, Nov. 2015. on Acoustics, Speech and Signal Processing. IEEE, 2017, pp. 1348–
1352.
[50] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture
networks: Maximizing quality and diversity in feed-forward [72] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stereoscopic neural
stylization and texture synthesis,” in Proceedings of the IEEE style transfer,” in Proceedings of the IEEE Conference on Computer
Conference on Computer Vision and Pattern Recognition, 2017, pp. Vision and Pattern Recognition, 2018.
6924–6932. [73] A. Selim, M. Elgharib, and L. Doyle, “Painting style transfer
[51] X. Huang and S. Belongie, “Arbitrary style transfer in real-time for head portraits using convolutional neural networks,” ACM
with adaptive instance normalization,” in Proceedings of the IEEE Transactions on Graphics, vol. 35, no. 4, p. 129, 2016.
International Conference on Computer Vision, 2017, pp. 1501–1510. [74] M. Ruder, A. Dosovitskiy, and T. Brox, “Artistic style transfer for
videos,” in German Conference on Pattern Recognition, 2016, pp.
[52] C. Li and M. Wand, “Precomputed real-time texture synthesis
26–36.
with markovian generative adversarial networks,” in European
Conference on Computer Vision, 2016, pp. 702–716. [75] ——, “Artistic style transfer for videos and spherical images,”
International Journal of Computer Vision, 2018.
[53] V. Dumoulin, J. Shlens, and M. Kudlur, “A learned representation
for artistic style,” in International Conference on Learning Represen- [76] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deep-
tations, 2017. flow: Large displacement optical flow with deep matching,” in
Proceedings of the IEEE International Conference on Computer Vision.
[54] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stylebank: IEEE, 2013, pp. 1385–1392.
An explicit representation for neural image style transfer,” in
[77] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid,
Proceedings of the IEEE Conference on Computer Vision and Pattern
“Epicflow: Edge-preserving interpolation of correspondences for
Recognition, 2017, pp. 1897–1906.
optical flow,” in Proceedings of the IEEE Conference on Computer
[55] Y. Li, F. Chen, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, Vision and Pattern Recognition, 2015, pp. 1164–1172.
“Diversified texture synthesis with feed-forward networks,” in
[78] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and
Proceedings of the IEEE Conference on Computer Vision and Pattern
W. Liu, “Real-time neural style transfer for videos,” in Proceedings
Recognition, 2017, pp. 3920–3928.
of the IEEE Conference on Computer Vision and Pattern Recognition,
[56] H. Zhang and K. Dana, “Multi-style generative network for real- 2017, pp. 783–791.
time transfer,” arXiv preprint arXiv:1703.06953, 2017. [79] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei, “Characterizing
[57] T. Q. Chen and M. Schmidt, “Fast patch-based style transfer of ar- and improving stability in neural style transfer,” in Proceedings
bitrary style,” in Proceedings of the NIPS Workshop on Constructive of the IEEE International Conference on Computer Vision, 2017, pp.
Machine Learning, 2016. 4067–4076.
[58] G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens, [80] D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua, “Coherent online
“Exploring the structure of a real-time, arbitrary neural artistic video style transfer,” in Proceedings of the IEEE International Con-
stylization network,” in Proceedings of the British Machine Vision ference on Computer Vision, 2017, pp. 1105–1114.
Conference, 2017. [81] G. Atarsaikhan, B. K. Iwana, A. Narusawa, K. Yanai, and
[59] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Univer- S. Uchida, “Neural font style transfer,” in Proceedings of the IAPR
sal style transfer via feature transforms,” in Advances in Neural International Conference on Document Analysis and Recognition,
Information Processing Systems, 2017, pp. 385–395. vol. 5. IEEE, 2017, pp. 51–56.
[60] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shecht- [82] S. Yang, J. Liu, Z. Lian, and Z. Guo, “Awesome typography:
man, “Controlling perceptual factors in neural style transfer,” in Statistics-based text effects transfer,” in Proceedings of the IEEE
Proceedings of the IEEE Conference on Computer Vision and Pattern Conference on Computer Vision and Pattern Recognition, 2017, pp.
Recognition, 2017, pp. 3985–3993. 7464–7473.
[61] Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song, [83] S. Azadi, M. Fisher, V. Kim, Z. Wang, E. Shechtman, and T. Dar-
“Stroke controllable fast style transfer with adaptive receptive rell, “Multi-content gan for few-shot font style transfer,” in
fields,” in European Conference on Computer Vision, 2018. Proceedings of the IEEE Conference on Computer Vision and Pattern
[62] X. Wang, G. Oxholm, D. Zhang, and Y.-F. Wang, “Multimodal Recognition, 2018.
transfer: A hierarchical deep convolutional neural network for [84] F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep photo style
fast artistic style transfer,” in Proceedings of the IEEE Conference on transfer,” in Proceedings of the IEEE Conference on Computer Vision
Computer Vision and Pattern Recognition, 2017, pp. 5239–5247. and Pattern Recognition. IEEE, 2017, pp. 6997–7005.
[63] X.-C. Liu, M.-M. Cheng, Y.-K. Lai, and P. L. Rosin, “Depth-aware [85] R. Mechrez, E. Shechtman, and L. Zelnik-Manor, “Photorealistic
neural style transfer,” in Proceedings of the Symposium on Non- style transfer with screened poisson equation,” in Proceedings of
Photorealistic Animation and Rendering, 2017, pp. 4:1–4:10. the British Machine Vision Conference, 2017.
24

[86] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed- [109] D. DeCarlo and M. Stone, “Visual explanations,” in Proceedings of
form solution to photorealistic image stylization,” in European the 8th international symposium on non-photorealistic animation and
Conference on Computer Vision, 2018. rendering. ACM, 2010, pp. 173–178.
[87] L. Zhang, Y. Ji, and X. Lin, “Style transfer for anime sketches [110] A. Hertzmann, “Non-photorealistic rendering and the science
with enhanced residual u-net and auxiliary classifier gan,” in of art,” in Proceedings of the 8th International Symposium on Non-
Proceedings of the Asian Conference on Pattern Recognition, 2017. Photorealistic Animation and Rendering. ACM, 2010, pp. 147–157.
[88] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang, “Visual attribute [111] D. Mould, “Authorial subjective evaluation of non-photorealistic
transfer through deep image analogy,” ACM Transactions on images,” in Proceedings of the Workshop on Non-Photorealistic Ani-
Graphics (TOG), vol. 36, no. 4, p. 120, 2017. mation and Rendering. ACM, 2014, pp. 49–56.
[89] S. Jiang and Y. Fu, “Fashion style generator,” in Proceedings of the [112] T. Isenberg, P. Neumann, S. Carpendale, M. C. Sousa, and J. A.
26th International Joint Conference on Artificial Intelligence. AAAI Jorge, “Non-photorealistic rendering in context: an observational
Press, 2017, pp. 3721–3727. study,” in Proceedings of the 4th international symposium on Non-
[90] P. Verma and J. O. Smith, “Neural style transfer for audio spec- photorealistic animation and rendering. ACM, 2006, pp. 115–126.
tograms,” in Proceedings of the NIPS Workshop on Machine Learning [113] J. Ren, X. Shen, Z. Lin, R. Mech, and D. J. Foran, “Personalized
for Creativity and Design, 2017. image aesthetics,” in Proceedings of the IEEE International Confer-
ence on Computer Vision, 2017, pp. 638–647.
[91] P. K. Mital, “Time domain neural audio style transfer,” in Pro-
[114] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:
ceedings of the NIPS Workshop on Machine Learning for Creativity
A review and new perspectives,” IEEE transactions on pattern
and Design, 2018.
analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
[92] D. Mould and P. L. Rosin, “A benchmark image set for evaluating
[115] H. Kim and A. Mnih, “Disentangling by factorising,” in Interna-
stylization,” in Proceedings of the Joint Symposium on Computa-
tional Conference on Machine Learning, 2018.
tional Aesthetics and Sketch Based Interfaces and Modeling and Non-
[116] Z. Feng, X. Wang, C. Ke, A. Zeng, D. Tao, and M. Song, “Dual
Photorealistic Animation and Rendering. Eurographics Association,
swap disentangling,” in Advances in neural information processing
2016, pp. 11–20.
systems, 2018.
[93] ——, “Developing and applying a benchmark for evaluating [117] Z. Feng, Z. Yu, Y. Yang, Y. Jing, J. Jiang, and M. Song, “In-
image stylization,” Computers & Graphics, vol. 67, pp. 58–76, 2017. terpretable partitioned embedding for customized multi-item
[94] J. Wang, H. Jiang, Z. Yuan, M.-M. Cheng, X. Hu, and N. Zheng, fashion outfit composition,” in Proceedings of the 2018 ACM on
“Salient object detection: A discriminative regional feature inte- International Conference on Multimedia Retrieval. ACM, 2018, pp.
gration approach.” International Journal of Computer Vision, vol. 143–151.
123, no. 2, 2017. [118] I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. P. Burgess,
[95] P. L. Rosin, D. Mould, I. Berger, J. Collomosse, Y.-K. Lai, C. Li, M. Botvinick, D. Hassabis, and A. Lerchner, “Scan: learning ab-
H. Li, A. Shamir, M. Wand, T. Wang, and H. Winnemöller, stract hierarchical compositional visual concepts,” in International
“Benchmarking non-photorealistic rendering of portraits,” in Conference on Learning Representations, 2018.
Proceedings of the Symposium on Non-Photorealistic Animation and [119] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman,
Rendering. ACM, 2017, p. 11. “Building machines that learn and think like people,” Behavioral
[96] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, and Brain Sciences, vol. 40, 2017.
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects [120] D. Bouchacourt, R. Tomioka, and S. Nowozin, “Multi-level vari-
in context,” in European conference on computer vision. Springer, ational autoencoder: Learning disentangled representations from
2014, pp. 740–755. grouped observations,” in AAAI Conference on Artificial Intelli-
[97] J. Johnson, “neural-style,” https://github.com/jcjohnson/ gence, 2018.
neural-style, 2015. [121] C. Wang, C. Wang, C. Xu, and D. Tao, “Tag disentangled gen-
[98] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- erative adversarial networks for object image re-rendering,” in
image translation using cycle-consistent adversarial networks,” International Joint Conference on Artificial Intelligence, 2017.
in Proceedings of the IEEE Conference on Computer Vision and Pattern [122] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,
Recognition, 2017, pp. 2223–2232. S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual
[99] “DeepArt,” 2016. [Online]. Available: https://deepart.io/ concepts with a constrained variational framework,” in Interna-
[100] R. Sreeraman, “Neuralstyler: Turn your videos/photos/gif into tional Conference on Learning Representations, 2017.
art,” 2016. [Online]. Available: http://neuralstyler.com/ [123] J. Cho, S. Yun, K. Lee, and J. Y. Choi, “Palettenet: Image recol-
[101] A. Semmo, M. Trapp, J. Döllner, and M. Klingbeil, “Pictory: orization with given color palette,” in Proceedings of the IEEE
Combining neural style transfer and image filtering,” in ACM Conference on Computer Vision and Pattern Recognition Workshops,
SIGGRAPH 2017 Appy Hour. ACM, 2017, pp. 5:1–5:2. 2017, pp. 62–70.
[124] S. Meyer, V. Cornillère, A. Djelouah, C. Schroers, and M. Gross,
[102] S. Pasewaldt, A. Semmo, M. Klingbeil, and J. Döllner, “Pictory
“Deep video color propagation,” in Proceedings of the British
- neural style transfer and editing with coreml,” in SIGGRAPH
Machine Vision Conference, 2018.
Asia 2017 Mobile Graphics & Interactive Applications. ACM, 2017,
[125] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-
pp. 12:1–12:2.
fellow, and R. Fergus, “Intriguing properties of neural networks,”
[103] T. Dürschmid, M. Söchting, A. Semmo, M. Trapp, and J. Döllner, in International Conference on Learning Representations, 2014.
“Prosumerfx: Mobile design of image stylization components,”
[126] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and
in SIGGRAPH Asia 2017 Mobile Graphics & Interactive Applications.
harnessing adversarial examples,” in International Conference on
ACM, 2017, pp. 1:1–1:8.
Learning Representations, 2015.
[104] Y. Jia and P. Vajda, “Delivering real-time ai [127] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in
in the palm of your hand,” 2016. [Online]. Proceedings of the IEEE Conference on Computer Vision and Pattern
Available: https://code.facebook.com/posts/196146247499076/ Recognition, 2018.
delivering-real-time-ai-in-the-palm-of-your-hand [128] P. Upchurch, J. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala,
[105] B. J. Joshi, K. Stewart, and D. Shapiro, “Bringing impressionism and K. Weinberger, “Deep feature interpolation for image content
to life with neural style transfer in come swim,” ArXiv e-prints, changes,” in Proceedings of the IEEE Conference on Computer Vision
Jan. 2017. and Pattern Recognition, 2017, pp. 7064–7073.
[106] J. Fišer, O. Jamriška, M. Lukáč, E. Shechtman, P. Asente, J. Lu, and [129] M. He, D. Chen, J. Liao, P. V. Sander, and L. Yuan, “Deep
D. Sỳkora, “Stylit: illumination-guided example-based stylization exemplar-based colorization,” ACM Transactions on Graphics
of 3d renderings,” ACM Transactions on Graphics (TOG), vol. 35, (Proc. of Siggraph 2018), 2018.
no. 4, p. 92, 2016. [130] Q. Fan, D. Chen, L. Yuan, G. Hua, N. Yu, and B. Chen, “Decouple
[107] D. H. Salesin, “Non-photorealistic animation & rendering: 7 learning for parameterized image operators,” in European Confer-
grand challenges,” Keynote talk at NPAR, 2002. ence on Computer Vision, 2018.
[108] A. A. Gooch, J. Long, L. Ji, A. Estey, and B. S. Gooch, “View- [131] A. Atapour-Abarghouei and T. Breckon, “Real-time monocular
ing progress in non-photorealistic rendering through heinlein’s depth estimation using synthetic data with domain adaptation
lens,” in Proceedings of the 8th International Symposium on Non- via image style transfer,” in Proceedings of the IEEE Conference on
Photorealistic Animation and Rendering. ACM, 2010, pp. 165–171. Computer Vision and Pattern Recognition, 2018.
25

[132] C. Chen, X. Tan, and K.-Y. K. Wong, “Face sketch synthesis with
style transfer using pyramid column feature,” in IEEE Winter
Conference on Applications of Computer Vision. Lake Tahoe, USA,
2018.
[133] I. Korshunova, W. Shi, J. Dambre, and L. Theis, “Fast face-swap
using convolutional neural networks,” in Proceedings of the IEEE
International Conference on Computer Vision, 2017, pp. 3677–3685.
[134] W. Zhang, S. Xiao, and X. Shi, “Low-poly style image and video
processing,” in Systems, Signals and Image Processing (IWSSIP),
2015 International Conference on. IEEE, 2015, pp. 97–100.
[135] M. Gai and G. Wang, “Artistic low poly rendering for images,”
The visual computer, vol. 32, no. 4, pp. 491–500, 2016.
[136] T. Gerstner, D. DeCarlo, M. Alexa, A. Finkelstein, Y. Gingold,
and A. Nealen, “Pixelated image abstraction,” in Proceedings
of the Symposium on Non-Photorealistic Animation and Rendering.
Eurographics Association, 2012, pp. 29–36.
[137] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multi-
modal unsupervised image-to-image translation,” arXiv preprint
arXiv:1804.04732, 2018.
Deep Learning:
A Critical Appraisal

Gary Marcus1
New York University

Abstract
Although deep learning has historical roots going back decades, neither the term “deep
learning” nor the approach was popular just over five years ago, when the field was
reignited by papers such as Krizhevsky, Sutskever and Hinton’s now classic 2012
(Krizhevsky, Sutskever, & Hinton, 2012)deep net model of Imagenet.

What has the field discovered in the five subsequent years? Against a background of
considerable progress in areas such as speech recognition, image recognition, and game
playing, and considerable enthusiasm in the popular press, I present ten concerns for deep
learning, and suggest that deep learning must be supplemented by other techniques if we
are to reach artificial general intelligence.


!1 Departments of Psychology and Neural Science, New York University, gary.marcus at nyu.edu. I thank Christina
Chen, François Chollet, Ernie Davis, Zack Lipton, Stefano Pacifico, Suchi Saria, and Athena Vouloumanos for
sharp-eyed comments, all generously supplied on short notice during the holidays at the close of 2017.

Page !1 of 27
!
For most problems where deep learning has enabled
transformationally better solutions (vision, speech), we've
entered diminishing returns territory in 2016-2017.
François Chollet, Google, author of Keras
neural network library
December 18, 2017

‘Science progresses one funeral at a time.' The future


depends on some graduate student who is deeply suspicious
of everything I have said.
Geoff Hinton, grandfather of deep learning
September 15, 2017

1. Is deep learning approaching a wall?


Although deep learning has historical roots going back decades(Schmidhuber, 2015), it
attracted relatively little notice until just over five years ago. Virtually everything
changed in 2012, with the publication of a series of highly influential papers such as
Krizhevsky, Sutskever and Hinton’s 2012 ImageNet Classification with Deep
Convolutional Neural Networks (Krizhevsky, Sutskever, & Hinton, 2012), which
achieved state-of-the-art results on the object recognition challenge known as ImageNet
(Deng et al., ). Other labs were already working on similar work (Cireşan, Meier, Masci,
& Schmidhuber, 2012). Before the year was out, deep learning made the front page of
The New York Times2, and it rapidly became the best known technique in artificial
intelligence, by a wide margin. If the general idea of training neural networks with
multiple layers was not new, it was, in part because of increases in computational power
and data, the first time that deep learning truly became practical.

Deep learning has since yielded numerous state of the art results, in domains such as
speech recognition, image recognition , and language translation and plays a role in a
wide swath of current AI applications. Corporations have invested billions of dollars
fighting for deep learning talent. One prominent deep learning advocate, Andrew Ng, has
gone so far to suggest that “If a typical person can do a mental task with less than one
second of thought, we can probably automate it using AI either now or in the near

2 http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-
intelligence.html

Page !2 of 27
!
future.” (A, 2016). A recent New York Times Sunday Magazine article 3, largely about
deep learning, implied that the technique is “poised to reinvent computing itself.”

Yet deep learning may well be approaching a wall, much as I anticipated earlier, at
beginning of the resurgence (Marcus, 2012), and as leading figures like Hinton (Sabour,
Frosst, & Hinton, 2017) and Chollet (2017) have begun to imply in recent months.

What exactly is deep learning, and what has its shown about the nature of intelligence?
What can we expect it to do, and where might we expect it to break down? How close or
far are we from “artificial general intelligence”, and a point at which machines show a
human-like flexibility in solving unfamiliar problems? The purpose of this paper is both
to temper some irrational exuberance and also to consider what we as a field might need
to move forward.

This paper is written simultaneously for researchers in the field, and for a growing set of
AI consumers with less technical background who may wish to understand where the
field is headed. As such I will begin with a very brief, nontechnical introduction 4 aimed at
elucidating what deep learning systems do well and why (Section 2), before turning to an
assessment of deep learning’s weaknesses (Section 3) and some fears that arise from
misunderstandings about deep learning’s capabilities (Section 4), and closing with
perspective on going forward (Section 5).

Deep learning is not likely to disappear, nor should it. But five years into the field’s
resurgence seems like a good moment for a critical reflection, on what deep learning has
and has not been able to achieve.

2. What deep learning is, and what it does well


Deep learning, as it is primarily used, is essentially a statistical technique for classifying
patterns, based on sample data, using neural networks with multiple layers.5

3 https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
4For more technical introduction, there are many excellent recent tutorials on deep learning including (Chollet,
2017) and (Goodfellow, Bengio, & Courville, 2016), as well as insightful blogs and online resources from Zachary
Lipton, Chris Olah, and many others.
5 Other applications of deep learning beyond classification are possible, too, though currently less popular, and
outside of the scope of the current article. These include using deep learning as an alternative to regression, as a
component in generative models that create (e.g.,) synthetic images, as a tool for compressing images, as a tool for
learning probability distributions, and (relatedly) as an important technique for approximation known as variational
inference.

Page !3 of 27
!
Neural networks in the deep learning literature typically consist of a set of input units that
stand for things like pixels or words, multiple hidden layers (the more such layers, the
deeper a network is said to be) containing hidden units (also known as nodes or neurons),
and a set output units, with connections running between those nodes. In a typical
application such a network might be trained on a large sets of handwritten digits (these
are the inputs, represented as images) and labels (these are the outputs) that identify the
categories to which those inputs belong (this image is a 2, that one is a 3, and so forth).

... ...
... ...

Hidden layers

Input layer Output layer

Over time, an algorithm called back-propagation allows a process called gradient descent
to adjust the connections between units using a process, such that any given input tends to
produce the corresponding output.

Collectively, one can think of the relation between inputs and outputs that a neural
network learns as a mapping. Neural networks, particularly those with multiple hidden
layers (hence the term deep) are remarkably good at learning input-output mappings,

Such systems are commonly described as neural networks because the input nodes,
hidden nodes, and output nodes can be thought of as loosely analogous to biological
neurons, albeit greatly simplified, and the connections between nodes can be thought of
as in some way reflecting connections between neurons. A longstanding question, outside
the scope of the current paper, concerns the degree to which artificial neural networks are
biologically plausible.

Most deep learning networks make heavy use of a technique called convolution (LeCun,
1989), which constrains the neural connections in the network such that they innately
capture a property known as translational invariance. This is essentially the idea that an
object can slide around an image while maintaining its identity; a circle in the top left can
be presumed, even absent direct experience) to be the same as a circle in the bottom right.

Page !4 of 27
!
Deep learning is also known for its ability to self-generate intermediate representations,
such as internal units that may respond to things like horizontal lines, or more complex
elements of pictorial structure.

In principle, given infinite data, deep learning systems are powerful enough to represent
any finite deterministic “mapping” between any given set of inputs and a set of
corresponding outputs, though in practice whether they can learn such a mapping
depends on many factors. One common concern is getting caught in local minima, in
which a systems gets stuck on a suboptimal solution, with no better solution nearby in the
space of solutions being searched. (Experts use a variety of techniques to avoid such
problems, to reasonably good effect). In practice, results with large data sets are often
quite good, on a wide range of potential mappings.

In speech recognition, for example, a neural network learns a mapping between a set of
speech sounds, and set of labels (such as words or phonemes). In object recognition, a
neural network learns a mapping between a set of images and a set of labels (such that,
for example, pictures of cars are labeled as cars). In DeepMind’s Atari game system
(Mnih et al., 2015), neural networks learned mappings between pixels and joystick
positions.

Deep learning systems are most often used as classification system in the sense that the
mission of a typical network is to decide which of a set of categories (defined by the
output units on the neural network) a given input belongs to. With enough imagination,
the power of classification is immense; outputs can represent words, places on a Go
board, or virtually anything else.

In a world with infinite data, and infinite computational resources, there might be little
need for any other technique.

3. Limits on the scope of deep learning


Deep learning’s limitations begin with the contrapositive: we live in a world in which
data are never infinite. Instead, systems that rely on deep learning frequently have to
generalize beyond the specific data that they have seen, whether to a new pronunciation
of a word or to an image that differs from one that the system has seen before, and where
data are less than infinite, the ability of formal proofs to guarantee high-quality
performance is more limited.

Page !5 of 27
!
As discussed later in this article, generalization can be thought of as coming in two
flavors, interpolation between known examples, and extrapolation, which requires going
beyond a space of known training examples (Marcus, 1998a).

For neural networks to generalize well, there generally must be a large amount of data,
and the test data must be similar to the training data, allowing new answers to be
interpolated in between old ones. In Krizhevsky et al’s paper (Krizhevsky, Sutskever, &
Hinton, 2012), a nine layer convolutional neural network with 60 million parameters and
650,000 nodes was trained on roughly a million distinct examples drawn from
approximately one thousand categories. 6

This sort of brute force approach worked well in the very finite world of ImageNet, into
which all stimuli can be classified into a comparatively small set of categories. It also
works well in stable domains like speech recognition in which exemplars are mapped in
constant way onto a limited set of speech sound categories, but for many reasons deep
learning cannot be considered (as it sometimes is in the popular press) as a general
solution to artificial intelligence.

Here are ten challenges faced by current deep learning systems:

3.1. Deep learning thus far is data hungry

Human beings can learn abstract relationships in a few trials. If I told you that a schmister
was a sister over the age of 10 but under the age of 21, perhaps giving you a single
example, you could immediately infer whether you had any schmisters, whether your best
friend had a schmister, whether your children or parents had any schmisters, and so forth.
(Odds are, your parents no longer do, if they ever did, and you could rapidly draw that
inference, too.)

In learning what a schmister is, in this case through explicit definition, you rely not on
hundreds or thousands or millions of training examples, but on a capacity to represent
abstract relationships between algebra-like variables.

Humans can learn such abstractions, both through explicit definition and more implicit
means (Marcus, 2001). Indeed even 7-month old infants can do so, acquiring learned
abstract language-like rules from a small number of unlabeled examples, in just two

6 Using a common technique known as data augmentation, each example was actually presented along with its label
in a many different locations, both in its original form and in mirror reversed form. A second type of data
augmentation varied the brightness of the images, yielding still more examples for training, in order to train the
network to recognize images with different intensities. Part of the art of machine learning involves knowing what
forms of data augmentation will and won’t help within a given system.

Page !6 of 27
!
minutes (Marcus, Vijayan, Bandi Rao, & Vishton, 1999). Subsequent work by Gervain
and colleagues (2012) suggests that newborns are capable of similar computations.

Deep learning currently lacks a mechanism for learning abstractions through explicit,
verbal definition, and works best when there are thousands, millions or even billions of
training examples, as in DeepMind’s work on board games and Atari. As Brenden Lake
and his colleagues have recently emphasized in a series of papers, humans are far more
efficient in learning complex rules than deep learning systems are (Lake, Salakhutdinov,
& Tenenbaum, 2015; Lake, Ullman, Tenenbaum, & Gershman, 2016). (See also related
work by George et al (2017), and my own work with Steven Pinker on children’s
overregularization errors in comparison to neural networks (Marcus et al., 1992).)

Geoff Hinton has also worried about deep learning’s reliance on large numbers of labeled
examples, and expressed this concern in his recent work on capsule networks with his
coauthors (Sabour et al., 2017) noting that convolutional neural networks (the most
common deep learning architecture) may face “exponential inefficiencies that may lead to
their demise. A good candidate is the difficulty that convolutional nets have in
generalizing to novel viewpoints [ie perspectives on object in visual recognition tasks].
The ability to deal with translation[al invariance] is built in, but for the other ... [common
type of] transformation we have to chose between replicating feature detectors on a grid
that grows exponentially ... or increasing the size of the labelled training set in a similarly
exponential way.”

In problems where data are limited, deep learning often is not an ideal solution.

3.2.Deep learning thus far is shallow and has limited capacity for
transfer
Although deep learning is capable of some amazing things, it is important to realize that
the word “deep” in deep learning refers to a technical, architectural property (the large
number of hidden layers used in a modern neural networks, where there predecessors
used only one) rather than a conceptual one (the representations acquired by such
networks don’t, for example, naturally apply to abstract concepts like “justice”,
“democracy” or “meddling”).

Even more down-to-earth concepts like “ball” or “opponent” can lie out of reach.
Consider for example DeepMind’s Atari game work (Mnih et al., 2015) on deep
reinforcement learning, which combines deep learning with reinforcement learning (in
which a learner tries to maximize reward). Ostensibly, the results are fantastic: the system
meets or beats human experts on a large sample of games using a single set of
“hyperparameters” that govern properties such as the rate at which a network alters its
weights, and no advance knowledge about specific games, or even their rules. But it is

Page !7 of 27
!
easy to wildly overinterpret what the results show. To take one example, according to a
widely-circulated video of the system learning to play the brick-breaking Atari game
Breakout, “after 240 minutes of training, [the system] realizes that digging a tunnel
thought the wall is the most effective technique to beat the game”.

But the system has learned no such thing; it doesn’t really understand what a tunnel, or
what a wall is; it has just learned specific contingencies for particular scenarios. Transfer
tests — in which the deep reinforcement learning system is confronted with scenarios
that differ in minor ways from the one ones on which the system was trained show that
deep reinforcement learning’s solutions are often extremely superficial. For example, a
team of researchers at Vicarious showed that a more efficient successor technique,
DeepMind’s Atari system [Asynchronous Advantage Actor-Critic; also known as A3C],
failed on a variety of minor perturbations to Breakout (Kansky et al., 2017) from the
training set, such as moving the Y coordinate (height) of the paddle, or inserting a wall
midscreen. These demonstrations make clear that it is misleading to credit deep
reinforcement learning with inducing concept like wall or paddle; rather, such remarks
are what comparative (animal) psychology sometimes call overattributions. It’s not that
the Atari system genuinely learned a concept of wall that was robust but rather the system
superficially approximated breaking through walls within a narrow set of highly trained
circumstances.7

My own team of researchers at a startup company called Geometric Intelligence (later


acquired by Uber) found similar results as well, in the context of a slalom game, In 2017,
a team of researchers at Berkeley and OpenAI has shown that it was not difficult to
construct comparable adversarial examples in a variety of games, undermining not only
DQN (the original DeepMind algorithm) but also A3C and several other related
techniques (Huang, Papernot, Goodfellow, Duan, & Abbeel, 2017).

Recent experiments by Robin Jia and Percy Liang (2017) make a similar point, in a
different domain: language. Various neural networks were trained on a question
answering task known as SQuAD (derived from the Stanford Question Answering
Database), in which the goal is to highlight the words in a particular passage that
correspond to a given question. In one sample, for instance, a trained system correctly,
and impressively, identified the quarterback on the winning of Super Bowl XXXIII as
John Elway, based on a short paragraph. But Jia and Liang showed the mere insertion of
distractor sentences (such as a fictional one about the alleged victory of Google’s Jeff

7 In the same paper, Vicarious proposed an alternative to deep learning called schema networks (Kansky et al., 2017)
that can handle a number of variations in the Atari game Breakout, albeit apparently without the multi-game
generality of DeepMind’s Atari system.

Page !8 of 27
!
Dean in another Bowl game 8) caused performance to drop precipitously. Across sixteen
models, accuracy dropped from a mean of 75% to a mean of 36%.

As is so often the case, the patterns extracted by deep learning are more superficial than
they initially appear.

3.3.Deep learning thus far has no natural way to deal with


hierarchical structure
To a linguist like Noam Chomsky, the troubles Jia and Liang documented would be
unsurprising. Fundamentally, most current deep-learning based language models
represent sentences as mere sequences of words, whereas Chomsky has long argued that
language has a hierarchical structure, in which larger structures are recursively
constructed out of smaller components. (For example, in the sentence the teenager who
previously crossed the Atlantic set a record for flying around the world, the main clause is
the teenager set a record for flying around the world, while the embedded clause who
previously crossed the Atlantic is an embedded clause that specifies which teenager.)

In the 80’s Fodor and Pylyshyn (1988)expressed similar concerns, with respect to an
earlier breed of neural networks. Likewise, in (Marcus, 2001), I conjectured that single
recurrent neural networks (SRNs; a forerunner to today’s more sophisticated deep
learning based recurrent neural networks, known as RNNs; Elman, 1990) would have
trouble systematically representing and extending recursive structure to various kinds of
unfamiliar sentences (see the cited articles for more specific claims about which types).

Earlier this year, Brenden Lake and Marco Baroni (2017) tested whether such pessimistic
conjectures continued to hold true. As they put it in their title, contemporary neural nets
were “Still not systematic after all these years”. RNNs could “generalize well when the
differences between training and test ... are small [but] when generalization requires
systematic compositional skills, RNNs fail spectacularly”.

Similar issues are likely to emerge in other domains, such as planning and motor control,
in which complex hierarchical structure is needed, particular when a system is likely to
encounter novel situations. One can see indirect evidence for this in the struggles with
transfer in Atari games mentioned above, and more generally in the field of robotics, in
which systems generally fail to generalize abstract plans well in novel environments.

8 Here’s the full Super Bowl passage; Jia and Liang’s distractor sentence that confused the model is at the end.
Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also
the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the
Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football
Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.

Page !9 of 27
!
The core problem, at least at present, is that deep learning learns correlations between
sets of features that are themselves “flat” or nonhierachical, as if in a simple, unstructured
list, with every feature on equal footing. Hierarchical structure (e.g., syntactic trees that
distinguish between main clauses and embedded clauses in a sentence) are not inherently
or directly represented in such systems, and as a result deep learning systems are forced
to use a variety of proxies that are ultimately inadequate, such as the sequential position
of a word presented in a sequences.

Systems like Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) that represent
individuals words as vectors have been modestly successful; a number of systems that
have used clever tricks
try to represent complete sentences in deep-learning compatible vector spaces (Socher,
Huval, Manning, & Ng, 2012). But, as Lake and Baroni’s experiments make clear.
recurrent networks continue limited in their capacity to represent and generalize rich
structure in a faithful manner.

3.4.Deep learning thus far has struggled with open-ended inference


If you can’t represent nuance like the difference between “John promised Mary to leave”
and “John promised to leave Mary”, you can’t draw inferences about who is leaving
whom, or what is likely to happen next. Current machine reading systems have achieved
some degree of success in tasks like SQuAD, in which the answer to a given
question is explicitly contained within a text, but far less success in tasks in which
inference goes beyond what is explicit in a text, either by combining multiple sentences
(so called multi-hop inference) or by combining explicit sentences with background
knowledge that is not stated in a specific text selection. Humans, as they read texts,
frequently derive wide-ranging inferences that are both novel and only implicitly
licensed, as when they, for example, infer the intentions of a character based only on
indirect dialog.

Altough Bowman and colleagues (Bowman, Angeli, Potts, & Manning, 2015; Williams,
Nangia, & Bowman, 2017) have taken some important steps in this direction, there is, at
present, no deep learning system that can draw open-ended inferences based on real-
world knowledge with anything like human-level accuracy.

3.5.Deep learning thus far is not sufficiently transparent


The relative opacity of “black box” neural networks has been a major focus of discussion
in the last few years (Samek, Wiegand, & Müller, 2017; Ribeiro, Singh, & Guestrin,
2016). In their current incarnation, deep learning systems have millions or even billions
of parameters, identifiable to their developers not in terms of the sort of human

Page 10
! of 27
!
interpretable labels that canonical programmers use (“last_character_typed”) but only in
terms of their geography within a complex network (e.g., the activity value of the ith node
in layer j in network module k). Although some strides have been in visualizing the
contributions of individuals nodes in complex networks (Nguyen, Clune, Bengio,
Dosovitskiy, & Yosinski, 2016), most observers would acknowledge that neural networks
as a whole remain something of a black box.

How much that matters in the long run remains unclear (Lipton, 2016). If systems are
robust and self-contained enough it might not matter; if it is important to use them in the
context of larger systems, it could be crucial for debuggability.

The transparency issue, as yet unsolved, is a potential liability when using deep learning
for problem domains like financial trades or medical diagnosis, in which human users
might like to understand how a given system made a given decision. As Catherine
O’Neill (2016) has pointed out, such opacity can also lead to serious issues of bias.

3.6.Deep learning thus far has not been well integrated with prior
knowledge
The dominant approach in deep learning is hermeneutic, in the sense of being self-
contained and isolated from other, potentially usefully knowledge. Work in deep learning
typically consists of finding a training database, sets of inputs associated with respective
outputs, and learn all that is required for the problem by learning the relations between
those inputs and outputs, using whatever clever architectural variants one might devise,
along with techniques for cleaning and augmenting the data set. With just a handful of
exceptions, such as LeCun’s convolutional constraint on how neural networks are
wired(LeCun, 1989), prior knowledge is often deliberately minimized.

Thus, for example, in a system like Lerer et al’s (2016) efforts to learn about the physics
of falling towers, there is no prior knowledge of physics (beyond what is implied in
convolution). Newton’s laws, for example, are not explicitly encoded; the system instead
(to some limited degree) approximates them by learning contingencies from raw, pixel
level data. As I note in a forthcoming paper in innate (Marcus, in prep) researchers in
deep learning appear to have a very strong bias against including prior knowledge even
when (as in the case of physics) that prior knowledge is well known.

It also not straightforward in general how to integrate prior knowledge into a deep
learning system:, in part because the knowledge represented in deep learning systems
pertains mainly to (largely opaque) correlations between features, rather than to
abstractions like quantified statements (e.g. all men are mortal), see discussion of
universally-quantified one-to-one-mappings in Marcus (2001), or generics (violable

Page !11 of 27
!
statements like dogs have four legs or mosquitos carry West Nile virus (Gelman, Leslie,
Was, & Koch, 2015)).

A related problem stems from a culture in machine learning that emphasizes competition
on problems that are inherently self-contained, without little need for broad general
knowledge. This tendency is well exemplified by the machine learning contest platform
known as Kaggle, in which contestants vie for the best results on a given data set.
Everything they need for a given problem is neatly packaged, with all the relevant input
and outputs files. Great progress has been made in this way; speech recognition and some
aspects of image recognition can be largely solved in the Kaggle paradigm.

The trouble, however, is that life is not a Kaggle competition; children don’t get all the
data they need neatly packaged in a single directory. Real-world learning offers data
much more sporadically, and problems aren’t so neatly encapsulated. Deep learning
works great on problems like speech recognition in which there are lots of labeled
examples, but scarcely any even knows how to apply it to more open-ended problems.
What’s the best way to fix a bicycle that has a rope caught in its spokes? Should I major
in math or neuroscience? No training set will tell us that.

Problems that have less to do with categorization and more to do with commonsense
reasoning essentially lie outside the scope of what deep learning is appropriate for, and so
far as I can tell, deep learning has little to offer such problems. In a recent review of
commonsense reasoning, Ernie Davis and I (2015) began with a set of easily-drawn
inferences that people can readily answer without anything like direct training, such as
Who is taller, Prince William or his baby son Prince George? Can you make a salad out
of a polyester shirt? If you stick a pin into a carrot, does it make a hole in the carrot or in the
pin?

As far as I know, nobody has even tried to tackle this sort of thing with deep learning.

Such apparently simple problems require humans to integrate knowledge across vastly disparate
sources, and as such are a long way from the sweet spot of deep learning-style perceptual
classification. Instead, they are perhaps best thought of as a sign that entirely different
sorts of tools are needed, along with deep learning, if we are to reach human-level
cognitive flexibility.

3.7.Deep learning thus far cannot inherently distinguish causation


from correlation
If it is a truism that causation does not equal correlation, the distinction between the two
is also a serious concern for deep learning. Roughly speaking, deep learning learns
complex correlations between input and output features, but with no inherent

Page 12
! of 27
!
representation of causality. A deep learning system can easily learn that height and
vocabulary are, across the population as a whole, correlated, but less easily represent the
way in which that correlation derives from growth and development (kids get bigger as
they learn more words, but that doesn’t mean that growing tall causes them to learn more
words, nor that learning new words causes them to grow). Causality has been central
strand in some other approaches to AI (Pearl, 2000) but, perhaps because deep learning is
not geared towards such challenges, relatively little work within the deep learning
tradition has tried to address it. 9

3.8.Deep learning presumes a largely stable world, in ways that may


be problematic
The logic of deep learning is such that it is likely to work best in highly stable worlds,
like the board game Go, which has unvarying rules, and less well in systems such as
politics and economics that are constantly changing. To the extent that deep learning is
applied in tasks such as stock prediction, there is a good chance that it will eventually
face the fate of Google Flu Trends, which initially did a great job of predicting
epidemological data on search trends, only to complete miss things like the peak of the
2013 flu season (Lazer, Kennedy, King, & Vespignani, 2014).

3.9. Deep learning thus far works well as an approximation, but its
answers often cannot be fully trusted
In part as a consequence of the other issues raised in this section, deep learning systems
are quite good at some large fraction of a given domain, yet easily fooled.

An ever-growing array of papers has shown this vulnerability, from the linguistic
examples of Jia and Liang mentioned above to a wide range of demonstrations in the
domain of vision, where deep learning systems have mistaken yellow-and-black patterns
of stripes for school buses (Nguyen, Yosinski, & Clune, 2014) and sticker-clad parking
signs for well-stocked refrigerators (Vinyals, Toshev, Bengio, & Erhan, 2014) in the
context of a captioning system that otherwise seems impressive.

More recently, there have been real-world stop signs, lightly defaced, that have been
mistaken for speed limit signs (Evtimov et al., 2017) and 3d-printed turtles that have been
mistake for rifles (Athalye, Engstrom, Ilyas, & Kwok, 2017). A recent news story

9 One example of interesting recent work is (Lopez-Paz, Nishihara, Chintala, Schölkopf, & Bottou, 2017), albeit
focused specifically on an rather unusual sense of the term causation as it relates to the presence or absence of
objects (e.g., “the presence of cars cause the presence of wheel[s]). This strikes me as quite different from the sort of
causation one finds in the relation between a disease and the symptoms it causes.

Page 13
! of 27
!
recounts the trouble a British police system has had in distinguishing nudes from sand
dunes.10

The “spoofability” of deep learning systems was perhaps first noted by Szegedy et
al(2013). Four years later, despite much active research, no robust solution has been
found.11

3.10. Deep learning thus far is difficult to engineer with


Another fact that follows from all the issues raised above is that is simply hard to do
robust engineering with deep learning. As a team of authors at Google put it in 2014, in
the title of an important, and as yet unanswered essay (Sculley, Phillips, Ebner,
Chaudhary, & Young, 2014), machine learning is “the high-interest credit card of
technical debt”, meaning that is comparatively easy to make systems that work in some
limited set of circumstances (short term gain), but quite difficult to guarantee that they
will work in alternative circumstances with novel data that may not resemble previous
training data (long term debt, particularly if one system is used as an element in another
larger system).

In an important talk at ICML, Leon Bottou (2015) compared machine learning to the
development of an airplane engine, and noted that while the airplane design relies on
building complex systems out of simpler systems for which it was possible to create
sound guarantees about performance, machine learning lacks the capacity to produce
comparable guarantees. As Google’s Peter Norvig (2016) has noted, machine learning as
yet lacks the incrementality, transparency and debuggability of classical programming,
trading off a kind of simplicity for deep challenges in achieving robustness.

Henderson and colleagues have recently extended these points, with a focus on deep
reinforcement learning, noting some serious issues in the field related to robustness and
replicability (Henderson et al., 2017).

Although there has been some progress in automating the process of developing machine
learning systems (Zoph, Vasudevan, Shlens, & Le, 2017), there is a long way to go.

10 https://gizmodo.com/british-cops-want-to-use-ai-to-spot-porn-but-it-keeps-m-1821384511/amp
11 Deep learning’s predecessors were vulnerable to similar problems, as Pinker and Prince (1988)pointed out, in a
discussion of neural networks that produced bizarre past tense forms for a subset of its inputs. The verb to mail, for
example, was inflected in the past tense as membled, the verb tour as toureder. Children rarely if ever make mistakes
like these.

Page 14
! of 27
!
3.11. Discussion
Of course, deep learning, is by itself, just mathematics; none of the problems identified
above are because the underlying mathematics of deep learning are somehow flawed. In
general, deep learning is a perfectly fine way of optimizing a complex system for
representing a mapping between inputs and outputs, given a sufficiently large data set.

The real problem lies in misunderstanding what deep learning is, and is not, good for. The
technique excels at solving closed-end classification problems, in which a wide range of
potential signals must be mapped onto a limited number of categories, given that there is
enough data available and the test set closely resembles the training set.

But deviations from these assumptions can cause problems; deep learning is just a
statistical technique, and all statistical techniques suffer from deviation from their
assumptions.

Deep learning systems work less well when there are limited amounts of training data
available, or when the test set differs importantly from the training set, or when the space
of examples is broad and filled with novelty. And some problems cannot, given real-
world limitations, be thought of as classification problems at all. Open-ended natural
language understanding, for example, should not be thought of as a classifier mapping
between a large finite set of sentences and large, finite set of sentences, but rather a
mapping between a potentially infinite range of input sentences and an equally vast array
of meanings, many never previously encountered. In a problem like that, deep learning
becomes a square peg slammed into a round hole, a crude approximation when there
must be a solution elsewhere.

One clear way to get an intuitive sense of why something is amiss to consider a set of
experiments I did long ago, in 1997, when I tested some simplified aspects of language
development on a class of neural networks that were then popular in cognitive science.
The 1997-vintage networks were, to be sure, simpler than current models — they used no
more than three layers (inputs nodes connected to hidden nodes connected to outputs
node), and lacked Lecun’s powerful convolution technique. But they were driven by
backpropagation just as today’s systems are, and just as beholden to their training data.

In language, the name of the game is generalization — once I hear a sentence like John
pilked a football to Mary, I can infer that is also grammatical to say John pilked Mary the
football, and Eliza pilked the ball to Alec; equally if I can infer what the word pilk means,
I can infer what the latter sentences would mean, even if I had not hear them before.

Page 15
! of 27
!
Distilling the broad-ranging problems of language down to a simple example that I
believe still has resonance now, I ran a series of experiments in which I trained three-
layer perceptrons (fully connected in today’s technical parlance, with no convolution) on
the identity function, f(x) = x, e.g, f(12)=12.

Training examples were represented by a set of input nodes (and corresponding output
nodes) that represented numbers in terms of binary digits. The number 7 for example,
would be represented by turning on the input (and output) nodes representing 4, 2, and 1.
As a test of generalization, I trained the network on various sets of even numbers, and
tested it all possible inputs, both odd and even.

Every time I ran the experiment, using a wide variety of parameters, the results were the
same: the network would (unless it got stuck in local minimum) correctly apply the
identity function to the even numbers that it had seen before (say 2, 4, 8 and 12), and to
some other even numbers (say 6 and 14) but fail on all the odds numbers, yielding, for
example f(15) = 14.

In general, the neural nets I tested could learn their training examples, and interpolate to a
set of test examples that were in a cloud of points around those examples in n-
dimensional space (which I dubbed the training space), but they could not extrapolate
beyond that training space.

Odd numbers were outside the training space, and the networks could not generalize
identity outside that space.12 Adding more hidden units didn’t help, and nor did adding
more hidden layers. Simple multilayer perceptrons simply couldn’t generalize outside
their training space (Marcus, 1998a; Marcus, 1998b; Marcus, 2001). (Chollet makes quite
similar points in the closing chapters of his his (Chollet, 2017) text.)

What we have seen in this paper is that challenges in generalizing beyond a space of
training examples persist in current deep learning networks, nearly two decades later.
Many of the problems reviewed in this paper — the data hungriness, the vulnerability to
fooling, the problems in dealing with open-ended inference and transfer — can be seen
as extension of this fundamental problem. Contemporary neural networks do well on
challenges that remain close to their core training data, but start to break down on cases
further out in the periphery.

12Of course, the network had never seen an odd number before, but pretraining the network on odd numbers in a
different context didn’t help. And of course people, in contrast, readily generalize to novel words immediately upon
hearing them. Likewise, the experiments I did with seven-month-olds consisted entirely of novel words.

Page 16
! of 27
!
The widely-adopted addition of convolution guarantees that one particular class of
problems that are akin to my identity problem can be solved: so-called translational
invariances, in which an object retains its identity when it is shifted to a location. But the
solution is not general, as for example Lake’s recent demonstrations show. (Data
augmentation offers another way of dealing with deep learning’s challenges in
extrapolation, by trying to broaden the space of training examples itself, but such
techniques are more useful in 2d vision than in language).

As yet there is no general solution within deep learning to the problem of generalizing
outside the training space. And it is for that reason, more than any other, that we need to
look to different kinds of solutions if we want to reach artificial general intelligence.

4. Potential risks of excessive hype


One of the biggest risks in the current overhyping of AI is another AI winter, such as the
one that devastated the field in the 1970’s, after the Lighthill report (Lighthill, 1973),
suggested that AI was too brittle, too narrow and too superficial to be used in practice.
Although there are vastly more practical applications of AI now than there were in the
1970s, hype is still a major concern. When a high-profile figure like Andrew Ng writes in
the Harvard Business Review promising a degree of imminent automation that is out of
step with reality, there is fresh risk for seriously dashed expectations. Machines cannot in
fact do many things that ordinary humans can do in a second, ranging from reliably
comprehending the world to understanding sentences. No healthy human being would
ever mistake a turtle for a rifle or parking sign for a refrigerator.

Executives investing massively in AI may turn out to be disappointed, especially given


the poor state of the art in natural language understanding. Already, some major projects
have been largely abandoned, like Facebook’s M project, which was launched in August
2015 with much publicity13 as a general purpose personal assistant, and then later
downgraded to a significantly smaller role, helping users with a vastly small range of
well-defined tasks such as calendar entry.

It is probably fair to say that chatbots in general have not lived up to the hype they
received a couple years ago. If, for example, driverless car should also, disappoint,
relative to their early hype, by proving unsafe when rolled out at scale, or simply not
achieving full autonomy after many promises, the whole field of AI could be in for a
sharp downturn, both in popularity and funding. We already may be seeing hints of this,

13 https://www.wired.com/2015/08/how-facebook-m-works/

Page 17
! of 27
!
as in a just published Wired article14 that was entitled “After peak hype, self-driving cars
enter the trough of disillusionment.”

There are other serious fears, too, and not just of the apocalyptic variety (which for now
to still seem to be stuff of science fiction). My own largest fear is that the field of AI
could get trapped in a local minimum, dwelling too heavily in the wrong part of
intellectual space, focusing too much on the detailed exploration of a particular class of
accessible but limited models that are geared around capturing low-hanging fruit —
potentially neglecting riskier excursions that might ultimately lead to a more robust path.

I am reminded of Peter Thiel’s famous (if now slightly outdated) damning of an often
too-narrowly focused tech industry: “We wanted flying cars, instead we got 140
characters”. I still dream of Rosie the Robost, a full-service domestic robot that take of
my home; but for now, six decades into the history of AI, our bots do little more than play
music, sweep floors, and bid on advertisements.

If didn’t make more progress, it would be a shame. AI comes with risk, but also great
potential rewards. AI’s greatest contributions to society, I believe, could and should
ultimately come in domains like automated scientific discovery, leading among other
things towards vastly more sophisticated versions of medicine than are currently possible.
But to get there we need to make sure that the field as whole doesn’t first get stuck in a
local minimum.

5. What would be better?


Despite all of the problems I have sketched, I don’t think that we need to abandon deep
learning.

Rather, we need to reconceptualize it: not as a universal solvent, but simply as one tool
among many, a power screwdriver in a world in which we also need hammers, wrenches,
and pliers, not to mentions chisels and drills, voltmeters, logic probes, and oscilloscopes.

In perceptual classification, where vast amounts of data are available, deep learning is a
valuable tool; in other, richer cognitive domains, it is often far less satisfactory.

The question is, where else should we look? Here are four possibilities.

14 https://www.wired.com/story/self-driving-cars-challenges/

Page 18
! of 27
!
5.1.Unsupervised learning

In interviews, deep learning pioneers Geoff Hinton and Yann LeCun have both recently
pointed to unsupervised learning as one key way in which to go beyond supervised, data-
hungry versions of deep learning.

To be clear, deep learning and unsupervised learning are not in logical opposition. Deep
learning has mostly been used in a supervised context with labeled data, but there are
ways of using deep learning in an unsupervised fashion. But there is certainly reasons in
many domains to move away from the massive demands on data that supervised deep
learning typically requires.

Unsupervised learning, as the term is commonly used, tends to refer to several kinds of
systems. One common type of system “clusters” together inputs that share properties,
even without having them explicitly labeled. Google’s cat detector model (Le et al., 2012)
is perhaps the most publicly prominent example of this sort of approach.

Another approach, advocated researchers such as Yann LeCun (Luc, Neverova, Couprie,
Verbeek, & LeCun, 2017), and not mutually exclusive with the first, is to replace labeled
data sets with things like movies that change over time. The intuition is that systems
trained on videos can use each pair of successive frames as a kind of ersatz teaching
signal, in which the goal is to predict the next frame; frame t becomes a predictor for
frame t1, without the need for any human labeling.

My view is that both of these approaches are useful (and so are some others not discussed
here), but that neither inherently solve the sorts of problems outlined in section 3. One is
still left with data hungry systems that lack explicit variables, and I see no advance there
towards open-ended inference, interpretability or debuggability.

That said, there is a different notion of unsupervised learning, less discussed, which I find
deeply interesting: the kind of unsupervised learning that human children do. Children
often y set themselves a novel task, like creating a tower of Lego bricks or climbing
through a small aperture, as my daughter recently did in climbing through a chair, in the
space between the seat and the chair back . Often, this sort of exploratory problem
solving involves (or at least appears to involve) a good deal of autonomous goal setting
(what should I do?) and high level problem solving (how do I get my arm through the
chair, now that the rest of my body has passed through?), as well the integration of
abstract knowledge (how bodies work, what sorts of apertures and affordances various
objects have, and so forth). If we could build systems that could set their own goals and
do reasoning and problem-solving at this more abstract level, major progress might
quickly follow.

Page 19
! of 27
!
5.2.Symbol-manipulation, and the need for hybrid models

Another place that we should look is towards classic, “symbolic” AI, sometimes referred
to as GOFAI (Good Old-Fashioned AI). Symbolic AI takes its name from the idea, central
to mathematics, logic, and computer science, that abstractions can be represented by
symbols. Equations like f = ma allow us to calculate outputs for a wide range of inputs,
irrespective of whether we have seen any particular values before; lines in computer
programs do the same thing (if the value of variable x is greater than the value of variable
y, perform action a).

By themselves, symbolic systems have often proven to be brittle, but they were largely
developed in era with vastly less data and computational power than we have now. The
right move today may be to integrate deep learning, which excels at perceptual
classification, with symbolic systems, which excel at inference and abstraction. One
might think such a potential merger on analogy to the brain; perceptual input systems,
like primary sensory cortex, seem to do something like what deep learning does, but there
are other areas, like Broca’s area and prefrontal cortex, that seem to operate at much
higher level of abstraction. The power and flexibility of the brain comes in part from its
capacity to dynamically integrate many different computations in real-time. The process
of scene perception, for instance, seamlessly integrates direct sensory information with
complex abstractions about objects and their properties, lighting sources, and so forth.

Some tentative steps towards integration already exist, including neurosymbolic


modeling (Besold et al., 2017) and recent trend towards systems such as differentiable
neural computers (Graves et al., 2016), programming with differentiable interpreters
(Bošnjak, Rocktäschel, Naradowsky, & Riedel, 2016), and neural programming with
discrete operations (Neelakantan, Le, Abadi, McCallum, & Amodei, 2016). While none
of this work has yet fully scaled towards anything like full-service artificial general
intelligence, I have long argued (Marcus, 2001) that more on integrating microprocessor-
like operations into neural networks could be extremely valuable.

To the extent that the brain might be seen as consisting of “a broad array of reusable
computational primitives—elementary units of processing akin to sets of basic
instructions in a microprocessor—perhaps wired together in parallel, as in the
reconfigurable integrated circuit type known as the field-programmable gate array”, as I
have argued elsewhere(Marcus, Marblestone, & Dean, 2014), steps towards enriching the
instruction set out of which our computational systems are built can only be a good thing.

Page 20
! of 27
!
5.3.More insight from cognitive and developmental psychology

Another potential valuable place to look is human cognition (Davis & Marcus, 2015;
Lake et al., 2016; Marcus, 2001; Pinker & Prince, 1988). There is no need for machines
to literally replicate the human mind, which is, after all, deeply error prone, and far from
perfect. But there remain many areas, from natural language understanding to
commonsense reasoning, in which humans still retain a clear advantage; learning the
mechanisms underlying those human strengths could lead to advances in AI, even the
goal is not, and should not be, an exact replica of human brain.

For many people, learning from humans means neuroscience; in my view, that may be
premature. We don’t yet know enough about neuroscience to literally reverse engineer the
brain, per se, and may not for several decades, possibly until AI itself gets better. AI can
help us to decipher the brain, rather than the other way around.

Either way, in the meantime, it should certainly be possible to use techniques and insights
drawn from cognitive and developmental and psychology, now, in order to build more
robust and comprehensive artificial intelligence, building models that are motivated not
just by mathematics but also by clues from the strengths of human psychology.

A good starting point might be to first to try understand the innate machinery in humans
minds, as a source of hypotheses into mechanisms that might be valuable in developing
artificial intelligences; in companion article to this one (Marcus, in prep) I summarize a
number of possibilities, some drawn from my own earlier work (Marcus, 2001) and
others from Elizabeth Spelke’s (Spelke & Kinzler, 2007). Those drawn from my own
work focus on how information might be represented and manipulated, such as by
symbolic mechanisms for representing variables and distinctions between kinds and
individuals from a class; those drawn from Spelke focus on how infants might represent
notions such as space, time, and object.

A second focal point might be on common sense knowledge, both in how it develops
(some might be part of our innate endowment, much of it is learned), how it is
represented, and how it is integrated on line in the process of our interactions with the
real world (Davis & Marcus, 2015). Recent work by Lerer et al (2016), Watters and
colleagues (2017), Tenenbaum and colleagues(Wu, Lu, Kohli, Freeman, & Tenenbaum,
2017) and Davis and myself (Davis, Marcus, & Frazier-Logue, 2017) suggest some
competing approaches to how to think about this, within the domain of everyday physical
reasoning.

Page 21
! of 27
!
A third focus might be on human understanding of narrative, a notion long ago suggested
by Roger Schank and Abelson (1977) and due for a refresh (Marcus, 2014; Kočiský et al.,
2017).

5.4.Bolder challenges
Whether deep learning persists in current form, morphs into something new, or gets
replaced altogether, one might consider a variety of challenge problems that push systems
to move beyond what can be learned in supervised learning paradigms with large
datasets. Drawing in part of from a recent special issue of AI Magazine devoted to
moving beyond the Turing Test that I edited with Francesca Rossi, Manuelo Veloso
(Marcus, Rossi, Veloso - AI Magazine, & 2016, 2016), here are a few suggestions:

• A comprehension challenge (Paritosh & Marcus, 2016; Kočiský et al., 2017)] which
would require a system to watch an arbitrary video (or read a text, or listen to a
podcast) and answer open-ended questions about what is contained therein. (Who is the
protagonist? What is their motivation? What will happen if the antagonist succeeds in
her mission?) No specific supervised training set can cover all the possible
contingencies; infererence and real-world knowledge integration are necessities.
• Scientific reasoning and understanding, as in the Allen AI institute’s 8th grade science
challenge (Schoenick, Clark, Tafjord, P, & Etzioni, 2017; Davis, 2016). While the
answers to many basic science questions can simply be retrieved from web searches,
others require inference beyond what is explicitly stated, and the integration of general
knowledge.
• General game playing (Genesereth, Love, & Pell, 2005), with transfer between games
(Kansky et al., 2017), such that, for example, learning about one first-person shooter
enhances performance on another with entirely different images, equipment and so
forth. (A system that can learn many games, separately, without transfer between them,
such as DeepMind’s Atari game system, would not qualify; the point is to acquire
cumulative, transferrable knowledge).
• A physically embodied test an AI-driven robot that could build things (Ortiz Jr, 2016),
ranging from tents to IKEA shelves, based on instructions and real-world physical
interactions with the objects parts, rather than vast amounts trial-and-error.

No one challenge is likely to be sufficient. Natural intelligence is multi-dimensional


(Gardner, 2011), and given the complexity of the world, generalized artificial intelligence
will necessarily be multi-dimensional as well.

By pushing beyond perceptual classification and into a broader integration of inference


and knowledge, artificial intelligence will advance, greatly.

Page 22
! of 27
!
6. Conclusions
As a measure of progress, it is worth considering a somewhat pessimistic piece I wrote
for The New Yorker five years ago15, conjecturing that “deep learning is only part of the
larger challenge of building intelligent machines” because “such techniques lack ways of
representing causal relationships (such as between diseases and their symptoms), and are
likely to face challenges in acquiring abstract ideas like “sibling” or “identical to.” They
have no obvious ways of performing logical inferences, and they are also still a long way
from integrating abstract knowledge, such as information about what objects are, what
they are for, and how they are typically used.”

As we have seen, many of these concerns remain valid, despite major advances in
specific domains like speech recognition, machine translation, and board games, and
despite equally impressive advances in infrastructure and the amount of data and compute
available.

Intriguingly, in the last year, a growing array of other scholars, coming from an
impressive range of perspectives, have begun to emphasize similar limits. A partial list
includes Brenden Lake and Marco Baroni (2017), François Chollet (2017), Robin Jia and
Percy Liang (2017), Dileep George and others at Vicarious (Kansky et al., 2017) and
Pieter Abbeel and colleagues at Berkeley (Stoica et al., 2017).

Perhaps most notably of all, Geoff Hinton has been courageous enough to reconsider has
own beliefs, revealing in an August interview with the news site Axios 16 that he is
“deeply suspicious” of back-propagation, a key enabler of deep learning that he helped
pioneer, because of his concern about its dependence on labeled data sets.

Instead, he suggested (in Axios’ paraphrase) that “entirely new methods will probably
have to be invented.”

I share Hinton’s excitement in seeing what comes next.

15 https://www.newyorker.com/news/news-desk/is-deep-learning-a-revolution-in-artificial-intelligence
16 https://www.axios.com/ai-pioneer-advocates-starting-over-2485537027.html

Page 23
! of 27
!
References
Athalye, A., Engstrom, L., Ilyas, A., & Kwok, K. (2017). Synthesizing Robust Adversarial
Examples. arXiv, cs.CV.
Besold, T. R., Garcez, A. D., Bader, S., Bowman, H., Domingos, P., Hitzler, P. et al. (2017).
Neural-Symbolic Learning and Reasoning: A Survey and Interpretation. arXiv, cs.AI.
Bošnjak, M., Rocktäschel, T., Naradowsky, J., & Riedel, S. (2016). Programming with a
Differentiable Forth Interpreter. arXiv.
Bottou, L. (2015). Two big challenges in machine learning. Proceedings from 32nd International
Conference on Machine Learning.
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for
learning natural language inference. arXiv, cs.CL.
Chollet, F. (2017). Deep Learning with Python. Manning Publications.
Cireşan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network
for traffic sign classification. Neural networks.
Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in
artificial intelligence. Communications of the ACM, 58(9)(9), 92-103.
Davis, E. (2016). How to Write Science Questions that Are Easy for People and Hard for
Computers. AI magazine, 37(1)(1), 13-22.
Davis, E., Marcus, G., & Frazier-Logue, N. (2017). Commonsense reasoning about containers
using radically incomplete information. Artificial Intelligence, 248, 46-84.
Deng, J., Dong, W., Socher, R., Li, L. J., Li - Computer Vision and, K., & 2009 Imagenet: A
large-scale hierarchical image database. Proceedings from Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on.
Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2)(2), 179-211.
Evtimov, I., Eykholt, K., Fernandes, E., Kohno, T., Li, B., Prakash, A. et al. (2017). Robust
Physical-World Attacks on Deep Learning Models. arXiv, cs.CR.
Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: a critical
analysis. Cognition, 28(1-2)(1-2), 3-71.
Gardner, H. (2011). Frames of mind: The theory of multiple intelligences. Basic books.
Gelman, S. A., Leslie, S. J., Was, A. M., & Koch, C. M. (2015). Children’s interpretations of
general quantifiers, specific quantifiers, and generics. Lang Cogn Neurosci, 30(4)(4),
448-461.
Genesereth, M., Love, N., & Pell, B. (2005). General game playing: Overview of the AAAI
competition. AI magazine, 26(2)(2), 62.
George, D., Lehrach, W., Kansky, K., Lázaro-Gredilla, M., Laan, C., Marthi, B. et al. (2017). A
generative vision model that trains with high data efficiency and breaks text-based
CAPTCHAs. Science, 358(6368)(6368).
Gervain, J., Berent, I., & Werker, J. F. (2012). Binding at birth: the newborn brain detects identity
relations and sequential position in speech. J Cogn Neurosci, 24(3)(3), 564-574.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.

Page 24
! of 27
!
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A. et al.
(2016). Hybrid computing using a neural network with dynamic external memory. Nature,
538(7626)(7626), 471-476.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2017). Deep
Reinforcement Learning that Matters. arXiv, cs.LG.
Huang, S., Papernot, N., Goodfellow, I., Duan, Y., & Abbeel, P. (2017). Adversarial Attacks on
Neural Network Policies. arXiv, cs.LG.
Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension
Systems. arXiv.
Kahneman, D. (2013). Thinking, fast and slow (1st pbk. ed. ed.). New York: Farrar, Straus and
Giroux.
Kansky, K., Silver, T., Mély, D. A., Eldawy, M., Lázaro-Gredilla, M., Lou, X. et al. (2017).
Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive
Physics. arXIv, cs.AI.
Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G. et al. (2017). The
NarrativeQA Reading Comprehension Challenge. arXiv, cs.CL.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. In (pp. 1097-1105).
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning
through probabilistic program induction. Science, 350(6266)(6266), 1332-1338.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016). Building Machines
That Learn and Think Like People. Behav Brain Sci, 1-101.
Lake, B. M., & Baroni, M. (2017). Still not systematic after all these years: On the compositional
skills of sequence-to-sequence recurrent networks. arXiv.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). Big data. The parable of Google Flu:
traps in big data analysis. Science, 343(6176)(6176), 1203-1205.
Le, Q. V., Ranzato, M.-A., Monga, R., Devin, M., Chen, K., Corrado, G. et al. (2012). Building
high-level features using large scale unsupervised learning. Proceedings from International
Conference on Machine Learning.
LeCun, Y. (1989). Generalization and network design strategies. Technical Report CRG-TR-89-4.
Lerer, A., Gross, S., & Fergus, R. (2016). Learning Physical Intuition of Block Towers by
Example. arXiv, cs.AI.
Lighthill, J. (1973). Artificial Intelligence: A General Survey. Artificial Intelligence: a paper
symposium.
Lipton, Z. C. (2016). The Mythos of Model Interpretability. arXiv, cs.LG.
Lopez-Paz, D., Nishihara, R., Chintala, S., Schölkopf, B., & Bottou, L. (2017). Discovering
causal signals in images. Proceedings from Proceedings of Computer Vision and Pattern
Recognition (CVPR).
Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting Deeper into the
Future of Semantic Segmentation. International Conference on Computer Vision (ICCV
2017).

Page 25
! of 27
!
Marcus, G., Rossi, F., Veloso - AI Magazine, M., & 2016. (2016). Beyond the Turing Test. AI
Magazine, Whole issue.
Marcus, G., Marblestone, A., & Dean, T. (2014). The atoms of neural computation. Science,
346(6209)(6209), 551-552.
Marcus, G. (in prep). Innateness, AlphaZero, and Artificial Intelligence.
Marcus, G. (2014). What Comes After the Turing Test? The New Yorker.
Marcus, G. (2012). Is “Deep Learning” a Revolution in Artificial Intelligence? The New Yorker.
Marcus, G. F. (2008). Kluge : the haphazard construction of the human mind. Boston: Houghton
Mifflin.
Marcus, G. F. G. F. (2001). The Algebraic Mind: Integrating Connectionism and cognitive
science. Cambridge, Mass.: MIT Press.
Marcus, G. F. (1998a). Rethinking eliminative connectionism. Cogn Psychol, 37(3)(3), 243-282.
Marcus, G. F. (1998b). Can connectionism save constructivism? Cognition, 66(2)(2), 153-182.
Marcus, G. F., Pinker, S., Ullman, M., Hollander, M., Rosen, T. J., & Xu, F. (1992).
Overregularization in language acquisition. Monogr Soc Res Child Dev, 57(4)(4), 1-182.
Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999). Rule learning by seven-
month-old infants. Science, 283(5398)(5398), 77-80.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word
Representations in Vector Space. arXiv.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G. et al. (2015).
Human-level control through deep reinforcement learning. Nature, 518(7540)(7540),
529-533.
Neelakantan, A., Le, Q. V., Abadi, M., McCallum, A., & Amodei, D. (2016). Learning a Natural
Language Interface with Neural Programmer. arXiv.
Ng, A. (2016). What Artificial Intelligence Can and Can’t Do Right Now. Harvard Business
Review.
Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., & Yosinski, J. (2016). Plug & Play
Generative Networks: Conditional Iterative Generation of Images in Latent Space. arXiv,
cs.CV.
Nguyen, A., Yosinski, J., & Clune, J. (2014). Deep Neural Networks are Easily Fooled: High
Confidence Predictions for Unrecognizable Images. arXiv, cs.CV.
Norvig, P. (2016). State-of-the-Art AI: Building Tomorrow’s Intelligent Systems. Proceedings
from EmTech Digital, San Francisco.
O’Neil, C. (2016). Weapons of math destruction : how big data increases inequality and threatens
democracy.
Ortiz Jr, C. L. (2016). Why we need a physically embodied Turing test and what it might look
like. AI magazine, 37(1)(1), 55-63.
Paritosh, P., & Marcus, G. (2016). Toward a comprehension challenge, using crowdsourcing as a
tool. AI Magazine, 37(1)(1), 23-31.
Pearl, J. (2000). Causality : models, reasoning, and inference /. Cambridge, U.K.; New York :
Cambridge University Press.

Page 26
! of 27
!
Pinker, S., & Prince, A. (1988). On language and connectionism: analysis of a parallel distributed
processing model of language acquisition. Cognition, 28(1-2)(1-2), 73-193.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the
Predictions of Any Classifier. arXiv, cs.LG.
Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic Routing Between Capsules. arXiv,
cs.CV.
Samek, W., Wiegand, T., & Müller, K.-R. (2017). Explainable Artificial Intelligence:
Understanding, Visualizing and Interpreting Deep Learning Models. arXiv, cs.AI.
Schank, R. C., & Abelson, R. P. (1977). Scripts, Plans, Goals and Understanding: an Inquiry into
Human Knowledge Structures. Hillsdale, NJ: L. Erlbaum.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks.
Schoenick, C., Clark, P., Tafjord, O., P, T., & Etzioni, O. (2017). Moving beyond the Turing Test
with the Allen AI Science Challenge. Communications of the ACM, 60 (9)(9), 60-64.
Sculley, D., Phillips, T., Ebner, D., Chaudhary, V., & Young, M. (2014). Machine learning: The
high-interest credit card of technical debt. Proceedings from SE4ML: Software
Engineering for Machine Learning (NIPS 2014 Workshop).
Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through
recursive matrix-vector spaces. Proceedings from Proceedings of the 2012 joint conference
on empirical methods in natural language processing and computational natural language
learning.
Spelke, E. S., & Kinzler, K. D. (2007). Core knowledge. Dev Sci, 10(1)(1), 89-96.
Stoica, I., Song, D., Popa, R. A., Patterson, D., Mahoney, M. W., Katz, R. et al. (2017). A
Berkeley View of Systems Challenges for AI. arXiv, cs.AI.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. et al. (2013).
Intriguing properties of neural networks. arXiv, cs.CV.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2014). Show and Tell: A Neural Image Caption
Generator. arXiv, cs.CV.
Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., & Zoran, D. (2017). Visual
Interaction Networks. arXiv.
Williams, A., Nangia, N., & Bowman, S. R. (2017). A Broad-Coverage Challenge Corpus for
Sentence Understanding through Inference. arXiv, cs.CL.
Wu, J., Lu, E., Kohli, P., Freeman, B., & Tenenbaum, J. (2017). Learning to See Physics via
Visual De-animation. Proceedings from Advances in Neural Information Processing
Systems.
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2017). Learning Transferable Architectures for
Scalable Image Recognition. arXiv, cs.CV.

Page 27
! of 27
!
1

Recent Advances in Recurrent Neural Networks


Hojjat Salehinejad, Sharan Sankar, Joseph Barfett, Errol Colak, and Shahrokh Valaee

Abstract—Recurrent neural networks (RNNs) are capable of TABLE I: Some of the major advances in recurrent neural networks (RNNs)
learning features and long term dependencies from sequential at a glance.
and time-series data. The RNNs have a stack of non-linear units
where at least one connection between units forms a directed Year First Author Contribution
cycle. A well-trained RNN can model any dynamical system; 1990 Elman Popularized simple RNNs (Elman network)
however, training RNNs is mostly plagued by issues in learning
arXiv:1801.01078v3 [cs.NE] 22 Feb 2018

1993 Doya Teacher forcing for gradient descent (GD)


long-term dependencies. In this paper, we present a survey on 1994 Bengio
Difficulty in learning long term dependencies
RNNs and several new advances for newcomers and professionals with gradient descend
in the field. The fundamentals and recent advances are explained LSTM: long-short term memory
1997 Hochreiter
and the research challenges are introduced. for vanishing gradients problem
1997 Schuster BRNN: Bidirectional recurrent neural networks
Index Terms—Deep learning, long-term dependency, recurrent Hessian matrix approach for
1998 LeCun
neural networks, time-series analysis. vanishing gradients problem
2000 Gers Extended LSTM with forget gates
2001 Goodman Classes for fast Maximum entropy training
I. I NTRODUCTION 2005 Morin
A hierarchical softmax function for
language modeling using RNNs

A RTIFICIAL neural networks (ANNs) are made from


layers of connected units called artificial neurons. A
“shallow network” refers to an ANN with one input layer,
2005
2007
2007
Graves
Jaeger
Graves
BLSTM: Bidirectional LSTM
Leaky integration neurons
MDRNN: Multi-dimensional RNNs
2009 Graves LSTM for hand-writing recognition
one output layer, and at most one hidden layer without 2010 Mikolov RNN based language model
a recurrent connection. As the number of layers increases, Rectified linear unit (ReLU) for
2010 Neir
the complexity of network increases too. More number of vanishing gradient problem
layers or recurrent connections generally increases the depth 2011 Martens Learning RNN with Hessian-free optimization
RNN by back-propagation through
of the network and empowers it to provide various levels 2011 Mikolov
time (BPTT) for statistical language modeling
of data representation and feature extraction, referred to as 2011 Sutskever
Hessian-free optimization with
“deep learning”. In general, these networks are made from structural damping
2011 Duchi Adaptive learning rates for each weight
nonlinear but simple units, where the higher layers provide a 2012 Gutmann Noise-contrastive estimation (NCE)
more abstract representation of data and suppresses unwanted NCE for training neural probabilistic
2012 Mnih
variability [1]. Due to optimization difficulties caused by language models (NPLMs)
Avoiding exploding gradient problem
composition of the nonlinearity at each layer, not much work 2012 Pascanu
by gradient clipping
occurred on deep network architectures before significant ad- Negative sampling instead of
2013 Mikolov
vances in 2006 [2], [3]. ANNs with recurrent connections are hierarchical softmax
called recurrent neural networks (RNNs), which are capable Stochastic gradient descent (SGD)
2013 Sutskever
with momentum
of modelling sequential data for sequence recognition and 2013 Graves Deep LSTM RNNs (Stacked LSTM)
prediction [4]. RNNs are made of high dimensional hidden 2014 Cho Gated recurrent units
states with non-linear dynamics [5]. The structure of hidden 2015 Zaremba Dropout for reducing Overfitting
Structurally constrained recurrent network
states work as the memory of the network and state of the 2015 Mikolov (SCRN) to enhance learning longer memory
hidden layer at a time is conditioned on its previous state for vanishing gradient problem
[6]. This structure enables the RNNs to store, remember, and 2015 Visin
ReNet: A RNN-based alternative to
convolutional neural networks
process past complex signals for long time periods. RNNs can
2015 Gregor DRAW: Deep recurrent attentive writer
map an input sequence to the output sequence at the current 2015 Kalchbrenner Grid long-short term memory
timestep and predict the sequence in the next timestep. 2015 Srivastava Highway network
A large number of papers are published in the literature 2017 Jing Gated orthogonal recurrent units
based on RNNs, from architecture design to applications
H. Salehinejad is with the Department of Electrical & Computer Engi- development. In this paper, we focus on discussing discrete-
neering, University of Toronto, Toronto, Canada, and Department of Medical
Imaging, St. Michael’s Hospital, University of Toronto, Toronto, Canada, e- time RNNs and recent advances in the field. Some of the major
mail: salehinejadh@smh.ca. advances in RNNs through time are listed in Table I. The
S. Sankar is with the Department of Electrical and Computer development of back-propagation using gradient descent (GD)
Engineering, University of Waterloo, Waterloo, Canada, e-
mail:sdsankar@edu.uwaterloo.ca. has provided a great opportunity for training RNNs. This sim-
J. Barfett and E. Colak are with the Department of Medical Imaging, ple training approach has accelerated practical achievements in
St. Michael’s Hospital, University of Toronto, Toronto, Canada, e-mail: developing RNNs [5]. However, it comes with some challenges
{barfettj,colake}@smh.ca.
S. Valaee is with the Department of Electrical & Computer Engineering, in modelling long-term dependencies such as vanishing and
University of Toronto, Toronto, Canada, e-mail: valaee@ece.utoronto.ca. exploding gradient problems, which are discussed in this
2

paper. Output Layer y1 y2 yP


WHH
The rest of paper is organized as follow. The fundamentals WHO
of RNNs are presented in Section II. Methods for training Hidden Layer h1 h2 hM
RNNs are discussed in Section III and a variety of RNNs WIH
architectures are presented in Section IV. The regularization Input Layer x1 x2 xN
methods for training RNNs are discussed in Section V. Fi- (a) Folded RNN.
nally, a brief survey on major applications of RNN in signal y1 y2 yP y1 y2 yP
processing is presented in Section VI. WHH WHO WHH WHO
h1 h2 hM h1 h2 hM h1 h2 hM
II. A S IMPLE R ECURRENT N EURAL N ETWORK WIH WIH
x1 x2 xN x1 x2 xN
RNNs are a class of supervised machine learning models,
made of artificial neurons with one or more feedback loops [7]. t t+1 t+2 Time

The feedback loops are recurrent cycles over time or sequence (b) Unfolded RNN through time.
(we call it time throughout this paper) [8], as shown in Fig. 1: A simple recurrent neural network (RNN) and its unfolded structure
Figure 1. Training a RNN in a supervised fashion requires through time t. Each arrow shows a full connection of units between the
a training dataset of input-target pairs. The objective is to layers. To keep the figure simple, biases are not shown.
minimize the difference between the output and target pairs
(i.e., the loss value) by optimizing the weights of the network.
nonlinear activation function in every unit. However, such
simple structure is capable of modelling rich dynamics, if it
A. Model Architecture is well trained through timesteps.
A simple RNN has three layers which are input, recurrent
hidden, and output layers, as presented in Figure 1a. The input B. Activation Function
layer has N input units. The inputs to this layer is a sequence
For linear networks, multiple linear hidden layers act as a
of vectors through time t such as {..., xt−1 , xt , xt+1 , ...}, where
single linear hidden layer [10]. Nonlinear functions are more
xt = (x1 , x2 , ..., xN ). The input units in a fully connected
powerful than linear functions as they can draw nonlinear
RNN are connected to the hidden units in the hidden layer,
boundaries. The nonlinearity in one or successive hidden
where the connections are defined with a weight matrix WIH .
layers in a RNN is the reason for learning input-target re-
The hidden layer has M hidden units ht = (h1 , h2 , ..., hM ),
lationships.
that are connected to each other through time with recurrent
Some of the most popular activation functions are presented
connections, Figure 1b. The initialization of hidden units using
in Figure 2. The “sigmoid”, “tanh”, and rectified linear unit
small non-zero elements can improve overall performance and
(ReLU) have received more attention than the other activation
stability of the network [9]. The hidden layer defines the state
functions recently. The “sigmoid” is a common choice, which
space or “memory” of the system as
takes a real-value and squashes it to the range [0, 1]. This
activation function is normally used in the output layer, where
ht = fH (ot ), (1)
a cross-entropy loss function is used for training a classifica-
where tion model. The “tanh” and “sigmoid” activation functions are
ot = WIH xt + WHH ht−1 + bh , (2) defined as
e2x − 1
fH (·) is the hidden layer activation function, and bh is the bias tanh(x) = 2x (4)
e +1
vector of the hidden units. The hidden units are connected to
and
the output layer with weighted connections WHO . The output 1
layer has P units yt = (y1 , y2 , ..., yP ) that are computed as σ(x) = , (5)
1 + e−x
yt = fO (WHO ht + bo ) (3) respectively. The “tanh” activation function is in fact a scaled
“sigmoid” activation function such as
where fO (·) is the activation functions and bo is the bias vector
in the output layer. Since the input-target pairs are sequential tanh(x/2) + 1
σ(x) = . (6)
through time, the above steps are repeated consequently over 2
time t = (1, ..., T ). The Eqs. (1) and (3) show a RNN ReLU is another popular activation function, which is open-
is consisted of certain non-linear state equations, which are ended for positive input values [3], defined as
iterable through time. In each timestep, the hidden states
y(x) = max(x, 0). (7)
provide a prediction at the output layer based on the input
vector. The hidden state of a RNN is a set of values, which Selection of the activation function is mostly dependent on
apart from the effect of any external factors, summarizes all the problem and nature of the data. For example, “sigmoid”
the unique necessary information about the past states of the is suitable for networks where the output is in the range
network over many timesteps. This integrated information can [0, 1]. However, the “tanh” and “sigmoid” activation functions
define future behaviour of the network and make accurate saturate the neuron very fast and can vanish the gradient.
predictions at the output layer [5]. A RNN uses a simple Despite “tanh”, the non-zero centered output from “sigmoid”
3

1 1
network and the optimization algorithm to tune them in order
to minimize the training loss. The relationship among network
0.5 0.5 parameters and the dynamics of the hidden states through time
causes instability [4]. A glance at the proposed methods in the
out

out
0 0
literature shows that the main focus is to reduce complexity
−0.5 −0.5 of training algorithms, while accelerating the convergence.
−1 −1
However, generally such algorithms take a large number of it-
−4 −2 0
net
2 4 −4 −2 0
net
2 4 erations to train the model. Some of the approaches for training
(a) Linear. (b) Piecewise linear.
RNNs are multi-grid random search, time-weighted pseudo-
newton optimization, GD, extended Kalman filter (EKF) [15],
1 1
Hessian-free, expectation maximization (EM) [16], approx-
0.5 0.5 imated Levenberg-Marquardt [17], and global optimization
algorithms. In this section, we discuss some of these methods
out

out

0 0
in details. A detailed comparison is available in [18].
−0.5 −0.5

−1 −1
A. Initialization
−4 −2 0 2 4 −4 −2 0 2 4
net net Initialization of weights and biases in RNNs is critical.
(c) tanh(net). (d) Threshold. A general rule is to assign small values to the weights. A
1
Gaussian draw with a standard deviation of 0.001 or 0.01 is a
1
reasonable choice [9], [19]. The biases are usually set to zero,
0.5 0.8
but the output bias can also be set to a very small value [9].
0.6 However, the initialization of parameters is dependent on the
out

out

0.4 task and properties of the input data such as dimensionality [9].
−0.5
0.2
Setting the initial weight using prior knowledge or in a semi-
−1 supervised fashion are other approaches [4].
0
−4 −2 0 2 4 −4 −2 0 2 4
net net

(e) sin(net) until saturation. (f) Sigmoid. B. Gradient-based Learning Methods


Fig. 2: Most common activation functions. Gradient descent (GD) is a simple and popular optimization
method in deep learning. The basic idea is to adjust the
weights of the model by finding the error function derivatives
can cause unstable dynamics in the gradient updates for
with respect to each member of the weight matrices in the
the weights. The ReLU activation function leads to sparser
model [4]. To minimize total loss, GD changes each weight
gradients and greatly accelerates the convergence of stochastic
in proportion to the derivative of the error with respect to
gradient descent (SGD) compared to the “sigmoid” or “tanh”
that weight, provided that the non-linear activation functions
activation functions [11]. ReLU is computationally cheap,
are differentiable. The GD is also known as batch GD,
since it can be implemented by thresholding an activation
as it computes the gradient for the whole dataset in each
value at zero. However, ReLU is not resistant against a large
optimization iteration to perform a single update as
gradient flow and as the weight matrix grows, the neuron may
λ X ∂Lk
remain inactive during training. U
θt+1 = θt − (9)
U ∂θ
C. Loss Function k=1

Loss function evaluates performance of the network by where U is size of training set, λ is the learning rate, and θ is
comparing the output yt with the corresponding target zt set of parameters. This approach is computationally expensive
defined as for very large datasets and is not suitable for online training
X
T
(i.e., training the models as inputs arrive).
L(y, z) = Lt (yt , zt ), (8) Since a RNN is a structure through time, we need to
t=1
extend GD through time to train the network, called back-
that is an overall summation of losses in each timestep [12]. propagation through time (BPTT) [20]. However, computing
Selection of the loss function is problem dependent. Some error-derivatives through time is difficult [21]. This is mostly
popular loss function are Euclidean distance and Hamming due to the relationship among the parameters and the dynamics
distance for forecasing of real-values and cross-entropy over of the RNN, that is highly unstable and makes GD ineffective.
probablity distribution of outputs for classification prob- Gradient-based algorithms have difficulty in capturing depen-
lems [13]. dencies as the duration of dependencies increases [4]. The
derivatives of the loss function with respect to the weights
III. T RAINING R ECURRENT N EURAL N ETWORK only consider the distance between the current output and the
Efficient training of a RNN is a major problem. The corresponding target, without using the history information for
difficulty is in proper initialization of the weights in the weights updating [22]. RNNs cannot learn long-range temporal
4

TABLE II: Comparing major gradient descent (GD) methods, where N is number of nodes in the network and O(·) is per data point. More details in [14].
Method Description Advantages Disadvantages O(·)
Computing error gradient after obtaining
- online updating of weights
gradients of the network states w.r.t - large computational
RTRL - suitable for online adaption O(N 4 )
weights at time t in terms of those at complexity
property applications
time t − 1
Unfolding time iterations into layers with
identical weights converts the recurrent
- computationally efficient - not practical for
BPTT network into an equivalent feedforward O(N 2 )
- suitable for offline training online training
network, suitable for training with
back-propagation method.
- on-line technique
Recursive computing of boundary - more computational
- solving the gradient
FFP conditions of back-propagated gradients complexity than O(N 3 )
recursion forward in time,
at time t = 1. BPTT method
rather than backwards.
Computing the solution using the sought
- improving RTRL - more computational
error gradient based on the recursive
GF computational complexity complexity than O(N 3 )
equations for the output gradients and a
- online method BPTT method
dot product.
Updating the weights every O(N ) data - more computational
BU points using some aspects of the RTRL - online method complexity than O(N 3 )
and BTT methods. BPTT method

∂h+
yt-1 yt yt+1 where ∂θk is the partial derivative (i.e., “immediate” partial
dLt-1 dLt dLt+1 derivative). It describes how the parameters in the set θ affect
dht-1 dht dht+1
ht-1 ht ht+1 the loss function at the previous timesteps (i.e., k < t). In
dht dht+1 order to transport the error through time from timestep t back
dht-1 dht to timestep k we can have
xt-1 xt xt+1
∂ht Y
t
∂hi
Time
= . (12)
∂hk ∂hi−1
i=k+1
Fig. 3: As the network is receiving new inputs over time, the sensitivity of
units decay (lighter shades in layers) and the back-propagation through time We can consider Eq. (12) as a Jacobian matrix for the hidden
(BPTT) overwrites the activation in hidden units. This results in forgetting
the early visited inputs.
state parameters in Eq.(1) as
Y
t
∂hi Y
t

= WTHH diag|fH (hi−1 )|, (13)
∂hi−1
dependencies when GD is used for training [4]. This is due i=k+1 i=k+1

to the exponential decay of gradient, as it is back-propagated ′


where f (·) is the element-wise derivate of function f (·) and
through time, which is called the vanishing gradient problem. diag(·) is the diagonal matrix.
In another occasional situation, the back-propagated gradient We can generally recognize the long-term and short-term
can exponentially blow-up, which increases the variance of contribution of hidden states over time in the network. The
the gradients and results in very unstable learning situation, long-term dependency refers to the contribution of inputs and
called the exploding gradient problem [5]. These challenges corresponding hidden states at time k << t and short-term
are discussed in this section. A comparison of major GD dependencies refer to other times [19]. Figure 3 shows that as
methods is presented in Table II and an overview of gradient- the network makes progress over time, the contribution of the
based optimization algorithms is provided in [18]. inputs xt−1 at discrete time t − 1 vanishes through time to the
1) Back-propagation through time (BPTT): BPTT is a timestep t + 1 (the dark grey in the layers decays to higher
generalization of back-propagation for feed-forward networks. grey). On the other hand, the contribution of the loss function
The standard BPTT method for learning RNNs “unfolds” value Lt+1 with respect to the hidden state ht+1 at time t + 1
the network in time and propagates error signals backwards in BPTT is more than the previous timesteps.
through time. By considering the network parameters in Fig- 2) Vanishing Gradient Problem: According to the litera-
ure 1b as the set θ = {WHH , WIH , WHO , bH , bI , bO } and ture, it is possible to capture complex patterns of data in real-
ht as the hidden state of network at time t, we can write the world by using a strong nonlinearity [6]. However, this may
gradients as cause RNNs to suffer from the vanishing gradient problem [4].
This problem refers to the exponential shrinking of gradient
∂L X ∂Lt
T
= (10) magnitudes as they are propagated back through time. This
∂θ t=1
∂θ phenomena causes memory of the network to ignore long
term dependencies and hardly learn the correlation between
where the expansion of loss function gradients at time t is temporally distant events. There are two reasons for that:
1) Standard nonlinear functions such as the sigmoid function
∂Lt X ∂Lt ∂ht ∂h+
t
have a gradient which is almost everywhere close to zero;
= ( · . k) (11)
∂θ ∂ht ∂hk ∂θ 2) The magnitude of gradient is multiplied over and over by the
k=1
5

recurrent matrix as it is back-propagated through time. In this t+1


L(θ +μv )
case, when the eigenvalues of the recurrent matrix become less v t vt+1 μv
than one, the gradient converges to zero rapidly. This happens θ
v

normally after 5∼10 steps of back-propagation [6].


In training the RNNs on long sequences (e.g., 100
t
timesteps), the gradients shrink when the weights are small. L(t) θt
Product of a set of real numbers can shrink/explode to (a) Classical momentum. (b) Nesterov accelerated gradient.
zero/infinity, respectively. For the matrices the same analogy Fig. 4: The classical momentum and the Nesterov accelerated gradient
exists but shrinkage/explosion happens along some directions. schemes.
In [19], it is showed that by considering ρ as the spectral
radius of the recurrent weight matrix WHH , it is necessary at
ρ > 1 for the long term components to explode as t → ∞. It the SGD performs one update at a time [25]. For an input-
is possible to use singular values to generalize it to the non- target pair {xk , z} in which k ∈ {1, ..., U }, the parameters in

linear function fH (·) in Eq. (1) by bounding it with γ ∈ R θ are updated according as
such as ∂Lk
θt+1 = θt − λ . (17)
′ ∂θ
||diag(fH (hk ))|| ≤ γ. (14)
Such frequent update causes fluctuation in the loss func-
∂hk+1
Using the Eq. (13), the Jacobian matrix ∂hk , and the bound tion outputs, which helps the SGD to explore the problem
in Eq. (14), we can have landscape with higher diversity with the hope of finding
∂hk+1 ′
better local minima. An adaptive learning rate can control the
|| || ≤ ||WTHH || · ||diag(fH (hk ))|| ≤ 1. (15) convergence of SGD, such that as learning rate decreases, the
∂hk
exploration decreases and exploitation increases. It leads to
We can consider || ∂h∂hk+1
k
|| ≤ δ < 1 such as δ ∈ R for each faster convergence to a local minima. A classical technique
step k. By continuing it over different timesteps and adding to accelerate SGD is using momentum, which accumulates a
the loss function component we can have velocity vector in directions of persistent reduction towards
the objective across iterations [26]. The classical version of
∂Lt Y ∂hi+1
t−1
∂Lt momentum applies to the loss function L at time t with a set
|| ( )|| ≤ δ t−k || ||. (16)
∂ht ∂hi ∂ht of parameters θ as
i=k

This equation shows that as t − k gets larger, the long-term


vt+1 = µvt − λ∇L(θt ) (18)
dependencies move toward zero and the vanishing problem
happens. Finally, we can see that the sufficient condition for where ∇L(·) is the gradient of loss function and µ ∈ [0, 1] is
the gradient vanishing problem to appear is that the largest the momentum coefficient [9], [12]. As figure 4a shows, the
singular value of the recurrent weights matrix WHH (i.e., λ1 ) parameters in θ are updated as
satisfies λ1 < γ1 [19].
3) Exploding Gradient Problem: One of the major prob- θt+1 = θt + vt+1 . (19)
lems in training RNNs using BPTT is the exploding gradient By considering R as the condition number of the curvature
problem [4]. Gradients in training RNNs on long sequences at the minimum, the momentum can considerably accelerate
may explode as the weights become larger and the norm of √
convergence to a local minimum, requiring R times fewer
the gradient during training largely increases. As it is stated iterations than steepest descent to reach the same level of
in [19], the necessary condition for this situation to happen is accuracy [26].
λ1 > γ1 . √ In this case,
√ it is suggested to set the learning
rate to µ = ( R − 1)/( R + 1) [26].
In order to overcome the exploding gradient problem, many The Nesterov accelerated gradient (NAG) is a first-order
methods have been proposed recently. In 2012, Mikolov pro- optimization method that provides more efficient convergence
posed a gradient norm-clipping method to avoid the exploding rate for particular situations (e.g., convex functions with de-
gradient problem in training RNNs with simple tools such terministic gradient) than the GD [27]. The main difference
as BPTT and SGD on large datasets. [23], [24]. In a similar between NAG and GD is in the updating rule of the velocity
approach, Pascanu has proposed an almost similar method to vector v, as presented in Figure 4b, defined as
Mikolov, by introducing a hyper-parameter as threshold for
norm-clipping the gradients [19]. This parameter can be set by vt+1 = µvt − λ∇L(θ + µvt ) (20)
heuristics; however, the training procedure is not very sensitive
to that and behaves well for rather small thresholds. where the parameters in θ are updated using Eq. (19). By
4) Stochastic Gradient Descent: The SGD (also called on- reasonable fine-tuning of the momentum coefficient µ, it is
line GD) is a generalization of GD that is widely in use for possible to increase the optimization performance [9].
machine learning applications [12]. The SGD is robust, scal- 5) Mini-Batch Gradient Descent: The mini-batch GD com-
able, and performs well across many different domains ranging putes the gradient of a batch of training data which has
from smooth and strongly convex problems to complex non- more than one training sample. The typical mini-batch size is
convex objectives. Despite the redundant computations in GD, 50 ≤ b ≤ 256, but can vary for different applications. Feeding
6

the training samples in mini-batches accelerates the GD and is with the assumption that the optimum setting of the weights
suitable for processing load distribution on graphical process- is stationary [22], [33]. Comparing to back-propagation, the
ing units (GPUs). The update rule modifies the parameters EKF helps RNNs to reach the training steady state much faster
after b examples rather than needing to wait to scan the all for non-stationary processes. It can excel the back-propagation
examples such as algorithm in training with limited data [15]. Similar to SGD,
it can train a RNN with incoming input data in an online
λ X ∂Lk
i+b−1
θt = θt−1 − . (21) manner [33].
b ∂θ A more efficient and effective version of EKF is the
k=i
decoupled EKF (DEKF) method, which ignores the inter-
Since the GD-based algorithms are generally dependent on
dependencies of mutually exclusive groups of weights [32].
instantaneous estimations of the gradient, they are slow for
This technique can lower the computational complexity and
time series data [22] and ineffective on optimization of non-
the required storage per training instance. The decoupled
convex functions [28]. They also require setting of learning
extended Kalman filter (DEKF) applies the extended Kalman
rate which is often tricky and application dependent.
filter independently to each neuron in order to estimate the
The SGD is much faster than the GD and is useable for
optimum weights feeding it. By proceeding this way, only
tracking of updates. However, since the mini-batch GD is
local interdependencies are considered. The training procedure
easier for parallelization and can take advantage of vectorized
is modeled as an optimal filtering problem. It recursively and
implementation, it performs significantly better than GD and
efficiently computes a solution to the least-squares problem to
SGD [25]. A good vectorization can even lead to faster
find the best fitted curve for a given set of data in terms of
results compared to SGD. Also non-random initializations
minimizing the average distance between data and curve. At
schemes, such as layer-wise pre-training, may help with faster
a timestep t, all the information supplied to the network until
optimization [29]. A deeper analyze is provided in [30].
time t is used, including all derivatives computed since the
6) Adam Stochastic Optimization: Adaptive Moment Es-
first iteration of the learning process. However, computation
timation (Adam) is a first-order gradient-based optimization
requires just the results from the previous step and there
algorithm, which uses estimates of lower-order moments to
is no need to store results beyond that step [22]. Kalman-
optimize a stochastic objective function [31]. It needs initial-
based models in RNNs are computationally expensive and
ization of first moment vector m0 and second moment vector
have received little attention in the past years.
v0 at time-stamp zero. These vector are updated as
mt+1 = β1 mt + (1 − β1 )gt+1 (22) D. Second Order Optimization
and The second order optimization algorithms use information
2
vt+1 = β2 vt + (1 − β2 )gt+1 , (23) of the second derivate of a function. With the assumption of
having a quadratic function with good second order expansion
where gt+1 is the gradient of loss function. The exponential
approximation, Newton’s method can perform better and faster
decay rates for the moment estimates are recommended to be
than GD by moving toward the global minimum [34]. This
β1 = 0.9 and β2 = 0.999 [31]. The bias correction of first
is while the direction of optimization in GD is against the
and second moment estimates are
gradient and gets stuck near saddle points or local extrema.
m̂t+1 = n̂t+1 = vt /(1 − β1t+1 ), (24) The other challenge with GD-based models is setting of
learning rate, which is often tricky and application dependent.
and However, second order methods generally require computing
v̂t+1 = vt /(1 − β2t+1 ). (25) the Hessian matrix and inverse of Hessian matrix, which
Then, the parameters are updated as is a difficult task to perform in RNNs comparing to GD
approaches.
α · m̂t+1
θt+1 = θt − √ (26) A general recursive Bayesian Levenberg-Marquardt algo-
v̂t + ǫ rithm can sequentially update the weights and the Hessian
where ǫ = 10−8 . The Adam algorithm is relatively simple matrix in recursive second-order training of a RNN [35]. Such
to implement and is suitable for problems with very large approach outperforms standard real-time recurrent learning
datasets [31]. and EKF training algorithms for RNNs [35]. The challenges
in computing Hessian matrix for time-series are addressed by
C. Extended Kalman Filter-based Learning introducing Hessian free (HF) optimization [34].
Kalman filter is a method of predicting the future state of a
system based on a series of measurements observed over time E. Hessian-Free Optimization
by using Bayesian inference and estimating a joint probability A well-designed and well-initialized HF optimizer can
distribution over the variables for each timestep [32]. The work very well for optimizing non-convex functions, such
extended Kalman filter (EKF) is the nonlinear version of the as training the objective function for deep neural networks,
Kalman filter. It relaxes the linear prerequisites of the state given sensible random initializations [34]. Since RNNs share
transition and observation models. However, they may instead weights across time, the HF optimizer should be a good op-
need to be differentiable functions. The EKF trains RNNs timization candidate [5]. Training RNNs via HF optimization
7

can reduce training difficulties caused by gradient-based opti- of a RNN for language models [41]. Published literature on
mization [36]. In general, HF and truncated Newton methods global optimization methods for RNNs is scattered and has
compute a new estimate of the Hessian matrix before each not received much attention from the research community.
update step and can take into account abrupt changes in curva- This lack is mainly due to the computational complexity of
ture [19]. HF optimization targets unconstrained minimization these methods. However, the multi-agent philosophy of such
of real-valued smooth objective functions [28]. Like standard methods in a low computational complexity manner, such as
Newton’s method, it uses local quadratic approximations to models with small population size [42], may result in much
generate update proposals. It belongs to the broad class of better performance than SGD.
approximate Newton methods that are practical for problems
of very high dimensionality, such as the training objectives of IV. R ECURRENT N EURAL N ETWORKS A RCHITECTURES
large neural networks [28].
This section aims to provide an overview on the different
With the addition of a novel damping mechanism to a HF
architectures of RNNs and discuss the nuances between these
optimizer, the optimizer is able to train a RNN on pathological
models.
synthetic datasets, which are known to be impossible to learn
with GD [28]. Multiplicative RNNs (MRNNs) uses multiplica-
tive (also called “gated”) connections to allow the current input A. Deep RNNs with Multi-Layer Perceptron
character to determine the transition matrix from one hidden Deep architectures of neural networks can represent a
state vector to the next [5]. This method demonstrates the function exponentially more efficient than shallow architec-
power of a large RNN trained with this optimizer by applying tures. While recurrent networks are inherently deep in time
them to the task of predicting the next character in a stream given each hidden state is a function of all previous hidden
of text [5], [12]. states [43], it has been shown that the internal computation
The HF optimizer can be used in conjunction with or as is in fact quite shallow [44]. In [44], it is argued that adding
an alternative to existing pre-training methods and is more one or more nonlinear layers in the transition stages of a RNN
widely applicable, since it relies on fewer assumptions about can improve overall performance by better disentangling the
the specific structure of the network. HF optimization operates underlying variations the original input. The deep structures in
on large mini batches and is able to detect promising directions RNNs with perceptron layers can fall under three categories:
in the weight space that have very small gradients but even input to hidden, hidden to hidden, and hidden to output [44].
smaller curvature. Similar results have been achieved by using 1) Deep input to hidden: One of the basic ideas is to bring
SGD with momentum and initializing weights to small values the multi-layer perceptron (MLP) structure into the transition
close to zero [9]. and output stages, called deep transition RNNs and deep
output RNNs, respectively. To do so, two operators can be
introduced. The first is a plus ⊕ operator, which receives two
F. Global Optimization
vectors, the input vector x and hidden state h, and returns a
In general, evolutionary computing methods initialize a pop- summary as

ulation of search agents and evolve them to find local/global h = x ⊕ h. (27)
optimization solution(s) [37]. These methods can solve a
wide range of optimization problems including multimodal, This operator is equivalent to the Eq. (1). The other operator
ill-behaved, high-dimensional, convex, and non-convex prob- is a predictor denoted as ⊲, which is equivalent to the Eq. (3)
lems. However, evolutionary algorithms have some drawbacks and predicts the output of a given summary h as
in optimization of RNNs including getting stuck in local
y = ⊲h. (28)
minima/maxima, slow speed of convergence, and network
stagnancy. Higher level representation of input data means easier repre-
Optimization of the parameters in RNNs can be mod- sentation of relationships between temporal structures of data.
elled as a nonlinear global optimization problem. The most This technique has achieved better results than feeding the
common global optimization method for training RNNs is network with original data in speech recognition [43] and word
genetic algorithms [38]. The Alopex-based evolutionary al- embedding [45] applications. Structure of a RNN with an MLP
gorithm (AEA) uses local correlations between changes in in the input to hidden layers is presented in Figure 5a. In order
individual weights and changes in the global error measure to enhance long-term dependencies, an additional connection
and simultaneously updates all the weights in the network makes a short-cut between the input and hidden layer as in
using only local computations [39]. Selecting the optimal Figure 5b [44].
topology of neural network for a particular application is 2) Deep hidden to hidden and output: The most focus
a different task from optimizing the network parameters. A for deep RNNs is in the hidden layers. In this level, the
hybrid multi-objective evolutionary algorithm that trains and procedure of data abstraction and/or hidden state construction
optimizes the structure of a RNN for time series prediction is from previous data abstractions and new inputs is highly non-
proposed in [40]. Some models simultaneously acquire both linear. An MLP can model this non-linear function, which
the structure and weights for recurrent networks [38]. The helps a RNN to quickly adapt to fast changing input modes
covariance matrix adaptation evolution strategy (CMA-ES) is while still having a good memory of past events. A RNN
a global optimization method used for tuning the parameters can have both an MLP in transition and an MLP before the
8

yt yt x1
(0,0)
Output Layer
ht ht
x2 (i,j)
MLPt MLPt

hht-1
t-1 xt h t-1
t-1 xt

(a) Input to hidden. (b) Input to hidden with short-cut. Hidden Layer
(i,j)
yt

MLPt yt
Fig. 7: Forward pass with sequence ordering in two-dimensional recurrent
neural network (RNN). The connections within the hidden layer plane are
ht 1
ht
recurrent. The lines along x1 and x2 show the scanning strips along which
previous points were visited, starting at the top left corner.
MLPt h1t-1 ht

ht-1
t-1 xt ht-1
t-1 xt
B. Bidirectional RNN
(c) Hidden to hidden and output. (d) Stack of hidden states.
Conventional RNNs only consider the previous context of
Fig. 5: Some deep recurrent neural network (RNN) architectures with multi-
layer perceptron (MLP). data for training. While simply looking at previous context
is sufficient in many applications such as speech recognition,
yt-1 yt yt+1 it is also useful to explore the future context as well [43].
Previously, the use of future information as context for current
prediction have been attempted in the basic architecture of
ht-1 ht ht+1
RNNs by delaying the output by a certain number of time
frames. However, this method required a handpicked optimal
ht-1 ht ht+1
delay to be chosen for any implementation. A bi-directional
RNN (BRNN) considers all available input sequence in both
xt-1 xt xt+1
the past and future for estimation of the output vector [46]. To
Fig. 6: Unfolded through time bi-directional recurrent neural network do so, one RNN processes the sequence from start to end in a
(BRNN). forward time direction. Another RNN processes the sequence
backwards from end to start in a negative time direction as
demonstrated in Figure 6. Outputs from forward states are not
output layer (an example is presented in Figure 5c) [44]. A
connected to inputs of backward states and vice versa and there
deep hidden to output function can disentangle the factors of
are no interactions between the two types of state neurons [46].
variations in the hidden state and facilitate prediction of the
In Figure 6, the forward and backward hidden sequences are
target. This function allows a more compact hidden state of → ←
the network, which may result in a more informative historical denoted by h t and h t , respectively, at time t. The forward
summary of the previous inputs. hidden sequence is computed as
→ →
3) Stack of hidden states: Another approach to construct ht = fH (W → xt + W → h t−1 + b→ ), (29)
deep RNNs is to have a stack of hidden recurrent layers as IH HH h
shown in Figure 5d. This style of recurrent levels encourages where it is iterated over t = (1, ..., T ). The backward layer is
the network to operate at different timescales and enables it ← ←
to deal with multiple time scales of inputs sequences [44]. ht = fH (W ← xt + W ← h t−1 + b← ), (30)
IH HH h
However, the transitions between consecutive hidden states is which is iterated backward over time t = (T, ..., 1). The output
often shallow, which results in a limited family of functions sequence yt at time t is
it can represent [44]. Therefore, this function cannot act → ←
as a universal approximation, unless the higher layers have yt = W → ht + W ← ht + bo . (31)
HO HO
feedback to the lower layers.
While the augmentation of a RNN for leveraging the BPTT is one option to train BRNNs. However, the forward
benefits of deep networks has shown to yield performance and backward pass procedures are slightly more complicated
improvements, it has also shown to introduce potential issues. because the update of state and output neurons can no longer
By adding nonlinear layers to the network transition stages, be conducted one at a time [46]. While simple RNNs are
there now exists additional layers through which the gradient constrained by inputs leading to the present time, the BRNNs
must travel back. This can lead to issues such as vanishing extend this model by using both past and future information.
and exploding gradients which can cause the network to However, the shortcoming of BRNNs is their requirement to
fail to adequately capture long-term dependencies [44]. The know the start and end of input sequences in advance. An
addition of nonlinear layers in the transition stages of a RNN example is labeling spoken sentences by their phonemes [46].
can also significantly increase the computation and speed of
the network. Additional layers can significantly increase the C. Recurrent Convolutional Neural Networks
training time of the network, must be unrolled at each iteration The rise in popularity of RNNs can be attributed to its
of training, and can thus not be parallelized. ability to model sequential data. Previous models examined
9

have augmented the underlying structure of a simple RNN the sequence is processed in the reverse order of the forward
to improve its performance on learning the contextual depen- pass. At each timestep, the hidden layer receives both the
dencies of single dimension sequences. However, there exists output error derivatives and its own future derivatives [52].
several problems, which require understanding of contextual RNNs have suitable properties for multidimensional do-
dependencies over multiple dimensions. The most popular net- mains such as robustness to warping and flexible use of con-
work architectures use convolutional neural networks (CNNs) text. Furthermore, RNNs can also leverage inherent sequential
to tackle these problems. patterns in image analysis and video processing that are often
CNNs are very popular models for machine vision appli- ignored by other architectures [53]. However, memory usage
cations. CNNs may consist of multiple convolutional layers, can become a significant problem when trying to model
optionally with pooling layers in between, followed by fully multidimensional sequences. As more recurrent connections
connected perceptron layers [11]. Typical CNNs learn through in the network are increased, so too is the amount of saved
the use of convolutional layers to extract features using shared states that the network must conserve. This can result in huge
weights in each layer. The feature pooling layer (i.e., sub- memory requirements, if there is a large number of saved
sampling) generalizes the network by reducing the resolution states in the network. MDRNNs also fall victim to vanishing
of the dimensionality of intermediate representations (i.e., gradients and can fail to learn long-term sequential information
feature maps) as well as the sensitivity of the output to shifts along all dimensions. While applications of the MDRNN fall
and distortions. The extracted features, at the very last convo- in line with RCNNs, there has yet to be any comparative
lutional layer, are fed to fully connected perceptron model for examinations performed on the two models.
dimensionality reduction of features and classification.
Incorporation of recurrent connections into each convolu- E. Long-Short Term Memory
tional layer can shape a recurrent convolutional neural network Recurrent connections can improve performance of neural
(RCNN) [47]. The activation of units in RCNN evolve over networks by leveraging their ability to understand sequential
time, as they are dependent on the neighboring unit. This dependencies. However, the memory produced from the re-
approach can integrate the context information, important for current connections can severely be limited by the algorithms
object recognition tasks. This approach increases the depth of employed for training RNNs. All the models thus far have
model, while the number of parameters is constant by weight fallen victim to exploding or vanishing gradients during the
sharing between layers. Using recurrent connections from the training phase, resulting in the network failing to learn long-
output into the input of the hidden layer allows the network to term sequential dependencies in data. The following models
model label dependencies and smooth its own outputs based are specifically designed to tackle this problem, the most
on its previous outputs [48]. This RCNN approach allows a popular being the long-short term memory (LSTM) RNNs.
large input context to be fed to the network while limiting the LSTM is one of the most popular and efficient methods for
capacity of the model. This system can model complex spatial reducing the effects of vanishing and exploding gradients [54].
dependencies with low inference cost. As the context size This approach changes the structure of hidden units from
increases with the built-in recurrence, the system identifies and “sigmoid” or “tanh” to memory cells, in which their inputs and
corrects its own errors [48]. Quad-directional 2-dimensional outputs are controlled by gates. These gates control flow of
RNNs can enhance CNNs to model long range spatial depen- information to hidden neurons and preserve extracted features
dencies [49]. This method efficiently embeds the global spatial from previous timesteps [21], [54].
context into the compact local representation [49]. It is shown that for a continual sequence, the LSTM model’s
internal values may grow without bound [55]. Even when
continuous sequences have naturally reoccurring properties,
D. Multi-Dimensional Recurrent Neural Networks the network has no way to detect which information is no
Multi-dimensional recurrent neural networks (MDRNNs) longer relevant. The forget gate learns weights that control the
are another implementation of RNNs to high dimensional rate at which the value stored in the memory cell decays [55].
sequence learning. This network utilizes recurrent connections For periods when the input and output gates are off and the
for each dimension to learn correlations in the data. MDRNNs forget gate is not causing decay, a memory cell simply holds
are a special case of directed acyclic graph RNNs [50], its value over time so that the gradient of the error stays
generalized to multidimensional data by replacing the one- constant during back-propagation over those periods [21].
dimensional chain of network updates with a D-dimensional This structure allows the network to potentially remember
grid [51]. In this approach, the single recurrent connection information for longer periods.
is replaced with recurrent connections of size D. A 2- LSTM suffers from high complexity in the hidden layer. For
dimensional example is presented in Figure 7. During the identical size of hidden layers, a typical LSTM has about four
forward pass at each timestep, the hidden layer receives an times more parameters than a simple RNN [6]. The objective
external input as well as its own activation from one step at the time of proposing the LSTM method was to introduce a
back along all dimensions. A combination of the input and the scheme that could improve learning long-range dependencies,
previous hidden layer activation at each timestep is fed in the rather than to find the minimal or optimal scheme [21].
order of input sequence. Then, the network stores the resulting Multi-dimensional and grid LSTM networks have partially
hidden layer activation [52]. The error gradient of an MDRNN enhanced learning of long-term dependencies comparing to
can be calculated with BPTT. As with one dimensional BPTT, simple LSTM, which are discussed in this section.
10

yt ht
h#$% Outpu e
g ( .)
t
o

xt
'( .)
g !" ht-1
c)*, gtf(.
h-./ I ate xt
gti(.)

xt &( .)

xt h012
Fig. 8: The LSTM memory block with one cell. The dashed line represent Fig. 9: An example of S-LSTM, a long-short term memory network on tree
time lag. structures. A tree node can consider information from multiple descendants.
Information of the other nodes in white are blocked. The short line (-) at each
arrowhead indicates a block of information.
1) Standard LSTM: A typical LSTM cell is made of input,
forget, and output gates and a cell activation component as 2) S-LSTM: While the LSTM internal mechanics help the
shown in Figure 8. These units receive the activation signals network to learn longer sequence correlation, it may fail to
from different sources and control the activation of the cell by understand input structures more complicated than a sequence.
the designed multipliers. The LSTM gates can prevent the rest The S-LSTM model is designed to overcome the gradient
of the network from modifying the contents of the memory vanishing problem and learn longer term dependencies from
cells for multiple timesteps. LSTM networks preserve signals input. An S-LSTM network is made of S-LSTM memory
and propagate errors for much longer than ordinary RNNs. blocks and works based on a hierarchical structure. A typical
These properties allow LSTM networks to process data with memory block is made of input and output gates. In this
complex and separated interdependencies and to excel in a tree structure, presented in Figure 9, the memory of multiple
range of sequence learning domains. descendant cells over time periods are reflected on a memory
The input gate of LSTM is defined as cell recursively. This method learns long term dependencies
gti = σ(WIgi xt + WHgi ht−1 + Wgc gi gt−1
c
+ bgi ), (32) over the input by considering information from long-distances
on the tree (i.e., branches) to the principal (i.e., root). A typical
where WIgi is the weight matrix from the input layer to the S-LSTM has “sigmoid” function and therefore, the gating
input gate, WHgi is the weight matrix from hidden state to signal works in the range of [0,1]. Figure 9 shows that the
the input gate, Wgc gi is the weight matrix from cell activation closer gates to the root suffer less from gradient vanishing
to the input gate, and bgi is the bias of the input gate. The problem (darker circle) while the branches at lower levels
forget gate is defined as of tree loose their memory due to gradient vanishing (lighter
circles). A gate can be closed to not receive signal from lower
gtf = σ(WIgf xt + WHgf ht−1 + Wgc gf gt−1
c
+ bgf ), (33)
branches using a dash.
where WIgf is the weight matrix from the input layer to the The S-LSTM method can achieve competitive results com-
forget gate, WHgf is the weight matrix from hidden state paring to the recursive and LSTM models. It has the potential
to the forget gate, Wgc gf is the weight matrix from cell of extension to other LSTM models. However, its performance
activation to the forget gate, and bgf is the bias of the forget is not compared with other state-of-the-art LSTM models.
gate. The cell gate is defined as The reader may refer to [56] for more details about S-LSTM
memory cell.
gtc = gti tanh(WIgc xt + WHgc ht−1 + bgc ) + gtf gt−1
c
, (34) 3) Stacked LSTM: The idea of depth in ANNs is also
where WIgc is the weight matrix from the input layer to the applicable to LSTMs by stacking different hidden layers with
cell gate, WHgc is the weight matrix from hidden state to the LSTM cells in space to increase the network capacity [43],
cell gate, and bgc is the bias of the cell gate. The output gate [57]. A hidden layer l in a stack of L LSTMs using the hidden
is defined as layer in Eq. (1) is defined as
hlt = fH (WIH htl−1 + WHH hlt−1 + blh ), (37)
gto = σ(WIgo xt + WHgo ht−1 + Wgc go gtc + bgo ), (35)
where the hidden vector sequence hlt
is computed over
where WIgo is the weight matrix from the input layer to the
time t = (1, ..., T ) for l = (1, ..., L). The initial hidden
output gate, WHgo is the weight matrix from hidden state
vector sequence is defined using the input sequence h0 =
to the output gate, Wgc go is the weight matrix from cell
(x1 , ..., xT ) [43]. Then, the output of the network is
activation to the output gate, and bgo is the bias of the output
gate. Finally, the hidden state is computed as yt = fO (WHO hL
t + b0 ). (38)

ht = gto tanh(gtc ). (36) In stacked LSTM, a stack pointer can determine which cell
in the LSTM provides state and memory cell of a previous
11

timestep [58]. In such a controlled structure, not only the


controller can push to and pop from the top of the stack in
h t-1 s3 h4
xt
constant time but also an LSTM can maintain a continuous d5678t-1
9:;<= GHJKL M gtoDFE
space embedding of the stack contents [58], [59]. NOPQR
The combination of stacked LSTM with different RNN gt>@?
i
gtfACB
structures for different applications needs investigation. One
example is combination of stacked LSTM with frequency
domain CNN for speech processing [43], [60]. Fig. 10: Architecture of the differential recurrent neural network (dRNN) at
time t. The input gate i and the forget gate f are controlled by the DoS at
4) Bidirectional LSTM: It is possible to increase capac- times t − 1 and t, respectively, [65].
ity of BRNNs by stacking hidden layers of LSTM cells
in space, called deep bidirectional LSTM (BLSTM) [43].
BLSTM networks are more powerful than unidirectional A two-dimension grid LSTM network adds LSTM cells
LSTM networks [61]. These networks theoretically involve along the spatial dimension to a stacked LSTM. A three or
all information of input sequences during computation. The more dimensional LSTM is similar to MSLSTM, however,
distributed representation feature of BLSTM is crucial for has added LSTM cells along the spatial depth and performs
different applications such as language understanding [62]. N -way interaction. More details on grid LSTM are provided
The BLSTM model leverages the same advantages discussed in [57].
in the Bidirectional RNN section, while also overcoming the 7) Differential Recurrent Neural Networks: While LSTMs
the vanishing gradient problem. have shown improved learning ability in understanding long
5) Multidimensional LSTM: The classical LSTM model has term sequential dependencies, it has been argued that its gating
a single self-connection which is controlled by a single forget mechanisms have no way of comprehensively discriminating
gate. Its activation is considered as one dimensional LSTM. between salient and non-salient information in a sequence [65].
Multi-dimensional LSTM (MDLSTM) uses interconnection Therefore, LSTMs fail to capture spatio-temporal dynamic
from previous state of cell to extend the memory of LSTM patterns in tasks such as action recognition [65], in which
along every N dimensions [52], [63]. The MDLSTM receives sequences can often contain many non-salient frames. Differ-
inputs in a N -dimensional arrangement (e.g. two dimensions ential recurrent neural networks (dRNNs) refer to detecting
for an image). Hidden state vectors (h1 , ..., hN ) and memory and capturing of important spatio-temporal sequences to learn
vectors (m1 , ..., mN ) are fed to each input of the array from dynamics of actions in input [65]. A LSTM gate in dRNNs
the previous state for each dimension. The memory vector is monitors alternations in information gain of important motions
defined as between successive frames. This change of information is
XN
detectable by computing the derivative of hidden states (DoS).
m= gjf ⊙ mj + gji ⊙ gjc , (39)
j=1
A large DoS reveals sudden change of actions state, which
means the spatio-temporal structure contains informative dy-
where ⊙ is the element-wise product and the gates are namics. In this situation, the gates in Figure 10 allow flow of
computed using Eq.(32) to Eq.(36), [57]. information to update the memory cell defined as
Spatial LSTM is a particular case of MDLSTM [64],
which is a two-dimensional grid for image modelling. This st = gft ⊙ st−1 + gti ⊙ st−1/2 (42)
model generates a hidden state vector for a particular pixel
where
in an image by sequentially reading the pixels in its small
neighbourhood [64]. The state of the pixel is generated by st−1/2 = tanh(Whs ht−1 + Wxs Xt + bs ). (43)
feeding the state hidden vector into a factorized mixture of
conditional Gaussian scale mixtures (MCGSM) [64]. The DoS dst /dt quantifies the change of information at each
6) Grid LSTM: The MDLSTM model becomes unstable, time t. Small DoS keeps the memory cell away from any
as the grid size and LSTM depth in space grows. The grid influence by the input. More specifically, the cell controls the
LSTM model provides a solution by altering the computation input gate as
of output memory vectors. This method targets deep sequential
X R
(r) d
(r)
st−1
computation of multi-dimensional data. The model connects git = σ( Wdgi + Whgi ht−1 + Wx g i xt + bgi ),
dt (r)
LSTM cells along the spatiotemporal dimensions of input data r=0
and between the layers. Unlike the MDLSTM model, the block (44)
computes N transforms and outputs N hidden state vectors the forget gate unit as
and N memory vectors. The hidden sate vector for dimension
X R
(r) d
(r)
st−1
j is gft = σ( Wdgf + Whgf ht−1 + Wxgf xt + bgf ),
dt (r)
hj = LST M (H, mj , Wgj i , Wgj f , Wgoi , Wgj c ),

(40) r=0
(45)
where LST M (·) is the standard LSTM procedure [57] and H and the output gate unit as
is concatenation of input hidden state vectors defined as X R (r)
(r) d st
got = σ( Wdgo (r) +Whgo ht−1 +Wxgo xt +bgo ), (46)
H = [h1 , ..., hN ]T . (41) r=0
dt
12

TABLE III: A comparison between major long-short term memory (LSTM) architectures.
Method Advantages Disadvantages
- models long-term dependencies better than a simple
RNN - higher memory requirement and computational complexity
LSTM
- more robust to vanishing gradients than a simple than a simple RNN due to multiple memory cells
RNN
S-LSTM - models complicated inputs better than LSTM - higher computational complexity in comparison with LSTM
- models long-term sequential dependencies due to - higher memory requirement and computational complexity
Stacked LSTM
deeper architecture than LSTM due to a stack of LSTM cells
- captures both future and past context of the input - increases computational complexity in comparison with
Bidirectional LSTM
sequence better than LSTM and S-LSTM LSTM due to the forward and backward learning
- higher memory requirement and computational complexity than
Multidimensional LSTM - models multidimensional sequences LSTM due to multiple hidden state vectors
- instability of the network as grid size and depth grows
- models multidimensional sequences with - higher memory requirement and computational complexity
Grid LSTM
increased grid size than LSTM due to multiple recurrent connections
- discrimination between salient and non-salient
- increases computational complexity in comparison with LSTM
Differential RNN information in a sequence
due to the differential operators
- better captures spatiotemporal patterns
- increases computational complexity in comparison with LSTM
- improves exploitation of local and global contextual
Local-Global LSTM due to more number of parameters for local and global
information in a sequence
representations
- increases computational complexity due to word-by-word
Matching LSTM - optimizes LSTM for natural language inference tasks
matching of hypothesis and premise
- more computational complexity than LSTM due to more
Frequency-Time LSTM - models both time and frequency
number of parameters to model time and frequency

where the DoS has an upper order limit of R. BPTT can


train dRNNs. The 1-order and 2-order dRNN have better z
performance in training comparing with the simple LSTM;
however, it has additional computational complexity.
8) Other LSTM Models: The local-global LSTM (LG- r Input
S
h h
LSTM) architecture is initially proposed for semantic object Output
parsing [66]. The objective is to improve exploitation of
complex local (neighbourhood of a pixel) and global (whole
image) contextual information on each position of an image. Fig. 11: A gated recurrent unit (GRU). The update gate z decides if the hidden
state is to be updated with a new hidden state h̃. The reset gate r controls if
The current version of LG-LSTM has appended a stack the previous hidden state needs to be ignored.
of LSTM layers to intermediate convolutional layers. This
technique directly enhances visual features and allows an
end-to-end learning of network parameters [66]. Performance an end-to-end trainable model [68]. A comparison between
comparison of LG-LSTM with a variety of CNN models show major LSTM models is provided in Table III.
high accuracy performance [66]. It is expected that this model
can achieve more success by replacing all convolutional layers
with LG-LSTM layers. F. Gated Recurrent Unit
The matching LSTM (mLSTM) is initially proposed for
While LSTMs have shown to be a viable option for avoiding
natural language inference. The matching mechanism stores
vanishing or exploding gradients, they have a higher memory
(remembers) the critical results for the final prediction and
requirement given multiple memory cells in their architecture.
forgets the less important matchings [62]. The last hidden state
Recurrent units adaptively capture dependencies of different
of the mLSTM is useful to predict the relationship between the
time scales in gated recurrent units (GRUs) [69]. Similar to
premise and the hypothesis. The difference with other methods
the LSTM unit, the GRU has gating units that modulate the
is that instead of a whole sentence embedding of the premise
flow of information inside the unit, however, without having
and the hypothesis, the mLSTM performs a word-by-word
separate memory cells. In contrast to LSTM, the GRU exposes
matching of the hypothesis with the premise [62].
the whole state at each timestep [70] and computes a linear
The recurrence in both time and frequency for RNN, named
sum between the existing state and the newly computed state.
F-T-LSTM, is proposed in [67]. This model generates a sum-
The block diagram of a GRU is presented in Figure 11. The
mary of the spectral information by scanning the frequency
activation in a GRU is linearly modeled as
bands using a frequency LSTM. Then, it feeds the output
layers activations as inputs to a LSTM. The formulation ht = (1 − zt )ht−1 + zt h̃t , (47)
of frequency LSTM is similar to the time LSTM [67]. A
convolutional LSTM (ConvLSTM) model with convolutional where the update gate zt controls update value of the activa-
structures in both the input-to-state and state-to-state transi- tion, defined as
tions for precipitation now-casting is proposed in [68]. This
model uses a stack of multiple ConvLSTM layers to construct zt = σ(Wz xt + Uz ht−1 ), (48)
13

where W and U are weight matrices to be learned. The yT


candidate activation is
h̃t = tanh(Wh xt + Uh (rt ⊙ ht−1 )), (49) hW X

where rt is a set of rest gates defined as


α
sU
rt = σ(Wr xt + Ur ht−1 ) (50)
which allows the unit to forget the previous state by reading xV
the first symbol of an input sequence. Several similarities and
Fig. 12: Recurrent neural network with context features (longer memory).
differences between GRU networks and LSTM networks are
outlined in [69]. The study found that both models performed
better than the other only in certain tasks, which suggests there application [73]. The generalization and output feature map
cannot be a suggestion as to which model is better. parts of the MemNN have some similar functionalities with
the episodic memory in DMS. The MemNN processes sen-
G. Memory Networks tences independently [73], while the DMS processes sentences
Conventional RNNs have small memory size to store fea- via a sequence model [73]. The performance results on the
tures from past inputs [71], [72]. Memory neural networks Facebook bAbI dataset show the DMN passes 18 tasks with
(MemNN) utilize successful learning methods for inference accuracy of more than 95% while the MemNN passes 16 tasks
with a readable and writable memory component. A MemNN with lower accuracy [73]. Several steps of Episodic memory
is an array of objects and consists of input, response, gen- are discussed in [73].
eralization, and output feature map components [71], [73]. It
converts the input to an internal feature representation and then H. Structurally Constrained Recurrent Neural Network
updates the memories based on the new input. Then, it uses
Another model which aims to deal with the vanishing gra-
the input and the updated memory to compute output features
dient problem is the structurally constrained recurrent neural
and decode them to produce an output [71]. This networks is
network (SCRN). This network is based on the observation
not easy to train using BPTT and requires supervision at each
that the hidden states change rapidly during training, as pre-
layer [74]. A less supervision oriented version of MemNN
sented in Figure 12, [6]. In this approach, the SCRN structure
is end-to-end MemNN, which can be trained end-to-end from
is extended by adding a specific recurrent matrix equal to iden-
input-output pairs [74]. It generates an output after a number of
tity longer term dependencies. The fully connected recurrent
timesteps and the intermediary steps use memory input/output
matrix (called hidden layer) produces a set of quickly changing
operations to update the internal state [74].
hidden units, while the diagonal matrix (called context layer)
Recurrent memory networks (RMN) take advantage of the
supports slow change of the state of the context units [6]. In
LSTM as well as the MemNN [75]. The memory block in
this way, state of the hidden layer stays static and changes
RMN takes the hidden state of the LSTM and compares
are fed from external inputs. Although this model can prevent
it to the most recent inputs using an attention mechanism.
gradients of the recurrent matrix vanishing, it is not efficient
The RMN algorithm analyses the attention weights of trained
in training [6]. In this model, for a dictionary of size d, st is
model and extracts knowledge from the retained information
the state of the context units defined as
in the LSTM over time [75]. This model is developed for
language modeling and is tested on three large datasets. The st = (1 − α)Bxt + αst−1 , (51)
results show performance of the algorithm versus LSTM
model, however, this model inherits the complexity of LSTM where α is the context layer weight, normally set to 0.95,
and RMN and needs further development. Bd×s is the context embedding matrix, and xt is the input.
Episodic memory is inspired from semantic and episodic The hidden layer is defined as
memories, which are necessary for complex reasoning in the
ht = σ(Pst + Axt + Rht−1 ), (52)
brain [73]. Episodic memory is named as the memory of
the dynamic memory network framework, which remembers where Ad×m is the token embedding matrix, Pp×m is the
autobiographical details [73]. This memory refers to gener- connection matrix between hidden and context layers, Rm×m
ated representation of stored experiential facts. The facts are is the hidden layer ht−1 weights matrix, and σ(·) is the
retrieved from the inputs conditioned on the question. This “sigmoid” activation function. Finally, the output yt is defined
results in a final representation by reasoning on the facts. The as
module performs several passes over the facts, while focusing yt = f (Uht + Vst ), (53)
on different facts. The output of each pass is called an episode,
which is summarized into the memory [73]. A relevant work to where f is the “softmax” activation function, and U and V
MemNN is the dynamic memory networks (DMN). An added are the output weight matrices of hidden and context layers,
memory component to the MemNN can boost its performance respectively.
in learning long-term dependencies [71]. This approach has Analysis using adaptive context features, where the weights
shown performance for natural language question answering of the context layer are learned for each unit to capture context
14

TABLE IV: A comparison between major recurrent neural network (RNN) architectures.
Method Advantages Disadvantages
- disentangles variations of input sequence - increases computational complexity due to more number
Deep RNN - network can adapt to quick changing input nodes of parameters comparing to a RNN
- develops more compact hidden state - deeper networks are more susceptible to vanishing of gradients
- must know both start and end of sequence
Bidirectional - predicts both in the positive and negative time directions
- increases computational complexity due to more number
RNN simultaneously
of parameters comparing to a RNN
- models long range spatial dependencies
Recurrent
- embeds global spatial context into compact local - increases computational complexity comparing to
Convolutional
representation a RNN
Neural Network
- activation evolves over time
- increases computational complexity comparing with
Multi-Dimensional - models high dimensional sequences a RNN
RNN - more robust to warping than a RNN - significantly increases memory requirements for training
and testing due to multiple recurrent connections
- increases computational complexity comparing with
Long-short
- capable of modeling long-term sequential dependencies a RNN
term memory
- more robust to vanishing gradients than a RNN - higher memory requirement than RNN due to multiple
(LSTM)
memory cells
- capable of modeling long-term sequential dependencies
Gated - higher computational complexity and memory requirement
- more robust to vanishing gradients
Recurrent Unit than a RNN due to multiple hidden state vectors
- less memory requirements than LSTM
Recurrent Memory
- capable of storing larger memory than a RNN - increases memory requirements than a RNN
Networks
Structurally - stores larger memory than a RNN
- not efficient in training
Constrained RNN - more robust to vanishing gradients than simple RNN
- models long-term sequential dependencies
- robustness to vanishing gradients
Unitary RNN - requires more research and comparative study
- less computational and memory requirements
than gated RNN architectures
Gated Orthogonal - models long-term sequential dependencies
- requires more research and comparative study
Recurrent Unit - robustness to vanishing gradients
Hierarchical
- sensitive to sequential distortions
Subsampling - more robustness to vanishing gradients than a RNN
- requires tuning window size
RNN

from different time delays, show that learning of the self- eigenvalues from deviating, unitary matrices can be used to
recurrent weights does not seem to be important, as long as replace the general matrices in the network.
one uses also the standard hidden layer in the model. This is Unitary matrices are orthogonal matrices in the complex
while fixing the weights of the context layer to be constant, domain [76]. They have absolute eigenvalues of exactly one,
forces the hidden units to capture information on the same time which preserves the norm of vector flows and the gradients to
scale. The SCRN model is evaluated on the Penn Treebank propagate through longer timesteps. This leads to preventing
dataset. The presented results in [6] show that the SCRN vanishing or exploding gradient problems from arising [77].
method has bigger gains compared to the proposed model However, it has been argued that the ability to back propagate
in [3]. Also, the learning longer memory model claims that it gradients without any vanishing could lead to the output
has similar performance, but with less complexity, comparing being equally dependent on all inputs regardless of the time
to the LSTM model [6]. differences [77]. This also results in the network to waste
While adding the simple constraint to the matrix results memory due to storing redundant information.
in lower computation compared to its gated counterparts, Unitary RNNs have significant advantages over previous
the model is not efficient in training. The analysis of using architectures, which have attempted to solve the vanishing
adaptive context features, where the weights of the context gradient problem. A unitary RNN architecture keeps the inter-
layer are learned for each unit to capture context from different nal workings of a vanilla RNN without adding any additional
time delays, shows that learning of the self-recurrent weights memory requirements. Additionally, by maintaining the same
does not seem to be important, as long as one uses also architecture, Unitary RNNs do not noticeably increase the
the standard hidden layer in the model [6]. Thus, fixing the computational cost.
weights of the context layer to be constant forces the hidden
units to capture information on the same time scale.
J. Gated Orthogonal Recurrent Unit
I. Unitary Recurrent Neural Networks Thus far, implementations of RNNs have taken two separate
A simple approach to alleviating the vanishing and explod- approaches to tackle the issues of exploding and vanishing
ing gradients problem is to simply use unitary matrices in gradients. The first is implementation of additional gates to
a RNN. The problem of vanishing or exploding gradients improve memory of the system, such as the LSTM and
can be attributed to the eigenvalues of the hidden to hidden GRU architectures. The second is implementation of unitary
weight matrix, deviating from one [76]. Thus, to prevent these matrices to maintain absolute eigenvalues of one.
15

The gated orthogonal recurrent unit replaces the hidden

Loss
state loop matrix with an orthogonal matrix and introduces an
augmentation of the ReLU activation function, which allows Best Performance
it to handle complex-value inputs [77]. This unit is capable
of capturing long term dependencies of the data using unitary Test
matrices, while also leverages forgetting mechanisms present
2> 1
in the GRU structure [77].
1 2 Validation
Train
K. Hierarchical Subsampling Recurrent Neural Networks
Epoch
It has been shown that RNNs particularly struggle with
learning long sequences. While previous architectures have Fig. 13: Overfitting in training neural networks. To avoid overfitting, it is
possible to early-stop the training at the “Best Performance” epoch, where
aimed to change the mechanics of the network to better learn the training loss is decreasing but the validation loss starts increasing.
long term dependencies, a simpler solution exists, shortening
the sequences using methods such as subsampling. Hierar-
chical subsampling recurrent neural networks (HSRNNs) aim configuration and prevent the coefficients from fitting so
to better learn large sequences by performing subsampling perfectly as to overfit. The loss function in Eq. (8) with added
at each level using a fixed window size [78]. Training this regularization term is
network follows the same process as training a regular RNN,
p
with a few modifications based on the window sizes at each L(y, z) = L(y, z) + η kθkp , (54)
level.
HSRNNs can be extended to multidimensional networks where θ is the set of network parameters (weights), η controls
by simply replacing the subsampling windows with multidi- the relative importance of the regularization parameter, and
mensional windows [78]. In multidirectional HSRNNs, each X|θ|
level consists of two recurrent layers scanning in two separate kθkp = ( |θj |p )1/p . (55)
directions with a feedforward layer in between. However, in j=0
reducing the size of the sequences, the HSRNN becomes less
robust to sequential distortions. This requires a lot of tuning of If p = 1 the regularizer is L1 and if p = 2 the regularizer is
the network than other RNN models, since the optimal window L2 . The L1 is the sum of the weights and L2 is the sum of
size varies depending on the task [78]. HSRNNs have shown the square of the weights.
to be a viable option as a model for learning long sequences
due to their lower computational costs when compared to their B. Dropout
counterparts. RNNs, regardless of their internal architecture,
are activated at each time step of the sequence. This can cause In general, the dropout randomly omits a fraction of the
extremely high computational costs for the network to learn connections between two layers of the network during training.
information in long sequences [78]. Additionally, information For example, for the hidden layer outputs in Eq. (1) we have
can be widely dispersed in a long sequence, making inter-
ht = k ⊙ ht , (56)
dependencies harder to find. A comparison between major
RNN architectures is provided in Table IV. where k is a binary vector mask and ⊙ is the element-wise
product [81]. The mask can also follow a statistical pattern in
V. R EGULARIZING R ECURRENT N EURAL N ETWORKS applying the withdrawal. During testing, all units are retained
Regularization refers to controlling the capacity of the and their activations may be weighted.
neural network by adding or removing information to prevent A dropout specifically tailored to RNNs is introduced in
overfitting. For better training of a RNN, a portion of available [82], called RNNDrop. This method generates a single dropout
data is considered as validation dataset. The validation set is mask at the beginning of each training sequence and adjusts it
used to watch the training procedure and prevent the network over the duration of the sequence. This allows the network
from underfitting and overfitting [79]. Overfitting refers to connections to remain constant through time. Other imple-
the gap between the training loss and the validation loss mentations of dropout for RNNs suggest simply dropping
(including the test loss), which increases after a number of the previous hidden state of the network. A similar model
training epochs as the training loss decreases, as demonstrated to the RNNDrop is introduced in [83], where instead of
in Figure 13. Successful training of RNNs requires good dropout, it masks data samples at every input sequence per
regularization [80]. This section aims to introduce common step. This small adjustment has competitive performance to
regularization methods in training RNNs. the RNNDrop.

A. L1 and L2 C. Activation Stabilization


The L1 and L2 regularization methods add a regulariza- Another recently proposed method of regularization in-
tion term to the loss function to penalize certain parameter volves stabilizing the activations of the RNNs [84]. The norm-
16

y1 yP and challenging to train. A method based on factorization


dropped out
WHH WHO WHH of output layer is proposed in [87], which can speed-up the
h1 h2 hM h1 h2 hM training of a RNN for language modeling up to 100 times.
WIH
In this approach, words are assigned to specific categories
based on their unigram frequency and only words belonging
x1 x2 xN
to the predicted class are evaluated in the output layer [86].
HF optimization is used in [5] to train RNNs for character-
t t+1 Time level language modeling. This model uses gated connections
Fig. 14: Dropout applied to feed-forward connections in a RNN. The recurrent to allow the current input character to determine the transition
connections are shown as full connection with a solid line. The connection
between hidden units and output units are shown in dashed lines. The dropped- matrix from one hidden state vector to the next [5]. LSTMs
out connection between the hidden units and output units are shown by dotted have improved RNN models for language modeling due to
lines. their due to their ability to learn long-term dependencies in
stabilizer is an additional cost term to the loss function defined a sequence better than a simple hidden state [88]. LSTMs
as are also used in [89] to generate complex text and online
handwriting sequences with long-range structure, simply by
1 X
T
L(y, z) = L(y, z) + β (||ht ||2 − ||ht−1 ||2 )2 (57) predicting one data point at a time. RNNs are also used
T t=1 to capture poetic style in works of literature and generate
where ht and ht−1 are the vectors of the hidden activations at lyrics, for example Rap lyric generation [90]–[92]. A variety
time t and t − 1, respectively, and β controls the relative im- of document classification tasks is proposed in the literature
portance of the regularization. This additional term stabilizes using RNNs. In [93], a GRU is adapted to perform document
the norms of the hidden vectors when generalizing long-term level sentiment analysis. In [94], RCNNs are used for text
sequences. classification on several datasets. In such approaches, generally
Other implementations have been made to stabilize the the words are mapped to a feature vector and the sequence
hidden-to-hidden transition matrix such as the use of orthog- of feature vectors are passed as input to the RNN model.
onal matrices, however, inputs and nonlinearities can still The same sequence of feature vectors can also be represented
affect the stability of the activation methods. Experiments on as a feature matrix (i.e., an image) to be fed as input to a
language modelling and phoneme recognition show state of CNN. CNNs are used in [95] to classify radiology reports. The
the art performance of this approach [84]. proposed model is particularly developed for chest pathology
and mammogram reports. However, RNNs have not yet been
D. Hidden Activation Preservation examined for medical reports interpretation and can potentially
result in very high classification performance.
The zoneout method is a very special case of dropout. It
forces some units to keep their activation from the previous
timestep (i.e., ht = ht−1 ) [85]. This approach injects stochas- B. Speech and Audio
ticity (by adding noise) into the network, which makes the Speech and audio signals continuously vary over time. The
network more robust to changes in the hidden state and help inherent sequential and time varying nature of audio signals
the network to avoid overfitting. Zoneout uses a Bernoulli make RNNs the ideal model to learn features in this field.
mask k to modify dynamics of ht as Until recently, RNNs had limited contribution in labelling
ht = k ⊙ ht + (1 − k) ⊙ 1 (58) unsegmented speech data, primarily because this task re-
quires pre-segmented data and post-processing to produce
which improves the flow of information in the network [85].
outputs [96]. Early models in speech recognition, such as time-
Zoneout has slightly better performance than dropout. How-
delay neural networks, often try to make use of the sequential
ever, it can also work together with dropout and other regu-
nature of the data by feeding an ANN a set of frames [97].
larization methods [85].
Given that both past and future sequential information can
be of use in speech recognition predictions, the concept of
VI. R ECURRENT N EURAL N ETWORKS FOR S IGNAL
BRNNs were introduced for speech recognition [98]. Later,
P ROCESSING
RNNs were combined with hidden Markov models (HMM) in
RNNs have various applications in different fields and a which the HMM acted as an acoustic model while the RNN
large number of research articles are published in that regard. acted as the language model [99]. With the introduction of the
In this section, we review different applications of RNNs in connectionist temporal classification (CTC) function, RNNs
signal processing, particularly text, audio and speech, image, are capable of leveraging sequence learning on unsegmented
and video processing. speech data [96]. Since then, the popularity of RNNs in speech
recognition has exploded. Developments in speech recognition
A. Text then used the CTC function alongside newer recurrent network
RNNs are developed for various application in natural lan- architectures, which were more robust to vanishing gradients
guage processing and language modeling. RNNs can outper- to improve performance and perform recognition on larger
form n-gram models and are widely used as language model- vocabularies [100]–[102]. Iterations of the CTC model, such
ers [86]. However, RNNs are computationally more expensive as the sequence transducer and neural transducer [89], [103]
17

have incorporated a second RNN to act as a language model in particular the WaveNet model [114]. WaveNet is a newly
to tackle tasks such as online speech recognition. These introduced CNN capable of generating speech, using dilated
augmentations allows the model to make predictions based convolutions. Through the use of dilated causal convolutions,
on not only the linguistic features, but also on the previous WaveNet can model long-range temporal dependencies by
transcriptions made. increasing it’s receptive field of input. WaveNet has shown
Speech emotion recognition is very similar to speech recog- better performance than LSTMs and HMMs [114].
nition, such that a segment of speech must be classified The modelling of polyphonic music presents another task
as an emotion. Thus the development of speech emotion with inherent contextual dependencies. In [115], a RNN
recognition followed the same path as that of speech recogni- combined with a restricted Boltzmann machine (RBM) is
tion. HMMs were initially used for their wide presence in introduced, which is capable of modeling temporal information
speech applications [104]. Later, Gaussian mixture models in a music track. This model has a sequence of conditional
(GMMs) were adapted to the task for their lower training RBMs, which are fed as parameters to a RNN, so that can learn
requirements and efficient multi-modal distribution modeling harmonic and rhythmic probability rules from polyphonic
[104]. However, these models often require hand crafted music of varying complexity [115]. It has been shown that
and feature engineered input data. Some examples are mel- RNN models struggle to keep track of distant events that
frequency cepstral coefficients (MFCCs), perceptual linear indicate the temporal structure of music [116]. LSTM models
prediction (PLP) coefficients, and supra-segmental features have since been adapted in music generation to better learn
[105]. With the introduction of RNNs, the trend of input data the long-term temporal structure of certain genres of music
began to shift from such feature engineering to simply feeding [116], [117].
the raw signal as the input, since the networks themselves were
able to learn these features on their own. Several RNN models
C. Image
have been introduced since then to perform speech emotion
recognition. In [106], an LSTM network is shown to have Learning the spatial dependencies is generally the main
better performance than support vector machines (SVMs) and focus in machine vision. While CNNs have dominated most
conditional random fields (CRFs). This improved performance applications in computer vision and image processing, RNNs
is attributed to the network’s ability to capture emotions by have also shown promising results such as image labeling,
better modeling long-range dependencies. In [107], a deep image modeling, and handwriting recognition.
BLSTM is introduced for speech emotion recognition. Deep Scene labeling refers to the task of associating every pixel in
BLSTMs are able to capture more information by taking in an image to a class. This inherently involves the classification
larger number of frames while a feed-forward DNN simply of a pixel to be associated with the class of its neighbour pix-
uses the frame with the highest energy in a sequence [107]. els. However, models such as CNNs have not been completely
However, comparisons to previous RNNs used for speech emo- successful in using these underlying dependencies in their
tion recognition were not made. Given that this model used model. These dependencies have been shown to be leveraged
a different model than the LSTM model described prior, no in numerous implementations of RNNs. A set of images are
comparisons could be found as to which architecture performs represented as undirected cyclic graphs (UCGs) in [118]. To
better. Recently, a deep convolutional LSTM is adapted in feed these images into a RNN, the UCGs are decomposed into
[105]. This model gives state-of-the-art performance when several directed acyclic graphs (DAGs) meant to approximate
tested on the RECOLA dataset, as the convolutional layers the original images. Each DAG image involves a convolutional
learns to remove background noise and outline important layer to produce discriminative feature mapping, a DAG-
features in the speech, while the LSTM models the temporal RNN to model the contextual dependencies between pixels,
structure of the speech sequence. and a deconvolutional layer to up-sample the feature map
Much like speech recognition, speech synthesis also requires to its original image size. This implementation has better
long-term sequence learning. HMM-based models can often performance than other state of the art models on popular
produce synthesized speech, which does not sound natural. datasets such as SiftFlow, CamVid, and Barcelona [118]. A
This is due to the overly smooth trajectories produced by the similar implementation is shown in [49], where instead of
model, as a result of statistical averaging during the training decomposing the image into several DAGs, the image is first
phase [108]. Recent advancements in ANNs have shown that fed into a CNN to extract features for a local patch, which
deep MLP neural networks can synthesize speech. However, is then fed to a 2D-RNN. This 2D-RNN is similar to a
these models take each frame as an independent entity from its simple RNN, except for its ability to store hidden states in
neighbours and fail to take into account the sequential nature two dimensions. The two hidden neurons flow in different
of speech [108]. RNNs were first used for speech synthesis directions towards the same neuron to create the hidden
to leverage these sequential dependencies [109], [110], and memory. To encode the entire image, multiple starting points
were then replaced with LSTM models to better learn long are chosen to create the multiple hidden states of the same
term sequential dependencies [111]. The BLSTM has been pixel. This architecture is developed further by introducing
shown to perform very well in speech synthesis due to the 2D-LSTM units to better retain long-term information [119].
ability to integrate the relationship with neighbouring frames Image modeling is the task of assigning a probability
in both future and past time steps [112], [113]. CNNs have distribution to an image. RNNs are naturally the best choice for
shown to perform better than state of the art LSTM models, image modeling tasks given its inherent ability to be used as a
18

generative model. The deep recurrent attentive writer (DRAW) RNN architectures can potentially enhance performance of
combines a novel spatial attention mechanism that mimics these models.
the foveation of the human eye, with a sequential variational
auto-encoding framework that allows iterative construction of D. Video
complex images [120]. While most image generative models
A video is a sequence of images (i.e., frames) with temporal
aim to generate scenes all at once, this causes all pixels to
and spatial dependencies between frames and pixels in each
be modelled on a single latent distribution. The DRAW model
frame, respectively. A video file has far more pixels in com-
generates images through first generating sections of the scene
parison to a single image, which results in a greater number
independently of each other before going through iterations of
of parameters and computational cost to process it. While
refinement. The recent introduction of PixelRNN, involving
different tasks have been performed on videos using RNNs,
LSTMs and BLSTMs, has shown improvements in modelling
they are most prevalent in video description generation. This
natural images with scalability [121]. The PixelRNN uses up
application involves components of both image processing
to 12 2-dimensional LSTM layers, each of which has an input-
and natural language processing. The method proposed in
to-state component and a recurrent state-to-state component.
[125] combines a CNN for visual feature extraction with an
These component then determine the gates inside each LSTM.
LSTM model capable of decoding the features into a natural
To compute these states, masked convolutions are used to
language string known as long-term recurrent convolutional
gather states along one of the dimensions of the image. This
networks (LRCNs). However, this model was not an end-to-
model has better log-likelihood scores than other state of the
end solution and required supervised intermediate represen-
art models evaluated on MNIST and CIFAR-10 datasets. While
tations of the features generated by the CNN. This model
PixelRNN has shown to perform better than DRAW on the
is built upon in [126], which introduces a solution capable
MNIST dataset, there has been no comparison between the
of being trained end-to-end. This model utilizes an LSTM
two models as to why this might be.
model, which directly connects to a deep CNN. This model
The task of handwriting recognition combines both image
was further improved in [127], in which a 3-dimensional con-
processing and sequence learning. This task can be divided
volutional architecture was introduced for feature extraction.
into two types, online and offline recognition. RNNs per-
These features were then fed to an LSTM model based on
form well on this task, given the contextual dependencies
a soft-attention mechanism to dynamically control the flow
in letter sequences [122]. For the task of online handwrit-
of information from multiple video frames. RNNs has fewer
ing recognition, the position of the pen-tip is recorded at
advances in video processing, comparing with the other types
intervals and these positions are mapped to the sequence of
of signals, which introduces new opportunities for temporal
words [122]. In [122], a BLSTM model is introduced for
spatial machine learning.
online handwriting recognition. Performance of this model is
better than conventionally used HMM models due to its ability
to make use of information in both past and future time steps. VII. C ONCLUSION AND P OTENTIAL DIRECTIONS

BLSTM perform well when combined with a probabilistic In this paper, we systematically review major and recent
language model and trained with CTC. For offline handwriting advancements of RNNs in the literature and introduce the
recognition, only the image of the handwriting is available. challenging problems in training RNNs. A RNN refers to
To tackle this problem, an MDLSTM is used to convert the a network of artificial neurons with recurrent connections
2-dimensional inputs into a 1-dimensional sequence [52]. The among them. The recurrent connections learn the dependencies
data is then passed through a hierarchy of MDLSTMs, which among input sequential or time-series data. The ability to learn
incrementally decrease the size of the data. While such tasks sequential dependencies has allowed RNNs to gain popularity
are often implemented using CNNs, it is argued that due to in applications such as speech recognition, speech synthesis,
the absence of recurrent connections in such networks, CNNs machine vision, and video description generation.
cannot be used for cursive handwriting recognition without One of the main challenges is training RNNs is learning
first being pre-segmented [52]. The MDLSTM model proposed long-term dependencies in data. It occurs generally due to
in [52] offers a simple solution which does not need segmented the large number of parameters that need to be optimized
input and can learn the long-term temporal dependencies. during training in RNN over long periods of time. This paper
Recurrent generative networks are developed in [123] to discusses several architectures and training methods that have
automatically recover images from compressed linear mea- been developed to tackle the problems associated with training
surements. In this model, a novel proximal learning framework of RNNs. The followings are some major opportunities and
is developed, which adopts ResNets to model the proximals challenges in developing RNNs:
and a mixture of pixel-wise and perceptual costs are used • The introduction of BPTT algorithm has facilitated effi-
for training. The deep convolutional generative adversarial cient training of RNNs. However, this approaches intro-
networks are developed in [124] to generate artificial chest duces gradient vanishing and explodin problems. Recent
radiographs for automated abnormality detection in chest advances in RNNs have since aimed at tackling this issue.
radiographs. This model can be extended to medical image However, these challenges are still the main bottleneck of
modalities which have spatial and temporal dependencies, such training RNNs.
as head magnetic resonance imaging (MRI), using RCNNs. • Gating mechanisms have been a breakthrough in allow-
Since RNNs can model non-linear dynamical systems, recent ing RNNs to learn long-term sequential dependencies.
19

Architectures such as the LSTM and GRU have shown [14] R. J. Williams and D. Zipser, “Gradient-based learning algorithms for
significantly high performance in a variety of applica- recurrent networks and their computational complexity,” Backpropa-
gation: Theory, architectures, and applications, vol. 1, pp. 433–486,
tions. However, these architectures introduce higher com- 1995.
plexity and computation than simple RNNs. Reducing the [15] G. V. Puskorius and L. A. Feldkamp, “Neurocontrol of nonlinear
internal complexity of these architectures can help reduce dynamical systems with kalman filter trained recurrent networks,” IEEE
Transactions on neural networks, vol. 5, no. 2, pp. 279–297, 1994.
training time for the network. [16] S. Ma and C. Ji, “A unified approach on fast training of feedforward and
• The unitary RNN has potentially solved the above issue recurrent networks using em algorithm,” IEEE transactions on signal
by introducing a simple architecture capable of learn- processing, vol. 46, no. 8, pp. 2270–2274, 1998.
ing long-term dependencies. By replacing the internal [17] L.-W. Chan and C.-C. Szeto, “Training recurrent network with block-
diagonal approximated levenberg-marquardt algorithm,” in Neural Net-
weights with unitary matrices, the architecture keeps works, 1999. IJCNN’99. International Joint Conference on, vol. 3.
the same complexity of a simple RNN while providing IEEE, 1999, pp. 1521–1526.
stronger modeling ability. Further research into the use [18] S. Ruder, “An overview of gradient descent optimization algorithms,”
arXiv preprint arXiv:1609.04747, 2016.
of unitary RNNs can help in validating its performance [19] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training
against its gated RNN counterparts. recurrent neural networks,” in International Conference on Machine
• Several regularization methods such as dropout, activa- Learning, 2013, pp. 1310–1318.
[20] P. J. Werbos, “Backpropagation through time: what it does and how to
tion stabilization, and activation preservation have been do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
adapted for RNNs to avoid overfitting. While these meth- [21] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to ini-
ods have shown to improve performance, there is no tialize recurrent networks of rectified linear units,” arXiv preprint
standard for regularizing RNNs. Further research into arXiv:1504.00941, 2015.
[22] J. A. Pérez-Ortiz, F. A. Gers, D. Eck, and J. Schmidhuber, “Kalman
RNNs regularization can help introduce potentially better filters improve lstm network performance in problems unsolvable by
regularization methods. traditional recurrent nets,” Neural Networks, vol. 16, no. 2, pp. 241–
• RNNs have a great potential to learn features from 3- 250, 2003.
[23] T. Mikolov, I. Sutskever, A. Deoras, H.-S. Le, S. Kombrink, and
dimensional medical images, such as head MRI scans, J. Cernocky, “Subword language modeling with neural networks,”
lung computed tomography (CT), and abdominal MRI. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 2012.
In such modalities, the temporal dependency between [24] T. Mikolov and G. Zweig, “Context dependent recurrent neural network
language model.” SLT, vol. 12, pp. 234–239, 2012.
images is very important, particularly for cancer detection [25] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient
and segmentation. backprop,” in Neural networks: Tricks of the trade. Springer, 2012,
pp. 9–48.
[26] B. T. Polyak, “Some methods of speeding up the convergence of iter-
ation methods,” USSR Computational Mathematics and Mathematical
R EFERENCES Physics, vol. 4, no. 5, pp. 1–17, 1964.
[27] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan, “Better mini-batch
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. algorithms via accelerated gradient methods,” in Advances in neural
521, no. 7553, pp. 436–444, 2015. information processing systems, 2011, pp. 1647–1655.
[2] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm [28] J. Martens and I. Sutskever, “Training deep and recurrent networks
for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527– with hessian-free optimization,” Neural networks: Tricks of the trade,
1554, 2006. pp. 479–535, 2012.
[3] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances [29] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-
in optimizing recurrent networks,” in Acoustics, Speech and Signal wise training of deep networks,” in Advances in neural information
Processing (ICASSP), 2013 IEEE International Conference on. IEEE, processing systems, 2007, pp. 153–160.
2013, pp. 8624–8628. [30] L. Bottou, “Stochastic learning,” in Advanced lectures on machine
[4] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen- learning. Springer, 2004, pp. 146–168.
cies with gradient descent is difficult,” IEEE transactions on neural [31] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
networks, vol. 5, no. 2, pp. 157–166, 1994. arXiv preprint arXiv:1412.6980, 2014.
[5] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with [32] S. S. Haykin et al., Kalman filtering and neural networks. Wiley
recurrent neural networks,” in Proceedings of the 28th International Online Library, 2001.
Conference on Machine Learning (ICML-11), 2011, pp. 1017–1024. [33] R. J. Williams, “Training recurrent networks using the extended kalman
[6] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato, filter,” in Neural Networks, 1992. IJCNN., International Joint Confer-
“Learning longer memory in recurrent neural networks,” arXiv preprint ence on, vol. 4. IEEE, 1992, pp. 241–246.
arXiv:1412.7753, 2014. [34] J. Martens, “Deep learning via hessian-free optimization,” in Proceed-
[7] S. Haykin, Neural networks: a comprehensive foundation. Prentice ings of the 27th International Conference on Machine Learning (ICML-
Hall PTR, 1994. 10), 2010, pp. 735–742.
[8] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, [35] D. T. Mirikitani and N. Nikolaev, “Recursive bayesian recurrent neural
“Recurrent neural network based language model.” in Interspeech, networks for time-series modeling,” IEEE Transactions on Neural
vol. 2, 2010, p. 3. Networks, vol. 21, no. 2, pp. 262–274, 2010.
[9] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance [36] J. Martens and I. Sutskever, “Learning recurrent neural networks with
of initialization and momentum in deep learning,” in International hessian-free optimization,” in Proceedings of the 28th International
conference on machine learning, 2013, pp. 1139–1147. Conference on Machine Learning (ICML-11), 2011, pp. 1033–1040.
[10] Y. Bengio, Y. LeCun et al., “Scaling learning algorithms towards ai,” [37] H. Salehinejad, S. Rahnamayan, and H. R. Tizhoosh, “Micro-
Large-scale kernel machines, vol. 34, no. 5, pp. 1–41, 2007. differential evolution: Diversity enhancement and a comparative study,”
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification Applied Soft Computing, vol. 52, pp. 812–833, 2017.
with deep convolutional neural networks,” in Advances in neural [38] P. J. Angeline, G. M. Saunders, and J. B. Pollack, “An evolutionary
information processing systems, 2012, pp. 1097–1105. algorithm that constructs recurrent neural networks,” IEEE transactions
[12] I. Sutskever, “Training recurrent neural networks,” University of on Neural Networks, vol. 5, no. 1, pp. 54–65, 1994.
Toronto, Toronto, Ont., Canada, 2013. [39] K. Unnikrishnan and K. P. Venugopal, “Alopex: A correlation-based
[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT learning algorithm for feedforward and recurrent neural networks,”
Press, 2016, http://www.deeplearningbook.org. Neural Computation, vol. 6, no. 3, pp. 469–490, 1994.
20

[40] C. Smith and Y. Jin, “Evolutionary multi-objective generation of [65] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural net-
recurrent neural network ensembles for time series prediction,” Neuro- works for action recognition,” in Proceedings of the IEEE International
computing, vol. 143, pp. 302–311, 2014. Conference on Computer Vision, 2015, pp. 4041–4049.
[41] T. Tanaka, T. Moriya, T. Shinozaki, S. Watanabe, T. Hori, and K. Duh, [66] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan, “Seman-
“Evolutionary optimization of long short-term memory neural network tic object parsing with local-global long short-term memory,” arXiv
language model,” The Journal of the Acoustical Society of America, preprint arXiv:1511.04510, 2015.
vol. 140, no. 4, pp. 3062–3062, 2016. [67] J. Li, A. Mohamed, G. Zweig, and Y. Gong, “Lstm time and frequency
[42] H. Salehinejad, S. Rahnamayan, H. R. Tizhoosh, and S. Y. Chen, recurrence for automatic speech recognition,” 2015.
“Micro-differential evolution with vectorized random mutation factor,” [68] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c.
in Evolutionary Computation (CEC), 2014 IEEE Congress on. IEEE, Woo, “Convolutional lstm network: A machine learning approach for
2014, pp. 2055–2062. precipitation nowcasting,” arXiv preprint arXiv:1506.04214, 2015.
[43] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with [69] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
deep recurrent neural networks,” in Acoustics, speech and signal gated recurrent neural networks on sequence modeling,” arXiv preprint
processing (icassp), 2013 ieee international conference on. IEEE, arXiv:1412.3555, 2014.
2013, pp. 6645–6649. [70] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
[44] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct properties of neural machine translation: Encoder-decoder approaches,”
deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013. arXiv preprint arXiv:1409.1259, 2014.
[45] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, [71] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv
“Distributed representations of words and phrases and their composi- preprint arXiv:1410.3916, 2014.
tionality,” in Advances in neural information processing systems, 2013, [72] J. Weston, A. Bordes, S. Chopra, and T. Mikolov, “Towards ai-complete
pp. 3111–3119. question answering: a set of prerequisite toy tasks,” arXiv preprint
arXiv:1502.05698, 2015.
[46] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
[73] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce,
works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp.
P. Ondruska, I. Gulrajani, and R. Socher, “Ask me anything: Dynamic
2673–2681, 1997.
memory networks for natural language processing,” arXiv preprint
[47] M. Liang and X. Hu, “Recurrent convolutional neural network for ob- arXiv:1506.07285, 2015.
ject recognition,” in Proceedings of the IEEE Conference on Computer [74] S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory
Vision and Pattern Recognition, 2015, pp. 3367–3375. networks,” in Advances in Neural Information Processing Systems,
[48] P. Pinheiro and R. Collobert, “Recurrent convolutional neural networks 2015, pp. 2431–2439.
for scene labeling,” in International Conference on Machine Learning, [75] K. Tran, A. Bisazza, and C. Monz, “Recurrent memory network for
2014, pp. 82–90. language modeling,” arXiv preprint arXiv:1601.01272, 2016.
[49] B. Shuai, Z. Zuo, and G. Wang, “Quaddirectional 2d-recurrent neural [76] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent
networks for image labeling,” IEEE Signal Processing Letters, vol. 22, neural networks,” in International Conference on Machine Learning,
no. 11, pp. 1990–1994, 2015. 2016, pp. 1120–1128.
[50] P. Baldi and G. Pollastri, “The principled design of large-scale recur- [77] L. Jing, C. Gulcehre, J. Peurifoy, Y. Shen, M. Tegmark, M. Soljačić, and
sive neural network architectures–dag-rnns and the protein structure Y. Bengio, “Gated orthogonal recurrent units: On learning to forget,”
prediction problem,” Journal of Machine Learning Research, vol. 4, arXiv preprint arXiv:1706.02761, 2017.
no. Sep, pp. 575–602, 2003. [78] A. Graves, Supervised sequence labelling with recurrent neural net-
[51] A. Graves, S. Fernández, and J. Schmidhuber, “Multi- works. Springer Science & Business Media, 2012, vol. 385.
dimensional recurrent neural networks,” 2007. [Online]. Available: [79] C. M. Bishop, Pattern recognition and machine learning. springer,
http://arxiv.org/abs/0705.2011 2006.
[52] A. Graves and J. Schmidhuber, “Offline handwriting recognition with [80] N. Srivastava, “Improving neural networks with dropout,” University
multidimensional recurrent neural networks,” in Advances in neural of Toronto, vol. 182, 2013.
information processing systems, 2009, pp. 545–552. [81] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout
[53] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and improves recurrent neural networks for handwriting recognition,” in
Y. Bengio, “Renet: A recurrent neural network based alternative to Frontiers in Handwriting Recognition (ICFHR), 2014 14th Interna-
convolutional networks,” arXiv preprint arXiv:1505.00393, 2015. tional Conference on. IEEE, 2014, pp. 285–290.
[54] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [82] T. Moon, H. Choi, H. Lee, and I. Song, “Rnndrop: A novel dropout
computation, vol. 9, no. 8, pp. 1735–1780, 1997. for rnns in asr,” in Automatic Speech Recognition and Understanding
[55] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 65–70.
Continual prediction with lstm,” 1999. [83] S. Semeniuta, A. Severyn, and E. Barth, “Recurrent dropout without
[56] X. Zhu, P. Sobihani, and H. Guo, “Long short-term memory over re- memory loss,” arXiv preprint arXiv:1603.05118, 2016.
cursive structures,” in International Conference on Machine Learning, [84] D. Krueger and R. Memisevic, “Regularizing rnns by stabilizing
2015, pp. 1604–1612. activations,” arXiv preprint arXiv:1511.08400, 2015.
[57] N. Kalchbrenner, I. Danihelka, and A. Graves, “Grid long short-term [85] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke,
memory,” arXiv preprint arXiv:1507.01526, 2015. A. Goyal, Y. Bengio, H. Larochelle, A. Courville et al., “Zoneout:
Regularizing rnns by randomly preserving hidden activations,” arXiv
[58] K. Yao, T. Cohn, K. Vylomova, K. Duh, and C. Dyer, “Depth-gated
preprint arXiv:1606.01305, 2016.
lstm,” arXiv preprint arXiv:1508.03790, 2015.
[86] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and J. Černockỳ,
[59] M. Ballesteros, C. Dyer, and N. A. Smith, “Improved transition-based “Empirical evaluation and combination of advanced language modeling
parsing by modeling characters instead of words with lstms,” arXiv techniques,” in Twelfth Annual Conference of the International Speech
preprint arXiv:1508.00657, 2015. Communication Association, 2011.
[60] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying [87] T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur,
convolutional neural networks concepts to hybrid nn-hmm model “Extensions of recurrent neural network language model,” in Acoustics,
for speech recognition,” in Acoustics, Speech and Signal Processing Speech and Signal Processing (ICASSP), 2011 IEEE International
(ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. Conference on. IEEE, 2011, pp. 5528–5531.
4277–4280. [88] M. Sundermeyer, R. Schl&quot;uter, and H. Ney, “Lstm neural net-
[61] A. Graves and J. Schmidhuber, “Framewise phoneme classification works for language modeling,” in Thirteenth Annual Conference of the
with bidirectional lstm and other neural network architectures,” Neural International Speech Communication Association, 2012.
Networks, vol. 18, no. 5, pp. 602–610, 2005. [89] A. Graves, “Generating sequences with recurrent neural networks,”
[62] S. Wang and J. Jiang, “Learning natural language inference with lstm,” arXiv preprint arXiv:1308.0850, 2013.
arXiv preprint arXiv:1512.08849, 2015. [90] X. Zhang and M. Lapata, “Chinese poetry generation with recurrent
[63] A. Graves, S. Fernandez, and J. Schmidhuber, “Multi-dimensional neural networks.” in EMNLP, 2014, pp. 670– 680.
recurrent neural networks,” arXiv preprint arXiv:0705.2011v1, 2007. [91] P. Potash, A. Romanov, and A. Rumshisky, “Ghostwriter: using an
[64] L. Theis and M. Bethge, “Generative image modeling using spatial lstm for automatic rap lyric generation,” in Proceedings of the 2015
lstms,” in Advances in Neural Information Processing Systems, 2015, Conference on Empirical Methods in Natural Language Processing,
pp. 1918–1926. 2015, pp. 1919– 1924.
21

[92] M. Ghazvininejad, X. Shi, Y. Choi, and K. Knight, “Generating topical [115] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling
poetry.” in EMNLP, 2016, pp. 1183– 1191. temporal dependencies in high-dimensional sequences: Application
[93] D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent to polyphonic music generation and transcription,” arXiv preprint
neural network for sentiment classification.” in EMNLP, 2015, pp. arXiv:1206.6392, 2012.
1422– 1432. [116] D. Eck and J. Schmidhuber, “A first look at music composition using
[94] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural lstm recurrent neural networks,” Istituto Dalle Molle Di Studi Sull
networks for text classification.” in AAAI, vol. 333, 2015, pp. 2267– Intelligenza Artificiale, vol. 103, 2002.
2273. [117] ——, “Finding temporal structure in music: Blues improvisation with
[95] H. Salehinejad, J. Barfett, S. Valaee, E. Colak, A. Mnatzakanian, lstm recurrent networks,” in Neural Networks for Signal Processing,
and T. Dowdell, “Interpretation of mammogram and chest radiograph 2002. Proceedings of the 2002 12th IEEE Workshop on. IEEE, 2002,
reports using deep neural networks-preliminary results,” arXiv preprint pp. 747–756.
arXiv:1708.09254, 2017. [118] B. Shuai, Z. Zuo, B. Wang, and G. Wang, “Dag-recurrent neural
[96] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection- networks for scene labeling,” in Proceedings of the IEEE Conference
ist temporal classification: labelling unsegmented sequence data with on Computer Vision and Pattern Recognition, 2016, pp. 3620–3629.
recurrent neural networks,” in Proceedings of the 23rd international [119] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling with
conference on Machine learning. ACM, 2006, pp. 369–376. lstm recurrent neural networks,” in Proceedings of the IEEE Conference
[97] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, on Computer Vision and Pattern Recognition, 2015, pp. 3547–3555.
[120] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra,
“Phoneme recognition using time-delay neural networks,” IEEE trans-
“Draw: A recurrent neural network for image generation,” arXiv
actions on acoustics, speech, and signal processing, vol. 37, no. 3, pp.
preprint arXiv:1502.04623, 2015.
328–339, 1989.
[121] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent
[98] M. Schuster, “Bi-directional recurrent neural networks for speech
neural networks,” arXiv preprint arXiv:1601.06759, 2016.
recognition,” Technical report, Tech. Rep., 1996.
[122] A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, and S. Fernández,
[99] H. A. Bourlard and N. Morgan, Connectionist speech recognition: a “Unconstrained on-line handwriting recognition with recurrent neural
hybrid approach. Springer Science & Business Media, 2012, vol. 247. networks,” in Advances in Neural Information Processing Systems,
[100] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with 2008, pp. 577–584.
recurrent neural networks,” in Proceedings of the 31st International [123] M. Mardani, H. Monajemi, V. Papyan, S. Vasanawala, D. Donoho,
Conference on Machine Learning (ICML-14), 2014, pp. 1764–1772. and J. Pauly, “Recurrent generative adversarial networks for proximal
[101] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based learning and automated compressive image recovery,” arXiv preprint
recurrent neural network architectures for large vocabulary speech arXiv:1711.10046, 2017.
recognition,” arXiv preprint arXiv:1402.1128, 2014. [124] H. Salehinejad, S. Valaee, T. Dowdell, E. Colak, and J. Barfett,
[102] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “Generalization of deep neural networks for chest pathology classifi-
“End-to-end attention-based large vocabulary speech recognition,” in cation in x-rays using generative adversarial networks,” arXiv preprint
Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Inter- arXiv:1712.01636, 2017.
national Conference on. IEEE, 2016, pp. 4945–4949. [125] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
[103] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio, gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
“A neural transducer,” arXiv preprint arXiv:1511.04868, 2015. networks for visual recognition and description,” in Proceedings of the
[104] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion IEEE conference on computer vision and pattern recognition, 2015,
recognition: Features, classification schemes, and databases,” Pattern pp. 2625–2634.
Recognition, vol. 44, no. 3, pp. 572–587, 2011. [126] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and
[105] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, K. Saenko, “Translating videos to natural language using deep recurrent
B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech neural networks,” arXiv preprint arXiv:1412.4729, 2014.
emotion recognition using a deep convolutional recurrent network,” [127] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and
in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE A. Courville, “Video description generation incorporating spatio-
International Conference on. IEEE, 2016, pp. 5200–5204. temporal features and a soft-attention mechanism,” arXiv preprint
[106] M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas- arXiv:1502.08029, 2015.
Cowie, and R. Cowie, “Abandoning emotion classes-towards continu-
ous emotion recognition with modelling of long-range dependencies,”
in Ninth Annual Conference of the International Speech Communica-
tion Association, 2008.
[107] J. Lee and I. Tashev, “High-level feature representation using recurrent
neural network for speech emotion recognition.” in INTERSPEECH,
2015, pp. 1537–1540.
[108] K. Fan, Z. Wang, J. Beck, J. Kwok, and K. A. Heller, “Fast second
order stochastic backpropagation for variational inference,” in Advances
in Neural Information Processing Systems, 2015, pp. 1387–1395.
[109] O. Karaali, G. Corrigan, I. Gerson, and N. Massey, “Text-to-speech
conversion with neural networks: A recurrent tdnn approach,” arXiv
preprint cs/9811032, 1998.
[110] C. Tuerk and T. Robinson, “Speech synthesis using artificial neural
networks trained on cepstral coefficients.” in EUROSPEECH, 1993.
[111] H. Zen and H. Sak, “Unidirectional long short-term memory recurrent
neural network with recurrent output layer for low-latency speech
synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015
IEEE International Conference on. IEEE, 2015, pp. 4470–4474.
[112] Y. Fan, Y. Qian, F.-L. Xie, and F. K. Soong, “Tts synthesis with
bidirectional lstm based recurrent neural networks,” in Fifteenth Annual
Conference of the International Speech Communication Association,
2014.
[113] R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory, “Prosody
contour prediction with long short-term memory, bi-directional, deep
recurrent neural networks.” in Interspeech, 2014, pp. 2268–2272.
[114] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
A generative model for raw audio,” arXiv preprint arXiv:1609.03499,
2016.
Deep Learning: An Introduction for Applied
Mathematicians
arXiv:1801.05894v1 [math.HO] 17 Jan 2018

Catherine F. Higham∗ Desmond J. Higham†


January 19, 2018

Abstract
Multilayered artificial neural networks are becoming a pervasive tool
in a host of application fields. At the heart of this deep learning revolution
are familiar concepts from applied and computational mathematics; no-
tably, in calculus, approximation theory, optimization and linear algebra.
This article provides a very brief introduction to the basic ideas that un-
derlie deep learning from an applied mathematics perspective. Our target
audience includes postgraduate and final year undergraduate students in
mathematics who are keen to learn about the area. The article may also
be useful for instructors in mathematics who wish to enliven their classes
with references to the application of deep learning techniques. We focus
on three fundamental questions: what is a deep neural network? how is a
network trained? what is the stochastic gradient method? We illustrate
the ideas with a short MATLAB code that sets up and trains a network.
We also show the use of state-of-the art software on a large scale image
classification problem. We finish with references to the current literature.

1 Motivation
Most of us have come across the phrase deep learning. It relates to a set of
tools that have become extremely popular in a vast range of application fields,
from image recognition, speech recognition and natural language processing to
targeted advertising and drug discovery. The field has grown to the extent
where sophisticated software packages are available in the public domain, many
produced by high-profile technology companies. Chip manufacturers are also
customizing their graphics processing units (GPUs) for kernels at the heart of
deep learning.
∗ School of Computing Science, University of Glasgow, UK
(Catherine.Higham@glasgow.ac.uk). This author was supported by the EPSRC UK
Quantum Technology Programme under grant EP/M01326X/1.
† Department of Mathematics and Statistics, University of Strathclyde, UK
(d.j.higham@strath.ac.uk). This author was supported by the EPSRC/RCUK Digi-
tal Economy Programme under grant EP/M00158X/1.

1
Whether or not its current level of attention is fully justified, deep learning is
clearly a topic of interest to employers, and therefore to our students. Although
there are many useful resources available, we feel that there is a niche for a
brief treatment aimed at mathematical scientists. For a mathematics student,
gaining some familiarity with deep learning can enhance employment prospects.
For mathematics educators, slipping “Applications to Deep Learning” into the
syllabus of a class on calculus, approximation theory, optimization, linear al-
gebra, or scientific computing is a great way to attract students and maintain
their interest in core topics. The area is also ripe for independent project study.
There is no novel material in this article, and many topics are glossed over
or omitted. Our aim is to present some key ideas as clearly as possible while
avoiding non-essential detail. The treatment is aimed at readers with a back-
ground in mathematics who have completed a course in linear algebra and are
familiar with partial differentiation. Some experience of scientific computing is
also desirable.
To keep the material concrete, we list and walk through a short MATLAB
code that illustrates the main algorithmic steps in setting up, training and
applying an artificial neural network. We also demonstrate the high-level use of
state-of-the-art software on a larger scale problem.
Section 2 introduces some key ideas by creating and training an artificial
neural network using a simple example. Section 3 sets up some useful notation
and defines a general network. Training a network, which involves the solution
of an optimization problem, is the main computational challenge in this field.
In Section 4 we describe the stochastic gradient method, a variation of a tra-
ditional optimization technique that is designed to cope with very large scale
sets of training data. Section 5 explains how the partial derivatives needed for
the stochastic gradient method can be computed efficiently using back propa-
gation. First-principles MATLAB code that illustrates these ideas is provided
in section 6. A larger scale problem is treated in section 7. Here we make
use of existing software. Rather than repeatedly acknowledge work throughout
the text, we have chosen to place the bulk of our citations in Section 8, where
pointers to the large and growing literature are given. In that section we also
raise issues that were not mentioned elsewhere, and highlight some current hot
topics.

2 Example of an Artificial Neural Network


This article takes a data fitting view of artificial neural networks. To be concrete,
consider the set of points shown in Figure 1. This shows labeled data—some
points are in category A, indicated by circles, and the rest are in category B,
indicated by crosses. For example, the data may show oil drilling sites on a
map, where category A denotes a successful outcome. Can we use this data
to categorize a newly proposed drilling site? Our job is to construct a trans-
formation that takes any point in R2 and returns either a circle or a square.
Of course, there are many reasonable ways to construct such a transformation.

2
Figure 1: Labeled data points in R2 . Circles denote points in category A.
Crosses denote points in category B.

The artificial neural network approach uses repeated application of a simple,


nonlinear function.
We will base our network on the sigmoid function
1
σ(x) = , (1)
1 + e−x
which is illustrated in the upper half of Figure 2 over the interval −10 ≤ x ≤ 10.
We may regard σ(x) as a smoothed version of a step function, which itself mimics
the behavior of a neuron in the brain—firing (giving output equal to one) if the
input is large enough, and remaining inactive (giving output equal to zero)
otherwise. The sigmoid also has the convenient property that its derivative
takes the simple form
σ 0 (x) = σ(x) (1 − σ(x)) , (2)
which is straightforward to verify.
The steepness and location of the transition in the sigmoid function may
be altered by scaling and shifting the argument or, in the language of neural
networks, by weighting and biasing the input. The lower plot in Figure 2 shows
σ (3(x − 5)). The factor 3 has sharpened the changeover and the shift −5 has
altered its location. To keep our notation managable, we need to interpret the
sigmoid function in a vectorized sense. For z ∈ Rm , σ : Rm → Rm is defined by
applying the sigmoid function in the obvious componentwise manner, so that

(σ(z))i = σ(zi ).

With this notation, we can set up layers of neurons. In each layer, every neuron
outputs a single real number, which is passed to every neuron in the next layer.
At the next layer, each neuron forms its own weighted combination of these
values, adds its own bias, and applies the sigmoid function. Introducing some
mathematics, if the real numbers produced by the neurons in one layer are

3
Figure 2: Upper: sigmoid function (1). Lower: sigmoid with shifted and scaled
input.

collected into a vector, a, then the vector of outputs from the next layer has the
form
σ(W a + b). (3)
Here, W is matrix and b is a vector. We say that W contains the weights and
b contains the biases. The number of columns in W matches the number of
neurons that produced the vector a at the previous layer. The number of rows
in W matches the number of neurons at the current layer. The number of
components in b also matches the number of neurons at the current layer. To
emphasize the role of the ith neuron in (3), we could pick out the ith component
as  
X
σ wij aj + bi  ,
j

where the sum runs over all entries in a. Throughout this article, we will be
switching between the vectorized and componentwise viewpoints to strike a
balance between clarity and brevity.
In the next section, we introduce a full set of notation that allows us to define
a general network. Before reaching that stage, we will give a specific example.
Figure 3 represents an artificial neural network with four layers. We will apply
this form of network to the problem defined by Figure 1. For the network in
Figure 3 the first (input) layer is represented by two circles. This is because our
input data points have two components. The second layer has two solid circles,
indicating that two neurons are being employed. The arrows from layer one to
layer two indicate that both components of the input data are made available

4
Layer 2 Layer 4
Layer 1 (Output layer)
(Input layer) Layer 3

Figure 3: A network with four layers.

to the two neurons in layer two. Since the input data has the form x ∈ R2 , the
weights and biases for layer two may be represented by a matrix W [2] ∈ R2×2
and a vector b[2] ∈ R2 , respectively. The output from layer two then has the
form
σ(W [2] x + b[2] ) ∈ R2 .
Layer three has three neurons, each receiving input in R2 . So the weights and
biases for layer three may be represented by a matrix W [3] ∈ R3×2 and a vector
b[3] ∈ R3 , respectively. The output from layer three then has the form
 
σ W [3] σ(W [2] x + b[2] ) + b[3] ∈ R3 .

The fourth (output) layer has two neurons, each receiving input in R3 . So the
weights and biases for this layer may be represented by a matrix W [4] ∈ R2×3
and a vector b[4] ∈ R[2] , respectively. The output from layer four, and hence
from the overall network, has the form
   
F (x) = σ W [4] σ W [3] σ(W [2] x + b[2] ) + b[3] + b[4] ∈ R2 . (4)

The expression (4) defines a function F : R2 → R2 in terms of its 23


parameters—the entries in the weight matrices and bias vectors. Recall that
our aim is to produce a classifier based on the data in Figure 1. We do this
by optimizing over the parameters. We will require F (x) to be close to [1, 0]T
for data points in category A and close to [0, 1]T for data points in category B.
Then, given a new point x ∈ R2 , it would be reasonable to classify it according
to the largest component of F (x); that is, category A if F1 (x) > F2 (x) and
category B if F1 (x) < F2 (x), with some rule to break ties. This requirement on
F may be specified through a cost function. Denoting the ten data points in
Figure 1 by {x{i} }10
i=1 , we use y(x
{i}
) for the target output; that is,
  
1
if x{i} is in category A,


0



y(x{i} ) =   (5)
0


if x{i} is in category B.


1

5
Figure 4: Visualization of output from an artificial neural network applied to
the data in Figure 1.

Our cost function then takes the form


10
  1 X1
Cost W [2] , W [3] , W [4] , b[2] , b[3] , b[4] = ky(x{i} ) − F (x{i} )k22 . (6)
10 i=1 2

Here, the factor 21 is included for convenience; it simplifies matters when we start
differentiating. We emphasize that Cost is a function of the weights and biases—
the data points are fixed. The form in (6), where discrepancy is measured by
averaging the squared Euclidean norm over the data points, is often refered to as
a quadratic cost function. In the language of optimization, Cost is our objective
function.
Choosing the weights and biases in a way that minimizes the cost function
is refered to as training the network. We note that, in principle, rescaling an
objective function does not change an optimization problem. We should arrive
at the same minimizer if we change Cost to, for example, 100 Cost or Cost/30.
So the factors 1/10 and 1/2 in (6) should have no effect on the outcome.
For the data in Figure 1, we used the MATLAB optimization toolbox to
minimize the cost function (6) over the 23 parameters defining W [2] , W [3] ,
W [4] , b[2] , b[3] and b[4] . More precisely, we used the nonlinear least-squares
solver lsqnonlin. For the trained network, Figure 4 shows the boundary where
F1 (x) > F2 (x). So, with this approach, any point in the shaded region would
be assigned to category A and any point in the unshaded region to category B.
Figure 5 shows how the network responds to additional training data. Here
we added one further category B point, indicated by the extra cross at (0.3, 0.7),
and re-ran the optimization routine.
The example illustrated in Figure 4 is small-scale by the standards of today’s
deep learning tools. However, the underlying optimization problem, minimizing
a non-convex objective function over 23 variables, is fundamentally difficult.
We cannot exhaustively search over a 23 dimensional parameter space, and
we cannot guarantee to find the global minimum of a non-convex function.

6
Figure 5: Repeat of the experiment in Figure 4 with an additional data point.

Indeed, some experimentation with the location of the data points in Figure 4
and with the choice of initial guess for the weights and biases makes it clear
that lsqnonlin, with its default settings, cannot always find an acceptable
solution. This motivates the material in sections 4 and 5, where we focus on
the optimization problem.

3 The General Set-up


The four layer example in section 2 introduced the idea of neurons, represented
by the sigmoid function, acting in layers. At a general layer, each neuron receives
the same input—one real value from every neuron at the previous layer—and
produces one real value, which is passed to every neuron at the next layer. There
are two exceptional layers. At the input layer, there is no “previous layer” and
each neuron receives the input vector. At the output layer, there is no “next
layer” and these neurons provide the overall output. The layers in between
these two are called hidden layers. There is no special meaning to this phrase;
it simply indicates that these neurons are performing intermediate calculations.
Deep learning is a loosely-defined term which implies that many hidden layers
are being used.
We now spell out the general form of the notation from section 2. We
suppose that the network has L layers, with layers 1 and L being the input and
output layers, respectively. Suppose that layer l, for l = 1, 2, 3, . . . , L contains
nl neurons. So n1 is the dimension of the input data. Overall, the network maps
from Rn1 to RnL . We use W [l] ∈ Rnl ×nl−1 to denote the matrix of weights at
[l]
layer l. More precisely, wjk is the weight that neuron j at layer l applies to the
output from neuron k at layer l − 1. Similarly, b[l] ∈ Rnl is the vector of biases
[l]
for layer l, so neuron j at layer l uses the bias bj .
In Fig 6 we give an example with L = 5 layers. Here, n1 = 4, n2 = 3, n3 = 4,
n4 = 5 and n5 = 2, so W [2] ∈ R3×4 , W [3] ∈ R4×3 , W [4] ∈ R5×4 , W [5] ∈ R2×5 ,
b[2] ∈ R3 , b[3] ∈ R4 , b[4] ∈ R5 and b[5] ∈ R2 .

7
[3] Layer 5
W43 (Output layer)
Layer 2

Layer 1 Layer 3
(Input layer) Layer 4

Figure 6: A network with five layers. The edge corresponding to the weight
[3]
w43 is highlighted. The output from neuron number 3 at layer 2 is weighted by
[3]
the factor w43 when it is fed into neuron number 4 at layer 3.

Given an input x ∈ Rn1 , we may then neatly summarize the action of the
[l]
network by letting aj denote the output, or activation, from neuron j at layer
l. So, we have

a[1] = x ∈ Rn 1 , (7)
 
[l]
a = σ W [l] a[l−1] + b[l] ∈ Rnl , for l = 2, 3, . . . , L. (8)

It should be clear that (7) and (8) amount to an algorithm for feeding the input
forward through the network in order to produce an output a[L] ∈ RnL . At the
end of section 5 this algorithm appears within a pseudocode description of an
approach for training a network.
Now suppose we have N pieces of data, or training points, in Rn1 , {x{i} }N
i=1 ,
for which there are given target outputs {y(x{i} )}N
i=1 in R
nL
. Generalizing (6),
the quadratic cost function that we wish to minimize has the form
N
1 X1
Cost = ky(x{i} ) − a[L] (x{i} )k22 , (9)
N i=1 2

where, to keep notation under control, we have not explicitly indicated that
Cost is a function of all the weights and biases.

8
4 Stochastic Gradient
We saw in the previous two sections that training a network corresponds to
choosing the parameters, that is, the weights and biases, that minimize the cost
function. The weights and biases take the form of matrices and vectors, but at
this stage it is convenient to imagine them stored as a single vector that we call
p. The example in Figure 3 has a total of 23 weights and biases. So, in that case,
p ∈ R23 . Generally, we will suppose p ∈ Rs , and write the cost function in (9)
as Cost(p) to emphasize its dependence on the parameters. So Cost : Rs → R.
We now introduce a classical method in optimization that is often referred
to as steepest descent or gradient descent. The method proceeds iteratively,
computing a sequence of vectors in Rs with the aim of converging to a vector
that minimizes the cost function. Suppose that our current vector is p. How
should we choose a perturbation, ∆p, so that the next vector, p+∆p, represents
an improvement? If ∆p is small, then ignoring terms of order k ∆p k2 , a Taylor
series expansion gives
s
X ∂ Cost(p)
Cost(p + ∆p) ≈ Cost(p) + ∆pr . (10)
r=1
∂ pr

Here ∂ Cost(p)/∂ pr denotes the partial derivative of the cost function with
respect to the rth parameter. For convenience, we will let ∇Cost(p) ∈ Rs
denote the vector of partial derivatives, known as the gradient, so that

∂ Cost(p)
(∇Cost(p))r = .
∂ pr

Then (10) becomes

Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p. (11)

Our aim is to reduce the value of the cost function. The relation (11) motivates
the idea of choosing ∆p to make ∇Cost(p)T ∆p as negative as possible. We can
address this problem via the Cauchy–Schwarz inequality, which states that for
any f, g ∈ Rs , we have |f T g| ≤ k f k2 k g k2 . So the most negative that f T g
can be is −k f k2 k g k2 , which happens when f = −g. Hence, based on (11), we
should choose ∆p to lie in the direction −∇Cost(p). Keeping in mind that (11)
is an approximation that is relevant only for small ∆p, we will limit ourselves
to a small step in that direction. This leads to the update

p → p − η∇Cost(p). (12)

Here η is small stepsize that, in this context, is known as the learning rate.
This equation defines the steepest descent method. We choose an initial vector
and iterate with (12) until some stopping criterion has been met, or until the
number of iterations has exceeded the computational budget.

9
Our cost function (9) involves a sum of individual terms that runs over the
training data. It follows that the partial derivative ∇Cost(p) is a sum over the
training data of individual partial derivatives. More precisely, let
Cx{i} = 21 ky(x{i} ) − a[L] (x{i} )k22 . (13)
Then, from (9),
N
1 X
∇Cost(p) = ∇Cx{i} (p). (14)
N i=1
When we have a large number of parameters and a large number of training
points, computing the gradient vector (14) at every iteration of the steepest
descent method (12) can be prohibitively expensive. A much cheaper alternative
is to replace the mean of the individual gradients over all training points by the
gradient at a single, randomly chosen, training point. This leads to the simplest
form of what is called the stochastic gradient method. A single step may be
summarized as
1. Choose an integer i uniformly at random from {1, 2, 3, . . . , N }.
2. Update
p → p − η∇Cx{i} (p). (15)
In words, at each step, the stochastic gradient method uses one randomly chosen
training point to represent the full training set. As the iteration proceeds, the
method sees more training points. So there is some hope that this dramatic
reduction in cost-per-iteration will be worthwhile overall. We note that, even
for very small η, the update (15) is not guaranteed to reduce the overall cost
function—we have traded the mean for a single sample. Hence, although the
phrase stochastic gradient descent is widely used, we prefer to use stochastic
gradient.
The version of the stochastic gradient method that we introduced in (15) is
the simplest from a large range of possibilities. In particular, the index i in (15)
was chosen by sampling with replacement—after using a training point, it is
returned to the training set and is just as likely as any other point to be chosen
at the next step. An alternative is to sample without replacement; that is, to
cycle through each of the N training points in a random order. Performing N
steps in this manner, refered to as completing an epoch, may be summarized as
follows:
1. Shuffle the integers {1, 2, 3, . . . , N } into a new order, {k1 , k2 , k3 , . . . , kN }.
2. for i = 1 upto N , update
p → p − η∇Cx{ki } (p). (16)

If we regard the stochastic gradient method as approximating the mean over


all training points in (14) by a single sample, then it is natural to consider a
compromise where we use a small sample average. For some m  N we could
take steps of the following form.

10
1. Choose m integers, k1 , k2 , . . . , km , uniformly at random from {1, 2, 3, . . . , N }.
2. Update
m
1 X
p→p−η ∇Cx{ki } (p). (17)
m i=1

In this iteration, the set {x{ki } }ni=1 is known as a mini-batch. There is a without
replacement alternative where, assuming N = Km for some K, we split the
training set randomly into K distinct mini-batches and cycle through them.
Because the stochastic gradient method is usually implemented within the
context of a very large scale computation, algorithmic choices such as mini-
batch size and the form of randomization are often driven by the requirements
of high performance computing architectures. Also, it is, of course, possible to
vary these choices, along with others, such as the learning rate, dynamically as
the training progresses in an attempt to accelerate convergence.
Section 6 describes a simple MATLAB code that uses a vanilla stochastic
gradient method. In section 7 we use a state-of-the-art implementation and
section 8 has pointers to the current literature.

5 Back Propagation
We are now in a position to apply the stochastic gradient method in order
to train an artificial neural network. So we switch from the general vector
of parameters, p, used in section 4 to the entries in the weight matrices and
bias vectors. Our task is to compute partial derivatives of the cost function
[l] [l]
with respect to each wjk and bj . We saw that the idea behind the stochastic
gradient method is to exploit the structure of the cost function: because (9) is a
linear combination of individual terms that runs over the training data the same
is true of its partial derivatives. We therefore focus our attention on computing
those individual partial derivatives.
Hence, for a fixed training point we regard Cx{i} in (13) as a function of the
weights and biases. So we may drop the dependence on x{i} and simply write

C = 12 ky − a[L] k22 . (18)

We recall from (8) that a[L] is the output from the artificial neural network.
The dependence of C on the weights and biases arises only through a[L] .
To derive worthwhile expressions for the partial derivatives, it is useful to
introduce two further sets of variables. First we let

z [l] = W [l] a[l−1] + b[l] ∈ Rnl , for l = 2, 3, . . . , L. (19)


[l]
We refer to zj as the weighted input for neuron j at layer l. The fundamen-
tal relation (8) that propagates information through the network may then be
written  
a[l] = σ z [l] , for l = 2, 3, . . . , L. (20)

11
Second, we let δ [l] ∈ Rnl be defined by
[l] ∂C
δj = [l]
, for 1 ≤ j ≤ nl and 2 ≤ l ≤ L. (21)
∂ zj
This expression, which is often called the error in the jth neuron at layer l,
is an intermediate quantity that is useful both for analysis and computation.
However, we point out that this useage of the term error is somewhat ambiguous.
At a general, hidden layer, it is not clear how much to “blame” each neuron for
discrepancies in the final output. Also, at the output layer, L, the expression
[l]
(21) does not quantify those discrepancies directly. The idea of referring to δj
in (21) as an error seems to have arisen because the cost function can only be at
[l]
a minimum if all partial derivatives are zero, so δj = 0 is a useful goal. As we
[l]
mention later in this section, it may be more helpful to keep in mind that δj
measures the sensitivity of the cost function to the weighted input for neuron j
at layer l.
At this stage we also need to define the Hadamard, or componentwise, prod-
uct of two vectors. If x, y ∈ Rn , then x ◦ y ∈ Rn is defined by (x ◦ y)i = xi yi .
In words, the Hadamard product is formed by pairwise multiplication of the
corresponding components.
With this notation, the following results are a consequence of the chain rule.
Lemma 1 We have
δ [L] = σ 0 (z [L] ) ◦ (aL − y), (22)
[l] 0 [l] [l+1] T [l+1]
δ = σ (z ) ◦ (W ) δ , for 2 ≤ l ≤ L − 1, (23)
∂C [l]
[l]
= δj , for 2 ≤ l ≤ L, (24)
∂ bj
∂C [l] [l−1]
[l]
= δj ak . for 2 ≤ l ≤ L. (25)
∂ wjk

Proof We begin by proving (22). The relation (20) with l = L shows that
[L] [L] 
zj and aj are connected by a[L] = σ z [L] , and hence
[L]
∂ aj [L]
[L]
= σ 0 (zj ).
∂ zj
Also, from (18),
nL
∂C ∂ X [L] [L]
[L]
= 1
[L] 2
(yk − ak )2 = −(yj − aj ).
∂ aj ∂ aj k=1

So, using the chain rule,


[L]
[L] ∂C ∂ C ∂ aj [L] [L]
δj = [L]
= [L] [L]
= (aj − yj )σ 0 (zj ),
∂ zj ∂ aj ∂ zj

12
which is the componentwise form of (22).
[l] [l+1] nl+1
To show (23), we use the chain rule to convert from zj to {zk }k=1 .
Applying the chain rule, and using the definition (21),
nl+1 [l+1] nl+1 [l+1]
[l] ∂C X ∂ C ∂ zk X [l+1] ∂ zk
δj = [l]
= [l+1] [l]
= δk [l]
. (26)
∂zj k=1 ∂zk ∂zj k=1 ∂zj
[l+1] [l]
Now, from (19) we know that zk and zj are connected via
nl  
[l+1] [l+1] [l+1]
X
zk = wks σ zs[l] + bk .
s=1

Hence,
[l+1]
∂ zk [l+1] 0

[l]

[l]
= wkj σ zj .
∂zj
In (26) this gives
nl+1  
[l] [l+1] [l+1] 0 [l]
X
δj = δk wkj σ zj ,
k=1

which may be rearranged as


  
[l] [l]
δj = σ 0 zj (W [l+1] )T δ [l+1] .
j

This is the componentwise form of (23).


[l] [l]
To show (24), we note from (19) and (20) that zj is connected to bj by
  
[l] [l]
zj = W [l] σ z [l−1] + bj .
j

[l]
Since z [l−1] does not depend on bj , we find that
[l]
∂ zj
[l]
= 1.
∂ bj

Then, from the chain rule,


[l]
∂C ∂ C ∂ zj ∂C [l]
[l]
= [l] [l]
= [l]
= δj ,
∂ bj ∂ zj ∂ bj ∂ zj

using the definition (21). This gives (24).


Finally, to obtain (25) we start with the componentwise version of (19),
nl−1
[l] [l] [l−1] [l]
X
zj = wjk ak + bj ,
k=1

13
which gives
[l]
∂ zj [l−1]
[l]
= ak , independently of j, (27)
∂wjk
and
[l]
∂ zs
[l]
= 0, for s 6= j. (28)
∂wjk
In words, (27) and (28) follow because the jth neuron at layer l uses the weights
from only the jth row of W [l] , and applies these weights linearly. Then, from
the chain rule, (27) and (28) give
nl [l] [l]
∂C X ∂ C ∂ zs ∂ C ∂ zj ∂C [l−1] [l] [l−1]
[l]
= [l] [l]
= [l] [l]
= a
[l] k
= δj ak ,
∂ wjk s=1 ∂zs ∂ wjk ∂zj ∂ wjk ∂zj

[l]
where the last step used the definition of δj in (21). This completes the proof.

There are many aspects of Lemma 1 that deserve our attention. We recall
from (7), (19) and (20) that the output a[L] can be evaluated from a forward pass
through the network, computing a[1] , z [2] , a[2] , z [3] , . . . , a[L] in order. Having
done this, we see from (22) that δ [L] is immediately available. Then, from (23),
δ [L−1] , δ [L−2] , . . . , δ [2] may be computed in a backward pass. From (24) and
(25), we then have access to the partial derivatives. Computing gradients in
this way is known as back propagation.
To gain further understanding of the back propagation formulas (24) and
(25) in Lemma 1, it is useful to recall the fundamental definition of a partial
[l]
derivative. The quantity ∂ C/∂ wjk measures how C changes when we make
[l]
a small perturbation to wjk . For illustration, Figure 6 highlights the weight
[3]
w43 . It is clear that a change in this weight has no effect on the output of
[3]
previous layers. So to work out ∂ C/∂ w43 we do not need to know about
partial derivatives at previous layers. It should, however, be possible to express
[3]
∂ C/∂ w43 in terms of partial derivatives at subsequent layers. More precisely,
[3]
the activation feeding into the 4th neuron on layer 3 is z4 , and, by definition,
[3]
δ4 measures the sensitivity of C with respect to this input. Feeding in to this
[3] [2]
neuron we have w43 a3 + constant, so it makes sense that

∂C [3] [2]
[3]
= δ4 a3 .
∂ w43
[3]
Similarly, in terms of the bias, b4 + constant is feeding in to the neuron, which
explains why
∂C [3]
[3]
= δ4 × 1.
∂ b4

14
We may avoid the Hadamard product notation in (22) and (23) by in-
troducing diagonal matrices. Let D[l] ∈ Rnl ×nl denote the diagonal matrix
[l]
with (i, i) entry given by σ 0 (zi ). Then we see that δ [L] = D[L] (a[L] − y) and
δ [l] = D[l] (W [l+1] )T δ [l+1] . We could expand this out as

δ [l] = D[l] (W [l+1] )T D[l+1] (W [l+2] )T · · · D[L−1] (W [L] )T D[L] (a[L] − y).

We also recall from (2) that σ 0 (z) is trivial to compute.


The relation (24) shows that δ [l] corresponds precisely to the gradient of the
[l]
cost function with respect to the biases at layer l. If we regard ∂C/∂wjk as
defining the (j, k) component in a matrix of partial derivatives at layer l, then
T
(25) shows this matrix to be the outer product δ [l] a[l−1] ∈ Rnl ×nl−1 .
Putting this together, we may write the following pseudocode for an algo-
rithm that trains a network using a fixed number, Niter, of stochastic gradient
iterations. For simplicity, we consider the basic version (15) where single sam-
ples are chosen with replacement. For each training point, we perform a forward
pass through the network in order to evaluate the activations, weighted inputs
and overall output a[L] . Then we perform a backward pass to compute the
errors and updates.

For counter = 1 upto Niter


Choose an integer k uniformly at random from {1, 2, 3, . . . , N }
x{k} is current training data point
a[1] = x{k}
For l = 2 upto L
z [l] = W [l] a[l−1] + b[l]

a[l] = σ z [l]
D[l] = diag σ 0 (z [l] )


end

δ [L] = D[L] a[L] − y(x{k} )




For l = L − 1 downto 2
δ [l] = D[l] (W [l+1] )T δ [l+1]
end

For l = L downto 2
T
W [l] → W [l] − η δ [l] a[l−1]
b[l] → b[l] − η δ [l]
end

end

15
6 Full MATLAB Example
We now give a concrete illustration involving back propagation and the stochas-
tic gradient method. Listing 6.1 shows how a network of the form shown in
Figure 3 may be used on the data in Figure 1. We note that this MATLAB
code has been written for clarity and brevity, rather than efficiency or elegance.
In particular, we have “hardwired” the number of layers and iterated through
the forward and backward passes line by line. (Because the weights and biases
do not have the the same dimension in each layer, it is not convenient to store
them in a three-dimensional array. We could use a cell array or structure array,
[18], and then implement the forward and backward passes in for loops. How-
ever, this approach produced a less readable code, and violated our self-imposed
one page limit.)
The function netbp in Listing 6.1 contains the nested function cost, which
evaluates a scaled version of Cost in (6). Because this function is nested, it
has access to the variables in the main function, notably the training data. We
point out that the nested function cost is not used directly in the forward and
backward passes. It is called at each iteration of the stochastic gradient method
so that we can monitor the progress of the training.
Listing 6.2 shows the function activate, used by netbp, which applies the
sigmoid function in vectorized form.
At the start of netbp we set up the training data and target y values, as
defined in (5). We then initialize all weights and biases using the normal pseu-
dorandom number generator randn. For simplicity, we set a constant learning
rate eta = 0.05 and perform a fixed number of iterations Niter = 1e6.
We use the the basic stochastic gradient iteration summarized at the end
of Section 5. Here, the command randi(10) returns a uniformly and indepen-
dently chosen integer between 1 and 10.
Having stored the value of the cost function at each iteration, we use the
semilogy command to visualize the progress of the iteration.
In this experiment, our initial guess for the weights and biases produced a
cost function value of 5.3. After 106 stochastic gradient steps this was reduced
to 7.3 × 10−4 . Figure 7 shows the semilogy plot, and we see that the decay is
not consistent—the cost undergoes a flat period towards the start of the process.
After this plateau, we found that the cost decayed at a very slow linear rate—the
ratio between successive values was typically within around 10−6 of unity.
An extended version of netbp can be found in the supplementary material.
This version has the extra graphics commands that make Figure 7 more read-
able. It also takes the trained network and produces Figure 8. This plot shows
how the trained network carves up the input space. Eagle-eyed readers will spot
that the solution in Figure 8. differs slightly from the version in Figure 4, where
the same optimization problem was tackled by the nonlinear least-squares solver
lsqnonlin. In Figure 9 we show the corresponding result when an extra data
point is added; this can be compared with Figure 5.

16
function netbp
%NETBP Uses backpropagation to train a network

%%%%%%% DATA %%%%%%%%%%%


x1 = [0.1,0.3,0.1,0.6,0.4,0.6,0.5,0.9,0.4,0.7];
x2 = [0.1,0.4,0.5,0.9,0.2,0.3,0.6,0.2,0.4,0.6];
y = [ones(1,5) zeros(1,5); zeros(1,5) ones(1,5)];

% Initialize weights and biases


rng(5000);
W2 = 0.5*randn(2,2); W3 = 0.5*randn(3,2); W4 = 0.5*randn(2,3);
b2 = 0.5*randn(2,1); b3 = 0.5*randn(3,1); b4 = 0.5*randn(2,1);

% Forward and Back propagate


eta = 0.05; % learning rate
Niter = 1e6; % number of SG iterations
savecost = zeros(Niter,1); % value of cost function at each iteration
for counter = 1:Niter
k = randi(10); % choose a training point at random
x = [x1(k); x2(k)];
% Forward pass
a2 = activate(x,W2,b2);
a3 = activate(a2,W3,b3);
a4 = activate(a3,W4,b4);
% Backward pass
delta4 = a4.*(1-a4).*(a4-y(:,k));
delta3 = a3.*(1-a3).*(W4’*delta4);
delta2 = a2.*(1-a2).*(W3’*delta3);
% Gradient step
W2 = W2 - eta*delta2*x’;
W3 = W3 - eta*delta3*a2’;
W4 = W4 - eta*delta4*a3’;
b2 = b2 - eta*delta2;
b3 = b3 - eta*delta3;
b4 = b4 - eta*delta4;
% Monitor progress
newcost = cost(W2,W3,W4,b2,b3,b4) % display cost to screen
savecost(counter) = newcost;
end

% Show decay of cost function


save costvec
semilogy([1:1e4:Niter],savecost(1:1e4:Niter))

function costval = cost(W2,W3,W4,b2,b3,b4)


costvec = zeros(10,1);
for i = 1:10
x =[x1(i);x2(i)];
a2 = activate(x,W2,b2);
a3 = activate(a2,W3,b3);
a4 = activate(a3,W4,b4);
costvec(i) = norm(y(:,i) - a4,2);
end
costval = norm(costvec,2)^2;
end % of nested function

end

Listing 6.1: M-file netbp.m.


17
function y = activate(x,W,b)
%ACTIVATE Evaluates sigmoid function.
%
% x is the input vector, y is the output vector
% W contains the weights, b contains the shifts
%
% The ith component of y is activate((Wx+b)_i)
% where activate(z) = 1/(1+exp(-z))

y = 1./(1+exp(-(W*x+b)));

Listing 6.2: M-file activate.m.

Figure 7: Vertical axis shows a scaled value of the cost function (6). Horizontal
axis shows the iteration number. Here we used the stochastic gradient method
to train a network of the form shown in Figure 3 on the data in Figure 1. The
resulting classification function is illustrated in Figure 8.

18
Figure 8: Visualization of output from an artificial neural network applied to
the data in Figure 1. Here we trained the network using the stochastic gradient
method with back propagation—behaviour of cost function is shown in Figure 7.
The same optimization problem was solved with the lsqnonlin routine from
MATLAB in order to produce Figure 4.

Figure 9: Visualization of output from an artificial neural network applied


to the data in Figure 1 with an additional data point. Here we trained the
network using the stochastic gradient method with back propagation. The same
optimization problem was solved with the lsqnonlin routine from MATLAB
in order to produce Figure 5.

19
7 Image Classification Example
We now move on to a more realistic task, which allows us to demonstrate the
power of the deep learning approach. We make use of the MatConvNet tool-
box [33], which is designed to offer key deep learning building blocks as simple
MATLAB commands. So MatConvNet is an excellent environment for pro-
totyping and for educational use. Support for GPUs also makes MatConvNet
efficient for large scale computations, and pre-trained networks may be down-
loaded for immediate use.
Applying MatConvNet on a large scale problem also gives us the oppor-
tunity to outline further concepts that are relevant to practical computation.
These are introduced in the next few subsections, before we apply them to the
image classification exercise.

7.1 Convolutional Neural Networks


MatConvNet uses a special class of artificial neural networks known as Con-
volutional Neural Networks (CNNs), which have become a standard tool in
computer vision applications. To motivate CNNs, we note that the general
framework described in section 3 does not scale well in the case of digital image
data. Consider a color image made up of 200 by 200 pixels, each with a red,
green and blue component. This corresponds to an input vector of dimension
n1 = 200 × 200 × 3 = 120, 000, and hence a weight matrix W [2] at level 2 that
has 120, 000 columns. If we allow general weights and biases, then this approach
is clearly infeasible. CNNs get around this issue by constraining the values that
are allowed in the weight matrices and bias vectors. Rather than a single full-
sized linear transformation, CNNs repeatedly apply a small-scale linear kernel,
or filter, across portions of their input data. In effect, the weight matrices used
by CNNs are extremely sparse and highly structured.
To understand why this approach might be useful, consider premultiplying
an input vector in R6 by the matrix
 
1 −1
 1 −1 
 ∈ R5×6 .
 

 1 −1  (29)
 1 −1 
1 −1

This produces a vector in R5 made up of differences between neighboring values.


In this case we are using a filter [1, −1] and a stride of length one—the filter
advances by one place after each use. Appropriate generalizations of this matrix
to the case of input vectors arising from 2D images can be used to detect edges
in an image—returning a large absolute value when there is an abrupt change in
neighboring pixel values. Moving a filter across an image can also reveal other
features, for example, particular types of curves or blotches of the same color.
So, having specified a filter size and stride length, we can allow the training
process to learn the weights in the filter as a means to extract useful structure.

20
The word “convolutional” arises because the linear transformations involved
may be written in the form of a convolution. In the 1D case, the convolution of
the vector x ∈ Rp with the filter g1−p , g2−p , . . . , gp−2 , gp−1 has kth component
given by
Xp
yk = xn gk−n .
n=1

The example in (29) corresponds to a filter with g0 = 1, g−1 = −1 and all other
gk = 0. In the case
 
x1
 x2 
 
  x3 
 
  
y1 a b c d  x4 
 
 y2   a b c d   x5 
=
  x6  , (30)
  
 y3   a b c d  
y4 a b c d   x7 

 x8 
 
 x9 
0

we are applying a filter with four weights, a, b, c, and d, using a stride length
of two. Because the dimension of the input vector x is not compatible with the
filter length, we have padded with an extra zero value.
In practice, image data is typically regarded as a three dimensional tensor:
each pixel has two spatial coordinates and a red/green/blue value. With this
viewpoint, the filter takes the form of a small tensor that is successsively applied
to patches of the input tensor and the corresponding convolution operation is
multi-dimensional. From a computational perspective, a key benefit of CNNs is
that the matrix-vector products involved in the forward and backward passes
through the network can be computed extremely efficiently using fast transform
techniques.
A convolutional layer is often followed by a pooling layer, which reduces
dimension by mapping small regions of pixels into single numbers. For example,
when these small regions are taken to be squares of four neigboring pixels in a
2D image, a max pooling or average pooling layer replaces each set of four by
their maximum or average value, respectively.

7.2 Avoiding Overfitting


Overfitting occurs when a trained network performs very accurately on the
given data, but cannot generalize well to new data. Loosely, this means that the
fitting process has focussed too heavily on the unimportant and unrepresentative
“noise” in the given data. Many ways to combat overfitting have been suggested,
some of which can be used together.
One useful technique is to split the given data into two distinct groups.

21
• Training data is used in the definition of the cost function that defines the
optimization problem. Hence this data drives the process that iteratively
updates the weights.
• Validation data is not used in the optimization process—it has no effect
on the way that the weights are updated from step to step. We use the
validation data only to judge the performance of the current network. At
each step of the optimization, we can evaluate the cost function corre-
sponding to the validation data. This measures how well the current set
of trained weights performs on unseen data.
Intuitively, overfitting corresponds to the situation where the optimization pro-
cess is driving down its cost function (giving a better fit to the training data),
but the cost function for the validation error is no longer decreasing (so the
performance on unseen data is not improving). It is therefore reasonable to ter-
minate the training at a stage where no improvement is seen on the validation
data.
Another popular approach to tackle overfitting is to randomly and indepen-
dently remove neurons during the training phase. For example, at each step of
the stochastic gradient method, we could delete each neuron with probability
p and train on the remaining network. At the end of the process, because the
weights and biases were produced on these smaller networks, we could multiply
each by a factor of p for use on the full-sized network. This technique, known
as dropout, has the intuitive interpretation that we are constructing an average
over many trained networks, with such a consensus being more reliable than
any individual.

7.3 Activation and Cost Functions


In Sections 2 to 6 we used activation functions of sigmoid form (1) and a
quadratic cost function (9). There are many other widely used choices, and
their relative performance is application-specific. In our image classification
setting it is common to use a rectified linear unit, or ReLU,

0, for x ≤ 0,
σ(x) = (31)
x, for x > 0,

as the activation.
In the case where our training data {x{i} }N i=1 comes from K labeled cat-
egories, let li ∈ {1, 2, . . . , K} be the given label for data point x{i} . As an
alternative to the quadratic cost function (9), we could use a softmax log loss
approach, as follows. Let the output a[L] (x{i} ) =: v {i} from the network take
the form of a vector in RK such that the jth component is large when the image
is believed to be from category j. The softmax operation
{i}
e vs
(v {i} )s 7→ P {i}
.
K vj
j=1 e

22
boosts the large components and produces a vector of positive weights summing
to unity, which may be interpreted as probabilties. Our aim is now to force
the softmax value for training point x{i} to be as close to unity as possible
in component li , which corresponds to the correct label. Using a logarithmic
rather than quadratic measure of error, we arrive at the cost function
 {i}

N vl
X e i
− log  P {i}
. (32)
K vj
i=1 j=1 e

7.4 Image Classification Experiment


We now show results for a supervised learning task in image classification. To
do this, we rely on the codes cnn_cifar.m and cnn_cifar_init.m that are
available via the MatConvNet website. We made only minor edits, including
some changes that allowed us to test the use of dropout. Hence, in particular,
we are using the network architecture and parameter choices from those codes.
We refer to the MatConvNet documentation and tutorial material for the fine
details, and focus here on some of the bigger picture issues.
We consider a set of images, each of which falls into exactly one of the
following ten categories: airplane, automobile, bird, cat, deer, dog, frog, horse,
ship, truck. We use labeled data from the freely available CIFAR-10 collection
[20]. The images are small, having 32 by 32 pixels, each with a red, green, and
blue component. So one piece of training data consists of 32 × 32 × 3 = 3, 072
values. We use a training set of 50,000 images, and use 10,000 more images as
our validation set. Having completed the optimization and trained the network,
we then judge its performance on a fresh collection of 10,000 test images, with
1,000 from each category.
Following the architecture used in the relevant MatConvNet codes, we
set up a network whose layers are divided into five blocks as follows. Here we
describe the dimemsions of the inputs/outputs and weights in compact tensor
notation. (Of course, the tensors could be stretched into sparse vectors and
matrices in order to fit in with the general framework of sections 2 to 6. But we
feel that the tensor notation is natural in this context, and it is consistent with
the MatConvNet syntax.)
Block 1 consists of a a convolution layer followed by a pooling layer and ac-
tivation. This converts the original 32 × 32 × 3 input into dimension
16 × 16 × 32. In more detail, the convolutional layer uses 5 × 5 filters
that also scan across the 3 color channels. There are 32 different filters, so
overall the weights can be represented in a 5×5×3×32 array. The output
from each filter may be described as a feature map. The filters are applied
with unit stride. In this way, each of the 32 feature maps has dimension
32 × 32. Max pooling is then applied to each feature map using stride
length two. This reduces the dimension of the feature maps to 16 × 16. A
ReLU activation is then used.

23
Figure 10: Overview of the CNN used for the image classification task.

Block 2 applies convolution followed by activation and then a pooling layer.


This reduces the dimension to 8 × 8 × 32. In more detail, we use 32
filters. Each is 5 × 5 across the dimensions of the feature maps, and also
scans across all 32 feature maps. So the weights could be regarded as a
5 × 5 × 32 × 32 tensor. The stride length is one, so the resulting 32 feature
maps are still of dimension 16 × 16. After ReLU activation, an average
pooling layer of stride two is then applied, which reduces each of the 32
feature maps to dimension 8 × 8.
Block 3 applies a convolution layer followed by the activation function, and
then performs a pooling operation, in a way that reduces dimension to
4 × 4 × 64. In more detail, 64 filters are applied. Each filter is 5 × 5 across
the dimensions of the feature maps, and also scans across all 32 feature
maps. So the weights could be regarded as a 5 × 5 × 32 × 64 tensor. The
stride has length one, resulting in feature maps of dimension 8 × 8. After
ReLU activation, an average pooling layer of stride two is applied, which
reduces each of the 64 feature maps to dimension 4 × 4.
Block 4 does not use pooling, just convolution followed by activation, leading
to dimension 1 × 1 × 64. In more detail, 64 filters are used. Each filter is
4 × 4 across the 64 feature maps, so the weights could be regarded as a
4 × 4 × 64 × 64 tensor, and each filter produces a single number.
Block 5 does not involve convolution. It uses a general (fully connected) weight
matrix of the type discussed in sections 2 to 6 to give output of dimension
1 × 1 × 10. This corresponds to a weight matrix of dimension 10 × 64.
A final softmax operation transforms each of the ten ouput components to
the range [0, 1].
Figure 10 gives a visual overview of the network architecture.
Our output is a vector of ten real numbers. The cost function in the opti-
mization problem takes the softmax log loss form (32) with K = 10. We specify
stochastic gradient with momentum, which uses a “moving average” of current

24
and past gradient directions. We use mini-batches of size 100 (so m = 100 in
(17)) and set a fixed number of 45 epochs. We predefine the learning rate for
each epoch: η = 0.05, η = 0.005 and η = 0.0005 for the first 30 epochs, next
10 epochs and final 5 epochs, respectively. Running on a Tesla C2075 GPU in
single precision, the 45 epochs can be completed in just under 4 hours.
As an additional test, we also train the network with dropout. Here, on
each stochastic gradient step, any neuron has its output re-set to zero with
independent probability
• 0.15 in block 1,
• 0.15 in block 2,

• 0.15 in block 3,
• 0.35 in block 4,
• 0 in block 5 (no dropout).

We emphasize that in this case all neurons become active when the trained
network is applied to the test data.
In Figure 11 we illustrate the training process in the case of no dropout.
For the plot on the left, circles are used to show how the objective function
(32) decreases after each of the 45 epochs. We also use crosses to indicate the
objective function value on the validation data. (More precisely, these error
measures are averaged over the individual batches that form the epoch—note
that weights are updated after each batch.) Given that our overall aim is to
assign images to one of the ten classes, the middle plot in Figure 11 looks
at the percentage of errors that take place when we classify with the highest
probability choice. Similarly, the plot on the right shows the percentage of cases
where the correct category is not among the top five. We see from Figure 11 that
the validation error starts to plateau at a stage where the stochastic gradient
method continues to make significant reductions on the training error. This gives
an indication that we are overfitting—learning fine details about the training
data that will not help the network to generalize to unseen data.
Figure 12 shows the analogous results in the case where dropout is used. We
see that the training errors are significantly larger than those in Figure 11 and
the validation errors are of a similar magnitude. However, two key features in the
dropout case are that (a) the validation error is below the training error, and (b)
the validation error continues to decrease in sync with the training error, both of
which indicate that the optimization procedure is extracting useful information
over all epochs.
Figure 13 gives a summary of the performance of the trained network with
no dropout (after 45 epochs) in the form of a confusion matrix. Here, the
integer value in the general i, j entry shows the number of occasions where the
network predicted category i for an image from category j. Hence, off-diagonal
elements indicate mis-classifications. For example, the (1,1) element equal to
814 in Figure 13 records the number of airplane images that were correctly

25
Figure 11: Errors for the trained network. Horizontal axis runs over the 45
epochs of the stochastic gradient method (that is, 45 passes through the train-
ing data). Left: Circles show cost function on the training data; crosses show
cost function on the validation data. Middle: Circles show the percentage of
instances where the most likely classification from the network does not match
the correct category, over the training data images; crosses show the same mea-
sure computed over the validation data. Right: Circles show the percentage
of instances where the five most likely classifications from the network do not
include the correct category, over the training data images; crosses show the
same measure computed over the validation data.

26
Figure 12: As for Figure 11 in the case where dropout was used.

classified as airplanes, and the (1,2) element equal to 21 records the number
of automobile images that were incorrectly classified as airplanes. Below each
integer is the corresponding percentage, rounded to one decimal place, given
that the test data has 1,000 images from each category. The extra row, labeled
“all”, summarizes the entries in each column. For example, the value 81.4%
in the first column of the final row arises because 814 of the 1,000 airplane
images were correctly classified. Beneath this, the value 18.6% arises because
186 of these airplane images were incorrectly classified. The final column of the
matrix, also labeled “all”, summarizes each row. For example, the value 82.4%
in the final column of the first row arises because 988 images were classified
by the network as airplanes, with 814 of these classifications being correct.
Beneath this, the value 17.6% arises because the remaining 174 out of these
988 airplane classifications were incorrect. Finally, the entries in the lower right
corner summarize over all categories. We see that 80.1% of all images were
correctly classified (and hence 19.9% were incorrectly classified).
Figure 14 gives the corresponding results in the case where dropout was
used. We see that the use of dropout has generally improved performance,
and in particular has increased the overall success rate from 80.1% to 81.1%.
Dropout gives larger values along the diagonal elements of the confusion matrix
in nine out of the ten categories.
To give a feel for the difficulty of this task, Figure 15 shows 16 images ran-
domly sampled from those that were misclassified by the non-dropout network.

27
Figure 13: Confusion matrix for the the trained network from Figure 11.

28
Figure 14: Confusion matrix for the the trained network from Figure 12, which
used dropout.

29
Figure 15: Sixteen of the images that were misclassified by the trained network
from Figure 11. Predicted category is indicated, with correct category shown in
parentheses. Note that images are low-resolution, having 32 × 32 pixels.

30
8 Of Things Not Treated
This short introductory article is aimed at those who are new to deep learning.
In the interests of brevity and accessibility we have ruthlessly omitted many
topics. For those wishing to learn more, a good starting point is the free online
book [26], which provides a hands-on tutorial style description of deep learning
techniques. The survey [22] gives an intuitive and accessible overview of many
of the key ideas behind deep learning, and highlights recent success stories. A
more detailed overview of the prize-winning performances of deep learning tools
can be found in [29], which also traces the development of ideas across more than
800 references. The review [35] discusses the pre-history of deep learning and
explains how key ideas evolved. For a comprehensive treatment of the state-of-
the-art, we recommend the book [10] which, in particular, does an excellent job
of introducing fundamental ideas from computer science/discrete mathematics,
applied/computational mathematics and probability/statistics/inference before
pulling them all together in the deep learning setting. The recent review article
[3] focuses on optimization tasks arising in machine learning. It summarizes
the current theory underlying the stochastic gradient method, along with many
alternative techniques. Those authors also emphasize that optimization tools
must be interpreted and judged carefully when operating within this inherently
statistical framework. Leaving aside the training issue, a mathematical frame-
work for understanding the cascade of linear and nonlinear transformations used
by deep networks is given in [24].
To give a feel for some of the key issues that can be followed up, we finish
with a list of questions that may have occured to interested readers, along with
brief answers and further citations.

Why use artificial neural networks? Looking at Figure 4, it is clear that


there are many ways to come up with a mapping that divides the x-y axis
into two regions; a shaded region containing the circles and an unshaded
region containing the crosses. Artificial neural networks provide one useful
approach. In real applications, success corresponds to a small generaliza-
tion error ; the mapping should perform well when presented with new
data. In order to make rigorous, general, statements about performance,
we need to make some assumptions about the type of data. For example,
we could analyze the situation where the data consists of samples drawn
independently from a certain probability distribution. If an algorithm is
trained on such data, how will it perform when presented with new data
from the same distribution? The authors in [15] prove that artificial neural
networks trained with the stochastic gradient method can behave well in
this sense. Of course, in practice we cannot rely on the existence of such a
distribution. Indeed, experiments in [36] indicate that the worst case can
be as bad as possible. These authors tested state-of-the-art convolutional
networks for image classification. In terms of the heuristic performance
indicators used to monitor the progress of the training phase, they found
that the stochastic gradient method appears to work just as effectively

31
when the images are randomly re-labelled. This implies that the network
is happy to learn noise—if the labels for the unseen data are similarly
randomized then the classifications from the trained network are no bet-
ter than random choice. Other authors have established negative results
by showing that small and seemingly unimportant perturbations to an
image can change its predicted class, including cases where one pixel is
altered [32]. Related work in [4] showed proof-of-principle for an adver-
sarial patch, which alters the classification when added to a wide range
of images; for example, such a patch could be printed as a small sticker
and used in the physical world. Hence, although artificial neural networks
have outperformed rival methods in many application fields, the reasons
behind this success are not fully understood. The survey [34] describes
a range of mathematical approaches that are beginning to provide useful
insights, whilst the discussion piece [25] includes a list of ten concerns.
Which nonlinearity? The sigmoid function (1), illustrated in Figure 2, and
the rectified linear unit (31) are popular choices for the activation function.
Alternatives include the step function,

0, for x ≤ 0,
1, for x > 0.

Each of these can undergo saturation: produce very small derivatives that
thereby reduce the size of the gradient updates. Indeed, the step function
and rectified linear unit have completely flat portions. For this reason, a
leaky rectified linear unit, such as,

0.01x, for x ≤ 0,
f (x) =
x, for x > 0,

is sometimes preferred, in order to force a nonzero derivative for negative


inputs. The back propagation algorithm described in section 5 carries
through to general activation functions.
How do we decide on the structure of our net? Often, there is a natu-
ral choice for the size of the output layer. For example, to classify images
of individual handwritten digits, it would make sense to have an output
layer consisting of ten neurons, corresponding to 0, 1, 2, . . . , 9, as used in
Chapter 1 of [26]. In some cases, a physical application imposes natural
constraints on one or more of the hidden layers [16]. However, in gen-
eral, choosing the overall number of layers, the number of neurons within
each layer, and any constraints involving inter-neuron connections, is not
an exact science. Rules of thumb have been suggested, but there is no
widely accepted technique. In the context of image processing, it may
be possible to attribute roles to different layers; for example, detecting
edges, motifs and larger structures as information flows forward [22], and
our understanding of biological neurons provides further insights [10]. But
specific roles cannot be completely hardwired into the network design—the

32
weights and biases, and hence the tasks performed by each layer, emerge
from the training procedure. We note that the use of back propagation to
compute gradients is not restricted to the types of connectivity, activation
functions and cost functions discussed here. Indeed, the method fits into a
very general framework of techniques known as automatic differentiation
or algorithmic differentiation [13].
How big do deep learning networks get? The AlexNet architecture [21] achieved
groundbreaking image classification results in 2012. This network used
650,000 neurons, with five convolutional layers followed by two fully con-
nected layers and a final softmax. The programme AlphaGo, developed
by the Google DeepMind team to play the board game Go, rose to fame
by beating the human European champion by five games to nil in Octo-
ber 2015 [30]. AlphaGo makes use of two artificial neural networks with
13 layers and 15 layers, some convolutional and others fully connected,
involving millions of weights.
Didn’t my numerical analysis teacher tell me never to use steepest descent?
It is known that the steepest descent method can perform poorly on ex-
amples where other methods, notably those using information about the
second derivative of the objective function, are much more efficient. Hence,
optimization textbooks typically downplay steepest descent [9, 27]. How-
ever, it is important to note that training an artificial neural network is a
very specific optimization task:
• the problem dimension and the expense of computing the objective
function and its derivatives, can be extremely high,
• the optimization task is set within a framework that is inherently
statistical in nature,
• a great deal of research effort has been devoted to the development
of practical improvements to the basic stochastic gradient method in
the deep learning context.
Currently, a theoretical underpinning for the success of the stochastic gra-
dient method in training networks is far from complete [3]. A promising
line of research is to connect the stochastic gradient method with dis-
cretizations of stochastic differential equations, [31], generalizing the idea
that many deterministic optimization methods can be viewed as timestep-
ping methods for gradient ODEs, [17]. We also note that the introduction
of more traditional tools from the field of optimization may lead to im-
proved training algorithms.

Is it possible to regularize? As we discussed in section 7, overfitting occurs


when a trained network performs accurately on the given data, but cannot
generalize well to new data. Regularization is a broad term that describes
attempts to avoid overfitting by rewarding smoothness. One approach is

33
to alter the cost function in order to encourage small weights. For example,
(9) could be extended to
N L
1 X1 {i} [L] {i} 2 λ X
Cost = ky(x ) − a (x )k 2 + k W [l] k22 . (33)
N i=1 2 N
l=2

Here λ > 0 is the regularization parameter. One motivation for (33) is


that large weights may lead to neurons that are sensitive to their inputs,
and hence less reliable when new data is presented. This argument does
not apply to the biases, which typically are not included in such a regu-
larization term. It is straightforward to check that using (33) instead of
(9) makes a very minor and inexpensive change to the back propagation
algorithm.
What about ethics and accountability? The use of “algorithms” to aid
decision-making is not a recent phenomenon. However, the increasing
influence of black-box technologies is naturally causing concerns in many
quarters. The recent articles [7, 14] raise several relevant issues and il-
lustrate them with concrete examples. They also highlight the particular
challenges arising from massively-parameterized artificial neural networks.
Professional and governmental institutions are, of course, alert to these
matters. In 2017, the Association for Computing Machinery’s US Pub-
lic Policy Council released seven Principles for Algorithmic Transparency
and Accountability 1 . Among their recommendations are that
• “Systems and institutions that use algorithmic decision-making are
encouraged to produce explanations regarding both the procedures
followed by the algorithm and the specific decisions that are made”,
and
• “A description of the way in which the training data was collected
should be maintained by the builders of the algorithms, accompanied
by an exploration of the potential biases induced by the human or
algorithmic data-gathering process.”
Article 15 of the European Union’s General Data Protection Regulation
2016/6792 , which takes effect in May 2018, concerns “Right of access
by the data subject,” and includes the requirement that “The data sub-
ject shall have the right to obtain from the controller confirmation as to
whether or not personal data concerning him or her are being processed,
and, where that is the case, access to the personal data and the following
information:.”Item (h) on the subsequent list covers
• “the existence of automated decision-making, including profiling, re-
ferred to in Article 22(1) and (4) and, at least in those cases, mean-
ingful information about the logic involved, as well as the significance
1 https://www.acm.org/
2 https://www.privacy-regulation.eu/en/15.htm

34
and the envisaged consequences of such processing for the data sub-
ject.”
What are some current research topics? Deep learning is a fast-moving,
high-bandwith field, where many new advances are driven by the needs
of specific application areas and the features of new high performance
computing architectures. Here, we briefly mention three hot-topic areas
that have not yet been discussed.
Training a network can be an extremely expensive task. When a trained
network is seen to make a mistake on new data, it is therefore tempting to
fix this with a local perturbation to the weights and/or network structure,
rather than re-training from scratch. Approaches for this type of on the
fly tuning can be developed and justified using the theory of measure
concentration in high dimensional spaces [12].
Adversarial networks, [11], are based on the concept that an artificial neu-
ral network may be viewed as a generative model : a way to create realistic
data. Such a model may be useful, for example, as a means to produce
realistic sentences, or very high resolution images. In the adversarial set-
ting, the generative model is pitted against a discriminative model. The
role of the discriminative model is to distinguish between real training
data and data produced by the generative model. By iteratively improv-
ing the performance of these models, the quality of both the generation
and discrimination can be increased dramatically.
The idea behind autoencoders [28] is, perhaps surprisingly, to produce
an overall network whose output matches its input. More precisely, one
network, known as the encoder, corresponds to a map F that takes an
input vector, x ∈ Rs , and produces a lower dimensional output vector
F (x) ∈ Rt . So t  s. Then a second network, known as the decoder,
corresponds to a map G that takes us back to the same dimension as x;
that is, G(F (x)) ∈ Rs . We could then aim to minimize the sum of the
squared error kx − G(F (x))k22 over a set of training data. Note that this
technique does not require the use of labelled data—in the case of images
we are attempting to reproduce each picture without knowing what it
depicts. Intuitively, a good encoder is a tool for dimension reduction. It
extracts the key features. Similarly, a good decoder can reconstruct the
data from those key features.
Where can I find code and data? There are many publicly available codes
that provide access to deep learning algorithms. In addition to MatCon-
vNet [33], we mention Caffe [19], Keras [5], TensorFlow [1], Theano [2]
and Torch [6]. These packages differ in their underlying platforms and in
the extent of expert knowledge required. Your favorite scientific comput-
ing environment may also offer a range of proprietary and user-contributed
deep learning toolboxes. However, it is currently the case that making se-
rious use of modern deep learning technology requires a strong background
in numerical computing. Among the standard benchmark data sets are

35
the CIFAR-10 collection [20] that we used in section 7, and its big sibling
CIFAR-100, ImageNet [8], and the handwritten digit database MNIST
[23].

Acknowledgements
We are grateful to the MatConvNet team for making their package available
under a permissive BSD license. The MATLAB code in Listings 6.1 and 6.2 can
be found at
http://personal.strath.ac.uk/d.j.higham/algfiles.html
as well as an exteneded version that produces Figures 7 and 8, and a MATLAB
code that uses lsqnonlin to produce Figure 4.

References
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur,
J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner,
P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and
X. Zheng, TensorFlow: A system for large-scale machine learning, in
12th USENIX Symposium on Operating Systems Design and Implemen-
tation (OSDI 16), 2016, pp. 265–283.
[2] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bah-
danau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Be-
lopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson,
J. Bleecher Snyder, N. Bouchard, N. Boulanger-Lewandowski,
X. Bouthillier, A. de Brébisson, O. Breuleux, P.-L. Carrier,
K. Cho, J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Côté,
M. Côté, A. Courville, Y. N. Dauphin, O. Delalleau, J. De-
mouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Du-
moulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Ger-
main, X. Glorot, I. Goodfellow, M. Graham, C. Gulcehre,
P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi, S. Honari,
A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb,
P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois,
S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey, C. Lorenz,
J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, R. T.
McGibbon, R. Memisevic, B. van Merriënboer, V. Michalski,
M. Mirza, A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raf-
fel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski,
J. Salvatier, F. Savard, J. Schlüter, J. Schulman, G. Schwartz,
I. V. Serban, D. Serdyuk, S. Shabanian, E. Simon, S. Spieck-
ermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van

36
Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries,
D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao,
S. Zhang, and Y. Zhang, Theano: A Python framework for fast compu-
tation of mathematical expressions, arXiv e-prints, abs/1605.02688 (2016).
[3] L. Bottou, F. Curtis, and J. Nocedal, Optimization methods for
large-scale machine learning, arXiv:1606.04838, version 2, (2017).
[4] T. B. Brown, D. Mané, A. R. M. Abadi, and J. Gilmer, Adversarial
patch, arXiv:1712.09665 [cs.CV], (2017).
[5] F. Chollet et al., Keras, GitHub, (2015).
[6] R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A Matlab-
like environment for machine learning, in BigLearn, NIPS Workshop, 2011.
[7] J. H. Davenport, The debate about algorithms, Mathematics Today,
(2017), p. 162.
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, Im-
ageNet: A large-scale hierarchical image database., in CVPR, IEEE Com-
puter Society, 2009, pp. 248–255.
[9] R. Fletcher, Practical Methods of Optimization, Wiley, Chichester, sec-
ond ed., 1987.
[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT
Press, Boston, 2016.
[11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
Farley, S. Ozair, A. C. Courville, and Y. Bengio, Generative ad-
versarial nets, in Advances in Neural Information Processing Systems 27,
Montreal, Canada, 2014, pp. 2672–2680.
[12] A. N. Gorban and I. Y. Tyukin, Stochastic separation theorems, Neural
Networks, 94 (2017), pp. 255–259.
[13] A. Griewank and A. Walther, Evaluating Derivatives: Principles and
Techniques of Algorithmic Differentiation, Society for Industrial and Ap-
plied Mathematics, Philadelphia, second ed., 2008.
[14] P. Grindrod, Beyond privacy and exposure: ethical issues within citizen-
facing analytics, Phil. Trans. of the Royal Society A, 374 (2016), p. 2083.
[15] M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better:
Stability of stochastic gradient descent, in Proceedings of the 33rd Interna-
tional Conference on Machine Learning, 2016, pp. 1225–1234.
[16] C. F. Higham, R. Murray-Smith, M. J. Padgett, and M. P. Edgar,
Deep learning for real-time single-pixel video, Scientific Reports, (to ap-
pear).

37
[17] D. J. Higham, Trust region algorithms and timestep selection, SIAM Jour-
nal on Numerical Analysis, 37 (1999), pp. 194–210.
[18] D. J. Higham and N. J. Higham, MATLAB Guide, Society for Industrial
and Applied Mathematics, Philadelphia, PA, USA, third ed., 2017.
[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
shick, S. Guadarrama, and T. Darrell, Caffe: Convolutional archi-
tecture for fast feature embedding, arXiv preprint arXiv:1408.5093, (2014).
[20] A. Krizhevsky, Learning multiple layers of features from tiny images,
tech. rep., 2009.
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classi-
fication with deep convolutional neural networks, in Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, eds., 2012, pp. 1097–1105.
[22] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, 521
(2015), pp. 436–444.
[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based
learning applied to document recognition, Proceedings of the IEEE, 86
(1998), pp. 2278–2324.
[24] S. Mallat, Understanding deep convolutional networks, Philosophical
Transactions of the Royal Society of London A, 374 (2016), p. 20150203.
[25] G. Marcus, Deep learning: A critical appraisal, arXiv:1801.00631 [cs.AI],
(2018).
[26] M. Nielsen, Neural Networks and Deep Learning, Determination Press,
2015.
[27] J. Nocedal and S. J. Wright, Numerical Optimization, Springer,
Berlin, second ed., 2006.
[28] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Parallel dis-
tributed processing: Explorations in the microstructure of cognition, vol. 1,
MIT Press, Cambridge, MA, USA, 1986, ch. Learning Internal Represen-
tations by Error Propagation, pp. 318–362.
[29] J. Schmidhuber, Deep learning in neural networks: An overview, Neural
Networks, 61 (2015), pp. 85–117.
[30] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Pan-
neershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham,
N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,
K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game
of Go with deep neural networks and tree search, Nature, 2529 (2016),
pp. 484–489.

38
[31] J. Sirignano and K. Spiliopoulos, Stochastic gradient descent in con-
tinuous time, SIAM J. Finan. Math., 8 (2017), pp. 933–961.
[32] J. Su, D. V. Vargas, and S. Kouichi, One pixel attack for fooling deep
neural networks, arXiv:1710.08864 [cs.LG], (2017).

[33] A. Vedaldi and K. Lenc, MatConvNet: Convolutional neural networks


for MATLAB, in ACM International Conference on Multimedia, Brisbane,
2015, pp. 689–692.
[34] R. Vidal, R. Giryes, J. Bruna, and S. Soatto, Mathematics of deep
learning, Proc. of the Conf. on Decision and Control (CDC), (2017).

[35] H. Wang and B. Raj, On the origin of deep learning, arXiv:1702.07800


[cs.LG], (2017).
[36] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Un-
derstanding deep learning requires rethinking generalization, in 5th Inter-
national Conference on Learning Representations, 2017.

39
Deep Learning for Sentiment Analysis: A Survey
Lei Zhang, LinkedIn Corporation, lzhang32@gmail.com
Shuai Wang, University of Illinois at Chicago, shuaiwanghk@gmail.com
Bing Liu, University of Illinois at Chicago, liub@uic.edu

Abstract
Deep learning has emerged as a powerful machine learning technique that learns multiple layers of
representations or features of the data and produces state-of-the-art prediction results. Along with
the success of deep learning in many other application domains, deep learning is also popularly used
in sentiment analysis in recent years. This paper first gives an overview of deep learning and then
provides a comprehensive survey of its current applications in sentiment analysis.

INTRODUCTION
Sentiment analysis or opinion mining is the computational study of people’s opinions, sentiments,
emotions, appraisals, and attitudes towards entities such as products, services, organizations,
individuals, issues, events, topics, and their attributes.1 The inception and rapid growth of the field
coincide with those of the social media on the Web, for example, reviews, forum discussions, blogs,
micro-blogs, Twitter, and social networks, because for the first time in human history, we have a
huge volume of opinionated data recorded in digital forms. Since early 2000, sentiment analysis has
grown to be one of the most active research areas in natural language processing (NLP). It is also
widely studied in data mining, Web mining, text mining, and information retrieval. In fact, it has
spread from computer science to management sciences and social sciences such as marketing,
finance, political science, communications, health science, and even history, due to its importance to
business and society as a whole. This proliferation is due to the fact that opinions are central to
almost all human activities and are key influencers of our behaviours. Our beliefs and perceptions of
reality, and the choices we make, are, to a considerable degree, conditioned upon how others see
and evaluate the world. For this reason, whenever we need to make a decision we often seek out
the opinions of others. This is not only true for individuals but also true for organizations.
Nowadays, if one wants to buy a consumer product, one is no longer limited to asking one’s friends
and family for opinions because there are many user reviews and discussions about the product in
public forums on the Web. For an organization, it may no longer be necessary to conduct surveys,
opinion polls, and focus groups in order to gather public opinions because there is an abundance of
such information publicly available. In recent years, we have witnessed that opinionated postings in
social media have helped reshape businesses, and sway public sentiments and emotions, which have
profoundly impacted on our social and political systems. Such postings have also mobilized masses
for political changes such as those happened in some Arab countries in 2011. It has thus become a
necessity to collect and study opinions1.
However, finding and monitoring opinion sites on the Web and distilling the information contained
in them remains a formidable task because of the proliferation of diverse sites. Each site typically
contains a huge volume of opinion text that is not always easily deciphered in long blogs and forum
postings. The average human reader will have difficulty identifying relevant sites and extracting and
summarizing the opinions in them. Automated sentiment analysis systems are thus needed. Because
of this, there are many start-ups focusing on providing sentiment analysis services. Many big
corporations have also built their own in-house capabilities. These practical applications and
industrial interests have provided strong motivations for research in sentiment analysis.
Existing research has produced numerous techniques for various tasks of sentiment analysis, which
include both supervised and unsupervised methods. In the supervised setting, early papers used all
types of supervised machine learning methods (such as Support Vector Machines (SVM), Maximum
Entropy, Naïve Bayes, etc.) and feature combinations. Unsupervised methods include various
methods that exploit sentiment lexicons, grammatical analysis, and syntactic patterns. Several
survey books and papers have been published, which cover those early methods and applications
extensively.1,2,3
Since about a decade ago, deep learning has emerged as a powerful machine learning technique4
and produced state-of-the-art results in many application domains, ranging from computer vision
and speech recognition to NLP. Applying deep learning to sentiment analysis has also become very
popular recently. This paper first gives an overview of deep learning and then provides a
comprehensive survey of the sentiment analysis research based on deep learning.

NEURAL NETWORKS
Deep learning is the application of artificial neural networks (neural networks for short) to learning
tasks using networks of multiple layers. It can exploit much more learning (representation) power of
neural networks, which once were deemed to be practical only with one or two layers and a small
amount of data.

Inspired by the structure of the biological brain, neural networks consist of a large number of
information processing units (called neurons) organized in layers, which work in unison. It can learn
to perform tasks (e.g., classification) by adjusting the connection weights between neurons,
resembling the learning process of a biological brain.


Figure 1: Feedforward neural network

Based on network topologies, neural networks can generally be categorized into feedforward neural
networks and recurrent/recursive neural networks, which can also be mixed and matched. We will
describe recurrent/recursive neural networks later. A simple example of a feedforward neural
network is given in Figure 1, which consists of three layers 𝐿! , 𝐿! and 𝐿! . 𝐿! is the input layer, which
corresponds to the input vector (𝑥! , 𝑥! , 𝑥! ) and intercept term +1. 𝐿! is the output layer, which
corresponds to the output vector (𝑠! ). 𝐿! is the hidden layer, whose output is not visible as a
network output. A circle in 𝐿! represents an element in the input vector, while a circle in 𝐿! or 𝐿!
represents a neuron, the basic computation element of a neural network. We also call it an
activation function. A line between two neurons represents a connection for the flow of information.
Each connection is associated with a weight, a value controlling the signal between two neurons.
The learning of a neural network is achieved by adjusting the weights between neurons with the
information flowing through them. Neurons read output from neurons in the previous layer, process
the information, and then generate output to neurons in the next layer. As in Figure 1, the neutral
network alters weights based on training examples (𝑥 (!) , 𝑦 (!) ). After the training process, it will
obtain a complex form of hypotheses ℎ!,! (𝑥) that fits the data.

Diving into the hidden layer, we can see that each neuron in 𝐿! takes input 𝑥! , 𝑥! , 𝑥! and intercept
+1 from 𝐿! , and outputs a value 𝑓(𝑊 ! 𝑥) = 𝑓( !!!! 𝑊! 𝑥! + 𝑏) by the activation function 𝑓. 𝑊! are
weights of the connections; 𝑏 is the intercept or bias; 𝑓 is normally non-linear. The common choices
of 𝑓 are sigmoid function, hyperbolic tangent function (tanh), or rectified linear function (ReLU).
Their equations are as follows.
!
𝑓 𝑊 ! 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑊 ! 𝑥 = (1)
!!!"# !! ! !

! !
! ! ! !! !! !
𝑓 𝑊 ! 𝑥 = tanh 𝑊 ! 𝑥 = ! ! (2)
! ! ! !! !! !

𝑓 𝑊 ! 𝑥 = 𝑅𝑒𝐿𝑈 𝑊 ! 𝑥 = max 0, 𝑊 ! 𝑥 (3)

The sigmoid function takes a real-valued number and squashes it to a value in the range between 0
and 1. The function has been in frequent use historically due to its nice interpretation as the firing
rate of a neuron: 0 for not firing or 1 for firing. But the non-linearity of the sigmoid has recently
fallen out of favour because its activations can easily saturate at either tail of 0 or 1, where gradients
are almost zero and the information flow would be cut. What is more is that its output is not zero-
centered, which could introduce undesirable zig-zagging dynamics in the gradient updates for the
connection weights in training. Thus, the tanh function is often more preferred in practice as its
output range is zero-centered, [-1, 1] instead of [0, 1]. The ReLU function has also become popular
lately. Its activation is simply thresholded at zero when the input is less than 0. Compared with the
sigmoid function and the tanh function, ReLU is easy to compute, fast to converge in training and
yields equal or better performance in neural networks.5

In 𝐿! , we can use the softmax function as the output neuron, which is a generalization of the logistic
function that squashes a K-dimensional vector 𝑋 of arbitrary real values to a K-dimensional vector
𝜎(𝑋) of real values in the range (0, 1) that add up to 1. The function definition is as follows.
!
! !
𝜎 𝑋 ! = ! !! 𝑓𝑜𝑟 𝑗 = 1, … , 𝑘 (4)
!!! !

Generally, softmax is used in the final layer of neural networks for final classification in feedforward
neural networks.

By connecting together all neurons, the neural network in Figure 1 has parameters (𝑊, 𝑏) =
! ! ! (!)
(𝑊 ,𝑏 ,𝑊 , 𝑏 (!) ), where 𝑊!" denotes the weight associated with the connection between
(!)
neuron 𝑗 in layer 𝑙, and neuron 𝑖 in layer 𝑙 + 1. 𝑏! is the bias associated with neuron 𝑖 in layer 𝑙 + 1.

To train a neural network, stochastic gradient descent via backpropagation6 is usually employed to
minimize the cross-entropy loss, which is a loss function for softmax output. Gradients of the loss
function with respect to weights from the last hidden layer to the output layer are first calculated,
and then gradients of the expressions with respect to weights between upper network layers are
calculated recursively by applying the chain rule in a backward manner. With those gradients, the
weights between layers are adjusted accordingly. It is an iterative refinement process until certain
stopping criteria are met. The pseudo code for training the neural network in Figure 1 is as follows.
Training algorithm: stochastic gradient descent via backpropagation

Initialize weights 𝑊 and biases 𝑏 of the neural network 𝑁 with random values
do
for each training example (𝑥! , 𝑦! )
𝑝! = neural-network-prediction (𝑁, 𝑥! )
calculate gradients of loss function ( 𝑝! , 𝑦! ) with respect to 𝑤 ! at layer 𝐿!
get ∆𝑤 ! for all weights from hidden layer 𝐿! to output layer 𝐿!
calculate gradient with respect to 𝑤 ! by chain rule at layer 𝐿!
get ∆𝑤 ! for all weights from input layer 𝐿! to hidden layer 𝐿!
update ( 𝑤 ! , 𝑤 ! )
until all training examples are classified correctly or other stopping criteria are met
return the trained neural network

Table 1: Training the neural network in Figure 1.

The above algorithm can be extended to generic feedforward neural network training with multiple
hidden layers. Note that stochastic gradient descent estimates the parameters for every training
example as opposed to the whole set of training examples in batch gradient descent. Therefore, the
parameter updates have a high variance and cause the loss function to fluctuate to different
intensities, which helps discover new and possibly better local minima.

DEEP LEARNING
The research community lost interests in neural networks in late 1990s mainly because they were
regarded as only practical for “shallow” neural networks (neural networks with one or two layers) as
training a “deep” neural network (neural networks with more layers) is complicated and
computationally very expensive. However, in the past 10 years, deep learning made breakthrough
and produced state-of-the-art results in many application domains, starting from computer vision,
then speech recognition, and more recently, NLP.7,8 The renaissance of neural networks can be
attributed to many factors. Most important ones include: (1) the availability of computing power
due to the advances in hardware (e.g., GPUs), (2) the availability of huge amounts of training data,
and (3) the power and flexibility of learning intermediate representations.9

In a nutshell, deep learning uses a cascade of multiple layers of nonlinear processing units for
feature extraction and transformation. The lower layers close to the data input learn simple features,
while higher layers learn more complex features derived from lower layer features. The architecture
forms a hierarchical and powerful feature representation. Figure 2 shows the feature hierarchy from
the left (a lower layer) to the right (a higher layer) learned by deep learning in face image
classification.10 We can see that the learned image features grow in complexity, starting from
blobs/edges, then noses/eyes/cheeks, to faces.


Figure 2: Feature hierarchy by deep learning
In recent years, deep learning models have been extensively applied in the field of NLP and show
great potentials. In the following several sections, we briefly describe the main deep learning
architectures and related techniques that have been applied to NLP tasks.

WORD EMBEDDING
Many deep learning models in NLP need word embedding results as input features.7 Word
embedding is a technique for language modelling and feature learning, which transforms words in a
vocabulary to vectors of continuous real numbers (e.g.,
𝑤𝑜𝑟𝑑 "ℎ𝑎𝑡" → (… , 0.15, … , 0.23, … , 0.41, … ) ). The technique normally involves a mathematic
embedding from a high-dimensional sparse vector space (e.g., one-hot encoding vector space, in
which each word takes a dimension) to a lower-dimensional dense vector space. Each dimension of
the embedding vector represents a latent feature of a word. The vectors may encode linguistic
regularities and patterns.

The learning of word embeddings can be done using neural networks11-15 or matrix factorization.16,17
One commonly used word embedding system is Word2Veci, which is essentially a computationally-
efficient neural network prediction model that learns word embeddings from text. It contains
Continuous Bag-of-Words model (CBOW)13, and Skip-Gram model (SG)14. The CBOW model predicts
the target word (e.g., “wearing”) from its context words (“the boy is _ a hat”, where “_” denotes the
target word), while the SG model does the inverse, predicting the context words given the target
word. Statistically, the CBOW model smoothens over a great deal of distributional information by
treating the entire context as one observation. It is effective for smaller datasets. However, the SG
model treats each context-target pair as a new observation and is better for larger datasets.
Another frequently used learning approach is Global Vectorii (GloVe)17, which is trained on the non-
zero entries of a global word-word co-occurrence matrix.

AUTOENCODER AND DENOISING AUTOENCODER


Autoencoder Neural Network is a three-layer neural network, which sets the target values to be
equal to the input values. Figure 3 shows an example of an autoencoder architecture.


Figure 3: Autoencoder neural network


i
Source code: https://code.google.com/archive/p/word2vec/
ii
Source code: https://github.com/stanfordnlp/GloVe
Given the input vector 𝑥 ∈ [0,1]! , the autoencoder first maps it to a hidden representation
!
𝑦 ∈ [0,1]! by an encoder function ℎ(∙) (e.g., the sigmoid function). The latent representation 𝑦 is
then mapped back by a decoder function 𝑔(∙) into a reconstruction 𝑟 𝑥 = 𝑔(ℎ 𝑥 ) . The
autoencoder is typically trained to minimize a form of reconstruction error 𝑙𝑜𝑠𝑠(𝑥, 𝑟 𝑥 ). The
objective of the autoencoder is to learn a representation of the input, which is the activation of the
hidden layer. Due to the nonlinear function ℎ(∙) and 𝑔(∙), the autoencoder is able to learn non-
linear representations, which give it much more expressive power than its linear counterparts, such
as Principal Component Analysis (PCA) or Latent Semantic Analysis (LSA).

One often stacks autoencoders into layers. A higher level autoencoder uses the output of the lower
one as its training data. The stacked autoencoders18 along with Restricted Boltzmann Machines
(RBMs)19 are earliest approaches to building deep neural networks. Once a stack of autoencoders
has been trained in an unsupervised fashion, their parameters describing multiple levels of
representations for 𝑥 (intermediate representations) can be used to initialize a supervised deep
neural network, which has been shown empirically better than random parameter initialization.

The Denoising Autoencoder (DAE)20 is an extension of autoencoder, in which the input vector 𝑥 is
stochastically corrupted into a vector 𝑥. And the model is trained to denoise it, that is, to minimize a
denoising reconstruction error 𝑙𝑜𝑠𝑠(𝑥, 𝑟 𝑥 ). The idea behind DAE is to force the hidden layer to
discover more robust features and prevent it from simply learning the identity. A robust model
should be able to reconstruct the input well even in the presence of noises. For example, deleting or
adding a few of words from or to a document should not change the semantic of the document.

CONVOLUTIONAL NEURAL NETWORK


Convolutional Neural Network (CNN) is a special type of feedforward neural network originally
employed in the field of computer vision. Its design is inspired by the human visual cortex, a visual
mechanism in animal brain. The visual cortex contains a lot of cells that are responsible for detecting
light in small and overlapping sub-regions of the visual fields, which are called receptive fields. These
cells act as local filters over the input space. CNN consists of multiple convolutional layers, each of
which performs the function that is processed by the cells in the visual cortex.

Figure 4 shows a CNN for recognizing traffic signs.21 The input is a 32x32x1 pixel image (32 x 32
represents image width x height; 1 represents input channel). In this first stage, the filter (size 5x5x1)
is used to scan the image. Each region in the input image that the filter projects on is a receptive
field. The filter is actually an array of numbers (called weights or parameters). As the filter is sliding
(or convolving), it is multiplying its weight values with the original pixel values of the image (element
wise multiplications). The multiplications are all summed up to a single number, which is a
representative of the receptive field. Every receptive field produces a number. After the filter
finishes scanning over the image, we can get an array (size 28x28x1), which is called the activation
map or feature map. In CNN, we need to use different filters to scan the input. In Figure 4, we apply
108 kinds of filters and thus have 108 stacked feature maps in the first stage, which consists of the
first convolutional layer. Following the convolutional layer, a subsampling (or pooling) layer is
usually used to progressively reduce the spatial size of the representation, thus to reduce the
number of features and the computational complexity of the network. For example, after
subsampling in the first stage, the convolutional layer reduces its dimensions to (14x14x108). Note
that while the dimensionality of each feature map is reduced, the subsampling step retains the most
important information, with a commonly used subsampling operation being the max pooling.
Afterwards, the output from the first stage becomes input to the second stage and the new filters
are employed. The new filter size is 5x5x108, where 108 is the feature map size of the last layer.
After the second stage, CNN uses a fully connected layer and then a softmax readout layer with
output classes for classification.
Convolutional layers in CNN play the role of feature extractor, which extracts local features as they
restrict the receptive fields of the hidden layers to be local. It means that CNN has a special spatially-
local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. Such
a characteristic is useful for classification in NLP, in which we expect to find strong local clues
regarding class membership, but these clues can appear in different places in the input. For example,
in a document classification task, a single key phrase (or an n-gram) can help in determining the
topic of the document. We would like to learn that certain sequences of words are good indicators
of the topic, and do not necessarily care where they appear in the document. Convolutional and
pooling layers allow the CNN to learn to find such local indicators, regardless of their positions.8


Figure 4: Convolutional neural network

RECURRENT NEURAL NETWORK


Recurrent Neural Network (RNN)22 is a class of neural networks whose connections between
neurons form a directed cycle. Unlike feedforward neural networks, RNN can use its internal
“memory” to process a sequence of inputs, which makes it popular for processing sequential
information. The “memory” means that RNN performs the same task for every element of a
sequence with each output being dependent on all previous computations, which is like
“remembering” information about what has been processed so far.


Figure 5: Recurrent neural network

Figure 5 shows an example of a RNN. The left graph is an unfolded network with cycles, while the
right graph is a folded sequence network with three time steps. The length of time steps is
determined by the length of input. For example, if the word sequence to be processed is a sentence
of six words, the RNN would be unfolded into a neural network with six time steps or layers. One
layer corresponds to a word.
In Figure 5, 𝑥! is the input vector at time step 𝑡. ℎ! is the hidden state at time step 𝑡, which is
calculated based on the previous hidden state and the input at the current time step.

ℎ! = 𝑓 𝑤 !! ℎ!!! + 𝑤 !! 𝑥! (5)

In Equation (5), the activation function 𝑓 is usually the tanh function or the ReLU function. 𝑤 !! is the
weight matrix used to condition the input 𝑥! . 𝑤 !! is the weight matrix used to condition the
previous hidden state ℎ!!! .

𝑦! is the output probability distribution over the vocabulary at step t. For example, if we want to
predict the next word in a sentence, it would be a vector of probabilities across the word vocabulary.

𝑦! = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑤 !! ℎ! (6)

The hidden state ℎ! is regarded as the memory of the network. It captures information about what
happened in all previous time steps. 𝑦! is calculated solely based on the memory ℎ! at time 𝑡 and the
corresponding weight matrix 𝑤 !! .

Unlike a feedforward neural network, which uses different parameters at each layer, RNN shares the
same parameters (𝑊 !! , 𝑊 !! , 𝑊 !! ) across all steps. This means that it performs the same task at
each step, just with different inputs. This greatly reduces the total number of parameters needed to
learn.

Theoretically, RNN can make use of the information in arbitrarily long sequences, but in practice, the
standard RNN is limited to looking back only a few steps due to the vanishing gradient or exploding
gradient problem.23


Figure 6: Bidirectional RNN (left) and deep bidirectional RNN (right)

Researchers have developed more sophisticated types of RNN to deal with the shortcomings of the
standard RNN model: Bidirectional RNN, Deep Bidirectional RNN and Long Short Term Memory
network. Bidirectional RNN is based on the idea that the output at each time may not only depend
on the previous elements in the sequence, but also depend on the next elements in the sequence.
For instance, to predict a missing word in a sequence, we may need to look at both the left and the
right context. A bidirectional RNN24 consists of two RNNs, which are stacked on the top of each other.
The one that processes the input in its original order and the one that processes the reversed input
sequence. The output is then computed based on the hidden state of both RNNs. Deep bidirectional
RNN is similar to bidirectional RNN. The only difference is that it has multiple layers per time step,
which provides higher learning capacity but needs a lot of training data. Figure 6 shows examples of
bidirectional RNN and deep bidirectional RNN (with two layers) respectively.

LSTM NETWORK
Long Short Term Memory network (LSTM)25 is a special type of RNN, which is capable of learning
long-term dependencies.

All RNNs have the form of a chain of repeating modules. In standard RNNs, this repeating module
normally has a simple structure. However, the repeating module for LSTM is more complicated.
Instead of having a single neural network layer, there are four layers interacting in a special way.
Besides, it has two states: hidden state and cell state.


Figure 7: Long Short Term Memory network

Figure 7 shows an example of LSTM. At time step 𝑡, LSTM first decides what information to dump
from the cell state. This decision is made by a sigmoid function/layer 𝜎, called the “forget gate”. The
function takes ℎ!!! (output from the previous hidden layer) and 𝑥! (current input), and outputs a
number in [0, 1], where 1 means “completely keep” and 0 means “completely dump” in Equation (7).

𝑓! = 𝜎 𝑊 ! 𝑥! + 𝑈 ! ℎ!!! (7)

Then LSTM decides what new information to store in the cell state. This has two steps. First, a
sigmoid function/layer, called the “input gate” as Equation (8), decides which values LSTM will
update. Next, a tanh function/layer creates a vector of new candidate values 𝐶! , which will be
added to the cell state. LSTM combines these two to create an update to the state.

𝑖! = 𝜎 𝑊 ! 𝑥! + 𝑈 ! ℎ!!! (8)

𝐶! = tanh 𝑊 ! 𝑥! + 𝑈 ! ℎ!!! (9)

It is now time to update the old cell state 𝐶!!! into new cell state 𝐶! as Equation (10). Note that
forget gate 𝑓! can control the gradient passes through it and allow for explicit “memory” deletes and
updates, which helps alleviate vanishing gradient or exploding gradient problem in standard RNN.

𝐶! = 𝑓! ∗ 𝐶!!! + 𝑖! ∗ 𝐶! (10)
Finally, LSTM decides the output, which is based on the cell state. LSTM first runs a sigmoid layer,
which decides which parts of the cell state to output in Equation (11), called “output gate”. Then,
LSTM puts the cell state through the tanh function and multiplies it by the output of the sigmoid
gate, so that LSTM only outputs the parts it decides to as Equation (12).

𝑜! = 𝜎 𝑊 ! 𝑥! + 𝑈 ! ℎ!!! (11)

ℎ! = 𝑜! ∗ tanh 𝐶! (12)

LSTM is commonly applied to sequential data but can also be used for tree-structured data. Tai et
al.26 introduced a generalization of the standard LSTM to Tree-structured LSTM (Tree-LSTM) and
showed better performances for representing sentence meaning than a sequential LSTM.

A slight variation of LSTM is the Gated Recurrent Unit (GRU).27,28 It combines the “forget” and “input”
gates into a single update gate. It also merges the cell state and hidden state, and makes some other
changes. The resulting model is simpler than the standard LSTM model, and has been growing in
popularity.

ATTENTION MECHANISM WITH RECURRENT NEURAL NETWORK


Supposedly, bidirectional RNN and LSTM should be able to deal with long-range dependencies in
data. But in practice, the long-range dependencies are still problematic to handle. Thus, a technique
called the Attention Mechanism was proposed.

The attention mechanism in neural networks is inspired by the visual attention mechanism found in
humans. That is, the human visual attention is able to focus on a certain region of an image with
“high resolution” while perceiving the surrounding image in “low resolution” and then adjusting the
focal point over time. In NLP, the attention mechanism allows the model to learn what to attend to
based on the input text and what it has produced so far, rather than encoding the full source text
into a fixed-length vector like standard RNN and LSTM.


Figure 8: Attention mechanism in bidirectional recurrent neural network

Bahdanau et al.29 first utilized the attention mechanism for machine translation in NLP. They
proposed an encoder-decoder framework where an attention mechanism is used to select reference
words in the original language for words in the target language before translation. Figure 8
illustrates the use of the attention mechanism in their bidirectional RNN. Note that each decoder
output word 𝑦! depends on a weighted combination of all the input states, not just the last state as
in the normal case. 𝑎!,! are weights that define in how much of each input state should be weighted
for each output. For example, if 𝑎!,! has a big value, it means that the decoder pays a lot of
attention to the second state in the source sentence while producing the second word of the target
sentence. The weights of 𝑎!,! sum to 1 normally.

MEMORY NETWORK

Weston et al.30 introduced the concept of Memory Networks (MemNN) for the question answering
problem. It works with several inference components combined with a large long-term memory. The
components can be neural networks. The memory acts as a dynamic knowledge base. The four
learnable/inference components function as follows: I component coverts the incoming input to the
internal feature representation; G component updates old memories given the new input; O
component generates output (also in the feature representation space); R component converts the
output into a response format. For instance, given a list of sentences and a question for question
answering, MemNN finds evidences from those sentences and generates an answer. During
inference, the I component reads one sentence at a time and encodes it into a vector representation.
Then the G component updates a piece of memory based on the current sentence representation.
After all sentences are processed, a memory matrix (each row representing a sentence) is generated,
which stores the semantics of the sentences. For a question, MemNN encodes it into a vector
representation, then the O component uses the vector to select some related evidences from the
memory and generates an output vector. Finally, the R component takes the output vector as the
input and outputs a final response.

Based on MemNN, Sukhbaatar et al.31 proposed an End-to-End Memory Network (MemN2N), which
is a neural network architecture with a recurrent attention mechanism over the long-term memory
component and it can be trained in an End-to-End manner through standard backpropagation. It
demonstrates that multiple computational layers (hops) in the O component can uncover more
abstractive evidences than a single layer and yield improved results for question answering and
language modelling. It is worth noting that each computational layer can be a content-based
attention model. Thus, MemN2N refines the attention mechanism to some extent. Note also a
similar idea is the Neural Turing Machines reported by Graves et al.32

RECURSIVE NEURAL NETWORK


Recursive Neural Network (RecNN) is a type of neural network that is usually used to learn a
directed acyclic graph structure (e.g., tree structure) from data. A recursive neural network can be
seen as a generalization of the recurrent neural network. Given the structural representation of a
sentence (e.g., a parse tree), RecNN recursively generates parent representations in a bottom-up
fashion, by combining tokens to produce representations for phrases, eventually the whole sentence.
The sentence level representation can then be used to make a final classification (e.g., sentiment
classification) for a given input sentence. An example process of vector composition in RecNN is
shown in Figure 9 33. The vector of node “very interesting” is composed from the vectors of the node
“very” and the node “interesting”. Similarly, the node “is very interesting” is composed from the
phrase node “very interesting” and the word node “is”.

Figure 9: Recursive Neural network

SENTIMENT ANALYSIS TASKS


We are now ready to survey deep learning applications in sentiment analysis. But before doing that,
we first briefly introduce the main sentiment analysis tasks in this section. For additional details,
please refer to Liu’s book1 on sentiment analysis.

Researchers have mainly studied sentiment analysis at three levels of granularity: document level,
sentence level, and aspect level. Document level sentiment classification classifies an opinionated
document (e.g., a product review) as expressing an overall positive or negative opinion. It considers
the whole document as the basic information unit and assumes that the document is known to be
opinionated and contain opinions about a single entity (e.g., a particular phone). Sentence level
sentiment classification classifies individual sentences in a document. However, each sentence
cannot be assumed to be opinionated. Traditionally, one often first classifies a sentence as
opinionated or not opinionated, which is called subjectivity classification. Then the resulting
opinionated sentences are classified as expressing positive or negative opinions. Sentence level
sentiment classification can also be formulated as a three-class classification problem, that is, to
classify a sentence as neutral, positive or negative. Compared with document level and sentence
level sentiment analysis, aspect level sentiment analysis or aspect-based sentiment analysis is more
fine-grained. Its task is to extract and summarize people’s opinions expressed on entities and
aspects/features of entities, which are also called targets. For example, in a product review, it aims
to summarize positive and negative opinions on different aspects of the product respectively,
although the general sentiment on the product could be positive or negative. The whole task of
aspect-based sentiment analysis consists of several subtasks such as aspect extraction, entity
extraction, and aspect sentiment classification. For example, from the sentence, “the voice quality
of iPhone is great, but its battery sucks”, entity extraction should identify “iPhone” as the entity, and
aspect extraction should identify that “voice quality” and “battery” are two aspects. Aspect
sentiment classification should classify the sentiment expressed on the voice quality of the iPhone as
positive and on the battery of the iPhone as negative. Note that for simplicity, in most algorithms
aspect extraction and entity extraction are combined and are called aspect extraction or
sentiment/opinion target extraction.

Apart from these core tasks, sentiment analysis also studies emotion analysis, sarcasm detection,
multilingual sentiment analysis, etc. See Liu’s book1 for more details. In the following sections, we
survey the deep learning applications in all these sentiment analysis tasks.
DOCUMENT LEVEL SENTIMENT CLASSIFICATION
Sentiment classification at the document level is to assign an overall sentiment orientation/polarity
to an opinion document, i.e., to determine whether the document (e.g., a full online review) conveys
an overall positive or negative opinion. In this setting, it is a binary classification task. It can also be
formulated as a regression task, for example, to infer an overall rating score from 1 to 5 stars for the
review. Some researchers also treat this as a 5-class classification task.

Sentiment classification is commonly regarded as a special case of document classification. In such a


classification, document representation plays an important role, which should reflect the original
information conveyed by words or sentences in a document. Traditionally, the bag-of-words model
(BoW) is used to generate text representations in NLP and text mining, by which a document is
regarded as a bag of its words. Based on BoW, a document is transformed to a numeric feature
vector with a fixed length, each element of which can be the word occurrence (absence or presence),
word frequency, or TF-IDF score. Its dimension equals to the size of the vocabulary. A document
vector from BoW is normally very sparse since a single document only contains a small number of
words in a vocabulary. Early neural networks adopted such feature settings.

Despite its popularity, BoW has some disadvantages. Firstly, the word order is ignored, which means
that two documents can have exactly the same representation as long as they share the same words.
Bag-of-N-Grams, an extension for BoW, can consider the word order in a short context (n-gram), but
it also suffers from data sparsity and high dimensionality. Secondly, BoW can barely encode the
semantics of words. For example, the words “smart”, “clever” and “book” are of equal distance
between them in BoW, but “smart” should be closer to “clever” than “book” semantically.

To tackle the shortcomings of BoW, word embedding techniques based on neural networks
(introduced in the aforementioned section) were proposed to generate dense vectors (or low-
dimensional vectors) for word representation, which are, to some extent, able to encode some
semantic and syntactic properties of words. With word embeddings as input of words, document
representation as a dense vector (or called dense document vector) can be derived using neural
networks.

Notice that in addition to the above two approaches, i.e., using BoW and learning dense vectors for
documents through word embeddings, one can also learn a dense document vector directly from
BoW. We distinguish the different approaches used in related studies in Table 2.

When documents are properly represented, sentiment classification can be conducted using a
variety of neural network models following the traditional supervised learning setting. In some cases,
neural networks may only be used to extract text features/text representations, and these features
are fed into some other non-neural classifiers (e.g., SVM) to obtain a final global optimum classifier.
The properties of neural networks and SVM complement each other in such a way that their
advantages are combined.

Besides sophisticated document/text representations, researchers also leveraged the characteristics


of the data – product reviews, for sentiment classification. For product reviews, several researchers
found it beneficial to jointly model sentiment and some additional information (e.g., user
information and product information) for classification. Additionally, since a document often
contains long dependency relations, the attention mechanism is also frequently used in document
level sentiment classification. We summarize the existing techniques in Table 2.


Research Document/Text Neural Networks Use Attention Joint Modelling
Work Representation Model Mechanism with Sentiment
34
Moraes et al. BoW ANN (Artificial Neural No -
Network)
Le and Learning dense vector at Paragraph Vector No -
35
Mikolov sentence, paragraph,
document level
36
Glorot et al. BoW to dense document SDA (Stacked Denoising No Unsupervised data
vector Autoencoder) representation from
target domains (in
transfer learning
settings)
Zhai and BoW to dense document DAE (Denoising No -
37
Zhang vector Autoencoder)
Johnson and BoW to dense document BoW-CNN and Seq-CNN No -
38
Zhang vector
39
Tang et al. Word embeddings to CNN/LSTM (to learn No -
dense document vector sentence representation) +
GRU (to learn document
representation)

40 Word embeddings to UPNN (User Product Neutral User information and


Tang et al. No
dense document vector Network) based on CNN product information

41 Word embeddings to UPA (User Product User information and


Chen et al. Yes
dense document vector Attention) based on LSTM product Information
42
Dou Word embeddings to Memory Network Yes User information and
dense document vector product Information
43
Xu et al. Word embeddings to LSTM No -
dense document vector
44
Yang et al. Word embeddings to GRU-based sequence Hierarchical -
dense document vector encoder attention
45
Yin et al. Word embeddings to Input encoder and LSTM Hierarchical Aspect/target
dense document vector attention information
46
Zhou et al. Word embeddings to LSTM Hierarchical Cross-lingual
dense document vector attention information
47
Li et al. Word embeddings to Memory Network Yes Cross-domain
dense document vector information

Table 2: Deep learning methods for document level sentiment classification

Below, we also give a brief description of these existing representative works.

Moraes et al.34 made an empirical comparison between Support Vector Machines (SVM) and
Artificial Neural Networks (ANN) for document level sentiment classification, which demonstrated
that ANN produced competitive results to SVM’s in most cases.

To overcome the weakness of BoW, Le and Mikolov35 proposed Paragraph Vector, an unsupervised
learning algorithm that learns vector representations for variable-length texts such as sentences,
paragraphs and documents. The vector representations are learned by predicting the surrounding
words in contexts sampled from the paragraph.
Glorot et al.36 studied domain adaptation problem for sentiment classification. They proposed a
deep learning system based on Stacked Denoising Autoencoder with sparse rectifier units, which can
perform an unsupervised text feature/representation extraction using both labeled and unlabeled
data. The features are highly beneficial for domain adaption of sentiment classifiers.

Zhai and Zhang37 introduced a semi-supervised autoencoder, which further considers the sentiment
information in its learning stage in order to obtain better document vectors, for sentiment
classification. More specifically, the model learns a task-specific representation of the textual data
by relaxing the loss function in the autoencoder to the Bregman Divergence and also deriving a
discriminative loss function from the label information.

Johnson and Zhang38 proposed a CNN variant named BoW-CNN that employs bag-of-word
conversion in the convolution layer. They also designed a new model, called Seq-CNN, which keeps
the sequential information of words by concatenating the one-hot vector of multiple words.

Tang et al.39 proposed a neural network to learn document representation, with the consideration of
sentence relationships. It first learns the sentence representation with CNN or LSTM from word
embeddings. Then a GRU is utilized to adaptively encode semantics of sentences and their inherent
relations in document representations for sentiment classification.

Tang et al.40 applied user representations and product representations in review classification. The
idea is that those representations can capture important global clues such as individual preferences
of users and overall qualities of products, which can provide better text representations.

Chen et al.41 also incorporated user information and product information for classification but via
word and sentence level attentions, which can take into account of the global user preference and
product characteristics at both the word level and the semantic level. Likewise, Dou42 used a deep
memory network to capture user and product information. The proposed model can be divided into
two separate parts. In the first part, LSTM is applied to learn a document representation. In the
second part, a deep memory network consisting of multiple computational layers (hops) is used to
predict the review rating for each document.

Xu et al.43 proposed a cached LSTM model to capture the overall semantic information in a long text.
The memory in the model is divided into several groups with different forgetting rates. The intuition
is to enable the memory groups with low forgetting rates to capture global semantic features and
the ones with high forgetting rates to learn local semantic features.

Yang et al.44 proposed a hierarchical attention network for document level sentiment rating
prediction of reviews. The model includes two levels of attention mechanisms: one at the word level
and the other at the sentence level, which allow the model to pay more or less attention to
individual words or sentences in constructing the representation of a document.

Yin et al.45 formulated the document-level aspect-sentiment rating prediction task as a machine
comprehension problem and proposed a hierarchical interactive attention-based model. Specifically,
documents and pseudo aspect-questions are interleaved to learn aspect-aware document
representation.

Zhou et al.46 designed an attention-based LSTM network for cross-lingual sentiment classification at
the document level. The model consists of two attention-based LSTMs for bilingual representation,
and each LSTM is also hierarchically structured. In this setting, it effectively adapts the sentiment
information from a resource-rich language (English) to a resource-poor language (Chinese) and helps
improve the sentiment classification performance.
Li et al.47 proposed an adversarial memory network for cross-domain sentiment classification in a
transfer learning setting, where the data from the source and the target domain are modelled
together. It jointly trains two networks for sentiment classification and domain classification (i.e.,
whether a document is from the source or target domain).

SENTENCE LEVEL SENTIMENT CLASSIFICATION


Sentence level sentiment classification is to determine the sentiment expressed in a single given
sentence. As discussed earlier, the sentiment of a sentence can be inferred with subjectivity
classification48 and polarity classification, where the former classifies whether a sentence is
subjective or objective and the latter decides whether a subjective sentence expresses a negative or
positive sentiment. In existing deep learning models, sentence sentiment classification is usually
formulated as a joint three-way classification problem, namely, to predict a sentence as positive,
neural, and negative.

Same as document level sentiment classification, sentence representation produced by neural


networks is also important for sentence level sentiment classification. Additionally, since a sentence
is usually short compared to a document, some syntactic and semantic information (e.g., parse
trees, opinion lexicons, and part-of-speech tags) may be used to help. Additional information such as
review ratings, social relationship, and cross-domain information can be considered too. For
example, social relationships have been exploited in discovering sentiments in social media data
such as tweets.

In early research, parse trees (which provide some semantic and syntactic information) were used
together with the original words as the input to neural models, so that the sentiment composition
can be better inferred. But lately, CNN and RNN become more popular, and they do not need parse
trees to extract features from sentences. Instead, CNN and RNN use word embeddings as input,
which already encode some semantic and syntactic information. Moreover, the model architecture
of CNN or RNN can help learn intrinsic relationships between words in a sentence too. The related
works are introduced in detail below.

Socher et al.49 first proposed a semi-supervised Recursive Autoencoders Network (RAE) for
sentence level sentiment classification, which obtains a reduced dimensional vector representation
for a sentence. Later on, Socher et al.50 proposed a Matrix-vector Recursive Neural Network (MV-
RNN), in which each word is additionally associated with a matrix representation (besides a vector
representation) in a tree structure. The tree structure is obtained from an external parser. In Socher
et al.51, the authors further introduced the Recursive Neural Tensor Network (RNTN), where tensor-
based compositional functions are used to better capture the interactions between elements. Qian
et al.33 proposed two more advanced models, Tag-guided Recursive Neural Network (TG-RNN),
which chooses a composition function according to the part-of-speech tags of a phrase, and Tag-
embedded Recursive Neural Network / Recursive Neural Tenser Network (TE-RNN/RNTN), which
learns tag embeddings and then combines tag and word embeddings together.

Kalchbrenner et al.52 proposed a Dynamic CNN (called DCNN) for semantic modelling of sentences.
DCNN uses the dynamic K-Max pooling operator as a non-linear subsampling function. The feature
graph induced by the network is able to capture word relations. Kim53 also proposed to use CNN for
sentence-level sentiment classification and experimented with several variants, namely CNN-rand
(where word embeddings are randomly initialized), CNN-static (where word embeddings are pre-
trained and fixed), CNN-non-static (where word embeddings are pre-trained and fine-tuned) and
CNN-multichannel (where multiple sets of word embeddings are used).

dos Santos and Gatti54 proposed a Character to Sentence CNN (CharSCNN) model. CharSCNN uses
two convolutional layers to extract relevant features from words and sentences of any size to
perform sentiment analysis of short texts. Wang et al.55 utilized LSTM for Twitter sentiment
classification by simulating the interactions of words during the compositional process.
Multiplicative operations between word embeddings through gate structures are used to provide
more flexibility and to produce better compositional results compared to the additive ones in simple
recurrent neural network. Similar to bidirectional RNN, the unidirectional LSTM can be extended to a
bidirectional LSTM56 by allowing bidirectional connections in the hidden layer.

Wang et al.57 proposed a regional CNN-LSTM model, which consists of two parts: regional CNN and
LSTM, to predict the valence arousal ratings of text.

Wang et al.58 described a joint CNN and RNN architecture for sentiment classification of short texts,
which takes advantage of the coarse-grained local features generated by CNN and long-distance
dependencies learned via RNN.

Guggilla et al.59 presented a LSTM- and CNN-based deep neural network model, which utilizes
word2vec and linguistic embeddings for claim classification (classifying sentences to be factual or
feeling).

Huang et al.60 proposed to encode the syntactic knowledge (e.g., part-of-speech tags) in a tree-
structured LSTM to enhance phrase and sentence representation.

Akhtar et al.61 proposed several multi-layer perceptron based ensemble models for fine-gained
sentiment classification of financial microblogs and news.

Guan et al.62 employed a weakly-supervised CNN for sentence (and also aspect) level sentiment
classification. It contains a two-step learning process: it first learns a sentence representation weakly
supervised by overall review ratings and then uses the sentence (and aspect) level labels for fine-
tuning.

Teng et al.63 proposed a context-sensitive lexicon-based method for sentiment classification based
on a simple weighted-sum model, using bidirectional LSTM to learn the sentiment strength,
intensification and negation of lexicon sentiments in composing the sentiment value of a sentence.

Yu and Jiang64 studied the problem of learning generalized sentence embeddings for cross-domain
sentence sentiment classification and designed a neural network model containing two separated
CNNs that jointly learn two hidden feature representations from both the labeled and unlabeled
data.

Zhao et al.65 introduced a recurrent random walk network learning approach for sentiment
classification of opinionated tweets by exploiting the deep semantic representation of both user
posted tweets and their social relationships.

Mishra et al.66 utilized CNN to automatically extract cognitive features from the eye-movement (or
gaze) data of human readers reading the text and used them as enriched features along with textual
features for sentiment classification.

Qian et al.67 presented a linguistically regularized LSTM for the task. The proposed model
incorporates linguistic resources such as sentiment lexicon, negation words and intensity words into
the LSTM so as to capture the sentiment effect in sentences more accurately.

ASPECT LEVEL SENTIMENT CLASSIFICATION


Different from the document level and the sentence level sentiment classification, aspect level
sentiment classification considers both the sentiment and the target information, as a sentiment
always has a target. As mentioned earlier, a target is usually an entity or an entity aspect. For
simplicity, both entity and aspect are usually just called aspect. Given a sentence and a target aspect,
aspect level sentiment classification aims to infer the sentiment polarity/orientation of the sentence
toward the target aspect. For example, in the sentence “the screen is very clear but the battery life is
too short.” the sentiment is positive if the target aspect is “screen” but negative if the target aspect
is “battery life”. We will discuss automated aspect or target extraction in the next section.

Aspect level sentiment classification is challenging because modelling the semantic relatedness of a
target with its surrounding context words is difficult. Different context words have different
influences on the sentiment polarity of a sentence towards the target. Therefore, it is necessary
capture semantic connections between the target word and the context words when building
learning models using neural networks.

There are three important tasks in aspect level sentiment classification using neural networks. The
first task is to represent the context of a target, where the context means the contextual words in a
sentence or document. This issue can be similarly addressed using the text representation
approaches mentioned in the above two sections. The second task is to generate a target
representation, which can properly interact with its context. A general solution is to learn a target
embedding, which is similar to word embedding. The third task is to identify the important
sentiment context (words) for the specified target. For example, in the sentence “the screen of
iPhone is clear but batter life is short”, “clear” is the important context word for “screen” and “short”
is the important context for “battery life”. This task is recently addressed by the attention
mechanism. Although many deep learning techniques have been proposed to deal with aspect level
sentiment classification, to our knowledge, there are still no dominating techniques in the literature.
Related works and their main focuses are introduced below.

Dong et al.68 proposed an Adaptive Recursive Neural Network (AdaRNN) for target-dependent
twitter sentiment classification, which learns to propagate the sentiments of words towards the
target depending on the context and syntactic structure. It uses the representation of the root node
as the features, and feeds them into the softmax classifier to predict the distribution over classes.

Vo and Zhang69 studied aspect-based Twitter sentiment classification by making use of rich
automatic features, which are additional features obtained using unsupervised learning methods.
The paper showed that multiple embeddings, multiple pooling functions, and sentiment lexicons can
offer rich sources of feature information and help achieve performance gains.

Since LSTM can capture semantic relations between the target and its context words in a more
flexible way, Tang et al.70 proposed Target-dependent LSTM (TD-LSTM) and Target-connection LSTM
(TC-LSTM) to extend LSTM by taking the target into consideration. They regarded the given target as
a feature and concatenated it with the context features for aspect sentiment classification.

Ruder et al.71 proposed to use a hierarchical and bidirectional LSTM model for aspect level sentiment
classification, which is able to leverage both intra- and inter-sentence relations. The sole
dependence on sentences and their structures within a review renders the proposed model
language-independent. Word embeddings are fed into a sentence-level bidirectional LSTM. Final
states of the forward and backward LSTM are concatenated together with the target embedding and
fed into a bidirectional review-level LSTM. At every time step, the output of the forward and
backward LSTM is concatenated and fed into a final layer, which outputs a probability distribution
over sentiments.

Considering the limitation of work by Dong et al.68 and Vo and Zhang69, Zhang et al.72 proposed a
sentence level neural model to address the weakness of pooling functions, which do not explicitly
model tweet-level semantics. To achieve that, two gated neural networks are presented. First, a bi-
directional gated neural network is used to connect the words in a tweet so that pooling functions
can be applied over the hidden layer instead of words for better representing the target and its
contexts. Second, a three-way gated neural network structure is used to model the interaction
between the target mention and its surrounding contexts, addressing the limitations by using gated
neural network structures to model the syntax and semantics of the enclosing tweet, and the
interaction between the surrounding contexts and the target respectively. Gated neural networks
have been shown to reduce the bias of standard recurrent neural networks towards the ends of a
sequence by better propagation of gradients.

Wang et al.73 proposed an attention-based LSTM method with target embedding, which was proven
to be an effective way to enforce the neural model to attend to the related part of a sentence. The
attention mechanism is used to enforce the model to attend to the important part of a sentence, in
response to a specific aspect. Likewise, Yang et al.74 proposed two attention-based bidirectional
LSTMs to improve the classification performance. Liu and Zhang75 extended the attention modelling
by differentiating the attention obtained from the left context and the right context of a given
target/aspect. They further controlled their attention contribution by adding multiple gates.

Tang et al.76 introduced an end-to-end memory network for aspect level sentiment classification,
which employs an attention mechanism with an external memory to capture the importance of each
context word with respect to the given target aspect. This approach explicitly captures the
importance of each context word when inferring the sentiment polarity of the aspect. Such
importance degree and text representation are calculated with multiple computational layers, each
of which is a neural attention model over an external memory.

Lei et al.77 proposed to use a neural network approach to extract pieces of input text as rationales
(reasons) for review ratings. The model consists of a generator and a decoder. The generator
specifies a distribution over possible rationales (extracted text) and the encoder maps any such text
to a task-specific target vector. For multi-aspect sentiment analysis, each coordinate of the target
vector represents the response or rating pertaining to the associated aspect.

Li et al.78 integrated the target identification task into sentiment classification task to better model
aspect-sentiment interaction. They showed that sentiment identification can be solved with an end-
to-end machine learning architecture, in which the two sub-tasks are interleaved by a deep memory
network. In this way, signals produced in target detection provide clues for polarity classification,
and reversely, the predicted polarity provides feedback to the identification of targets.

Ma et al.79 proposed an Interactive Attention Network (IAN) that considers both attentions on
target and context. That is, it uses two attention networks to interactively detect the important
words of the target expression/description and the important words of its full context.

Chen et al.80 proposed to utilize a recurrent attention network to better capture the sentiment of
complicated contexts. To achieve that, their proposed model uses a recurrent/dynamic attention
structure and learns non-linear combination of the attention in GRUs.

Tay et al.81 designed a Dyadic Memory Network (DyMemNN) that models dyadic interactions
between aspect and context, by using either neural tensor compositions or holographic
compositions for memory selection operation.

ASPECT EXTRACTION AND CATEGORIZATION


To perform aspect level sentiment classification, one needs to have aspects (or targets), which can
be manually given or automatically extracted. In this section, we discuss existing work for automated
aspect extraction (or aspect term extraction) from a sentence or document using deep learning
models. Let us use an example to state the problem. For example, in the sentence “the image is very
clear” the word “image” is an aspect term (or sentiment target). The associated problem of aspect
categorization is to group the same aspect expressions into a category. For instance, the aspect
terms “image”, “photo” and “picture” can be grouped into one aspect category named Image. In the
review below, we include the extraction of both aspect and entity that are associated with opinions.

One reason why deep learning models can be helpful for this task is that, deep learning is essentially
good at learning (complicated) feature representations. When an aspect is properly characterized in
some feature space, for example, in one or some hidden layer(s), the semantics or correlation
between an aspect and its context can be captured with the interplay between their corresponding
feature representations. In other words, deep learning provides a possible approach to automated
feature engineering without human involvement.

Katiyar and Cardie82 investigated the use of deep bidirectional LSTMs for joint extraction of opinion
entities and the IS-FORM and IS-ABOUT relationships that connect the entities. Wang et al.83 further
proposed a joint model integrating RNN and Conditional Random Fields (CRF) to co-extract aspects
and opinion terms or expressions. The proposed model can learn high-level discriminative features
and double-propagate information between aspect and opinion terms simultaneously. Wang et al.84
further proposed a Coupled Multi-Layer Attention Model (CMLA) for co-extracting of aspect and
opinion terms. The model consists of an aspect attention and an opinion attention using GRU units.
An improved LSTM-based approach was reported by Li and Lam85, specifically for aspect term
extraction. It consists of three LSTMs, of which two LSTMs are for capturing aspect and sentiment
interactions. The third LSTM is to use the sentiment polarity information as an additional guidance.

He et al.86 proposed an attention-based model for unsupervised aspect extraction. The main
intuition is to utilize the attention mechanism to focus more on aspect-related words while de-
emphasizing aspect-irrelevant words during the learning of aspect embeddings, similar to the
autoencoder framework.

Zhang et al.87 extended a CRF model using a neural network to jointly extract aspects and
corresponding sentiments. The proposed CRF variant replaces the original discrete features in CRF
with continuous word embeddings, and adds a neural layer between the input and output nodes.

Zhou et al.88 proposed a semi-supervised word embedding learning method to obtain continuous
word representations on a large set of reviews with noisy labels. With the word vectors learned,
deeper and hybrid features are learned by stacking on the word vectors through a neural network.
Finally, a logistic regression classifier trained with the hybrid features is used to predict the aspect
category.

Yin et al.89 first learned word embedding by considering the dependency path connecting words.
Then they designed some embedding features that consider the linear context and dependency
context information for CRF-based aspect term extraction.

Xiong et al.90 proposed an attention-based deep distance metric learning model to group aspect
phrases. The attention-based model is to learn feature representation of contexts. Both aspect
phrase embedding and context embedding are used to learn a deep feature subspace metric for K-
means clustering.

Poria et al.91 proposed to use CNN for aspect extraction. They developed a seven-layer deep
convolutional neural network to tag each word in opinionated sentences as either aspect or non-
aspect word. Some linguistic patterns are also integrated into the model for further improvement.
Ying et al.92 proposed two RNN-based models for cross-domain aspect extraction. They first used
rule-based methods to generate an auxiliary label sequence for each sentence. They then trained
the models using both the true labels and auxiliary labels, which shows promising results.

OPINION EXPRESSION EXTRACTION


In this and the next few sections, we discuss deep learning applications to some other sentiment
analysis related tasks. This section focuses on the problem of opinion expression extraction (or
opinion term extraction, or opinion identification), which aims to identify the expressions of
sentiment in a sentence or a document.

Similar to the aspect extraction, opinion expression extraction using deep learning models is
workable because their characteristics could be identified in some feature space as well.

Irsoy and Cardie93 explored the application of deep bidirectional RNN for the task, which
outperforms traditional shallow RNNs with the same number of parameters and also previous CRF
methods.94

Liu et al.95 presented a general class of discriminative models based on the RNN architecture and
word embedding. The authors used pre-trained word embeddings from three external sources in
different RNN architectures including Elman-type, Jordan-type, LSTM and their variations.

Wang et al.83 proposed a model integrating recursive neural networks and CRF to co-extract aspect
and opinion terms. The aforementioned CMLA is also proposed for co-extraction of aspect and
opinion terms.84

SENTIMENT COMPOSITION
Sentiment composition claims that the sentiment orientation of an opinion expression is determined
by the meaning of its constituents as well as the grammatical structure. Due to their particular tree-
structure design, RecNN is naturally suitable for this task.51 Irsoy and Cardie96 reported that the
RecNN with a deep architecture can more accurately capture different aspects of compositionality in
language, which benefits sentiment compositionality. Zhu et al.97 proposed a neural network for
integrating the compositional and non-compositional sentiment in the process of sentiment
composition.

OPINION HOLDER EXTRACTION


Opinion holder (or source) extraction is the task of recognizing who holds the opinion (or
whom/where the opinion is from).1 For example, in the sentence “John hates his car”, the opinion
holder is “john”. This problem is commonly formulated as a sequence labelling problem like opinion
expression extraction or aspect extraction. Notice that opinion holder can be either explicit (from a
noun phrase in the sentence) or implicit (from the writer) as shown by Yang and Cardie98. Deng and
Wiebe99 proposed to use word embeddings of opinion expressions as features for recognizing
sources of participant opinions and non-participant opinions, where a source can be the noun
phrase or writer.

TEMPORAL OPINION MINING


Time is also an important dimension in problem definition of sentiments analysis (see Liu’s book1).
As time passes by, people may maintain or change their mind, or even give new viewpoints.
Therefore, predicting future opinion is important in sentiment analysis. Some research using neural
networks has been reported recently to tackle this problem.

Chen et al.100 proposed a Content-based Social Influence Model (CIM) to make opinion behaviour
predictions of twitter users. That is, it uses the past tweets to predict users’ future opinions. It is
based on a neural network framework to encode both the user content and social relation factor
(one’s opinion about a target is influenced by one’s friends).

Rashkin et al.101 used LSTMs for targeted sentiment forecast in the social media context. They
introduced multilingual connotation frames, which aim at forecasting implied sentiments among
world event participants engaged in a frame.

SENTIMENT ANALYSIS WITH WORD EMBEDDING


It is clear that word embeddings played an important role in deep learning based sentiment analysis
models. It is also shown that even without the use of deep learning models, word embeddings can
be used as features for non-neural learning models for various tasks. The section thus specifically
highlights word embeddings’ contribution to sentiment analysis.

We first present the works of sentiment-encoded word embeddings. For sentiment analysis, directly
applying regular word methods like CBOW or Skip-gram to learn word embeddings from context can
encounter problems, because words with similar contexts but opposite sentiment polarities (e.g.,
“good” or “bad”) may be mapped to nearby vectors in the embedding space. Therefore, sentiment-
encoded word embedding methods have been proposed. Mass el al.102 learned word embeddings
that can capture both semantic and sentiment information. Bespalov et al.103 showed that an n-gram
model combined with latent representation would produce a more suitable embedding for
sentiment classification. Labutov and Lipson104 re-embed existing word embeddings with logistic
regression by regarding sentiment supervision of sentences as a regularization term.

Le and Mikolov35 proposed the concept of paragraph vector to first learn fixed-length representation
for variable-length pieces of texts, including sentences, paragraphs and documents. They
experimented on both sentence and document-level sentiment classification tasks and achieved
performance gains, which demonstrates the merit of paragraph vectors in capturing semantics to
help sentiment classification. Tang et al.105,106 presented models to learn Sentiment-specific Word
Embeddings (SSWE), in which not only the semantic but also sentiment information is embedded in
the learned word vectors. Wang and Xia107 developed a neural architecture to train a sentiment-
bearing word embedding by integrating the sentiment supervision at both the document and word
levels. Yu et al.108 adopted a refinement strategy to obtain joint semantic-sentiment bearing word
vectors.

Feature enrichment and multi-sense word embeddings are also investigated for sentiment analysis.
Vo and Zhang69 studied aspect-based Twitter sentiment classification by making use of rich
automatic features, which are additional features obtained using unsupervised learning techniques.
Li and Jurafsky109 experimented with the utilization of multi-sense word embeddings on various NLP
tasks. Experimental results show that while such embeddings do improve the performance of some
tasks, they offer little help to sentiment classification tasks. Ren et al.110 proposed methods to learn
topic-enriched multi-prototype word embeddings for Twitter sentiment classification.

Multilinguistic word embeddings have also been applied to sentiment analysis. Zhou et al.111
reported a Bilingual Sentiment Word Embedding (BSWE) model for cross-language sentiment
classification. It incorporates the sentiment information into English-Chinese bilingual embeddings
by employing labeled corpora and their translation, instead of large-scale parallel corpora. Barnes et
al.112 compared several types of bilingual word embeddings and neural machine translation
techniques for cross-lingual aspect-based sentiment classification.

Zhang et al.113 integrated word embeddings with matrix factorization for personalized review-based
rating prediction. Specifically, the authors refine existing semantics-oriented word vectors (e.g.,
word2vec and GloVe) using sentiment lexicons. Sharma et al.114 proposed a semi-supervised
technique to use sentiment bearing word embeddings for ranking sentiment intensity of adjectives.
Word embedding techniques have also been utilized or improved to help address various sentiment
analysis tasks in many other recent studies.55,62,87,89,95

SARCASM ANALYSIS
Sarcasm is a form verbal irony and a closely related concept to sentiment analysis. Recently, there is
a growing interest in NLP communities in sarcasm detection. Researchers have attempted to solve it
using deep learning techniques due of their impressive success in many other NLP problems.

Zhang et al.115 constructed a deep neural network model for tweet sarcasm detection. Their network
first uses a bidirectional GRU model to capture the syntactic and semantic information over tweets
locally, and then uses a pooling neural network to extract contextual features automatically from
history tweets for detecting sarcastic tweets.

Joshi et al.116 investigated word embeddings-based features for sarcasm detection. They
experimented four past algorithms for sarcasm detection with augmented word embeddings
features and showed promising results.

Poria et al.117 developed a CNN-based model for sarcasm detection (sarcastic or non-sarcastic tweets
classification), by jointly modelling pre-trained emotion, sentiment and personality features, along
with the textual information in a tweet.

Peled and Reichart118 proposed to interpret sarcasm tweets based on a RNN neural machine
translation model.

Ghosh and Veale119 proposed a CNN and bidirectional LSTM hybrid for sarcasm detection in tweets,
which models both linguistic and psychological contexts.

Mishra et al.66 utilized CNN to automatically extract cognitive features from the eye-movement (or
gaze) data to enrich information for sarcasm detection. Word embeddings are also used for irony
recognition in English tweets120 and for controversial words identification in debates.121

EMOTION ANALYSIS
Emotions are the subjective feelings and thoughts of human beings. The primary emotions include
love, joy, surprise, anger, sadness and fear. The concept of emotion is closely related to sentiment.
For example, the strength of a sentiment can be linked to the intensity of certain emotion like joy
and anger. Thus, many deep learning models are also applied to emotion analysis following the way
in sentiment analysis.

Wang et al. 122 built a bilingual attention network model for code-switched emotion prediction. A
LSTM model is used to construct a document level representation of each post, and the attention
mechanism is employed to capture the informative words from both the monolingual and bilingual
contexts.
Zhou et al. 123 proposed an emotional chatting machine to model the emotion influence in large-
scale conversation generation based on GRU. The technique has also been applied in other papers.
39,72,115

Abdul-Mageed and Ungar124 first built a large dataset for emotion detection automatically by using
distant supervision and then used a GRU network for fine-grained emotion detection.

Felbo et al. 125 used millions of emoji occurrences in social media for pretraining neural models in
order to learn better representations of emotional contexts.

A question-answering approach is proposed using a deep memory network for emotion cause
extraction.126 Emotion cause extraction aims to identify the reasons behind a certain emotion
expressed in text.

MULTIMODAL DATA FOR SENTIMENT ANALYSIS


Multimodal data, such as the data carrying textual, visual, and acoustic information, has been used
to help sentiment analysis as it provides additional sentiment signals to the traditional text features.
Since deep learning models can map inputs to some latent space for feature representation, the
inputs from multimodal data can also be projected simultaneously to learn multimodal data fusion,
for example, by using feature concatenation, joint latent space, or other more sophisticated fusion
approaches. There is now a growing trend of using multimodal data with deep learning techniques.

Poria et al.127 proposed a way of extracting features from short texts based on the activation values
of an inner layer of CNN. The main novelty of the paper is the use of a deep CNN to extract features
from text and the use of multiple kernel learning (MKL) to classify heterogeneous multimodal fused
feature vectors.

Bertero et al.128 described a CNN model for emotion and sentiment recognition in acoustic data from
interactive dialog systems.

Fung et al.129 demonstrated a virtual interaction dialogue system that have incorporated sentiment,
emotion and personality recognition capabilities trained by deep learning models.

Wang et al.130 reported a CNN structured deep network, named Deep Coupled Adjective and Noun
(DCAN) neural network, for visual sentiment classification. The key idea of DCAN is to harness the
adjective and noun text descriptions, treating them as two (weak) supervision signals to learn two
intermediate sentiment representations. Those learned representations are then concatenated and
used for sentiment classification.

Yang et al.131 developed two algorithms based on a conditional probability neural network to analyse
visual sentiment in images.

Zhu et al.132 proposed a unified CNN-RNN model for visual emotion recognition. The architecture
leverages CNN with multiple layers to extract different levels of features (e.g., colour, texture, object,
etc.) within a multi-task learning framework. And a bidirectional RNN is proposed to integrate the
learned features from different layers in the CNN model.

You et al.133 adopted the attention mechanism for visual sentiment analysis, which can jointly
discover the relevant local image regions and build a sentiment classifier on top of these local
regions.

Poria et al.134 proposed some a deep learning model for multi-modal sentiment analysis and emotion
recognition on video data. Particularly, a LSTM-based model is proposed for utterance-level
sentiment analysis, which can capture contextual information from their surroundings in the same
video.

Tripathi et al.135 used deep and CNN-based models for emotion classification on a multimodal
dataset DEAP, which contains electroencephalogram and peripheral physiological and video signals.

Zadeh et al.136 formulated the problem of multimodal sentiment analysis as modelling intra-modality
and inter-modality dynamics and introduced a new neural model named Tensor Fusion Network to
tackle it.

Long et al.137 proposed an attention neural model trained with cognition grounded eye-tracking data
for sentence-level sentiment classification. A Cognition Based Attention (CBA) layer is built for
neural sentiment analysis.

Wang et al. 138 proposed a Select-Additive Learning (SAL) approach to tackle the confounding factor
problem in multimodal sentiment analysis, which removes the individual specific latent
representations learned by neural networks (e.g., CNN). To achieve it, two learning phases are
involved, namely, a selection phase for confounding factor identification and a removal phase for
confounding factor removal.

RESOURCE-POOR LANGUAGE AND MULTILINGUAL SENTIMENT ANALYSIS


Recently, sentiment analysis in resource-poor languages (compared to English) has also achieved
significant progress due to the use of deep learning models. Additionally, multilingual features also
can help sentiment analysis just like multimodal data. In the same way, deep learning has been
applied to the multilingual sentiment analysis setting.

Akhtar et al.139 reported a CNN-based hybrid architecture for sentence and aspect level sentiment
classification in a resource-poor language, Hindi.

Dahou et al.140 used word embeddings and a CNN-based model for Arabic sentiment classification at
the sentence level.

Singhal and Bhattacharyya141 designed a solution for multilingual sentiment classification at


review/sentence level and experimented with multiple languages, including Hindi, Marathi, Russian,
Dutch, French, Spanish, Italian, German, and Portuguese. The authors applied machine translation
tools to translate these languages into English and then used English word embeddings, polarities
from a sentiment lexicon and a CNN model for classification.

Joshi et al.142 introduced a sub-word level representation in a LSTM architecture for sentiment
classification of Hindi-English code-mixed sentences.

OTHER RELATED TASKS


There are also applications of deep learning in some other sentiment analysis related tasks.

Sentiment Intersubjectivity: Gui et al.143 tackled the intersubjectivity problem in sentiment analysis,
where the problem is to study the gap between the surface form of a language and the
corresponding abstract concepts, and incorporate the modelling of intersubjectivity into a proposed
CNN.

Lexicon Expansion: Wang et al.144 proposed a PU learning-based neural approach for opinion lexicon
expansion.
Financial Volatility Prediction: Rekabsaz et al.145 made volatility predictions using financial disclosure
sentiment with word embedding-based information retrieval models, where word embeddings are
used in similar word set expansion.

Opinion Recommendation: Wang and Zhang146 introduced the task of opinion recommendation,
which aims to generate a customized review score of a product that the particular user is likely to
give, as well as a customized review that the user would have written for the target product if the
user had reviewed the product. A multiple-attention memory network was proposed to tackle the
problem, which considers users’ reviews, product’s reviews, and users’ neighbours (similar users).

Stance Detection: Augenstein et al.147 proposed a bidirectional LSTMs with a conditional encoding
mechanism for stance detection in political twitter data. Du et al.148 designed a target-specific neural
attention model for stance classification.

CONCLUSION
Applying deep learning to sentiment analysis has become a popular research topic lately. In this
paper, we introduced various deep learning architectures and their applications in sentiment
analysis. Many of these deep learning techniques have shown state-of-the-art results for various
sentiment analysis tasks. With the advances of deep learning research and applications, we believe
that there will be more exciting research of deep learning for sentiment analysis in the near future.

Acknowledgments
Bing Liu and Shuai Wang’s work was supported in part by National Science Foundation (NSF) under
grant no. IIS1407927 and IIS-1650900, and by Huawei Technologies Co. Ltd with a research gift.

References
[1] Liu B. Sentiment analysis: mining opinions, sentiments, and emotions. The Cambridge University Press,
2015.

[2] Liu B. Sentiment analysis and opinion mining (introduction and survey), Morgan & Claypool, May 2012.

[3] Pang B and Lee L. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval,
2008. 2(1–2): pp. 1–135.

[4] Goodfellow I, Bengio Y, Courville A. Deep learning. The MIT Press. 2016.

[5] Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In Proceedings of the International
Conference on Artificial Intelligence and Statistics (AISTATS 2011), 2011.

[6] Rumelhart D.E, Hinton G.E, Williams R.J. Learning representations by back-propagating errors. Cognitive
modelling, 1988.

[7] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, and Kuksa P. Natural language processing (almost)
from scratch. Journal of Machine Learning Research, 2011.

[8] Goldberg Y. A primer on neural network models for natural language processing. Journal of Artificial
Intelligence Research, 2016.

[9] Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2013.
[10] Lee H, Grosse R, Ranganath R, and Ng A.Y. Convolutional deep belief networks for scalable unsupervised
learning of hierarchical representations. In Proceedings of the International Conference on Machine Learning
(ICML 2009), 2009.

[11] Bengio Y, Ducharme R, Vincent P, and Jauvin C. A neural probabilistic language model. Journal of Machine
Learning Research, 2003.

[12] Morin F, Bengio Y. Hierarchical probabilistic neural network language model. In Proceedings of the
International Workshop on Artificial Intelligence and Statistics, 2005.

[13] Mikolov T, Chen K, Corrado G, and Dean J. Efficient estimation of word representations in vector space. In
Proceedings of International Conference on Learning Representations (ICLR 2013), 2013.

[14] Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J. Distributed representations of words and phrases
and their compositionality. In Proceedings of the Annual Conference on Advances in Neural Information
Processing Systems (NIPS 2013), 2013.

[15] Mnih A, Kavukcuoglu K. Learning word embeddings efficiently with noise-contrastive estimation. In
Proceedings of the Annual Conference on Advances in Neural Information Processing Systems (NIPS 2013),
2013.

[16] Huang E.H, Socher R, Manning C.D. and Ng A.Y. Improving word representations via global context and
multiple word prototypes. In Proceedings of the Annual Meeting of the Association for Computational
Linguistics (ACL 2012), 2012.

[17] Pennington J, Socher R, Manning C.D. GloVe: global vectors for word representation. In Proceedings of the
Conference on Empirical Methods on Natural Language Processing (EMNLP 2014), 2014.

[18] Bengio Y, Lamblin P, Popovici D, and Larochelle H. Greedy layer-wise training of deep networks. In
Proceedings of the Annual Conference on Advances in Neural Information Processing Systems (NIPS 2006),
2006.

[19] Hinton G.E, Salakhutdinov R.R. Reducing the dimensionality of data with neural networks. Science, July
2006.

[20] Vincent P, Larochelle H, Bengio Y, and Manzagol P-A. Extracting and composing robust features with
denoising autoencoders. In Proceedings of the International Conference on Machine Learning (ICML 2008),
2008.

[21] Sermanet P, LeCun Y. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of
the International Joint Conference on Neural Networks (IJCNN 2011), 2011.

[22] Elman J.L. Finding structure in time. Cognitive Science, 1990.

[23] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE
Transactions on Neural Networks, 1994.

[24] Schuster M, Paliwal K.K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing,
1997.

[25] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 9(8): 1735-1780, 1997.

[26] Tai K.S, Socher R, Manning C. D. Improved semantic representations from tree-structured long short-term
memory networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL
2015), 2015.

[27] Cho K, Bahdanau D, Bougares F, Schwenk H and Bengio Y. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing (EMNLP 2014), 2014.
[28] Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on
sequence modelling. arXiv preprint arXiv:1412.3555, 2014.

[29] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv
preprint arXiv:1409.0473, 2014.

[30] Weston J, Chopra S, Bordes A. Memory networks. arXiv preprint arXiv:1410.3916. 2014.

[31] Sukhbaatar S, Weston J, Fergus R. End-to-end memory networks. In Proceedings of the 29th Conference on
Neural Information Processing Systems (NIPS 2015), 2015.

[32] Graves A, Wayne G, Danihelka I. Neural Turing Machines. preprint arXiv:1410.5401. 2014.

[33] Qian Q, Tian B, Huang M, Liu Y, Zhu X and Zhu X. Learning tag embeddings and tag-specific composition
functions in the recursive neural network. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (ACL 2015), 2015.

[34] Moraes R, Valiati J.F, Neto W.P. Document-level sentiment classification: an empirical comparison
between SVM and ANN. Expert Systems with Applications. 2013.

[35] Le Q, Mikolov T. Distributed representations of sentences and documents. In Proceedings of the


International Conference on Machine Learning (ICML 2014), 2014.

[36] Glorot X, Bordes A, Bengio Y. Domain adaption for large-scale sentiment classification: a deep learning
approach. In Proceedings of the International Conference on Machine Learning (ICML 2011), 2011.

[37] Zhai S, Zhongfei (Mark) Zhang. Semisupervised autoencoder for sentiment analysis. In Proceedings of AAAI
Conference on Artificial Intelligence (AAAI 2016), 2016.

[38] Johnson R, Zhang T. Effective use of word order for text categorization with convolutional neural networks.
In Proceedings of the Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-HLT 2015), 2015.

[39] Tang D, Qin B, Liu T. Document modelling with gated recurrent neural network for sentiment classification.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), 2015.

[40] Tang D, Qin B, Liu T. Learning semantic representations of users and products for document level
sentiment classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics
(ACL 2015), 2015.

[41] Chen H, Sun M, Tu C, Lin Y, and Liu Z. Neural sentiment classification with user and product attention. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016.

[42] Dou ZY. Capturing user and product Information for document level sentiment analysis with deep memory
network. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP
2017), 2017.

[43] Xu J, Chen D, Qiu X, and Huang X. Cached long short-term memory neural networks for document-level
sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP 2016), 2016.

[44] Yang Z, Yang D, Dyer C, He X, Smola AJ, and Hovy EH. Hierarchical attention networks for document
classification. In Proceedings of the Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (NAACL-HLT 2016), 2016.

[45] Yin Y, Song Y, Zhang M. Document-level multi-aspect sentiment classification as machine comprehension.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), 2017.

[46] Zhou X, Wan X, Xiao J. Attention-based LSTM network for cross-lingual sentiment classification. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016.
[47] Li Z, Zhang Y, Wei Y, Wu Y, and Yang Q. End-to-end adversarial memory network for cross-domain
sentiment classification. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI
2017), 2017.

[48] Wiebe J, Bruce R, and O’Hara T. Development and use of a gold standard data set for subjectivity
classifications. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL
1999), 1999.

[49] Socher R, Pennington J, Huang E.H, Ng A.Y, and Manning C.D. Semi-supervised recursive autoencoders for
predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP 2011), 2011.

[50] Socher R, Huval B, Manning C.D, and Ng A.Y. Semantic compositionality through recursive matrix-vector
spaces. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2012),
2012.

[51] Socher R, Perelygin A, Wu J. Y, Chuang J, Manning C.D, Ng A. Y, and Potts C. Recursive deep models for
semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods
on Natural Language Processing (EMNLP 2013), 2013.

[52] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2014), 2014.

[53] Kim Y. Convolutional neural networks for sentence classification. In Proceedings of the Annual Meeting of
the Association for Computational Linguistics (ACL 2014), 2014.

[54] dos Santos, C. N., Gatti M. Deep convolutional neural networks for sentiment analysis for short texts. In
Proceedings of the International Conference on Computational Linguistics (COLING 2014), 2014.

[55] Wang X, Liu Y, Sun C, Wang B, and Wang X. Predicting polarities of tweets by composing word embeddings
with long short-term memory. In Proceedings of the Annual Meeting of the Association for Computational
Linguistics (ACL 2015), 2015.

[56] Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural
network architectures. Neural Networks, 2005.

[57] Wang J, Yu L-C, Lai R.K., and Zhang X. Dimensional sentiment analysis using a regional CNN-LSTM model.
In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2016), 2016.

[58] Wang X, Jiang W, Luo Z. Combination of convolutional and recurrent neural network for sentiment
analysis of short texts. In Proceedings of the International Conference on Computational Linguistics (COLING
2016), 2016.

[59] Guggilla C, Miller T, Gurevych I. CNN-and LSTM-based claim classification in online user comments. In
Proceedings of the International Conference on Computational Linguistics (COLING 2016), 2016.

[60] Huang M, Qian Q, Zhu X. Encoding syntactic knowledge in neural networks for sentiment classification.
ACM Transactions on Information Systems, 2017

[61] Akhtar MS, Kumar A, Ghosal D, Ekbal A, and Bhattacharyya P. A multilayer perceptron based ensemble
technique for fine-grained financial sentiment analysis. In Proceedings of the Conference on Empirical Methods
on Natural Language Processing (EMNLP 2017), 2017.

[62] Guan Z, Chen L, Zhao W, Zheng Y, Tan S, and Cai D. Weakly-supervised deep learning for customer review
sentiment classification. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI
2016), 2016.

[63] Teng Z, Vo D-T, and Zhang Y. Context-sensitive lexicon features for neural sentiment analysis. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016.
[64] Yu J, Jiang J. Learning sentence embeddings with auxiliary tasks for cross-domain sentiment classification.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016.

[65] Zhao Z, Lu H, Cai D, He X, Zhuang Y. Microblog sentiment classification via recurrent random walk network
learning. In Proceedings of the Internal Joint Conference on Artificial Intelligence (IJCAI 2017), 2017.

[66] Mishra A, Dey K, Bhattacharyya P. Learning cognitive features from gaze data for sentiment and sarcasm
classification using convolutional neural network. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (ACL 2017), 2017.

[67] Qian Q, Huang M, Lei J, and Zhu X. Linguistically regularized LSTM for sentiment classification. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2017), 2017.

[68] Dong L, Wei F, Tan C, Tang D, Zhou M, and Xu K. Adaptive recursive neural network for target-dependent
Twitter sentiment classification. In Proceedings of the Annual Meeting of the Association for Computational
Linguistics (ACL 2014), 2014.

[69] Vo D-T, Zhang Y. Target-dependent twitter sentiment classification with rich automatic features. In
Proceedings of the Internal Joint Conference on Artificial Intelligence (IJCAI 2015), 2015.

[70] Tang D, Qin B, Feng X, and Liu T. Effective LSTMs for target-dependent sentiment classification. In
Proceedings of the International Conference on Computational Linguistics (COLING 2016), 2016.

[71] Ruder S, Ghaffari P, Breslin J.G. A hierarchical model of reviews for aspect-based sentiment analysis. In
Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2016), 2016.

[72] Zhang M, Zhang Y, Vo D-T. Gated neural networks for targeted sentiment analysis. In Proceedings of AAAI
Conference on Artificial Intelligence (AAAI 2016), 2016.

[73] Wang Y, Huang M, Zhu X, and Zhao L. Attention-based LSTM for aspect-level sentiment classification. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016.

[74] Yang M, Tu W, Wang J, Xu F, and Chen X. Attention-based LSTM for target-dependent sentiment
classification. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI 2017), 2017.

[75] Liu J, Zhang Y. Attention modeling for targeted sentiment. In Proceedings of the Conference of the
European Chapter of the Association for Computational Linguistics (EACL 2017), 2017.

[76] Tang D, Qin B, and Liu T. Aspect-level sentiment classification with deep memory network. arXiv preprint
arXiv:1605.08900, 2016.

[77] Lei T, Barzilay R, Jaakkola T. Rationalizing neural predictions. In Proceedings of the Conference on Empirical
Methods on Natural Language Processing (EMNLP 2016), 2016.

[78] Li C, Guo X, Mei Q. Deep memory networks for attitude Identification. In Proceedings of the ACM
International Conference on Web Search and Data Mining (WSDM 2017), 2017.

[79] Ma D, Li S, Zhang X, Wang H. Interactive attention networks for aspect-Level sentiment classification. In
Proceedings of the Internal Joint Conference on Artificial Intelligence (IJCAI 2017), 2017.

[80] Chen P, Sun Z, Bing L, and Yang W. Recurrent attention network on memory for aspect sentiment analysis.
In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[81] Tay Y, Tuan LA, Hui SC. Dyadic memory networks for aspect-based sentiment analysis. In Proceedings of
the International Conference on Information and Knowledge Management (CIKM 2017), 2017.

[82] Katiyar A, Cardie C. Investigating LSTMs for joint extraction of opinion entities and relations. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2016), 2016.
[83] Wang W, Pan SJ, Dahlmeier D, and Xiao X. Recursive neural conditional random fields for aspect-based
sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP 2016), 2016.

[84] Wang W, Pan SJ, Dahlmeier D, and Xiao X. Coupled multi-Layer attentions for co-extraction of aspect and
opinion terms. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI 2017), 2017.

[85] Li X, Lam W. Deep multi-task learning for aspect term extraction with memory Interaction. In Proceedings
of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[86] He R, Lee WS, Ng HT, and Dahlmeier D. An unsupervised neural attention model for aspect extraction. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2017), 2017.

[87] Zhang M, Zhang Y, Vo D-T. Neural networks for open domain targeted sentiment. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), 2015.

[88] Zhou X, Wan X, Xiao J. Representation learning for aspect category detection in online reviews. In
Proceeding of AAAI Conference on Artificial Intelligence (AAAI 2015), 2015.

[89] Yin Y, Wei F, Dong L, Xu K, Zhang M, and Zhou M. Unsupervised word and dependency path embeddings
for aspect term extraction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI
2016), 2016.

[90] Xiong S, Zhang Y, Ji D, and Lou Y. Distance metric learning for aspect phrase grouping. In Proceedings of
the International Conference on Computational Linguistics (COLING 2016), 2016.

[91] Poria S, Cambria E, Gelbukh A. Aspect extraction for opinion mining with a deep convolutional neural
network. Journal of Knowledge-based Systems. 2016.

[92] Ying D, Yu J, Jiang J. Recurrent neural networks with auxiliary labels for cross-domain opinion target
extraction. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI 2017), 2017

[93] Irsoy O, Cardie C. Opinion mining with deep recurrent neural networks. In Proceedings of the Conference
on Empirical Methods on Natural Language Processing (EMNLP 2014), 2014.

[94] Yang B, Cardie C. Extracting opinion expressions with semi-markov conditional random fields. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2012), 2012.

[95] Liu P, Joty S, Meng H. Fine-grained opinion mining with recurrent neural networks and word embeddings.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), 2015.

[96] Irsoy O, Cardie C. Deep recursive neural networks for compositionality in language. In Proceedings of the
Annual Conference on Advances in Neural Information Processing Systems (NIPS 2014), 2014.

[97] Zhu X, Guo H, Sobhani P. Neural networks for integrating compositional and non-compositional sentiment
in sentiment composition. In Proceedings of the Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies (NAACL-HLT 2015), 2015.

[98] Yang B, Cardie C. Joint Inference for fine-grained opinion extraction. In Proceedings of the Annual Meeting
of the Association for Computational Linguistics (ACL 2013), 2013.

[99] Deng L, Wiebe J. Recognizing opinion sources based on a new categorization of opinion types. In
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2016), 2016.

[100] Chen C, Wang Z, Lei Y, and Li W. Content-based influence modelling for opinion behaviour Prediction. In
Proceedings of the International Conference on Computational Linguistics (COLING 2016), 2016.

[101] Rashkin H, Bell E, Choi Y, and Volkova S. Multilingual connotation frames: a case study on social media
for targeted sentiment analysis and forecast. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (ACL 2017), 2017.
[102] Mass A. L, Daly R. E, Pham P. T, Huang D, Ng A. Y. and Potts C. Learning word vectors for sentiment
analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2011),
2011.

[103] Bespalov D, Bai B, Qi Y, and Shokoufandeh A. Sentiment classification based on supervised latent n-gram
analysis. In Proceedings of the International Conference on Information and Knowledge Management (CIKM
2011), 2011.

[104] Labutov I, Lipson H. Re-embedding words. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (ACL 2013), 2013.

[105] Tang D, Wei F, Yang N, Zhou M, Liu T, and Qin B. Learning sentiment-specific word embedding for twitter
sentiment classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics
(ACL 2014), 2014.

[106] Tang D, Wei F, Qin B, Yang N, Liu T, and Zhoug M. Sentiment embeddings with applications to sentiment
analysis. IEEE Transactions on Knowledge and Data Engineering, 2016.

[107] Wang L, Xia R. Sentiment Lexicon construction with representation Learning based on hierarchical
sentiment Supervision. In Proceedings of the Conference on Empirical Methods on Natural Language
Processing (EMNLP 2017), 2017.

[108] Yu LC, Wang J, Lai KR, and Zhang X. Refining word embeddings for sentiment analysis. In Proceedings of
the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[109] Li J, Jurafsky D. Do multi-sense embeddings improve natural language understanding? In Proceedings of


the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), 2015.

[110] Ren Y, Zhang Y, Zhang, M and Ji D. Improving Twitter sentiment classification using topic-enriched multi-
prototype word embeddings. In Proceeding of AAAI Conference on Artificial Intelligence (AAAI 2016), 2016.

[111] Zhou H, Chen L, Shi F, Huang D. Learning bilingual sentiment word embeddings for cross-language
sentiment classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics
(ACL 2015), 2015.

[112] Barnes J, Lambert P, Badia T. Exploring distributional representations and machine translation for aspect-
based cross-lingual sentiment classification. In Proceedings of the 27th International Conference on
Computational Linguistics (COLING 2016), 2016.

[113] Zhang W, Yuan Q, Han J, and Wang J. Collaborative multi-Level embedding learning from reviews for
rating prediction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2016),
2016.

[114] Sharma R, Somani A, Kumar L, and Bhattacharyya P. Sentiment intensity ranking among adjectives using
sentiment bearing word embeddings. In Proceedings of the Conference on Empirical Methods on Natural
Language Processing (EMNLP 2017), 2017.

[115] Zhang M, Zhang Y, Fu G. Tweet sarcasm detection using deep neural network. In Proceedings of the
International Conference on Computational Linguistics (COLING 2016), 2016.

[116] Joshi A, Tripathi V, Patel K, Bhattacharyya P, and Carman M. Are word embedding-based features useful
for sarcasm detection? In Proceedings of the Conference on Empirical Methods on Natural Language
Processing (EMNLP 2016), 2016.

[117] Poria S, Cambria E, Hazarika D, and Vij P. A deeper look into sarcastic tweets using deep convolutional
neural networks. In Proceedings of the International Conference on Computational Linguistics (COLING 2016),
2016.
[118] Peled L, Reichart R. Sarcasm SIGN: Interpreting sarcasm with sentiment based monolingual machine
translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2017),
2017.

[119] Ghosh A, Veale T. Magnets for sarcasm: making sarcasm detection timely, contextual and very personal.
In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[120] Van Hee C, Lefever E, Hoste V. Monday mornings are my fave:)# not exploring the automatic recognition
of irony in english tweets. In Proceedings of the International Conference on Computational Linguistics (COLING
2016), 2016.

[121] Chen WF, Lin FY, Ku LW. WordForce: visualizing controversial words in debates. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016.

[122] Wang Z, Zhang Y, Lee S, Li S, and Zhou G. A bilingual attention network for code-switched emotion
prediction. In Proceedings of the International Conference on Computational Linguistics (COLING 2016), 2016.

[123] Zhou H, Huang M, Zhang T, Zhu X and Liu B. Emotional chatting machine: emotional conversation
generation with internal and external memory. arXiv preprint. arXiv:1704.01074, 2017.

[124] Abdul-Mageed M, Ungar L. EmoNet: fine-grained emotion detection with gated recurrent neural
networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2017),
2017.

[125] Felbo B, Mislove A, Søgaard A, Rahwan I, and Lehmann S. Using millions of emoji occurrences to learn
any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the Conference
on Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[126] Gui L, Hu J, He Y, Xu R, Lu Q, and Du J. A question answering approach to emotion cause extraction. In


Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[127] Poria S, Cambria E, Gelbukh A. Deep convolutional neural text features and multiple kernel learning for
utterance-level multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods on
Natural Language Processing (EMNLP 2015), 2015.

[128] Bertero D, Siddique FB, Wu CS, Wan Y, Chan R.H, and Fung P. Real-time speech emotion and sentiment
recognition for interactive dialogue systems. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP 2016), 2016.

[129] Fung P, Dey A, Siddique FB, Lin R, Yang Y, Bertero D, Wan Y, Chan RH, and Wu CS. Zara: a virtual
interactive dialogue system incorporating emotion, sentiment and personality recognition. In Proceedings of
the International Conference on Computational Linguistics (COLING 2016), 2016.

[130] Wang J, Fu J, Xu Y, and Mei T. Beyond object recognition: visual sentiment analysis with deep coupled
adjective and noun neural networks. In Proceedings of the Internal Joint Conference on Artificial Intelligence
(IJCAI 2016), 2016.

[131] Yang J, Sun M, Sun X. Learning visual sentiment distributions via augmented conditional probability
neural network. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI 2017), 2017.

[132] Zhu X, Li L, Zhang W, Rao T, Xu M, Huang Q, and Xu D. Dependency exploitation: a unified CNN-RNN
approach for visual emotion recognition. In Proceedings of the Internal Joint Conference on Artificial
Intelligence (IJCAI 2017), 2017.

[133] You Q, Jin H, Luo J. Visual sentiment analysis by attending on local image regions. In Proceedings of AAAI
Conference on Artificial Intelligence (AAAI 2017), 2017.

[134] Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, and Morency LP. Context-dependent sentiment
analysis in user-generated videos. In Proceedings of the Annual Meeting of the Association for Computational
Linguistics (ACL 2017), 2017.
[135] Tripathi S, Acharya S, Sharma RD, Mittal S, and Bhattacharya S. Using deep and convolutional neural
networks for accurate emotion classification on DEAP dataset. In Proceedings of AAAI Conference on Artificial
Intelligence (AAAI 2017), 2017.

[136] Zadeh A, Chen M, Poria S, Cambria E, and Morency LP. Tensor fusion network for multimodal sentiment
analysis. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP
2017), 2017.

[137] Long Y, Qin L, Xiang R, Li M, and Huang CR. A cognition based attention model for sentiment analysis. In
Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[138] Wang H, Meghawat A, Morency LP, and Xing E.X. Select-additive learning: improving generalization in
multimodal sentiment analysis. In Proceedings of the International Conference on Multimedia and Expo (ICME
2017), 2017.

[139] Akhtar MS, Kumar A, Ekbal A, and Bhattacharyya P. A hybrid deep learning architecture for sentiment
analysis. In Proceedings of the International Conference on Computational Linguistics (COLING 2016), 2016.

[140] Dahou A, Xiong S, Zhou J, Haddoud MH, and Duan P. Word embeddings and convolutional neural
network for Arabic sentiment classification. In Proceedings of the International Conference on Computational
Linguistics (COLING 2016), 2016.

[141] Singhal P, Bhattacharyya P. Borrow a little from your rich cousin: using embeddings and polarities of
english words for multilingual sentiment classification. In Proceedings of the International Conference on
Computational Linguistics (COLING 2016), 2016.

[142] Joshi A, Prabhu A, Shrivastava M, and Varma V. Towards sub-word level compositions for sentiment
analysis of Hindi-English code mixed text. In Proceedings of the International Conference on Computational
Linguistics (COLING 2016), 2016.

[143] Gui L, Xu R, He Y, Lu Q, and Wei Z. Intersubjectivity and sentiment: from language to knowledge. In
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2016), 2016.

[144] Wang Y, Zhang Y, Liu B. Sentiment lexicon expansion based on neural PU Learning, double dictionary
lookup, and polarity association. In Proceedings of the Conference on Empirical Methods on Natural Language
Processing (EMNLP 2017), 2017.

[145] Rekabsaz N, Lupu M, Baklanov A, Hanbury A, Dür A, and Anderson L. Volatility prediction using financial
disclosures sentiments with word embedding-based IR models. In Proceedings of the Annual Meeting of the
Association for Computational Linguistics (ACL 2017), 2017.

[146] Wang Z, Zhang Y. Opinion recommendation using a neural model. In Proceedings of the Conference on
Empirical Methods on Natural Language Processing (EMNLP 2017), 2017.

[147] Augenstein I, Rocktäschel T, Vlachos A, Bontcheva K. Stance detection with bidirectional conditional
encoding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP
2016), 2016.

[148] Du J, Xu R, He Y, Gui L. Stance classification with target-specific neural attention networks. In Proceedings
of the Internal Joint Conference on Artificial Intelligence (IJCAI 2017), 2017.


A New Backpropagation Algorithm without
Gradient Descent
arXiv:1802.00027v1 [cs.LG] 25 Jan 2018

Varun Ranganathan S. Natarajan


Student at PES University Professor at PES University
varunranga1997@hotmail.com natarajan@pes.edu
January 2018

Abstract
The backpropagation algorithm, which had been originally introduced
in the 1970s, is the workhorse of learning in neural networks. This back-
propagation algorithm makes use of the famous machine learning algo-
rithm known as Gradient Descent, which is a first-order iterative opti-
mization algorithm for finding the minimum of a function. To find a local
minimum of a function using gradient descent, one takes steps propor-
tional to the negative of the gradient (or of the approximate gradient)
of the function at the current point. In this paper, we develop an alter-
native to the backpropagation without the use of the Gradient Descent
Algorithm, but instead we are going to devise a new algorithm to find the
error in the weights and biases of an artificial neuron using Moore-Penrose
Pseudo Inverse. The numerical studies and the experiments performed on
various datasets are used to verify the working of this alternative algo-
rithm.

Index Terms – Machine Learning, Artificial Neural Network (ANN), Back-


propagation, Moore-Penrose Pseudo Inverse.

1 Introduction
Artificial Neural Network (ANN), inspired by biological neural networks, is
based on a collection of connected units or nodes called artificial neurons. These
systems are used as a learning algorithm which tries to mimic how the brain
works. ANNs are consider as universal function approximators, that is, it can
approximate the function for the data sent through it. It is based on the mul-
tilayer perceptron [3] model which is a class of feedforward artificial neural
networks, consisting of at least three layers of models. Learning occurs in the
perceptron by changing connection weights after each piece of data is processed,
based on the amount of error in the output compared to the expected result.
This is an example of supervised learning, and is carried out through back-
propagation, a generalization of the least mean squares algorithm in the linear

1
perceptron. The multilayer perceptron model coupled with the backpropagation
algorithm gave rise to the Artificial Neural Network, which can be effectively
and efficiently used as a learning algorithm.

Backpropagation [1] is a method used in artificial neural networks to calcu-


late the error contribution of each neuron after a batch of data is processed. It
is commonly used by the gradient descent optimization algorithm to adjust the
weight of neurons by calculating the gradient of the loss function. This tech-
nique is also sometimes called backward propagation of errors, because the error
is calculated at the output and distributed back through the network layers. It
also requires a known, desired output for each input value — it is therefore
considered to be a supervised learning method.

Gradient Descent [4] is an iterative approach that takes small steps to reach
to the local minima of the function. This is used to update the weights and
biases of each neuron in a neural network. Gradient descent is based on the
observation that if the multivariable function F (x) is defined and differentiable
in a neighborhood of a point a, then F (x) decreases fastest if one goes from a
in the direction of the negative gradient of F at a, −∆F (a). It follows that, if
an+1 = an – γ∆F (an ) for γ small enough, then F (an ) >= F (an+1 ).

In other words, the term γ∆F (a) is subtracted from a because we want to
move against the gradient, toward the minimum. With this observation in
mind, one starts with a guess x0 for a local minimum of F , and considers the
sequence x0 , x1 , x2 , ... such that xn+1 = xn – γn ∆F (xn ), for n >= 0. We have
F (x0 ) >= F (x1 ) >= F (x2 ) >= ..., so hopefully the sequence xn converges to
the desired local minimum.

Even though this method works well in general, it has a few limitations. Firstly,
due to the iterative nature of the algorithm, it takes a lot of time to converge
to the local minima of the function. Secondly, gradient descent is relatively
slow close to the minimum: technically, its asymptotic rate of convergence is
inferior to many other methods. Thirdly, the gradient methods are ill-defined
for non-differentiable functions.

During the paper we will be referring to the Moore-Penrose Pseudo Inverse


[8]. In mathematics, and in particular linear algebra, a pseudoinverse A+ of a
matrix A is a generalization of the inverse matrix. The most widely known type
of matrix pseudoinverse is the Moore–Penrose inverse, which was independently
described by E. H. Moore in 1920, Arne Bjerhammar in 1951, and Roger Pen-
rose in 1955. A common use of the pseudoinverse is to compute a ‘best fit’ (least
squares) solution to a system of linear equations that lacks a unique solution.
Another use is to find the minimum (Euclidean) norm solution to a system of
linear equations with multiple solutions. The pseudoinverse facilitates the state-
ment and proof of results in linear algebra. The pseudoinverse is defined and
unique for all matrices whose entries are real or complex numbers. It can be

2
computed using the singular value decomposition.

In this paper, we formulate another method of finding the errors in weights


and biases of the neurons in a neural network. But first, we would like to
present a few assumptions made in the model of the neural network, to make
our method feasible.

2 Modifications to the neuron structure


We have made one change to the structure of an artificial neuron. We assume
that there is a weight and bias associated for each input, that is, each element
in the input vector is multiplied by a weight and a bias is added to it. This is a
slight alteration from the traditional artificial neuron where there is a common
bias applied to the overall output of the neural network. This change will not
alter the goal or the end result of a neural network. The proof for this statement
is shown below:

For input vector of size ‘n’:

c1 w1 + b1 + c2 w2 + b2 + c3 w3 + b3...cn wn + bn (1)

= c1 w1 + c2 w2 + c3 w3 ...cn wn + b (2)
Where :

b = b1 + b2 + b3 ..bn (3)

Therefore, having a separate bias for each input element will make no difference
to the end result.

Figure 1: Neuron

3
3 The New Backpropagation Algorithm
3.1 Calculating new weights and biases for a neuron
Taking one neuron at a time, there is one input entering into the neuron, which
is multiplied by some weight and a bias is added to this product. This value is
then sent through an activation function, and the output from activation func-
tion is taken as the output of the neuron.

Let C be the input into the neuron,


the original weight applied to that input is w
and the original bias applied to that input is b.

Let x be the output given initially when the input C passes through the neuron.

Let xn be the output that we require.


Based on the required output, we will require a different weight and bias value,
say wn and bn respectively.

The original output is calculated as,

Cw + b = x (4)

But, we required xn as the output. Therefore,

Cwn + bn = xn (5)

Let,
wn = w − ∆w (6)
bn = b − ∆b (7)
Where,
∆w is the error in the weight and,
∆b is the error in the bias.

Cwn + bn = xn (8)
C(w − ∆w) + (b − ∆b) = xn (9)
C(w − ∆w) + (b − ∆b) = xn (10)
(Cw + b) − (C∆w + ∆b) = xn (11)
x − xn = (C∆w + ∆b) (12)
Therefore,
C∆w + ∆b = (x − xn ) (13)

4
Now,
∆w
[ C 1 ]×[ ] = [ (x − xn ) ] (14)
∆b

∆w
[ ]=[ C 1 ]−1 × [ (x − xn ) ] (15)
∆b

But, [ C 1 ] is not a square matrix.


Therefore, We will have to find the Moore-Penrose Pseudo-Inverse of the matrix
[ C 1 ].

∆w
[ ]=[ C 1 ]+ × [ (x − xn ) ] (16)
∆b

After obtaining ∆w and ∆b, change the original weight and bias to the new
weight and bias in accordance to,

wn = w − (∆w ∗ α) (17)

bn = b − (∆b ∗ α) (18)
where α is the learning rate.

3.2 Tackling multiple inputs


The above mentioned method of changing weights and biases of the neuron can
be extended for a vector input of length n.

Let the input vector C belong to the nth dimension.

In this case, each element of the input vector will be multiplied by its respective
weight in the neuron, and a bias will be added to each of the products. There-
fore, there will be n input elements, n corresponding weights and biases, and n
outputs from each weight-bias block. These outputs are added up to give one
single output and passed on the activation function.

During the backpropagation stage, the desired output is distributed amongst


all the weight-bias pairs, such that, for a block of weight and bias i (wi , bi ), the
required output for that block will be 1/n of the required output.

That is, For all weight-bias blocks (wi , bi )

xni = xn /n (19)

The weights and biases are initialized to random values in the beginning, that
is, absolute weightage given to each element in the input vector is randomized.

5
Figure 2: Neuron for vector length of ‘n’

The relative weightage given to each element in the input vector should be the
same. Each weight-bias block will give the same output, so that the cumulative
output will give us the required answer. Therefore, this method of dividing the
weights will work.

3.3 Activation Function for non-linearity


To achieve non-linearity, the general approach taken is to pass the summation
of the output from all weight-bias pairs through a non-linear activation func-
tion [6]. During the backpropagation phase, to correct the weights and biases
values of the neuron, we cannot simply pass the actual output vector required.
If we do so, it will change the weights and biases as though there is no acti-
vation function, and when the forward propagation of the same vector occurs,
the neuron outputs will go through the activation function, and give a wrong
result. Therefore, we must pass the output vector required through the inverse
of the activation function. We need to make sure that we will have to choose
an activation function such that its domain and range are the same, so as to
avoid math errors and to avoid loss of data. The new vector after applying the
inverse activation function is the actual vector sent to the layers of the network
during the backpropagation phase.

3.4 Network Architecture


Figure 3 shows the representation of a neural network. Each neuron outputs
one value. The output of every neuron in one layer is sent as the input to every

6
Figure 3: Neural Network Representation

neuron in the next layer. Therefore, each layer can be associated with a buffer
list, so that the output from each neuron in that layer can be stored and passed
on to the next layer as input. This would help in the implementation of a neural
network by simplifying the forward propagation task.

Figure 4: Neural Network Implementation

The input forward propagates through the network and at the last (output)
layer it gives out an output vector. Now, for this last layer, the required output
is known. Therefore the weights and biases of the neurons of the last layer can

7
be easily changed.

We do not know the required output vectors for the previous layers. We can
only make a calculated guess. Using a simple intuition by asking ourselves the
question, ”What should be the input (which is the output vector of the previous
layer) to the last layer such that the output would be correct?”, we can arrive
at a conclusion that the input, which would be the correct required output for
the previous layer, is a vector which should have given no error in the output
of the last layer. This can be illustrated by the following equations.
If
C ∗w+b=x (20)
Then what Cn vector will satisfy the equation Cn ∗ w + b = x

Cn = (xn − b)/w (21)

This approach can be extended to all the previous layers.

Another issue arises that many neurons will give their own ‘required’ input,
so that their outputs will be correct. This could happen in a multiclass classi-
fication problem, wherein the output vector required is one-hot encoded vector
(where the element of the vector at the position of the required class is 1, and
the other elements in the vector are 0). Therefore, we take the average of all
vectors. This will give an equal weightage of all the feedbacks from each neuron.
Pass this averaged required input vector to the previous layers as the required
output from that layer.

This concludes the complete working of the neural network with our devised
backpropagation algorithm.

4 Differences with Extreme Learning Machines


Extreme learning machines [7] are feedforward neural network for classification,
regression, clustering, sparse approximation, compression and feature learning
with a single layer or multilayers of hidden nodes, where the parameters of hid-
den nodes (not just the weights connecting inputs to hidden nodes) need not be
tuned. These hidden nodes can be randomly assigned and never updated (i.e.
they are random projection but with nonlinear transforms), or can be inherited
from their ancestors without being changed. In most cases, the output weights
of hidden nodes are usually learned in a single step, which essentially amounts
to learning a linear model.

Even though both the method use the Moore-Penrose Pseudo Inverse, there are
a few significant differences between the ELM and the proposed backpropagtion

8
method explained in this paper. The ELM is a feedforward network which is
aims at replacing the traditional artificial neural network, whereas this paper
provides an alternative for the backpropagation algorithm used in traditional
artificial neural networks. The ELM algorithm provides only a forward prop-
agation technique to change the weights and bias of the neurons in the last
hidden layer, whereas we have provided a method of backpropagation to change
the weights and biases of all neurons in every layer.

5 Results
5.1 Telling-Two-Spirals-Apart Problem
Alexis P. Wieland proposed a useful benchmark task for neural networks: distin-
guishing between two intertwined spirals. Although this task is easy to visualize,
it is hard for a network to learn due to its extreme non-linearity. In this report
we exhibit a network architecture that facilitates the learning of the spiral task,
and then compare the learning speed of several variants of the backpropagation
algorithm.

In our experiment, we are using the spiral dataset which contains 193 data
points of each class. We have decided to model the network with a 16-32-64-32-
2 configuration, with ‘Softplus’ activation function on all neurons of the network.
We trained the model for 1000 epochs, with a learning rate of 0.0002.

Figure 5: Training data points for the Two-Spirals problem

From the above 2 figures, we can see that although it doesn’t distinguish
between the two spirals very well, we are able to get an accuracy of about 63%.
This is due to the fact that the Softplus activation function is not the rec-

9
Figure 6: Testing data points for the Two-Spirals problem

ommended activation function for this particular problem. The recommended


activation function is ‘Tanh’ but, due to the fact that the domain of inverse of
the Tanh function lies between (−1, 1) and not between (−∞, +∞), it cannot
be used in our backpropagation method without causing some loss of data.

Looking at figure 6, we can observe the non-linearity in the classification of the


two sets of spirals, which proves that this backpropagation method is working.

5.2 Separating-Concentric-Circles Problem


Another type of natural patterns is concentric rings. As a test, we use the
sklearn.dataset.make circles function to create 2 concentric circles with each
100 data points, which were respectively assigned to two classes. We used an
artificial neural network model with the configurations 16-64-32-2, again using
the ‘Softplus’ activation function on all neurons of the network. We trained the
model for 1000 epochs with a learning rate of 0.00001.

Observing figure 8, we can see that there is a slight non-linearity in the


classification of the 2 points. We can observe an accuracy rate of 61%. This
low accuracy can again be justified with the fact that the softplus activation
function is not suitable for such types of data.

5.3 XOR Problem


Continuing our tests on this alternate algorithm, we create a dataset with 1000
data points with each data sample containing 2 numbers, and 1 class number.
If the 2 numbers are both positive or negative, the class is 0, else, the class
number is 1. The XOR function is applied on the sign of the number.

10
Figure 7: Training data points for the Concentric-Circles problem

Figure 8: Testing data points for the Concentric-Circles problem

Our model was of configuration 4-8-16-32-1 where ‘Softplus’ activation func-


tion is applied by all neurons. The learning rate was set to 0.0001 and the
network was trained for 100 epochs.

A validation accuracy of 81% was achieved.

11
Figure 9: Training data points for the XOR problem

Figure 10: Testing data points for the XOR problem

5.4 Wisconsin Breast Cancer Dataset


To further test our neural network model, we used a real-world dataset in test-
ing our neural network. This dataset contains 699 samples, where each sample
has 10 attributes as the features, and 1 class attribute. This dataset is taken
from the UCI Machine Learning Repository, where samples arrive periodically
as Dr. Wolberg reports his clinical cases.

The model had a configuration of 16-2, and the ‘Softplus’ activation function is
applied by all neurons. We trained the model for 1000 epochs with a learning
rate of 0.0001. We could observe that the validation accuracy reached upto

12
90.4% at the 78th epoch. Even though the values of validation error and train-
ing error are erratic in the start, they seem to reach an almost constant value
after some number of epochs.

Figure 11: Validation Accuracy while training for Wisconsin Breast Cancer
Dataset

Figure 12: Training Error while training for Wisconsin Breast Cancer Dataset

From the above experiments, we can conclude that the Softplus activation
function is more suited to the Wisconsin Breast Cancer Dataset and that our
proposed backpropagation algorithm truly works.

13
Figure 13: Validation Error while training for Wisconsin Breast Cancer Dataset

6 Discussions and Conclusion


From the above stated facts and results, we can observe a few properties with
this method. This proposed method of backpropagation can be used very well
with activation functions where the domain of the activation function matches
the range of its inverse. This property eases the requirement that the activation
function must be differentiable. Therefore, ReLU-like activation functions such
as LeakyReLU, Softplus, S-shaped rectified linear activation unit (SReLU), etc.
will be a good match with this method.

Further optimizations must be made to this method, so that, it can be effi-


ciently used. The requirement of a different type of activation function could
accelerate the discovery of many more activation functions which could fit var-
ious different models.

We believe that because this backpropagation method suits ReLU-like [2] ac-
tivation functions, it can be enhanced to be used in the fields of biomedical
engineering, due to the asymmetric behaviour of data collected in such fields
where the number of data points in different classes are not balanced. Possibly
in the future, if a suitable replacement for activation functions, such as Sigmoid
and Tanh, are created, this method could be used more frequently.

References
[1] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October
1986). ”Learning representations by back-propagating errors”. Nature. 323
(6088): 533–536. doi:10.1038/323533a0

14
[2] arXiv:1710.05941 - Prajit Ramachandran, Barret Zoph, Quoc V. Le - Search-
ing for Activation Functions
[3] Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Infor-
mation Storage and Organization in the Brain, Cornell Aeronautical Labo-
ratory, Psychological Review, v65, No. 6, pp. 386–408. doi:10.1037/h0042519

[4] Snyman, Jan (3 March 2005). Practical Mathematical Optimization: An In-


troduction to Basic Optimization Theory and Classical and New Gradient-
Based Algorithms. Springer Science & Business Media. ISBN 978-0-387-
24348-1

[5] arXiv:1606.04474 - Marcin Andrychowicz, Misha Denil, Sergio Gomez,


Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford,
Nando de Freitas - Learning to learn by gradient descent by gradient de-
scent
[6] arXiv:1602.05980v2 - Bing Xu, Ruitong Huang, Mu Li - Revise Saturated
Activation Functions
[7] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew - Extreme learning ma-
chine: a new learning scheme of feedforward neural networks. - ISBN: 0-
7803-8359-1
[8] Weisstein, Eric W. ”Moore-Penrose Matrix Inverse.” From MathWorld–
A Wolfram Web Resource. http://mathworld.wolfram.com/Moore-
PenroseMatrixInverse.html
[9] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

15
The Matrix Calculus You Need For Deep Learning

Terence Parr and Jeremy Howard

July 3, 2018
arXiv:1802.01528v3 [cs.LG] 2 Jul 2018

(We teach in University of San Francisco’s MS in Data Science program and have other nefarious
projects underway. You might know Terence as the creator of the ANTLR parser generator. For
more material, see Jeremy’s fast.ai courses and University of San Francisco’s Data Institute in-
person version of the deep learning course.)

HTML version (The PDF and HTML were generated from markup using bookish)

Abstract

This paper is an attempt to explain all the matrix calculus you need in order to understand
the training of deep neural networks. We assume no math knowledge beyond what you learned
in calculus 1, and provide links to help you refresh the necessary math where needed. Note that
you do not need to understand this material before you start learning to train and use deep
learning in practice; rather, this material is for those who are already familiar with the basics of
neural networks, and wish to deepen their understanding of the underlying math. Don’t worry
if you get stuck at some point along the way—just go back and reread the previous section, and
try writing down and working through some examples. And if you’re still stuck, we’re happy
to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference
section at the end of the paper summarizing all the key matrix calculus rules and terminology
discussed here.

1
Contents
1 Introduction 3

2 Review: Scalar derivative rules 4

3 Introduction to vector calculus and partial derivatives 5

4 Matrix calculus 6
4.1 Generalization of the Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Derivatives of vector element-wise binary operators . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Derivatives involving scalar expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 Vector sum reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5 The Chain Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5.1 Single-variable chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5.2 Single-variable total-derivative chain rule . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5.3 Vector chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 The gradient of neuron activation 23

6 The gradient of the neural network loss function 25


6.1 The gradient with respect to the weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 The derivative with respect to the bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Summary 29

8 Matrix Calculus Reference 29


8.1 Gradients and Jacobians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Element-wise operations on vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.3 Scalar expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.4 Vector reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.5 Chain rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9 Notation 31

10 Resources 32

2
1 Introduction

Most of us last saw calculus in school, but derivatives are a critical part of machine learning,
particularly deep neural networks, which are trained by optimizing a loss function. Pick up a
machine learning paper or the documentation of a library such as PyTorch and calculus comes
screeching back into your life like distant relatives around the holidays. And it’s not just any old
scalar calculus that pops up—you need differential matrix calculus, the shotgun wedding of linear
algebra and multivariate calculus.

Well... maybe need isn’t the right word; Jeremy’s courses show how to become a world-class deep
learning practitioner with only a minimal level of scalar calculus, thanks to leveraging the automatic
differentiation built in to modern deep learning libraries. But if you really want to really understand
what’s going on under the hood of these libraries, and grok academic papers discussing the latest
advances in model training techniques, you’ll need to understand certain bits of the field of matrix
calculus.

For example, the activation of a single computation unit in a neural network is typically calculated
using the dot product (from linearPalgebra) of an edge weight vector w with an input vector x plus
a scalar bias (threshold): z(x) = ni wi xi + b = w · x + b. Function z(x) is called the unit’s affine
function and is followed by a rectified linear unit, which clips negative values to zero: max(0, z(x)).
Such a computational unit is sometimes referred to as an “artificial neuron” and looks like:

Neural networks consist of many of these units, organized into multiple collections of neurons called
layers. The activation of one layer’s units become the input to the next layer’s units. The activation
of the unit or units in the final layer is called the network output.

Training this neuron means choosing weights w and bias b so that we get the desired output for all N
inputs x. To do that, we minimize a loss function that compares the network’s final activation(x)
with the target(x) (desired output of x) for all input x vectors. To minimize the loss, we use
some variation on gradient descent, such as plain stochastic gradient descent (SGD), SGD with
momentum, or Adam. All of those require the partial derivative (the gradient) of activation(x)
with respect to the model parameters w and b. Our goal is to gradually tweak w and b so that the
overall loss function keeps getting smaller across all x inputs.

If we’re careful, we can derive the gradient by differentiating the scalar version of a common loss

3
function (mean squared error):
|x|
1 X 1 X X
(target(x) − activation(x))2 = (target(x) − max(0, wi xi + b))2
N x N x
i

But this is just one neuron, and neural networks must train the weights and biases of all neurons
in all layers simultaneously. Because there are multiple inputs and (potentially) multiple network
outputs, we really need general rules for the derivative of a function with respect to a vector and
even rules for the derivative of a vector-valued function with respect to a vector.

This article walks through the derivation of some important rules for computing partial derivatives
with respect to vectors, particularly those useful for training neural networks. This field is known as
matrix calculus, and the good news is, we only need a small subset of that field, which we introduce
here. While there is a lot of online material on multivariate calculus and linear algebra, they are
typically taught as two separate undergraduate courses so most material treats them in isolation.
The pages that do discuss matrix calculus often are really just lists of rules with minimal explanation
or are just pieces of the story. They also tend to be quite obscure to all but a narrow audience
of mathematicians, thanks to their use of dense notation and minimal discussion of foundational
concepts. (See the annotated list of resources at the end.)

In contrast, we’re going to rederive and rediscover some key matrix calculus rules in an effort to
explain them. It turns out that matrix calculus is really not that hard! There aren’t dozens of new
rules to learn; just a couple of key concepts. Our hope is that this short paper will get you started
quickly in the world of matrix calculus as it relates to training neural networks. We’re assuming
you’re already familiar with the basics of neural network architecture and training. If you’re not,
head over to Jeremy’s course and complete part 1 of that, then we’ll see you back here when you’re
done. (Note that, unlike many more academic approaches, we strongly suggest first learning to
train and use neural networks in practice and then study the underlying math. The math will be
much more understandable with the context in place; besides, it’s not necessary to grok all this
calculus to become an effective practitioner.)

A note on notation: Jeremy’s course exclusively uses code, instead of math notation, to explain
concepts since unfamiliar functions in code are easy to search for and experiment with. In this
paper, we do the opposite: there is a lot of math notation because one of the goals of this paper is
to help you understand the notation that you’ll see in deep learning papers and books. At the end
of the paper, you’ll find a brief table of the notation used, including a word or phrase you can use
to search for more details.

2 Review: Scalar derivative rules

Hopefully you remember some of these main scalar derivative rules. If your memory is a bit fuzzy
on this, have a look at Khan academy vid on scalar derivative rules.

4
Rule f (x) Scalar derivative notation Example
with respect to x
d
Constant c 0 dx 99 = 0
df d
Multiplication cf c dx dx 3x = 3
by constant
d 3
Power Rule xn nxn−1 dx x = 3x
2
df dg d 2
Sum Rule f +g dx + dx dx (x + 3x) = 2x + 3
df dg d 2
Difference Rule f −g dx − dx dx (x − 3x) = 2x − 3
dg df d 2 2 2
Product Rule fg f dx + dx g dx x x = x + x2x = 3x
df (u) du d 2 1 2
Chain Rule f (g(x)) du dx , let u = g(x) dx ln(x ) = x2 2x = x

There are other rules for trigonometry, exponentials, etc., which you can find at Khan Academy
differential calculus course.

When a function has a single parameter, f (x), you’ll often see f 0 and f 0 (x) used as shorthands for
d
dx f (x). We recommend against this notation as it does not make clear the variable we’re taking
the derivative with respect to.

d
You can think of dx as an operator that maps a function of one parameter to another function.
d
That means that dx f (x) maps f (x) to its derivative with respect to x, which is the same thing
as dfdx
(x) dy
. Also, if y = f (x), then dx = dfdx
(x) d
= dx f (x). Thinking of the derivative as an operator
helps to simplify complicated derivatives because the operator is distributive and lets us pull out
constants. For example, in the following equation, we can pull out the constant 9 and distribute
the derivative operator across the elements within the parentheses.
d d d d 2
9(x + x2 ) = 9 (x + x2 ) = 9( x + x ) = 9(1 + 2x) = 9 + 18x
dx dx dx dx
That procedure reduced the derivative of 9(x + x2 ) to a bit of arithmetic and the derivatives of x
and x2 , which are much easier to solve than the original derivative.

3 Introduction to vector calculus and partial derivatives

Neural network layers are not single functions of a single parameter, f (x). So, let’s move on to
functions of multiple parameters such as f (x, y). For example, what is the derivative of xy (i.e.,
the multiplication of x and y)? In other words, how does the product xy change when we wiggle
the variables? Well, it depends on whether we are changing x or y. We compute derivatives with
respect to one variable (parameter) at a time, giving us two different partial derivatives for this two-
d
parameter function (one for x and one for y). Instead of using operator dx , the partial derivative
∂ ∂ ∂
operator is ∂x (a stylized d and not the Greek letter δ). So, ∂x xy and ∂y xy are the partial derivatives

of xy; often, these are just called the partials. For functions of a single parameter, operator ∂x is
d d
equivalent to dx (for sufficiently smooth functions). However, it’s better to use dx to make it clear
you’re referring to a scalar derivative.

The partial derivative with respect to x is just the usual scalar derivative, simply treating any other
variable in the equation as a constant. Consider function f (x, y) = 3x2 y. The partial derivative

5
∂ ∂
with respect to x is written ∂x 3x2 y. There are three constants from the perspective of ∂x : 3, 2,
∂ 2 ∂ 2
and y. Therefore, ∂x 3yx = 3y ∂x x = 3y2x = 6yx. The partial derivative with respect to y treats

x like a constant: ∂y ∂
3x2 y = 3x2 ∂y y = 3x2 ∂y 2 2
∂y = 3x × 1 = 3x . It’s a good idea to derive these
yourself before continuing otherwise the rest of the article won’t make sense. Here’s the Khan
Academy video on partials if you need help.

To make it clear we are doing vector calculus and not just multivariate calculus, let’s consider what
we do with the partial derivatives ∂f∂x
(x,y)
and ∂f∂y
(x,y) ∂
(another way to say ∂x ∂
f (x, y) and ∂y f (x, y))
2
that we computed for f (x, y) = 3x y. Instead of having them just floating around and not organized
in any way, let’s organize them into a horizontal vector. We call this vector the gradient of f (x, y)
and write it as:
∂f (x, y) ∂f (x, y)
∇f (x, y) = [ , ] = [6yx, 3x2 ]
∂x ∂y
So the gradient of f (x, y) is simply a vector of its partials. Gradients are part of the vector calculus
world, which deals with functions that map n scalar parameters to a single scalar. Now, let’s get
crazy and consider derivatives of multiple functions simultaneously.

4 Matrix calculus

When we move from derivatives of one function to derivatives of many functions, we move from
the world of vector calculus to matrix calculus. Let’s compute partial derivatives for two functions,
both of which take two parameters. We can keep the same f (x, y) = 3x2 y from the last section,
but let’s also bring in g(x, y) = 2x + y 8 . The gradient for g has two entries, a partial derivative for
each parameter:

∂g(x, y) ∂2x ∂y 8 ∂x
= + =2 +0=2×1=2
∂x ∂x ∂x ∂x
and
∂g(x, y) ∂2x ∂y 8
= + = 0 + 8y 7 = 8y 7
∂y ∂y ∂y

giving us gradient ∇g(x, y) = [2, 8y 7 ].

Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two
functions, we can also organize their gradients into a matrix by stacking the gradients. When we
do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:
 " ∂f (x,y) ∂f (x,y) # 
6yx 3x2
 
∇f (x, y) ∂x ∂y
J= = ∂g(x,y) ∂g(x,y) =
∇g(x, y) 2 8y 7
∂x ∂y

Welcome to matrix calculus!

Note that there are multiple ways to represent the Jacobian. We are using the so-called
numerator layout but many papers and software will use the denominator layout. This is just

6
transpose of the numerator layout Jacobian (flip it around its diagonal):
 
6yx 2
3x2 8y 7

4.1 Generalization of the Jacobian

So far, we’ve looked at a specific example of a Jacobian matrix. To define the Jacobian matrix
more generally, let’s combine multiple parameters into a single vector argument: f (x, y, z) ⇒ f (x).
(You will sometimes see notation ~x for vectors in the literature as well.) Lowercase letters in bold
font such as x are vectors and those in italics font like x are scalars. xi is the ith element of vector
x and is in italics because a single vector element is a scalar. We also have to define an orientation
for vector x. We’ll assume that all vectors are vertical by default of size n × 1:
 
x1
 x2 
x= . 
 
 .. 
xn
With multiple scalar-valued functions, we can combine them all into a vector just like we did with
the parameters. Let y = f (x) be a vector of m scalar-valued functions that each take a vector x
of length n = |x| where |x| is the cardinality (count) of elements in x. Each fi function within f
returns a scalar just as in the previous section:
y1 = f1 (x)
y2 = f2 (x)
..
.
ym = fm (x)
For instance, we’d represent f (x, y) = 3x2 y and g(x, y) = 2x + y 8 from the last section as
y1 = f1 (x) = 3x21 x2 (substituting x1 for x, x2 for y)
y2 = f2 (x) = 2x1 + x82
It’s very often the case that m = n because we will have a scalar function result for each element
of the x vector. For example, consider the identity function y = f (x) = x:
y1 = f1 (x) = x1
y2 = f2 (x) = x2
..
.
yn = fn (x) = xn
So we have m = n functions and parameters, in this case. Generally speaking, though, the Jacobian
matrix is the collection of all m × n possible partial derivatives (m rows and n columns), which is
the stack of m gradients with respect to x:
  ∂   ∂ ∂ ∂

f (x) f (x) . . . f (x)

∇f1 (x) ∂x f1 (x) ∂x1 1 ∂x2 1 ∂xn 1
  ∂ f2 (x)   ∂ ∂ ∂
∂y  ∇f (x) f (x) f (x) . . . ∂xn f2 (x) 

2 ∂x ∂x 2 ∂x 2
= = =
 1 2
∂x  . . .   . . .   ...


∇fm (x) ∂ ∂ ∂ ∂
f
∂x m (x) ∂x1 mf (x) ∂x2 mf (x) . . . f
∂xn m (x)

7

Each ∂x fi (x) is a horizontal n-vector because the partial derivative is with respect to a vector, x,
whose length is n = |x|. The width of the Jacobian is n if we’re taking the partial derivative with
respect to x because there are n parameters we can wiggle, each potentially changing the function’s
value. Therefore, the Jacobian is always m rows for m equations. It helps to think about the
possible Jacobian shapes visually:

vector

scalar x
x

scalar
∂f ∂f
f ∂x ∂x
vector

∂f ∂f
f ∂x ∂x

The Jacobian of the identity function f (x) = x, with fi (x) = xi , has n functions and each function
has n parameters held in a single vector x. The Jacobian is, therefore, a square matrix since m = n:

 
∂ ∂ ∂ ∂
∂x1 f1 (x) ∂x2 f1 (x) ... ∂xn f1 (x)
 
∂x f1 (x)
∂y  ∂ ∂ ∂ ∂
∂x f2 (x)  ∂x1 f2 (x) ∂x2 f2 (x) ... ∂xn f2 (x)
  
= =
 
∂x  ... ...
  
 
∂ ∂ ∂ ∂
∂x fm (x) f
∂x1 m (x) f
∂x2 m (x) . . . f
∂xn m (x)
 
∂ ∂ ∂
∂x1 x1 ∂x2 x1 . . . ∂xn x1
∂ ∂ ∂
∂x1 x2 ∂x2 x2 . . . ∂xn x2 
 
= 

...

 
∂ ∂ ∂
∂x1 xn ∂x2 xn . . . ∂xn xn


(and since xi = 0 for j 6= i)
∂xj
 ∂ 
∂x1 x1 0 ... 0
 0 ∂
∂x2 x2 . . . 0 
= 
 
.. 
 . 

0 0 ... ∂xn xn
 
1 0 ... 0
0 1 . . . 0
= 
 
.. 
 . 
0 0 ... 1

= I (I is the identity matrix with ones down the diagonal)

8
Make sure that you can derive each step above before moving on. If you get stuck, just consider each
element of the matrix in isolation and apply the usual scalar derivative rules. That is a generally
useful trick: Reduce vector expressions down to a set of scalar expressions and then take all of the
partials, combining the results appropriately into vectors and matrices at the end.

Also be careful to track whether a matrix is vertical, x, or horizontal, xT where xT means x


transpose. Also make sure you pay attention to whether something is a scalar-valued function,
y = ... , or a vector of functions (or a vector-valued function), y = ... .

4.2 Derivatives of vector element-wise binary operators

Element-wise binary operations on vectors, such as vector addition w + x, are important because
we can express many common vector operations, such as the multiplication of a vector by a scalar,
as element-wise binary operations. By “element-wise binary operations” we simply mean applying
an operator to the first item of each vector to get the first item of the output, then to the second
items of the inputs for the second item of the output, and so forth. This is how all the basic math
operators are applied by default in numpy or tensorflow, for example. Examples that often crop
up in deep learning are max(w, x) and w > x (returns a vector of ones and zeros).

We can generalize the element-wise binary operations with notation y = f (w) g(x) where m =
n = |y| = |w| = |x|. (Reminder: |x| is the number of items in x.) The symbol represents
any element-wise operator (such as +) and not the ◦ function composition operator. Here’s what
equation y = f (w) g(x) looks like when we zoom in to examine the scalar equations:
   
y1 f1 (w) g1 (x)
 y2   f2 (w) g2 (x) 
 ..  = 
   
.. 
.  . 
yn fn (w) gn (x)
where we write n (not m) equations vertically to emphasize the fact that the result of element-wise
operators give m = n sized vector results.

Using the ideas from the last section, we can see that the general case for the Jacobian with respect
to w is the square matrix:
 
∂ ∂ ∂
∂w1 (f1 (w) g1 (x)) ∂w2 (f1 (w) g1 (x)) . . . ∂wn (f1 (w) g1 (x))
 ∂ ∂
∂y (f (w) g2 (x)) ∂w (f2 (w) g2 (x)) . . . ∂w∂ n (f2 (w) g2 (x)) 

Jw = =  ∂w1 2
 2
∂w  ...


∂ ∂ ∂
(f
∂w1 n (w) gn (x)) ∂w2 n(f (w) gn (x)) . . . ∂wn n(f (w) g n (x))
and the Jacobian with respect to x is:
 
∂ ∂ ∂
∂x1 (f1 (w) g1 (x)) g1 (x)) . . .
∂x2 (f1 (w) g1 (x))
∂xn (f1 (w)
∂y  ∂x∂ 1 (f2 (w) g2 (x))
 ∂
g2 (x)) . . .
∂x2 (f2 (w)

g2 (x)) 
∂xn (f2 (w)

Jx = =
∂x  ...


∂ ∂ ∂
∂x1 (fn (w) gn (x)) (f
∂x2 n (w) gn (x)) . . . (f
∂xn n (w) gn (x))

9
That’s quite a furball, but fortunately the Jacobian is very often a diagonal matrix, a matrix that
is zero everywhere but the diagonal. Because this greatly simplifies the Jacobian, let’s examine in
detail when the Jacobian reduces to a diagonal matrix for element-wise operations.


In a diagonal Jacobian, all elements off the diagonal are zero, ∂w j
(fi (w) gi (x)) = 0 where j 6= i.
(Notice that we are taking the partial derivative with respect to wj not wi .) Under what conditions
are those off-diagonal elements zero? Precisely when fi and gi are contants with respect to wj ,
∂ ∂
∂wj fi (w) = ∂wj gi (x) = 0. Regardless of the operator, if those partial derivatives go to zero, the
operation goes to zero, 0 0 = 0 no matter what, and the partial derivative of a constant is zero.

Those partials go to zero when fi and gi are not functions of wj . We know that element-wise
operations imply that fi is purely a function of wi and gi is purely a function of xi . For example,
w + x sums wi + xi . Consequently, fi (w) gi (x) reduces to fi (wi ) gi (xi ) and the goal becomes
∂ ∂
∂wj fi (wi ) = ∂wj gi (xi ) = 0. fi (wi ) and gi (xi ) look like constants to the partial differentiation op-
erator with respect to wj when j 6= i so the partials are zero off the diagonal. (Notation fi (wi ) is
technically an abuse of our notation because fi and gi are functions of vectors not individual ele-
ments. We should really write something like fˆi (wi ) = fi (w), but that would muddy the equations
further, and programmers are comfortable overloading functions, so we’ll proceed with the notation
anyway.)

We’ll take advantage of this simplification later and refer to the constraint that fi (w) and gi (x)
access at most wi and xi , respectively, as the element-wise diagonal condition.


Under this condition, the elements along the diagonal of the Jacobian are ∂wi (fi (wi ) gi (xi )):
 

∂w1 (f1 (w1 ) g1 (x1 ))
∂y


=

∂w2 (f2 (w2 ) g2 (x2 )) 0 


∂w  ... 

0 ∂
(f
∂wn n (w n ) g (x
n n ))

(The large “0”s are a shorthand indicating all of the off-diagonal are 0.)

More succinctly, we can write:


 
∂y ∂ ∂ ∂
= diag (f1 (w1 ) g1 (x1 )), (f2 (w2 ) g2 (x2 )), . . . , (fn (wn ) gn (xn ))
∂w ∂w1 ∂w2 ∂wn

and
 
∂y ∂ ∂ ∂
= diag (f1 (w1 ) g1 (x1 )), (f2 (w2 ) g2 (x2 )), . . . , (fn (wn ) gn (xn ))
∂x ∂x1 ∂x2 ∂xn

where diag(x) constructs a matrix whose diagonal elements are taken from vector x.

Because we do lots of simple vector arithmetic, the general function f (w) in the binary element-
wise operation is often just the vector w. Any time the general function is a vector, we know
that fi (w) reduces to fi (wi ) = wi . For example, vector addition w + x fits our element-wise

10
diagonal condition because f (w) + g(x) has scalar equations yi = fi (w) + gi (x) that reduce to just
yi = fi (wi ) + gi (xi ) = wi + xi with partial derivatives:
∂ ∂
(fi (wi ) + gi (xi )) = (wi + xi ) = 1 + 0 = 1
∂wi ∂wi
∂ ∂
(fi (wi ) + gi (xi )) = (wi + xi ) = 0 + 1 = 1
∂xi ∂xi
That gives us ∂(w+x)
∂w = ∂(w+x)
∂x = I, the identity matrix, because every element along the diagonal
is 1. I represents the square identity matrix of appropriate dimensions that is zero everywhere but
the diagonal, which contains all ones.

Given the simplicity of this special case, fi (w) reducing to fi (wi ), you should be able to derive the
Jacobians for the common element-wise binary operations on vectors:

Op Partial with respect to w


∂(w+x)
+ ∂w = diag(. . . ∂(w∂w
i +xi )
i
. . .) = diag(~1) = I

∂(w−x) i −xi )
− ∂w = diag(. . . ∂(w∂w i
. . .) = diag(~1) = I

∂(w⊗x) i ×xi )
⊗ ∂w = diag(. . . ∂(w∂w i
. . .) = diag(x)

∂(w x)
∂w = diag(. . . ∂(w∂w
i /xi )
i
. . .) = diag(. . . x1i . . .)

Op Partial with respect to x


∂(w+x)
+ ∂x =I

∂(w−x) i −xi )
− ∂x = diag(. . . ∂(w∂x i
. . .) = diag(−~1) = −I

∂(w⊗x)
⊗ ∂x = diag(w)

∂(w x)
∂x = diag(. . . −w
x2
i
. . .)
i

The ⊗ and operators are element-wise multiplication and division; ⊗ is sometimes called the
Hadamard product. There isn’t a standard notation for element-wise multiplication and division so
we’re using an approach consistent with our general binary operation notation.

4.3 Derivatives involving scalar expansion

When we multiply or add scalars to vectors, we’re implicitly expanding the scalar to a vector
and then performing an element-wise binary operation. For example, adding scalar z to vector x,
y = x + z, is really y = f (x) + g(z) where f (x) = x and g(z) = ~1z. (The notation ~1 represents
a vector of ones of appropriate length.) z is any scalar that doesn’t depend on x, which is useful
∂z
because then ∂x i
= 0 for any xi and that will simplify our partial derivative computations. (It’s
okay to think of variable z as a constant for our discussion here.) Similarly, multiplying by a scalar,

11
y = xz, is really y = f (x) ⊗ g(z) = x ⊗ ~1z where ⊗ is the element-wise multiplication (Hadamard
product) of the two vectors.

The partial derivatives of vector-scalar addition and multiplication with respect to vector x use our
element-wise rule:
 
∂y ∂
= diag . . . (fi (xi ) gi (z)) . . .
∂x ∂xi

This follows because functions f (x) = x and g(z) = ~1z clearly satisfy our element-wise diagonal
condition for the Jacobian (that fi (x) refer at most to xi and gi (z) refers to the ith value of the ~1z
vector).

Using the usual rules for scalar partial derivatives, we arrive at the following diagonal elements of
the Jacobian for vector-scalar addition:
∂ ∂(xi + z) ∂xi ∂z
(fi (xi ) + gi (z)) = = + =1+0=1
∂xi ∂xi ∂xi ∂xi

So, ∂
∂x (x + z) = diag(~1) = I.

Computing the partial derivative with respect to the scalar parameter z, however, results in a
vertical vector, not a diagonal matrix. The elements of the vector are:

∂ ∂(xi + z) ∂xi ∂z
(fi (xi ) + gi (z)) = = + =0+1=1
∂z ∂z ∂z ∂z

Therefore, ∂
∂z (x + z) = ~1.

The diagonal elements of the Jacobian for vector-scalar multiplication involve the product rule for
scalar derivatives:
∂ ∂z ∂xi
(fi (xi ) ⊗ gi (z)) = xi +z =0+z =z
∂xi ∂xi ∂xi

So, ∂
∂x (xz) = diag(~1z) = Iz.

The partial derivative with respect to scalar parameter z is a vertical vector whose elements are:
∂ ∂z ∂xi
(fi (xi ) ⊗ gi (z)) = xi +z = xi + 0 = xi
∂z ∂z ∂z

This gives us ∂z (xz) = x.

4.4 Vector sum reduction

Summing up the elements of a vector is an important operation in deep learning, such as the
network loss function, but we can also use it as a way to simplify computing the derivative of
vector dot product and other operations that reduce vectors to scalars.

12
Let y = sum(f (x)) = ni=1 fi (x). Notice we were careful here to leave the parameter as a vector x
P
because each function fi could use all values in the vector, not just xi . The sum is over the results
of the function and not the parameter. The gradient (1 × n Jacobian) of vector summation is:
h i
∂y ∂y ∂y ∂y
∂x = ,
∂x1 ∂x2 , . . . , ∂xn

h i
∂ P ∂ P ∂ P
= ∂x1 i fi (x), ∂x2 i fi (x), . . . , ∂xn i fi (x)

hP i
∂fi (x) P ∂fi (x) P ∂fi (x) P
= i ∂x1 , i ∂x2 , ..., i ∂xn (move derivative inside )

(The summation inside the gradient elements can be tricky so make sure to keep your notation
consistent.)

Let’s look at the gradient of the simple y = sum(x). The function inside the summation is just
fi (x) = xi and the gradient is then:
hP P ∂fi (x) i hP ∂xi P ∂xi P ∂xi i
∂fi (x) P ∂fi (x)
∇y = i ∂x1 , i ∂x2 , . . . , i ∂xn = i ∂x1 , i ∂x2 , . . . , i ∂xn


Because ∂xj xi = 0 for j 6= i, we can simplify to:
h i
∂x1 ∂x2 ∂xn
= 1, 1, . . . , 1 = ~1T
 
∇y = ∂x1 , ∂x2 , ..., ∂xn

Notice that the result is a horizontal vector full of 1s, not a vertical vector, and so the gradient is
~1T . (The T exponent of ~1T represents the transpose of the indicated vector. In this case, it flips a
vertical vector to a horizontal vector.) It’s very important to keep the shape of all of your vectors
and matrices in order otherwise it’s impossible to compute the derivatives of complex functions.

As another example, let’s sum the result of multiplying a vector by a constant scalar. If y = sum(xz)
then fi (x, z) = xi z. The gradient is:
hP i
∂y ∂ P ∂ P ∂
∂x = i ∂x1 x i z, i ∂x2 x i z, . . . , i ∂xn x i z
h i
∂ ∂ ∂
= ∂x1 x1 z, ∂x2 x2 z, ..., ∂xn xn z

 
= z, z, . . . , z

The derivative with respect to scalar variable z is 1 × 1:


∂y ∂ Pn
∂z = ∂z i=1 xi z

P ∂
= i ∂z xi z
P
= i xi

= sum(x)

13
4.5 The Chain Rules

We can’t compute partial derivatives of very complicated functions using just the basic matrix
calculus rules we’ve seen so far. For example, we can’t take the derivative of nested expressions like
sum(w + x) directly without reducing it to its scalar equivalent. We need to be able to combine
our basic vector rules using what we can call the vector chain rule. Unfortunately, there are a
number of rules for differentiation that fall under the name “chain rule” so we have to be careful
which chain rule we’re talking about. Part of our goal here is to clearly define and name three
different chain rules and indicate in which situation they are appropriate. To get warmed up, we’ll
start with what we’ll call the single-variable chain rule, where we want the derivative of a scalar
function with respect to a scalar. Then we’ll move on to an important concept called the total
derivative and use it to define what we’ll pedantically call the single-variable total-derivative chain
rule. Then, we’ll be ready for the vector chain rule in its full glory as needed for neural networks.

The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks compli-
cated expressions into subexpressions whose derivatives are easier to compute. Its power derives
from the fact that we can process each simple subexpression in isolation yet still combine the
intermediate results to get the correct overall result.

The chain rule comes into play when we need the derivative of an expression composed of nested
d
subexpressions. For example, we need the chain rule when confronted with expressions like dx sin(x2 ).
The outermost expression takes the sin of an intermediate result, a nested subexpression that
squares x. Specifically, we need the single-variable chain rule, so let’s start by digging into that in
more detail.

4.5.1 Single-variable chain rule

d
Let’s start with the solution to the derivative of our nested expression: dx sin(x2 ) = 2xcos(x2 ). It
doesn’t take a mathematical genius to recognize components of the solution that smack of scalar
d 2 d
differentiation rules, dx x = 2x and du sin(u) = cos(u). It looks like the solution is to multiply
the derivative of the outer expression by the derivative of the inner expression or “chain the pieces
together,” which is exactly right. In this section, we’ll explore the general principle at work and
provide a process that works for highly-nested expressions of a single variable.

Chain rules are typically defined in terms of nested functions, such as y = f (g(x)) for single-variable
chain rules. (You will also see the chain rule defined using function composition (f ◦ g)(x), which
is the same thing.) Some sources write the derivative using shorthand notation y 0 = f 0 (g(x))g 0 (x),
but that hides the fact that we are introducing an intermediate variable: u = g(x), which we’ll see
shortly. It’s better to define the single-variable chain rule of f (g(x)) explicitly so we never take the
derivative with respect to the wrong variable. Here is the formulation of the single-variable chain
rule we recommend:
dy dy du
=
dx du dx
To deploy the single-variable chain rule, follow these steps:

14
1. Introduce intermediate variables for nested subexpressions and subexpressions for both binary
and unary operators; e.g., × is binary, sin(x) and other trigonometric functions are usually
unary because there is a single operand. This step normalizes all equations to single operators
or function applications.
2. Compute derivatives of the intermediate variables with respect to their parameters.
3. Combine all derivatives of intermediate variables by multiplying them together to get the
overall result.
4. Substitute intermediate variables back in if any are referenced in the derivative equation.

The third step puts the “chain” in “chain rule” because it chains together intermediate results.
Multiplying the intermediate derivatives together is the common theme among all variations of the
chain rule.

Let’s try this process on y = f (g(x)) = sin(x2 ):

1. Introduce intermediate variables. Let u = x2 represent subexpression x2 (shorthand for


u(x) = x2 ). This gives us:

u = x2 (relative to definition f (g(x)), g(x) = x2 )


y = sin(u) (y = f (u) = sin(u))
The order of these subexpressions does not affect the answer, but we recommend working in
the reverse order of operations dictated by the nesting (innermost to outermost). That way,
expressions and derivatives are always functions of previously-computed elements.
2. Compute derivatives.
du
dx = 2x (Take derivative with respect to x)
dy
du = cos(u) (Take derivative with respect to u not x)

3. Combine.
dy dy du
= = cos(u)2x
dx du dx
4. Substitute.
dy dy du
= = cos(x2 )2x = 2xcos(x2 )
dx du dx

Notice how easy it is to compute the derivatives of the intermediate variables in isolation! The
chain rule says it’s legal to do that and tells us how to combine the intermediate results to get
2xcos(x2 ).

You can think of the combining step of the chain rule in terms of units canceling. If we let y be miles,
dy dy du miles miles gallon
x be the gallons in a gas tank, and u as gallons we can interpret dx = du dx as tank = gallon tank .
The gallon denominator and numerator cancel.

15
Another way to to think about the single-variable chain rule is to visualize the overall expression
as a dataflow diagram or chain of operations (or abstract syntax tree for compiler people):

Changes to function parameter x bubble up through a squaring operation then through a sin
dy
operation to change result y. You can think of dudx as “getting changes from x to u” and du as
“getting changes from u to y.” Getting from x to y requires an intermediate hop. The chain rule
dy dy du
is, by convention, usually written from the output variable down to the parameter(s), dx = du dx .
But, the x-to-y perspective would be more clear if we reversed the flow and used the equivalent
dy du dy
dx = dx du .

Conditions under which the single-variable chain rule applies. Notice that there is a single
dataflow path from x to the root y. Changes in x can influence output y in only one way. That is the
condition under which we can apply the single-variable chain rule. An easier condition to remember,
though one that’s a bit looser, is that none of the intermediate subexpression functions, u(x) and
y(u), have more than one parameter. Consider y(x) = x + x2 , which would become y(x, u) = x + u
after introducing intermediate variable u. As we’ll see in the next section, y(x, u) has multiple
paths from x to y. To handle that situation, we’ll deploy the single-variable total-derivative chain
rule.

As an aside for those interested in automatic differentiation, papers and library documentation use
terminology forward differentiation and backward differentiation (for use in the back-propagation
algorithm). From a dataflow perspective, we are computing a forward differentiation because it
follows the normal data flow direction. Backward differentiation, naturally, goes the other direc-
tion and we’re asking how a change in the output would affect function parameter x. Because
backward differentiation can determine changes in all function parameters at once, it turns out to
be much more efficient for computing the derivative of functions with lots of parameters. Forward
differentiation, on the other hand, must consider how a change in each parameter, in turn, affects
the function output y. The following table emphasizes the order in which partial derivatives are
computed for the two techniques.

Forward differentiation from x to y Backward differentiation from y to x


dy du dy dy dy du
dx = dx du dx = du dx

Automatic differentiation is beyond the scope of this article, but we’re setting the stage for a future
article.

d
Many readers can solve dx sin(x2 ) in their heads, but our goal is a process that will work even for
very complicated expressions. This process is also how automatic differentiation works in libraries
like PyTorch. So, by solving derivatives manually in this way, you’re also learning how to define
functions for custom neural networks in PyTorch.

16
With deeply nested expressions, it helps to think about deploying the chain rule the way a compiler
unravels nested function calls like f4 (f3 (f2 (f1 (x)))) into a sequence (chain) of calls. The result of
calling function fi is saved to a temporary variable called a register, which is then passed as a
parameter to fi+1 . Let’s see how that looks in practice by using our process on a highly-nested
equation like y = f (x) = ln(sin(x3 )2 ):

1. Introduce intermediate variables.


u1 = f1 (x) = x3
u2 = f2 (u1 ) = sin(u1 )
u3 = f3 (u2 ) = u22
u4 = f4 (u3 ) = ln(u3 )(y = u4 )

2. Compute derivatives.
d d 3
ux u1 = xx = 3x2
d d
u1 u2 = u1 sin(u1 ) = cos(u1 )
d d 2
u2 u3 = u2 u2 = 2u2
d d
u3 u4 = u3 ln(u3 ) = u13

3. Combine four intermediate values.


dy du4 du4 du3 du2 du1 1 6u2 x2 cos(u1 )
= = = 2u2 cos(u1 )3x2 =
dx dx du3 du2 du1 dx u3 u3

4. Substitute.
dy 6sin(u1 )x2 cos(x3 ) 6sin(x3 )x2 cos(x3 ) 6sin(x3 )x2 cos(x3 ) 6x2 cos(x3 )
= = = =
dx u22 sin(u1 )2 sin(x3 )2 sin(x3 )

Here is a visualization of the data flow through the chain of operations from x to y:

At this point, we can handle derivatives of nested expressions of a single variable, x, using the
chain rule but only if x can affect y through a single data flow path. To handle more complicated
expressions, we need to extend our technique, which we’ll do next.

17
4.5.2 Single-variable total-derivative chain rule

Our single-variable chain rule has limited applicability because all intermediate variables must be
functions of single variables. But, it demonstrates the core mechanism of the chain rule, that of
multiplying out all derivatives of intermediate subexpressions. To handle more general expressions
such as y = f (x) = x + x2 , however, we need to augment that basic chain rule.

dy d d 2
Of course, we immediately see dx = dx x + dx x = 1 + 2x, but that is using the scalar addition
derivative rule, not the chain rule. If we tried to apply the single-variable chain rule, we’d get
the wrong answer. In fact, the previous chain rule is meaningless in this case because derivative
d
operator dx does not apply to multivariate functions, such as u2 among our intermediate variables:

u1 (x) = x2
u2 (x, u1 ) = x + u1 (y = f (x) = u2 (x, u1 ))
du2 du1
Let’s try it anyway to see what happens. If we pretend that du1 = 0 + 1 = 1 and dx = 2x, then
dy du2 du2 du1
dx = dx = du1 dx = 2x instead of the right answer 1 + 2x.

Because u2 (x, u) = x + u1 has multiple parameters, partial derivatives come into play. Let’s blindly
apply the partial derivative operator to all of our equations and see what we get:
∂u1 (x) du1 (x)
∂x = 2x (same as dx )
∂u2 (x,u1 ) ∂
∂u1 = ∂u1 (x + u1 ) = 0 + 1 = 1
∂u2 (x,u1 ) ∂
∂x ∂x (x + u1 ) = 1 + 0 = 1 (something’s not quite right here!)

Ooops! The partial ∂u2∂x(x,u1 )


is wrong because it violates a key assumption for partial derivatives.
When taking the partial derivative with respect to x, the other variables must not vary as x varies.
Otherwise, we could not act as if the other variables were constants. Clearly, though, u1 (x) = x2
is a function of x and therefore varies with x. ∂u2∂x(x,u1 )
6= 1 + 0 because ∂u∂x
1 (x)
6= 0. A quick look
at the data flow diagram for y = u2 (x, u1 ) shows multiple paths from x to y, thus, making it clear
we need to consider direct and indirect (through u1 (x)) dependencies on x:

A change in x affects y both as an operand of the addition and as the operand of the square
operator. Here’s an equation that describes how tweaks to x affect the output:

ŷ = (x + ∆x) + (x + ∆x)2

Then, ∆y = ŷ − y, which we can read as “the change in y is the difference between the original y
and y at a tweaked x.”

18
If we let x = 1, then y = 1+12 = 2. If we bump x by 1, ∆x = 1, then ŷ = (1+1)+(1+1)2 = 2+4 = 6.
The change in y is not 1, as ∂u2 /u1 would lead us to believe, but 6 − 2 = 4!

dy
Enter the “law” of total derivatives, which basically says that to compute dx , we need to sum up
all possible contributions from changes in x to the change in y. The total derivative with respect to
x assumes all variables, such as u1 in this case, are functions of x and potentially vary as x varies.
The total derivative of f (x) = u2 (x, u1 ) that depends on x directly and indirectly via intermediate
variable u1 (x) is given by:
dy ∂f (x) ∂u2 (x, u1 ) ∂u2 ∂x ∂u2 ∂u1 ∂u2 ∂u2 ∂u1
= = = + = +
dx ∂x ∂x ∂x ∂x ∂u1 ∂x ∂x ∂u1 ∂x
Using this formula, we get the proper answer:
dy ∂f (x) ∂u2 ∂u2 ∂u1
= = + = 1 + 1 × 2x = 1 + 2x
dx ∂x ∂x ∂u1 ∂x
That is an application of what we can call the single-variable total-derivative chain rule:
n
∂f (x, u1 , . . . , un ) ∂f ∂f ∂u1 ∂f ∂u2 ∂f ∂un ∂f X ∂f ∂ui
= + + + ... + = +
∂x ∂x ∂u1 ∂x ∂u2 ∂x ∂un ∂x ∂x ∂ui ∂x
i=1

The total derivative assumes all variables are potentially codependent whereas the partial derivative
assumes all variables but x are constants.

There is something subtle going on here with the notation. All of the derivatives are shown as
partial derivatives because f and ui are functions of multiple variables. This notation mirrors
that of MathWorld’s notation but differs from Wikipedia, which uses df (x, u1 , . . . , un )/dx instead
(possibly to emphasize the total derivative nature of the equation). We’ll stick with the partial
derivative notation so that it’s consistent with our discussion of the vector chain rule in the next
section.

In practice, just keep in mind that when you take the total derivative with respect to x, other
variables might also be functions of x so add in their contributions as well. The left side of the
equation looks like a typical partial derivative but the right-hand side is actually the total derivative.
It’s common, however, that many temporary variables are functions of a single parameter, which
means that the single-variable total-derivative chain rule degenerates to the single-variable chain
rule.

Let’s look at a nested subexpression, such as f (x) = sin(x + x2 ). We introduce three intermediate
variables:
u1 (x) = x2
u2 (x, u1 ) = x + u1
u3 (u2 ) = sin(u2 ) (y = f (x) = u3 (u2 ))

and partials:
∂u1
∂x = 2x
∂u2 ∂u2 ∂u1
∂x = ∂x
∂x + ∂u1 ∂x = 1 + 1 × 2x = 1 + 2x
∂f (x)
∂x = ∂u ∂u3 ∂u2
∂x + ∂u2 ∂x
3
= 0 + cos(u2 ) ∂u
∂x
2
= cos(x + x2 )(1 + 2x)

19
∂u2 ∂f (x) ∂ui
where both ∂x and ∂x have ∂x terms that take into account the total derivative.

∂f ∂ui
Also notice that the total derivative formula always sums versus, say, multiplies terms ∂u i ∂x
.
It’s tempting to think that summing up terms in the derivative makes sense because, for example,
y = x + x2 adds two terms. Nope. The total derivative is adding terms because it represents a
weighted sum of all x contributions to the change in y. For example, given y = x × x2 instead
of y = x + x2 , the total-derivative chain rule formula still adds partial derivative terms. (x × x2
simplifies to x3 but for this demonstration, let’s not combine the terms.) Here are the intermediate
variables and partial derivatives:
u1 (x) = x2
u2 (x, u1 ) = xu1 (y = f (x) = u2 (x, u1 ))

∂u1
∂x = 2x
∂u2
∂x = u1 (for u2 = x + u1 , ∂u
∂x = 1)
2

∂u2 ∂u2
∂u1 = x (for u2 = x + u1 , ∂u1 = 1)
The form of the total derivative remains the same, however:
dy ∂u2 ∂u2 du1
= + = u1 + x2x = x2 + 2x2 = 3x2
dx ∂x ∂u1 ∂x
It’s the partials (weights) that change, not the formula, when the intermediate variable operators
change.

Those readers with a strong calculus background might wonder why we aggressively introduce
intermediate variables even for the non-nested subexpressions such as x2 in x + x2 . We use this
process for three reasons: (i) computing the derivatives for the simplified subexpressions is usually
trivial, (ii) we can simplify the chain rule, and (iii) the process mirrors how automatic differentiation
works in neural network libraries.

Using the intermediate variables even more aggressively, let’s see how we can simplify our single-
variable total-derivative chain rule to its final form. The goal is to get rid of the ∂f
∂x sticking out on
the front like a sore thumb:
n
∂f (x, u1 , . . . , un ) ∂f X ∂f ∂ui
= +
∂x ∂x ∂ui ∂x
i=1

We can achieve that by simply introducing a new temporary variable as an alias for x: un+1 = x.
Then, the formula reduces to our final form:
n+1
∂f (u1 , . . . , un+1 ) X ∂f ∂ui
=
∂x ∂ui ∂x
i=1

This chain rule that takes into consideration the total derivative degenerates to the single-variable
chain rule when all intermediate variables are functions of a single variable. Consequently, you
can remember this more general formula to cover both cases. As a bit of dramatic foreshadowing,
∂f ∂u ∂f ∂u
notice that the summation sure looks like a vector dot product, ∂u · ∂x , or a vector multiply ∂u ∂x .

Before we move on, a word of caution about terminology on the web. Unfortunately, the chain rule
given in this section, based upon the total derivative, is universally called “multivariable chain rule”

20
in calculus discussions, which is highly misleading! Only the intermediate variables are multivariate
functions. The overall function, say, f (x) = x + x2 , is a scalar function that accepts a single
parameter x. The derivative and parameter are scalars, not vectors, as one would expect with a
so-called multivariate chain rule. (Within the context of a non-matrix calculus class, “multivariate
chain rule” is likely unambiguous.) To reduce confusion, we use “single-variable total-derivative
chain rule” to spell out the distinguishing feature between the simple single-variable chain rule,
dy dy du
dx = du dx , and this one.

4.5.3 Vector chain rule

Now that we’ve got a good handle on the total-derivative chain rule, we’re ready to tackle the
chain rule for vectors of functions and vector variables. Surprisingly, this more general chain rule
is just as simple looking as the single-variable chain rule for scalars. Rather than just presenting
the vector chain rule, let’s rediscover it ourselves so we get a firm grip on it. We can start by
computing the derivative of a sample vector function with respect to a scalar, y = f (x), to see if
we can abstract a general formula.

ln(x2 )
     
y1 (x) f1 (x)
= =
y2 (x) f2 (x) sin(3x)

Let’s introduce two intermediate variables, g1 and g2 , one for each fi so that y looks more like
y = f (g(x)):
   2
g1 (x) x
=
g2 (x) 3x
   
f1 (g) ln(g1 )
=
f2 (g) sin(g2 )
The derivative of vector y with respect to scalar x is a vertical vector with elements computed
using the single-variable total-derivative chain rule:
" # " # 
∂f1 (g) ∂f1 ∂g1 ∂f1 ∂g2 1   2x   2

∂y ∂g ∂x + ∂g ∂x g 2x + 0 2
= ∂f∂x2 (g)
= ∂f2 ∂g1 ∂f2 ∂g2 =
1 2 1 = x = x
∂x ∂g1 ∂x + ∂g2 ∂x
0 + cos(g2 )3 3cos(3x) 3cos(3x)
∂x

Ok, so now we have the answer using just the scalar rules, albeit with the derivatives grouped into
a vector. Let’s try to abstract from that result what it looks like in vector form. The goal is to
convert the following vector of scalar operations to a vector operation.
" #
∂f1 ∂g1 ∂f1 ∂g2
∂g1 ∂x + ∂g2 ∂x
∂f2 ∂g1 ∂f2 ∂g2
∂g1 ∂x + ∂g2 ∂x

∂fi ∂gj ∂gj


If we split the ∂g j ∂x
terms, isolating the ∂x terms into a vector, we get a matrix by vector
multiplication:
" #
∂f1 ∂f1  ∂g1 
∂g1 ∂g2 ∂x
∂f ∂g
∂f2 ∂f2 ∂g2 =
∂g1 ∂g2 ∂x
∂g ∂x

21
That means that the Jacobian is the multiplication of two other Jacobians, which is kinda cool.
Let’s check our results:
1    1   2

∂f ∂g g 0 2x g 2x + 0 x
= 1 = 1 =
∂g ∂x 0 cos(g2 ) 3 0 + cos(g2 )3 3cos(3x)

Whew! We get the same answer as the scalar approach. This vector chain rule for vectors of
functions and a single parameter appears to be correct and, indeed, mirrors the single-variable
chain rule. Compare the vector rule:
∂ ∂f ∂g
f (g(x)) =
∂x ∂g ∂x
with the single-variable chain rule:
d df dg
f (g(x)) =
dx dg dx
To make this formula work for multiple parameters or vector x, we just have to change x to vector
∂g ∂f
x in the equation. The effect is that ∂x and the resulting Jacobian, ∂x , are now matrices instead
of vertical vectors. Our complete vector chain rule is:
∂ ∂f ∂g ∂f ∂g
∂x f (g(x)) = ∂g ∂x (Note: matrix multiply doesn’t commute; order of ∂g ∂x matters)

The beauty of the vector formula over the single-variable chain rule is that it automatically takes
into consideration the total derivative while maintaining the same notational simplicity. The Ja-
cobian contains all possible combinations of fi with respect to gj and gi with respect to xj . For
completeness, here are the two Jacobian components in their full glory:
 ∂f1 ∂f1 ∂f1
  ∂g ∂g1 ∂g1

∂g ∂g . . . ∂gk ∂x1
1
∂x2 . . . ∂xn
∂  ∂f12 ∂f22 ∂f   ∂g ∂g ∂g 
 ∂g1 ∂g2 . . . ∂gk2   ∂x21 ∂x22 . . . ∂xn2 
f (g(x)) = 
∂x ...
 
 ...  
∂fm ∂fm ∂fm ∂gk ∂gk ∂gk
∂g1 ∂g2 . . . ∂gk ∂x1 ∂x2 . . . ∂xn

where m = |f |, n = |x|, and k = |g|. The resulting Jacobian is m × n (an m × k matrix multiplied
by a k × n matrix).

∂f ∂g
Even within this ∂g ∂x formula, we can simplify further because, for many applications, the Jaco-
bians are square (m = n) and the off-diagonal entries are zero. It is the nature of neural networks
that the associated mathematics deals with functions of vectors not vectors of functions. For ex-
ample, the neuron affine function has term sum(w ⊗ x) and the activation function is max(0, x);
we’ll consider derivatives of these functions in the next section.

As we saw in a previous section, element-wise operations on vectors w and x yield diagonal matrices
with elements ∂w
∂xi because wi is a function purely of xi but not xj for j 6= i. The same thing happens
i

here when fi is purely a function of gi and gi is purely a function of xi :


∂f ∂fi
= diag( )
∂g ∂gi
∂g ∂gi
= diag( )
∂x ∂xi

22
In this situation, the vector chain rule simplifies to:
∂ ∂fi ∂gi ∂fi ∂gi
f (g(x)) = diag( )diag( ) = diag( )
∂x ∂gi ∂xi ∂gi ∂xi
Therefore, the Jacobian reduces to a diagonal matrix whose elements are the single-variable chain
rule values.

After slogging through all of that mathematics, here’s the payoff. All you need is the vector chain
rule because the single-variable formulas are special cases of the vector chain rule. The following
table summarizes the appropriate components to multiply in order to get the Jacobian.

vector

scalar x
x
vector vector
∂ ∂f ∂g
∂x f (g(x)) = ∂g ∂x scalar u u
u

∂f ∂f
scalar ∂u ∂u ∂u ∂u
∂f ∂u ∂x ∂x
f ∂u ∂x
vector
∂u
∂f ∂x ∂f ∂u ∂f ∂u
f ∂u ∂u ∂x ∂u ∂x

5 The gradient of neuron activation

We now have all of the pieces needed to compute the derivative of a typical neuron activation for
a single neural network computation unit with respect to the model parameters, w and b:
activation(x) = max(0, w · x + b)
(This represents a neuron with fully connected weights and rectified linear unit activation. There
are, however, other affine functions such as convolution and other activation functions, such as
exponential linear units, that follow similar logic.)

∂ ∂
Let’s worry about max later and focus on computing ∂w (w · x + b) and ∂b (w · x + b). (Recall that
neural networks learn through optimization of their weights and biases.) We haven’t discussed the
derivative of the dot product yet, y = f (w) · g(x), but we can use the chain rule to avoid having to
memorize yet another rule. (Note notation y not y as the result is a scalar not a vector.)

The dot product w · x is just the summation of the element-wise multiplication of the elements:
P n
i (wi xi ) = sum(w ⊗ x). (You might also find it useful to remember the linear algebra notation

23
w · x = wT x.) We know how to compute the partial derivatives of sum(x) and w ⊗ x but haven’t
looked at partial derivatives for sum(w ⊗ x). We need the chain rule for that and so we can
introduce an intermediate vector variable u just as we did using the single-variable chain rule:

u = w⊗x
y = sum(u)

Once we’ve rephrased y, we recognize two subexpressions for which we already know the partial
derivatives:
∂u ∂
∂w = ∂w (w ⊗ x) = diag(x)
∂y
∂u = ∂
∂u sum(u) = ~1T

The vector chain rule says to multiply the partials:


∂y ∂y ∂u ~ T
= = 1 diag(x) = xT
∂w ∂u ∂w
To check our results, we can grind the dot product down into a pure scalar function:
Pn
y = w·x = (w x )
∂y ∂ P Pi ∂ i i ∂
∂wj = ∂wj i (wi xi ) = i ∂wj (wi xi ) = ∂wj (wj xj ) = xj

Then:
∂y
= [x1 , . . . , xn ] = xT
∂w
Hooray! Our scalar results match the vector chain rule results.

Now, let y = w · x + b, the full expression within the max activation function call. We have two
different partials to compute, but we don’t need the chain rule:
∂y
∂w = ∂ ∂
∂w w · x + ∂w b = xT + ~0T = xT
∂y ∂ ∂
∂b = ∂b w · x + ∂b b = 0+1 = 1

Let’s tackle the partials of the neuron activation, max(0, w · x + b). The use of the max(0, z)
function call on scalar z just says to treat all negative z values as 0. The derivative of the max
function is a piecewise function. When z ≤ 0, the derivative is 0 because z is a constant. When
z > 0, the derivative of the max function is just the derivative of z, which is 1:
(
∂ 0 z≤0
max(0, z) = dz
∂z dz = 1 z > 0

An aside on broadcasting functions across scalars. When one or both of the max arguments are
vectors, such as max(0, x), we broadcast the single-variable function max across the elements. This
is an example of an element-wise unary operator. Just to be clear:
 
max(0, x1 )
 max(0, x2 ) 
max(0, x) =  
 ... 
max(0, xn )

24
For the derivative of the broadcast version then, we get a vector of zeros and ones where:
(
∂ 0 xi ≤ 0
max(0, xi ) = dxi
∂xi dxi = 1 xi > 0
 

∂x1 max(0, x1 )
∂  ∂
max(0, x2 ) 

max(0, x) =  ∂x2

∂x ...

 

∂xn max(0, xn )
To get the derivative of the activation(x) function, we need the chain rule because of the nested
subexpression, w · x + b. Following our process, let’s introduce intermediate scalar variable z to
represent the affine function giving:

z(w, b, x) = w · x + b

activation(z) = max(0, z)
The vector chain rule tells us:
∂activation ∂activation ∂z
=
∂w ∂z ∂w
which we can rewrite as follows:
(
∂activation ∂z
0 ∂w = ~0T z≤0
= ∂z ∂z ∂z
∂w 1 ∂w = ∂w = xT z>0 (we computed ∂w = xT previously)

and then substitute z = w · x + b back in:


(
∂activation ~0T w · x + b ≤ 0
=
∂w xT w · x + b > 0

That equation matches our intuition. When the activation function clips affine function output z
to 0, the derivative is zero with respect to any weight wi . When z > 0, it’s as if the max function
disappears and we get just the derivative of z with respect to the weights.

Turning now to the derivative of the neuron activation with respect to b, we get:
(
∂activation 0 ∂z
∂b = 0 w · x + b ≤ 0
= ∂z
∂b 1 ∂b = 1 w · x + b > 0

Let’s use these partial derivatives now to handle the entire loss function.

6 The gradient of the neural network loss function

Training a neuron requires that we take the derivative of our loss or “cost” function with respect to
the parameters of our model, w and b. Because we train with multiple vector inputs (e.g., multiple
images) and scalar targets (e.g., one classification per image), we need some more notation. Let

X = [x1 , x2 , . . . , xN ]T

25
where N = |X|, and then let

y = [target(x1 ), target(x2 ), . . . , target(xN )]T

where yi is a scalar. Then the cost equation becomes:


N N
1 X 1 X
C(w, b, X, y) = (yi − activation(xi ))2 = (yi − max(0, w · xi + b))2
N N
i=1 i=1

Following our chain rule process introduces these intermediate variables:

u(w, b, x) = max(0, w · x + b)
v(y, u) = y−u
= N1 N 2
P
C(v) i=1 v

Let’s compute the gradient with respect to w first.

6.1 The gradient with respect to the weights

From before, we know:


(
∂ ~0T w·x+b≤0
u(w, b, x) =
∂w xT w·x+b>0

and
(
∂v(y, u) ∂ ∂u ∂u ~0T w·x+b≤0
= (y − u) = ~0T − =− =
∂w ∂w ∂w ∂w −xT w·x+b>0

Then, for the overall gradient, we get:

N
∂C(v) ∂ 1 X 2
= v
∂w ∂w N
i=1
N
1 X ∂ 2
= v
N ∂w
i=1
N
1 X ∂v 2 ∂v
=
N ∂v ∂w
i=1
N
1 X ∂v
= 2v
N ∂w
i=1
N
(
1 X 2v~0T = ~0T w · xi + b ≤ 0
=
N
i=1
−2vxT w · xi + b > 0

26
N
(
1 X ~0T w · xi + b ≤ 0
= T
N −2(yi − u)xi w · xi + b > 0
i=1
N
(
1 X ~0T w · xi + b ≤ 0
= T
N −2(yi − max(0, w · xi + b))xi w · xi + b > 0
i=1
N
(
1 X ~0T w · xi + b ≤ 0
= T
N −2(yi − (w · xi + b))xi w · xi + b > 0
i=1
(
~0T w · xi + b ≤ 0
= −2 PN T
N i=1 (yi − (w · xi + b))xi w · xi + b > 0
(
~0T w · xi + b ≤ 0
= 2 PN T
N i=1 (w · xi + b − yi )xi w · xi + b > 0

To interpret that equation, we can substitute an error term ei = w · xi + b − yi yielding:


N
∂C 2 X
= ei xTi (for the nonzero activation case)
∂w N
i=1

From there, notice that this computation is a weighted average across all xi in X. The weights are
the error terms, the difference between the target output and the actual neuron output for each xi
input. The resulting gradient will, on average, point in the direction of higher cost or loss because
large ei emphasize their associated xi . Imagine we only had one input vector, N = |X| = 1, then
the gradient is just 2e1 xT1 . If the error is 0, then the gradient is zero and we have arrived at the
minimum loss. If e1 is some small positive difference, the gradient is a small step in the direction
of x1 . If e1 is large, the gradient is a large step in that direction. If e1 is negative, the gradient is
reversed, meaning the highest cost is in the negative direction.

Of course, we want to reduce, not increase, the loss, which is why the gradient descent recurrence
relation takes the negative of the gradient to update the current position (for scalar learning rate
η):
∂C
wt+1 = wt − η
∂w
Because the gradient indicates the direction of higher cost, we want to update x in the opposite
direction.

6.2 The derivative with respect to the bias

To optimize the bias, b, we also need the partial with respect to b. Here are the intermediate
variables again:
u(w, b, x) = max(0, w · x + b)
v(y, u) = y−u
= N1 N 2
P
C(v) i=1 v

27
We computed the partial with respect to the bias for equation u(w, b, x) previously:
(
∂u 0 w·x+b≤0
=
∂b 1 w·x+b>0

For v, the partial is:


(
∂v(y, u) ∂ ∂u ∂u 0 w·x+b≤0
= (y − u) = 0 − =− =
∂b ∂b ∂b ∂b −1 w·x+b>0

And for the partial of the cost function itself we get:

N
∂C(v) ∂ 1 X 2
= v
∂b ∂b N
i=1
N
1 X ∂ 2
= v
N ∂b
i=1
N
1 X ∂v 2 ∂v
=
N ∂v ∂b
i=1
N
1 X ∂v
= 2v
N ∂b
i=1
N
(
1 X 0 w·x+b≤0
=
N −2v w · x + b > 0
i=1
N
(
1 X 0 w·x+b≤0
=
N −2(yi − max(0, w · xi + b)) w · x + b > 0
i=1
N
(
1 X 0 w·x+b≤0
=
N 2(w · xi + b − yi ) w · x + b > 0
i=1
(
0 w · xi + b ≤ 0
= 2 PN
N i=1 (w · xi + b − yi ) w · xi + b > 0

As before, we can substitute an error term:


N
∂C 2 X
= ei (for the nonzero activation case)
∂b N
i=1

The partial derivative is then just the average error or zero, according to the activation level. To
update the neuron bias, we nudge it in the opposite direction of increased cost:
∂C
bt+1 = bt − η
∂b

28
In practice, it is convenient to combine w and b into a single vector parameter rather than having
to deal with two different partials: ŵ = [wT , b]T . This requires a tweak to the input vector x as
well but simplifies the activation function. By tacking a 1 onto the end of x, x̂ = [xT , 1], w · x + b
becomes ŵ · x̂.

This finishes off the optimization of the neural network loss function because we have the two
partials necessary to perform a gradient descent.

7 Summary

Hopefully you’ve made it all the way through to this point. You’re well on your way to understand-
ing matrix calculus! We’ve included a reference that summarizes all of the rules from this article
in the next section. Also check out the annotated resource link below.

Your next step would be to learn about the partial derivatives of matrices not just vectors. For
example, you can take a look at the matrix differentiation section of Matrix calculus.

Acknowledgements. We thank Yannet Interian (Faculty in MS data science program at Univer-


sity of San Francisco) and David Uminsky (Faculty/director of MS data science) for their help with
the notation presented here.

8 Matrix Calculus Reference

8.1 Gradients and Jacobians

The gradient of a function of two variables is a horizontal 2-vector:

∂f (x, y) ∂f (x, y)
∇f (x, y) = [ , ]
∂x ∂y

The Jacobian of a vector-valued function that is a function of a vector is an m × n (m = |f | and


n = |x|) matrix containing all possible scalar partial derivatives:
  ∂   ∂ ∂ ∂

f (x) f (x) . . . f (x)

∇f1 (x) f
∂x 1 (x) ∂x1 1 ∂x2 1 ∂xn 1
  ∂ f2 (x)   ∂ ∂ ∂
∂y  ∇f (x) f (x) f (x) . . . ∂xn f2 (x) 

2  =  ∂x ∂x 2 ∂x 2
= = 1
 2
∂x  . . .   . . .   ...


∇fm (x) ∂ ∂ ∂ ∂
f
∂x m (x) ∂x1 fm (x) ∂x2 fm (x) . . . ∂xn fm (x)

The Jacobian of the identity function f (x) = x is I.

29
8.2 Element-wise operations on vectors

Define generic element-wise operations on vectors w and x using operator such as +:


   
y1 f1 (w) g1 (x)
 y2   fn (w) g2 (x) 
 ..  = 
   
.. 
.  . 
yn fn (w) gn (x)

The Jacobian with respect to w (similar for x) is:


 
∂ ∂ ∂
∂w1 (f1 (w) g1 (x)) ∂w2 (f1 (w) g1 (x)) . . . ∂wn (f1 (w) g1 (x))
∂ ∂ ∂
∂y (f (w) g2 (x)) ∂w (f2 (w) g2 (x)) . . . ∂wn (f2 (w) g2 (x)) 
 
Jw = =  ∂w1 2
 2
∂w  ...


∂ ∂ ∂
∂w1 (fn (w) gn (x)) ∂w2 (fn (w) gn (x)) . . . ∂wn (fn (w) gn (x))

Given the constraint (element-wise diagonal condition) that fi (w) and gi (x) access at most wi and
xi , respectively, the Jacobian simplifies to a diagonal matrix:
 
∂y ∂ ∂ ∂
= diag (f1 (w1 ) g1 (x1 )), (f2 (w2 ) g2 (x2 )), . . . , (fn (wn ) gn (xn ))
∂w ∂w1 ∂w2 ∂wn

Here are some sample element-wise operators:

Op Partial with respect to w Partial with respect to x


∂(w+x) ∂(w+x)
+ ∂w =I ∂x =I
∂(w−x) ∂(w−x)
− ∂w =I ∂x = −I
∂(w⊗x) ∂(w⊗x)
⊗ ∂w = diag(x) ∂x = diag(w)
∂(w x) ∂(w x)
∂w
1
= diag(. . . xi . . .) ∂x = diag(. . . −w
x2
i
. . .)
i

8.3 Scalar expansion

Adding scalar z to vector x, y = x + z, is really y = f (x) + g(z) where f (x) = x and g(z) = ~1z.


(x + z) = diag(~1) = I
∂x

(x + z) = ~1
∂z
Scalar multiplication yields:

(xz) = Iz
∂x

(xz) = x
∂z

30
8.4 Vector reductions

The partial derivative of a vector sum with respect to one of the vectors is:
h i hP i
∂y ∂y ∂y ∂y ∂fi (x) P ∂fi (x) ∂fi (x)
∇x y = ∂x
P
= ∂x ,
1 ∂x2
, . . . , ∂xn = i ∂x1 , i ∂x2 , . . . , i ∂xn

For y = sum(x):

∇x y = ~1T

For y = sum(xz) and n = |x|, we get:

∇x y = [z, z, . . . , z]

∇z y = sum(x)
Pn
Vector dot product y = f (w) · g(x) = i (wi xi ) = sum(w ⊗ x). Substituting u = w ⊗ x and using
the vector chain rule, we get:
du d
dx = dx (w ⊗ x) = diag(w)
dy d ~T
du = du sum(u) = 1
dy dy du ~T
dx = du × dx = 1 × diag(w) = wT
dy
Similarly, dw = xT .

8.5 Chain rules

The vector chain rule is the general form as it degenerates to the others. When f is a function
of a single variable x and all intermediate variables u are functions of a single variable, the single-
variable chain rule applies. When some or all of the intermediate variables are functions of multiple
variables, the single-variable total-derivative chain rule applies. In all other cases, the vector chain
rule applies.

Single-variable rule Single-variable total-derivative rule Vector rule


df df du ∂f (u1 ,...,un ) ∂f ∂u ∂ ∂f ∂g
dx = du dx ∂x = ∂u ∂x ∂x f (g(x)) = ∂g ∂x

9 Notation

Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi
is the ith element of vector x and is in italics because a single vector element is a scalar. |x| means
“length of vector x.”

The T exponent of xT represents the transpose of the indicated vector.


Pb
i=a xi is just a for-loop that iterates i from a to b, summing all the xi .

31
Notation f (x) refers to a function called f with an argument of x.

I represents the square “identity matrix” of appropriate dimensions that is zero everywhere but
the diagonal, which contains all ones.

diag(x) constructs a matrix whose diagonal elements are taken from vector x.

The dot product w · x is the summation of the element-wise multiplication of the elements:
P n T
i (wi xi ) = sum(w ⊗ x). Or, you can look at it as w x.

d
Differentiation dx is an operator that maps a function of one parameter to another function. That
means that dx f (x) maps f (x) to its derivative with respect to x, which is the same thing as dfdx
d (x)
.
dy df (x) d
Also, if y = f (x), then dx = dx = dx f (x).


The partial derivative of the function with respect to x, ∂x f (x), performs the usual scalar derivative
holding all other variables constant.

The gradient of f with respect to vector x, ∇f (x), organizes all of the partial derivatives for a
specific scalar function.

The Jacobian organizes the gradients of multiple functions into a matrix by stacking them:
 
∇f1 (x)
J=
∇f2 (x)
The following notation means that y has the value a upon condition1 and value b upon condition2 .
(
a condition1
y=
b condition2

10 Resources

Wolfram Alpha can do symbolic matrix algebra and there is also a cool dedicated matrix calculus
differentiator.

When looking for resources on the web, search for “matrix calculus” not “vector calculus.” Here
are some comments on the top links that come up from a Google search:

• https://en.wikipedia.org/wiki/Matrix calculus The Wikipedia entry is actually quite good


and they have a good description of the different layout conventions. Recall that we use the
numerator layout where the variables go horizontally and the functions go vertically in the
Jacobian. Wikipedia also has a good description of total derivatives, but be careful that they
use slightly different notation than we do. We always use the ∂x notation not dx.
• http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/calculus.html This page has a section on ma-
trix differentiation with some useful identities; this person uses numerator layout. This might

32
be a good place to start after reading this article to learn about matrix versus vector differ-
entiation.

• https://www.colorado.edu/engineering/CAS/courses.d/IFEM.d/IFEM.AppC.d/IFEM.AppC.pdf
This is part of the course notes for “Introduction to Finite Element Methods” I believe by Car-
los A. Felippa. His Jacobians are transposed from our notation because he uses denominator
layout.

• http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/calculus.html This page has a huge number of


useful derivatives computed for a variety of vectors and matrices. A great cheat sheet. There
is no discussion to speak of, just a set of rules.

• https://www.math.uwaterloo.ca/˜hwolkowi/matrixcookbook.pdf Another cheat sheet that fo-


cuses on matrix operations in general with more discussion than the previous item.

• https://www.comp.nus.edu.sg/˜cs5240/lecture/matrix-differentiation.pdf A useful set of slides.

To learn more about neural networks and the mathematics behind optimization and back propa-
gation, we highly recommend Michael Nielsen’s book.

For those interested specifically in convolutional neural networks, check out A guide to convolution
arithmetic for deep learning.

We reference the law of total derivative, which is an important concept that just means derivatives
with respect to x must take into consideration the derivative with respect x of all variables that
are a function of x.

33
Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov∗1 Dmitrii Podoprikhin∗2,3 Timur Garipov∗4,5 Dmitry Vetrov2,3 Andrew Gordon Wilson1
1
Cornell University, 2 Higher School of Economics, 3 Samsung-HSE Laboratory,
4
Samsung AI Center in Moscow, 5 Lomonosov Moscow State University
arXiv:1803.05407v3 [cs.LG] 25 Feb 2019

Abstract we see that the weights of the networks ensembled by


FGE are on the periphery of the most desirable solu-
Deep neural networks are typically trained by tions. This observation suggests it is promising to aver-
optimizing a loss function with an SGD vari- age these points in weight space, and use a network with
ant, in conjunction with a decaying learning these averaged weights, instead of forming an ensemble
rate, until convergence. We show that simple by averaging the outputs of networks in model space. Al-
averaging of multiple points along the trajec- though the general idea of maintaining a running aver-
tory of SGD, with a cyclical or constant learn- age of weights traversed by SGD dates back to Ruppert
ing rate, leads to better generalization than [1988], this procedure is not typically used to train neural
conventional training. We also show that this networks. It is sometimes applied as an exponentially de-
Stochastic Weight Averaging (SWA) procedure caying running average in combination with a decaying
finds much flatter solutions than SGD, and ap- learning rate (where it is called an exponential moving
proximates the recent Fast Geometric Ensem- average), which smooths the trajectory of conventional
bling (FGE) approach with a single model. SGD but does not perform very differently. However, we
Using SWA we achieve notable improvement show that an equally weighted average of the points tra-
in test accuracy over conventional SGD train- versed by SGD with a cyclical or high constant learning
ing on a range of state-of-the-art residual net- rate, which we refer to as Stochastic Weight Averaging
works, PyramidNets, DenseNets, and Shake- (SWA), has many surprising and promising features for
Shake networks on CIFAR-10, CIFAR-100, training deep neural networks, leading to a better under-
and ImageNet. In short, SWA is extremely standing of the geometry of their loss surfaces. Indeed,
easy to implement, improves generalization, SWA with cyclical or constant learning rates can be used
and has almost no computational overhead. as a drop-in replacement for standard SGD training of
multilayer networks — but with improved generalization
and essentially no overhead. In particular:
1 INTRODUCTION
• We show that SGD with cyclical [e.g., Loshchilov
With a better understanding of the loss surfaces for mul- and Hutter, 2017] and constant learning rates tra-
tilayer networks, we can accelerate the convergence, sta- verses regions of weight space corresponding to
bility, and accuracy of training procedures in deep learn- high-performing networks. We find that while these
ing. Recent work [Garipov et al., 2018, Draxler et al., models are moving around this optimal set they
2018] shows that local optima found by SGD can be con- never reach its central points. We show that we can
nected by simple curves of near constant loss. Building move into this more desirable space of points by av-
upon this insight, Garipov et al. [2018] also developed eraging the weights proposed over SGD iterations.
Fast Geometric Ensembling (FGE) to sample multiple
nearby points in weight space to create high performing • While FGE ensembles [Garipov et al., 2018] can
ensembles in the time required to train a single DNN. be trained in the same time as a single model, test
predictions for an ensemble of k models requires k
FGE uses a high frequency cyclical learning rate with times more computation. We show that SWA can
SGD to select networks to ensemble. In Figure 1 (left) be interpreted as an approximation to FGE ensem-

Equal contribution. bles but with the test-time, convenience, and inter-
Test error (%) Test error (%) Train loss
> 50 > 50 > 0.8832

30 50 50 0.8832
W2 10 WSGD 10 WSGD
35.97 35.11 0.4391
20
28.49 27.52 0.2206
5 5
WSWA 24.5 23.65 0.1131
10
22.38 21.67 0.06024
W1 W3 WSWA WSWA
0 21.24 0 20.67 0 0.03422
epoch 125 epoch 125
20.64 20.15 0.02142

−10
19.95 19.62 0.00903
−10 0 10 20 30 40 50 −5 0 5 10 15 20 25
−5 0 5 10 15 20 25

Figure 1: Illustrations of SWA and SGD with a Preactivation ResNet-164 on CIFAR-1001 . Left: test error surface
for three FGE samples and the corresponding SWA solution (averaging in weight space). Middle and Right: test
error and train loss surfaces showing the weights proposed by SGD (at convergence) and SWA, starting from the same
initialization of SGD after 125 training epochs.

pretability of a single model. We emphasize that SWA is finding a solution in the same
basin of attraction as SGD, as can be seen in Figure 1,
• We demonstrate that SWA leads to solutions that are but in a flatter region of the training loss. SGD typically
wider than the optima found by SGD. Keskar et al. finds points on the periphery of a set of good weights. By
[2017] and Hochreiter and Schmidhuber [1997] running SGD with a cyclical or high constant learning
conjecture that the width of the optima is critically rate, we traverse the surface of this set of points, and
related to generalization. We illustrate that the loss by averaging we find a more centred solution in a flatter
on the train is shifted with respect to the test er- region of the training loss. Further, the training loss for
ror (Figure 1, middle and right panels, and sections SWA is often slightly worse than for SGD suggesting that
3, 4). We show that SGD generally converges to SWA solution is not a local optimum of the loss. In the
a point near the boundary of the wide flat region title of this paper, optima is used in a general sense to
of optimal points. SWA on the other hand is able mean solutions (converged points of a given procedure),
to find a point centered in this region, often with rather than different local minima of the same objective.
slightly worse train loss but with substantially bet-
ter test error.
2 RELATED WORK
• We show that the loss function is asymmetric in the
direction connecting SWA with SGD. In this direc- This paper is fundamentally about better understanding
tion, SGD is near the periphery of sharp ascent. Part the geometry of loss surfaces and generalization in deep
of the reason SWA improves generalization is that it learning. We follow the trajectory of weights traversed
finds solutions in flat regions of the training loss in by SGD, leading to new geometric insights and the in-
such directions. tuition that SWA will lead to better results than standard
training. Empirically, we make the discovery that SWA
• SWA achieves notable improvement for training notably improves training of many state-of-the-art deep
a broad range of architectures over several con- neural networks over a range of consequential bench-
sequential benchmarks. In particular, running marks, with essentially no overhead.
SWA for just 10 epochs on ImageNet we are
able to achieve 0.8% improvement for ResNet- The procedures for training neural networks are con-
50 and DenseNet-161, and 0.6% improvement for stantly being improved. New methods are being pro-
ResNet-150. We achieve improvement of over 1.3% posed for architecture design, regularization and opti-
on CIFAR-100 and of over 0.4% on CIFAR-10 mization. The SWA approach is related to work in both
with Preactivation ResNet-164, VGG-16 and Wide optimization and regularization.
ResNet-28-10. We also achieve substantial im- In optimization, there is great interest in how differ-
provement for the recent Shake-Shake Networks
1
and PyramidNets. Suppose we have three weight vectors w1 , w2 , w3 . We set
u = (w2 −w1 ), v = (w3 −w1 )−hw3 − w1 , w2 − w1 i/kw2 −
• SWA is extremely easy to implement and has vir- w1 k2 · (w2 − w1 ). Then the normalized vectors û = u/kuk,
v̂ = v/kvk form an orthonormal basis in the plane contain-
tually no computational overhead compared to the ing w1 , w2 , w3 . To visualize the loss in this plane, we define
conventional training schemes. a Cartesian grid in the basis û, v̂ and evaluate the networks
corresponding to each of the points in the grid. A point P
• We provide an implementation of SWA at with coordinates (x, y) in the plane would then be given by
https://github.com/timgaripov/swa. P = w1 + x · û + y · v̂.
ent types of local solutions affect generalization in deep test-time scalability.
learning. Keskar et al. [2017] claim that SGD is more
Dropout [Srivastava et al., 2014] is an extremely popu-
likely to converge to broad local optima than batch gra-
lar approach to regularizing DNNs. Across each mini-
dient methods, which tend to converge to sharp optima.
batch used for SGD, a different architecture is created
Moreover, they argue that the broad optima found by
by randomly dropping out neurons. The authors make
SGD are more likely to have good test performance, even
analogies between dropout, ensembling, and Bayesian
if the training loss is worse than for the sharp optima.
model averaging. At test time, an ensemble approach
On the other hand Dinh et al. [2017] argue that all the
is proposed, but then approximated with similar results
known definitions of sharpness are unsatisfactory and
by multiplying each connection by the dropout rate. At a
cannot on their own explain generalization. Chaudhari
high level, SWA and Dropout are both at once regulariz-
et al. [2017] propose the Entropy-SGD method that ex-
ers and training procedures, motivated to approximate an
plicitly forces optimization towards wide valleys. They
ensemble. Each approach implements these high level
report that although the optima found by Entropy-SGD
ideas quite differently, and as we show in our experi-
are wider than those found by conventional SGD, the
ments, can be combined for improved performance.
generalization performance is still comparable.
The SWA method is based on averaging multiple points
3 STOCHASTIC WEIGHT AVERAGING
along the trajectory of SGD with cyclical or constant
learning rates. The general idea of maintaining a running
average of weights proposed by SGD was first consid- We present Stochastic Weight Averaging (SWA) and an-
ered in convex optimization by Ruppert [1988] and later alyze its properties. In section 3.1, we consider trajec-
by Polyak and Juditsky [1992]. However, this procedure tories of SGD with a constant and cyclical learning rate,
is not typically used to train neural networks. Practi- which helps understand the geometry of SGD training
tioners instead sometimes use an exponentially decay- for neural networks, and motivates the SWA procedure.
ing running average of the weights found by SGD with Then in section 3.2 we present the SWA algorithm in
a decaying learning rate, which smooths the trajectory of detail, in section 3.3 we derive its complexity, and in
SGD but performs comparably. section 3.4 we analyze the width of solutions found by
SWA versus conventional SGD training. In section 3.5
SWA is making use of multiple samples gathered through we then examine the relationship between SWA and the
exploration of the set of points corresponding to high per- recently proposed Fast Geometric Ensembling [Garipov
forming networks. To enforce exploration we run SGD et al., 2018]. Finally, in section 3.6 we consider SWA
with constant or cyclical learning rates. Mandt et al. from the perspective of stochastic convex optimization.
[2017] show that under several simplifying assumptions
running SGD with a constant learning rate is equivalent We note the name SWA has two meanings: on the one
to sampling from a Gaussian distribution centered at the hand, it is an average of SGD weights. On the other,
minimum of the loss, and the covariance of this Gaussian with a cyclical or constant learning rate, SGD proposals
is controlled by the learning rate. Following this expla- are approximately sampling from the loss surface of the
nation from [Mandt et al., 2017], we can interpret points DNN, leading to stochastic weights.
proposed by SGD as being constrained to the surface of
a sphere, since they come from a high dimensional Gaus- 3.1 ANALYSIS OF SGD TRAJECTORIES
sian distribution. SWA effectively allows us to go inside
the sphere to find higher density solutions. SWA is based on averaging the samples proposed by
SGD using a learning rate schedule that allows explo-
In a procedure called Fast Geometric Ensembling (FGE), ration of the region of weight space corresponding to
Garipov et al. [2018] showed that using a cyclical learn- high-performing networks. In particular we consider
ing rate it is possible to gather models that are spatially cyclical and constant learning rate schedules.
close to each other but produce diverse predictions. They
used the gathered models to train ensembles with no The cyclical learning rate schedule that we adopt is in-
computational overhead compared to training a single spired by Garipov et al. [2018] and Smith and Topin
DNN model. In recent work Neklyudov et al. [2018] [2017]. In each cycle we linearly decrease the learning
also discuss an efficient approach for model averaging rate from α1 to α2 . The formula for the learning rate at
of Bayesian neural networks. SWA was inspired by fol- iteration i is given by
lowing the trajectories of FGE proposals, in order to find
α(i) = (1 − t(i))α1 + t(i)α2 ,
a single model that would approximate an FGE ensem-
ble, but provide greater interpretability, convenience, and 1
t(i) = (mod(i − 1, c) + 1) .
c
corresponding to DNNs with high accuracy. The main
α1
difference between the two approaches is that the indi-
Learning rate

vidual proposals of SGD with a cyclical learning rate


schedule are in general much more accurate than the pro-
α2
n
posals of a fixed-learning rate SGD. After making a large
step, SGD with a cyclical learning rate spends several
50
epochs fine-tuning the resulting point with a decreasing
learning rate. SGD with a fixed learning rate on the other
Test error (%)

45
40 hand is always making steps of relatively large sizes, ex-
35
30
ploring more efficiently than with a cyclical learning rate,
25 but the individual proposals are worse.
20
0 1c 2c 3c 4c Another important insight we can get from Figure 3 is
iteration number
that while the train loss and test error surfaces are quali-
Figure 2: Top: cyclical learning rate as a function of tatively similar, they are not perfectly aligned. The shift
iteration. Bottom: test error as a function of iteration between train and test suggests that more robust central
for cyclical learning rate schedule with Preactivation- points in the set of high-performing networks can lead to
ResNet-164 on CIFAR-100. Circles indicate iterations better generalization. Indeed, if we average several pro-
corresponding to the minimum learning rates. posals from the optimization trajectories, we get a more
robust point that has a substantially higher test perfor-
mance than the individual proposals of SGD, and is es-
sentially centered on the shifted mode for test error. We
The base learning rates α1 ≥ α2 and the cycle length c further discuss the reasons for this behaviour in sections
are the hyper-parameters of the method. Here by itera- 3.4, 3.5, 3.6.
tion we assume the processing of one batch of data. Fig-
ure 2 illustrates the cyclical learning rate schedule and 3.2 SWA ALGORITHM
the test error of the corresponding points. Note that un-
like the cyclical learning rate schedule of Garipov et al. We now present the details of the Stochastic Weight Av-
[2018] and Smith and Topin [2017], here we propose to eraging algorithm, a simple but effective modification for
use a discontinuous schedule that jumps directly from training neural networks, motivated by our observations
the minimum to maximum learning rates, and does not in section 3.1.
steadily increase the learning rate as part of the cycle.
We use this more abrupt cycle because for our purposes Following Garipov et al. [2018], we start with a pre-
exploration is more important than the accuracy of indi- trained model ŵ. We will refer to the number of epochs
vidual proposals. For even greater exploration, we also required to train a given DNN with the conventional
consider constant learning rates α(i) = α1 . training procedure as its training budget and will denote
it by B. The pretrained model ŵ can be trained with the
We run SGD with cyclical and constant learning rate conventional training procedure for full training budget
schedules starting from a pretrained point for a Preacti- or reduced number of epochs (e.g. 0.75B). In the lat-
vation ResNet-164 on CIFAR-100. We then use the first, ter case we just stop the training early without modify-
middle and last point of each of the trajectories to de- ing the learning rate schedule. Starting from ŵ we con-
fine a 2-dimensional plane in the weight space contain- tinue training, using a cyclical or constant learning rate
ing all affine combinations of these points. In Figure 3 schedule. When using a cyclical learning rate we capture
we plot the loss on train and error on test for points in the models wi that correspond to the minimum values of
these planes. We then project the other points of the tra- the learning rate (see Figure 2), following Garipov et al.
jectory to the plane of the plot. Note that the trajectories [2018]. For constant learning rates we capture models
do not generally lie in the plane of the plot, except for the at each epoch. Next, we average the weights of all the
first, last and middle points, showed by black crosses in captured networks wi to get our final model wSWA .
the figure. Therefore for other points of the trajectories it
is not possible to tell the value of train loss and test error Note that for cyclical learning rate schedule, the SWA
from the plots. algorithm is related to FGE [Garipov et al., 2018], except
that instead of averaging the predictions of the models,
The key insight from Figure 3 is that both methods ex- we average their weights, and we use a different type of
plore points close to the periphery of the set of high- learning rate cycle. In section 3.5 we show how SWA
performing networks. The visualizations suggest that can approximate FGE, but with a single model.
both methods are doing exploration in the region of space
Train loss Test error (%) Train loss Test error (%)
> 1.7 > 50 > 1.7 > 50

30 1.7 30 50 1.7 50
30 30
0.7899 35.97 0.883 37.03
20 20
0.3758 28.49 0.5062 30.04
20 20
0.1874 24.5 0.3324 26.28
10 10
0.1017 22.38 10 0.2522 10 24.26

0 0.06269 0 21.24 0.2152 23.17

0.04494 20.64 0 0.1981 0 22.58


−10 −10
0.03013 19.95 0.1835 21.9
−10 0 10 20 30 40 50 −10 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Figure 3: The L2 -regularized cross-entropy train loss and test error surfaces of a Preactivation ResNet-164 on CIFAR-
100 in the plane containing the first, middle and last points (indicated by black crosses) in the trajectories with (left
two) cyclical and (right two) constant learning rate schedules.

Algorithm 1 Stochastic Weight Averaging store the model that aggregates the average, leading to
Require: the same memory requirements as standard training.
weights ŵ, LR bounds α1 , α2 , During training extra time is only spent to update the ag-
cycle length c (for constant learning rate c = 1), num- gregated weight average. This operation is of the form
ber of iterations n
wSWA · nmodels + w
Ensure: wSWA wSWA ← ,
w ← ŵ {Initialize weights with ŵ} nmodels + 1
wSWA ← w and it only requires computing a weighted sum of the
for i ← 1, 2, . . . , n do weights of two DNNs. As we apply this operation at
α ← α(i) {Calculate LR for the iteration} most once per epoch, SWA and SGD require practically
w ← w − α∇Li (w) {Stochastic gradient update} the same amount of computation. Indeed, a similar op-
if mod(i, c) = 0 then eration is performed as a part of each gradient step, and
nmodels ← i/c {Number of models} each epoch consists of hundreds of gradient steps.
·nmodels +w
wSWA ← wSWA nmodels +1 {Update average}
end if 3.4 SOLUTION WIDTH
end for
{Compute BatchNorm statistics for wSWA weights} Keskar et al. [2017] and Chaudhari et al. [2017] conjec-
ture that the width of a local optimum is related to gen-
eralization. The general explanation for the importance
Batch normalization. If the DNN uses batch normal- of width is that the surfaces of train loss and test error
ization [Ioffe and Szegedy, 2015], we run one additional are shifted with respect to each other and it is thus de-
pass over the data, as in Garipov et al. [2018], to compute sirable to converge to the modes of broad optima, which
the running mean and standard deviation of the activa- stay approximately optimal under small perturbations. In
tions for each layer of the network with wSWA weights this section we compare the solutions found by SWA and
after the training is finished, since these statistics are SGD and show that SWA generally leads to much wider
not collected during training. For most deep learning li- solutions.
braries, such as PyTorch or Tensorflow, one can typically Let wSWA and wSGD denote the weights of DNNs trained
collect these statistics by making a forward pass over the using SWA and conventional SGD, respectively. Con-
data in training mode. sider the rays
The SWA procedure is summarized in Algorithm 1.
wSWA (t, d) = wSWA + t · d,
wSGD (t, d) = wSGD + t · d,
3.3 COMPUTATIONAL COMPLEXITY
which follow a direction vector d on the unit sphere,
The time and memory overhead of SWA compared to starting at wSWA and wSGD , respectively. In Figure 4
conventional training is negligible. During training, we we plot train loss and test error of wSWA (t, di ) and
need to maintain a copy of the running average of DNN wSGD (t, di ) as a function of t for 10 random directions
weights. Note however that the memory consumption di , i = 1, 2, . . . , 10 drawn from a uniform distribution
in storing a DNN is dominated by its activations rather on the unit sphere. For this visualization we use a Preac-
than its weights, and thus is only slightly increased by the tivation ResNet-164 on CIFAR-100.
SWA procedure, even for large DNNs (e.g., on the order
of 10%). After the training is complete we only need to First, while the loss values on train for wSGD and wSWA
are quite similar (and in fact wSGD has a slightly lower
30 0.20

28
0.15

26
0.10
24

22 0.05

20
0.00
0 5 10 15 20 0 5 10 15 20

Figure 4: (Left) Test error and (Right) L2 -regularized cross-entropy train loss as a function of a point on a random ray
starting at SWA (blue) and SGD (green) solutions for Preactivation ResNet-164 on CIFAR-100. Each line corresponds
to a different random ray.

30.0 2.5

Test error Train loss 39.5 Test error Train loss 1.5
27.5 SWA SWA 2.0 SWA SWA
SGD SGD SGD SGD

Test error (%)


Test error (%)

25.0 1.5 34.5 1.0

Train loss
Train loss
22.5 1.0
29.5 0.5

20.0 0.5

24.5 0.0
17.5 0.0
−80 −60 −40 −20 0 20 40 −60 −40 −20 0 20 40
Distance Distance

Figure 5: L2 -regularized cross-entropy train loss and test error as a function of a point on the line connecting SWA
and SGD solutions on CIFAR-100. Left: Preactivation ResNet-164. Right: VGG-16.

train loss), the test error for wSGD is lower by 1.5% (at Second, wSGD lies near the boundary of a wide flat region
the converged value corresponding to t = 0). Further, of the train loss. Further, the loss is very steep near wSGD .
the shapes of both train loss and test error curves are con-
Keskar et al. [2017] argue that the loss near sharp op-
siderably wider for wSWA than for wSGD , suggesting that
tima found by SGD with very large batches are actu-
SWA indeed converges to a wider solution: we have to
ally flat in most directions, but there exist directions in
step much further away from wSWA to increase error by a
which the optima are extremely steep. They conjecture
given amount. We even see the error curve for SGD has
that because of this sharpness the generalization perfor-
an inflection point that is not present for these distances
mance of large batch optimization is substantially worse
with SWA.
than that of solutions found by small batch SGD. Re-
Notice that in Figure 4 any of the random directions from markably, in our experiments in this section we observe
wSGD increase test error. However, we know that the di- that there exist directions of steep ascent even for small
rection from wSGD to wSWA would decrease test error, batch optima, and that SWA provides even wider solu-
since wSWA has considerably lower test error than wSGD . tions (at least along random directions) with better gen-
In other words, the path from wSGD to wSWA is qualita- eralization. Indeed, we can see clearly in Figure 5 that
tively different from all directions shown in Figure 4, be- SWA is not finding a different minima than SGD, but
cause along this direction wSGD is far from optimal. We rather a flatter region in the same basin of attraction. We
therefore consider the line segment connecting wSGD and can also see clearly that the significant asymmetry of the
wSWA : loss function in certain directions, such as the direction
SWA to SGD, has a role in understanding why SWA pro-
w(t) = t · wSGD + (1 − t) · wSWA .
vides better generalization than SGD. In these directions
In Figure 5 we plot the train loss and test error of w(t) SWA finds a much flatter solution than SGD, which can
as a function of signed distance from wSWA for Preacti- be near the periphery of sharp ascent.
vation ResNet-164 and VGG-16 on CIFAR-100.
We can extract several key insights about wSWA and wSGD 3.5 CONNECTION TO ENSEMBLING
from Figure 5. First, the train loss and test error plots Garipov et al. [2018] proposed the Fast Geometric En-
are indeed substantially shifted, and the point obtained sembling (FGE) procedure for training ensembles in the
by minimizing the train loss is far from optimal on test.
time required to train a single model. Using a cyclical The norm of difference of the probabilities for the SWA
learning rate, FGE generates a sequence of points that model and the FGE ensemble is 0.079, which is substan-
are close to each other in the weight space, but produce tially smaller than the difference between the probabili-
diverse predictions. In SWA instead of averaging the pre- ties of consecutive FGE proposals. Further, the fraction
dictions of the models we average their weights. How- of objects for which consecutive FGE proposals output
ever, the predictions proposed by FGE ensembles and the same labels is not greater than 87.33%. For FGE
SWA models have similar properties. and SWA the fraction of identically labeled objects is
95.26%.
Let f (·) denote the predictions of a neural network
parametrized by weights w. We will assume that f is The theoretical considerations and empirical results pre-
a scalar (e.g. the probability for a particular class) twice sented in this section suggest that SWA can approximate
continuously differentiable function with respect to w. the FGE ensemble with a single model.
Consider points wi proposed by FGE. These points are
close in the weight space by design, 3.6 CONNECTION TO CONVEX
1
Pn and concentrated MINIMIZATION
Pn n i=1 wi . We denote
around their average wSWA =
∆i = wi − wSWA . Note i=1 ∆i = 0. Ensembling the
Mandt et al. [2017] showed that under strong simplify-
networks corresponds to averaging the function values
ing assumptions SGD with a fixed learning rate approx-
1X
n imately samples from a Gaussian distribution centered
f¯ = f (wi ). at the minimum of the loss. Suppose this is the case
n i=1
when we run SGD with a fixed learning rate for train-
ing a DNN.
Consider the linearization of f at wSWA .
Let us denote the dimensionality of the weight space of
f (wj ) = f (wSWA ) + h∇f (wSWA ), ∆j i + O(k∆j k2 ), the neural network by d. Denote the samples produced
by SGD by wi , i = 1, 2, . . . , k. Assume the points wi
where h·, ·i denotes the dot product. Thus, the difference are concentrated around the local optimum ŵ. The SWA
between averaging the weights and averaging the predic- Pk
solution is given by wSWA = n1 i=1 wi . The points wi
tions
are samples from a multidimensional Gaussian N (ŵ, Σ)
n for some covariance matrix Σ defined by the curvature of
1X
f¯ − f (wSWA ) = h∇f (wSWA ), ∆i i + O(k∆i k2 )

n i=1 the loss, batch size and the learning rate. Note that the
* + samples from a multidimensional Gaussian are concen-
n
1X trated on the ellipsoid
= ∇f (wSWA ), ∆i + O(∆2 ) = O(∆2 ),
n i=1 n 1 √ o
z ∈ Rd | kΣ− 2 (z − ŵ)k = d ,
where ∆ = maxni=1 k∆i k. Note that the difference be-
tween the predictions of different perturbed networks is and the probability mass for a sample to end up inside the
ellipsoid near ŵ is negligible. On the other hand, wSWA
f (wi ) − f (wj ) = h∇f (wSWA ), ∆i − ∆j i + O(∆2 ), is guaranteed to converge to ŵ as k → ∞.

and is thus of the first order of smallness, while the Moreover, Polyak and Juditsky [1992] showed that aver-
difference between averaging predictions and averaging aging SGD proposals achieves the best possible conver-
weights is of the second order of smallness. Note that for gence rate among all stochastic gradient algorithms. The
the points proposed by FGE the distances between pro- proof relies on the convexity of the underlying problem
posals are relatively small by design, which justifies the and in general there are no convergence guarantees if the
local analysis. loss function is non-convex [see e.g. Ghadimi and Lan,
2013]. While DNN loss functions are known to be non-
To analyze the difference between ensembling and av- convex [e.g. Choromanska et al., 2015], over the trajec-
eraging the weights of FGE proposals in practice, we tory of SGD these loss surfaces are approximately con-
run FGE for 20 epochs and compare the predictions of vex [e.g. Goodfellow et al., 2015]. However, even when
different models on the test dataset with a Preactivation the loss is locally non-convex, SWA can improve gen-
ResNet-164 [He et al., 2016] on CIFAR-100. The norm eralization. For example, in Figure 5 we see that SWA
of the difference between the class probabilities of con- converges to a central point of the training loss.
secutive FGE proposals averaged over the test dataset is
0.126. We then average the weights of the proposals In other words, there are a set of points that all achieve
and compute the class probabilities on the test dataset. low training loss. By running SGD with a high constant
Table 1: Accuracies (%) of SWA, SGD and FGE methods on CIFAR-100 and CIFAR-10 datasets for different training
budgets. Accuracies for the FGE ensemble are from Garipov et al. [2018].

SWA
DNN (Budget) SGD FGE (1 Budget) 1 Budget 1.25 Budgets 1.5 Budgets
CIFAR-100
VGG-16 (200) 72.55 ± 0.10 74.26 73.91 ± 0.12 74.17 ± 0.15 74.27 ± 0.25
ResNet-164 (150) 78.49 ± 0.36 79.84 79.77 ± 0.17 80.18 ± 0.23 80.35 ± 0.16
WRN-28-10 (200) 80.82 ± 0.23 82.27 81.46 ± 0.23 81.91 ± 0.27 82.15 ± 0.27
PyramidNet-272 (300) 83.41 ± 0.21 – – 83.93 ± 0.18 84.16 ± 0.15
CIFAR-10
VGG-16 (200) 93.25 ± 0.16 93.52 93.59 ± 0.16 93.70 ± 0.22 93.64 ± 0.18
ResNet-164 (150) 95.28 ± 0.10 95.45 95.56 ± 0.11 95.77 ± 0.04 95.83 ± 0.03
WRN-28-10 (200) 96.18 ± 0.11 96.36 96.45 ± 0.11 96.64 ± 0.08 96.79 ± 0.05
ShakeShake-2x64d (1800) 96.93 ± 0.10 – – 97.16 ± 0.10 97.12 ± 0.06

or cyclical schedule, we traverse over the surface of this 2x64d [Gastaldi, 2017] on CIFAR-10 and PyramidNet-
set. Then by averaging the corresponding iterates, we get 272 (bottleneck, α = 200) [Han et al., 2016] on CIFAR-
to move inside the set. This observation explains both 100. All models are trained using L2 -regularization, and
convergence rates and generalization. In deep learning VGG-16 also uses dropout.
we mostly observe benefits in generalization from av-
For each model we define budget as the number of
eraging. Averaging can move to a more central point,
epochs required to train the model until convergence with
which means one has to move further from this point to
conventional SGD training, such that we do not see im-
increase the loss by a given amount, in virtually any di-
provement with SGD beyond this budget. We use the
rection. By contrast, conventional SGD with a decaying
same budgets for VGG, Preactivation ResNet and Wide
schedule will converge to a point on the periphery of this
ResNet models as Garipov et al. [2018]. For Shake-
set. With different initializations conventional SGD will
Shake and PyramidNets we use the budgets indicated by
find different points on the boundary, of solutions with
the papers that proposed these models [Gastaldi, 2017,
low training loss, but it will not move inside.
Han et al., 2016]. We report the results of SWA training
within 1, 1.25 and 1.5 budgets of epochs.
4 EXPERIMENTS
For VGG, Wide ResNet and Preactivation-ResNet mod-
els we first run standard SGD training for ≈ 75% of the
We compare SWA against conventional SGD training
training budget, and then use the weights at the last epoch
on CIFAR-10, CIFAR-100 and ImageNet ILSVRC-2012
as an initialization for SWA with a fixed learning rate
[Russakovsky et al., 2012]. We also compare to Fast Ge-
schedule. We ran SWA for 0.25, 0.5 and 0.75 budget
ometric Ensembling (FGE) [Garipov et al., 2018], but
to complete the training within 1, 1.25 and 1.5 budgets
we note that FGE is an ensemble whereas SWA corre-
respectively.
sponds to a single model. Conventional SGD training
uses a standard decaying learning rate schedule (details For Shake-Shake and PyramidNet architectures we do
in the Appendix) until convergence. We found an ex- not report the results in one budget. For these models
ponentially decaying average of SGD to perform com- we use a full budget to get an initialization for the proce-
parably to conventional SGD at convergence. We re- dure, and then train with a cyclical learning rate schedule
lease the code for reproducing the results in this paper for 0.25 and 0.5 budgets. We used long cycles of small
at https://github.com/timgaripov/swa. learning rates for Shake-Shake, because this architecture
already involves many stochastic components.
4.1 CIFAR DATASETS We present the details of the learning rate schedules for
each of these models in the Appendix.
For the experiments on CIFAR datasets we use VGG-
16 [Simonyan and Zisserman, 2014], a 164-layer For each model we also report the results of conventional
Preactivation-ResNet [He et al., 2016] and Wide ResNet- SGD training, which we denote by SGD. For VGG, Pre-
28-10 [Zagoruyko and Komodakis, 2016] models. Ad- activation ResNet and Wide ResNet we also provide the
ditionally, we experiment with the recent Shake-Shake- results of the FGE method with one budget reported in
26
Garipov et al. [2018]. Note that for FGE we report the Baseline CLR(0.01, 0.0001)
accuracy of an ensemble of 6 to 12 networks, while for 25
LR = 0.1 CLR(0.005, 0.00005)
SWA we report the accuracy of a single model. 24
LR = 0.05 CLR(0.05, 0.0005)
LR = 0.01

Test error (%)


CLR(0.1, 0.001)
We summarize the experimental results in Table 1. For 23 LR = 0.001

all models we report the mean and standard deviation 22

of test accuracy over 3 runs. In all conducted experi- 21


ments SWA substantially outperforms SGD in one bud-
20
get, and improves further, as we allow more training
19
epochs. Across different architectures we see consis- 140 160 180 200 220
Epochs
tent improvement by ≈ 0.5% on CIFAR-10 (excluding
Shake-Shake, for which SGD performance is already ex-
Figure 6: Test error as a function of training epoch for
tremely high) and by 0.75-1.5% on CIFAR-100. Amaz-
SWA with different learning rate schedules with a Preac-
ingly, SWA is able to achieve comparable or better per-
tivation ResNet-164 on CIFAR-100.
formance than FGE ensembles with just one model. On
CIFAR-100 SWA usually needs more than one budget
to get results comparable with FGE ensembles, but on rate schedules. For cyclical learning rates we fix the cy-
CIFAR-10 even with 1 budget SWA outperforms FGE. cle length to 5, and consider the pairs of base learning
rate parameters (α1 , α2 ) ∈ {(10−1 , 10−3 ), (5 · 10−2 , 5 ·
4.2 IMAGENET 10−4 ), (10−2 , 10−4 ), (5 · 10−3 , 5 · 10−5 )}. Among the
constant learning rates we consider α1 ∈ {10−1 , 5 ·
On ImageNet we experimented with ResNet-50, ResNet- 10−2 , 10−2 , 10−3 }.
152 [He et al., 2016] and DenseNet-161 [Huang et al.,
2017]. For these architectures we used pretrained mod- We plot the test error of the SWA procedure for different
els from PyTorch.torchvision. For each of the learning rate schedules as a function of the number of
models we ran SWA for 10 epochs with a cyclical learn- training epochs in Figure 6.
ing rate schedule with the same parameters for all models We find that in general the more aggressive constant
(the details can be found in the Appendix), and report the learning rate schedule leads to faster convergence of
mean and standard deviation of test error averaged over SWA. In our experiments we found that setting the learn-
3 runs. The results are shown in Table 2. ing rate to some intermediate value between the largest
Table 2: Top-1 accuracies (%) on ImageNet for SWA and and the smallest learning rate used in the annealing
SGD with different architectures. scheme in conventional training usually gave us the best
results. The approach is however universal and can work
well with different learning rate schedules tailored for
SWA
particular tasks.
DNN SGD 5 epochs 10 epochs
ResNet-50 76.15 76.83 ± 0.01 76.97 ± 0.05
ResNet-152 78.31 78.82 ± 0.01 78.94 ± 0.07 4.4 DNN TRAINING WITH A FIXED
DenseNet-161 77.65 78.26 ± 0.09 78.44 ± 0.06 LEARNING RATE

In this section we show that it is possible to train DNNs


For all 3 architectures SWA provides consistent improve- from scratch with a fixed learning rate using SWA. We
ment by 0.6-0.9% over the pretrained models. run SGD with a fixed learning rate of 0.05 on a Wide
ResNet-28-10 [Zagoruyko and Komodakis, 2016] for
4.3 EFFECT OF THE LEARNING RATE 300 epochs from a random initialization on CIFAR-100.
SCHEDULE We then averaged the weights at the end of each epoch
from epoch 140 and until the end of training. The final
In this section we explore how the learning rate schedule test accuracy of this SWA model was 81.7.
affects the performance of SWA. We run experiments on
Figure 7 illustrates the test error as a function of the num-
Preactivation ResNet-164 on CIFAR-100. For all sched-
ber of training epochs for SWA and conventional train-
ules we use the same initialization from a model trained
ing. The accuracy of the individual models with weights
for 125 epochs using the conventional SGD training. As
averaged by SWA stays at the level of ≈ 65% which is
a baseline we use a fully-trained model trained with con-
16% less than the accuracy of the SWA model. These re-
ventional SGD for 150 epochs.
sults correspond to our intuition presented in section 3.6
We consider a range of constant and cyclical learning that SGD with a constant learning rate oscillates around
50
these rich models. We hope that SWA will inspire further
45
progress in this area.
40
Test error (%)

35 Acknowledgements. This work was supported by


30
NSF IIS-1563887, Samsung Research, Samsung Elec-
tronics and Russian Science Foundation grant 17-11-
25
SGD 01027. We also thank Vadim Bereznyuk for helpful com-
Const LR SGD
20
Const LR SWA
ments.
15
0 50 100 150 200 250 300
Epochs
References
Figure 7: Test error as a function of training epoch for P. Chaudhari, Anna Choromanska, S. Soatto, Yann Le-
constant (green) and decaying (blue) learning rate sched- Cun, C. Baldassi, C. Borgs, J. Chayes, Levent Sagun,
ules for a Wide ResNet-28-10 on CIFAR-100. In red we and R. Zecchina. Entropy-sgd: Biasing gradient de-
average the points along the trajectory of SGD with con- scent into wide valleys. In International Conference
stant learning rate starting at epoch 140. on Learning Representations (ICLR), 2017.
Anna Choromanska, Mikael Henaff, Michael Mathieu,
the optimum, but SWA converges. Gérard Ben Arous, and Yann LeCun. The loss surfaces
of multilayer networks. In Artificial Intelligence and
While being able to train a DNN with a fixed learning Statistics, pages 192–204, 2015.
rate is a surprising property of SWA, for practical pur-
poses we recommend initializing SWA from a model pre- Laurent Dinh, Razvan Pascanu, Samy Bengio, and
trained with conventional training (possibly for a reduced Yoshua Bengio. Sharp minima can generalize for deep
number of epochs), as it leads to faster and more stable nets. In International Conference on Machine Learn-
convergence than running SWA from scratch. ing, pages 1019–1028, 2017.
Felix Draxler, Kambis Veschgini, Manfred Salmhofer,
and Fred Hamprecht. Essentially no barriers in neu-
5 DISCUSSION
ral network energy landscape. In Proceedings of the
35th International Conference on Machine Learning,
We have presented Stochastic Weight Averaging (SWA)
pages 1308–1317, 2018.
for training neural networks. SWA is extremely easy to
implement, architecture-agnostic, and improves general- Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin,
ization performance at virtually no additional cost over Dmitry P Vetrov, and Andrew Gordon Wilson. Loss
conventional training. surfaces, mode connectivity, and fast ensembling of
dnns. arXiv preprint arXiv:1802.10026, 2018.
There are so many exciting directions for future research.
SWA does not require each weight in its average to corre- Xavier Gastaldi. Shake-shake regularization. arXiv
spond to a good solution, due to the geometry of weights preprint arXiv:1705.07485, 2017.
traversed by the algorithm. It therefore may be possible Saeed Ghadimi and Guanghui Lan. Stochastic first-and
to develop SWA for much faster convergence than stan- zeroth-order methods for nonconvex stochastic pro-
dard SGD. One may also be able to combine SWA with gramming. SIAM Journal on Optimization, 23(4):
large batch sizes while preserving generalization perfor- 2341–2368, 2013.
mance, since SWA discovers much broader optima than Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe.
conventional SGD training. Furthermore, a cyclic learn- Qualitatively characterizing neural network optimiza-
ing rate enables SWA to explore regions of high poste- tion problems. International Conference on Learning
rior density over neural network weights. Such learning Representations, 2015.
rate schedules could be developed in conjunction with
stochastic MCMC approaches, to encourage exploration Dongyoon Han, Jiwhan Kim, and Junmo Kim.
while still providing high quality samples. One could Deep pyramidal residual networks. arXiv preprint
also develop SWA to average whole regions of good arXiv:1610.02915, 2016.
solutions, using the high-accuracy curves discovered in Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Garipov et al. [2018]. Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer
A better understanding of the loss surfaces for multilayer
vision and pattern recognition, pages 770–778, 2016.
networks will help continue to unlock the potential of
Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Sergey Zagoruyko and Nikos Komodakis. Wide residual
Neural Computation, 9(1):1–42, 1997. networks. arXiv preprint arXiv:1605.07146, 2016.
Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Lau-
rens van der Maaten. Densely connected convolutional A Appendix
networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, volume 1, A.1 EXPERIMENTAL DETAILS
page 3, 2017.
For the experiments on CIFAR datasets (section 4.1) we
Sergey Ioffe and Christian Szegedy. Batch normaliza- used the following implementations (embedded links):
tion: Accelerating deep network training by reducing
internal covariate shift. In International Conference • Shake-Shake-2x64d
on Machine Learning, pages 448–456, 2015.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge No- • PyramidNet-272
cedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang.
• VGG-16
On large-batch training for deep learning: Generaliza-
tion gap and sharp minima. International Conference • Preactivation-ResNet-164
on Learning Representations, 2017.
• Wide ResNet-28-10
Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gra-
dient descent with restarts. International Conference
on Learning Representations, 2017. Models for ImageNet are from here. Pretrained networks
can be found here.
Stephan Mandt, Matthew D Hoffman, and David M Blei.
Stochastic gradient descent as approximate bayesian
SWA learning rates. For PyramidNet SWA uses a
inference. The Journal of Machine Learning Research,
cyclic learning rate with α1 = 0.05 and α2 = 0.001
18(1):4873–4907, 2017.
and cycle length 3. For VGG and Wide ResNet we used
Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, constant learning α1 = 0.01. For ResNet we used con-
and Dmitry Vetrov. Variance networks: When expec- stant learning rates α1 = 0.01 on CIFAR-10 and 0.05 on
tation does not meet your expectations. arXiv preprint CIFAR-100.
arXiv:1803.03764, 2018.
For Shake-Shake Net we used a custom cyclic learn-
Boris T Polyak and Anatoli B Juditsky. Acceleration of ing rate based on the cosine annealing used when train-
stochastic approximation by averaging. SIAM Journal ing Shake-Shake with SGD. Each of the cycles replicate
on Control and Optimization, 30(4):838–855, 1992. the learning rates corresponding to epochs 1600 − 1700
David Ruppert. Efficient estimations from a slowly of the standard training and the cycle length c = 100
convergent robbins-monro process. Technical report, epochs. The learning rate schedule is depicted in Figure
Cornell University Operations Research and Industrial 8 and follows the formula
Engineering, 1988.   
1600 + epoch(i) mod 100)
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, α(i) = 0.1 · 1 + cos π · ,
1800
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
Karpathy, Aditya Khosla, Michael Bernstein, et al. where epoch(i) is the number of data passes completed
Imagenet large scale visual recognition challenge. In- before iteration i.
ternational Journal of Computer Vision, 115(3):211–
252, 2012. For all experiments with ImageNet we used cyclic learn-
ing rate schedule with the same hyperparameters α1 =
Karen Simonyan and Andrew Zisserman. Very deep con-
0.001, α2 = 10−5 and c = 1.
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
SGD learning rates. For conventional SGD training
Leslie N Smith and Nicholay Topin. Exploring loss we used SGD with momentum 0.9 and with an annealed
function topology with cyclical learning rates. arXiv learning rate schedule. For VGG, Wide ResNet and Pre-
preprint arXiv:1702.04283, 2017. activation ResNet we fixed the learning rate to α1 for the
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, first half of epochs (0B–0.5B), then linearly decreased
Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: the learning rate to 0.01α1 for the next 40% of epochs
A simple way to prevent neural networks from overfit- (0.5B–0.9B), and then kept it constant for the last 10%
ting. The Journal of Machine Learning Research, 15 of epochs (0.9B – 1B). For VGG we set α1 = 0.05,
(1):1929–1958, 2014. and for Preactivation ResNet and Wide ResNet we set
α1
Learning rate

α2

0 100 200 300 400


Epochs

Figure 8: Cyclical learning rate used for Shake-Shake as


a function of iteration.

α1 = 0.1. For Shake-Shake Net and PyramidNets


we used the cosine and piecewise-constant learning rate
schedules described in Gastaldi [2017] and Han et al.
[2016] respectively.

A.2 TRAINING RESNET WITH A CONSTANT


LEARNING RATE

In this section we present the experiment on training


Preactivation ResNet-164 using a constant learning rate.
The experimental setup is the same as in section 4.4. We
set the learning rate to α1 = 0.1 and start averaging after
epoch 200. The results are presented in Figure 9.
50
SGD
45 Const LR SGD
Const LR SWA
40
Test error (%)

35

30

25

20
0 50 100 150 200 250 300
Epochs

Figure 9: Test error as a function of training epoch for


constant (green) and decaying (blue) learning rate sched-
ules for a Preactivation ResNet-164 on CIFAR-100. In
red we average the points along the trajectory of SGD
with constant learning rate starting at epoch 200.
Group Normalization

Yuxin Wu Kaiming He
Facebook AI Research (FAIR)
{yuxinwu,kaiminghe}@fb.com

Abstract 36
arXiv:1803.08494v3 [cs.CV] 11 Jun 2018

Batch Norm
Batch Normalization (BN) is a milestone technique in the 34 Group Norm
development of deep learning, enabling various networks 32
to train. However, normalizing along the batch dimension
30

error (%)
introduces problems — BN’s error increases rapidly when
the batch size becomes smaller, caused by inaccurate batch 28
statistics estimation. This limits BN’s usage for training 26
larger models and transferring features to computer vision
tasks including detection, segmentation, and video, which 24

require small batches constrained by memory consumption. 22


32 16 8 4 2
In this paper, we present Group Normalization (GN) as
batch size (images per worker)
a simple alternative to BN. GN divides the channels into
Figure 1. ImageNet classification error vs. batch sizes. This is
groups and computes within each group the mean and vari-
a ResNet-50 model trained in the ImageNet training set using 8
ance for normalization. GN’s computation is independent
workers (GPUs), evaluated in the validation set.
of batch sizes, and its accuracy is stable in a wide range
of batch sizes. On ResNet-50 trained in ImageNet, GN has Despite its great success, BN exhibits drawbacks that are
10.6% lower error than its BN counterpart when using a also caused by its distinct behavior of normalizing along
batch size of 2; when using typical batch sizes, GN is com- the batch dimension. In particular, it is required for BN
parably good with BN and outperforms other normaliza- to work with a sufficiently large batch size (e.g., 32 per
tion variants. Moreover, GN can be naturally transferred worker2 [26, 59, 20]). A small batch leads to inaccurate
from pre-training to fine-tuning. GN can outperform its BN- estimation of the batch statistics, and reducing BN’s batch
based counterparts for object detection and segmentation in size increases the model error dramatically (Figure 1). As
COCO,1 and for video classification in Kinetics, showing a result, many recent models [59, 20, 57, 24, 63] are trained
that GN can effectively replace the powerful BN in a variety with non-trivial batch sizes that are memory-consuming.
of tasks. GN can be easily implemented by a few lines of The heavy reliance on BN’s effectiveness to train models in
code in modern libraries. turn prohibits people from exploring higher-capacity mod-
els that would be limited by memory.
The restriction on batch sizes is more demanding in com-
1. Introduction puter vision tasks including detection [12, 47, 18], segmen-
Batch Normalization (Batch Norm or BN) [26] has been tation [38, 18], video recognition [60, 6], and other high-
established as a very effective component in deep learning, level systems built on them. For example, the Fast/er and
largely helping push the frontier in computer vision [59, 20] Mask R-CNN frameworks [12, 47, 18] use a batch size of
and beyond [54]. BN normalizes the features by the mean 1 or 2 images because of higher resolution, where BN is
and variance computed within a (mini-)batch. This has been “frozen” by transforming to a linear layer [20]; in video
shown by many practices to ease optimization and enable classification with 3D convolutions [60, 6], the presence of
very deep networks to converge. The stochastic uncertainty spatial-temporal features introduces a trade-off between the
of the batch statistics also acts as a regularizer that can ben- temporal length and batch size. The usage of BN often re-
efit generalization. BN has been a foundation of many state- quires these systems to compromise between the model de-
of-the-art computer vision algorithms. sign and batch sizes.
2 In the context of this paper, we use “batch size” to refer to the number
1 https://github.com/facebookresearch/Detectron/ of samples per worker (e.g., GPU). BN’s statistics are computed for each
blob/master/projects/GN. worker, but not broadcast across workers, as is standard in many libraries.

1
This paper presents Group Normalization (GN) as a sim- inference time, so the mean and variance are pre-computed
ple alternative to BN. We notice that many classical features from the training set [26], often by running average; conse-
like SIFT [39] and HOG [9] are group-wise features and in- quently, there is no normalization performed when testing.
volve group-wise normalization. For example, a HOG vec- The pre-computed statistics may also change when the tar-
tor is the outcome of several spatial cells where each cell is get data distribution changes [45]. These issues lead to in-
represented by a normalized orientation histogram. Analo- consistency at training, transferring, and testing time. In ad-
gously, we propose GN as a layer that divides channels into dition, as aforementioned, reducing the batch size can have
groups and normalizes the features within each group (Fig- dramatic impact on the estimated batch statistics.
ure 2). GN does not exploit the batch dimension, and its Several normalization methods [3, 61, 51, 2, 46] have
computation is independent of batch sizes. been proposed to avoid exploiting the batch dimension.
GN behaves very stably over a wide range of batch sizes Layer Normalization (LN) [3] operates along the chan-
(Figure 1). With a batch size of 2 samples, GN has 10.6% nel dimension, and Instance Normalization (IN) [61] per-
lower error than its BN counterpart for ResNet-50 [20] in forms BN-like computation but only for each sample (Fig-
ImageNet [50]. With a regular batch size, GN is comparably ure 2). Instead of operating on features, Weight Normal-
good as BN (with a gap of ∼0.5%) and outperforms other ization (WN) [51] proposes to normalize the filter weights.
normalization variants [3, 61, 51]. Moreover, although the These methods do not suffer from the issues caused by the
batch size may change, GN can naturally transfer from pre- batch dimension, but they have not been able to approach
training to fine-tuning. GN shows improved results vs. its BN’s accuracy in many visual recognition tasks. We pro-
BN counterpart on Mask R-CNN for COCO object detec- vide comparisons with these methods in context of the re-
tion and segmentation [37], and on 3D convolutional net- maining sections.
works for Kinetics video classification [30]. The effective-
ness of GN in ImageNet, COCO, and Kinetics demonstrates Addressing small batches. Ioffe [25] proposes Batch
that GN is a competitive alternative to BN that has been Renormalization (BR) that alleviates BN’s issue involving
dominant in these tasks. small batches. BR introduces two extra parameters that con-
There have been existing methods, such as Layer Nor- strain the estimated mean and variance of BN within a cer-
malization (LN) [3] and Instance Normalization (IN) [61] tain range, reducing their drift when the batch size is small.
(Figure 2), that also avoid normalizing along the batch di- BR has better accuracy than BN in the small-batch regime.
mension. These methods are effective for training sequen- But BR is also batch-dependent, and when the batch size
tial models (RNN/LSTM [49, 22]) or generative models decreases its accuracy still degrades [25].
(GANs [15, 27]). But as we will show by experiments, both There are also attempts to avoid using small batches.
LN and IN have limited success in visual recognition, for The object detector in [43] performs synchronized BN
which GN presents better results. Conversely, GN could be whose mean and variance are computed across multiple
used in place of LN and IN and thus is applicable for se- GPUs. However, this method does not solve the problem
quential or generative models. This is beyond the focus of of small batches; instead, it migrates the algorithm prob-
this paper, but it is suggestive for future research. lem to engineering and hardware demands, using a number
of GPUs proportional to BN’s requirements. Moreover, the
2. Related Work synchronized BN computation prevents using asynchronous
solvers (ASGD [10]), a practical solution to large-scale
Normalization. It is well-known that normalizing the in- training widely used in industry. These issues can limit the
put data makes training faster [33]. To normalize hidden scope of using synchronized BN.
features, initialization methods [33, 14, 19] have been de- Instead of addressing the batch statistics computation
rived based on strong assumptions of feature distributions, (e.g., [25, 43]), our normalization method inherently avoids
which can become invalid when training evolves. this computation.
Normalization layers in deep networks had been widely
used before the development of BN. Local Response Nor- Group-wise computation. Group convolutions have been
malization (LRN) [40, 28, 32] was a component in AlexNet presented by AlexNet [32] for distributing a model into two
[32] and following models [64, 53, 58]. Unlike recent meth- GPUs. The concept of groups as a dimension for model
ods [26, 3, 61], LRN computes the statistics in a small design has been more widely studied recently. The work
neighborhood for each pixel. of ResNeXt [63] investigates the trade-off between depth,
Batch Normalization [26] performs more global normal- width, and groups, and it suggests that a larger number of
ization along the batch dimension (and as importantly, it groups can improve accuracy under similar computational
suggests to do this for all layers). But the concept of “batch” cost. MobileNet [23] and Xception [7] exploit channel-wise
is not always present, or it may change from time to time. (also called “depth-wise”) convolutions, which are group
For example, batch-wise normalization is not legitimate at convolutions with a group number equal to the channel

2
Batch Norm Layer Norm Instance Norm Group Norm

H, W

H, W

H, W

H, W
C N C N C N C N

Figure 2. Normalization methods. Each subplot shows a feature map tensor, with N as the batch axis, C as the channel axis, and (H, W )
as the spatial axes. The pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels.

number. ShuffleNet [65] proposes a channel shuffle oper- 3.1. Formulation


ation that permutes the axes of grouped features. These
We first describe a general formulation of feature nor-
methods all involve dividing the channel dimension into
malization, and then present GN in this formulation. A fam-
groups. Despite the relation to these methods, GN does not
ily of feature normalization methods, including BN, LN, IN,
require group convolutions. GN is a generic layer, as we
and GN, perform the following computation:
evaluate in standard ResNets [20].
1
x̂i = (xi − µi ). (1)
3. Group Normalization σi

The channels of visual representations are not entirely Here x is the feature computed by a layer, and i is an index.
independent. Classical features of SIFT [39], HOG [9], In the case of 2D images, i = (iN , iC , iH , iW ) is a 4D vec-
and GIST [41] are group-wise representations by design, tor indexing the features in (N, C, H, W ) order, where N is
where each group of channels is constructed by some kind the batch axis, C is the channel axis, and H and W are the
of histogram. These features are often processed by group- spatial height and width axes.
wise normalization over each histogram or each orientation. µ and σ in (1) are the mean and standard deviation (std)
Higher-level features such as VLAD [29] and Fisher Vec- computed by:
tors (FV) [44] are also group-wise features where a group s
1 X 1 X
can be thought of as the sub-vector computed with respect µi = xk , σi = (xk − µi )2 + , (2)
to a cluster. m m
k∈Si k∈Si
Analogously, it is not necessary to think of deep neu-
ral network features as unstructured vectors. For example, with  as a small constant. Si is the set of pixels in which
for conv1 (the first convolutional layer) of a network, it is the mean and std are computed, and m is the size of this set.
reasonable to expect a filter and its horizontal flipping to Many types of feature normalization methods mainly differ
exhibit similar distributions of filter responses on natural in how the set Si is defined (Figure 2), discussed as follows.
images. If conv1 happens to approximately learn this pair In Batch Norm [26], the set Si is defined as:
of filters, or if the horizontal flipping (or other transforma- Si = {k | kC = iC }, (3)
tions) is made into the architectures by design [11, 8], then
the corresponding channels of these filters can be normal- where iC (and kC ) denotes the sub-index of i (and k) along
ized together. the C axis. This means that the pixels sharing the same
The higher-level layers are more abstract and their be- channel index are normalized together, i.e., for each chan-
haviors are not as intuitive. However, in addition to orien- nel, BN computes µ and σ along the (N, H, W ) axes. In
tations (SIFT [39], HOG [9], or [11, 8]), there are many Layer Norm [3], the set is:
factors that could lead to grouping, e.g., frequency, shapes,
illumination, textures. Their coefficients can be interde- Si = {k | kN = iN }, (4)
pendent. In fact, a well-accepted computational model
meaning that LN computes µ and σ along the (C, H, W )
in neuroscience is to normalize across the cell responses
axes for each sample. In Instance Norm [61], the set is:
[21, 52, 55, 5], “with various receptive-field centers (cov-
ering the visual field) and with various spatiotemporal fre- Si = {k | kN = iN , kC = iC }. (5)
quency tunings” (p183, [21]); this can happen not only in
the primary visual cortex, but also “throughout the visual meaning that IN computes µ and σ along the (H, W ) axes
system” [5]. Motivated by these works, we propose new for each sample and each channel. The relations among BN,
generic group-wise normalization for deep neural networks. LN, and IN are in Figure 2.

3
As in [26], all methods of BN, LN, and IN learn a per- def GroupNorm(x, gamma, beta, G, eps=1e−5):
channel linear transform to compensate for the possible lost # x: input features with shape [N,C,H,W]
# gamma, beta: scale and offset, with shape [1,C,1,1]
of representational ability: # G: number of groups for GN

yi = γ x̂i + β, (6) N, C, H, W = x.shape


x = tf.reshape(x, [N, G, C // G, H, W])
where γ and β are trainable scale and shift (indexed by iC mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
in all case, which we omit for simplifying notations). x = (x − mean) / tf.sqrt(var + eps)

Group Norm. Formally, a Group Norm layer computes µ x = tf.reshape(x, [N, C, H, W])
and σ in a set Si defined as:
return x ∗ gamma + beta
kC iC
Si = {k | kN = iN , b c=b c}. (7) Figure 3. Python code of Group Norm based on TensorFlow.
C/G C/G
Here G is the number of groups, which is a pre-defined TensorFlow. In fact, we only need to specify how the mean
hyper-parameter (G = 32 by default). C/G is the num- and variance (“moments”) are computed, along the appro-
ber of channels per group. b·c is the floor operation, and priate axes as defined by the normalization method.
kC iC
“b C/G c = b C/G c” means that the indexes i and k are in
the same group of channels, assuming each group of chan- 4. Experiments
nels are stored in a sequential order along the C axis. GN
computes µ and σ along the (H, W ) axes and along a group 4.1. Image Classification in ImageNet
of CG channels. The computation of GN is illustrated in We experiment in the ImageNet classification dataset
Figure 2 (rightmost), which is a simple case of 2 groups [50] with 1000 classes. We train on the ∼1.28M training
(G = 2) each having 3 channels. images and evaluate on the 50,000 validation images, using
Given Si in Eqn.(7), a GN layer is defined by Eqn.(1), the ResNet models [20].
(2), and (6). Specifically, the pixels in the same group are
normalized together by the same µ and σ. GN also learns Implementation details. As standard practice [20, 17], we
the per-channel γ and β. use 8 GPUs to train all models, and the batch mean and
variance of BN are computed within each GPU. We use the
Relation to Prior Work. LN, IN, and GN all perform in- method of [19] to initialize all convolutions for all mod-
dependent computations along the batch axis. The two ex- els. We use 1 to initialize all γ parameters, except for each
treme cases of GN are equivalent to LN and IN (Figure 2). residual block’s last normalization layer where we initial-
Relation to Layer Normalization [3]. GN becomes LN if we ize γ by 0 following [16] (such that the initial state of a
set the group number as G = 1. LN assumes all channels residual block is identity). We use a weight decay of 0.0001
in a layer make “similar contributions” [3]. Unlike the case for all weight layers, including γ and β (following [17] but
of fully-connected layers studied in [3], this assumption can unlike [20, 16]). We train 100 epochs for all models, and
be less valid with the presence of convolutions, as discussed decrease the learning rate by 10× at 30, 60, and 90 epochs.
in [3]. GN is less restricted than LN, because each group of During training, we adopt the data augmentation of [58] as
channels (instead of all of them) are assumed to subject to implemented by [17]. We evaluate the top-1 classification
the shared mean and variance; the model still has flexibil- error on the center crops of 224×224 pixels in the valida-
ity of learning a different distribution for each group. This tion set. To reduce random variations, we report the median
leads to improved representational power of GN over LN, error rate of the final 5 epochs [16]. Other implementation
as shown by the lower training and validation error in ex- details follow [17].
periments (Figure 4). Our baseline is the ResNet trained with BN [20]. To
compare with LN, IN, and GN, we replace BN with the
Relation to Instance Normalization [61]. GN becomes IN
specific variant. We use the same hyper-parameters for all
if we set the group number as G = C (i.e., one channel per
models. We set G = 32 for GN by default.
group). But IN can only rely on the spatial dimension for
computing the mean and variance and it misses the oppor- Comparison of feature normalization methods. We first
tunity of exploiting the channel dependence. experiment with a regular batch size of 32 images (per
GPU) [26, 20]. BN works successfully in this regime, so
3.2. Implementation
this is a strong baseline to compare with. Figure 4 shows
GN can be easily implemented by a few lines of code in the error curves, and Table 1 shows the final results.
PyTorch [42] and TensorFlow [1] where automatic differ- Figure 4 shows that all of these normalization methods
entiation is supported. Figure 3 shows the code based on are able to converge. LN has a small degradation of 1.7%

4
train error val error
60 60
Batch Norm (BN) Batch Norm (BN)
55 Layer Norm (LN) 55 Layer Norm (LN)
Instance Norm (IN) Instance Norm (IN)
50 Group Norm (GN) 50 Group Norm (GN)

45
error (%) 45

error (%)
40 IN 40

LN IN
35 35
BN LN
GN GN
30 30

BN
25 25

20 20
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs

Figure 4. Comparison of error curves with a batch size of 32 images/GPU. We show the ImageNet training error (left) and validation
error (right) vs. numbers of training epochs. The model is ResNet-50.
Batch Norm (BN) Group Norm (GN)
60 60
BN, 32 ims/gpu GN, 32 ims/gpu
55 BN, 16 ims/gpu 55 GN, 16 ims/gpu
BN, 8 ims/gpu GN, 8 ims/gpu
50 BN, 4 ims/gpu 50 GN, 4 ims/gpu
BN, 2 ims/gpu GN, 2 ims/gpu
45 45
error (%)

error (%)
40 40

35 35

30 30

25 25

20 20
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs

Figure 5. Sensitivity to batch sizes: ResNet-50’s validation error of BN (left) and GN (right) trained with 32, 16, 8, 4, and 2 images/GPU.

BN LN IN GN batch size 32 16 8 4 2
val error 23.6 25.3 28.4 24.1 BN 23.6 23.7 24.8 27.3 34.7
4 (vs. BN) - 1.7 4.8 0.5 GN 24.1 24.2 24.0 24.2 24.1
4 0.5 0.5 -0.8 -3.1 -10.6
Table 1. Comparison of error rates (%) of ResNet-50 in the Ima-
geNet validation set, trained with a batch size of 32 images/GPU. Table 2. Sensitivity to batch sizes. We show ResNet-50’s vali-
The error curves are in Figure 4. dation error (%) in ImageNet. The last row shows the differences
between BN and GN. The error curves are in Figure 5. This table
comparing with BN. This is an encouraging result, as it sug- is visualized in Figure 1.
gests that normalizing along all channels (as done by LN) of
a convolutional network is reasonably good. IN also makes Small batch sizes. Although BN benefits from the stochas-
the model converge, but is 4.8% worse than BN.3 ticity under some situations, its error increases when the
In this regime where BN works well, GN is able to ap- batch size becomes smaller and the uncertainty gets bigger.
proach BN’s accuracy, with a decent degradation of 0.5% in We show this in Figure 1, Figure 5, and Table 2.
the validation set. Actually, Figure 4 (left) shows that GN
We evaluate batch sizes of 32, 16, 8, 4, 2 images per
has lower training error than BN, indicating that GN is ef-
GPU. In all cases, the BN mean and variance are computed
fective for easing optimization. The slightly higher valida-
within each GPU and not synchronized. All models are
tion error of GN implies that GN loses some regularization
trained in 8 GPUs. In this set of experiments, we adopt the
ability of BN. This is understandable, because BN’s mean
linear learning rate scaling rule [31, 4, 16] to adapt to batch
and variance computation introduces uncertainty caused by
size changes — we use a learning rate of 0.1 [20] for the
the stochastic batch sampling, which helps regularization
batch size of 32, and 0.1N/32 for a batch size of N . This
[26]. This uncertainty is missing in GN (and LN/IN). But
linear scaling rule works well for BN if the total batch size
it is possible that GN combined with a suitable regularizer
changes (by changing the number of GPUs) but the per-
will improve results. This can be a future research topic.
GPU batch size does not change [16]. We keep the same
3 For completeness, we have also trained ResNet-50 with WN [51], number of training epochs for all cases (Figure 5, x-axis).
which is filter (instead of feature) normalization. WN’s result is 28.2%. All other hyper-parameters are unchanged.

5
none (w/o norm) Batch Norm Group Norm
3 1st percentile 3 1st percentile
20
20th percentile 20th percentile
2 80th percentile 2 80th percentile
99th percentile 99th percentile
0 1 1
error
0 0
−20 none 29.2
−1 −1

−40
BN 28.0
−2 −2

1st percentile −3 −3
GN 27.6
−60 20th percentile
80th percentile −4 −4
99th percentile
−80 −5 −5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs epochs

Figure 6. Evolution of feature distributions of conv5 3 ’s output (before normalization and ReLU) from VGG-16, shown as the {1, 20, 80,
99} percentile of responses. The table on the right shows the ImageNet validation error (%). Models are trained with 32 images/GPU.

# groups (G) With a batch size of 4, ResNet-50 trained with BR has an


64 32 16 8 4 2 1 (=LN) error rate of 26.3%. This is better than BN’s 27.3%, but still
24.6 24.1 24.6 24.4 24.6 24.7 25.3 2.1% higher than GN’s 24.2%.
0.5 - 0.5 0.3 0.5 0.6 1.2
Group division. Thus far all presented GN models are
# channels per group trained with a group number of G = 32. Next we eval-
64 32 16 8 4 2 1 (=IN) uate different ways of dividing into groups. With a given
24.4 24.5 24.2 24.3 24.8 25.6 28.4 fixed group number, GN performs reasonably well for all
0.2 0.3 - 0.1 0.6 1.4 4.2 values of G we studied (Table 3, top panel). In the extreme
case of G = 1, GN is equivalent to LN, and its error rate is
Table 3. Group division. We show ResNet-50’s validation error higher than all cases of G > 1 studied.
(%) in ImageNet, trained with 32 images/GPU. (Top): a given We also evaluate fixing the number of channels per group
number of groups. (Bottom): a given number of channels per (Table 3, bottom panel). Note that because the layers can
group. The last rows show the differences with the best number. have different channel numbers, the group number G can
change across layers in this setting. In the extreme case of 1
Figure 5 (left) shows that BN’s error becomes consider-
channel per group, GN is equivalent to IN. Even if using as
ably higher with small batch sizes. GN’s behavior is more
few as 2 channels per group, GN has substantially lower er-
stable and insensitive to the batch size. Actually, Figure 5
ror than IN (25.6% vs. 28.4%). This result shows the effect
(right) shows that GN has very similar curves (subject to
of grouping channels when performing normalization.
random variations) across a wide range of batch sizes from
32 to 2. In the case of a batch size of 2, GN has 10.6% Deeper models. We have also compared GN with BN on
lower error rate than its BN counterpart (24.1% vs. 34.7%). ResNet-101 [20]. With a batch size of 32, our BN base-
These results indicate that the batch mean and variance line of ResNet-101 has 22.0% validation error, and the GN
estimation can be overly stochastic and inaccurate, espe- counterpart has 22.4%, slightly worse by 0.4%. With a
cially when they are computed over 4 or 2 images. How- batch size of 2, GN ResNet-101’s error is 23.0%. This is
ever, this stochasticity disappears if the statistics are com- still a decently stable result considering the very small batch
puted from 1 image, in which case BN becomes similar size, and it is 8.9% better than the BN counterpart’s 31.9%.
to IN at training time. We see that IN has a better result
Results and analysis of VGG models. To study GN/BN
(28.4%) than BN with a batch size of 2 (34.7%).
compared to no normalization, we consider VGG-16 [56]
The robust results of GN in Table 2 demonstrate GN’s
that can be healthily trained without normalization layers.
strength. It allows to remove the batch size constraint im-
We apply BN or GN right after each convolutional layer.
posed by BN, which can give considerably more mem-
Figure 6 shows the evolution of the feature distributions
ory (e.g., 16× or more). This will make it possible to
of conv5 3 (the last convolutional layer). GN and BN be-
train higher-capacity models that would be otherwise bot-
have qualitatively similar, while being substantially differ-
tlenecked by memory limitation. We hope this will create
ent with the variant that uses no normalization; this phe-
new opportunities in architecture design.
nomenon is also observed for all other convolutional layers.
Comparison with Batch Renorm (BR). BR [25] intro- This comparison suggests that performing normalization is
duces two extra parameters (r and d in [25]) that constrain essential for controlling the distribution of features.
the estimated mean and variance of BN. Their values are For VGG-16, GN is better than BN by 0.4% (Figure 6,
controlled by rmax and dmax . To apply BR to ResNet-50, we right). This possibly implies that VGG-16 benefits less
have carefully chosen these hyper-parameters, and found from BN’s regularization effect, and GN (that leads to lower
that rmax = 1.5 and dmax = 0.5 work best for ResNet-50. training error) is superior to BN in this case.

6
4.2. Object Detection and Segmentation in COCO backbone APbbox APbbox
50 APbbox
75 APmask APmask
50 APmask
75
*
BN 37.7 57.9 40.9 32.8 54.3 34.7
Next we evaluate fine-tuning the models for transferring
GN 38.8 59.2 42.2 33.6 55.9 35.4
to object detection and segmentation. These computer vi-
sion tasks in general benefit from higher-resolution input, Table 4. Detection and segmentation ablation results in COCO,
so the batch size tends to be small in common practice (1 or using Mask R-CNN with ResNet-50 C4. BN* means BN is frozen.
2 images/GPU [12, 47, 18, 36]). As a result, BN is turned
into a linear layer y = σγ (x − µ) + β where µ and σ are backbone box head APbbox APbbox
50 APbbox
75 APmask APmask
50 APmask
75
pre-computed from the pre-trained model and frozen [20]. BN *
- 38.6 59.5 41.9 34.2 56.2 36.1
We denote this as BN* , which in fact performs no normal- BN* GN 39.5 60.0 43.2 34.4 56.4 36.3
ization during fine-tuning. We have also tried a variant that GN GN 40.0 61.0 43.3 34.8 57.3 36.3
fine-tunes BN (normalization is performed and not frozen)
Table 5. Detection and segmentation ablation results in COCO,
and found it works poorly (reducing ∼6 AP with a batch
using Mask R-CNN with ResNet-50 FPN and a 4conv1fc bound-
size of 2), so we ignore this variant.
ing box head. BN* means BN is frozen.
We experiment on the Mask R-CNN baselines [18], im-
plemented in the publicly available codebase of Detectron
[13]. We use the end-to-end variant with the same hyper- APbbox APbbox
50 APbbox
75 APmask APmask
50 APmask
75

parameters as in [13]. We replace BN* with GN during fine- R50 BN* 38.6 59.8 42.1 34.5 56.4 36.3
tuning, using the corresponding models pre-trained from R50 GN 40.3 61.0 44.0 35.7 57.9 37.7
ImageNet.4 During fine-tuning, we use a weight decay of 0 R50 GN, long 40.8 61.6 44.4 36.1 58.5 38.2
for the γ and β parameters, which is important for good de- R101 BN* 40.9 61.9 44.8 36.4 58.5 38.7
tection results when γ and β are being tuned. We fine-tune R101 GN 41.8 62.5 45.4 36.8 59.2 39.0
with a batch size of 1 image/GPU and 8 GPUs. R101 GN, long 42.3 62.8 46.2 37.2 59.7 39.5
The models are trained in the COCO train2017 Table 6. Detection and segmentation results in COCO using
set and evaluated in the COCO val2017 set (a.k.a Mask R-CNN and FPN. Here BN* is the default Detectron base-
minival). We report the standard COCO metrics of Av- line [13], and GN is applied to the backbone, box head, and mask
erage Precision (AP), AP50 , and AP75 , for bounding box head. “long” means training with more iterations. Code of these
detection (APbbox ) and instance segmentation (APmask ). results are in https://github.com/facebookresearch/
Detectron/blob/master/projects/GN.
Results of C4 backbone. Table 4 shows the comparison
of GN vs. BN* on Mask R-CNN using a conv4 backbone
layers to construct a pyramid, and appends randomly initial-
(“C4” [18]). This C4 variant uses ResNet’s layers of up to
ized layers as the head. In [35], the box head consists of two
conv4 to extract feature maps, and ResNet’s conv5 layers as
hidden fully-connected layers (2fc). We find that replacing
the Region-of-Interest (RoI) heads for classification and re-
the 2fc box head with 4conv1fc (similar to [48]) can better
gression. As they are inherited from the pre-trained model,
leverage GN. The resulting comparisons are in Table 5.
the backbone and head both involve normalization layers.
As a baseline, BN* has 38.6 box AP using the 4conv1fc
On this baseline, GN improves over BN* by 1.1 box AP
head, on par with its 2fc counterpart using the same pre-
and 0.8 mask AP. We note that the pre-trained GN model is
trained model (38.5 AP). By adding GN to all convolutional
slightly worse than BN in ImageNet (24.1% vs. 23.6%), but
layers of the box head (but still using the BN* backbone),
GN still outperforms BN* for fine-tuning. BN* creates in-
we increase the box AP by 0.9 to 39.5 (2nd row, Table 5).
consistency between pre-training and fine-tuning (frozen),
This ablation shows that a substantial portion of GN’s im-
which may explain the degradation.
provement for detection is from normalization in the head
We have also experimented with the LN variant, and
(which is also done by the C4 variant). On the contrary, ap-
found it is 1.9 box AP worse than GN and 0.8 worse than
plying BN to the box head (that has 512 RoIs per image)
BN* . Although LN is also independent of batch sizes, its
does not provide satisfactory result and is ∼9 AP worse —
representational power is weaker than GN.
in detection, the batch of RoIs are sampled from the same
Results of FPN backbone. Next we compare GN and BN* image and their distribution is not i.i.d., and the non-i.i.d.
on Mask R-CNN using a Feature Pyramid Network (FPN) distribution is also an issue that degrades BN’s batch statis-
backbone [35], the currently state-of-the-art framework in tics estimation [25]. GN does not suffer from this problem.
COCO. Unlike the C4 variant, FPN exploits all pre-trained Next we replace the FPN backbone with the GN-based
4 Detectron
counterpart, i.e., the GN pre-trained model is used during
[13] uses pre-trained models provided by the authors of
[20]. For fair comparisons, we instead use the models pre-trained in this
fine-tuning (3rd row, Table 5). Applying GN to the back-
paper. The object detection and segmentation accuracy is statistically sim- bone alone contributes a 0.5 AP gain (from 39.5 to 40.0),
ilar between these pre-trained models. suggesting that GN helps when transferring features.

7
Batch Norm (BN) Group Norm (GN)
60 60
BN, 8clips/gpu GN, 8clips/gpu
BN, 4clips/gpu GN, 4clips/gpu

55 55
error (%)

error (%)
50 50

45 45

40 40
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs

Figure 7. Error curves in Kinetics with an input length of 32 frames. We show ResNet-50 I3D’s validation error of BN (left) and GN
(right) using a batch size of 8 and 4 clips/GPU. The monitored validation error is the 1-clip error under the same data augmentation as the
training set, while the final validation accuracy in Table 8 is 10-clip testing without data augmentation.

from scratch APbbox APbbox


50 APbbox
75 APmask APmask
50 APmask
75
clip length 32 32 64
R50 BN [34] 34.5 55.2 37.7 - - - batch size 8 4 4
BN 73.3 / 90.7 72.1 / 90.0 73.3 / 90.8
R50 GN 39.5 59.8 43.6 35.2 56.9 37.6
GN 73.0 / 90.6 72.8 / 90.6 74.5 / 91.7
R101 GN 41.0 61.1 44.9 36.4 58.2 38.7
Table 7. Detection and segmentation results trained from scratch Table 8. Video classification results in Kinetics: ResNet-50 I3D
in COCO using Mask R-CNN and FPN. Here the BN baseline’s top-1 / top-5 accuracy (%).
results are from [34], and BN is synced across GPUs
[43] and is not frozen. Code of these results are in 4.3. Video Classification in Kinetics
https://github.com/facebookresearch/Detectron/
blob/master/projects/GN.
Lastly we evaluate video classification in the Kinetics
dataset [30]. Many video classification models [60, 6] ex-
tend the features to 3D spatial-temporal dimensions. This is
Table 6 shows the full results of GN (applied to the memory-demanding and imposes constraints on the batch
backbone, box head, and mask head), compared with the sizes and model designs.
standard Detectron baseline [13] based on BN* . Using the We experiment with Inflated 3D (I3D) convolutional net-
same hyper-parameters as [13], GN increases over BN* works [6]. We use the ResNet-50 I3D baseline as described
by a healthy margin. Moreover, we found that GN is not in [62]. The models are pre-trained from ImageNet. For
fully trained with the default schedule in [13], so we also both BN and GN, we extend the normalization from over
tried increasing the iterations from 180k to 270k (BN* does (H, W ) to over (T, H, W ), where T is the temporal axis.
not benefit from longer training). Our final ResNet-50 GN We train in the 400-class Kinetics training set and evaluate
model (“long”, Table 6) is 2.2 points box AP and 1.6 points in the validation set. We report the top-1 and top-5 classifi-
mask AP better than its BN* variant. cation accuracy, using standard 10-clip testing that averages
softmax scores from 10 clips regularly sampled.
Training Mask R-CNN from scratch. GN allows us to We study two different temporal lengths: 32-frame and
easily investigate training object detectors from scratch 64-frame input clips. The 32-frame clip is regularly sam-
(without any pre-training). We show the results in Table 7, pled with a frame interval of 2 from the raw video, and the
where the GN models are trained for 270k iterations.5 To 64-frame clip is sampled continuously. The model is fully
our knowledge, our numbers (41.0 box AP and 36.4 mask convolutional in spacetime, so the 64-frame variant con-
AP) are the best from-scratch results in COCO reported to sumes about 2× more memory. We study a batch size of
date; they can even compete with the ImageNet-pretrained 8 or 4 clips/GPU for the 32-frame variant, and 4 clips/GPU
results in Table 6. As a reference, with synchronous BN for the 64-frame variant due to memory limitation.
[43], a concurrent work [34] achieves a from-scratch result
of 34.5 box AP using R50 (Table 7), and 36.3 using a spe- Results of 32-frame inputs. Table 8 (col. 1, 2) shows the
cialized backbone. video classification accuracy in Kinetics using 32-frame
clips. For the batch size of 8, GN is slightly worse than
5 For models trained from scratch, we turn off the default StopGrad in BN by 0.3% top-1 accuracy and 0.1% top-5. This shows
Detectron that freezes the first few layers. that GN is competitive with BN when BN works well. For

8
the smaller batch size of 4, GN’s accuracy is kept simi- [2] D. Arpit, Y. Zhou, B. Kota, and V. Govindaraju. Normal-
lar (72.8 / 90.6 vs. 73.0 / 90.6), but is better than BN’s ization propagation: A parametric technique for removing
72.1 / 90.0. BN’s accuracy is decreased by 1.2% when the internal covariate shift in deep networks. In ICML, 2016.
batch size decreases from 8 to 4. [3] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.
Figure 7 shows the error curves. BN’s error curves (left) arXiv:1607.06450, 2016.
have a noticeable gap when the batch size decreases from 8 [4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization meth-
ods for large-scale machine learning. arXiv:1606.04838,
to 4, while GN’s error curves (right) are very similar.
2016.
Results of 64-frame inputs. Table 8 (col. 3) shows the re- [5] M. Carandini and D. J. Heeger. Normalization as a canonical
sults of using 64-frame clips. In this case, BN has a result neural computation. Nature Reviews Neuroscience, 2012.
of 73.3 / 90.8. These appear to be acceptable numbers (vs. [6] J. Carreira and A. Zisserman. Quo vadis, action recognition?
73.3 / 90.7 of 32-frame, batch size 8), but the trade-off be- a new model and the kinetics dataset. In CVPR, 2017.
tween the temporal length (64 vs. 32) and batch size (4 vs. [7] F. Chollet. Xception: Deep learning with depthwise separa-
8) could have been overlooked. Comparing col. 3 and col. 2 ble convolutions. In CVPR, 2017.
in Table 8, we find that the temporal length actually has pos- [8] T. Cohen and M. Welling. Group equivariant convolutional
itive impact (+1.2%), but it is veiled by BN’s negative effect networks. In ICML, 2016.
of the smaller batch size. [9] N. Dalal and B. Triggs. Histograms of oriented gradients for
GN does not suffer from this trade-off. The 64-frame human detection. In CVPR, 2005.
variant of GN has 74.5 / 91.7 accuracy, showing healthy [10] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,
A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale
gains over its BN counterpart and all BN variants. GN helps
distributed deep networks. In NIPS, 2012.
the model benefit from temporal length, and the longer clip
[11] S. Dieleman, J. De Fauw, and K. Kavukcuoglu. Exploiting
boosts the top-1 accuracy by 1.7% (top-5 1.1%) with the cyclic symmetry in convolutional neural networks. In ICML,
same batch size. 2016.
The improvement of GN on detection, segmentation, and [12] R. Girshick. Fast R-CNN. In ICCV, 2015.
video classification demonstrates that GN is a strong alter- [13] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár,
native to the powerful and currently dominant BN technique and K. He. Detectron. https://github.com/
in these tasks. facebookresearch/detectron, 2018.
[14] X. Glorot and Y. Bengio. Understanding the difficulty of
5. Discussion and Future Work training deep feedforward neural networks. In International
Conference on Artificial Intelligence and Statistics (AIS-
We have presented GN as an effective normalization TATS), 2010.
layer without exploiting the batch dimension. We have eval- [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
uated GN’s behaviors in a variety of applications. We note, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
however, that BN has been so influential that many state-of- erative adversarial nets. In NIPS, 2014.
the-art systems and their hyper-parameters have been de- [16] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis,
signed for it, which may not be optimal for GN-based mod- L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He.
els. It is possible that re-designing the systems or searching Accurate, large minibatch SGD: Training ImageNet in 1
new hyper-parameters for GN will give better results. hour. arXiv:1706.02677, 2017.
In addition, we have shown that GN is related to LN [17] S. Gross and M. Wilber. Training and investigating Resid-
and IN, two normalization methods that are particularly ual Nets. https://github.com/facebook/fb.
successful in training recurrent (RNN/LSTM) or generative resnet.torch, 2016.
(GAN) models. This suggests us to study GN in those areas [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-
in the future. We will also investigate GN’s performance CNN. In ICCV, 2017.
[19] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
on learning representations for reinforcement learning (RL)
rectifiers: Surpassing human-level performance on imagenet
tasks, e.g., [54], where BN is playing an important role for
classification. In ICCV, 2015.
training very deep models [20]. [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In CVPR, 2016.
Acknowledgement. We would like to thank Piotr Dollár
[21] D. J. Heeger. Normalization of cell responses in cat striate
and Ross Girshick for helpful discussions.
cortex. Visual neuroscience, 1992.
[22] S. Hochreiter and J. Schmidhuber. Long short-term memory.
References
Neural computation, 1997.
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, [23] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor- T. Weyand, M. Andreetto, and H. Adam. MobileNets: Effi-
flow: A system for large-scale machine learning. In Operat- cient convolutional neural networks for mobile vision appli-
ing Systems Design and Implementation (OSDI), 2016. cations. arXiv:1704.04861, 2017.

9
[24] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. [46] M. Ren, R. Liao, R. Urtasun, F. H. Sinz, and R. S. Zemel.
Densely connected convolutional networks. In CVPR, 2017. Normalizing the normalizers: Comparing and extending net-
[25] S. Ioffe. Batch renormalization: Towards reducing minibatch work normalization schemes. In ICLR, 2017.
dependence in batch-normalized models. In NIPS, 2017. [47] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
[26] S. Ioffe and C. Szegedy. Batch normalization: Accelerating wards real-time object detection with region proposal net-
deep network training by reducing internal covariate shift. In works. In NIPS, 2015.
ICML, 2015. [48] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object
[27] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image detection networks on convolutional feature maps. TPAMI,
translation with conditional adversarial networks. In CVPR, 2017.
2017. [49] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning
representations by back-propagating errors. Nature, 1986.
[28] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al. What is the best
[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
multi-stage architecture for object recognition? In ICCV,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
2009.
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
[29] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating
Recognition Challenge. IJCV, 2015.
local descriptors into a compact image representation. In
[51] T. Salimans and D. P. Kingma. Weight normalization: A
CVPR, 2010.
simple reparameterization to accelerate training of deep neu-
[30] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- ral networks. In NIPS, 2016.
jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. [52] O. Schwartz and E. P. Simoncelli. Natural signal statistics
The Kinetics human action video dataset. arXiv:1705.06950, and sensory gain control. Nature neuroscience, 2001.
2017. [53] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
[31] A. Krizhevsky. One weird trick for parallelizing convolu- and Y. LeCun. Overfeat: Integrated recognition, localization
tional neural networks. arXiv:1404.5997, 2014. and detection using convolutional networks. In ICLR, 2014.
[32] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet clas- [54] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
sification with deep convolutional neural networks. In NIPS, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton,
2012. Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driess-
[33] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient che, T. Graepel, and D. Hassabis. Mastering the game of go
backprop. In Neural Networks: Tricks of the Trade. 1998. without human knowledge. Nature, 2017.
[34] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and [55] E. P. Simoncelli and B. A. Olshausen. Natural image statis-
J. Sun. DetNet: A backbone network for object detection. tics and neural representation. Annual review of neuro-
arXiv:1804.06215, 2018. science, 2001.
[35] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and [56] K. Simonyan and A. Zisserman. Very deep convolutional
S. Belongie. Feature pyramid networks for object detection. networks for large-scale image recognition. In ICLR, 2015.
In CVPR, 2017. [57] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,
[36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal inception-resnet and the impact of residual connections on
loss for dense object detection. In ICCV, 2017. learning. In ICLR Workshop, 2016.
[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[37] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
Going deeper with convolutions. In CVPR, 2015.
mon objects in context. In ECCV. 2014.
[59] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
[38] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional Rethinking the inception architecture for computer vision. In
networks for semantic segmentation. In CVPR, 2015. CVPR, 2016.
[39] D. G. Lowe. Distinctive image features from scale-invariant [60] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
keypoints. IJCV, 2004. Learning spatiotemporal features with 3D convolutional net-
[40] S. Lyu and E. P. Simoncelli. Nonlinear image representation works. In ICCV, 2015.
using divisive normalization. In CVPR, 2008. [61] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance nor-
[41] A. Oliva and A. Torralba. Modeling the shape of the scene: malization: The missing ingredient for fast stylization.
A holistic representation of the spatial envelope. IJCV, 2001. arXiv:1607.08022, 2016.
[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- [62] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- networks. In CVPR, 2018.
matic differentiation in pytorch. 2017. [63] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
[43] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, residual transformations for deep neural networks. In CVPR,
and J. Sun. MegDet: A large mini-batch object detector. In 2017.
CVPR, 2018. [64] M. D. Zeiler and R. Fergus. Visualizing and understanding
[44] F. Perronnin and C. Dance. Fisher kernels on visual vocabu- convolutional neural networks. In ECCV, 2014.
laries for image categorization. In CVPR, 2007. [65] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An
[45] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple extremely efficient convolutional neural network for mobile
visual domains with residual adapters. In NIPS, 2017. devices. In CVPR, 2018.

10
A Survey on Neural Network-Based
Summarization Methods
Yue Dong
April, 2018

1 Introduction
arXiv:1804.04589v1 [cs.CL] 19 Mar 2018

Every day, enormous amounts of text are published online and quick access to the major points
of these documents is critical for decision making. However, manually producing summaries
for such large amounts of documents in a timely manner is no longer feasible. Automatic text
summarization, the automated process of shortening a text while reserving the main ideas of
the document(s), has consequently became popular.
Up until recently, text summarization was dominated by unsupervised information retrieval
models. In 2014, [Kågebäck et al., 2014] demonstrated that the neural-based continuous vector
models are promising for text summarization. This marked the beginning of the widespread
use of neural network-based text summarization models, because of their superior performance
compared to the traditional techniques.
The aim of this literature review is to survey the recent work on neural-based models in
automatic text summarization. This survey starts with the general background on document
summarization (Section 2), including the factors by which summarization tasks may be classi-
fied, evaluation issues, and a brief history of the traditional summarization techniques. Section
3 examines in detail ten neural-based summarizers. Section 4 discusses the related techniques
and presents promising paths for future research, and Section 5 concludes the paper.

2 Background
2.1 Summarization Factors
According to [Jones et al., 1999], text summarization tasks can be defined and classified by the
following factors: input, purpose and output.

2.1.1 Input Factors


Single-document vs. multi-document: [Jones et al., 1999] defines this factor as the unit
input parameter, which simply is the number of input documents that the summarization
system takes.
Monolingual, multilingual vs. cross-lingual: The monolingual summarizers produce
summaries that are in the same languages as the inputs, while the multilingual systems can
handle the input-output pairs in the same language across several different languages. On the
contrary, the cross-lingual summarization systems operate on input-output pairs that are not
necessarily in the same language.

1
2.1.2 Purpose Factors
Informative vs. indicative:
An indicative summary serves as a road-map to convey the relevant contents of the original
documents, so the readers can select documents that align with their interests to read further.
An indicative summary itself is not supposed to be a substitute for the source documents. On
the other hand, the purpose of an informative summary is to replace the original documents as
far as the important contents is concerned.
Generic vs. user-oriented: This factor concerns the coverage of the original documents
conditioned on the potential readers of the summary. Generic systems create summaries which
consider all the information found in the documents. In contrast, user-oriented systems produce
personalized summaries that focus on certain information from the source document(s) that are
consistent with a user query.
General purpose vs. domain-specific: General-purpose summarizers can be used across
any domain(s) with little or no modification. On the other hand, domain-specific systems are
designed for processing documents in a specific domain.

2.1.3 Output Factors


Extractive vs. abstractive: In relation to the source document(s), a summary can either
be extractive or abstractive. There is no clear agreement on the definition of the two. In this
literature review, the definition of [See et al., 2017] is adopted where an extractive summarizer
explicitly selects text snippets (words, phrases, sentences, etc.) from the source document(s),
while an abstractive summarizer generates novel text snippets to convey the most salient con-
cepts prevalent in the source document(s).

2.2 Evaluation of Summarization Systems


Evaluation is critical for developing summarization systems. However, what evaluation criteria
should be used for assessing summarization systems still remains unclear due to the subjective
aspect of what makes for a good summary. In general, existing evaluation techniques can be
split into either intrinsic or extrinsic [Jones et al., 1999]. Intrinsic methods directly evaluate
the outcome of a summarization system and extrinsic methods evaluate summaries based on
the performance of the down-stream tasks that the system summaries are used for.
The most prevalent intrinsic evaluation is to compare system-generated summaries (system
summaries) with human-created “gold” summaries (reference summaries). This allows the use
of quantitative measures such as the precisions and recalls in ROUGE [Lin, 2004]. However,
the problem with ROUGE is that people usually disagree on what a “gold” summary should
be.
Evaluation methods such as Pyramid [Nenkova and Passonneau, 2004] address this problem
and assume that no single best gold summary exists. However, Pyramid is very expensive in
terms of the human involvement.
Up to this day, no single best summarization evaluation method exists and researchers
usually adopt the cheap automated evaluation metric ROUGE coupled with human ratings.

2.2.1 ROUGE [Lin, 2004]


Recall-Oriented Understudy for Gisting Evaluation (ROUGE) are a set of evaluation methods
that automatically determine the quality of a system summary by comparing it to human-
created summaries. ROUGE-N, ROUGE-L and ROUGE-SU are commonly used in the sum-
marization literatures.

2
ROUGE-N computes the percentage of n-gram overlapping of system and reference sum-
maries. It requires the consecutive matches of words in n-grams (n needs to be defined and
fixed) that is often not the best assumption.
ROUGE-L computes the sum of the longest in sequence matches of each reference sentence
to the system summary. It considers the sentence-level word orders and automatically identify
the longest in-sequence word overlapping without a pre-defined n.
ROUGE-SU measures the percentage of skip-bigrams and unigrams overlapping. Skip-
bigram consists two words from the sentence with arbitrary gaps in their sentence order. Ap-
plying skip-bigrams without any constraint on the distance between the words usually produce
spurious bigram matchings [Lin, 2004]. Therefore, ROUGE-SU is usually used with a limited
maximum skip distance, such as ROUGE-SU4 with maximum skip distance of 4.

2.2.2 Pyramid [Nenkova and Passonneau, 2004]


Instead of matching the exact phrase units as in [Lin, 2004], Pyramid tries to score summaries
based on semantic matchings of the content units. It works under the assumption that there’s no
single best summary and therefore multiple reference summaries are necessary for this system.
Given a document with n human created reference summaries r1 , . . . , rn , Pyramid score of
a random summary s is roughly computed as follows:

1. Human annotation are first required to identify all Summarization Content Units (SCU)
in r1 , . . . , rn and in s, where SCUs are the smallest content unit for some semantic meaning
[Nenkova and Passonneau, 2004].

2. Each SCU is then associated with a weight by counting how many reference summaries
in the cluster contain this SCU.

3. Suppose summary s is annotated


Pk with k SCUs with weights w1 , . . . , wk , the pyramid
score of s is computed as i=1 i /Soptimal , where Soptimal is the sum of the k largest
w
SCUs’ weights.

2.3 Summarization Techniques


2.3.1 Scope of this review
In this literature review, we primarily consider neural-based extractive and abstractive summa-
rization techniques with the following factors: single document, English, informative, generic
and general purpose. As far as I know, related surveys either investigate the traditional models
[Afantenos et al., 2005, Das and Martins, 2007, Nenkova et al., 2011] or give little details for
neural-based summarizers [Gambhir and Gupta, 2017].

2.3.2 Brief History of Pre-Neural Networks Era


Extractive models Most early works on single-document extractive summarization employ
statistical techniques based on the ”Edmundsonian paradigm” [Afantenos et al., 2005]. Such
algorithms rank each sentence based on its relation to the other sentences by using pre-defined
formulas 1 such as the sum of frequencies of significant words (Luhn algorithm[Luhn, 1958]); the
overlapping rate with the document title (PyTeaser[Xu, 2004]); the correlation with salient con-
cepts/topics (Latent Semantic Analysis[Gong and Liu, 2001]); and sum of weighted similarities
to other sentences (TextRank [Mihalcea and Tarau, 2004]).
1
These formulas usually don’t contain hyper-parameters, and therefore training is not required.

3
Later works on text summarization address the problem by creating sentence representations
of the documents and utilizing machine learning algorithms. These models manually select the
appropriate features, and train supervised models to classify whether to include the sentence in
the summary. For example, [Wong et al., 2008] extracted surface, content, event and relevance
features for the sentence representation, and used Support Vector Machines (SVM) and Naı̈ve
Bayes models for the classification. In addition, sequential models such as Hidden Markov
Chains (HMMs) [Conroy and O’leary, 2001] were proposed to improve the results by considering
the sentence orders in the documents.

Abstractive models The core of abstractive summarization techniques is to identify the


main ideas in the documents and encode them into feature representations. These encoded fea-
tures are then passed to natural language generation (NLG) systems, such as the one proposed
in [Reiter and Dale, 1997], for summary generation.
Most of the early work on abstractive summarization uses semi-manual process of identifying
the main ideas of the document(s). Prior knowledge such as scripts and templates are usually
used to produce summaries. Thus, the abstractive summary is produced through slot fillings
and simple smoothing techniques such as in [DeJong, 1982, Radev and McKeown, 1998].

3 Neural-Based Summarization Techniques


Since the bloom of deep learning, neural-based summarizers have attracted considerable atten-
tion for automatic summarization. Compared to the traditional models, neural-based models
achieve better performance with less human involvement if the training data is abundant. In
this section, five extractive and five abstractive neural-based models are examined in details.
Most neural-based summarizers use the following pipeline: 1) words are transformed to
continuous vectors, called word embeddings, by a look-up table; 2) sentences/documents are
encoded as continuous vectors using the word embeddings; 3) sentence/document represen-
tations (sometimes also word embeddings) are then fed to a model for selection (extractive
summarization) or generation (abstractive summarization).
Neural networks can be used in any of the above three steps. In step 1, we can use neural
networks to obtain pre-learned look-up tables (such as Word2Vec, CW vectors, and GloVe).
In step 2, neural networks, such as convolutional neural networks (CNNs) or recurrent neural
networks(RNNs), can be used as encoders for extracting sentence/document features. In step 3,
neural network models can be used as regressors for ranking/selection (extraction) or decoders
for generation (abstraction).

CNNs and RNNs: CNNs and RNNs are commonly used in neural-based summarizers.
Both CNNs and RNNs serve the same purpose: transform a sequence of word embeddings
x1 , . . . , xT ∈ Rd to a vector (sentence representation) s ∈ Rh .

• CNNs achieve this purpose by using h filters and sliding them over the input sequence.
Each filter performs local convolution2 on the sub-sequences of the input to obtain a
set of feature maps (scalars), then a global max-pooling-over-time is performed to ob-
tain a scalar. These scalars from the h filters are then concatenated into the sequence
representation vector s ∈ Rh .
2
The convolution operation used here is basically element-wise matrix multiplication followed by a summa-
tion.

4
• RNNs achieve this purpose by introducing time-dependent neural networks. At the time
step t, an RNN computes a hidden state vector ht , which is obtained by a non-linear
transformation with two inputs – the previous hidden state ht−1 and the current word
input xt :
ht = f (ht−1 , xt ).
The most basic RNN is called the Elman RNN:
ht = σ(W1 ht−1 + W2 xt ).

Two other popular RNNs, which address the problem of long-term dependencies by adding
extra parameters, are as follows:
Gated Recurrent Unit (GRU) Long short-term memory (LSTM)
        ! it
  
σ
  
W1
 
W5
zt σ W1 W3 !
= h + x  ft   σ  W2  W6 
rt σ W2 t−1 W4 t o  =  σ  W  ht−1 + W  xt
t 3 7
 c′t tanh W4 W8
h̃t = tanh W5 (rt ⊙ ht−1 ) + +W6 xt
ct = ft ⊙ ct1 + it ⊙ c′t
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h̃t
ht = ot ⊙ tanh(ct )
where ⊙ denotes element-wise matrix multiplication and Wi are matrices with the cor-
responding dimensions. The last hidden state hT is usually used as the sequence repre-
sentation s = hT ∈ Rh .

3.1 Extractive Models


Extractive summarizers, which are selection-based methods, need to solve the following two crit-
ical challenges: 1) how to represent sentences; 2) how to select the most appropriate sentences,
taking into account of the coverage and the redundancy.
In this section, we review five extractive neural-based summarizers in chronological order.
Each summarization system is presented based on its sentence representation model and its
sentence selection model. At the end of this section, the techniques used in the extractive
neural-based models are summarized and the models’ performance are compared.

3.1.1 Continuous Vector Space Models [Kågebäck et al., 2014]


Sentence Representation [Kågebäck et al., 2014] proposes to represent sentences as con-
tinuous vectors that are obtained by either adding the word embeddings or using an unfolding
Recursive Auto-encoder (RAE) on word embeddings. The RAE basically combines two text
units into one in a recursive manner, until only one vector (the sentence representation) left.
The RAE is trained in an unsupervised manner by the backpropagation method with the self-
reconstruction errors. The pre-computed word embeddings from Collobert and Weston’s model
(CW vectors) or Mikolov et al.’s model (W2V vectors) are directly used without fine-tuning.

Sentence Selection [Kågebäck et al., 2014] formulates the task of choosing summary S as
an optimization problem that maximizes the linear combination of the diversity of the sentences
R and the coverage of the input text L:
F(S) = L(S) + λR(S) (1)
where λ is the tread-off between the converge and the diversity.

5
According to [Kågebäck et al., 2014], this optimization problem is NP-hard. However, there
exists fast scalable approximation algorithms with theoretical guarantees if the objective func-
tion is submodular 3 . The authors choose two submodular functions, which are computed
based on sentence similarities, as the diversity function and the converge function, respectively.
The objective function is therefore submodular and an approximation optimization algorithm
described in [Kågebäck et al., 2014] is used for selecting the sentences.

3.1.2 CNNLM [Yin and Pei, 2015]


Sentence Representation [Yin and Pei, 2015] uses convolutional neural networks (CNNs),
similar to the basic CNN model we described previously, on pre-trained word embeddings to
obtain the sentence representation.
The learnable parameters (including the word embeddings) in the CNN are trained by
unsupervised learning. The noise-contrastive estimation (NCE) [Mnih and Teh, 2012] is used
as the cost function. With this cost function, the model is basically trained as a language
model(LM): it learns to discriminate between true next words and noise words.

Sentence Selection Similar as in [Kågebäck et al., 2014], the authors frame the sentence
selection as a direct optimization problem with the following objective function:
X X
Q(S) = α p2i − pi Mi,j pj . (2)
i∈S i,j∈S

Here, the matrix M is obtained by calculating the pairwise cosine similarities of the learned
sentence representations. The prestige vector p is derived by using the PageRank algorithm on
M.
The goal is to find a summary S (as set of sentences) that maximizes the above objective
function. Fortunately, equation (2) is also submodular (proof in [Yin and Pei, 2015]). There-
fore, as stated in [Kågebäck et al., 2014], a near-optimal solution exists and is presented in
[Yin and Pei, 2015].

3.1.3 PriorSum [Cao et al., 2015]


Sentence Representation PriorSum uses the CNN learned features concatenated with doc-
ument independent features as the sentence representation. Three document-independent fea-
tures are used: 1) sentence position; 2) averaged term frequency of words in the sentence based
on the document; 3) averaged term frequency of words in the sentence based on the cluster
(multi-document summarization).
Similar as in [Yin and Pei, 2015], CNNs with multiple filters are used to capture sentence
features. However, PriorSum employs a deeper and more complicated CNN. The CNN used
in PriorSum has multiple-layers with alternating convolution and pooling operations. The
filters in the convolution layers have different window sizes and two-stage max-over-time-pooling
operations are performed in the pooling layers. The parameters in this CNN is updated by
applying the diagonal variant of AdaGrad with mini-batches as described in [Yin and Pei, 2015].

Sentence Selection Unlike the previous two extractive neural-based models, PriorSum is a
supervised model that requires the gold standard summaries during training. PriorSum follows
the traditional supervised extractive framework: it first ranks each sentence and then selects
the top k ranked non-redundant sentences as the final summary.
3
A function F is called submodular on the set S if ∀s ∈ S, A ⊂ B ⊂ S/{s} implies F(A + {s}) − F(A) ≥
F(B + {s}) − F(B). This condition is also called as the diminishing return property.

6
The authors frame the sentence ranking process as a regression problem. During training,
each sentence in the document is associated with the ROUGE-2 score (stopwords removed) with
respect to the gold standard summary. Then a linear regression model is trained to estimate
these ROUGE-2 scores by updating the regression weights.
During testing, non-redundant sentences are selected by a simple greedy algorithm. The
greedy selection algorithm first ranks all sentences with more than 8 words in descending order
based on the estimated informative scores. The top k sentences are then selected in order as
long as the sentence is not redundant with respect to the current summary. A sentence is
considered non-redundant with respect to a summary if more than 50% of its words do not
appear in the summary.

3.1.4 NN-SE [Cheng and Lapata, 2016]


Sentence Representation In [Cheng and Lapata, 2016], sentence representations are ob-
tained by using a CNN followed by an RNN. The CNN extractor, which is similar to the one
in 3.1.2, has multiple feature maps with different window sizes. Once sentence representations
(s1 , . . . , sT ) are obtained by using the CNN sentence extractor, they are fed into an LSTM
encoder. The LSTM’s hidden states (h1 , . . . , hT ) are then used as the final sentence repre-
sentations. Comparing to (s1 , . . . , sT ), the authors believe (h1 , . . . , hT ) capture the sentence
dependency information and are therefore better suited as sentence representations.

Sentence Selection Similar to [Cao et al., 2015]’s work, NN-SE is a supervised model that
first scores the sentences and then selects them based on the estimated scores. Instead of using a
simple linear regressor as in [Cao et al., 2015], NN-SE utilizes an LSTM decoder with a sigmoid
layer (equation 4) for scoring sentences. During training, the ground truth labels are given (1
for sentences included in the reference summary and 0 otherwise) and the decoder is trained to
label sentences sequentially by zeros and ones.
Given vectors (s1 , . . . , sT ) obtained by the CNN and the LSTM encoder’s hidden states
(h1 , . . . , ht ), the decoder’s hidden states (h̄1 , . . . , h̄t ) are computed as:
h̄t = LST M(pt−1 st−1 , h̄t−1 ) (3)
where pt−1 is the probability that the decoder believes the previous sentence should be included
in the summary. The binary decision of whether to include sentence t are modeled by the
following sigmoid layer:
p(y(t) = 1|D) = σ(MLP (h̄t : ht )) (4)
where MLP is a multi-layer neural network.

Joint Training with a Large-Scale Dataset NN-SE is a sequence-to-sequence model with


a CNN+RNN encoder and a LSTM+sigmoid decoder. The encoder (sentence representation
model) and the decoder (sentence selection model) can be jointly trained by the stochastic
gradient descent (SGD) method, with the objective of minimizing the negative log-likelihood
(NLL):
m
X
− logp(y|D, θ) = − yi logp(yi |D, θ). (5)
i=1
Training a sequence-to-sequence summarizer requires a large-scale dataset with extractive la-
bels, i.e., documents with sentences labeled as summary-worthy or not. The authors created a
large scale dataset – the DailyMail dataset – with about 200K training examples. Each data
instance contains an extractive reference summary that is obtained by labeling sentences based
on a set of rules such as sentence positions and n-grams overlapping.

7
3.1.5 SummaRuNNer [Nallapati et al., 2017]
Sentence Representation SummaRuNNer employs a two-layer bi-directional RNN for sen-
tences and document representations. The first layer of the RNN is a bi-directional GRU that
runs on words level: it takes word embeddings in a sentence as the inputs and produces a set
of hidden states. These hidden states are averaged into a vector, which is used as the sentence
representation. The second layer of the RNN is also a bi-directional GRU, and it runs on the
sentence-level by taking the sentence representations obtained by the first layer as inputs. The
hidden states of the second layer are then combined into a vector d (document representation)
through a non-linear transformation.

Sentence Selection The authors frame the task of sentence selection as a sequentially sen-
tence labeling problem, which is similar to the settings of [Cheng and Lapata, 2016]. Different
from [Cheng and Lapata, 2016], instead of using another RNN as the decoder, SummaRuNNer
uses the hidden states (h1 , , hm ) from the second layer of the encoder RNN directly for the
binary decision (modeled by a sigmoid function):

P (yt = 1|ht , st , d) = σ(wc ht + hTt W1 d − hTt W2 tanh(st ) + b) (6)

where b includes the information of the sentence’s absolute and relative position, as well as the
bias. sj can be viewed as a “soft” summary representationPthat is computed as the running
weighted sum of sentence representations until time t: st = t−1i=1 hi P (yi = 1|hi , si , d).
The sigmoid decision layer (6) and the two-layer encoder RNN (GRUword+GRUsent) are
jointly trained by SGD with the objective function similar to (5) in [Cheng and Lapata, 2016].

Comparison of the Extractive Models and Their Performance


Table 1 compares and summarizes the five extractive models mentioned previously. Almost all
these models are evaluated on DUC2002 dataset and we therefore compare their performance
on DUC2002 dataset in Table 2.

Table 1: Comparison of the techniques used in the extractive summarizers


models sentence represen- training of sentence sentence selection training of sentence
tation representation selection
Continuous Vec- adding word em- no training for direct optimization no training
tor Space models beddings or using adding; RAE is on submodular ob-
(2014) RAE trained in an un- jectives
supervise with
REs
CNNLM (2015) CNN unsupervised learn- direct optimization no training
ing with NCE on submodular ob-
jectives
PriorSum (2015) CNN unsupervised learn- sentence ranking supervised learning
ing with diagonal by linear regression with ROUGE2
variant of AdaGrad scores
NN-SE (2016) CNN+RNN supervised co-train sentence rank- supervised learning
with decoder ing from with SGD and NLL
LSTM+sigmoid
SummaRuNNer GRU+GRU supervised co-train sentence ranking supervised learning
(2017) with decoder from sigmoid with SGD and NLL
RAE: recursive auto-encoder, REs: reconstruction errors, NCE: noise-contrastive estimation

8
Table 2: Rouge f-scores of the extractive summarizers on the DUC2002 dataset
models Extra data used for training (in rouge1 rouge2 rougeL rougeSU
addition to DUC2002)
Continuous Vector Space models pre-trained W2V and CW word - - - -
(2014) embeddings
CNNLM (2015) pre-trained W2V word embed- 51.0 27.0 - 29.4
dings
PriorSum (2015) 1. pre-trained CW word embed- 36.63 8.97 - -
dings 2. Gigaword for CNN
NN-SE (2016) DailyMail 47.4 23.0 43.5 -
SummaRuNNer (2017) 1. pre-trained GloVe word em- 46.6 23.1 43.0 -
beddings 2. DailyMail

3.2 Abstractive Models


Abstractive summarizers focus on capturing the meaning representation of the whole document
and then generate an abstractive summary based on this meaning representation. Therefore,
neural-based abstractive summarizers, which are generation-based methods, need to make the
following two decisions: 1) how to represent the whole document by an encoder; 2) how to
generate the words sequence by a decoder.
In this section, we review five abstractive neural-based summarizers in chronological order.
Each summarization system is presented based on its encoder and its decoder. At the end of
this section, the techniques used in the abstractive neural-based models are summarized and
the models’ performance are compared.

3.2.1 ABS [Rush et al., 2015]


Encoder [Rush et al., 2015] proposes three encoder structures to capture the meaning rep-
resentation of a document. The common goal of these encoders is to transform a sequence of
word embeddings w1 , . . . , wT to a vector d, which is used as the meaning representation of the
document.
1. Bag-of-Words Encoder: The first encoder basically computes the summation of the word
embeddings appeared in the sequence: d1 = T1 Ti=1 xi . The word order is not preserved by
P
this bag-of-words encoder.
2. Convolutional Encoder : This encoder utilizes a CNN model with multiple alternating
convolution and 2-element-max-pooling layers. In each layer, the convolution operations extract
a sequence of feature vectors (u1 , . . . , ul ) and the number of these feature vectors are reduced
by a factor of two with the 2-element-max-pooling: ūi = tanh(max{ul2i−1 , ul2i }). After L layers
of convolution and max-pooling, a max-pooling-over-time is performed to obtain the document
representation d2 .
3. Attention-Based Encoder : This encoder produces a document representation at each
time step based on the previous C words (context) generated by the decoder. At time step
t, given the inputs’ word embeddings X = [x1 , . . . , xm ] and the decoder’s context yCt−1 =
concat(yt−C , . . . , yt−1 ), the encoder produces a document representation (for time step t) as
follows:
dt3 = pT X where p ∈ Rm ∝ exp(XPyCt−1 ).

Decoder [Rush et al., 2015] uses a feed-forward neural network-based language model (NNLM)
for estimating the probability distribution that generates the word at each time step t:

p(yt |yCt−1 , dt ) ∝ exp(W1 ht + W2 dt ) where ht = tanh(W3 yCt−1 ).

9
Training In [Rush et al., 2015], the encoder and decoder are trained jointly in mini-batches.
Suppose {(x(1) ; y (1) ), . . . , (x(J) ; y (J) )} are J input-summary pairs, then the loss (negative log-
likelihood loss (NLL)) based on the parameters θ is computed as:
J J X
T
(j)
X X
(j) (j)
NNL(θ) = − logp(y |x ; θ) = − logp(yt |x(j) ; θ). (7)
j=1 j=1 t=1

The training objective is to minimize the NLL and it is achieved by using mini-batch stochastic
gradient descent.

3.2.2 RAS-LSTM and RAS-Elman [Chopra et al., 2016]


Encoder The CNN-based attentive encoder used in [Chopra et al., 2016] is similar to the
attentive encoder proposed by [Rush et al., 2015], except the weights αi is computed based
on the aggregated vectors obtained by a CNN model. At time step t, the attention weights
are calculatedPtby the aggregated vectors (z1 , . . . , zT ) and decoder’s hidden state ht : αj,t =
exp(zj · ht )/ i=1 exp(zi · ht ). These attention weights are then combined with the inputs’ word
embeddings to form the document representation dt 4 : dt = Tj=1 αj,t−1xj .
P

Decoder [Chopra et al., 2016] replaces the NNLM model used in [Rush et al., 2015] to a
recurrent neural network. Instead of only using the previously-generated C words for decoding
as in NNLM, the RNN decoder’s hidden state ht can keep the information of all the words
generated till time t.
The authors propose two decoder models based on the Elman RNN and the LSTM5 . In
addition to the previous generated word yt−1 and the previous hidden state ht−1 , the Elman
RNN and the LSTM take encoder’s context vector dt (document representation at t) as an
additional input. For example, the Elman RNN’s hidden state is computed as ht = σ(W1 yt−1 +
W2 ht−1 + W3 dt ).
Once the decoder’s hidden state ht is computed, it is combined with the document repre-
sentation dt to decide which word to generate 6 at the time step t. The decision is modeled by a
softmax function, which gives the probability distribution over all the words in the dictionary:

Pt = sof tmax(W4 ht + W5 dt )

3.2.3 Hierarchical Attentive RNNs [Nallapati et al., 2016]


Encoder [Nallapati et al., 2016] proposes a feature-rich hierarchical attentive encoder based
on the bidirectional-GRU to represent the document.
Feature-rich inputs: The encoder takes the input vector obtained by concatenating the word
embedding with additional linguistic features. The additional linguistic features used in their
model are parts-of-speech (POS) tags, named-entity (NER) tags, term-frequency (TF) and
inverse document frequency (IDF) of the word. The continuous features (TF and IDF) are first
discretized into a fixed number of bins and then encoded into one-hot vectors as other discrete
features. All the one-hot vectors are then transformed into continuous vectors by embedding
matrices and these continuous vectors are concatenated into a single long vector, which is then
fed into the encoder.
4
dt , the document representation at time step t, is also called the encoder’s context at time step t, which is
commonly denoted as ct in literature.
5
How the Elman RNN and LSTM work to produce the hidden states are explained early in section CNNs
and RNNs.
6
The words are generated from a pre-fixed dictionary.

10
Hierarchical attention: The hierarchical encoder has two RNNs with a similar structure as
in [Nallapati et al., 2017]: one runs on the word-level and one runs on the sentence-level. The
hierarchical attention proposed by the authors basically re-weigh the word attentions by the
corresponding sentence-level attention. The document representation dt is then obtained by
the weighted sum of the feature-rich input vectors.

Decoder [Nallapati et al., 2016] uses a RNN decoder based on uni-directional GRU, which
works similar to the decoder in [Chopra et al., 2016]. In addition, the following two mechanisms
are used in [Nallapati et al., 2016]’s decoder:

1. The large vocabulary trick (LVT): This trick reduces the computation time in the softmax
layer by limiting the number of words the decoder can generate from during training.
Basically, it defines a small dictionary in each mini-batch during training. The dictionary
only contains the words that are in the source documents of that batch and the most
frequent k words in the global dictionary.

2. Decoder/pointer switch Using a pointer network, which directly copy words from the
source, can improve the summaries’ quality by including the rare-words from the source
documents. A pointer network can simply be modeled based on the encoder’s attention
weights where the word with the largest weight is the word for copying. The decision
of whether to copy or generate is controlled by a switch, which is modeled by a sigmoid
function P (si = 1) = σ(f (ht , yt−1 , dt )).

3.2.4 Pointer-Generator Networks [See et al., 2017]


Encoder The encoder of the Pointer-Generator network is simply a single-layer bidirectional
LSTM. It computes the document representation dt based on the attention weights and the
encoder’s hidden states, which is exactly the same as the encoder in [Chopra et al., 2016].

Decoder The basic building block of [See et al., 2017]’s decoder is a single-layer uni-directional
LSTM. In addition, a decoder/pointer switch similar to [Nallapati et al., 2016] is used for point-
ing.
Moreover, the authors propose a coverage mechanism for penalizing repeated attentions
on already attended words. This is achieved by using a coverage vector ct , which P tracks the
attentions that all the words in the dictionary has received till time t: ct = t−1 t′ =0 at′
. The
coverage vector is then used for the attention computation at time step t + 1, as well as in the
objective function (acted as a regularizer):
X
Lt = −logp(wt∗) + λ min(ati , cti )
i

Here, wt∗ is the true label at the time step t and λ is a hyperparameter controlling the degree
of the coverage regularizer.

3.2.5 Neural Intra-attention Model [Paulus et al., 2017]


Encoder [Paulus et al., 2017] also uses a bi-directional LSTM encoder for modeling the doc-
ument representation. The model is similar to the encoder in [See et al., 2017], except the
attention scores are computed by linear transformations and a softmax function7 , which is
7
The attention scores in all other models we reviewed are computed by sigmoid functions followed by a
softmax function.

11
called the intra-attention mechanism by the authors. dt is then computed based on these
intra-attentions and the encoder’s hidden states.

Decoder A uni-directional LSTM is used as the decoder in [Paulus et al., 2017]. In addition,
the authors employ an intra-attention mechanism on the decoder to prevent generating repeated
phrases: a decoder context vector ct is computed based on the intra-attentions of the already
generated sequence and then used as an additional input for the softmax layer of generating.
A generator/pointer switch similar to the ones in [Nallapati et al., 2016] and [See et al., 2017]
is also employed in the decoder.

Hybrid Training Objectives In terms of the encoder-decoder model, [Paulus et al., 2017]
and [See et al., 2017] are very similar. However, what novel in [Paulus et al., 2017] is how the
parameters in their model are updated: they use both stochastic gradient descent method and
reinforcement learning method to update model parameters with a hybrid training objectives.
Stochastic gradient descent method (SGD) is used in abstractive summarization models
to minimize the negative log-likelihood of the ground-truth values during the training, as ex-
plained in the previous models [Rush et al., 2015, Chopra et al., 2016, Nallapati et al., 2016,
See et al., 2017]. We denote this NLL objective as Lml . Using SGD to minimize Lml has two
shortcomings: 1) it creates a discrepancy during training and testing since there are no ground
truth values during testing; 2) optimizing this objective does not always correlate to a high
score on the discrete evaluation metric, such as ROUGE scores.
Therefore, the authors propose to use another objective based on the reinforcement learning
method –REINFORCE – for training:
T
X
s
Lrl = (r(ŷ − r(y )) s
logp(yts |y1s , . . . , yt−1 , x)
t=1

s
where y s is obtained by sampling from the p(yts |y1s , . . . , yt−1 , x) at each decoding time step t. ŷ
acts as the REINFORCE baseline, which is obtained by performing a greedy selection rather
than sampling at each decoding time step. r(y) is the reward score for an output sequence y,
which is usually obtained by an automated evaluation method, such as ROUGE.
The authors noticed that optimizing Lrl directly would lead to sequences with high ROUGE
scores that are ungrammatical. Therefore, a mixed training objective with hyperparameter γ
is used for balancing the ROUGE score and the readability of the generated sequence:

Lmixed = γLrl + (1 − γ)Lml .

Comparison of the Abstractive Models and Their Performance


Table 3 compares and summarizes the above five abstractive models. Two large-scale datasets –
the Gigaword dataset and the CNN/DailyMail dataset – are commonly used as the abstractive
summarization benchmarks. We therefore compare the five abstractive models’ performance
on these two datasets as in Table 4.

12
Table 3: Comparison of the techniques used in the abstractive summarizers
models encoder decoder training
1. bag-of-words encoder,
ABS (2015) 2. CNN, NNLM SGD
3. attention-based encoder
RAS-LSTM and RAS- CNN + attention Elman RNN or LSTM SGD
Elman (2016)
Hierarchical Attentive feature-rich GRU + LVT + pointer SGD
RNNs (2016) bidirectional-GRU + switch
hierarchical attention
Pointer-Generator Net- bidirectional LSTM + LSTM + pointer switch SGD
works (2017) attention + coverage mechanism
Neural Intra-attention bidirectional LSTM + LSTM + pointer switch SGD + REINFORCE
Model (2017) intra-attention + intra-attention

Table 4: Rouge f-scores of the abstractive summarizers on the Gigaword(G)/CNN-DailyMail(C)


datasets
models rouge1 rouge2 rougeL rougeSU
G C G C G C G C
ABS (2015) 29.78 - 11.89 - 26.97 - - -
RAS-LSTM and RAS-Elman (2016) 33.78 - 15.97 - 31.15 - - -
Hierarchical Attentive RNNs (2016 35.30 35.46 16.64 13.30 32.62 32.65 - -
Pointer-Generator Networks (2017) - 39.53 - 17.28 - 36.38 - 29.4
Neural Intra-attention Model (2017) - 39.87 - 15.82 - 36.90 - -

4 Discussions and the Promising Paths for Future Re-


search
4.1 Other Related Tasks and Techniques
4.1.1 Reinforcement Learning Methods for Sequence Prediction [Bahdanau et al., 2017]
[Paulus et al., 2017] shows a promising path of applying the reinforcement learning (RL) method
in abstractive summarization. [Paulus et al., 2017] applies REINFORCE, which is an unbiased
estimator with large variance, for sequence prediction in summarization.
In [Bahdanau et al., 2017], the authors apply the actor-critic algorithm, which is a biased
estimator with smaller variable, for machine translation. In addition to the policy network (an
encoder-decoder model), they introduce a critic network that is trained to predict the values of
output tokens. This critic network is based on a bidirectional GRU and is trained supervisely
with the ground-truth labels.
The key difference in the REINFORCE algorithm and the actor-critic algorithm is what
rewards the actor uses to update its parameters. REINFORCE uses the overall reward from
the whole sequence and only performs the update after obtaining the whole trajectory. The
actor-critic algorithm uses the TD errors [Bahdanau et al., 2017] calculated based on the critic
network and can update the actor during the generating process. Compared to the REIN-
FORCE algorithm, the actor-critic method has lower variance and faster convergence rate,
which makes it a promising algorithm to be used in summarization.

4.1.2 Text Simplification [Xu et al., 2015, Zhang and Lapata, 2017]
The goal of text simplification is to rewrite complex documents into simpler ones that are
easier to understand. This is usually achieved by three operations: splitting, deletion and

13
paraphrasing [Xu et al., 2015]. Text simplification can help improve the performance of many
natural language processing (NLP) tasks. For example, text simplification techniques can
transform long, complex sentences into ones that are more easily processed by automatic text
summarizers.
One challenge of developing text simplification models is the lack of datasets with parallel
complex/simple sentence pairs. [Xu et al., 2015] created a good quality simplification dataset,
called the Newsela dataset, for the tasks of text simplification. From their analyses, we could see
that the words distribution are significantly different in complex and simple texts. In addition,
the distribution of syntax patterns are also very different. These findings indicate that a text
simplification model need to consider both the semantic meaning of words and the syntactic
patterns of sentences.
[Zhang and Lapata, 2017] propose a sequence-to-sequence model with attentions based on
LSTMs for text simplification. This encoder-decoder model, called Deep REinforcement Sen-
tence Simplification (DRESS), is trained with the reinforcement learning method that optimizes
a task-specific discrete reward function. This discrete reward function encourages the outputs to
be simple, grammatical, and semantically related to the inputs. Experiments on three datasets
demonstrate that their model is promising for text simplification tasks.

4.2 Discussions
In summarization, one critical issue is to represent the semantic meanings of the sentences and
documents. Neural-based models display superior performance on automatically extracting
these feature representations. However, deep neural network models are neither transparent
enough nor integrating with the prior knowledge well. More analysis and understanding of the
neural-based models are needed for further exploiting these models.
In addition, the current neural-based models have the following limitations: 1) they are
unable to deal with sequences longer than a few thousand words due to the large memory
requirement of these models; 2) they are unable to work well on small-scale datasets due to
the large amount of parameters these models have; 3) they are very slow to train due to the
complexity of the models.
There are many very interesting and promising directions for future research on text sum-
marization. We proposed two directions in this review: 1) using the reinforcement learning
approaches, such as the actor-critic algorithm, to train the neural-based models; 2) exploiting
techniques in text simplification to transform documents into simpler ones for summarizers to
process.

5 Conclusion
This survey presented the potential of neural-based techniques in automatic text summariza-
tion, based on the examination of the-state-of-the-art extractive and abstractive summarizers.
Neural-based models are promising for text summarization in terms of the performance when
large-scale datasets are available for training. However, many challenges with neural-based
models still remain unsolved. Future research directions such as adding the reinforcement
learning algorithms and text simplification methods to the current neural-based models are
provided to the researchers.

14
References
[Afantenos et al., 2005] Afantenos, S., Karkaletsis, V., and Stamatopoulos, P. (2005). Summarization from
medical documents: a survey. Artificial intelligence in medicine, 33(2):157–177.
[Bahdanau et al., 2017] Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and
Bengio, Y. (2017). An actor-critic algorithm for sequence prediction.
[Cao et al., 2015] Cao, Z., Wei, F., Li, S., Li, W., Zhou, M., and Wang, H. (2015). Learning summary prior
representation for extractive summarization. In ACL.
[Cheng and Lapata, 2016] Cheng, J. and Lapata, M. (2016). Neural summarization by extracting sentences and
words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 484–494, Berlin, Germany. Association for Computational Linguistics.
[Chopra et al., 2016] Chopra, S., Auli, M., and Rush, A. M. (2016). Abstractive sentence summarization with
attentive recurrent neural networks. In NAACL HLT 2016, The 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego Cali-
fornia, USA, June 12-17, 2016, pages 93–98.
[Conroy and O’leary, 2001] Conroy, J. M. and O’leary, D. P. (2001). Text summarization via hidden markov
models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development
in information retrieval, pages 406–407. ACM.
[Das and Martins, 2007] Das, D. and Martins, A. F. (2007). A survey on automatic text summarization.
[DeJong, 1982] DeJong, G. F. (1982). An overview of the frump system. In Lehnert, W. G. and Ringle, M. H.,
editors, Strategies for Natural Language Processing, pages 149–176. Lawrence Erlbaum.
[Gambhir and Gupta, 2017] Gambhir, M. and Gupta, V. (2017). Recent automatic text summarization tech-
niques: a survey. Artificial Intelligence Review, 47(1):1–66.
[Gong and Liu, 2001] Gong, Y. and Liu, X. (2001). Generic text summarization using relevance measure and
latent semantic analysis. In Proceedings of the 24th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 19–25. ACM.
[Jones et al., 1999] Jones, K. S. et al. (1999). Automatic summarizing: factors and directions. Advances in
automatic text summarization, pages 1–12.
[Kågebäck et al., 2014] Kågebäck, M., Mogren, O., Tahmasebi, N., and Dubhashi, D. (2014). Extractive sum-
marization using continuous vector space models. In Proceedings of the 2nd Workshop on Continuous Vector
Space Models and their Compositionality (CVSC)@ EACL, pages 31–39.
[Lin, 2004] Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Marie-
Francine Moens, S. S., editor, Text Summarization Branches Out: Proceedings of the ACL-04 Workshop,
pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
[Luhn, 1958] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and
development, 2(2):159–165.
[Mihalcea and Tarau, 2004] Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order into text. In Pro-
ceedings of the 2004 conference on empirical methods in natural language processing.
[Nallapati et al., 2017] Nallapati, R., Zhai, F., and Zhou, B. (2017). SummaRuNNer: A recurrent neural
network based sequence model for extractive summarization of documents. In Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages
3075–3081.
[Nallapati et al., 2016] Nallapati, R., Zhou, B., dos Santos, C. N., Gülçehre, Ç., and Xiang, B. (2016). Abstrac-
tive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL
Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12,
2016, pages 280–290.
[Nenkova et al., 2011] Nenkova, A., McKeown, K., et al. (2011). Automatic summarization. Foundations and
Trends R in Information Retrieval, 5(2–3):103–233.

[Nenkova and Passonneau, 2004] Nenkova, A. and Passonneau, R. J. (2004). Evaluating content selection in
summarization: The pyramid method. In Human Language Technology Conference of the North American
Chapter of the Association for Computational Linguistics, HLT-NAACL 2004, Boston, Massachusetts, USA,
May 2-7, 2004, pages 145–152.

15
[Paulus et al., 2017] Paulus, R., Xiong, C., and Socher, R. (2017). A deep reinforced model for abstractive
summarization. arXiv preprint arXiv:1705.04304.
[Radev and McKeown, 1998] Radev, D. R. and McKeown, K. R. (1998). Generating natural language sum-
maries from multiple on-line sources. Computational Linguistics, 24(3):470–500.
[Reiter and Dale, 1997] Reiter, E. and Dale, R. (1997). Building applied natural language generation systems.
Nat. Lang. Eng., 3(1):57–87.
[Rush et al., 2015] Rush, A. M., Chopra, S., and Weston, J. (2015). A neural attention model for abstractive
sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 379–389.
[See et al., 2017] See, A., Liu, P. J., and Manning, C. D. (2017). Get to the point: Summarization with
pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1073–1083.
[Wong et al., 2008] Wong, K.-F., Wu, M., and Li, W. (2008). Extractive summarization using supervised and
semi-supervised learning. In Proceedings of the 22nd International Conference on Computational Linguistics-
Volume 1, pages 985–992. Association for Computational Linguistics.
[Xu et al., 2015] Xu, W., Callison-Burch, C., and Napoles, C. (2015). Problems in current text simplification
research: New data can help. TACL, 3:283–297.
[Xu, 2004] Xu, X. (2004). Pyteaser.
[Yin and Pei, 2015] Yin, W. and Pei, Y. (2015). Optimizing sentence modeling and selection for document
summarization. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence,
IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 1383–1389.
[Zhang and Lapata, 2017] Zhang, X. and Lapata, M. (2017). Sentence simplification with deep reinforcement
learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 595–605.

16
geomstats: a Python Package for Riemannian
Geometry in Machine Learning

Nina Miolane Johan Mathe Claire Donnat


Stanford University Froglabs AI Stanford University
Stanford, CA 94305 San Francisco, CA94103, USA Stanford, CA 94305, USA
arXiv:1805.08308v2 [cs.LG] 6 Nov 2018

nmiolane@stanford.edu johan@froglabs.ai cdonnat@stanford.edu

Mikael Jorda Xavier Pennec


Stanford University Inria Sophia-Antipolis
Stanford, CA 94305, USA 06902 Valbonne, France
mjorda@stanford.edu xavier.pennec@inria.fr

Abstract
We introduce geomstats, a python package that performs computations on mani-
folds such as hyperspheres, hyperbolic spaces, spaces of symmetric positive definite
matrices and Lie groups of transformations. We provide efficient and extensively
unit-tested implementations of these manifolds, together with useful Riemannian
metrics and associated Exponential and Logarithm maps. The corresponding
geodesic distances provide a range of intuitive choices of Machine Learning’s loss
functions. We also give the corresponding Riemannian gradients. The operations
implemented in geomstats are available with different computing backends such
as numpy, tensorflow and keras. We have enabled GPU implementation and
integrated geomstats’ manifold computations into keras’ deep learning frame-
work. This paper also presents a review of manifolds in machine learning and
an overview of the geomstats package with examples demonstrating its use for
efficient and user-friendly Riemannian geometry.

1 Introduction
There is a growing interest in using Riemannian geometry in machine learning. To illustrate the
reason for this interest, consider a standard supervised learning problem: given an input X, we want
to predict an output Y . We can model the relation between X and Y by a function fθ parameterized
by a parameter θ. There are three main situations where Riemannian geometry can naturally appear
in this setting: through the input X, the output Y , or the parameter θ. For example the input X can
belong to a Riemannian manifold or be an image defined on a Riemannian manifold. The input X
can also be a manifold itself, for example a 2D smooth surface representing a shape such as a human
pose [5]. Similarly, the output Y may naturally belong to a Riemannian manifold, as in [18] where a
neural network is used to predict the pose of a camera which is an element of the Lie group SE(3).
Finally, the parameter θ of a model can be constrained on a Riemannian manifold as in the work of
[19] which constrains the weights of a neural network on multiple dependent Stiefel manifolds.
There are intuitive and practical advantages for modeling inputs, outputs and parameters on manifolds.
Computing on a lower dimensional space leads to manipulating fewer degrees of freedom, which can
potentially imply faster computations and less memory allocation. Moreover, the non-linear degrees
of freedom that arise in a lower dimensional space often make more intuitive sense: cities on the
earth are better localized giving their longitude and latitude, i.e., their manifold coordinates, than
giving their position x, y, z in the 3D space.

Preprint. Work in progress.


Yet, the adoption of Riemannian geometry by the larger machine learning community has been
inhibited by the lack of a modular framework for implementing such methods. Code sequences
are often custom tailored for specific problems, and are not easily reused. To address this issue,
some packages have been written to perform computations on manifolds. The theanogeometry
package [25] provides an implementation of differential geometric tensors on manifolds where
closed forms do not necessarily exist, using the automatic differentiation tool theano to integrate
differential equations that define the geometric tensors. The pygeometry package [7] offers an
implementation primarily focused on the Lie groups SO(3) and SE(3) for robotics applications.
However, there is no implementation of non-canonical metrics on these Lie groups. The pymanopt
package [39] (which builds upon the matlab package manopt [6] but is otherwise independent)
provides a very comprehensive toolbox for optimization on a extensive list of manifolds. Still, the
choice of metrics is restricted on these manifolds which are often implemented using canonical
embeddings in higher-dimensional euclidean spaces.
This paper presents geomstats, a package specifically targeted to the machine learning community
to perform computations on Riemannian manifolds with a flexible choice of Riemannian metrics. The
geomstats package makes four contributions. First, geomstats is the first Riemannian geometry
package to be extensively unit-tested with more than 90 % code coverage. Second, geomstats
implements numpy [31] and tensorflow [29] backends, making computations intuitive, vectorized
for batch computations, and available for GPU implementation. Third, we provide an updated version
of the keras deep learning framework equipped with Riemannian gradient descent on manifolds.
Fourth, geomstats has an educational role on Riemannian geometry for computer scientists that can
be used as a complement to theoretical papers or books. We refer to [35] for the theory and expect
the reader to have a high-level understanding of Riemannian geometry.
An overview of geomstats is given in Section 2. We then present concrete use cases of geomstats
for machine learning on manifolds of increasing geometric complexity, starting with manifolds
embedded in flat spaces in Section 3, to a manifold embedded in a Lie group with a Lie group action
in Section 4, to the Lie groups SO(n) and SE(n) in Section 5. Along the way, we present a review of
the occurrences of each manifold in the machine learning literature, some educational visualizations
of the Riemannian geometry as well as implementations of machine learning models where the inputs,
the outputs and the parameters successively belong to manifolds.

2 The Geomstats Package


2.1 Geometry

The geomstats package implements Riemannian geometry using a natural object-oriented ap-
proach with two main families of classes: the manifolds, inherited from the class Manifold
and the Riemannian metrics, inherited from the class RiemannianMetric. Children classes of
Manifold considered here include: LieGroup, EmbeddedManifold, SpecialOrthogonalGroup,
SpecialEuclideanGroup, Hypersphere, HyperbolicSpace and SPDMatricesSpace. Then,
the Riemannian metrics can equip the manifolds. Instantiations of the RiemannianMetric class and
its children classes are attributes of the manifold objects.
The class RiemannianMetric implements the usual methods of Riemannian geometry, such as the
inner product of two tangent vectors at a base point, the (squared) norm of a tangent vector at a
base point, the (squared) distance between two points, the Riemannian Exponential and Logarithm
maps at a base point and a geodesic characterized by an initial tangent vector at an initial point
or by an initial point and an end point. Children classes of RiemannianMetric include the class
InvariantMetric, which implements the left- and right- invariant metrics on Lie groups.
The methods of the above classes have been extensively unit-tested, with more than 90% code
coverage. The code is provided with numpy and tensorflow backends. The code is vectorized
through the use of arrays, to facilitate intuitive batch computations. The tensorflow backend also
enables running the computations on GPUs.

2.2 Statistics and Machine Learning

The package geomstats also implements the “Statistics" aspect of Geometric Statistics - specifically
Riemannian statistics through the class RiemannianMetric [33]. The class RiemannianMetric

2
implements the weighted Fréchet mean of a dataset through a Gauss-Newton gradient descent iteration
[12], the variance of a dataset with respect to a point on the manifold, as well as tangent principal
component analysis [11].
The package facilitates the use of Riemannian geometry in machine learning and deep learning
settings. Suppose we want to train a neural network to predict on a manifold, geomstats provides
off-the-shelf loss functions on Riemannian manifolds, implemented as squared geodesic distances
between the predicted output and the ground truth. These loss functions are consistent with the geo-
metric structure of the Riemannian manifold. The package gives the closed forms of the Riemannian
gradients corresponding to these losses, so that back-propagation can be easily performed.
Suppose we want to constrain the parameters of a model, for example the weights of a neural network,
to belong to a manifold. We provide modified versions of keras and tensorflow, so that they can
constrain weights on manifolds during training.
In the following sections, we demonstrate the use of the manifolds implemented in geomstats. For
each manifold, we present a literature review of its appearance in machine learning and we describe
its implementation in geomstats together with a concrete use case.

3 Embedded Manifolds - Hypersphere and Hyperbolic Space


We consider the hypersphere and the hyperbolic space, respectively implemented in the classes
Hypersphere and HyperbolicSpace. The logic of the Riemannian structure of these two manifolds
is very similar. They are both manifolds defined by their embedding in a flat Riemannian or pseudo-
Riemannian manifold.
The n-dimensional hypersphere S n is defined by its embedding in the (n + 1)-Euclidean space,
which is a flat Riemannian manifold, as
S n = x ∈ Rn+1 : x21 + ... + x2n+1 = 1 .

(1)

Similarly, the n-dimensional hyperbolic space Hn is defined by its embedding the (n+1)-dimensional
Minkowski space, which is a flat pseudo-Riemannian manifold, as
Hn = x ∈ Rn+1 : −x21 + ... + x2n+1 = −1 .

(2)

The classes Hypersphere and HyperbolicSpace therefore inherit from the class
EmbeddedManifold. They implement methods such as: conversion functions from intrin-
sic n-dimensional coordinates to extrinsic (n + 1)-dimensional coordinates in the embedding space
(and vice-versa); projection of a point in the embedding space to the embedded manifold; projection
of a vector in the embedding space to a tangent space at the embedded manifold.
The Riemannian metric defined on S n is derived from the Euclidean metric in the embedding space,
the Riemannian metric defined on H n is derived from the Minkowski metric in the embedding space.
They are respectively implemented in the classes HypersphereMetric and HyperbolicMetric.

3.1 Hypersphere - Use Cases Review in Machine Learning

We review the contexts where it is natural to embed data on a hypersphere. Examples include circular
statistics, directional statistics or orientation statistics which focus on data on circles, spheres and
rotation groups. Applications are obviously extremely diverse and include biology and physics,
among many others [28]. In biology, the sphere S 2 is used in analysis of protein structures [23]. In
4
physics, the semi-hypersphere S+ is used to encode the projective space P4 for representing crystal
orientations in applied crystallography [36].
The shape statistics literature is also manipulating data on abstract hyperspheres. Kendall’s studies
shapes of k landmarks in m dimensions and introduces the pre-shape spaces which are hyperspheres
S m(k−1) [22]. The s-rep, a skeletal representation of 3D shapes, also deals with hyperspheres S 3n−4
as the object under study is represented by n points along its boundary [17].
Lastly, hyperspheres can be used to constrain the parameters of a machine learning model. For
example, training a neural net with parameters constrained on a hypersphere results in an easier
optimization, faster convergence and comparable (even better) classification accuracy [27].

3
3.2 Geomstats Use Case - Optimization and Deep Learning on Hyperspheres

We demonstrate the use of geomstats for constraining neural networks’ weights on manifolds during
training following the deep learning literature [27]. The folder deep_learning of the supplementary
materials contains the implementation of this use case.

Figure 1: Minimization of a scalar field on the sphere S 2 . The color map indicates the scalar field
values, where blue is the minimum and red the maximum. The red curve shows the trajectory taken
by the Riemannian gradient descent, which converges to a minimum (blue region).

First, however, we provide the implementation of the Riemannian gradient descent on the hypersphere.
Our example minimizes a quadratic form xT Ax with A ∈ Rn×n and xT Ax > 0 constrained on
the hypersphere S n−1 . Geomstats allows us to conveniently generate a positive semidefinite matrix
by doing a random uniform sampling on the SPDMatricesSpace manifold. Figure 1 illustrates the
Riemannian optimization process.
As for neural network’s training, the optimization step has been modified in keras such than
the stochastic gradient descent is done on the manifold through the Exponential map. In our
implementation, the user can pass a manifold parameter to each neural network layer. The stochastic
gradient descent optimizer has been modified to operate the Riemannian gradient descent in parallel.
It infers the number of manifolds directly from the dimensionality by finding out how many manifolds
are needed in order to optimize the number of kernel weights of a given layer.
We provide a modified version of a simple deep convolutional neural network and a resnet [16] with
its convolutional layers’ weights trained on the hypersphere. They were trained respectively on the
MNIST [26] and [24] datasets.

3.3 Hyperbolic Space - Use Case Reviews in Machine Learning

We review the machine learning literature that deals with Hyperbolic spaces. Hyperbolic spaces arise
in information and learning theory. The space of univariate Gaussian endowed with the Fisher metric
densities is a hyperbolic space. This characterization is used in various fields, and for example in
image processing where each image pixel is represented by a Gaussian distribution [1] and in radar
signal processing where the corresponding echo is represented by a stationary Gaussian process [2].
The hyperbolic spaces can also be seen as continuous versions of trees and are therefore interesting
when learning hierarchical representations of data [30]. Hyperbolic geometric graphs (HGG) have
also been suggested as a promising model for social networks, where the hyperbolicity appears
through a competition between similarity and popularity of an individual [32].

4
3.4 Geomstats Use Case - Visualization on the Hyperbolic space H2

We present the visualization toolbox of geomstats, that plays an educational role by enabling the
users to test their intuition on Riemannian manifolds. They can run and adapt the examples provided
in the geomstats/examples folder of the supplementary materials. For example, we can visualize
the hyperbolic space H2 through the Poincare disk representation, where the border of the disk is at
infinity. The user can then observe how a geodesic grid and a geodesic square are deformed in the
hyperbolic geometry on Figure 2.

1.0 1.0

0.5 0.5

0.0 0.0
Y

Y
0.5 0.5

1.0 1.0

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
X X

Figure 2: Left: Regular geodesic grid on the Hyperbolic space H 2 in Poincare disk representation.
Right: Geodesic square on the Hyperbolic space H2 , with points regularly spaced on the geodesics
defining the square’s edges.

4 Manifold of Symmetric Positive Definite (SPD) Matrices


We have seen the Hypersphere and the Hyperbolic space that are manifolds embedded in flat spaces.
Now we increase the geometric complexity and consider a manifold embedded in the General Linear
group of invertible matrices. The manifold of symmetric positive definite (SPD) matrices in n
dimensions is indeed defined as
SP D = S ∈ Rn×n : S T = S, ∀z ∈ Rn , z 6= 0, z T Sz > 0 .

(3)
The class SPDMatricesSpace thus inherits from the class EmbeddedManifold and has an
embedding_manifold attribute which stores an object of the class GeneralLinearGroup. We
equip the manifold of SPD matrices with an object of the class SPDMetric that implements the
affine-invariant Riemannian metric of [34] and inherits from the class RiemannianMetric.

4.1 SPD Matrices Manifold - Use Cases in Machine Learning

SPD matrices are used for data representation in many fields [8]. This subsection lists their use cases.
In diffusion tensor imaging (DTI), the “diffusion tensors" are ellipsoids that are 3x3 SPD matrices at
each voxel. They spatially characterize the diffusion of water molecules in the tissues. These fields of
SPD matrices are inputs to regression models, for example of an intrinsic local polynomial regression
applied to comparison of fiber tracts of HIV subjects compared with a control group in [41].
In functional magnetic resonance imaging (fMRI), the brain connectome framework extracts connec-
tivity graphs from a set of patients’ resting-state images’ time series [38, 40, 20]. The regularized
graph Laplacians of the respective graphs form a dataset of SPD matrices. They represent a compact
summary of the brain’s connectivity patterns which is used to assess neurological responses to a
variety of stimuli (drug, pathology, patient’s activity, etc.).
In medical imaging and computational anatomy, SPD matrices can also encode anatomical shape
changes observed in images. The SPD matrix J T J 1/2 represents the directional information of shape
change captured by the Jacobian matrix J at a given voxel [14].
Covariance matrices are also SPD matrices which appear in many settings. We find covariance
clustering used for sound compression in acoustic models of automatic speech recognition (ASR)

5
10111010111111000111001100110100100111001101001110000000000000101000100001111101001001
Distance Accuracy F1-Score
1.0
Riemannian 30.8% 47.1 0.9
0.8
Log Euclidean 62.5 36.4 0.7
0.6
Frobenius 46. 2% 0.00 0.5

10
01
00
10
11
11
10
00
01
00
01
01
00
00
00
00
00
00
01
11
00
10
11
00
11
10
01
00
10
11
00
11
00
11
10
00
11
11
11
01
01
11
01
Figure 3: Left: Connectome classification results. Right: Clustermap of the recovered similarities
using the Riemannian distance on the SPD Manifold. We note in particular the identification of
several clusters (red blocks on the diagonal)

systems [37] and covariance clustering for material classification [10], among others. Covariance
descriptors are also popular image descriptors or video descriptors [15].
Lastly, SPD matrices have found applications in deep learning, where they are used as features
extracted by a neural network. The authors of [13] show that an aggregation of learned deep
convolutional features into a SPD matrix creates a robust representation of images that enables to
outperform state-of-the-art methods on visual classification.

4.2 Geomstats Use Case - Connectivity Graph Classification

We show through a concrete brain connectome application how geomstats can be easily leveraged
for efficient supervised learning on the space of SPD matrices. The folder brain_connectome of
the supplementary materials contains the implementation of this use case.
We consider the fMRI data from the 2014 MLSP Schizophrenia Classification challenge1 , consisting
of the resting-state fMRIs of 86 patients split into two balanced categories: control vs people suffering
schizophrenia. Consistently with the connectome literature, we approach the classification task by
using a SVM classifier on the pre-computed pairwise-similarities between brains. The critical step lies
in our ability to correctly identify similar brain structures, here represented by regularized Laplacian
SPD matrices L̂ = (D − A) + γI, where A and D are respectively the adjacency and the degree
matrices of a given connectome. The parameter γ is a regularization shown to have little effect on the
classification performance [9].
Following two popular approaches in the literature [9], we define similarities between connectomes
−1/2 −1/2
through kernels relying on the Riemannian distance dR (L̂1 , L̂2 ) = || log(L̂1 .L̂2 .L̂1 )||F and
on the log-Euclidean distance, a computationally-lighter proxy for the first: dLED (L̂1 , L̂2 ) =
|| logI (L̂2 ) − logI (L̂1 )||F . In these formulae, log is the matrix logarithm and F refers to the
Frobenius norm. Both of these similarities are easily computed with geomstats, for example the
Riemannian distance is obtained through metric.squared_dist where metric is an instance of
the class SPDMetric.
Figure 3 (left) shows the performance of these similarities for graph classification, which we bench-
mark against a standard Frobenius distance. With an out-of-sample accuracy of 61.2%, the log-
Euclidean distance here achieves the best performance. Interestingly, the affine-invariant Riemannian
distance on SPD matrices is the distance that picks up the most differences between connectomes.
While both the Frobenius and the log-Euclidean recover only very slight differences between con-
nectomes –placing them almost uniformly afar from each other–, the Riemannian distance exhibits
greater variability, as shown by the clustermap in Figure 3 (right). Given the ease of implementation
of these similarities with geomstats, comparing them further opens research directions for in-depth
connectome analysis.

1
Data openly available at https://www.kaggle.com/c/mlsp-2014-mri

6
5 Lie Groups SO(n) and SE(n) - Rotations and Rigid Transformations
We have seen manifolds embedded in other manifolds, where the embedding manifolds were either
flat or had a Lie group structure. Now we turn to manifolds that are Lie groups themselves. The
special orthogonal group SO(n) is the group of rotations in n dimensions defined as
SO(n) = R ∈ Rn×n : RT .R = Idn and det R = 1 .

(4)

The special Euclidean group SE(n) is the group of rotations and translations in n dimensions defined
by its homegeneous representation as
   
R t n
SE(n) = X ∈ Rn×n | X = , t ∈ R , R ∈ SO(n) (5)
0 1

The classes SpecialOrthogonalGroup and SpecialEuclideanGroup both inherit from the


classes LieGroup and EmbeddedManifold, as embedded in the General Linear group. They
both have an attribute metrics which can store a list of metric objects, instantiations of the class
InvariantMetric. A left- or right- invariant metric object is instantiated through an inner-product
matrix at the tangent space at the identity of the group.

5.1 Lie Groups SO(n) and SE(n) - Use Cases in Machine Learning

This subsection enumerates the use cases of the Lie groups SO(n) and SE(n) for data and parameters
representation. In 3D, SO(3) and SE(3) appear naturally when dealing with articulated objects.
A spherical robot arm is an example of articulated object, whose positions can be modeled as the
elements of SO(3). The human spine can also be modeled as an articulated object where each
vertebra is represented as an orthonormal frame that encodes the rigid body transformation from the
previous vertebra [3, 4].
In computer vision, elements of SO(3) or SE(3) are used to represent the orientation or pose of
cameras [21]. Supervised learning algorithm predicting such orientations or poses have numer-
ous applications for robots and autonomous vehicles which need to localize themselves in their
environment.
Lastly, the Lie group SO(n) and its extension to the Stiefel manifold, are found very useful in the
training of deep neural networks. The authors of [19] suggest to constrain the network’s weights on
a Stiefel manifold, i.e. forcing the weights to be orthogonal to each other. Enforcing the geometry
significantly improves performances, reducing for example the test error of wide residual network on
CIFAR-100 from 20.04% to 18.61% .

5.2 Geomstats Use Case - Geodesics on SO(3)

Riemannian geometry can be easily integrated for machine learning applications in robotics applica-
tions using geomstats. We demonstrate this by presenting the interpolation of a robot arm trajectory
by geodesics. The folder robotics of the supplementary materials contains the implementation of
this use case.
In robotics, it is common to control a manipulator in Cartesian space rather than configuration space.
This allows for a much more intuitive task specification, and makes the computations easier by
solving several low dimension problems instead of a high dimension one. Most robotic tasks require
to generate and follow a position trajectory as well as an orientation trajectory.
While it is quite easy to generate a trajectory for position using interpolation between several via
points, it is less trivial to generate one for orientations that are commonly represented as rotation
matrices or quaternions. Here, we show that we can actually easily generate an orientation trajectory
as a geodesic between two elements of SO(3) (or as a sequence of geodesics between several via
points in SO(3)). We generate a geodesic on SO(3) between the initial orientation of the robot and
its desired final orientation, and use the generated trajectory as an input to the robot controller. The
trajectory obtained is illustrated in Figure 4.
This opens the door for research at the intersection of Riemannian geometry, robotics and machine
learning. We could ask the robot arm to perform a trajectory towards an element of SE(3) or SO(3)

7
Figure 4: A Riemannian geodesic computed with the canonical bi-invariant metric of SO(3), applied
to the extremity of the robotic arm.

predicted by a supervised learning algorithm trained for a specific task. The next subsection presents
the concrete use case of training a neural network to predict on Lie groups using geomstats.

5.3 Geomstats Use Case - Deep Learning Predictions on SE(3)

We show how to use geomstats to train supervised learning algorithms to predict on manifolds,
specifically here: to predict on the Lie group SE(3). This use case is presented in more details in the
paper [18] and the open-source implementation is given. The authors of [18] consider the problem of
pose estimation that consists in predicting the position and orientation of the camera that has taken a
picture given as inputs.
The outputs of the algorithm belong to the Lie group SE(3). The geomstats package is used to train
the CNN to predict on SE(3) equipped with a left-invariant Riemannian metric. At each training
step, they use the loss given by the squared Riemannian geodesic distance between the predicted
pose and the ground truth. The Riemannian gradients required for back-propagation are given by the
closed forms implemented in geomstats.

Any CNN Architecture

p/
(p , p/)

𝐿𝑜𝑠𝑠
𝑑123

p = 𝑟$ , 𝑟& , 𝑟' , 𝑡$ , 𝑡& , 𝑡' ∈ 𝔰𝔢(3)

Figure 5: Image courtesy of [18]. CNN with a squared Riemannian distance as the loss on SE(3).

The effectiveness of the Riemannian loss is demonstrated by experiments showing significative


improvements in accuracy for image-based 2D to 3D registration. The loss functions and gradients
provided in geomstats extend this research directions to CNN predicting on other Lie groups and
manifolds.

6 Conclusion and Outlook


We introduce the open-source package geomstats to democratize the use of Riemannian geometry
in machine learning for a wide range of applications. Regarding the geometry, we have presented
manifolds of increasing complexity: manifolds embedded in flat Riemannian spaces, then the case of
the SPD matrices space and lastly Lie groups with invariant Riemannian metrics. This provides an
educational tool for users who want to delve into Riemannian geometry through a hands-on approach,
with intuitive visualizations for example in subsections 3.4 and 5.2.
In regard to machine learning, we have presented concrete use cases where inputs, outputs and
parameters belong to manifolds, in the respective examples of subsection 4.2, subsection 5.3 and
subsection 3.2. They demonstrate the usability of geomstats package for efficient and user-friendly

8
Riemannian geometry. Regarding the machine learning applications, we have reviewed the oc-
currences of each manifold in the literature across many different fields. We kept the range of
applications very wide to show the many new research avenues that open at the cross-roads of
Riemannian geometry and machine learning.
geomstats implements manifolds where closed-forms for the Exponential and the Logarithm maps
of the Riemannian metrics exist. Future work will involve implementing manifolds where these
closed forms do not necessarily exist. We will also provide the pytorch backend.

References
[1] Angulo, J., Velasco-Forero, S.: Morphological processing of univariate Gaussian distribution-valued
images based on Poincaré upper-half plane representation. In: Nielsen, F. (ed.) Geometric Theory of
Information, pp. 331–366. Signals and Communication Technology, Springer International Publishing
(may 2014)

[2] Arnaudon, M., Barbaresco, F., Yang, L.: Riemannian Medians and Means With Applications to Radar
Signal Processing. IEEE Journal of Selected Topics in Signal Processing 7(4), 595–604 (aug 2013)

[3] Arsigny, V.: Processing Data in {L}ie Groups: An Algebraic Approach. Application to Non-Linear
Registration and Diffusion Tensor MRI. Thèse de sciences (phd thesis), École polytechnique (nov 2006)

[4] Boisvert, J., Cheriet, F., Pennec, X., Labelle, H., Ayache, N.: Articulated Spine Models for 3D Reconstruc-
tion from Partial Radiographic Data. IEEE Transactions on Bio-Medical Engineering 55(11), 2565–2574
(nov 2008)

[5] Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric Deep Learning: Going
beyond Euclidean data. IEEE Signal Processing Magazine 34(4), 18–42 (jul 2017)

[6] Boumal, N., Bamdev, M., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab Toolbox for Optimization on
Manifolds. Journal of Machine Learning Research. 15, 1455—1459 (2014)

[7] Censi, A.: Pygeometry: library for handling various differentiable manifolds. (2010), https://github.
com/AndreaCensi/geometry

[8] Cherian, A., Sra, S.: Positive Definite Matrices: Data Representation and Applications to Computer Vision.
In: Algorithmic Advances in Riemannian Geometry and Applications. Springer (2016)

[9] Dodero, L., Minh, H.Q., Biagio, M.S., Murino, V., Sona, D.: Kernel-based classification for brain
connectivity graphs on the Riemannian manifold of positive definite matrices. In: 2015 IEEE 12th
International Symposium on Biomedical Imaging (ISBI). pp. 42–45 (apr 2015)

[10] Faraki, M., Harandi, M.T., Porikli, F.: Material Classification on Symmetric Positive Definite Manifolds.
In: 2015 IEEE Winter Conference on Applications of Computer Vision. pp. 749–756 (jan 2015)

[11] Fletcher, P.T., Lu, C., Pizer, S.M., Joshi, S.: Principal geodesic analysis for the study of nonlinear statistics
of shape. IEEE transactions on medical imaging 23(8), 995–1005 (2004)

[12] Fréchet, M.: Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’institut
Henri Poincaré 10(4), 215–310 (1948)

[13] Gao, Z., Wu, Y., Bu, X., Jia, Y.: Learning a Robust Representation via a Deep Network on Symmetric
Positive Definite Manifolds. CoRR abs/1711.06540 (2017), http://arxiv.org/abs/1711.06540

[14] Grenander, U., Miller, M.: Pattern Theory: From Representation to Inference. Oxford University Press,
Inc., New York, NY, USA (2007)

[15] Harandi, M.T., Hartley, R.I., Lovell, B.C., Sanderson, C.: Sparse Coding on Symmetric Positive Definite
Manifolds using Bregman Divergences. CoRR abs/1409.0083 (2014), http://arxiv.org/abs/1409.
0083

[16] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. CoRR abs/1512.03385
(2015), http://arxiv.org/abs/1512.03385

[17] Hong, J., Vicory, J., Schulz, J., Styner, M., Marron, J.S., Pizer, S.: Non-Euclidean Classification of
Medically Imaged Objects via s-reps. Med Image Anal 31, 37–45 (2016)

9
[18] Hou, B., Miolane, N., Khanal, B., Lee, M., Alansary, A., McDonagh, S., Hajnal, J., Ruecket, D., Glocker,
B., Kainz, B.: Deep Pose Estimation for Image-Based Registration. Submitteed to MICCAI 2018. (2018)

[19] Huang, L., Liu, X., Lang, B., Yu, A.W., Li, B.: Orthogonal Weight Normalization: Solution to Optimization
over Multiple Dependent Stiefel Manifolds in Deep Neural Networks. CoRR abs/1709.06079 (2017)

[20] Ingalhalikar, M., Smith, A., Parker, D., Satterthwaite, T.D., Elliott, M.A., Ruparel, K., Hakonarson, H., Gur,
R.E., Gur, R.C., Verma, R.: Sex differences in the structural connectome of the human brain. Proceedings
of the National Academy of Sciences 111(2), 823–828 (2014)

[21] Kendall, A., Grimes, M., Cipolla, R.: Convolutional networks for real-time 6-DOF camera relocalization.
CoRR abs/1505.07427 (2015), http://arxiv.org/abs/1505.07427

[22] Kendall, D.G.: A Survey of the Statistical Theory of Shape. Statistical Science 4(2), pp. 87–99 (1989)

[23] Kent, J.T., Hamelryck, T.: Using the Fisher-Bingham distribution in stochastic models for protein structure.
Quantitative Biology, Shape Analysis, and Wavelets pp. 57–60 (2005)

[24] Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (Canadian Institute for Advanced Research). bar (2010),
http://www.cs.toronto.edu/{~}kriz/cifar.html

[25] Kühnel, L., Sommer, S.: Computational Anatomy in Theano. CoRR abs/1706.07690 (2017), http:
//arxiv.org/abs/1706.07690

[26] LeCun, Y., Cortes, C.: {MNIST} handwritten digit database. foo (2010), http://yann.lecun.com/
exdb/mnist/

[27] Liu, W., Zhang, Y.M., Li, X., Yu, Z., Dai, B., Zhao, T., Song, L.: Deep Hyperspherical Learning. In:
Advances in Neural Information Processing Systems. pp. 3953–3963 (2017)

[28] Mardia, K.V., Jupp, P.E.: Directional statistics. Wiley series in probability and statistics, Wiley (2000),
https://books.google.com/books?id=zjPvAAAAMAAJ

[29] Mart\’\in˜Abadi, Ashish˜Agarwal, Paul˜Barham, Eugene˜Brevdo, Zhifeng˜Chen, Craig˜Citro,


Greg˜S.˜Corrado, Andy˜Davis, Jeffrey˜Dean, Matthieu˜Devin, Sanjay˜Ghemawat, Ian˜Goodfellow,
Andrew˜Harp, Geoffrey˜Irving, Michael˜Isard, Jia, Y., Rafal˜Jozefowicz, Lukasz˜Kaiser, Manju-
nath˜Kudlur, Josh˜Levenberg, Dandelion˜Mané, Rajat˜Monga, Sherry˜Moore, Derek˜Murray, Chris˜Olah,
Mike˜Schuster, Jonathon˜Shlens, Benoit˜Steiner, Ilya˜Sutskever, Kunal˜Talwar, Paul˜Tucker, Vin-
cent˜Vanhoucke, Vijay˜Vasudevan, Fernanda˜Viégas, Oriol˜Vinyals, Pete˜Warden, Martin˜Wattenberg,
Martin˜Wicke, Yuan˜Yu, Xiaoqiang˜Zheng: {TensorFlow}: Large-Scale Machine Learning on Heteroge-
neous Systems (2015), https://www.tensorflow.org/

[30] Nickel, M., Kiela, D.: Poincaré Embeddings for Learning Hierarchical Representations. In: Guyon, I.,
Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in
Neural Information Processing Systems 30, pp. 6338–6347. Curran Associates, Inc. (2017)

[31] Oliphant, T.E.: Guide to NumPy. CreateSpace Independent Publishing Platform, USA, 2nd edn. (2015)

[32] Papadopoulos, F., Kitsak, M., Serrano, M.Á., Boguñá, M., Krioukov, D.: Popularity versus similarity in
growing networks. Nature 489, 537 EP – (2012), http://dx.doi.org/10.1038/nature11459

[33] Pennec, X.: Intrinsic Statistics on {R}iemannian Manifolds: Basic Tools for Geometric Measurements.
Journal of Mathematical Imaging and Vision 25(1), 127–154 (2006)

[34] Pennec, X., Fillard, P., Ayache, N.: A {R}iemannian Framework for Tensor Computing. International
Journal of Computer Vision 66(1), 41–66 (jan 2006)

[35] Postnikov, M.: Riemannian Geometry. Encyclopaedia of Mathem. Sciences, Springer (2001)

[36] Schaeben, H.: Towards statistics of crystal orientations in quantitative texture anaylsis. Journal of Applied
Crystallography 26(1), 112–121 (feb 1993), https://doi.org/10.1107/S0021889892009270

[37] Shinohara, Y., Masuko, T., Akamine, M.: Covariance clustering on Riemannian manifolds for acoustic
model compression. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
pp. 4326–4329 (mar 2010)

[38] Sporns, O., Tononi, G., Kötter, R.: The human connectome: a structural description of the human brain.
PLoS computational biology 1(4), e42 (2005)

10
[39] Townsend, J., Koep, N., Weichwald, S.: Pymanopt: A Python Toolbox for Optimization on Manifolds
using Automatic Differentiation. Journal of Machine Learning Research 17(137), 1–5 (2016), http:
//jmlr.org/papers/v17/16-177.html

[40] Wang, J., Zuo, X., Dai, Z., Xia, M., Zhao, Z., Zhao, X., Jia, J., Han, Y., He, Y.: Disrupted functional brain
connectome in individuals at risk for Alzheimer’s disease. Biological psychiatry 73(5), 472–481 (2013)

[41] Yuan, Y., Zhu, H., Lin, W., Marron, J.S.: Local polynomial regression for symmetric positive definite
matrices. Journal of the Royal Statistical Society Series B 74(4), 697–719 (2012), https://econpapers.
repec.org/RePEc:bla:jorssb:v:74:y:2012:i:4:p:697-719

11
Backdrop: Stochastic Backpropagation

Siavash Golkar Kyle Cranmer


New York University New York University
golkar@nyu.edu kyle.cranmer@nyu.edu
arXiv:1806.01337v1 [stat.ML] 4 Jun 2018

Abstract
We introduce backdrop, a flexible and simple-to-implement method, intuitively
described as dropout acting only along the backpropagation pipeline. Backdrop is
implemented via one or more masking layers which are inserted at specific points
along the network. Each backdrop masking layer acts as the identity in the forward
pass, but randomly masks parts of the backward gradient propagation. Intuitively,
inserting a backdrop layer after any convolutional layer leads to stochastic gradients
corresponding to features of that scale. Therefore, backdrop is well suited for
problems in which the data have a multi-scale, hierarchical structure. Backdrop can
also be applied to problems with non-decomposable loss functions where standard
SGD methods are not well suited. We perform a number of experiments and
demonstrate that backdrop leads to significant improvements in generalization.

1 Introduction
Stochastic gradient descent (SGD) and its minibatch variants are ubiquitous in virtually all learning
tasks [1, 13]. SGD enables deep learning to scale to large datasets, decreases training time and
improves generalization. However, there are many problems where there exists a large amount of
information but the data is packaged in a small number of information-rich samples. These problems
cannot take full advantage of the benefits of SGD as the number of training samples is limited.
Examples of these include learning from high resolution medical images, satellite imagery and
GIS data, cosmological simulations, lattice simulations for quantum systems, and many others. In
all of these situations, there are relatively few training samples, but each training sample carries a
tremendous amount of information and can be intuitively considered as being comprised of many
independent subsamples.
Since the benefits of SGD require the existence of a large number of samples, efficiently employing
these techniques on the above problems requires careful analysis of the problem in order to restructure
the information-rich data samples into smaller independent pieces. This is not always a straight-
forward procedure, and may even be impossible without destroying the structure of the data [27, 29].
This motivates us to introduce backdrop, a new technique for stochastic gradient optimization which
does not require the user to reformulate the problem or restructure the training data. In this method,
the loss function is unmodified and takes the entirety of each sample as input, but the gradient is
computed stochastically on a fraction of the paths in the backpropagation pipeline.
A closely related class of problems, which can also benefit from backdrop, is optimization objectives
defined via non-decomposable loss functions. The rank statistic, a differentiable approximation of
the ROC AUC is an example of such loss functions [10]. Here again, minibatch SGD optimization is
not viable as the loss function cannot be well approximated over small batch sizes. Backdrop can
also be applied in these problems to significantly improve generalization performance.

Main contributions. In this paper, (a) we establish a means for stochastic gradient optimization
which does not require a modification to the forward pass needed to evaluate the loss function and is

Preprint. Work in progress.


particularly useful for problems with non-trivial subsample structure or with non-decomposable loss.
(b) We explore the technique empirically and demonstrate the significant gains that can be achieved
using backdrop in a number of synthetic and real world examples. (c) We introduce the “multi-
scale GP texture generator”, a flexible synthetic texture generator with clear hierarchical subsample
structure which can be used as a benchmark to measure performance of networks and optimization
tools in problems with hierarchical subsample structure. The source code for our implementation of
backdrop and the multi-scale GP texture generator is available at github.com/dexgen/backdrop.

1.1 Related work

Non-decomposable loss optimization. The study of optimization problems with complex non-
decomposable loss functions has been the subject of active research in the last few years. A number of
methods have been suggested including methods for online optimization [14, 15, 25], methods to solve
problems with constraints [24], and indirect plug-in methods [17, 26]. There are also methods that
indirectly optimize specific performance measures [5, 33]. Our contribution is fundamentally different
from the previous work on the subject. Firstly, our approach can be applied to any differentiable
loss function and does not change based on the exact form of the loss. Secondly, implementation of
backdrop is simple, does not require changing the network structure or the optimization technique
and can therefore be used in conjunction with any other batch optimization method. Finally, backdrop
optimization can be applied to a more general class of problems where the loss is decomposable but
the samples have complex hierarchical subsample structure.

Study of minibatch sizes. Recently there have been many studies analyzing the effects of different
batch sizes on generalization [16, 21, 23, 32] and on computation efficiency and data parallelization [3,
4, 12]. These studies, however, are primarily concerned with problems such as classification accuracy
with cross-entropy or similar decomposable losses. There are no in-depth studies specifically
targetting the effects of varying the batch size for non-decomposable loss functions that we are aware
of. We address this in passing as part of one of our examples in Sec. 3.

2 Backdrop stochastic gradient descent


Let us consider the following problem. Given dataset S = {x1 , · · · , xN } with iid samples xn , we
wish to optimize an empirical loss function L = Lθ (S), where θ are parameters in a given model.
We can employ gradient descent (GD), an iterative optimization strategy for finding the minimum
of L wherein we take steps proportional to −∇θ L. An alternate strategy which can result in better
generalization performance is to use stochastic gradient descent (SGD), where instead we take steps
according to −∇θ Ln = −∇θ L(xn ), i.e. the gradient of the loss computed on a single sample. We
can further generalize SGD by evaluating the iteration step on a number of randomly chosen samples
called minibatches [1, 13]. We cast GD and minibatch SGD as graphical models in Figs. 1a and 1b,
where the solid and dashed lines respectively denote the forward and backward passes of the network
and N , B, and N/B denote the number of samples, the number of minibatches and the minibatch
size.

LB δθBD

θ L δθGD θ LB δθMB
θ hn δθn ×

Bern
x b p
x x
N/B
N N/B B B

(a) Gradient descent. (b) minibatch SGD. (c) Bernoulli backdrop.


Figure 1: Graphical model representation of GD, minibatch SGD and backdrop with Bernoulli
masking. The solid and dashed lines represent the forward pass and gradient computation respectively.

Now we consider two variations of this problem. First let us assume that the total loss we wish to
optimize is not decomposable over the individual samples, i.e. it cannot be written as a sum of sample

2
losses, for example it might depend on some global property of the dataset and include cross terms
between different samples, i.e. L = f (~h), where hn is the hidden state computed from sample xn
and f is some non-decomposable function. There are many problems where these kinds of losses
arise, and we discuss a few examples in Sec. 3. In extreme cases, the loss requires all N data points
and batching the data is no longer meaningful (i.e. B = 1). In other cases, a large number of samples
is required to accurately approximate the loss (i.e. N/B  1) and hence minibatch SGD is also not
optimal.
In order to deal with this problem and recover stochasticity to the gradient, we propose the following.
During optimization, we evaluate the total loss during the forward pass, but for gradient calculation
during the backward pass, we only compute δθ with respect to one (or P some) of the samples. Note
that if the loss L can be decomposed to a sum of sample losses L = Ln , this procedure would be
identically equal to SGD. In practice, we implement backdrop by using a Bernoulli random variable
to choose which samples to perform the gradient update with respect to (Fig. 1c). We call this method
backdrop as it is reminiscent of dropout applied to the backpropagation pipeline.
We now consider a different variation of problem. Let us as-
sume that each sample x consists of a number of subsamples
x = {σ1 , · · · , σM } where the subsamples σ are no longer re- LB δθBD

quired to be iid. For example, an image can be considered as a


collection of smaller patches, or a sequence can be considered
as a collection of individual sequence elements. σi can also be Ln
defined implicitly by the network, for example it could refer
to the (possibly overlapping) receptor fields of the individual θ
outputs of a convolutional layer. Similar to above, we take the
sample loss to be a non-linear function of the subsamples i.e. hm δθm ×

Ln = f (~h) which again cannot be decomposed into a sum over


the individual subsamples. We can implement backdrop by Bern
σ b p
defining δθBD in a manner analogous to the previous case. Fig- M
ure 2 depicts this implementation of backdrop on subsamples,
N/B
where the B and N/B are as before and the M plate represents
B
the subsamples.
Note that while it is possible to think of the first class of prob- Figure 2: Backdrop on subsamples
lems with non-decomposable losses as a special case of the with Bernoulli masking. The dotted
second class of problems with subsamples, we find that it is plate denotes the fact that the subsam-
conceptually easier to consider these two classes separately. ples are generically not iid.

2.1 Implementation

Since backdrop does not modify the forward pass of the network, it can be easily applied in most
architectures without any modification of the structure. The simplest method to implement backdrop
is via the insertion of a masking layer Mp defined as follows. For vector ~v = (v1 , · · · , vN ) we have:
N
Mp (vn ) ≡ vn , ∇vn Mp (vm ) ≡ δnm bn (1 − p), (1)
~b
1

where ~b is a length N vector of Bernoulli random variables with probability 1 − p and |~b|1 counts the
number of its nonzero elements. In words, Mp acts as identity during the forward pass but drops a
random number of gradients given by probability p during gradient computation. The factor in front
of the gradient is normalization such that the `1 norm of the gradient is preserved.
For the problem where the total loss is a non-decomposable function of the sample latent variables
L = f (~h), we can implement backdrop simply by inserting a masking layer between hn and L
(Fig. 1c):
N X
LBD = f (Mp (~h)) ⇒ δθBD = ∇θ f (Mp (~h)) = bn ∇θ hn · ∇hn f (~h), (2)
~b
1 n
resulting in only a fraction of the gradients being accumulated for the gradient descent update as
desired. We would say backdrop in this case is masking its input along the minibatch direction n,
resulting in a smaller effective batch size or EBS, which we define as EBS ≡ N/B × (1 − p).

3
In more complicated cases, for example if we want to implement backdrop on the subsamples as in
Fig. 2, the location of the masking layer along the pipeline of the network may depend on the details
of the problem. We will discuss this point with a number of examples in Sec. 3. However, even
when the scale and structure of the subsamples is not known, it is possible to insert masking layers at
multiple points along the network and treat the various masking probabilities as hyperparameters. In
other words, it is not necessary to know the details of structure of the data in order to implement and
take advantage of the generalization benefits provided by backdrop.
We note that while this implementation of backdrop is reminiscent to dropout [11], there are major
differences. Most notably, the philosophy behind the two schemes is fundamentally different.
Dropout is a network averaging method that simultaneously trains a multitude of networks and takes
an averages of the outcomes to reduce over-fitting. Backdrop, on the other hand, is a tool designed
to take advantage of the structure of the data and introduce stochasticity in cases where SGD is not
viable. On a technical level, with backdrop, the masking takes place during the backward pass while
the forward pass remains undisturbed. Because of this, backdrop can be applied in scenarios where
the entire input is required for the task and it can also be simultaneously used with other optimization
techniques. For example, it does not suffer the same incompatibility issues that dropout has with
batch-normalization [22].

3 Examples
In this section, we discuss a number of examples in detail in order to clarify the effects of backdrop in
different scenarios. We focus on the two problem classes discussed in Sec. 2, i.e. problems with non-
decomposable losses and problems where we want to take advantage of the hierarchical subsample
structure of the data. For our experiments, we use CIFAR10 [18], the Ponce texture dataset [20] and
a new synthetic dataset comprised of Gaussian processes with hierarchical correlation lengths [8].
Note that in these experiments, the purpose is not to achieve state of the art performance, but to
exemplify how backdrop can be used and what measure of performance gains one can expect. In each
experiment we scan over a range of learning rates and weight decays. We train a number of models
in each case and report the average of results for the configuration that achieves the best performance.
The structure of the models used in the experiments is given in Tab. 1.
Ponce GP CIFAR
400×300 monochrome 1024×1024 monochrome 32×32 RGB image RGB

3×3 conv 64
4× 3×3 conv 64 3×3 conv 96
3×3 mp stride 2
7×7 conv 64 3×3 conv 96

3×3 conv 64 stride 2 masking layer 3×3 conv 96 stride 2

3×3 mp stride 2

3×3 conv 64
2× {3×3 conv 192
3× 3×3 conv 64
3×3 mp stride 2 3×3 conv 192 stride 2
2× {3×3 conv 192
3×3 conv 64
3×3 conv 10
3×3 conv 4
masking layer *
average-pool over remaining spatial dimensions
softmax
Table 1: Network structures. With the exception of the final layer, all conv layers are followed by ReLU
and batch-norm. The network used in CIFAR10 problems with non-decomposable loss do not employ the (*)
masking layer. In these problems backdrop is implemented as in Eq. (4).

3.1 Backdrop on a non-decomposable loss

There are an increasing number of learning tasks which require the optimization of loss functions
that cannot be written as a sum of per-sample losses. Examples of such tasks include manifold
learning, problems with constraints, classification problems with class imbalance, problems where
the top/bottom results are more valued than the overall precision, or in general any problem which
utilizes performance measures that require the the entire dataset for evaluation such as F-measures
and area under the ROC curve and others.

4
As discussed in Sec. 2, SGD and minibatching schemes are either not applicable to these optimization
problems or not optimal. We propose to approach this class of problems by using one or more
backdrop masking layers to improve generalization (Fig 1c).

Optimizing ROC AUC on CIFAR10. As the first example in this class of problems we consider
optimizing the ROC AUC on images from the CIFAR10 dataset. Specifically, we take the images of
cats and dogs from this dataset. To make the problem more challenging and relevant for ROC AUC,
we construct a 5:1 imbalanced dataset for training and test sets (e.g. 5000 cat images and 1000 dog
images on the training set). In a binary classification problem, the ROC AUC measures the probability
that a member of class 1 receives a higher score than a member of class 0. Since the ROC AUC is
itself a non-differentiable function, we use the rank statistic, which replaces the non-differentiable
Heaviside function in the definition of the ROC AUC with a sigmoid approximation, as our loss
function [10].
87.0
The results of this experiment are given in Fig. 3. We
see that for all batch sizes, implementing backdrop 86.5
improves the performance of the network. Moreover, 86.0
the improvement that is achieved by implementing

ROC AUC (%)


85.5
backdrop at batch size 2048 is greater than the im-
provement that we see when reducing batch size from 85.0
2048 to 32 without backdrop. This implies that in this 84.5
problem, backdrop is a superior method of stochastic batch size: 32
batch size: 128
optimization than minibatching alone. For this exper- 84.0 batch size: 512
batch size: 2048
iment, performance deteriorates when reducing the
batch size below 32, as the batch size becomes too 0 1
1 2 1 41 1
1 16 1 641
small to approximate the rank statistic well. However, Mask probability
the effective batch size defined in Sec 2.1 has no such Figure 3: ROC AUC vs. mask probability p for
restriction. different batch sizes.

CIFAR10 with imposed latent space structure. Consider again the image classification problem
on CIFAR10 with a multi-class cross-entropy loss function. For demonstration purposes, we will
try to impose a specific latent space structure by adding an extra non-decomposable term to the loss,
demanding that the pair-wise `2 distance of the per-class average of the hidden state be separated
from each other by a fixed amount. This loss is similar in nature to disentangled representation
learning problems [6, 19, 31].
If we denote the output of the final and next to final layers of the network on the n’th sample as fn
and vn , we can write the total loss as:
Xh i2
L = LXE (f~) + Ldist (~v ), Ldist (~v ) = + |v̄c − v̄c0 |2 − d2 , v̄c = E [v | c] (3)
c<c0

where c and c0 are different image classes and v̄c is the class average of the hidden state. The total
loss L is no longer decomposable as a sum of individual sample losses. Furthermore, to make the
problem more challenging, we truncate the CIFAR10 training set to a subset of 5500 images with an
imbalanced number of samples per class. Specifically, on both train and test datasets, we truncate the
i’th class down to 100i samples, i.e. 100 samples for class 1, 200 for class 2 and so on. We impose
the same imbalance ratio on the test set.
Before proceeding to the results, it is important to note that we expect the two terms in the loss LXE
and Ldist to behave differently when optimized using minibatch SGD. The cross-entropy loss would
benefit from small batch sizes and its generalization performance is expected to suffer as the batch
size is increased. However, as we will see, the second term becomes a poor estimator of the real loss
when evaluated on minibatches of size smaller than 200. We would therefore expect a trade-off in the
generalization performance of this problem if optimized using traditional methods.
To address this trade-off we use two masking layers with different masking probabilities for the two
terms in the loss, i.e. we take the loss function:

LBD = LXE (MpX (f~)) + Ldist (MpD (~v )), (4)

5
where Mp are the backdrop masking layers defined in Eqs. (1) and (2). In this way we can benefit
from two different effective batch sizes in a consistent manner and avoid the aforementioned trade-off
between LXE and Ldist .
We train 10 models in each configuration given by batch sizes ranging from 32 to 2048 with pX , pD
ranging from 0 to 0.97. The results of this experiment are reported in Fig 4. Let us first consider the
performance of the network without backdrop as a function of the minibatch size (Fig 4a purple lines).
The solid lines denote Ldist evaluated on the entire test dataset and the dashed lines are the average of
Ldist evaluated on minibatches, which we denote as LMB MB
dist . Indeed we see that Ldist becomes smaller
as we train the network with smaller minibatches, however, Ldist remains roughly the same, implying
that LMB
dist becomes a poorer estimator of Ldist as batch sizes get smaller. As claimed, minibatch
SGD does not benefit this optimization task.

0.250
0.25 0.97 0.25 0.97
0.225
0.24 0.94
0.94 0.200
0.20 0.23

Distance mask
Distance mask 0.87 0.87 0.175
Distance loss

0.22
0.150
0.75 0.21 0.75
0.15
0.20 0.125
0.5 0.5
0.19 0.100
0.10 mask 0, full loss
mask 0, batch loss 0 0.18 0 0.075
mask 0.87, full loss
mask 0.87, batch loss 0 0.5 0.75 0.87 0.94 0.97 0 0.5 0.75 0.87 0.94 0.97
32 128 512 2048 Classification mask Classification mask
Batch size

(a) Test Ldist vs. batch size. (b) Test LXE , batch size 2048. (c) Test Ldist , batch size 2048
Figure 4: The results of the CIFAR10 experiment with non-decomposable loss. The solid lines in (a) denote
Ldist and the dashed lines represent LMB
dist , the average of Ldist evaluated on minibatches.

Training the network with backdrop however, results in significant gains. In Figs 4b and 4c, we see
the test results for LXE and Ldist as a function of pX and pD i.e. the mask probabilities applied
to LXE and Ldist respectively (Eq. (4)). Note that increasing pX and pD generally reduces both
losses but are more effective in reducing the loss function which they are directly masking. We note
that at batch size 2048, the best performance of the network is achieved with masking probabilities
(pX ; pD ) = (0.94; 0.97).

3.2 Backdrop on subsamples

The second class of problems where backdrop can have a significant impact are those where each
sample has a hierarchical structure with multiple subsamples at each level of the hierarchy. There are
many cases where this situation arises naturally. Image classification, time series analysis, translation
and natural language processing all have intricate hierarchical subsample structure and we would like
to be able to take advantage of this during training. In this section we provide three examples which
demonstrate how backdrop takes advantage of this hierarchical subsample structure.

(a) Small scale GP (b) Large scale GP (c) Final image


Figure 5: Each sample in the two-scale GP dataset is the convolution of two GP processes with different scales.

Gaussian process textures. For our first example, we created a flexible synthetic texture dataset
generator using Gaussian processes (GPs). The classes of any generated dataset are created by taking
the convolution of a small-scale GP (5a) and a large-scale GP (5b) resulting in a 2-scale hierarchical
structure (5c). In simple terms, each sample is comprised of a number of large blobs, each of which
contains many much smaller blobs. Because of this hierarchical structure and the many realizations
at each scale of the hierarchy, this dataset generator is the ideal test bed for our discussion. The

6
generator also provides the flexibility to tune the difficulty of the problem via changing the individual
correlation lengths as well as the number of subsample hierarchies.
A naive approach to problems with many realizations of the subsamples would be to crop each
training sample into smaller images corresponding to the subsample sizes. In this multi-scale problem
however, cropping is problematic. If we crop at the scale of the larger blobs, we are not fully taking
advantage of the subsample structure of the smaller blobs. Whereas if we crop at the scale of the
smaller blobs we destroy the large scale structure of the data. Furthermore, in many problems the
precise subsample structure of the data is not known, making this cropping approach even harder to
implement.

Mask 2 Avg Pool

Mask 1 Conv Max Pool


Conv Max Pool

Figure 6: Cartoon of a masked convolutional network with masking layers at two different scales. The masking
layers are transparent during the forward pass (solid lines) but block some percentage of the gradients during the
backward pass (denoted by the dashed lines). The exact number of convolution, max pool and masking layers
for each experiment is given in Tab. 1.

Backdrop provides an elegant solution to this problem. We utilize a convolutional network which takes
the entirety of each image as input. In order to take advantage of the hierarchical subsample structure,
we use two masking layers Mpl and Mps respectively for the long and short range fluctuations. We
insert these at positions along the network which correspond to convolution layers whose receptor
field is roughly equal to the scale of the blobs (Fig. 6). This will result in a gradient update rule
which updates the weights of the coarser feature detector convolutional layers according to a subset
of the larger GP patches and will update the weights of the finer feature detector layers according
to a subset of the small GP patches. pl and ps respectively determine how many of the large and
small patches will be dropped during gradient computation. We therefore expect the classification
accuracy of the different scales to change based on the two masking probabilities. Note that unlike
cropping, this does not require knowing the exact size of the blobs. Firstly, since the network takes
the entire image as an input, it will not lose classification power if we underestimate the size of the
blobs. Furthermore, even if we have no idea about the structure of the data we can insert masking
layers at more positions corresponding to a variety of receptor field sizes and use their respective
masking probabilities as a hyper-parameter.
To highlight the improvements derived from backdrop mask-
(ps , pl ) Total Small Large
ing, we treat this problem as a one-shot learning classifi-
(0, 0) 45.9 64.7 74.1
cation task, where each class has a single large training
(0, 0.75) 54.1 67.4 82.9
example. Each sample is a 1024 × 1024 monochrome im- (0, 0.94) 49.1 55.3 88.2
age and the 4 different classes of the dataset have (small, (0.99, 0) 57.1 77.9 73.8
large) GP scales corresponding to (9.5, 80), (10, 80), (9.5, (0.999, 0) 57.4 68.2 84.1
140) and (10, 140) pixels [8]. The details of the network (0.99, 0.75) 53.5 55.0 98.2
(0.99, 0.94) 66.2 78.7 84.6
structure used are given in Tab. 1. In particular, the large (0.999, 0.75) 54.7 61.5 88.8
masking layer acts on a 4 × 4 spatial lattice correspond- (0.999, 0.94) 55.0 58.5 95.0
ing to patches of size 256 × 256 and the small masking 512 × 512 crop 45.3 50.9 87.9
layer acts on a 64 × 64 lattice corresponding to patches of 256 × 256 crop 34.3 53.9 55.3
size 16 × 16. We train 10 models for masking probabilities
Table 2: Classification accuracy of the
pl ∈ {0, 0.75, 0.94}, corresponding to keeping all, 4 or only small and large scales and overall accuracy
one of the 16 large patches and ps ∈ {0, 0.99, 0.999} which and comparison with cropping.
correspond to keeping all, 40 or 4 of the 642 small patches
for gradient computation. We refrain from random cropping, flipping or any other type of data aug-
mentation to keep the one-shot learning spirit of the problem where in many cases data augmentation
is not possible.

7
The results of the experiment, evaluated on 100 test images can be seen in Tab. 2. We report the total
classification accuracy, as well as the accuracy for correctly classifying each scale while ignoring
the other. For reference, random guessing would yield 25% and 50% for total and scale-specific
accuracies. We see that training with backdrop dramatically increases classification accuracy. Note
that for networks trained with a single masking layer, classification is improved for the scale at which
the masking is taking place, i.e. if we are masking the larger patches, the large scale classification
is improved. The best performance is achieved with both masking layers present. However, we
also see that having both masking layers at high masking probabilities can lead to a deterioration
in classification performance. For comparison, we also provide the results of training the network
with 512 × 512 and 256 × 256 fixed-location cropping augmentation, i.e. we respectively chop each
training sample up into 4 or 16 equal sized images and train the network on these smaller sized
images. With 512 × 512 cropping, the large scale discrimination power increases but this comes at
the cost of the small scale discrimination power. The situation is worse with the tighter 256 × 256
crops, where the network is doing little better than random guessing.

Ponce texture dataset. The second example we 93


consider is the Ponce texture dataset [20], which con-
sists of 25 texture classes each with 40 samples of 92

Test accuracy (%)


400 × 300 monochrome images. In this case, the hi-
91
erarchical subsamples are still present but the exact
batch size: 8
structure of the data is less apparent compared to the 90 batch size: 16
batch size: 32
GPs considered above. For training, we use 475 sam- batch size: 64
batch size: 128
ples keeping the remaining 525 samples as a test set. 89 batch size: 256
We use a convolutional neural net similar to the pre- 0 3
1 4 1
1 2 1 4 1 1 18 1 16 1
vious example but with a single backdrop layer which Mask probability
masks patches that are roughly 70 × 60 pixels in size. Figure 7: Test accuracy vs. mask probability p
The details of the model are given in Tab. 1. for different batch sizes.

We train 25 models with batch sizes ranging from 8 to 256 and masking probabilities ranging from
0 to 93.75% (roughly equivalent to keeping all the patches to keeping only 2 out of the 30 patches
during gradient evaluation). The results of the experiment are reported in Fig. 7. For every value of
the batch size, test performance improves as we increase the masking probability, but the gains are
more pronounced for larger batch sizes. The best performance of the network is achieved by using
small minibatches with high masking probability.

Figure 8: Samples from the Ponce texture dataset.

CIFAR10. Finally, we would like to demonstrate that even when the dataset does not have any
apparent subsample structure, it is still possible to take advantage of generalization improvements
provided by backdrop masking if we employ a network that implicitly defines such a structure. We
demonstrate this using the task of image classification on the CIFAR10 dataset.
We employ the all-convolutional network structure
introduced in [30] with the slight modification that we 1.0
use batch-normalization after all convolutional layers.
The details of the model are given in Tab. 1. This 0.9
model has a similar structure to the other models used
Loss ratio

in this section in that its output is a 6 × 6 classification 0.8

heatmap (akin to semantic segmentation models) and


0.7
the final classification decision is made by taking an batch size: 16
batch size: 128
average of this heatmap. The fact that this model batch size: 512
0.6 batch size: 2048
works well (it achieves state of the art performance if
0.0 0.2 0.4 0.6 0.8
trained with data augmentation), implies that each of Mask probability
the 6 × 6 points on the heatmap has a receptor field Figure 9: Ratio of test loss with backdrop to test
that is large enough for the purpose of classification. loss without backdrop vs. mask probability p.

8
We can use this heatmap output of the network for implementing backdrop, in a similar manner to the
previous two examples.
For this experiment, we use minibatch sizes from 16 to 2048 and backdrop mask probabilities from
0 to 0.9 (corresponding to keeping about 4 of the 36 points on the heatmap) and train 5 models for
each individual configuration. We note that using backdrop in this situation leads to small and in
some cases not statistically significant gains in test accuracy. However, as can be seen in Fig. 9, using
backdrop can result in significant gains in test loss (up to 40% in some cases), leading to the model
making more confident classifications. The gains are especially pronounced for smaller batch sizes.
Note, however, that excessive masking for small batch sizes can deteriorate generalization.

4 Discussion
Backdrop is a flexible strategy for introducing data-dependent stochasticity into the gradient, and
can be thought of as a generalization of minibatch SGD. Backdrop can be implemented without
modifying the structure of the network, simply by adding masking layers which are transparent
during the forward pass but drop randomly chosen elements of the gradient during the backward
pass. We have shown in a number of examples how backdrop masking can dramatically improve
generalization performance of a network in situations where minibatching is not viable. These fall
into two categories, problems where evaluation of the loss on a small minibatch leads to a poor
approximation of the real loss, e.g. problems with losses that are non-decomposable over the samples
and problems with hierarchical subsample structure. We discussed examples for both cases and
demonstrated how backdrop can lead to significant improvements. We also demonstrated that even in
scenarios where the loss is decomposable and there is no obvious hierarchical subsample structure,
using backdrop can lead to lower test loss and higher classification confidence. It would therefore
be of interest to explore any possible effects of backdrop on vulnerability to adversarial attacks. We
leave this venue of research to future work.
In our experiments, we repeatedly noticed that the best performance of the network, especially for
larger batch sizes, is achieved when the masking probabilities of backdrop layers were extremely
high (in some cases > 98%). As a result, it takes more epochs of training to be exposed to the
full information contained in the dataset, which can lead to longer training times. In our initial
implementation of backdrop, the blocked gradients are still computed but are simply multiplied by
zero. A more efficient implementation, where the gradients are not computed for the blocked paths,
would therefore lead to a significant decrease in computation time as well as in memory requirements.
In a similar fashion, we would expect backdrop to be a natural addition to gradient checkpointing
schemes whose aim is to reduce memory requirements [2, 9].
Similar to other tools in a machine-learning toolset, backdrop masking should be used judiciously and
after considerations of the structure of the problem. However, it can also be used in autoML schemes
where a number of masking layers are inserted and the masking probabilities are then fine-tuned as
hyperparameters.
The empirical success of backdrop in the above examples warrants further analytic study of the
technique. In particular, it would be interesting to carryout a martingale analysis of backdrop as
found in stochastic optimization literature [7, 28].

Acknowledgments. We would like to thank Léon Bottou, Joan Bruna, Kyunghyun Cho and Yann
LeCun for interesting discussions and input. We are also grateful to Kyunghyun Cho for suggesting
the name backdrop. KC is supported through the NSF grants ACI-1450310 and PHY-1505463 and
would like to acknowledge the Moore-Sloan data science environment at NYU. SG is supported by
the James Arthur Postdoctoral Fellowship.

References
[1] Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine
learning. SIAM Review, 60(2):223–311.

[2] Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Training deep nets with sublinear memory
cost. CoRR, abs/1604.06174.

9
[3] Das, D., Avancha, S., Mudigere, D., Vaidyanathan, K., Sridharan, S., Kalamkar, D. D., Kaul, B.,
and Dubey, P. (2016). Distributed deep learning using synchronous stochastic gradient descent.
CoRR, abs/1602.06709.
[4] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., aurelio Ranzato, M., Senior,
A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. (2012). Large scale distributed deep networks.
In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural
Information Processing Systems 25, pages 1223–1231. Curran Associates, Inc.
[5] Dembczynski, K. J., Waegeman, W., Cheng, W., and Hüllermeier, E. (2011). An exact algorithm
for f-measure maximization. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and
Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 24, pages 1404–
1412. Curran Associates, Inc.
[6] Ganin, Y. and Lempitsky, V. (2014). Unsupervised Domain Adaptation by Backpropagation.
ArXiv e-prints.
[7] Gladyshev, E. G. (1965). On the stochastic approximation. Theory Probab. Appl., 10:275– 278.
[8] Golkar, S. and Cranmer, K. (2018). Multi-scale gaussian process dataset. Zenodo,
http://doi.org/10.5281/zenodo.1252464.
[9] Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., and Graves, A. (2016). Memory-efficient
backpropagation through time. CoRR, abs/1606.03401.
[10] Herschtal, A. and Raskutti, B. (2004). Optimising area under the roc curve using gradient
descent. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML
’04, pages 49–, New York, NY, USA. ACM.
[11] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012).
Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
[12] Hoffer, E., Hubara, I., and Soudry, D. (2017). Train longer, generalize better: closing the
generalization gap in large batch training of neural networks. ArXiv e-prints.
[13] Ian Goodfellow, Y. B. and Courville, A. (2016). Deep learning. Book in preparation for MIT
Press.
[14] Kar, P., Narasimhan, H., and Jain, P. (2014). Online and stochastic gradient methods for
non-decomposable loss functions. CoRR, abs/1410.6776.
[15] Kar, P., Narasimhan, H., and Jain, P. (2015). Surrogate Functions for Maximizing Precision at
the Top. ArXiv e-prints.
[16] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-
batch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836.
[17] Koyejo, O., Natarajan, N., Ravikumar, P., and Dhillon, I. S. (2014). Consistent binary classifica-
tion with generalized performance metrics. In Proceedings of the 27th International Conference
on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2744–2752, Cambridge,
MA, USA. MIT Press.
[18] Krizhevsky, A., Nair, V., and Hinton, G. (2009). Cifar-10 (canadian institute for advanced
research).
[19] Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., and Ranzato, M. (2017).
Fader networks: Manipulating images by sliding attributes. CoRR, abs/1706.00409.
[20] Lazebnik, S., Schmid, C., and Ponce, J. (2005). A sparse texture representation using local affine
regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1265–1278.
[21] LeCun, Y., Bottou, L., Orr, G. B., and Müller, K. R. (1998). Efficient BackProp, pages 9–50.
Springer Berlin Heidelberg, Berlin, Heidelberg.

10
[22] Li, X., Chen, S., Hu, X., and Yang, J. (2018). Understanding the disharmony between dropout
and batch normalization by variance shift. CoRR, abs/1801.05134.
[23] Masters, D. and Luschi, C. (2018). Revisiting Small Batch Training for Deep Neural Networks.
ArXiv e-prints.
[24] Narasimhan, H. (2018). Learning with complex loss functions and constraints. In Storkey, A.
and Perez-Cruz, F., editors, Proceedings of the Twenty-First International Conference on Artificial
Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages
1646–1654, Playa Blanca, Lanzarote, Canary Islands. PMLR.
[25] Narasimhan, H., Kar, P., and Jain, P. (2015). Optimizing Non-decomposable Performance
Measures: A Tale of Two Classes. ArXiv e-prints.
[26] Narasimhan, H., Vaish, R., and Agarwal, S. (2014). On the statistical consistency of plug-in
classifiers for non-decomposable performance measures. In Ghahramani, Z., Welling, M., Cortes,
C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing
Systems 27, pages 1493–1501. Curran Associates, Inc.
[27] Ravanbakhsh, S., Oliva, J., Fromenteau, S., Price, L. C., Ho, S., Schneider, J., and Poczos, B.
(2017). Estimating Cosmological Parameters from the Dark Matter Distribution. ArXiv e-prints.
[28] Robbins, H. and Siegmund, D. (1971). A convergence theorem for nonnegative almost super-
martingales and some applications. Optimizing methods in statistics, pages 233–257.
[29] Shanahan, P. E., Trewartha, D., and Detmold, W. (2018). Machine learning action parameters in
lattice quantum chromodynamics.
[30] Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. (2014). Striving for
simplicity: The all convolutional net. CoRR, abs/1412.6806.
[31] Whitney, W. (2016). Disentangled representations in neural models. CoRR, abs/1602.02383.
[32] Wilson, D. R. and Martinez, T. R. (2003). The general inefficiency of batch training for gradient
descent learning. Neural Netw., 16(10):1429–1451.
[33] Ye, N., Chai, K. M. A., Lee, W. S., and Chieu, H. L. (2012). Optimizing f-measures: A tale of
two approaches. In Proceedings of the 29th International Coference on International Conference
on Machine Learning, ICML’12, pages 1555–1562, USA. Omnipress.
[34] Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. CoRR,
abs/1511.07122.

11
Relational Deep Reinforcement Learning
Vinicius Zambaldi∗, David Raposo∗, Adam Santoro∗, Victor Bapst, Yujia Li, Igor Babuschkin,
Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan,
Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, Peter Battaglia

Contact: vzambaldi@google.com, draposo@google.com, adamsantoro@google.com

DeepMind
London, United Kingdom
arXiv:1806.01830v2 [cs.LG] 28 Jun 2018

Abstract
We introduce an approach for deep reinforcement learning (RL) that improves upon the
efficiency, generalization capacity, and interpretability of conventional approaches through
structured perception and relational reasoning. It uses self-attention to iteratively reason about
the relations between entities in a scene and to guide a model-free policy. Our results show that
in a novel navigation and planning task called Box-World, our agent finds interpretable solutions
that improve upon baselines in terms of sample complexity, ability to generalize to more complex
scenes than experienced during training, and overall performance. In the StarCraft II Learning
Environment, our agent achieves state-of-the-art performance on six mini-games – surpassing
human grandmaster performance on four. By considering architectural inductive biases, our
work opens new directions for overcoming important, but stubborn, challenges in deep RL.

1 Introduction
Recent advances in deep reinforcement learning (deep RL) [1, 2, 3] are in part driven by a capacity
to learn good internal representations to inform an agent’s policy. Unfortunately, deep RL models
still face important limitations, namely, low sample efficiency and a propensity not to generalize to
seemingly minor changes in the task [4, 5, 6, 7]. These limitations suggest that large capacity deep
RL models tend to overfit to the abundant data on which they are trained, and hence fail to learn an
abstract, interpretable, and generalizable understanding of the problem they are trying to solve.
Here we improve on deep RL architectures by leveraging insights introduced in the RL literature
over 20 years ago under the Relational RL umbrella (RRL, [8, 9]). RRL advocated the use of relational
state (and action) space and policy representations, blending the generalization power of relational
learning (or inductive logic programming) with reinforcement learning. We propose an approach
that exploits these advantages concurrently with the learning power afforded by deep learning. Our
approach advocates learned and reusable entity- and relation-centric functions [10, 11, 12] to implicitly
reason [13] over relational representations.
Our contributions are as follows: (1) we create and analyze an RL task called Box-World that
explicitly targets relational reasoning, and demonstrate that agents with a capacity to produce
relational representations using a non-local computation based on attention [14] exhibit interesting
generalization behaviors compared to those that do not, and (2) we apply the agent to a difficult
problem – the StarCraft II mini-games [15] – and achieve state-of-the-art performance on six mini-
games.
∗ Equal contribution.

1
Figure 1: Box-World and StarCraft II tasks demand reasoning about entities and their relations.

2 Relational reinforcement learning


The core idea behind RRL is to combine reinforcement learning with relational learning or Inductive
Logic Programming [16] by representing states, actions and policies using a first order (or relational)
language [8, 9, 17, 18]. Moving from a propositional to a relational representation facilitates general-
ization over goals, states, and actions, exploiting knowledge learnt during an earlier learning phase.
Additionally, a relational language also facilitates the use of background knowledge. Background
knowledge can be provided by logical facts and rules relevant to the learning problem.
For example in a blocks world, one could use the predicate above(S, A, B) to indicate that block
A is above block B in state S when specifying background knowledge. Such predicates can then be
used during learning for blocks C and D, for example. The representational language, background,
and assumptions form the inductive bias, which guides (and restricts) the search for good policies.
The language (or declarative) bias determines the way concepts can be represented.
Neural nets have traditionally been associated with the attribute-value, or propositional, RL
approaches [19]. Here we translate ideas from RRL into architecturally specified inductive biases
within a deep RL agent, using neural network models that operate on structured representations of a
scene – sets of entities – and perform relational reasoning via iterated, message-passing-like modes of
processing. The entities correspond to local regions of an image, and the agent learns to attend to
key objects and compute their pairwise and higher-order interactions.

3 Architecture
We equip a deep RL agent with architectural inductive biases that may be better suited for learning
(and computing) relations, rather than specifying them as background knowledge as in RRL. This
approach builds off previous work suggesting that relational computations needn’t necessarily be
biased by entities’ spatial proximity [20, 10, 21, 11, 13, 22], and may also profit from iterative
structured reasoning [23, 24, 25, 26].
Our contribution is founded on two guiding principles: non-local computations using a shared
function and iterative computation. We show that an agent which computes pairwise interactions
between entities, independent of their spatial proximity, using a shared function, will be better
suited for learning important relations than an agent that only computes local interactions, such as
in translation invariant convolutions1 . Moreover, an iterative computation may be better able to
capture higher-order interactions between entities.

Computing non-local interactions using a shared function


Among a family of related approaches for computing non-local interactions [20], we chose a computa-
tionally efficient attention mechanism. This mechanism has parallels with graph neural networks
and, more generally, message passing computations [27, 28, 29, 12, 30]. In these models entity-entity
1 Intuitively, a ball can be related to a square by virtue of it being “left of”, and this relation may hold whether the

two objects are separated by a centimetre or a kilometer.

2
ReLU
x4
FC 256
Multi-head dot product attention
Feature-wise
max pooling query
Relational key
module
value
ReLU
x2
Conv. 2 x 2, stride 1

...
...
...
Input

Figure 2: Box-World agent architecture and multi-head dot-product attention. E is a matrix that
compiles the entities produced by the visual front-end; fθ is a multilayer perceptron applied in parallel
to each row of the output of an MHDPA step, A, and producing updated entities, E. e

relations are explicitly computed when considering the messages passed between connected nodes of
the graph.
We start by assuming that we already have a set of entities for which interactions must be
computed. We consider multi-head dot-product attention (MHDPA), or self-attention [14], as the
operation that computes interactions between these entities.
For N entities (e1:N ), MHDPA projects each entity i’s state vector, ei , into query, key, and value
vector representations: qi , ki , vi , respectively, whose activities are subsequently normalized to have
0 mean and unit variance using the method from [31]. Each qi is compared to all entities’ keys
k1:N via a dot-product, to compute unnormalized saliencies, si . These are normalized into weights,
wi = softmax (si ). For each entity, theP cumulative interactions are computed by the weighted mixture
of all entities’ value vectors, ai = j=1:N wi,j vj . This can be compactly computed using matrix
multiplications:

QK T
 
A = softmax √ V (1)
d
| {z }
attention weights

where A, Q, K, and V compile the cumulative interactions, queries, keys, and values into matrices,
and d is the dimensionality of the key vectors used as a scaling factor. Like [14], we use multiple,
independent attention “heads”, applied in parallel, which our attention visualisation analyses (see
Results 4.1) suggest may assume different relational semantics through training. The aih vectors,
where h indexes the head, are concatenated together, passed to a multilayer perceptron (2-layer
MLP with ReLU non-linearities) with the same layers sizes as ei , summed with ei (i.e., a residual
connection), and transformed via layer normalization [31], to produce an output. Figure 2 depicts
this mechanism.
We refer to one application of this process as an “attention block”. A single block performs
non-local pairwise relational computations, analogous to relation networks [13] and non-local neural
networks [20]. Multiple blocks with shared (recurrent) or unshared (deep) parameters can be
composed to more easily approximate higher order relations, analogous to message-passing on graphs.

Extracting entities
When dealing with unstructured inputs – e.g., RGB pixels – we need a mechanism to represent the
relevant entities. We decide to make a minimal assumption that entities are things located in a

3
particular point in space. We use a convolutional neural network (CNN) to parse pixel inputs into k
feature maps of size n×n, where k is the number of output channels of the CNN. We then concatenate
x and y coordinates to each k-dimensional pixel feature-vector to indicate the pixel’s position in the
map. We treat the resulting n2 pixel-feature vectors as the set of entities by compiling them into a
n2 × k matrix E. As in [13], this provides an efficient and flexible way to learn representations of the
relevant entities, while being agnostic to what may constitute an entity for the particular problem at
hand.

Agent architecture for Box-World


We adopted an actor-critic set-up, using a distributed agent based on an Importance Weighted
Actor-Learner Architecture [32]. The agent consists of 100 actors, which generate trajectories of
experience, and a single learner, which directly learns a policy π and a baseline function V , using the
actors’ experiences. The model updates were performed on GPU using mini-batches of 32 trajectories
provided by the actors via a queue.
The complete network architecture is as follows. The input observation is first processed through
two convolutional layers with 12 and 24 kernels, 2 × 2 kernel sizes and a stride of 1, followed by
a rectified linear unit (ReLU) activation function. The output is tagged with two extra channels
indicating the spatial position (x and y) of each cell in the feature map using evenly spaced values
between −1 and 1. This is then passed to the relational module (described above) consisting of
a variable number of stacked MHDPA blocks, using shared weights. The output of the relational
module is aggregated using feature-wise max-pooling across space (i.e., pooling a n × n × k tensor to
a k-dimensional vector), and finally passed to a small MLP to produce policy logits (normalized and
used as multinomial distribution from which the action was sampled) and a baseline scalar V .
Our baseline control agent replaces the MHDPA blocks with a variable number of residual
convolution blocks. Please see the Appendix for further details, including hyperparameter choices.

Agent architecture for StarCraft II


The same set-up was used for the StarCraft II agent, with a few differences in the network architecture
to accommodate the specific requirements of the StarCraft II Learning Environment (SC2LE, [15]). In
particular, we increased its capacity using 2 residual blocks, each consisting of 3 convolutional layers
with 3 × 3 kernels, 32 channels and stride 1. We added a 2D-ConvLSTM immediately downstream of
the residual blocks, to give the agent the ability to deal with recent history. We noticed that this
was critical for StarCraft because the consequences of an agent’s actions are not necessarily part of
its future observations. For example, suppose the agent chooses to move a marine along a certain
path at timestep t. At t + τ the agent’s observation may depict the marine in a different location,
but the details of the path are not depicted. In these situations, the agent is prone to re-select the
path it had already chosen, rather than, say, move on to choose another action.
For the output, alongside action a and value V , the network produces two sets of action-related
arguments: non-spatial arguments (Args) and spatial arguments (Args x,y ). These arguments are used
as modifiers of particular actions (see [15]). Args are produced from the output of the aggregation
function, whereas Args x,y result from upsampling the output of the relational module.
As in Box-World, our baseline control agent replaces the MHDPA blocks with a variable number
of residual convolution blocks. Please see the Appendix for further details.

4
Observation Underlying graph 1.0

0.8
Branch length = 1

Fraction solved
0.6

0.4 Relational (1 block)


Relational (2 blocks)
0.2 Baseline (3 blocks)
Baseline (6 blocks)
0.0
0 2 4 6 8 10 12 14
Environment steps 1e8
1.0
Relational (2 block)
Branch length = 3

0.8 Relational (4 blocks)

Fraction solved
0.6

0.4

0.2

0.0
0 1 2 3 4 5 6 7 8
Environment steps 1e8

Figure 3: Box-World task: example observations (left), underlying graph structure that determines
the proper path to the goal and any distractor branches (middle) and training curves (right).

4 Experiments and results


4.1 Box-World
Task description
Box-World2 is a perceptually simple but combinatorially complex environment that requires abstract
relational reasoning and planning. It consists of a 12 × 12 pixel room with keys and boxes randomly
scattered. The room also contains an agent, represented by a single dark gray pixel, which can move
in four directions: up, down, left, right (see Figure 1).
Keys are represented by a single colored pixel. The agent can pick up a loose key (i.e., one not
adjacent to any other colored pixel) by walking over it. Boxes are represented by two adjacent colored
pixels – the pixel on the right represents the box’s lock and its color indicates which key can be used
to open that lock; the pixel on the left indicates the content of the box which is inaccessible while
the box is locked.
To collect the content of a box the agent must first collect the key that opens the box (the one
that matches the lock’s color) and walk over the lock, which makes the lock disappear. At this point
the content of the box becomes accessible and can be picked up by the agent. Most boxes contain
keys that, if made accessible, can be used to open other boxes. One of the boxes contains a gem,
represented by a single white pixel. The goal of the agent is to collect the gem by unlocking the
box that contains it and picking it up by walking over it. Keys that an agent has in possession are
depicted in the input observation as a pixel in the top-left corner.
In each level there is a unique sequence of boxes that need to be opened in order to reach the gem.
Opening one wrong box (a distractor box) leads to a dead-end where the gem cannot be reached
and the level becomes unsolvable. There are three user-controlled parameters that contribute to
the difficulty of the level: (1) the number of boxes in the path to the goal (solution length); (2)
the number of distractor branches; (3) the length of the distractor branches. In general, the task is
computationally difficult for a few reasons. First, a key can only be used once, so the agent must be
able to reason about whether a particular box is along a distractor branch or along the solution path.
Second, keys and boxes appear in random locations in the room, emphasising a capacity to reason
about keys and boxes based on their abstract relations, rather than based on their spatial positions.
2 The Box-World environment will be made publicly available online.

5
Figure 4: Visualization of attention weights. (a) The underlying graph of one example level; (b)
the result of the analysis for that level, using each of the entities along the solution path (1–5) as
the source of attention. Arrows point to the entities that the source is attending to. An arrow’s
transparency is determined by the corresponding attention weight.

Training results
The training set-up consisted of Box-World levels with solution lengths of at least 1 and up to 4.
This ensured that an untrained agent would have a small probability of reaching the goal by chance,
at least on some levels.3 The number of distractor branches was randomly sampled from 0 to 4.
Training was split into two variants of the task: one with distractor branches of length 1; another
one with distractor branches of length 3 (see Figure 3).
Agents augmented with our relational module achieved close to optimal performance in the two
variants of this task, solving more than 98% of the levels. In the task variant with short distractor
branches an agent with a single attention block was able to achieve top performance. In the variant
with long distractor branches a greater number of attention blocks was required, consistent with
the conjecture that more blocks allow higher-order relational computations. In contrast, our control
agents, which can only rely on convolutional and fully-connected layers, performed significantly worse,
solving less than 75% of the levels across the two task variants.
We repeated these experiments, this time with backward branching in the underlying graph used
to generate the level. With backward branching the agent does not need to plan far into the future;
when it is in possession of a key, a successful strategy is always to open the matching lock. In contrast,
with forward branching the agent can use a key on the wrong lock (i.e. on a lock along a distractor
branch). Thus, forward branching demands more complicated forward planning to determine the
correct locks to open, in contrast to backward branching where an agent can adopt a more reactive
policy, always opting to open the lock that matches the key in possession (see Figure 6 in Appendix).

Visualization of attention weights


T
We next looked at specific rows of the matrix produced by softmax( QK √ ); specifically, those rows
d
mapping onto to relevant objects in the observation space. Figure 4 shows the result of this analysis
when the attending entities (source of the attention) are objects along the solution path. For one of
the attention heads, each key attends mostly to the locks that can be unlocked with that key. In
3 An agent with a random policy solves by chance 2.3% of levels with solution lengths of 1 and 0.0% of levels with

solution lengths of 4.

6
a) Longer solution path lengths b) Withheld key

Not required
during training

Relational Baseline Relational Baseline


1.0 1.0
0.8 0.8
Fraction solved

Fraction solved
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
n. 4 6 8 10 n. 4 6 8 10 . .
ai ai ain st
ain st
Tr Tr Tr Te Tr Te
Test Test

Figure 5: Generalization in Box-World. Zero-shot transfer to levels that required: (a) opening a
longer sequence of boxes; (b) using a key-lock combination that was never required during training.

other words, the attention weights reflect the options available to the agent once a key is collected.
For another attention head, each key attends mostly to the agent icon. This suggests that it is
relevant to relate each object with the agent, which may, for example, provide a measure of relative
position and thus influence the agent’s navigation.
In the case of RGB pixel inputs, the relationship between keys and locks that can be opened
with that key is confounded with the fact that keys and the corresponding locks have the same RGB
representation. We therefore repeated the analysis, this time using one-hot representation of the
input, where the mapping between keys and the corresponding locks is arbitrary. We found evidence
for the following: (1) keys attend to the locks they can unlock; (2) locks attend to the keys that can
be used to unlock them; (3) all the objects attend to the agent location; (4) agent and gem attend to
each other and themselves.

Generalization capability: testing on withheld environments


As we observed, the attention weights captured a link between a key and its corresponding lock, using
a shared computation across entities. If the function used to compute the weights (and hence, used
to determine that certain keys and locks are related) has learned to represent some general, abstract
notion of what it means to “unlock” – e.g., unlocks(key, lock) – then this function should be
able to generalize to key-lock combinations that it has never observed during training. Similarly, a
capacity to understand “unlocking” shouldn’t necessarily be affected by the number of locks that
need to be unlocked to reach a solution.
And so, we tested the model under two conditions, without further training: (1) on levels that
required opening a longer sequence of boxes than it had ever observed (6, 8 and 10), and (2) on levels
that required using a key-lock combination that was never required for reaching the gem during
training, instead only being placed on distractor paths. In the first condition the agent with the
relational module solved more than 88% of the levels, across all three solution length conditions. In
contrast, the agent trained without the relational module had its performance collapse to 5% when
tested on sequences of 6 boxes and to 0% on sequences of 8 and 10. On levels with new key-lock
combinations, the agent augmented with a relational module solved 97% of the new levels. The
agent without the relational module performed poorly, reaching only 13%. Together, these results
show that the relational module confers on our agents, at least to a certain extent, the ability to
do zero-shot transfer to more complex and previously unseen problems, a skill that so far has been
difficult to attain using neural networks.

7
Mini-game
Agent 1 2 3 4 5 6 7

DeepMind Human Player [15] 26 133 46 41 729 6880 138


StarCraft Grandmaster [15] 28 177 61 215 727 7566 133
Random Policy [15] 1 17 4 1 23 12 <1
FullyConv LSTM [15] 26 104 44 98 96 3351 6
PBT-A3C [33] – 101 50 132 125 3345 0
Relational agent 27 196 ↑ 62 ↑ 303 ↑ 736 ↑ 4906 123
Control agent 27 187 ↑ 61 295 ↑ 602 5055 120

Table 1: Mean scores achieved in the StarCraft II mini-games using full action set. ↑ denotes a score
that is higher than a StarCraft Grandmaster. Mini-games: (1) Move To Beacon, (2) Collect Mineral
Shards, (3) Find And Defeat Zerglings, (4) Defeat Roaches, (5) Defeat Zerglings And Banelings, (6)
Collect Minerals And Gas, (7) Build Marines.

4.2 StarCraft II mini-games


Task description
StarCraft II is a popular video game that presents a very hard challenge for reinforcement learning.
It is a multi-agent game where each player controls a large number (hundreds) of units that need to
interact and collaborate (see Figure 1). It is partially observable and has a large action space, with
more than 100 possible actions. The consequences of any single action – in particular, early decisions
in the game – are typically only observed many frames later, posing difficulties in temporal credit
assignment and exploration.
We trained our agents on the suite of 7 mini-games developed for the StarCraft II Learning
Environment (SC2LE, [15]). These mini-games were proposed as a set of specific scenarios that are
representative of the mechanics of the full game and can be used to test agents in a simpler set up
with a better defined reward structure, compared to the full game.

Training results
For these results we used the full action set provided by SC2LE and performance was measured as
the mean score over 30 episodes for each mini-game. Our agent implementations achieved high scores
across all the mini-games (Table 1). In particular, the agent augmented with a relational module
achieved state-of-the-art results in six mini-games and its performance surpassed that of the human
grandmaster in four of them.4
Head-to-head comparisons between our two implementations show that the agent with the
relational component (relational) achieves equal or better results than the one without (control)
across all mini-games. We note that both models improved substantially over the previous best [15].
This can be attributed to a number of factors: better RL algorithm [32], better hyperparameter
tuning to address issues of credit assignment and exploration, longer training, improved architecture,
and a different action selection procedure. Next, we focus on differences afforded by relational
inductive biases and turn to particular generalization tests to determine the behavioural traits of the
control and relational agents.
4 For replay videos visit: http://bit.ly/2kQWMzE

8
Generalization capability
As observed in Box-World, a capacity to better understand underlying relational structure – rather
than latch onto superficial statistics – may manifest in better generalization to never-before-seen
situations. To test generalization in SC2 we took agents trained on Collect Mineral Shards, which
involved using two marines to collect randomly scattered minerals and tested them, without further
training, on modified levels that allowed the agents to instead control five marines. Intuitively, if
an agent understands that marines are independent units that can be coordinated, yet controlled
independently to collect resources, then increasing the number of marines available should only
affect the underlying strategy of unit deployment, and should not catastrophically break model
performance.
We observed that – at least for medium size networks – there may be some interesting generalization
capabilities, with the best seed of the relational agent achieving better generalization scores in the
test scenario. However, we noticed high variability in these results, with the effect diminishing when
using larger models (which may be more prone to overfitting on the training set). Therefore, more
work is needed to understand the generalization effects of using a relational agent in StarCraft II
(see Figure 7 in Appendix).
Given the combinatoric richness of the full-game, an agent is frequently exposed to situations on
which it was not trained. Thus, an improved capacity to generalize to new situations caused by a
better understanding of underlying, abstract relations is important.

5 Conclusion
By introducing structured perception and relational reasoning into deep RL architectures, our agents
can learn interpretable representations, and exceed baseline agents in terms of sample complexity,
ability to generalize, and overall performance. This demonstrates key benefits of marrying insights
from RRL with the representational power of deep learning. Instead of trying to directly characterize
the internal representations, we appealed to: (1) a behavioural analysis, and (2) an analysis of the
internal mechanisms of the attention mechanism we used to compute entity-entity interactions. (1)
showed that the learned representations allowed for better generalization, which is characteristic of
relational representations. (2) showed that the model’s internal computations were interpretable, and
congruent with the computations we would expect from a model computing task-relevant relations.
Future work could draw on computer vision for more sophisticated structured perceptual reasoning
mechanisms (e.g., [34]), and hierarchical RL and planning [35, 36] to allow structured representations
and reasoning to translate more fully into structured behaviors. It will also be important to further
explore the semantics of the agent’s learned representations, through the lens of what one might
hard-code in traditional RRL.
More speculatively, this work blurs the line between model-free agents, and those with a capacity
for more abstract planning. An important feature of model-based approaches is making general
knowledge of the environment available for decision-making. Here our inductive biases for entity- and
relation-centric representations and iterated reasoning reflect key knowledge about the structure of
the world. While not a model in the technical sense, it is possible that the agent learns to exploit this
relational architectural prior similarly to how an imagination-based agent’s forward model operates
[37, 38, 39]. More generally, our work opens new directions for RL via a principled hybrid of flexible
statistical learning and more structured approaches.

Acknowledgments
We would like to thank Richard Evans, Théophane Weber, André Barreto, Daan Wierstra, John
Agapiou, Petko Georgiev, Heinrich Küttler, Andrew Dudzik, Aja Huang, Ivo Danihelka, Timo Ewalds
and many others on the DeepMind team.

9
References
[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control
through deep reinforcement learning. Nature, 518(7540):529, 2015.
[2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the
game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
[3] Andrei A. Rusu, Matej Vecerik, Thomas Rothörl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell.
Sim-to-real robot learning from pixels with progressive nets. In 1st Annual Conference on Robot Learning,
CoRL 2017, Mountain View, California, USA, November 13-15, 2017, Proceedings, pages 262–270, 2017.
[4] Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforcement learning.
arXiv preprint arXiv:1609.05518, 2016.
[5] Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep
reinforcement learning. arXiv preprint arXiv:1804.06893, 2018.
[6] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines
that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
[7] Ken Kansky, Tom Silver, David A Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou,
Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks: Zero-shot
transfer with a generative causal model of intuitive physics. arXiv preprint arXiv:1706.04317, 2017.
[8] Saso Dzeroski, Luc De Raedt, and Hendrik Blockeel. Relational reinforcement learning. In Inductive
Logic Programming, 8th International Workshop, ILP-98, Madison, Wisconsin, USA, July 22-24, 1998,
Proceedings, pages 11–22, 1998.
[9] Saso Dzeroski, Luc De Raedt, and Kurt Driessens. Relational reinforcement learning. Machine Learning,
43(1/2):7–52, 2001.
[10] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for
learning about objects, relations and physics. In Advances in neural information processing systems,
pages 4502–4510, 2016.
[11] David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter
Battaglia. Discovering objects and their relations from entangled scene representations. arXiv preprint
arXiv:1702.05068, 2017.
[12] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,
Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre,
Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles
Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick,
Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph
networks. arXiv, 2018.
[13] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia,
and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural
information processing systems, pages 4974–4983, 2017.
[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing
Systems, pages 6000–6010, 2017.
[15] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle
Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: a new
challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
[16] Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. J. Log.
Program., 19/20:629–679, 1994.
[17] Kurt Driessens and Jan Ramon. Relational instance based regression for relational reinforcement learning.
In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August
21-24, 2003, Washington, DC, USA, pages 123–130, 2003.
[18] Kurt Driessens and Saso Dzeroski. Integrating guidance into relational reinforcement learning. Machine
Learning, 57(3):271–304, 2004.

10
[19] M. van Otterlo. Relational representations in reinforcement learning: Review and open problems, 7
2002.
[20] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. arXiv
preprint arXiv:1711.07971, 2017.
[21] Nicholas Watters, Andrea Tacchetti, Theophane Weber, Razvan Pascanu, Peter Battaglia, and Daniel
Zoran. Visual interaction networks. arXiv preprint arXiv:1706.01433, 2017.
[22] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection.
arXiv preprint arXiv:1711.11575, 2017.
[23] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial
optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
[24] Hanjun Dai, Elias Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization
algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6351–6361, 2017.
[25] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner.
In NIPS 2017 Workshop on Meta-Learning, 2017.
[26] WWM Kool and M Welling. Attention solves your tsp. arXiv preprint arXiv:1803.08475, 2018.
[27] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The
graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
[28] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks
for graphs. In International conference on machine learning, pages 2014–2023, 2016.
[29] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.
arXiv preprint arXiv:1609.02907, 2016.
[30] Misha Denil, Sergio Gómez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. Pro-
grammable agents. arXiv preprint arXiv:1706.06383, 2017.
[31] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
[32] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam
Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Importance weighted actor-learner architecture:
Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint
arXiv:1802.01561, 2018.
[33] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi,
Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural
networks. arXiv preprint arXiv:1711.09846, 2017.
[34] Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. Iterative visual reasoning beyond convolutions.
arXiv preprint arXiv:1803.11189, 2018.
[35] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David
Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint
arXiv:1703.01161, 2017.
[36] Arthur Guez, Théophane Weber, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals, Daan Wierstra,
Rémi Munos, and David Silver. Learning to search with mctsnets. arXiv preprint arXiv:1802.04697,
2018.
[37] Jessica B Hamrick, Andrew J Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W
Battaglia. Metacontrol for adaptive imagination-based optimization. arXiv preprint arXiv:1705.02670,
2017.
[38] Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racanière, David
Reichert, Théophane Weber, Daan Wierstra, and Peter Battaglia. Learning model-based planning from
scratch. arXiv preprint arXiv:1707.06170, 2017.
[39] Théophane Weber, Sébastien Racanière, David P Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez
Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented
agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.
In European Conference on Computer Vision, pages 630–645. Springer, 2016.

11
Appendix
A Box-world
Task
Each level in Box-world is procedurally generated. We start by generating a random graph (a tree) that
defines the correct path to the goal – i.e., the sequence of boxes that need to be opened to reach the gem.
This graph also defines multiple distractor branches – boxes that lead to dead-ends. The agent, keys and
boxes, including the one containing the gem, are positioned randomly in the room, assuring that there is
enough space for the agent to navigate between boxes. There is a total of 20 keys and 20 locks that are
randomly sampled to produce the level. An agent receives a reward of +10 for collecting the gem, +1 for
opening a box in the solution path and −1 for opening a distractor box. A level terminates immediately
after the gem is collected or a distractor box is opened.
The generation process produces a very large number of possible trees, making it extremely unlikely that
the agent will face the same level twice. The procedural generation of levels also allows us to create different
training-test splits by withholding levels that conform to a particular case during training and presenting
them to the agent at test time.

Agent architecture
The agent had an entropy cost of 0.005, discount (γ) of 0.99 and unroll length of 40 steps. Queries, keys and
values were produced by 2 to 4 attention heads and had an embedding size (d) of 64. The output of this
module was aggregated using a feature-wise max pooling function and passed to a 4 fully connected layers,
each followed by a ReLU. Policy logits (π, size 4) and baseline function (V , size 1) were produced by a linear
projection. The policy logits were normalized and used as multinomial distribution from which the action (a)
was sampled.
Training was done using RMSprop optimiser with momentum of 0,  of 0.1 and a decay term of 0.99. The
learning rate was tuned, taking values between 1e−5 and 2e−4. Informally, we note that we could replicate
these results using an A3C setup, though training took longer.

Control agent architecture


As a baseline control agent we used the same architecture as the relational agent but replaced the relational
module with a variable number (3 to 6) of residual-convolutional blocks. Each residual block comprised two
convolutional layers, with 3 × 3 kernels, stride of 1 and 26 output channels.

B StarCraft II mini-games
Starcraft II agents were trained with Adam optimiser for a total of 10 billion steps using batches of 32
trajectories, each unrolled for 80 steps. A linear decay was applied to the optimiser learning rate and entropy

Relational Baseline
1.0 1.0
Forward branching
0.8 Backward branching 0.8
Fraction solved
Fraction solved

0.6 0.6

0.4 0.4

0.2 0.2 Forward branching


Backward branching
0.0 0.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0
Environment steps 1e8 Environment steps 1e9

Figure 6: Box-World: forward branching versus backward branching. With backward branching, any
given key can only open one box; however, each key type (i.e. color), can appear in multiple boxes.
This means that an agent can adopt a more reactive policy without planning beyond which box to
open next.

12
200
Relational
Control
150

Mean score
100

50

0
1 marine 2 marines 3 marines 4 marines 5 marines 10 marines

Figure 7: Generalization results on the StarCraft II mini-game Collect Mineral Shards. Agents were
trained on levels with 2 marines and tested on levels with 1, 2, 3, 4, 5 or 10 marines. Colored bars
indicate mean score of the ten best seeds; error bars indicate standard error.

loss scaling throughout training (see Table 2 for details). We ran approximately 100 experiments for each
mini-game, following Table 4 hyperparameter settings and 3 seeds.

Relational Agent architecture


The StarCraft II (SC2) agent architecture follows closely the one we adopted in Box-World. Here we highlight
the changes needed to satisfy SC2 constraints.
Input-preprocessing. At each time step agents are presented with 4 sources of information: minimap,
screen, player, and last-action. These tensors share the same pre-processing: numerical features are rescaled
with a logarithmic transformation and categorical features are embedded into a continuous 10-dimensional
space.
State encoding. Spatially encoded inputs (minimap and screen) are tiled with binary masks denoting
whether the previous action constituted a screen- or minimap-related action. These tensors are then fed to
independent residual convolutional blocks, each consisting of one convolutional layer (4 × 4 kernels and stride
2) followed by a residual block with 2 convolutional layers (3 × 3 kernels and stride 1), which process and
downsample the inputs to [8 × 8 × #channels 1 ] outputs. These tensors are concatenated along the depth
dimension to form a singular spatial input (inputs 3D ). The remaining inputs (player and last-action) are
concatenated and passed to a 2-layer MLP (128 units, ReLU, 64 units) to form a singular non-spatial input
(inputs 2D ).
Memory processing. Next, inputs 2D is passed to the Conv2DLSTM along with its previous state to
produce a new state and outputs 2D , which represents an aggregated history of input observations.
Relational processing. outputs 2D is flattened and passed to the stacked MHDPA blocks (see Table 3
for details). Its output tensors follow two separate pathways – relational-spatial: reshapes the tensors to
their original spatial shape [8 × 8 × #channels 2 ]; relational-nonspatial: aggregates through a feature-wise
max-pooling operation and further processes using a 2-layer MLP (512 units per layer, ReLU activations).
Output processing. inputs 2D and relational-nonspatial are concatenated to form a set of shared features.
Policy logits are produced by feeding shared features to a 2-layer MLP (256 units, ReLU, |actions| units) and
masking unavailable actions (following [15]). Similarly, baselines values V are generated by feeding shared
features to a separate 2-layer MLP (256 units, ReLU, 1 unit).
Actions are sampled using computed policy logits and embedded into a 16 dimensional vector. This
embedding is used to condition shared features and generate logits for non-spatial arguments (Args) through
independent linear combinations (one for each argument). Finally, spatial arguments (Args x,y ) are obtained
by first deconvolving relational-spatial to [32 × 32 × #channels 3 ] tensors using Conv2DTranspose layers,
conditioned by tiling the action embedding along the depth dimension and passed to a 1 × 1 × 1 convolution
layers (one for each spatial argument). Spatial arguments (x, y) are produced by sampling resulting tensors
and selecting the corresponding row and column indexes.

Control agent architecture


The baseline control agent architecture only differs on the relational processing part of the pipeline. Analogous
to the relational agent, outputs2D are obtained from Conv2DLSTM layers. These tensors are first passed

13
Hyperparameter Value
Conv2DLSTM
Output channels (#channels 1 ) 96
Kernel shape (3, 3)
Stride (1, 1)
Conv2DTranspose
Output channels (#channels 3 ) 16
Kernel shape (4, 4)
Stride (2, 2)
Discount (γ) 0.99
Batch size 32
Unroll Length 80
Baseline loss scaling 0.1
Clip global gradient norm 100.0
Adam β1 0.9
Adam β2 0.999
Adam  1e−8

Table 2: Shared fixed hyperparameters across mini-games.

Setting Value
MLP layers 2
Units per MLP layer 384
MLP activations ReLU
Attention embedding size 32
Weight sharing shared MLP across blocks
shared embedding across blocks

Table 3: Fixed MHDPA settings for StarCraft II mini-games.

to a 12-layer deep residual model – comprising 4 blocks of 3 convolutions layers (32 output channels, 4 × 4
kernel for the first convolution and 3 × 3 for the second and third, and stride 1) interleaved with ReLU
activations and skip-connections – as proposed by [40], to form the relational-spatial outputs. These tensors
also follow a separate pathway where they are flattened and passed to a 2-layer MLP (512 units per layer,
ReLU activations) to produce what we refer to above as relational-nonspatial. The remaining architecture is
identical to the relational agent.

14
Hyperparameter Value
Relational module
Number of heads [1, 3]
Number of blocks [1, 3, 5]
Entropy loss scaling [1e−1, 1e−2, 1e−3]
Adam learning rate [1e−4, 1e−5]

Table 4: Swept hyperparameters across mini-games.

15
An intriguing failing of convolutional neural networks
and the CoordConv solution

Rosanne Liu1 Joel Lehman1 Piero Molino1 Felipe Petroski Such1 Eric Frank1

Alex Sergeev2 Jason Yosinski1


arXiv:1807.03247v2 [cs.CV] 3 Dec 2018

1 2
Uber AI Labs, San Francisco, CA, USA Uber Technologies, Seattle, WA, USA
{rosanne,joel.lehman,piero,felipe.such,mysterefrank,asergeev,yosinski}@uber.com

Abstract

Few ideas have enjoyed as large an impact on deep learning as convolution. For any
problem involving pixels or spatial representations, common intuition holds that
convolutional neural networks may be appropriate. In this paper we show a striking
counterexample to this intuition via the seemingly trivial coordinate transform
problem, which simply requires learning a mapping between coordinates in (x, y)
Cartesian space and coordinates in one-hot pixel space. Although convolutional
networks would seem appropriate for this task, we show that they fail spectacularly.
We demonstrate and carefully analyze the failure first on a toy problem, at which
point a simple fix becomes obvious. We call this solution CoordConv, which
works by giving convolution access to its own input coordinates through the use of
extra coordinate channels. Without sacrificing the computational and parametric
efficiency of ordinary convolution, CoordConv allows networks to learn either
complete translation invariance or varying degrees of translation dependence, as
required by the end task. CoordConv solves the coordinate transform problem with
perfect generalization and 150 times faster with 10–100 times fewer parameters
than convolution. This stark contrast raises the question: to what extent has this
inability of convolution persisted insidiously inside other tasks, subtly hampering
performance from within? A complete answer to this question will require further
investigation, but we show preliminary evidence that swapping convolution for
CoordConv can improve models on a diverse set of tasks. Using CoordConv in
a GAN produced less mode collapse as the transform between high-level spatial
latents and pixels becomes easier to learn. A Faster R-CNN detection model
trained on MNIST detection showed 24% better IOU when using CoordConv, and
in the Reinforcement Learning (RL) domain agents playing Atari games benefit
significantly from the use of CoordConv layers.

1 Introduction
Convolutional neural networks (CNNs) [17] have enjoyed immense success as a key tool for enabling
effective deep learning in a broad array of applications, like modeling natural images [36, 16], images
of human faces [15], audio [33], and enabling agents to play games in domains with synthetic imagery
like Atari [21]. Although straightforward CNNs excel at many tasks, in many other cases progress
has been accelerated through the development of specialized layers that complement the abilities
of CNNs. Detection models like Faster R-CNN [27] make use of layers to compute coordinate
transforms and focus attention, spatial transformer networks [13] make use of differentiable cameras
to transform data from the output of one CNN into a form more amenable to processing with another,

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
Figure 1: Toy tasks considered in this paper. The *conv block represents a network comprised of
one or more convolution, deconvolution (convolution transpose), or CoordConv layers. Experiments
compare networks with no CoordConv layers to those with one or more.

and some generative models like DRAW [8] iteratively perceive, focus, and refine a canvas rather
than using a single pass through a CNN to generate an image. These models were all created by
neural network designers that intuited some inability or misguided inductive bias of standard CNNs
and then devised a workaround.
In this work, we expose and analyze a generic inability of CNNs to transform spatial representations
between two different types: from a dense Cartesian representation to a sparse, pixel-based represen-
tation or in the opposite direction. Though such transformations would seem simple for networks
to learn, it turns out to be more difficult than expected, at least when models are comprised of the
commonly used stacks of convolutional layers. While straightforward stacks of convolutional layers
excel at tasks like image classification, they are not quite the right model for coordinate transform.
The main contributions of this paper are as follows:

1. We define a simple toy dataset, Not-so-Clevr, which consists of squares randomly positioned
on a canvas (Section 2).
2. We define the CoordConv operation, which allows convolutional filters to know where they
are in Cartesian space by adding extra, hard-coded input channels that contain coordinates
of the data seen by the convolutional filter. The operation may be implemented via a couple
extra lines of Tensorflow (Section 3).
3. Throughout the rest of the paper, we examine the coordinate transform problem starting
with the simplest scenario and ending with the most complex. Although results on toy
problems should generally be taken with a degree of skepticism, starting small allows us to
pinpoint the issue, exploring and understanding it in detail. Later sections then show that
the phenomenon observed in the toy domain indeed appears in more real-world settings.
We begin by showing that coordinate transforms are surprisingly difficult even when the
problem is small and supervised. In the Supervised Coordinate Classification task, given a
pixel’s (x, y) coordinates as input, we train a CNN to highlight it as output. The Supervised
Coordinate Regression task entails the inverse: given an input image containing a single
white pixel, output its coordinates. We show that both problems are harder than expected
using convolutional layers but become trivial by using a CoordConv layer (Section 4).
4. The Supervised Rendering task adds complexity to the above by requiring a network to paint
a full image from the Not-so-Clevr dataset given the (x, y) coordinates of the center of a
square in the image. The task is still fully supervised, but as before, the task is difficult to
learn for convolution and trivial for CoordConv (Section 4.3).
5. We show that replacing convolutional layers with CoordConv improves performance in a
variety of tasks. On two-object Sort-of-Clevr [29] images, Generative Adversarial Networks
(GANs) and Variational Autoencoders (VAEs) using CoordConv exhibit less mode collapse,
perhaps because ease of learning coordinate transforms translates to ease of using latents
to span a 2D Cartesian space. Larger GANs on bedroom scenes with CoordConv offer
geometric translation that was never observed in regular GAN. Adding CoordConv to a
Faster R-CNN produces much better object boxes and scores. Finally, agents learning to

2
Figure 2: The Not-so-Clevr dataset. (a) Example one-hot center images Pi from the dataset. (b) The
pixelwise sum of the entire train and test splits for uniform vs. quadrant splits. (c) and (d) Analagous
depictions of the canvas images Ii from the dataset. Best viewed electronically with zoom.

play Atari games obtain significantly higher scores on some but not all games, and they
never do significantly worse (Section 5).
6. To enable other researchers to reproduce experiments in this paper, and benefit from using
CoordConv as a simple drop-in replacement of the convolution layer in their models, we
release our code at https://github.com/uber-research/coordconv.

With reference to the above numbered contributions, the reader may be interested to know that the
course of this research originally progressed in the 5 → 2 direction as we debugged why progressively
simpler problems continued to elude straightforward modeling. But for ease of presentation, we give
results in the 2 → 5 direction. A progression of the toy problems considered is shown in Figure 1.

2 Not-so-Clevr dataset
We define the Not-so-Clevr dataset and make use of it for the first experiments in this paper. The
dataset is a single-object, grayscale version of Sort-of-CLEVR [29], which itself is a simpler version
of the Clevr dataset of rendered 3D shapes [14]. Note that the series of Clevr datasets have been
typically used for studies regarding relations and visual question answering, but we here use them
for supervised learning and generative models. Not-so-Clevr consists of 9 × 9 squares placed on a
64 × 64 canvas. Square positions are restricted such that the entire square lies within the 64 × 64
grid, so that square centers fall within a slightly smaller possible area of 56 × 56. Enumerating these
possible center positions results in a dataset with a total of 3,136 examples. For each example square
i, the dataset contains three fields:
• Ci ∈ R2 , its center location in (x, y) Cartesian coordinates,
• Pi ∈ R64×64 , a one-hot representation of its center pixel, and
• Ii ∈ R64×64 , the resulting 64 × 64 image of the square painted on the canvas.
We define two train/test splits of these 3,136 examples: uniform, where all possible center locations
are randomly split 80/20 into train vs. test sets, and quadrant, where three of four quadrants are in the
train set and the fourth quadrant in the test set. Examples from the dataset and both splits are depicted
in Figure 2. To emphasize the simplicity of the data, we note that this dataset may be generated in
only a line or two of Python using a single convolutional layer with filter size 9 × 9 to paint the
squares from a one-hot representation.1

3 The CoordConv layer


The proposed CoordConv layer is a simple extension to the standard convolutional layer. We assume
for the rest of the paper the case of two spatial dimensions, though operators in other dimensions
follow trivially. Convolutional layers are used in a myriad of applications because they often work
well, perhaps due to some combination of three factors: they have relatively few learned parameters,
they are fast to compute on modern GPUs, and they learn a function that is translation invariant (a
translated input produces a translated output).
1
For example, ignoring import lines and train/test splits:
onehots = np.pad(np.eye(3136).reshape((3136, 56, 56, 1)), ((0,0), (4,4), (4,4), (0,0)), "constant");
images = tf.nn.conv2d(onehots, np.ones((9, 9, 1, 1)), [1]*4, "SAME")

3
Figure 3: Comparison of 2D convolutional and CoordConv layers. (left) A standard convolutional
layer maps from a representation block with shape h × w × c to a new representation of shape
h0 × w0 × c0 . (right) A CoordConv layer has the same functional signature, but accomplishes the
mapping by first concatenating extra channels to the incoming representation. These channels contain
hard-coded coordinates, the most basic version of which is one channel for the i coordinate and one
for the j coordinate, as shown above. Other derived coordinates may be input as well, like the radius
coordinate used in ImageNet experiments (Section 5).

The CoordConv layer keeps the first two of these properties—few parameters and efficient
computation—but allows the network to learn to keep or to discard the third—translation invariance—
as is needed for the task being learned. It may appear that doing away with translation invariance
will hamper networks’ abilities to learn generalizable functions. However, as we will see in later
sections, allocating a small amount of network capacity to model non-translation invariant aspects of
a problem can enable far more trainable models that also generalize far better.
The CoordConv layer can be implemented as a simple extension of standard convolution in which
extra channels are instantiated and filled with (constant, untrained) coordinate information, after
which they are concatenated channel-wise to the input representation and a standard convolutional
layer is applied. Figure 3 depicts the operation where two coordinates, i and j, are added. Concretely,
the i coordinate channel is an h × w rank-1 matrix with its first row filled with 0’s, its second row with
1’s, its third with 2’s, etc. The j coordinate channel is similar, but with columns filled in with constant
values instead of rows. In all experiments, we apply a final linear scaling of both i and j coordinate
values to make them fall in the range [−1, 1]. For convolution over two dimensions, two (i, j)
coordinates are sufficient to completely specify an input pixel, but if desired, further channels can be
added as well to bias models toward learning particular solutions. In some p of the experiments that
follow, we have also used a third channel for an r coordinate, where r = (i − h/2)2 + (j − w/2)2 .
The full implementation of the CoordConv layer is provided in Section S9. Let’s consider next a few
properties of this operation.
Number of parameters. Ignoring bias parameters (which are not changed), a standard convolu-
tional layer with square kernel size k and with c input channels and c0 output channels will contain
cc0 k 2 weights, whereas the corresponding CoordConv layer will contain (c + d)c0 k 2 weights, where
d is the number of coordinate dimensions used (e.g. 2 or 3). The relative increase in parameters is
small to moderate, depending on the original number of input channels. 2
Translation invariance. CoordConv with weights connected to input coordinates set by initializa-
tion or learning to zero will be translation invariant and thus mathematically equivalent to ordinary
convolution. If weights are nonzero, the function will contain some degree of translation dependence,
the precise form of which will ideally depend on the task being solved. Similar to locally connected
layers with unshared weights, CoordConv allows learned translation dependence, but by contrast
2
A CoordConv layer implemented via the channel concatenation discussed entails an increase of dc0 k2
weights. However, if k > 1, not all k2 connections from coordinates to each output unit are necessary, as
spatially neighboring coordinates do not provide new information. Thus, if one cares acutely about minimizing
the number of parameters and operations, a k × k conv may be applied to the input data and a 1 × 1 conv to the
coordinates, then the results added. In this paper we have used the simpler, if marginally inefficient, channel
concatenation version that applies a single convolution to both input data and coordinates. However, almost all
experiments use 1 × 1 filters with CoordConv.

4
it requires far fewer parameters: (c + d)c0 k 2 vs. hwcc0 k 2 for spatial input size h × w. Note that
all CoordConv weights, even those to coordinates, are shared across all positions, so translation
dependence comes only from the specification of coordinates; one consequence is that, as with
ordinary convolution but unlike locally connected layers, the operation can be expanded outside the
original spatial domain if the appropriate coordinates are extrapolated.
Relations to other work. CoordConv layers are related to many other bodies of work. Composi-
tional Pattern Producing Networks (CPPNs) [31] implement functions from coordinates in arbitrarily
many dimensions to one or more output values. For example, with two input dimensions and N
output values, this can be thought of as painting N separate grayscale pictures. CoordConv can then
be thought of as a conditional CPPN where output values depend not only on coordinates but also
on incoming data. In one CPPN-derived work [11], researchers did train networks that take as input
both coordinates and incoming data for their use case of synthesizing a drum track that could derive
both from a time coordinate and from other instruments (input data) and trained using interactive
evolution. With respect to that work, we may see CoordConv as a simpler, single-layer mechanism
that fits well within the paradigm of training large networks with gradient descent on GPUs. In a
similar vein, research on convolutional sequence to sequence learning [7] has used fixed and learned
position embeddings at the input layer; in that work, positions were represented via an overcomplete
basis that is added to the incoming data rather than being compactly represented and input as separate
channels. In some cases using overcomplete sine and cosine bases or learned encodings for locations
has seemed to work well [34, 24]. Connections can also be made to mechanisms of spatial attention
[13] and to generative models that separately learn what and where to draw [8, 26]. While such works
might appear to provide alternative solutions to the problem explored in this paper, in reality, similar
coordinate transforms are often embedded within such models (e.g. a spatial transformer network
contains a localization network that regresses from an image to a coordinate-based representation
[13]) and might also benefit from CoordConv layers.
Moreover, several previous works have found it necessary or useful to inject geometric information
to networks, for example, in prior networks to enhance spatial smoothness [32], in segmentation
networks [2, 20], and in robotics control through a spatial softmax layer and an expected coordinate
layer that map scenes to object locations [18, 5]. However, in those works it is often seen as a
minor detail in a larger architecture which is tuned to a specific task and experimental project, and
discussions of this necessity are scarce. In contrast, our research (a) examines this necessity in depth
as its central thrust, (b) reduces the difficulty to its minimal form (coordinate transform), leading
to a simple single explanation that unifies previously disconnected observations, and (c) presents
one solution used in various forms by others as a unified layer, easily included anywhere in any
convolutional net. Indeed, the wide range of prior works provide strong evidence of the generality of
the core coordinate transform problem across domains, suggesting the significant value of a work
that systematically explores its impact and collects together these disparate previous references.
Finally, we note that challenges in learning coordinate transformations are not unknown in machine
learning, as learning a Cartesian-to-polar coordinate transform forms the basis of the classic two-
spirals classification problem [4].

4 Supervised Coordinate tasks


4.1 Supervised Coordinate Classification
The first and simplest task we consider is Supervised Coordinate Classification. Illustrated at the top
of Figure 1, given an (x, y) coordinate as input, a network must learn to paint the correct output pixel.
This is simply a multi-class classification problem where each pixel is a class. Why should we study
such a toy problem? If we expect to train generative models that can transform high level latents like
horizontal and vertical position into pixel data, solving this toy task would seem a simple prerequisite.
We later verify that performance on this task does in fact predict performance on larger problems.
In Figure 4 we depict training vs. test accuracy on the task for both uniform and quadrant train/test
splits. For convolutional models3 (6 layers of deconvolution with stride 2, see Section S1 in the
Supplementary Information for architecture details) on uniform splits, we find models that generalize
somewhat, but 100% test accuracy is never achieved, with the best model achieving only 86% test
3
For classification, convolutions and CoordConvs are actually deconvolutional on certain layers when
resolutions must be expanded, but we refer to the models as conv or CoordConv for simplicity.

5
Convolution
Convergence to 80% test
accuracy takes 4000 seconds

Perfect test accuracy


takes 10–20 seconds
CoordConv

Figure 4: Performance of convolution and CoordConv on Supervised Coordinate Classification.


(left column) Final test vs. train accuracy. On the easier uniform split, convolution never attains
perfect test accuracy, though the largest models memorize the training set. On the quadrant split,
generalization is almost zero. CoordConv attains perfect train and test accuracy on both splits. One
of the main results of this paper is that the translation invariance in ordinary convolution does not
lead to coordinate transform generalization even to neighboring pixels! (right column) Test accuracy
vs. training time of the best uniform-split models from the left plot (any reaching final test accuracy
≥ 0.8). The convolution models never achieve more than about 86% accuracy, and training is slow:
the fastest learning models still take over an hour to converge. CoordConv models learn several
hundred times faster, attaining perfect accuracy in seconds.

accuracy. This is surprising: because of the way the uniform train/test splits were created, all test
pixels are close to multiple train pixels. Thus, we reach a first striking conclusion: learning a smooth
function from (x, y) to one-hot pixel is difficult for convolutional networks, even when trained with
supervision, and even when supervision is provided on all sides. Further, training a convolutional
model to 86% accuracy takes over an hour and requires about 200k parameters (see Section S2 in the
Supplementary Information for details on training). On the quadrant split, convolutional models are
unable to generalize at all. Figure 5 shows sums over training set and test set predictions, showing
visually both the memorization of the convolutional model and its lack of generalization.
In striking contrast, CoordConv models attain perfect performance on both data splits and do so with
only 7.5k parameters and in only 10–20 seconds. The parsimony of parameters further confirms they
are simply more appropriate models for the task of coordinate transform [28, 10, 19].

4.2 Supervised Coordinate Regression


Because of the surprising difficulty of learning to transform coordinates from Cartesian to a pixel-
based, we examine whether the inverse transformation from pixel-based to Cartesian is equally
difficult. This is the type of transform that could be employed by a VAE encoder or GAN discriminator
to transform pixel information into higher level latents encoding locations of objects in a scene.
We experimented with various convolutional network structures, and found a 4-layer convolutional
network with fully connected layers (85k parameters, see Section S3 for details) can fit the uniform
training split and generalize well (less than half a pixel error on average), but that same architecture
completely fails on the quadrant split. A smaller fully-convolutional architecture (12k parameters, see
Section S3) can be tuned to achieve limited generalization on the quadrant split (around five pixels
error on average) as shown in Figure 5 (right column), but it performs poorly on the uniform split.
A number of factors may have led to the observed variation of performance, including the use of
max-pooling, batch normalization, and fully-connected layers. We have not fully and separately
measured how much each factor contributes to poor performance on these tasks; rather we report
only that our efforts to find a workable architecture across both splits did not yield any winners. In
contrast, a 900 parameter CoordConv model, where a single CoordConv layer is followed by several
layers of standard convolution, trains quickly and generalizes in both the uniform and quadrant splits.
See Section S3 in Supplementary Information for more details. These results suggest that the inverse
transformation requires similar considerations to solve as the Cartesian-to-pixel transformation.

6
Supervised Coordinate Classification Supervised Coordinate Regression
Train Test Train Test

Ground
truth

Convolution
prediction

CoordConv
prediction

Figure 5: Comparison of convolutional and CoordConv models on the Supervised Coordinate


Classification and Regression tasks, on a quadrant split. (left column) Results on the seemingly
simple classification task where the network must highlight one pixel given its (x, y) coordinates as
input. Images depict ground truth or predicted probabilities summed across the entire train or test set
and then normalized to make use of the entire black to white image range. Thus, e.g., the top-left
image shows the sum of all train set examples. The conv predictions on the train set cover it well,
although the amount of noise in predictions hints at the difficulty with which this model eventually
attained 99.6% train accuracy by memorization. The conv predictions on the test set are almost
entirely incorrect, with two pixel locations capturing the bulk of the probability for all locations in
the test set. By contrast, the CoordConv model attains 100% accuracy on both the train and test
sets. Models used: conv–6 layers of deconv with strides 2; CoordConv–5 layers of 1×1 conv, first
layer is CoordConv. Details in Section S2. (right column) The regression task poses the inverse
problem: predict real-valued (x, y) coordinates from a one-hot pixel input. As before, the conv
model memorizes poorly and largely fails to generalize, while the CoordConv model fits train and
test set perfectly. Thus we observe the coordinate transform problem to be difficult in both directions.
Models used: conv–9-layer fully-convolution with global pooling; CoordConv–5 layers of conv with
global pooling, first layer is CoordConv. Details in Section S3.

4.3 Supervised Rendering


Moving beyond the domain of single pixel coordinate transforms, we compare performance of
convolutional vs. CoordConv networks on the Supervised Rendering task, which requires a network
to produce a 64 × 64 image with a square painted centered at the given (x, y) location. As shown in
Figure 6, we observe the same stark contrast between convolution and CoordConv. Architectures
used for both models can be seen in Section S1 in the Supplementary Information, along with further
plots, details of training, and hyperparameter sweeps given in Section S4.

5 Applicability to Image Classification, Object Detection, Generative


Modeling, and Reinforcement Learning

Given the starkly contrasting results above, it is natural to ask how much the demonstrated inability
of convolution at coordinate transforms infects other tasks. Does the coordinate transform hurdle
persist insidiously inside other tasks, subtly hampering performance from within? Or do networks
skirt the issue by learning around it, perhaps by representing space differently, e.g. via non-Cartesian
representations like grid cells [1, 6, 3]? A complete answer to this question is beyond the scope of this
paper, but encouraging preliminary evidence shows that swapping Conv for CoordConv can improve
a diverse set of models — including ResNet-50, Faster R-CNN, GANs, VAEs, and RL models.

7
1.0 Deconv uniform
Deconv quadrant
0.8 CoordConv uniform
CoordConv quadrant
Test IOU 0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Train IOU

Figure 6: Results on the Supervised Rendering task. As with the Supervised Coordinate Classification
and Regression tasks, we see the same vast separation in training time and generalization between
convolution models and CoordConv models. (left) Test intersection over union (IOU) vs Train
IOU. We show all attempted models on the uniform and quadrant splits, including some CoordConv
models whose hyperparameter selections led to worse than perfect performance. (right) Test IOU
vs. training time of the best uniform-split models from the left plot (any reaching final test IOU
≥ 0.8). Convolution models never achieve more than about IOU 0.83, and training is slow: the fastest
learning models still take over two hours to converge vs. about a minute for CoordConv models.

ImageNet Classification As might be expected for tasks requiring straightforward translation


invariance, CoordConv does not help significantly when tested with image classification. Adding a
single extra 1×1 CoordConv layer with 8 output channels improves ResNet-50 [9] Top-5 accuracy by
a meager 0.04% averaged over five runs for each treatment; however, this difference is not statistically
significant. It is at least reassuring that CoordConv doesn’t hurt the performance since it can always
learn to ignore coordinates. This result was obtained using distributed training on 100 GPUs with
Horovod [30]; see Section S5 in Supplementary Information for more details.
Object Detection In object detection, models look at pixel space and output bounding boxes in
Cartesian space. This creates a natural coordinate transform problem which makes CoordConv
seemingly a natural fit. On a simple problem of detecting MNIST digits scattered on a canvas, we
found the test intersection-over-union (IOU) of a Faster R-CNN network improved by 24% when
using CoordConv. See Section S6 in Supplementary Information for details.
Generative Modeling Well-trained generative models can generate visually compelling images
[23, 15, 36], but careful inspection can reveal mode collapse: images are of an attractive quality, but
sample diversity is far less than diversity present in the dataset. Mode collapse can occur in many
dimensions, including those having to do with content, style, or position of components of a scene.
We hypothesize that mode collapse of position may be due to the difficulty of learning straightforward
transforms from a high-level latent space containing coordinate information to pixel space and that
using CoordConv could help. First we investigate a simple task of generating colored shapes with,
in particular, all possible geometric locations, using both GANs and VAEs. Then we scale up the
problem to Large-scale Scene Understanding (LSUN) [35] bedroom scenes with DCGAN [25],
through distributed training using Horovod [30].
Using GANs to generate simple colored objects, Figure 7a-d show sampled images and model
collapse analyses. We observe that a convolutional GAN exhibits collapse of a two-dimensional
distribution to a one-dimensional manifold. The corresponding CoordConv GAN model generates
objects that better cover the 2D Cartesian space while using 7% of the parameters of the conv GAN.
Details of the dataset and training can be seen in Section S7.1 in the Supplementary Information. A
similar story with VAEs is discussed in Section S7.2.
With LSUN, samples are shown in Figure 7e, and more in Section S7.3 in the Supplementary
Information. We observe (1) qualitatively comparable samples when drawing randomly from each
model, and (2) geometric translating behavior during latent space interpolation.
Latent space interpolation4 demonstrates that in generating colored objects, motions through latent
space generate coordinated object motion. In LSUN, while with convolution we see frozen objects
fading in and out, with CoordConv, we instead see smooth geometric transformations including
translation and deformation.
4
https://www.youtube.com/watch?v=YefMbLqS7Jg

8
Figure 7: Real images and generated images by GAN and CoordConv GAN. Both models learn the
basic concepts similarly well: two objects per image, one red and one blue, their size is fixed, and
their positions can be random (a). However, plotting the spread of object centers over 1000 samples,
we see that CoordConv GAN samples cover the space significantly better (average entropy: Data red
4.0, blue 4.0, diff 3.5; GAN red 3.13, blue 2.69, diff 2.81; CoordConv GAN red 3.30, blue 2.93, diff
2.62), while GAN samples exhibit mode collapse on where objects can be (b). In terms of relative
locations between the two objects, both model exhibit a certain level of model collapse, CoordConv
is worse (c). The averaged image of CoordConv GAN is smoother and closer to that of data (d). With
LSUN, sampled images are shown (e). All models used in generation are the best out of many runs.

Figure 8: Results using A2C to train on Atari games. Out of 9 games, (a) in 6 CoordConv improves
over convolution, (b) in 2 performs similarly, and (c) on 1 it is slightly worse.

Reinforcement Learning Adding a CoordConv layer to an actor network within A2C [22] pro-
duces significant improvements on some games, but not all, as shown in Figure 8. We also tried
adding CoordConv to our own implementation of Distributed Prioritized Experience Replay (Ape-X)
[12], but we did not notice any immediate difference. Details of training are included in Section S8.

6 Conclusions and Future Work


We have shown the curious inability of CNNs to model the coordinate transform task, shown a simple
fix in the form of the CoordConv layer, and given results that suggest including these layers can
boost performance in a wide range of applications. Future work will further evaluate the benefits of
CoordConv in large-scale datasets, exploring its ability against perturbations of translation, its impact
in relational reasoning [29], language tasks, video prediction, with spatial transformer networks [13],
and with cutting-edge generative models [8].

9
Acknowledgements
The authors gratefully acknowledge Zoubin Ghahramani, Peter Dayan, and Ken Stanley for insightful
discussions. We are also grateful to the entire Opus team and Machine Learning Platform team inside
Uber for providing our computing platform and for technical support.

References
[1] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr
Mirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al.
Vector-based navigation using grid-like representations in artificial agents. Nature, page 1,
2018.

[2] Clemens-Alexander Brust, Sven Sickert, Marcel Simon, Erik Rodner, and Joachim Denzler.
Convolutional patch networks with spatial prior for road detection and urban scene under-
standing. In International Conference on Computer Vision Theory and Applications (VISAPP),
2015.

[3] C. J. Cueva and X.-X. Wei. Emergence of grid-like representations by training recurrent neural
networks to perform spatial localization. ArXiv e-prints, March 2018.

[4] Scott E Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In
Advances in neural information processing systems, pages 524–532, 1990.

[5] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep
spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on
Robotics and Automation (ICRA), pages 512–519. IEEE, 2016.

[6] Mathias Franzius, Henning Sprekeler, and Laurenz Wiskott. Slowness and sparseness lead to
place, head-direction, and spatial-view cells. PLoS computational biology, 3(8):e166, 2007.

[7] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu-
tional sequence to sequence learning. CoRR, abs/1705.03122, 2017.

[8] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw:
A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.

[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. CoRR, abs/1512.03385, 2015.

[10] Geoffrey E Hinton and Drew Van Camp. Keeping neural networks simple by minimizing
the description length of the weights. In Proceedings of the sixth annual conference on
Computational learning theory, pages 5–13. ACM, 1993.

[11] Amy K Hoover and Kenneth O Stanley. Exploiting functional relationships in musical composi-
tion. Connection Science, 21(2-3):227–251, 2009.

[12] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado
Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint
arXiv:1803.00933, 2018.

[13] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In
Advances in neural information processing systems, pages 2017–2025, 2015.

[14] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,
and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary
visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference
on, pages 1988–1997. IEEE, 2017.

[15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. In ICLR, volume abs/1710.10196, 2018.

10
[16] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in Neural Information Processing Systems 25, pages
1106–1114, 2012.
[17] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.
The handbook of brain theory and neural networks, 3361(10):1995, 1995.
[18] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep
visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
[19] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic
Dimension of Objective Landscapes. In International Conference on Learning Representations,
April 2018.
[20] Yecheng Lyu and Xinming Huang. Road segmentation using cnn with gru. arXiv preprint
arXiv:1804.05164, 2018.
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.
Playing Atari with Deep Reinforcement Learning. ArXiv e-prints, December 2013.
[22] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,
Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-
forcement learning. In International Conference on Machine Learning, pages 1928–1937,
2016.
[23] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & Play Generative Net-
works: Conditional Iterative Generation of Images in Latent Space. ArXiv e-prints, November
2016.
[24] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, and Alexander
Ku. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
[25] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[26] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee.
Learning what and where to draw. In Advances in Neural Information Processing Systems,
pages 217–225, 2016.
[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. In Advances in neural information processing
systems, pages 91–99, 2015.
[28] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
[29] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter
Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In
Advances in neural information processing systems, pages 4974–4983, 2017.
[30] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow.
ArXiv e-prints, February 2018.
[31] Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of develop-
ment. Genetic programming and evolvable machines, 8(2):131–162, 2007.
[32] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. arXiv preprint
arXiv:1711.10925, 2017.
[33] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-
tion Processing Systems, pages 6000–6010, 2017.

11
[35] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:
Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv
preprint arXiv:1506.03365, 2015.
[36] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and
Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative
adversarial networks. In IEEE Int. Conf. Comput. Vision (ICCV), pages 5907–5915, 2017.

12
Supplementary Information for:
An intriguing failing of convolutional neural networks
and the CoordConv solution

S1 Architectures used for supervised painting tasks

Figure S1 depicts architectures used in each of the two supervised tasks going from coordinates to
images: Supervised Coordinate Classification (Section 4.1), and Supervised Rendering (Section 4.3).
In the case of convolution, or, in this case, transposed convolution (deconvolution), the same archi-
tecture is used for both tasks, as shown in the top row of Figure S1, but we generally found the
Supervised Rendering tasks requires wider layers (more channels). Top performing deconvolutional
models in Supervised Coordinate Classification have c = 1 or 2, while in Supervised Rendering we
usually need c = 2, 3. In terms of convolutional filter size, filter sizes of 2 and 4 seem to outperform
3 in Coordinate Classification, while in Rendering the difference is less distinctive.
Note that the CoordConv model only replaces the first layer with CoordConv (shown in green in
Figure S1 ).

Figure S1: Deconvolutional and CoordConv architectures used in each of the two supervised tasks.
“fs" stands for filter size, and “c" for channel size. We use a grid search on different ranges of them
as displayed underneath each model, while allowing deconvolutional models a wider range in both.
Green indicates a CoordConv layer.

13
1.0 0.8
Test accuracy
0.4 0.6

Deconv uniform
Deconv quadrant
0.2

CoordConv uniform
CoordConv quadrant
0.0

10 4

10 4

10 4

10 5

10 5
10 3

10 6
10 5

10 6



Model size

Figure S2: Model size vs. test accuracy for the Supervised Coordinate Classification subtask on
the uniform split and quadrant split. Deconv models (blue) of many sizes achieve 80% or a little
higher — but never perfect — test accuracy on the uniform split. On the quadrant split, while many
models perform slightly better than chance (1/4096 = .000244) no model generalizes significantly.
CoordConv model achieves perfect accuracy on both splits.

Because of the usage of different filter sizes and channel sizes, we end up training models with a
range of sizes. Each is combined with further grid searches on hyperparameters including the learning
rate, weight decay, and minibatch sizes. Therefore at the same size we end up with multiple models
with a spread of performances, as shown in Figure S2 for the Supervised Coordinate Classification
task. We repeat the same exact setting of experiments on both uniform and quadrant splits, which
result in the same number of experiments. It is not obviously shown in Figure S2 because quadrant
trainings are mostly poorly (at the bottom of the figure).
As can be seen, it seems unlikely that even larger models would perform better. They all basically
struggle to get to a good test accuracy. This (1) confirms that performance is not simply being
limited by model size, as well as (2) shows that working CoordConv models are one to two orders of
magnitude smaller (7553 as opposed to 50k-1.6M parameters) than the best convolutional models.
The model size vs. test performance plot on Supervised Rendering is similar (not shown), except
CoordConv model in that case has a slightly larger number of parameters: 9490. CoordConv achieves
perfect test IOU there while deconvolutional models struggle at sizes 183k to 1.6M.

S2 Further Supervised Coordinate Classification details

For deconvolutional models, we use the model structure as depicted in the top row in Figure S1, while
varying the choice of filter size ({2, 3, 4}) and channel size multipliers ({1,2,3}), and each combined
with a hyperparameter sweep of learning rate {0.001, 0.002, 0.005, 0.01, 0.02, 0.05}, and weight
decay {0.001, 0.01}. Models are trained using a softmax output with cross entropy loss with Adam
optimizer. We train 1000 epochs with minibatch size of 16 and 32. The learning rate is dropped to
10% every 200 epochs for four times.
For CoordConv models, because it converges so fast and easy, we did not have to try a lot of settings
— only 3 learning rates {0.01 0.001, 0.005} and they all learned perfectly well. There’s also no need
for learning rate schedules as it quickly converges in 10 seconds.
Figure S3 demonstrates how accurate and smooth the learned probability mass is with CoordConv,
and not so much with Deconv. We first show the overall 64 × 64 map of logits, one for a training
example and one for a test example just right next to the train. Then we zoom in to a smaller
region to examine the intricacies. We can see that convolution, even though designed to act in a
translation-invariant way, shows artifacts of not being able to accomplish so.

14
Figure S3: Comparison of behaviors between Deconv model and CoordConv model on the Supervised
Coordinate Classification task. We select five horizontally neighboring pixels, containing samples in
both train and test splits, and zoom in on a 5 × 9 section of the 64 × 64 canvas so the detail of the
logits and predicted probabilities may be seen. The full 64 × 64 map of logits of the first two samples
(first in train, second in test) are also shown. The deconvolutional model outputs probabilities in a
decidedly non-translation-invariant manner.

S3 Further Supervised Coordinate Regression details


Exact architectures applied in the Supervised Coordinate Regression task are described in table
S1. For the uniform split, the best-generalizing convolution architecture consisted of a stack of
alternating convolution and max-pooling layers, followed by a fully-connected layer and an output
layer. This architecture was fairly robust to changes in hyperparameters. In contrast, for the quandrant
split, the best-generalizing network consisted of strided convolutions feeding into a global-pooling
output layer, and good performance was delicate. In particular, training and generalization was
sensitive to the number of batch normalization layers (2), weight decay strength (5e-4), and optimizer
(Adam, learning rate 5e-4). A single CoordConv architecture generalized perfectly with the same
hyperparameters over both splits, and consisted of a single CoordConv layer followed by additional
layers of convolution, feeding into a global pooling output layer.

Table S1: Model Architectures for Supervised Coordinate Regression. FC: Fully-connected, MP:
Max Pooling, GP: Global Pooling, BN: Batch normalization, s2: stride 2.

Conv CoordConv
Uniform 3×3, 16 - MP 2×2 - 3×3, 16 - MP 2×2 - 3×3,
Split 16 - MP 2×2 - 3×3, 16 - FC 64 - FC 2 1×1, 8 - 1×1, 8 - 1×1, 8 - 3×3,
Quadrant 5×5 (s2), 16 - 1×1, 16 - BN - 3×3, 16 - 3×3 8 - 3×3, 2 - GP
Split (s2), 16 - 3×3 (s2), 16 - BN - 3×3 (s2), 16 -
1×1, 16 - 3×3 (s2), 16 - 3×3, 2 - GP

S4 Further Supervised Rendering details


Both the architectural and experimental settings are similar to Section S2 except the loss used is
pixelwise sigmoid output with cross entropy loss. We also tried mean squared error loss but the
performance is even weaker. We performed heavy hyperparameter sweeping and deliberate learning
rate annealing for Deconv models (same as said in Section S2), while in CoordConv models it is fairly
easy to find a good setting. All models trained with learning rates {0.001, 0.005}, weight decay {0,
0.001}, filter size {1, 2} turned out to perform well after 1–2 minutes of training. Take the best model

15
obtained, Figure S4 and Figure S5 show the learned logits and pixelwise probability distributions for
three samples each, in the uniform and quadrant cases, respectively. We can see that the CoordConv
model learns a much smoother and precise distribution. All samples are test samples.

Figure S4: Output comparison between Deconv and CoordConv models on three test samples. Models
are trained on a uniform split. Logits are model’s direct output; pixelwise probability (pw-prob)
is logits after Sigmoid. Conv outputs (middle columns) manage to get roughly right. CoordConv
outputs (right columns) are precisely correct and its logit maps are smooth.

Figure S5: Output comparison between Deconv and CoordConv models on three test samples. Models
are trained on a quadrant split. Logits are model’s direct output; pixelwise probability (pw-prob)
is logits after Sigmoid. Conv outputs (middle columns) failed mostly. Even with such a difficult
generalization problem, CoordConv outputs (right columns) are precisely correct and its logit maps
are smooth.

S5 Further ImageNet classification details


We evaluate the potential of CoordConv in image classification with ImageNet experiments. We take
ResNet-50 and run the baseline on distributed framework using 100 GPUs, with the open-source
framework Horovod. For CoordConv variants, we add an extra CoordConv layer only in the beginning,
which takes a 6-channel tensor containing image RBG, i, j coordinates and pixel distance to center r,
and output 8 channels with 1×1 convolution. The increase of parameters is negligible. It then goes in
with the rest of ResNet-50.
Each model is run 5 times on the same setting to account for experimental variances. Table. S2 lists
the test result from each run in the end of 90 epochs. CoordConv model obtains better average result
on two of the three measures, however a one-sided t-test tells that the improvement on Top 5 accuracy
is not quite statistically significant with p = .11.

16
Of all vision tasks, we might expect image classification to show the least performance change when
using CoordConv instead of convolution, as classification is more about what is in the image than
where it is. This tiny amount of improvement validates that.

Table S2: ImageNet classification result comparison between a baseline ResNet-50 and CoordConv
ResNet-50. For each model three experiments are run, listed in three separate rows below.

Test loss Top-1 Accuracy Top-5 Accuracy


1.43005 0.75722 0.92622
Baseline
ResNet-50 1.42385 0.75844 0.9272
1.42634 0.75782 0.92754
1.42166 0.75692 0.92756
1.42671 0.75724 0.92708
Average 1.425722 0.757528 0.92712
1.42335 0.75732 0.92802
CoordConv
ResNet-50 1.42492 0.75836 0.92754
1.42478 0.75774 0.92818
1.42882 0.75702 0.92694
1.42438 0.75668 0.92714
Average 1.42525 0.757424 0.927564

S6 Further object detection details

The object detection experiments are on a dataset containing randomly rescaled and placed MNIST
digits on a 64 × 64 canvas. To make it more akin to natural images, we generate a much larger canvas
and then center crop it to be 64 × 64, so that digits can be partially outside of the canvas. We kept
images that contain 5 digit objects whose centers are within the canvas. In the end we use 9000
images as the training set and 1000 as test.
A schematic of the model architecture is illustrated in Figure S6. We use number of anchors A = 9,
with sizes (15, 15), (20, 20), (25, 25), (15, 20), (20, 25), (20, 15), (25, 20), (15, 25), (25, 15). In
box sampling (training mode), p_size and n_size are 6. In box non-maximum suppression (NMS)
(test mode), the IOU threshold is 0.8 and maximum number of proposal boxes is 10. After the boxes
are proposed and shifted, we do not have a downstream classification task, but just calculate the loss
from the boxes. The training loss include box loss and score loss. As evaluation metric we also
calculate IOUs between proposed boxes and ground truth boxes. Table. S3 lists those metrics obtained
the test dataset, by both Conv and CoordConv models. We found that every metric is improved by
CoordConv, and the average test IOU improved by about 24 percent.

17
Figure S6: Faster R-CNN architecture used for object detection on scattered MNIST digits. Green
indicates where coordinates are added. Note that the input image is used for demonstration purpose.
The real dataset contains 5 digits on a canvas and allows overlapping. (Left) train mode with box
sampler. (Right) test mode with box NMS.

Table S3: MNIST digits detection result comparison between a Faster R-CNN model with regular
convolution vs. with CoordConv. Metrics are all on test set. Train IOU: average IOU between
sampled positive boxes (train mode) and ground truth; Test IOU-average): average IOU between 10
selected boxes (test mode) and ground truth; Test IOU-select: average IOU between the best scored
box and its closest ground truth.

Conv CoordConv % Improvement


Box loss 0.1003 0.0854 17.44
Score loss 0.5270 0.2526 108.63
Total loss (sum of the two above) 0.6271 0.3383 85.37
Train IOU 0.6388 0.6612 3.38
Test IOU-average 0.1508 0.1868 23.87
Test IOU-select 0.4965 0.6359 28.08

S7 Further generative modeling details


S7.1 GANs on colored shapes

Data. The dataset used to train the generative models is 50k red-and-blue-object images of size
64 × 64. We follow the same mechanism as Sort-of-Clevr, in that objects appear at random positions
on a white background without overlapping, only limiting the number of objects to be 2 per image.
The objects are always one red and one blue, of a randomly chosen shape out of {circle, square}.
Examples of images from this dataset can be seen in the top row, leftmost column in Figure 7, at the
intersection of “Real images" and “Random samples".

Architecture and training details. The z dimension to both regular GAN and CoordConv GAN
is 256. In GAN, the generator uses 4 layers of deconvolution with strides of 2 to project z to a
64 × 64 image shape. The parameter size of the generator is 6,413,315. In CoordConv GAN, we
add coordinate channels only at the beginning, making the first layer CoordConv, and then continue
with normal Conv.. The generator in this case uses mostly (1,1) convolutions and has only 444,931
parameters. The same discriminator is used for both models. In the case where we also turn the

18
discriminator to be CoordConv like, its first Conv layer is replaced by CoordConv, and the parameter
size increases from 4,316,545 to 4,319,745. The details of both architectures can be seen in Table. S4.
We trained two CoordConv GAN versions where CoordConv applies: 1) only in generator, and 2)
in both generator and discriminator. They end up performing similarly well. The demonstrated
examples in all figures are from one in the latter case.
The change needed to make a generator whose first layer is fully-connected CoordConv is trivial.
Instead of taking image tensors which already have Cartesian dimensions, the CoordConv generator
first tiles z vector into a full 64 × 64 space, and then concatenate it with coordinates in that space.
To train each model we use a fixed learning rate 0.0001 for the discriminator and 0.0005 for the
generator. In each iteration discriminator is trained once followed by generator trained twice. The
random noise vector z is drawn from a uniform distribution between [−1, 1]. We train each model
for 50 epochs and save the model in the end of every epoch. We repeat the training with the same
hyperparameters 5 to 10 times for each, and pick the best model for each to show a fair comparison
in all figures.

Table S4: Model Architectures for GAN and CoordConv GAN used in the colored shape generation.
In the case of CoordConv GAN, only the first layer is changed from regular Conv to CoordConv. FC:
fully connected layer; s2: stride 2.

Generator Discriminator
GAN FC 8192 (reshape 4×4×512) - 5×5, 256 (s2)
- 5×5, 128 (s2) - 5×5, 64 (s2) - 5×5, 3 (s2) - 5×5, 64 (s2) - 5×5, 128 (s2)
Tanh - 5×5, 256 (s2) - 5×5, 512
CoordConv 1×1, 512 - 1×1, 256 - 1×1, 256 - 1×1, 128 - (s2) - 1
GAN 1×1, 64 - 3×3, 64 - 3×3, 64 - 1×1, 3

Latent interpolation. Latent interpolation is conducted by randomly choosing two noise vectors,
each from a uniform distribution, and linearly interpolate in between with an α factor that indicates
how close it is to the first vector. Figure S7 and Figure S8 each show, on regular GAN and CoordConv
GAN, respectively, five random samples of pairs to conduct interpolation with. In addition to
Figure S8, Figure S9 shows deliberately picked examples that exhibit a special moving effect that has
only been seen in CoordConv GAN.

Figure S7: Regular GAN samples with a series of interpolation between 5 random pairs of z. Also
observed position and shape transitioning but are different.

Measure of entropy. In Figure 7, we reduce generated red and blue objects to their centers and plot
the coverage of space in column (b) and relative locations in (c). To make the comparison quantitative,

19
Figure S8: CoordConv GAN samples with a series of interpolation between 5 random pairs of z. Top
row: at the position in the manifold, the model has learned a smooth circular motion. The rest of
the rows: the circular relation between two objets is still observed, while some object shapes also
undergo a smooth transition.

Figure S9: A special effect only observed in CoordConv GAN latent space interpolation: two objects
stay constant in relative positions to each other but move together in space. They even move out of
the scene which is never present in the real data — learned to extrapolate. These 3 examples are
picked from many random drawings of z pairs, as opposed to Figure S8 and Figure S7, where first 5
random drawings are shown.

we can further calculate the entropy in each case, reducing each figure in (b) and (c) to an entropy
value shown as a bar in Figure S10. Confidence intervals of each bar is also shown by repeating the
experiment 10 times. We can see that CoordConv (red) is closer to data (green) in objects’ coverage
of space, but has more of a mode collapse in objects’ relative position.

S7.2 VAEs on colored shapes

We train both convolutional and CoordConv VAEs on the same dataset of 50k 64 x 64 images of
blue and red non-overlapping squares and circles, as described in Section S7.1. Convolutional VAEs
exhibit many of the same problems that we observed in GANs, and adding CoordConv confers many
of the same benefits.
A VAE is composed of an encoder that maps data to a latent and a decoder that maps the latent back to
data. With minor exceptions our VAE’s encoder architecture is the same as our GAN’s discriminator
and it’s decoder is the same as our GAN’s generator. The important difference is of course that the
output shape of the encoder is the size of the latent (in this case 8), not two as in a discriminator.

20
4

Entropy
2

1 conv
coordconv
data
0
Red center Bue center Red-Blue difference

Figure S10: Entropy values and confidence intervals of the sampled results in Figure 7, column (b)
and (c).

Architectures are shown in Table. S5. The decoder architectures of the convolutional control and
CoordConv experiments are similar aside from kernel size - the CoordConv decoder uses 1x1 kernels
while the convolutional decoder uses 5x5 kernels.
Due to the pixel sparsity of images in the dataset we found it important to weight reconstruction
loss more heavily than latent loss by a factor of 50. Doing so didn’t interfere with the quality of the
encoding. We used Adam with a learning rate of 0.005 and no weight decay.

Table S5: Model Architectures: Convolutional VAE and CoordConv VAE

Decoder Encoder
VAE FC 8192 (reshape 4×4×512) - 5×5, 256 (s2)
- 5×5, 128 (s2) - 5×5, 64 (s2) - 5×5, 3 (s2) - 5×5, 64 (s2) - 5×5, 128 (s2)
Sigmoid - 5×5, 256 (s2) - 5×5, 512
CoordConv 1×1, 512 - 1×1, 256 - 1×1, 256 - 1×1, 128 - (s2) - Flatten - FC, 10
VAE 1×1, 64 - 1×1, 3 - Sigmoid

21
Figure S11: Latent space interpolations from a VAE without CoordConv. The red and blue shapes
are mostly stationary. When they do move they do so by disappearing and appearing elsewhere in
pixel space. Smooth changes in the latent don’t translate to smooth geometric changes in pixel space.
The latents we interpolated between were sampled randomly from a uniform distribution.

Figure S12: Latent space interpolations from a VAE with CoordConv in the encoder and decoder.
The red and blue shapes span pixel space more fully and smooth changes in latent space map to
smooth changes in pixel space. Like the CoordConv GAN, the CoordConv VAE is able to extrapolate
beyond the borders of the frame it was trained on. The latents we interpolated between were sampled
randomly from a uniform distribution.

S7.3 GANs on LSUN

The dataset used to train the generative models is LSUN bedroom, composed of 3,033,042 images of
size 64 × 64.

22
The architectures adopted (see Table. S6) are similar to the ones adopted for generating the colored
shape results in Section S7.1, with a few noticeable differences:

• We use CoordConv layers instead of regular Conv layers not only in the first layer of the
discriminator, but in each layer. z is of dimension 100.

• The GAN generator includes a layer mapping from z to a 4x4x1024 tensor and the other
layers have double the number of channels.

• CoordConv GAN generator has more channels for each layer.

Table S6: Model Architectures for GAN and CoordConv GAN for LSUN. FC: fully connected layer;
s2: stride 2.

Generator Discriminator
GAN FC 16384 (reshape 4×4×1024) - 5×5, 512 (s2)
- 5×5, 256 (s2) - 5×5, 128 (s2) - 5×5, 3 (s2) - 5×5, 64 (s2) - 5×5, 128 (s2)
Tanh - 5×5, 256 (s2) - 5×5, 512
CoordConv 1×1, 1024 - 1×1, 512 - 1×1, 256 - 1×1, 256 - (s2) - 1
GAN 1×1, 128 - 3×3, 128 - 3×3, 64 - 1×1, 3

Figure S13: Samples from the regular GAN (left) and the CoordConv GAN (right).

Samples from both models are provided in Figure S13. One peculiar property of the CoordConv
GAN model with respect to the regular GAN one is the geometric interpolation. As shown in
Figure Figure S14 in regular GAN interpolations objects appear and disappear, while in CoordConv
GAN interpolations in Figure S15 objects move around, translating, enlarging, squashing and doing
geometric transformations over them.

23
Figure S14: Samples of regular GAN trained on LSUN with a series of interpolation between 5
random pairs of z.

Figure S15: Samples of CoordConv GAN trained on LSUN with a series of interpolation between 5
random pairs of z.

The regular GAN has been trained for 11000 steps of batch size 128, while the CoordConv GAN has
been trained 22000 steps of batch size 64 (because the available memory on the GPUs did not allow
for 128). Both models have been trained using Horovod to distribute the training on 16 GPUs.

24
S8 Further reinforcement learning details
We used OpenAI baselines 5 implementation and default parameters on all experiments. Table. S7
shows the average scores obtained at the end of game over 10 runs of each.

Table S7: All games with final scores and p-values.

Game Conv CoordConv p-value


Alien 1462.5 2005.0 0.0821
Bank Heist 932.5 1330.0 0.1736
Ms. Pacman 2557.5 3945.0 0.0065
Robotank 2.75 3.5 0.2899
Centipede 3359.5 3424.5 0.8703
Asterix 16250.0 35300.0 0.0003
Asteroids 2082.5 1912.5 0.1124
Amidar 1092.75 1137.5 0.2265
Seaquest 1780.0 1780.0 0.4057

S9 The CoordConv layer implementation

from tensorflow.python.layers import base


import tensorflow as tf

class AddCoords(base.Layer):
"""Add coords to a tensor"""
def __init__(self, x_dim=64, y_dim=64, with_r=False):
super(AddCoords, self).__init__()
self.x_dim = x_dim
self.y_dim = y_dim
self.with_r = with_r

def call(self, input_tensor):


"""
input_tensor: (batch, x_dim, y_dim, c)
"""
batch_size_tensor = tf.shape(input_tensor)[0]
xx_ones = tf.ones([batch_size_tensor, self.x_dim],
dtype=tf.int32)
xx_ones = tf.expand_dims(xx_ones, -1)
xx_range = tf.tile(tf.expand_dims(tf.range(self.y_dim), 0),
[batch_size_tensor, 1])
xx_range = tf.expand_dims(xx_range, 1)

xx_channel = tf.matmul(xx_ones, xx_range)


xx_channel = tf.expand_dims(xx_channel, -1)

yy_ones = tf.ones([batch_size_tensor, self.y_dim],


dtype=tf.int32)
yy_ones = tf.expand_dims(yy_ones, 1)
yy_range = tf.tile(tf.expand_dims(tf.range(self.x_dim), 0),
[batch_size_tensor, 1])
5
https://github.com/openai/baselines/

25
yy_range = tf.expand_dims(yy_range, -1)

yy_channel = tf.matmul(yy_range, yy_ones)


yy_channel = tf.expand_dims(yy_channel, -1)

xx_channel = tf.cast(xx_channel, ’float32’) / (self.x_dim - 1)


yy_channel = tf.cast(yy_channel, ’float32’) / (self.y_dim - 1)
xx_channel = xx_channel*2 - 1
yy_channel = yy_channel*2 - 1

ret = tf.concat([input_tensor,
xx_channel,
yy_channel], axis=-1)

if self.with_r:
rr = tf.sqrt( tf.square(xx_channel)
+ tf.square(yy_channel)
)
ret = tf.concat([ret, rr], axis=-1)

return ret

class CoordConv(base.Layer):
"""CoordConv layer as in the paper."""
def __init__(self, x_dim, y_dim, with_r, *args, **kwargs):
super(CoordConv, self).__init__()
self.addcoords = AddCoords(x_dim=x_dim,
y_dim=y_dim,
with_r=with_r)
self.conv = tf.layers.Conv2D(*args, **kwargs)

def call(self, input_tensor):


ret = self.addcoords(input_tensor)
ret = self.conv(ret)
return ret

26
ICML 2018 AutoML Workshop

Backprop Evolution

Maximilian Alber ∗ † maximilian.alber@tu-berlin.de


TU Berlin
Irwan Bello ∗ ibello@google.com
Barret Zoph barretzoph@google.com
Pieter-Jan Kindermans ‡ pikinder@google.com
Prajit Ramachandran ‡ prajit@google.com
arXiv:1808.02822v1 [cs.NE] 8 Aug 2018

Quoc Le qvl@google.com
Google Brain

Abstract
The back-propagation algorithm is the cornerstone of deep learning. Despite its importance,
few variations of the algorithm have been attempted. This work presents an approach to
discover new variations of the back-propagation equation. We use a domain specific lan-
guage to describe update equations as a list of primitive functions. An evolution-based
method is used to discover new propagation rules that maximize the generalization per-
formance after a few epochs of training. We find several update equations that can train
faster with short training times than standard back-propagation, and perform similar as
standard back-propagation at convergence.
Keywords: Back-propagation, neural networks, automl, meta-learning.

1. Introduction
The back-propagation algorithm is one of the most important algorithms in machine learning
(Linnainmaa (1970); Werbos (1974); Rumelhart et al. (1986)). A few attempts have been
made to change the back-propagation equation with some degrees of success (e.g., Bengio
et al. (1994); Lillicrap et al. (2014); Lee et al. (2015); Nøkland (2016); Liao et al. (2016)).
Despite these attempts, modifications of back-propagation equations have not been widely
used as these algorithms rarely improve practical applications, and sometimes hurt them.
Inspired by the recent successes of automated search methods for machine learning (Zoph
and Le, 2017; Zoph et al., 2018; Bello et al., 2017; Brock et al., 2017; Real et al., 2018; Bender
et al., 2018a), we propose a method for automatically generating back-propagation equa-
tions. To that end, we introduce a domain specific language to describe such mathematical
formulas as a list of primitive functions and use an evolution-based method to discover
new propagation rules. The search is conditioned to maximize the generalization after a
few epochs of training. We find several variations of the equation that can work as well
as the standard back-propagation equation. Furthermore, several variations can achieve
improved accuracies with short training times. This can be used to improve algorithms

Contributed equally.

Work was done as intern at Google Brain.

Work was done as a member of the Google AI Residency program (g.co/airesidency).


c 2018 M. Alber, I. Bello, B. Zoph, P.-J. Kindermans, P. Ramachandran & Q. Le.
Backprop Evolution

like Hyperband (Li et al., 2017) which make accuracy-based decisions over the course of
training.

2. Backward propagation

Figure 1: Neural networks can be seen as computational graphs. The forward graph is
defined by the network designer, while the back-propagation algorithm implicitly defines a
computational graph for the parameter updates. Our main contribution is exploring the use
of evolution to find a computational graph for the parameter updates that is more effective
than standard back-propagation.

The simplest neural network can be defined as a sequence of matrix multiplications and
nonlinearities:

hpi = WiT hi−1 , hi = σ(hpi ).

where h0 = x is the input to the network, i indexes the layer and Wi is the weight matrix
of the i-th layer. To optimize the neural network, we compute the partial derivatives of
the loss J(f (x), y) w.r.t. the weight matrices ∂J(f (x),y)
∂Wi . This quantity can be computed by
making use of the chain rule in the back-propagation algorithm. To compute the partial
derivative with respect to the hidden activations bpi = ∂J(fh(x),y)
p , a sequence of operations is
i
applied to the derivatives:

∂J(f (x), y) ∂hi+1 ∂hpi+1


bL = , bpi+1 = bi+1 , bi = bpi+1 . (1)
∂hL ∂hpi+1 ∂hi
∂hp
Once bpi is computed, the weights update can be computed as: ∆Wi = bpi Wii .
As shown in Figure 1, the neural network can be represented as a forward and back-
ward computational graph. Given a forward computational graph defined by a network
designer, the back-propagation algorithm defines a backward computational graph that is

2
Backprop Evolution

used to update the parameters. However, it may be possible to find an improved backward
computational graph that results in better generalization.
Recently, automated search methods for machine learning have achieved strong results
on a variety of tasks (e.g., Zoph and Le (2017); Baker et al. (2017); Real et al. (2017); Bello
et al. (2017); Brock et al. (2017); Ramachandran et al. (2018); Zoph et al. (2018); Bender
et al. (2018b); Real et al. (2018)). With the exception of Bello et al. (2017), these methods
involve modifying the forward computational graph, relying on back-propagation to define
the appropriate backward graph. In this work, we instead concern ourselves with modifying
the backward computational graph and use a search method to find better formulas for bpi ,
yielding new training rules.

3. Method
In order to discover improved update rules, we employ an evolution algorithm to search over
the space of possible update equations. At each iteration, an evolution controller sends a
batch of mutated update equations to a pool of workers for evaluation. Each worker trains
a fixed neural network architecture using its received mutated equation and reports the
obtained validation accuracy back to our controller.

3.1. Search Space


We use a domain-specific language (DSL) inspired by Bello et al. (2017) to describe the
equations used to compute bpi . The DSL expresses each bpi equation as f (u1 (op1 ), u2 (op2 ))
where op1 , op2 are possible operands, u1 (·) and u2 (·) are unary functions, and f (·, ·) is a
binary function. The sets of unary functions and binary functions are manually specified
but individual choices of functions and operands are selected by the controller. Examples
of each component are as follows:

• Operands: Wi (weight matrix of the current layer), Ri (Gaussian matrix), RLi


(Gaussian random matrix mapping from bpL to bpi ), hpi , hi , hpi+1 (hidden activations of
the forward propagation), bpL , bpi+1 (backward propagated values).

• Unary functions u(x): x (identity), xt (transpose), 1/x, x2 , sgn(x)x2 , x3 , ax, x + b,


dropd (x) (dropout with drop probability d ∈ (0.01, 0.1, 0.3)), clipc (x) (clip values
in range [−c, c] and c ∈ (0.01, 0.1, 0.5, 1.0)), x/k.kf ro , x/k.k1 , x/k.k−inf , x/k.kinf
(normalizing term by matrix norm).

• Binary functions f (x, y): x+y, x−y, x y, x/y (element-wise addition, subtraction,
multiplication, division), x · y (matrix multiplication) and x (keep left), min(x, y),
max(x, y) (minimum and maximum of x and y).

where i indexes the current layer. The full set of components used in our experiments
is specified in Appendix A. The resulting quantity f (u1 (op1 ), u2 (op2 )) is then either used
as bpi in Equation 1 or used recursively as op1 in subsequent parts of the equation. In
our experiments, we explore equations composed of between 1 and 3 binary operations.
This DSL is simple but can represent complex equations such as normal back-propagation,
feedback alignment (Lillicrap et al., 2014), and direct feedback alignment (Nøkland, 2016).

3
Backprop Evolution

3.2. Evolution algorithm


The evolutionary controller maintains a population of discovered equations. At each itera-
tion, the controller does one of the following: 1) With probability p, the controller chooses
an equation randomly at uniform within the N most competitive equations found so far
during the search, 2) With probability 1 − p, it chooses an equation randomly at uniform
from the rest of the population. The controller subsequently applies k mutations to the
selected equation, where k is drawn from a categorical distribution. Each of these k mu-
tations simply consists in selecting one of the equation components (e.g., an operand, an
unary, or a binary function) uniformly at random and swapping it with another component
of the same category chosen uniformly at random. Certain mutations lead to mathemati-
cally infeasible equations (e.g., a shape mismatch when applying a binary operation to two
matrices). When this is the case, the controller restarts the mutation process until success-
ful. N , p and the categorical distribution from which k is drawn are hyperparameters of
the algorithm.
To create an initial population, we simply sample N equations at random from the
search space. Additionally, in some of our experiments, we instead start with a small
population of pre-defined equations (typically the normal back-propagation equation or its
feedback alignment variants). The ability to start from existing equations is an advantage
of evolution over reinforcement learning based approaches (Zoph and Le (2017); Zoph et al.
(2018); Bello et al. (2017); Ramachandran et al. (2018)).

4. Experiments
The choice of model used to evaluate each proposed update equation is an important setting
in our method. Larger, deeper networks are more realistic but take longer to train, whereas
small models are faster to train but may lead to discovering update equations that do not
generalize. We strike a balance between the two criteria by using Wide ResNets (WRN)
(Zagoruyko and Komodakis, 2016) with 16 layers and a width multiplier of 2 (WRN 16-2)
to train on the CIFAR-10 dataset.
We experiment with different search setups, such as varying the subsets of the available
operands and operations and using different optimizers. The evolution controller searches
for the update equations that obtain the best accuracy on the validation set. We then
collect the 100 best discovered equations, which were determined by the mean accuracy of
5 reruns. Finally, the best equations are used to train models on the full training set, and
we report the test accuracy. The experiment setup is further detailed in Appendix B.

4.1. Baseline search and generalization


In the first search we conduct, the controller proposes update equations to train WRN 16-2
networks for 20 epochs with SGD with or without momentum. The top 100 update equations
according to validation accuracy are collected and then tested on different scenarios:

(A1) WRN 16-2 for 20 epochs, replicating the search settings.

(A2) WRN 28-10 for 20 epochs, testing generalization to larger models (WRN 28-10 has
10 times more parameters than WRN 16-2).

4
Backprop Evolution

SGD SGD and Momentum

(A1) Search validation


baseline gip 77.11 ± 3.53 baseline gip 83.00 ± 0.63
p
min(gi /k.kf ro ), clip1.0 (hi )) 84.48 ± 0.45 (gip /k.kelem
2
p
)/(2 + m̂((bL · RLi )
∂hi
∂h
p )) 85.43 ± 1.59
i
p row
(gi /k.k2 ) + m̂(
∂hi
∂h
p )/k.k0
elem
84.41 ± 1.37 p
(gi /k.kf ro ) + 0.5gi
p
85.36 ± 1.41
i
p
gi /k.kf ro 84.15 ± 0.66 p
gi /k.kf ro 84.23 ± 0.88
p
gi /k.k2 elem
83.16 ± 0.90 p
gi /k.kelem
2 83.83 ± 1.27
(A2) Generalize to WRN 28x10
baseline gip 73.10 ± 1.41 baseline gip 79.53 ± 2.89
p
∂h
p
(gi /k.kelem
∂hi
inf ) (ŝ( ∂hp )/k.k1 ) 88.22 ± 0.55 p
max((bi+1 ∂h i+1
i
p
−10.0), gi /k.kelem
2 ) 89.43 ± 0.99
i p
∂h
p p
clip0.01 (0.01 + hi − (hi )+ ) (gi /k.kelem
p
inf ) 87.28 ± 0.29 p
(bi+1 ∂h i+1
i
p
−0.01) + (gi /k.kelem
2 ) 89.26 ± 0.67
p
gi /k.kf ro 87.17 ± 0.87 p
gi /k.kf ro 89.63 ± 0.32
p
gi /k.kelem
2 85.30 ± 1.04 p
gi /k.kelem
2 89.05 ± 0.88
(A3) Generalize to longer training
baseline gip 92.38 ± 0.10 baseline gip 93.75 ± 0.15
p
(gi /k.kelem
2 ) sgn(bn(
∂hi
∂h
p )) 92.97 ± 0.18 p p
drop0.01 (gi ) − (bn(bL · RLi )/k.kelem
0 ) 93.72 ± 0.20
p i
∂h
p
(gi /k.kelem
2
p i+1
) − (m̂(bi+1 ∂h
i
)/k.kinf ) 92.90 ± 0.13 p p
(1 + gnoise0.01 )gi + (gi p1b /k.kelem1 ) 93.66 ± 0.12
p
92.85 ± 0.14
gi /k.kf ro
p
gi /k.kf ro 93.41 ± 0.18
p
gi /k.kelem
2 92.78 ± 0.13 g
p elem
i /k.k2 93.35 ± 0.15
(B1) Searching with longer training
baseline gip 87.13 ± 0.25 baseline gip 88.94 ± 0.11
p p
∂h ∂h
p p
(gi /k.k1 ) + (bi+1 ∂h i+1
i
/k.kelem
2 ) 87.94 ± 0.22 p p
2gi + (bn(bi+1 ∂h i+1
i
)/k.kelem
1 ) 89.04 ± 0.25
p
(gi /k.k1 ) clip0.1 (
∂h
∂h
i )
p 87.88 ± 0.39 (
∂h
∂h
i −0.5) 2.0g p
p i 88.95 ± 0.16
i p i
∂h
p
(gi /k.kelem
2 ) sgn(bn(
∂hi
∂h
p )) 87.82 ± 0.19 p i+1
(ŝ(bi+1 ∂h
i
)/k.kelem
inf ) ∗ 2.0gi
p
88.94 ± 0.20
i p
∂h
p
(0.5gi )/(ŝ(
∂hi
∂h
p ) + 0.1) 87.72 ± 0.25 (
∂hi +
∂h
p)
p
clip1.0 (bi+1 i+1
∂hi
) 88.93 ± 0.14
i i

Table 1: Results of the experiments. For A1-3 we show the two best performing equations
on each setup and two equations that consistently perform well across all setups. For B1
we show the four best performing equations. All results are the average test accuracy over
5 repetitions. Baseline is gradient back-propagation. Numbers that are at least 0.1% better
are in bold. A description of the operands and operations can be found in Appendix A. We
∂hp
denote bpi+1 i+1
∂hp
with gip .
i

(A3) WRN 16-2 for 100 epochs, testing generalization to longer training regimes.

The results are listed in Table 1. When evaluated on the search setting (A1), it is clear
that the search finds update equations that outperform back-propagation for both SGD and
SGD with momentum, demonstrating that the evolutionary controller can successfully find
update equations that achieve better accuracies. The update equations also successfully
generalize to the larger WRN 28-10 model (A2), outperforming the baseline by up to 15%
for SGD and 10% for SGD with momentum. This result suggests that the usage of smaller

5
Backprop Evolution

models during searches is appropriate because the update equations still generalize to larger
models.
However, when the model is trained for longer (A3), standard back-propagation and
the discovered update equations perform similarly. The discovered update equations can
then be seen as equations that speed up training initially but do not affect final model
performance, which can be practical in settings where decisions are made during the early
stages of training to continue the experiment, such as some hyperparameter search schemes
(Li et al., 2017).

4.2. Searching for longer training times


The previous search experiment finds update equations that work well at the beginning
of training but do not outperform back-propagation at convergence. The latter result is
potentially due to the mismatch between the search and the testing regimes, since the
search used 20 epochs to train child models whereas the test regime uses 100 epochs.
A natural followup is to match the two regimes. In the second search experiment, we
train each child model for 100 epochs. To compensate for the increase in experiment time
due to training for more epochs, a smaller network (WRN 10-1) is used as the child model.
The use of smaller models is acceptable since update equations tend to generalize to larger,
more realistic models (see (A2)).
The results are shown in Table 1 section (B1) and are similar to (A3), i.e., we are able
to find update rules that perform moderately better for SGD, but the results for SGD
with momentum are comparable to the baseline. The similarity between results of (A3)
and (B1) suggest that the training time discrepancy may not be the main source of error.
Furthermore, SGD with momentum is fairly unchanging to different update equations.
Future work can analyze why adding momentum increases robustness.

5. Conclusion and future work


In this work, we presented a method to automatically find equations that can replace
standard back-propagation. We use an evolutionary controller that operates in a space of
equation components and tries to maximize the generalization of trained networks. The
results of our exploratory study show that for specific scenarios, there are equations that
yield better generalization performance than this baseline, but more work is required to
find an equation that performs better in general scenarios. It is left to future work to distill
patterns from equations found by the search, and research under which conditions and why
they yield better performance.
Acknowledgements: We would like to thank Gabriel Bender for his technical advice
throughout this work and Simon Kornblith for his valuable feedback on the manuscript.

References
Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A
system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.

6
Backprop Evolution

Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network
architectures using reinforcement learning. In International Conference on Learning Rep-
resentations, 2017.

Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with
reinforcement learning. In International Conference on Machine Learning, 2017.

Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Un-
derstanding and simplifying one-shot architecture search. In Jennifer Dy and Andreas
Krause, editors, Proceedings of the 35th International Conference on Machine Learn-
ing, volume 80 of Proceedings of Machine Learning Research, pages 549–558, Stock-
holmsmssan, Stockholm Sweden, 10–15 Jul 2018a. PMLR. URL http://proceedings.
mlr.press/v80/bender18a.html.

Gabriel Bender, Pieter-jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V Le. De-
mystifying one-shot architecture search. In International Conference on Machine Learn-
ing, 2018b.

Samy Bengio, Yoshua Bengio, and Jocelyn Cloutier. Use of genetic programming for the
search of a new learning rule for neural networks. In Evolutionary Computation, 1994.
IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE
Conference on, pages 324–327. IEEE, 1994.

Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. SMASH: one-shot model
architecture search through hypernetworks. In International Conference on Learning
Representations, 2017.

François Chollet et al. Keras, 2015.

Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target prop-
agation. In Joint european conference on machine learning and knowledge discovery in
databases, pages 498–515. Springer, 2015.

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-
perband: A novel bandit-based approach to hyperparameter optimization. International
Conference on Learning Representations, 2017.

Qianli Liao, Joel Z Leibo, and Tomaso A Poggio. How important is weight symmetry in
backpropagation? In AAAI, 2016.

Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Ran-
dom feedback weights support learning in deep neural networks. arXiv preprint
arXiv:1411.0247, 2014.

Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as


a taylor expansion of the local rounding errors. Master’s thesis, 1970.

Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. arXiv
preprint arXiv:1608.03983, 2016.

7
Backprop Evolution

Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In
Advances in Neural Information Processing Systems, pages 1037–1045, 2016.

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. In
International Conference on Learning Representations (Workshop), 2018.

Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie
Tan, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. In Interna-
tional Conference on Machine Learning, 2017.

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for
image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations


by back-propagating errors. nature, 323(6088):533, 1986.

Paul Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences. PhD thesis, 1974.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In
International Conference on Learning Representations, 2017.

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable
architectures for scalable image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018.

8
Backprop Evolution

Appendix A. Search space


The operands and operations to populate our search space are the following (i indexes the
current layer):

• Operands:

– Wi , sgn(Wi ) (weight matrix of the current layer and sign of it),


– Ri , Si (Gaussian and Bernoulli random matrices, same shape as Wi ),
– RLi (Gaussian random matrix mapping from bpL to bpi ),
– hpi , hi , hpi+1 (hidden activations of the forward propagation),
– bpL , bpi+1 (backward propagated values),
∂hp ∂hp
– bpi+1 ∂hi+1
i
, bpi+1 ∂hi+1
p (backward propagated values according to gradient back-
i
ward propagation),
– bpi+1 · Ri , (bpi+1 · Ri ) ∂hi
∂hpi
(backward propagated values according to feedback
alignment),
– bpL ·RLi , (bpL ·RLi ) ∂h
∂hi
p (backward propagated values according to direct feedback
i
alignment).

• Unary functions u(x):

– x (identity), xt (transpose), 1/x,


p |x|, −x (negation),
p 1x>0 (1 if x greater than
0), x+ (ReLU), sgn(x) (sign), |x|, sgn(x) |x|, x2 , sgn(x)x2 , x3 , ax, x + b,
– x + gnoise(g), x (1 + gnoise(g)) (add or multiply with Gaussian noise of scale
g ∈ (0.01, 0.1, 0.5, 1.0)),
– dropd (x) (dropout with drop probability d ∈ (0.01, 0.1, 0.3)), clipc (x) (clip values
in range [−c, c] and c ∈ (0.01, 0.1, 0.5, 1.0),
– x/k.kelem
0 , x/k.kelem
1 , x/k.kelem
2 , x/k.kelem elem
−inf , x/k.kinf (normalizing term by vector
norm on flattened matrix),
– x/k.kcol col col col col row row row
0 , x/k.k1 , x/k.k2 , x/k.k−inf , x/k.kinf , x/k.k0 , x/k.k1 , x/k.k2 ,
row row
x/k.k−inf , x/k.kinf ,(normalizing term by vector norm along columns or rows of
matrix),
– x/k.kf ro , x/k.k1 , x/k.k−inf , x/k.kinf (normalizing term by matrix norm),
p
– (x − m̂(x))/ ŝ(x2 ) (normalizing with running averages with factor r = 0.9).

• Binary functions f (x, y):

– x + y, x − y, x y, x/y (element-wise addition, subtraction, multiplication,


division),
– x · y (matrix multiplication),
– x (keep left),
– min(x, y), max(x, y) (minimum and maximum of x and y).

9
Backprop Evolution

Additionally, we add for each operand running averages for the mean and standard
deviation as well as a normalize version of it, i.e., subtract the mean and divide by the
standard deviation.
So far we described our setup for dense layers. Many state-of-the-art neural networks are
additionally powered by convolutional layers, therefore we chose to use convolutional neural
networks. Conceptually, dense and convolutional layers are very similar, e.g., convolutional
layers can be mimicked by dense layers by extracting the the relevant image patches and
performing a matrix dot product. For performance reasons we do to use this technique, but
rather map the matrix multiplication operations to corresponding convolutional operations.
In this case we keep the (native) 4-dimensional tensors used in convoluational layers and,
when required, reshape them to matrices by joining all axes but the sample axis, i.e., join
width, height, and filter axes.

Appendix B. Experimental details


The Wide ResNets used in the experiments are not just composed of dense and convolutional
layers, but have operations like average pooling. For parts of the backward computational
graph that are not covered by the search space, we use the standard gradient equations to
propagate backwards.
Throughout the paper we use the CIFAR-10 dataset and use the preprocessing de-
scribed in Loshchilov and Hutter (2016). The hyperparameter setup is also based on that
in Loshchilov and Hutter (2016), and we only modify the learning rate and the optimizer.
For experiments with more than 50 epochs, we use cosine decay with warm starts, where
warm up phase lasts for 10% of the training steps and the cosine decay reduces the learning
rate over the remaining time.
The evolution process is parameterized as follows. With probability p = 0.7 we choose an
equation out of the N = 1000 most competitive equations. For all searches we always modify
one operand or one operation. We use Keras (Chollet et al., 2015) and Tensorflow (Abadi
et al., 2016) to implement our setup. We typically use 500 workers which run on CPUs for
search experiments, which lead to the search converging within 2-3 days.
We experiment with different search setups, i.e., found it useful to use a different subset
of the available operands and operations, to select different learning rates as well as op-
timizers. We use either WRNs with depth 16 and width 2 or depth 10 and width 1 and
use four different learning rates. Additionally, during searches we use early stopping on the
validation set to reduce the overhead of bad performing equations. Either SGD or SGD
with momentum (with the momentum coefficient set to 0.9) is used as the optimizer. The
controller tries to maximize the accuracy on the validation set. The top 100 equations are
trained on the entire training set, and the test set accuracy is reported in Table 1.

10
Recent Advances in Object Detection in the Age of Deep
Convolutional Neural Networks
arXiv:1809.03193v2 [cs.CV] 20 Aug 2019

Shivang Agarwal(∗,1) , Jean Ogier du Terrail(∗,1,2) , Frédéric Jurie(1)


(∗)
equal contribution
(1)
Normandie Univ, UNICAEN, ENSICAEN, CNRS
(2)
Safran Electronics and Defense

August 21, 2019

Abstract

Object detection, the computer vision task dealing with detecting instances of objects of a certain class
(e.g., ’car’, ’plane’, etc.) in images, attracted a lot of attention from the community during the last six
years. This strong interest can be explained not only by the importance this task has for many applications
but also by the phenomenal advances in this area since the arrival of deep convolutional neural networks
(DCNNs). This article reviews the recent literature on object detection with deep CNN, in a comprehensive
way. This study covers not only the design decisions made in modern deep (CNN) object detectors, but also
provides an in-depth perspective on the set of challenges currently faced by the computer vision community,
as well as some complementary and new directions on how to overcome them. In its last part it goes on to
show how object detection can be extended to other modalities and conducted under different constraints.
This survey also reviews in its appendix the public datasets and associated state-of-the-art algorithms.

1
Contents
1 Introduction 4
1.1 What is object detection in images? How to evaluate detector performance? . . . . . . . . . . . . . . . 6
1.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Other detection tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 From Hand-crafted to Data Driven Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Overview of Recent Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 On the Design of Modern Deep Detectors 11


2.1 Architecture of the Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Backbone Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Single Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Double Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.5 Parts-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Going Forward in Object Detection 31


3.1 Major Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Scale Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Rotational Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.3 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.4 Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.5 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.6 Detecting Small Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Complementary New Ideas in Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Adversarial Trainings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.3 Use of Contextual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Extending Object Detection 43


4.1 Detecting Objects in Other Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Object Detection in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Object Detection in 3D Point Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Detecting Objects Under Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Weakly Supervised Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Few-shot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Zero-shot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.4 Fast and Low Power Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Towards Versatile Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Interpretability and Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Universal Detector, Lifelong Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2
5 Conclusions 53

Appendix A Datasets and Results 91


A.1 Classical Datasets with Common Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1.1 Pascal-VOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1.2 MS COCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.1.3 ImageNet Detection Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1.4 VisualGenome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1.5 OpenImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.2 Specialized datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2.1 Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2.2 Text Detection in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.2.3 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.2.4 Pedestrian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.2.5 Logo Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.2.6 Traffic Signs Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.2.7 Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.3 3D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.4 Video Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3
1 Introduction then supposed to be statistically similar. The in-
stance can occupy very few pixels, 0.01% to 0.25%,
The task of automatically recognizing and locating as well as the majority of the pixels, 80% to 90%,
objects in images and videos is important in order in an image. Apart from the variation in size the
to make computers able to understand or interact variation can be in lighting, rotation, appearance,
with their surroundings. For humans, it is one of background, etc. There may not be enough data
the primary tasks, in the paradigm of visual intelli- to accurately cover all the variations well enough.
gence, in order to survive, work and communicate. Small objects, particularly, give low performance at
If one wants machines to work for us or with us, being detected because the available information to
they will need to make sense of their environment detect them is present but compressed and hard to
as good as humans or in some cases even better decode without some prior knowledge or context.
than humans. Solving the problem of object de- Some object instances can also be occluded.
tection with all the challenges it presents has been An additional difficulty is that real world ap-
identified as a major precursor to solving the prob- plications like video object detection demand this
lem of semantic understanding of the surrounding problem to be solved in real time. With the cur-
environment. rent state of the art detectors that is often not the
A large number of academics as well as industry case. Fastest detectors are usually worse than the
researchers have already shown their interest in it best performing ones (e.g. heavy ensembles).
by focusing on applications, such as autonomous We present this review to connect the dots be-
driving, surveillance, relief and rescue operations, tween various deep learning and data driven tech-
deploying robots in factories, pedestrian and face niques proposed in recent years, as they have
detection, brand recognition, visual effects in im- brought about huge improvements in the perfor-
ages, digitizing texts, understanding aerial images, mance, even though the recently introduced ob-
etc. which have object detection as a major chal- ject detection datasets are much more challenging.
lenge at their core. We intend to study what makes them work and
The Semantic Gap, defined by Smeulders et al. what are their shortcomings. We discuss the sem-
[349] as the lack of coincidence between the infor- inal works in the field and the incremental works
mation one can extract from some visual data and which are more application oriented. We also see
its interpretation by a user in a given situation, is their approach on trying to overcome each of the
one of the main challenges object detection must challenges. The earlier methods which were based
deal with. There is indeed a difference of nature on hand-crafted features are outside the scope of
between raw pixel intensities contained in images this review. The problems that are related to ob-
and semantic information depicting objects. ject detection such as semantic segmentation are
Object detection is a natural extension of the also outside the scope of this review, except when
classification problem. The added challenge is to used to bring contextual information to detectors.
correctly detect the presence and accurately locate Salient object detection being related to semantic
the object instance(s) in the image (Figure 1). It is segmentation will also not be treated in this survey.
(usually) a supervised learning problem in which, Several surveys related to object detection have
given a set of training images, one has to design been written in the past, addressing specific tasks
an algorithm which can accurately locate and cor- such as pedestrian detection [84], moving objects in
rectly classify as many object instances as possible surveillance systems [161], object detection in re-
in a rectangle box while avoiding false detections of mote sensing images [53], face detection [126, 453],
background or multiple detections of the same in- facial landmark detection [420], to cite only some
stance. The images can have object instances from illustrative examples. In contrast with this arti-
same classes, different classes or no instances at all. cle, the aforementioned surveys do not cover the
The object categories in training and testing set are latest advances obtained with deep neural net-

4
(a) (b) (c) (d)

(e) (f)

Figure 1: Visualization of sample examples form different kinds of dataset for the detection task. (a) generic
object detection [88], (b) text detection [112], (c) pedestrian detection [399], (d) traffic-sign detection [490],
(e) face detection [432] and (f) objects in aerial images detection [421].

works. Recently four non peer reviewed surveys ap- shot learning or domain adaptation (in addition to
peared on arXiv that also treat the subject of Ob- delving into non mainstream methods already men-
ject Detection using Deep Learning methods. This tioned).
article shares the same motivations as [470] and
[35], but covers the topic more comprehensively
and extensively as these two surveys which only
cover the backbones and flagship articles associated
with modern object detection. This work investi-
gates more thoroughly papers that one would not
The following subsections give an overview of the
necessarily call mainstream, like boosting meth-
problem, some of the seminal works in the field
ods or true cascades, and study related topics like
(hand-crafted as well as data driven) and describe
weakly supervised learning and approaches that
the task and evaluation methodology. Section 2
carry promises but that have yet to become widely
goes into the detail of the design of the current
used by the community (graph networks and gen-
state-of-the-art models. Section 3 presents recent
erative methods). Concurrently to this article, the
methodological advances as well as the main chal-
paper by [220] goes into many details about the
lenge modern detectors have to face. Section 4
modern object detectors. We wanted this survey to
shows how to extend the presented detectors to dif-
be more than just an inventory of existing methods
ferent detection tasks (video, 3D) or perform un-
but to provide the reader with a complete tool-set
der different constraints (energy efficiency, training
to be able to understand fully how the state of the
data, etc.). Finally, Section 5 concludes the re-
art came to be and what are the potential leads to
view. We also list a wide variety of datasets and
advance it further, by studying surrounding top-
the associated state of the art performances in the
ics such as interpretability, lifelong detectors, few-
Appendix.

5
1.1 What is object detection in im- the same object into a single region, or semantic
ages? How to evaluate detector segmentation which is similar to object segmenta-
performance? tion except that the classes may also refer to var-
ied backgrounds or ’stuff’ (e.g.’sky’, ’grass’, ’water’
1.1.1 Problem definition etc., categories). It is also different from Object
Object detection is one of the various tasks related Recognition which is usually defined as recognizing
to the inference of high-level information from im- (i.e. giving the name of the category) of an object
ages. Even if there is no universally accepted def- contained in an image or a bounding box, assuming
inition of it in the literature, it is usually defined there is only one object in the image. For some au-
as the task of locating all the instances of a given thors Object Recognition involves detecting all the
category (e.g.’car’ instances in the case of car de- objects in an image. Instance object detection is
tection) while avoiding raising alarms when/where more restricted than object detection as the detec-
no instances are present. The localization can be tor is focused on a single object (e.g. a particular
provided as the center of the object on the image, car model) and not any object of a given category.
as a bounding box containing the object, or even In case of videos, object detection task is to detect
as the list of the pixels belonging to the object. In the objects on each frame of the video.
some rare cases, only the presence/absence of at
least one instance of the category is sought, with- 1.1.2 Performance evaluation
out any localization. Evaluating the function of detection for a given im-
Object detection is always defined with respect age I is done by comparing the actual list of objects
to a dataset containing images associated with a locations O(I) (so-called the ground truth) of a
list containing some descriptions (position, scale, given category with the detections D(I, λ) provided
etc.) of the objects each image contains. Let’s de- by the detector. Such a comparison is possible only
note by I an image and O(I) the set of NI∗ object once the two following definitions are given:
descriptions, with:
1. A geometric compatibility function
O(I) = {(Y1∗ , Z1∗ ), . . . , (Yi∗ , Zi∗ ), . . . , (YN∗i∗ , ZN

∗ )}
i

where Yi∗ ∈ Y represent the category of the i- G : (Z, Z ∗ ) ∈ Z 2 → {0, 1}



th object and ZN ∗ ∈ Z a representation of its
i defining the conditions that must be met for
location/scale/geometry in the image. Y is the considering two locations as equivalent.
set of possible categories, which can be hierar-

chical or not. Y is the space of possible loca- 2. An association matrix A ∈ {0, 1}N (I,λ)×N (I)
tions/scales/geometries of objects in images. It can defining a bipartite graph between the
be the position of the center of the object (xc , yc ) ∈ detected objects {Z1 , · · · , ZN (I,λ) } and
R2 , a bounding box (xmin , ymin , xmax , ymax ) ∈ R4 the A(i, j) ≤ 1 ground truth objects
encompassing the object, a mask, etc. {Z1∗ , · · · , ZN

∗ (I) }, with:
Using these notations, object detection can be
defined as the function associating an image with N (I,λ)
X
a set of detections A(i, j) ≤ 1
j=1
D(I, λ) = {(Y1 , Z1 ), . . . , (Yi , Zi ), . . . , (YNi (λ) , ZNi (λ) ).
N ∗ (I)
The operating point λ allows to fix a tradeoff be- X
tween false alarms and missed detections. A(i, j) ≤ 1
i=1
Object detection is related but different from ob-
ject segmentation, which aims to group pixels from G(Zi∗ , Zj ) = 0 =⇒ A(i, j) = 0

6
Ground Truth Precision/Recall curve is obtained by varying the
True Positive operational point λ. The Mean Average Precision
FP: Localization
can be computed by averaging the Precision for
FP: Double detection
several Recall values (typically 11 equally spaced
FP: Misclassification
FP: Background
values).
The definition of G can vary from data sets to
data sets. However, only a few definitions reflect
most of the current research. One of the most com-
Figure 2: An illustration of predicted boxes be- mon one comes the Pascal VOC challenge [88]. It
ing marked as True Positive (TP) or False Pos- assumes ground truths are defined by non-rotated
itive (FP). The blue boxes are ground-truths for rectangular bounding boxes containing object in-
the class ”dog”. Predicted box is marked as TP if stances, associated with class labels. The diver-
the predicted class is correct and the overlap with sity of the methods to be evaluated prevents the
the ground truth is greater than a threshold. It use of ROC (Receiver Operating Characteristic) or
is marked as FP if it has overlap less than that DET (Detection Error Trade-off), commonly used
threshold or same object instance is detected again for face detection, as it would assume all the meth-
or it is misclassified or a background is predicted ods use the same window extraction scheme (such
as an object instance. The left dog is marked as as the sliding window mechanism), which is not
False Negative (FN). Best viewed in color. always the case. In the Pascal VOC challenge,
object detection is evaluated by one separate AP
score per category. For a given category, the Pre-
With such definitions, the number of correct de-
cision/Recall curve is computed from the ranked
tections is given by
outputs (bounding boxes) of the method to be eval-
X
T P (I, λ) = A(i, j) uated. Recall is the proportion of positive exam-
i,j ples ranked above a given rank, while precision is
the number of positive boxes above that rank. The
If several association matrices A satisfy the previ- AP summarizes the Precision/Recall curve and is
ous constraints, the one maximizing the number of defined as the mean (interpolated) precision of the
correct detections is chosen. An illustration of TP set of eleven equally spaced recall levels. Output
and False Positives (FP) in an image is shown in bounding boxes are judged as true positives (cor-
Figure 2. rect detections) if the overlap ratio (intersection
Such a definition can be viewed as the size of the over union or IOU) exceeds 0.50. Detection out-
maximal matching in a bipartite graph. Nodes are puts are assigned to ground truth in the order given
locations (ground truth, on the one hand, detec- by decreasing confidence scores. Duplicated detec-
tions on the other hand). Edges are based on the tions of the same object are considered as false de-
acceptance criterion G and the constraints stating tections. The performance over the whole dataset
that ground truth object and detected objets can is computed by averaging the APs across all the
be associated only once each. categories.
It is possible to average the correct detections at The recent and popular MSCOCO challenge
the level of a test set T through the two following [214] relies on the same principles. The main differ-
ratios: ence is that the overall performance (mAP) is ob-
P tained by averaging the AP obtained with 10 differ-
I T P (I, λ)
P recision(λ) = P ent IOU thresholds between 0.50 and 0.95. The Im-
I∈T N (I, λ) ageNet Large Scale Visual Recognition Challenge
P
T P (I, λ) (ILSVRC) also has a detection task in which al-
Recall(λ) = PI ∗
I∈T N (I)
gorithms have to produce triplets of class labels,

7
bounding boxes and confidence scores. Each image case and to evaluate video-mAP based on tubelets
has mostly one dominant object in it. Missing ob- IoU, where a tubelet is detected if and only if the
ject detections are penalized in the same way as a mean per frame IoU for every frame in the video
duplicate detection and the winner of the detection is greater than a threshold, σ, and the tube label
challenge is the one who achieves first place AP on is correctly predicted. We take this definition di-
most of the object categories. The challenge also rectly from [107], where they used it to compute
has the Object Localization task, with a slightly dif- mAP and ROC curves at a video-level.
ferent definition. The motivation is not to penalize
algorithms if one of the detected objects is actu- 1.1.3 Other detection tasks
ally present while not included in the ground-truth
annotations, which is not rare due to the size of This survey only covers the methodologies for per-
the dataset and the number of categories (1000). formance evaluation found in the recent literature.
Algorithms are expected to produce 5 class labels But, beside these common evaluation measures,
(in decreasing order of confidence) and 5 bounding there are a lot of more specific ones, as object de-
boxes (one for each class label). The error of an tection can be combined with other complex tasks,
algorithm on an image is 0 if one of the 5 bound- e.g., 3D orientation and layout inference in [423].
ing boxes is a true positive (correct class label and The reader can refer to the review by Mariano
correct localization according to IOU), 1 otherwise. et al. [232] to explore this topic. It is also worth
The error is averaged on all the images of the test mentioning the very recent work of Oksuz et al.
set. [260] which proposes a novel metric providing richer
Some recent datasets, like DOTA [421], proposed and more discriminative information than AP, es-
two tasks named as detection on horizontal bound- pecially with respect to the localization error.
ing boxes and detection on oriented bounding boxes, We have decided to orient this survey mainly on
corresponding to two different kinds of ground bounding boxes tasks even if there is a tendency
truths (with or without target orientations), no to move away from this task considering the per-
matter how those methods were trained. In some formances of the modern deep learning methods
other datasets, the scale of the detection is not im- that already approach human accuracy on some
portant and a detection is counted as a True Posi- datasets. The reason of this choice are numerous.
tive if its coordinates are close enough to the cen- First of all, historically speaking bounding boxes
ter of the object. This is the case for the VeDAI were one of the first object detection task and thus
dataset [302]. In the particular case of object de- there is already a body of literature on this topic
tection in 3D point clouds, such as in the KITTI that is immense. Secondly, not all the datasets
object detection benchmark [98], the criteria is sim- provide annotation down to the level of pixels. In
ilar to Pascal VOC, except that the boxes are in 3D aerial imagery for instance most of the datasets are
and the overlap is measured in terms of volume in- only bounding boxes. It is also the case for some
tersection. pedestrian detection datasets. Instance segmenta-
tion level annotations are still costly for the mo-
ment, even with the recent development of anno-
Object detection in videos Regarding the de- tator friendly algorithms (e.g. [32, 230]) that offer
tection of objects in videos, the most common prac- pixel level annotations at the expense of a few user
tice is to evaluate the performance by considering clicks. Maybe in the future all datasets will contain
each frame of the video as being an independent annotations down to the level of pixels but it is not
image and averaging the performance over all the yet the case. Even when one has pixel-level annota-
frames, as done in the ImageNet VID challenge tions for tasks like instance segmentation, which is
[319]. becoming the standard, bounding boxes are needed
It is also possible to move away from the 2D from the detector to distinguish between two in-

8
stances of the same class, which explains that most data. This leads to their major disadvantage of re-
modern instance segmentation pipelines like [118] quiring copious amounts of data. The first use of
have a bounding box branch. Therefore, metrics ConvNets for detection and localization goes back
evaluating the bounding boxes from the models are to the early 1990s for faces [392], hands [258] and
still relevant in that case. One could also make the multi-character strings [237]. Then in 2000s they
argument that bounding boxes are more robust an- were used in text [65], face [96, 263] and pedestrians
notations because they are less sensitive to the an- [328] detection.
notator noise but it is debatable. For all of these However, the merits of DCNN for object de-
reasons the rest of this survey will tackle mainly tection was generated in the community only af-
bounding boxes and associated tasks. ter the seminal work of Krizhevsky et al. [181]
and Sermanet et al. [327] on the challenging Im-
1.2 From Hand-crafted to Data ageNet dataset. Krizhevsky et al. [181] were the
first to demonstrate localization through DCNN in
Driven Detectors
the ILSVRC 2012 localization and detection tasks.
While the first object detectors initially relied on Just one year later Sermanet et al. [327] were able
mechanisms to align a 2D/3D model of the object to describe how the DCNN can be used to lo-
on the image using simple features, such as edges cate and detect objects instances. They won the
[217], key-points [224] or templates [278], the ar- ILSVRC 2013 localization and detection competi-
rival of Machine Learning (ML) was the first rev- tion and also showed that combining the classifica-
olution which had shaken up the area. One of the tion, localization and detection tasks can simulta-
most popular ML algorithms used for object de- neously boost the performance of all tasks.
tection was boosting, e.g., [326]) or Support Vector The first DCNN-based object detectors applied a
Machines, e.g. [64]. This first wave of ML-based de- fine-tuned classifier on each possible location of the
tectors were all based on hand-crafted (engineered) image in a sliding window manner [262], or on some
visual features processed by classifiers or regres- specific regions of interest [105], through a region
sors. These hand-crafted features were as diverse proposal mechanism. Girshick et al. [105] treated
as Haar Wavelets [398], edgelets [418], shapelets each region proposal as a separate classification and
[320], histograms of oriented gradient [64], bags-of- localization task. Therefore, given an arbitrary re-
visual-words [187], integral histograms [287], color gion proposal, they deformed it to a warped region
histograms [399], covariance descriptors [388], lin- of fixed dimensions. DCNN are used to extract a
ear binary patterns Wang et al. [408], or their com- fixed-length feature vector from each proposal re-
binations [85]. One of the most popular detectors spectively and then category-specific linear SVMs
before the DCNN revolution was the Deformable were used to classify them. Since it was a region
Part Model of Felzenszwalb et al. [90] and its vari- based CNN they called it R-CNN. Another im-
ants, e.g. [322]. portant contribution was to show the usability of
This very rich literature on visual descriptors has transfer learning in DCNN. Since data is scarce,
been wiped out in less than five years by Deep supervised pre-training on an auxiliary task can
Convolutional Neural Networks, which is a class of lead to a significant boost to the performance of
deep, feed-forward artificial neural networks. DC- domain specific fine-tuning. Sermanet et al. [327],
NNs are inspired by the connectivity patterns be- Girshick et al. [105] and Oquab et al. [262] were
tween neurons of the human visual cortex and use among the first authors to show that DCNN can
no pre-processing as the network learns itself the lead to dramatically higher object detection per-
filters previously hand-engineered by traditional formance on ImageNet detection challenge [66] and
algorithms, making them independent from prior PASCAL VOC [88] respectively as compared to
knowledge and human effort. They are said to be previous state-of-the-art systems based on HOG
end-to-end trainable and solely rely on the training [64] or SIFT [225].

9
Since most prevalent DCNN had to use a fixed age of completely end-to-end architectures. Specif-
size input, because of the fully connected layers at ically, the anchor mechanism, developed for the
the end of the network, they had to either warp RPN, was here to stay. This grid of fixed a-priori
or crop the image to make it fit into that size. He (or anchors), not necessarily corresponding to the
et al. [116] came up with the idea of aggregating receptive field of the feature map pixel they lied on,
feature maps of the final convolutional layer. Thus, created a framework for fully-convolutional classifi-
the fully connected layer at the end of the network cation and regression and is used nowadays by most
gets a fixed size input even if the input images in pipelines like [221] or [216], to cite a few.
the dataset are of varying sizes and aspect ratios. These conceptual changes make the detection
This helped reduce overfitting, increased robust- pipelines far more elegant and efficient than their
ness and improved the generalizability of the exist- counterparts when dealing with big training sets.
ing models. Compared to R-CNN which used one However, it comes at a cost. The resulting detec-
forward pass per proposal to generate the feature tors become complete black boxes, and, because
map, the methodology proposed by [116] allowed they are more prone to overfitting, they require
to share computation among all the proposals and more data than ever.
do just one forward pass for the whole image and [309] and its other double stage variants are now
then select the region from the final feature map the go-to methods for objects detection and will be
according to the regions proposed. This naturally thoroughly explored in Sec. 2.1.3. Although this
increased the speed of the network by over one hun- line of work is now prominent, other choices were
dred times. explored all based on fully-convolutional architec-
All the previous approaches train the network tures.
in multistage pipelines are complex, slow and in- Single-stage algorithms that were completely
elegant. They include extracting features through abandoned since Viola et al. [398] have now be-
CNNs, classifying through SVMs and finally fit- come reasonable alternatives thanks to the discrim-
ting bounding box regressors. Since, each task inative power of the CNN features. Redmon et al.
is handled separately, convolutional layers cannot [308] first showed that the simplest architectural
take advantage of end-to-end learning and bound- design could bring unfathomable speed with ac-
ing box regression. Girshick [104] helped alleviate ceptable performances. Liu et al. [221] sophisti-
this problem by streamlining all the tasks in a sin- cated the pipeline by using anchors at different lay-
gle model using a multitask loss. As we will explain ers while making it faster and more accurate than
later, this not only improved upon the accuracy but Redmon et al. [308]. These two seminal works gave
also made the network run faster at test time. birth to a considerable amount of literature on sin-
gle stage methods that we will cover in Sec. 2.1.2.
Boosting and Deformable part-based models, that
1.3 Overview of Recent Detectors
were once the norm, have yet to make their come-
The foundations of the DCNN based object detec- backs into the mainstream. However, some recent
tion, having been laid out, it allowed the field to popular works used close ideas like Dai et al. [63]
mature and move further away from classical meth- and thus these approaches will also be discussed in
ods. The fully-convolutional paradigm glimpsed in the survey sections 2.1.4 and 2.1.5.
[327] gained more traction every day in the com- The fully-convolutional nature of these new dom-
munity. inant architectures allows all kinds of implementa-
When Ren et al. [309] successfully replaced the tion tricks during training and at inference time
only component of Fast R-CNN that still relied on that will be discussed at the end of the next sec-
non-learned heuristics by inventing RPN (Region tion. However, it makes the subtle design choices
Proposal Networks), it put the last nail in the cof- of the different architectures something of a dark
fin of traditional object detection and started the art to the newcomers.

10
The goal of the rest of the survey is to provide 2.1 Architecture of the Networks
a complete view of this new landscape while giving
the keys to understand the underlying principles The architecture of the DCNN object detectors
that guide interesting new architectural ideas. Be- follows a Lego-like construction pattern based on
fore diving into the subject, the survey starts by chaining different building blocks. The first part
reminding the readers about the object detection of this Section will focus on what researchers call
task and the metrics associated with it. the backbone of the DCNN, meaning the feature ex-
tractor from which the detector draws its discrimi-
native power. We will then tackle diverse arrange-
ments of increasing complexity found in DCNN de-
After introducing the topic and touching upon tectors: from single stage to multiple stages meth-
some general information, next section will get ods. Finally, we will talk about the Deformable
right into the heart of object detection by present- Part Models and their place in the deep learning
ing the designs of recent deep learning based object landscape.
detectors.

2.1.1 Backbone Networks

A lot of deep neural networks originally designed


2 On the Design of Modern for classification tasks have been adopted for the
Deep Detectors detection task as well. And a lot of modifications
have been done on them to adapt for the additional
difficulties encountered. The following discussion
Here we analyze, investigate and dissect the cur-
is about these networks and the modifications in
rent state-of-the-art models and the intuition be-
question.
hind their approaches. We can divide the whole
detection pipeline into three major parts. The first
part focuses on the arrangement of convolutional Backbones: Backbone networks play a major
layers to get proposals (if required) and box pre- role in object detection models. Huang et al. [140]
dictions. The second part is about setting vari- partially confirmed the common observation that,
ous training hyper-parameters, deciding upon the as the classification performance of the backbone
losses, etc. to make the model converge faster. The increases on ImageNet classification task [319], so
third part’s center of attention will be to know var- does the performance of object detectors based on
ious approaches to refine the predictions from the those backbones. It is the case at least for popular
converged model(s) at test time and therefore get double-stage detectors like Faster-RCNN [309] and
better detection performances. The first part has R-FCN [62] although for SSD [221] the object de-
been able to get the attention of most of the re- tection performance remains around the same (see
searchers and second and third part not so much. the following Sections for details about these 3 ar-
To give a clear overview of all the major compo- chitectures).
nents and popular options available in them, we However, as the size of the network increases,
present a map of object detection pipeline in Fig- the inference and the training become slower and
ure 3. require more data. The most popular architectures
Most of the ideas from the following sub-sections in increasing order of inference time are MobileNet
have achieved top accuracy on the challenging MS [134], VGG [343], Inception [151, 364, 365], ResNet
COCO [214] object detection challenge and PAS- [117], Inception-ResNet [366], etc. All of the above
CAL VOC [88] detection challenge or on some other architectures were first borrowed from the classifi-
very challenging datasets. cation problem with little or no modification.

11
- OHEM Supplementary Losses
- Focal Loss
- Segmentation Loss
Dataset Double-Stage - Repulsion Loss
Class Imbalance
Framework
Losses

Regression - Encoding
Classification - L1 Loss
- L2 Loss
- In-Out
- Cross Entropy - Border
- Unit-Box
Network MobileNet | VGG | Inception | ResNet
Xception | Wide-ResNet | ResNeXt
SqueezeNet | ShuffleNet | DenseNet Performance
Train-time

Predictions
Classification
...
- Mean Average Precision 
Bounding Box  (mAP) @0.5 IOU
- mAP @0.5:0.95 IOU
- DET Curve
Initialization Test-time - ROC Curve
He/Xavier/Random - Multiple scales
Pre-training on other - Fusion of layers Greedy NMS
datasets Soft NMS
Data Augmentation
Learning Based
- Scale, Resize, Rotation, Flipping,        Methods
  Elastic distortions and Crop  Clustering
- Contrast, Color, Hue, Brightness,       
  Saturation and Sharpness - Mean Shift Inference
- Smart Augmentation - Agglomerative
- GANs - Affinity Propagation
- Heuristic Variants
Single-Stage
Framework

Figure 3: A map of object detection pipeline along with various options available in different parts. Images are
taken from a Dataset and then fed to a Network. They use one of the Single-Stage (see Figure 6) or Double-
Stage Framework (see Figure 8) to make Predictions of the probabilities of each class and their bounding
boxes at each spatial location of the feature map. During training these predictions are fed to Losses and
during testing these are fed to Inference for pruning. Based on the final predictions, a Performance is
evaluated against the ground-truths. All the ideas are referenced in the text. Best viewed in color.

12
predict

predict
predict
predict
predict
predict predict

predict

(a) (b) (c)

predict predict predict


predict

(d) (e)

Figure 4: An illustration of how the backbones can be modified to give predictions at multiple scales and
through fusion of features. (a) An unmodified backbone. (b) Predictions obtained from different scales of
image. (c) Feature maps added to the backbone to get predictions at different scales. (d) A top down
network added in parallel to backbone. (e) Top down network along with predictions at different scales.

Some other backbones used in object detec- Multi-scale detections: Papers [28, 200, 431]
tors which were not included in the analysis made independent predictions on multiple feature
of [140] but have given state-of-the-art perfor- maps to take into account objects of different
mances on ImageNet [66], or COCO [214] detec- scales. The lower layers with finer resolution have
tion tasks are Xception [57], DarkNet [306], Hour- generally been found better for detecting small ob-
glass [256], Wide-Residual Net [193, 445], ResNeXt jects than the coarser top layers. Similarly, coarser
[426], DenseNet [139], Dual Path Networks [50] layers are better for the bigger objects. Liu et al.
and Squeeze-and-Excitation Net [136]. The recent [221] were the first to use multiple feature maps for
DetNet [208], proposed a backbone network, is de- detecting objects. Their method has been widely
signed specifically for high performance detection. adopted by the community. Since final feature
It avoided large down-sampling factors present in maps of the networks may not be coarse enough
classification networks. Dilated Residual Networks to detect sizable objects in large images, additional
[439] also worked with similar motivations to ex- layers are also usually added. These layers have a
tract features with fewer strides. SqueezeNet [148] wider receptive field.
and ShuffleNet [461] choose instead to focus on
speed. More information for networks focusing on
Fusion of layers: In object detection, it is also
speed can be found in Section 4.2.4.
helpful to make use of the context pixels of the ob-
Adapting the mentioned backbones to the inher- ject [102, 446, 451]. One interesting argument in
ent multi-scale nature of object detection is a chal- favor of fusing different layers is it integrates infor-
lenge, we will give in the following paragraph ex- mation from different feature maps with different
amples of commonly used strategies. receptive fields, thus it can take help of surrounding

13
local context to disambiguate some of the object without reusing any component of the neural net-
instances. work or generating proposals of any kind, thus
Some papers [51, 92, 156, 192, 471] have experi- speeding up the detector.
mented with fusing different feature layers of these They started by dividing the image into a S × S
backbones so that the finer layers can make use grid and assuming B bounding boxes per grid.
of the context learned in the coarser layers. Lin Each cell containing the center of an object in-
et al. [215, 216], Shrivastava et al. [338], Woo et al. stance is responsible for the detection of that ob-
[415] took one step ahead and proposed a whole ject. Each bounding box predicts 4 coordinates,
additional top-down network in addition to stan- objectness and class probabilities. This reframed
dard bottom-up network connected through lateral the object detection as a regression problem. To
connections. The bottom-top network used can be have a receptive field cover that covers the whole
any one of the above mentioned. While Shrivastava image they included a fully connected layer in their
et al. [338] used only the finest layer of top-down design towards the end of the network.
architecture for detection, Feature Pyramid Net-
work (FPN) [216] and RetinaNet [215] used all the
SSD: Liu et al. [221], inspired by the Faster-
layers of top-down architecture for detection. FPN
RCNN architecture, used reference boxes of var-
used the feature maps thus generated in a two-stage
ious sizes and aspect ratios to predict object in-
detector fashion while RetinaNet used them in a
stances (Figure 5) but they completely got rid of
single-stage detector fashion (See Section 2.1.3 and
the region proposal stage (discussed in the follow-
Section 2.1.2 for more details). FPN [215] has been
ing Section). They were able to do this by making
a part of the top entries in MS COCO 2017 chal-
the whole network work as a regressor as well as
lenge. An illustration of multiple scales and fusion
a classifier. During training, thousands of default
of layers is shown in Figure 4.
boxes corresponding to different anchors on differ-
Now that we have seen how to best use the fea- ent feature maps learned to discriminate between
ture maps of the object detectors backbones we can objects and background. They also learned to di-
explore the architectural details of the different ma- rectly localize and predict class probabilities for the
jor players in DCNN object detection, starting with object instances. This was achieved with the help
the most immediate methods: single-stage detec- of a multitask loss. Since, during inference time a
tors. lot of boxes try to localize the objects, generally a
post-processing step like Greedy NMS is required
2.1.2 Single Stage Detectors to suppress duplicate detections.
In order to accommodate objects of all the sizes
The two most popular approaches in single stage they added additional convolutional layers to the
detection category are YOLO [308] and SSD [221]. backbone and used them, instead of a single feature
In this Section we will go through their basic map, to improve the performance. This method
functioning, some upsides and downsides of using was later applied to approaches related to two-stage
these two approaches and further improvements detectors too [215].
proposed on them.
Pros and Cons: Oftentimes single stage detec-
YOLO: Redmon et al. [308] presented for the tors do not give as good performance as the double-
first time a single stage method for object detection stage ones, but they are a lot faster [140] al-
where raw image pixels were converted to bound- though some double-stages detectors can be faster
ing box coordinates and class probabilities and can than single-stages due to architectural tricks and
be optimized end-to-end directly. This allowed to modern single-stage detectors outperform the older
directly predict boxes in a single feed-forward pass multi-stages pipelines.

14
k anchor boxes which they have to perform predictions.

Further improvements: Redmon and Farhadi


[306] and Redmon and Farhadi [307] suggested a lot
of small changes in versions 2 and 3 of the YOLO
method. The changes like applying batch normal-
ization, using higher resolution input images, re-
moving the fully connected layer and making it
256-d fully convolutional, clustering box dimensions, lo-
classification regression cation prediction and multi-scale training helped
(C+1)k scores 4k coordinates to improve performance while a custom network
(DarkNet) helped to improve speed.
Figure 5: The workings of anchors. k anchors are Many further developments by many researchers
declared at each spatial location of the final fea- have been proposed on Single Shot MultiBox De-
ture map(s). Classification score for each class (in- tector. The major advancements over the years
cluding background) is predicted for each anchor. have been illustrated in Figure 6. Deconvolutional
Regression coordinates are predicted only for an- Single Shot Detector (DSSD) [92], instead of the
chors having an overlap greater than a pre-decided element-wise sum, used a deconvolutional module
threshold with the ground-truth. For the special to increase the resolution of top layers and added
case of predicting objectness, C is set to one. This each layer, through element-wise products to previ-
idea was introduced in [309]. ous layer. Rainbow SSD [156] proposed to concate-
nate features of shallow layers to top layers by max-
pooling as well as features of top layers to shallow
The various advantages of YOLO strategy are layers through deconvolution operation. The final
that it is extremely fast, with 45 to 150 frames per fused information increased from few hundreds to
second. It sees the entire image as opposed to re- 2,816 channels per feature map. RUN [192] pro-
gion proposal based strategies which is helpful for posed a 3-way residual block to combine adjacent
encoding contextual information and it learns gen- layers before final prediction. Cao et al. [29] used
eralizable representations of objects. But it also concatenation modules and element-sum modules
has some obvious disadvantages. Since each grid to add contextual information in a slightly differ-
cell has only two bounding boxes, it can only pre- ent manner. Zheng et al. [471] slightly tweak DSSD
dict at most two objects in a grid cell. This is par- by fusing lesser number of layers and adding extra
ticularly inefficient strategy for small objects. It ConvNets to improve speed as well as performance.
struggles to precisely localize some objects as com- They all improved upon the performance of
pared to two stages. Another drawback of YOLO conventional SSD and they lie within a small
is that it uses coarse feature map at a single scale range among themselves on Pascal VOC 2012 test
only. set [88], but they added considerable amount of
To address these issues, SSD used a dense set computational costs, thus making it little slower.
of boxes and considered predictions from various WeaveNet [51] aimed at reducing computational
feature maps instead of one. It improved upon the costs by gradually sharing the information from
performance of YOLO. But since it has to sample adjacent scales in an iterative manner. They hy-
from these dense set of detections at test time it pothesized that by weaving the information iter-
gives lower performance on MS COCO dataset as atively, sufficient multi-scale context information
compared to two-stage detectors. The two-stage can be transferred and integrated to current scale.
object detectors get a sparse set of proposals on Recently three strong candidates have emerged

15
YOLO (Jun 2015) SSD (Dec 2015)

cls
cls +
+
fc reg
reg

RetinaNet (Aug 2017) DSSD (Jan 2017)

cls + reg

cls + reg

RefineDet (Nov 2017) CornerNet (Aug 2018)


HeatMaps Embeddings
obj
+ Top-left corners

reg
reg

Refined
Anchors
cls
cls
+
+
reg Bottom-right corners

Matching corresponding
corners 

Figure 6: Evolution of single stage detectors over the years. Major advancements in chronological order are
YOLO [308], SSD [221], DSSD [92], RetinaNet [216], RefineDet [460] and CornerNet [189]. Cuboid boxes,
solid rectangular box, dashed rectangular boxes and arrows represent convolutional layer, fully connected
layer, predictions and flow of features respectively. obj, cls and reg stand for objectness, classification and
regression losses. Best viewed in color.

16
for replacing the undying YOLO and SSD variants: posal generator is to present the classifier with
class-agnostic rectangular boxes which try to locate
• RetinaNet [216] borrowed the FPN structure
the ground-truth instances. The classifier, then,
but in a single stage setting. It is similar in
tries to assign a class to each of the proposals and
spirit to SSD but it deserves its own paragraph
further fine-tune the coordinates of the boxes.
given its growing popularity based on its speed
and performance. The main new advance of
this pipeline is the focal loss, which we will
discuss in Section 2.2.1.
Region proposal: Hosang et al. [131] presented
• RefineDet [460] tried to combine the advan- an in-depth review of ten ”non-data driven” ob-
tages of double-staged methods and single- ject proposal methods including Objectness [2, 3],
stage methods by incorporating two new mod- CPMC [30, 31], Endres and Hoiem [81, 82], Se-
ules in the single stage classic architecture. lective Search [391, 393], Rahtu et al. [293], Ran-
The first one, the ARM (Anchor Refinement domized Prim [229], Bing [56], MCG [286], Ranta-
modules), is used in multiple staged detectors’ lankila et al. [298], Humayun et al. [145] and Edge-
fashion to reduce the search space and also to Boxes [491] and evaluated their effect on the de-
iteratively refine the localization of the detec- tector’s performance. Also, Xiao et al. [425] de-
tions. The ODM (Object Detection Module) veloped a novel distance metric for grouping two
took the output of the ARM to output fine- super-pixels in high-complexity scenarios. Out of
grained classification and further improve the all these approaches Selective Search and Edge-
localization. Boxes gave the best recall and speed. The former
is an order of magnitude slower than Fast R-CNN
• CornerNet [189] offered a new approach for ob-
while the latter, which is not as efficient, took as
ject detection by predicting bounding boxes
much time as a detector. The bottleneck lied in
as paired top-left and bottom right keypoints.
the region proposal part of the pipeline.
They also demonstrated that one can get rid of
the prominent anchors step while gaining ac- Deep learning based approaches [86, 363] had
curacy and precision. They used fully convolu- also been used to propose regions but they were
tional networks to produce independent score not end-to-end trainable for detection and required
heat maps for both corners for each class in input images to be of fixed size. In order to ad-
addition to learning an embedding for each dress strong localization bias [46] proposed a box-
corner. The embedding similarities were then refinement method based on the super-pixel tight-
used to group them into multiple bounding ness distribution. DeepMask [282] and SharpMask
boxes. It beat its two (less original) competing [284] proposed segmentation based object propos-
rivals on COCO. als with very deep networks. [162] estimated the
objectness of image patches by comparing them
However, most methods used in competitions un- with exemplar regions from prior data and finding
til now are predominantly double-staged methods the ones that are most similar to it.
because their structure is better suited for fine-
grained classification. It is what we are going to The next obvious question became apparent.
see in the next Section. How can deep learning methods be streamlined
into existing approaches to give an elegant, sim-
ple, end-to-end trainable and fully convolutional
2.1.3 Double Stage Detectors
model? In the discussion that follows we will dis-
The process of detecting objects can be split into cuss two widely adopted approaches in two-stage
two parts: proposing regions & classifying and re- detectors, pros and cons of using such approaches
gressing bounding boxes. The purpose of the pro- and further improvements made on them.

17
R-CNN and Fast R-CNN: The first modern posal Network (RPN). RPN learned the ”object-
ubiquitous double-staged deep learning detection ness” of all instances and accumulated the propos-
method is certainly [105]. Although it is has now als to be used by the detector part of the backbone.
been abandoned due to faster alternatives it is The detector further classified and refined bound-
worth mentioning to better understand the next ing boxes around those proposals. RPN and detec-
paragraphs. Closer to the traditional non deep- tor can be trained separately as well as in a com-
learning methods the first stage of the method is bined manner. When sharing convolutional layers
the detection of objects in pictures to reduce the with the detector they result in very little extra
number of false positive of the subsequent stage. cost for region proposals. Since it has two parts for
It is done using a hierarchical pixel grouping algo- generating proposals and detection, it comes under
rithm widely popular in the 2000s called selective the category of two-stage detectors.
search [393]. Once the search space has been prop- Faster-RCNN used thousands of reference boxes,
erly narrowed, all regions above a certain score are commonly known as anchors. Anchors formed a
warped to a fixed size so that a classifier can be ap- grid of boxes that act as starting points for re-
plied on top of it. Further fine-tuning on the last gressing bounding boxes. These anchors were then
layers of the classifier is necessary on the classes trained end-to-end to regress to the ground truth
of the dataset used (they replace the last layer so and an objectness score was calculated per anchor.
that it has the right number of classes) and an The density, size and aspect ratio of anchors are
SVM is used on top of the fixed fine-tuned fea- decided according to the general range of size of
tures to further refine the localization of the detec- object instances expected in the dataset and the
tion. This method was the precursor of all the mod- receptive field of the associated neuron in the fea-
ern deep learning double-staged methods, in spite ture map.
of the fact that this first iteration of the method RoI Pooling, introduced in [104], warped the pro-
was far from the elegant paradigm used nowadays. posals generated by the RPN to fixed size vectors
Fast R-CNN Girshick [104] from the same author for feeding to the detection sub-network as its in-
is built on top of this previous work. The author puts. The quantization and rounding operation
started to refine R-CNN by being one of the first defining the pooling cells introduced misalignments
researcher with He et al. [116] to come-up with his and actually hurt localization.
own deep-learning detection building block. This
differentiable mechanism called RoI-pooling (Re-
gion of Interest Pooling) was used for resizing fixed R-FCN: To avoid running the costly RoI-wise
regions (also extracted with selective-search) com- subnetwork in Faster-RCNN hundreds of times, i.e.
ing not from the image directly but from the fea- once per proposal, Dai et al. [62] got rid of it and
ture computed on the full image, which kept the shared the convolutional network end to end. To
spatial layout of the original image. Not only did achieve this they proposed the idea of position sen-
that bring speed-up to the slow R-CNN method sitive feature maps. In this approach each feature
(x200 in inference) but it also came with a net gain map was responsible for outputting score for a spe-
in performances (around 6 points in mAP). cific part, like top-left, center, bottom right, etc.,
of the target class. The parts were identified with
RoI-Pooling cells which were distributed along-
Faster-RCNN: The seminal Faster-RCNN pa- side each part-specific feature map. Final scores
per [309] showed that the same backbone architec- were obtained by average voting every part of the
ture used in Fast R-CNN for classification can be RoI from the respective filter. This implementa-
used to generate proposals as well. They proposed tion trick introduced some more translational vari-
an efficient fully convolutional data driven based ance to structures that were essentially translation-
approach for proposing regions called Region Pro- invariant by construction. Translational variance

18
(a)

(b)

(c)

Figure 7: Graphical explanation of RoIPooling, RoIWarping and RoIAlign (actual dimensions of pooled
feature map differ). The red box is the predicted output of Region Proposal Network (RPN) and the dashed
blue grid is the feature map from which proposals are extracted. (a) RoI Pooling first aligns the proposal to
the feature map (first quantization) and then max-pools or average-pools the features (second quantization).
Note that some information is lost because the quantized proposal is not an integral multiple of the final
map’s dimensions. (b) RoI Warping retains the first quantization but deals with second one through bilinear
interpolation, calculated from four nearest features, through sampling N points (black dots) for each cell
of final map. (c) RoI Align removed both the quantizations by directly sampling N points on the original
proposal. N is set to four generally. Best viewed in color.

in object detection can be beneficial for learning lo- challenging COCO detection task. Another draw-
calization representations. Although this pipeline back is that Ren et al. [309] and Dai et al. [62] used
seems to be more precise, it is not always better coarse feature maps at a single scale only. This
performance-wise than its Faster R-CNN counter- is not sufficient when objects of diverse sizes are
part. present in the dataset.
From an engineering point of view, this method
of Position sensitive RoI-Pooling (PS Pooling) also Further improvements: Many improvements
prevented the loss of information at RoI Pooling have been suggested on the above methodologies
stage in Faster-RCNN. It improved the overall in- concerning speed, performance and computational
ference time speed of two-stage detectors but per- efficiency.
formed slightly worse. DeepBox [184] proposed a light weight generic
objectness system by capturing semantic proper-
Pros and Cons: RPNs are generally configured ties. It helped in reducing the burden of local-
to generate nearly 300 proposals to get state-of- ization on the detector as the number of classes
the-art performances. Since each of the proposal increased. Light-head R-CNN [206] proposed a
passed through a head of convolutional layers and smaller detection head and thin feature maps to
fully connected layers to classify the objects and speed up two-stage detectors. Singh et al. [346]
fine tune the bounding boxes, it decreased the over- brought R-FCN to 30 fps by sharing position sen-
all speed. Although they are slow and not suited sitive feature maps across classes. Using slight ar-
to real-time applications, the ideas based on these chitectural changes, they were also able to bring
approaches give one the best performances in the the number of classes predicted by R-FCN to 3000

19
R-CNN (Nov 2013) SPPNet (Jun 2014)
External External
Proposals Proposals

SVM SVM
Classifier Classifier

fc fc
+ +

Regressor Regressor
Spatial
Pyramid
Pooling

Faster RCNN (Jun 2015) Fast-RCNN (Apr 2015)

External
obj+reg Proposals cls  +  reg

cls
conv5
+ fc
reg conv5
RoI 
Pooling
Generate RoIs

RFCN (May 2016) FPN (Dec 2016)

obj+reg

vote
pool
(per RoI)

obj + reg cls + reg obj + reg cls + reg obj+reg cls + reg

Deformable CNN (Mar 2017) Mask RCNN (Mar 2017)


cls  +  reg
obj+reg
conv

fc

conv5
conv
conv

mask

input feature map output feature map

Figure 8: Evolution of double stage detectors over the years. Major advancements in chronological order are
R-CNN [105], SPPNet [116], Fast-RCNN [104], Faster RCNN [309], RFCN [62], FPN [215], Mask RCNN
[118], Deformable CNN [63] (only the modification is shown and not the entire network). The main idea is
marked in dashed blue rectangle wherever possible. Other information is same as in Figure 6.

20
without losing too much speed. Mask-RCNN [118] in addition to RoI-align added
Several improvements have been made to RoI- a branch in parallel to the classification and bound-
Pooling. The spatial transformer of [154] used a ing box regression for optimizing the segmentation
differentiable re-sampling grid using bilinear inter- loss. Additional training for segmentation led to
polation and can be used in any detection pipeline. an improvement in the performance of object de-
Chen et al. [37] used this for Face detection, tection task as well. The advancements of two stage
where faces were warped to fit canonical poses. detectors over the years is illustrated in Figure 8.
Dai et al. [61] proposed another type of pooling
called RoI Warping based on bilinear interpola- The double-staged methods have now by far at-
tion. Ma et al. [228] were the first to introduce tained supremacy over best performing object de-
a rotated RoI-Pooling working with oriented re- tection DCNNs. However, for certain applications
gions (More on oriented RoI-Pooling can be found two-stage methods are not enough to get rid of all
in Section 3.1.2). Mask R-CNN [118] proposed RoI the false positives.
Align to address the problem of misalignment in
RoI Pooling which used bilinear interpolation to 2.1.4 Cascades
calculate the value of four regularly sampled loca-
tions on each cell. It allowed to fix to some extents Traditional one-class object detection pipelines re-
the alignment between the computed features and sorted to boosting like approaches for improving
the regions they were extracted from. It brought the performance where uncorrelated weak classi-
consistent improvements to all Faster R-CNN base- fiers (better than random chance but not too cor-
lines on COCO. A comparison is shown in Figure related with the true predictions) are combined to
7. Recently, Jiang et al. [158] introduced a Precise form a strong classifier. With modern CNNs, as
RoI Pooling based on interpolating not just 4 spa- the classifiers are quite strong, the attractiveness of
tial locations but a dense region, which allowed full those methods has plummeted. However, for some
differentiability with no misalignments. specific problems where there are still too many
Li et al. [196], Yu et al. [442] also used contex- false positives, researchers still find it useful. Fur-
tual information and aspect ratios while StuffNet thermore, if the weak CNNs used are very shallow
[24] trained for segmenting amorphous categories it can also sometimes increase the overall speed of
such as ground and water for the same purpose. the method.
Chen and Gupta [48] made use of memory to One of the first ideas that were developed was
take advantage of context in detecting objects. Li to cascade multiple CNNs. Li et al. [198] and
et al. [207] incorporated Global Context Module Yang and Nevatia [434] both used a three-staged
(GCM) to utilize contextual information and Row- approach by chaining three CNNs for face detec-
Column Max Pooling (RCM Pooling) to better ex- tion. The former approach scanned the image using
tract scores from the final feature map as compared a 12 × 12 patch CNN to reject 90% of the non-face
to the R-FCN method. regions in a coarse manner. The remaining detec-
Deformable R-FCN [63] brought flexibility to tions were offset by a second CNN and given as
the fixed geometric transformations at the Position input to a 24 × 24 CNN that continued rejecting
sensitive RoI-Pooling stage of R-FCN by learning false positives and refining regressions. The final
additional offsets for each spatial sampling loca- candidates were then passed on to a 48 × 48 classi-
tion using a different network branch in addition fication network which output the final score. The
to other tricks discussed in Section 2.1.5. Lin et al. latter approach created separate score maps for dif-
[215] proposed to use a network with multiple fi- ferent resolutions using the same FCN on different
nal feature maps with different coarseness to adapt scales of the test image (image pyramid). These
to objects of various sizes. Zagoruyko et al. [446] score maps were then up-sampled to the same reso-
used skip connections with the same motivation. lution and added to create a final score map, which

21
was then used to select proposals. Proposals were cascade where the final score is a linear weighted
then passed to the second stage where two differ- combination of the scores given by the different
ent verification CNNs, trained on hard examples, weak classifiers like in Angelova et al. [6]. They
eradicated the remaining false positives. The first used 200 stages (instead of 2000 stages in their
one being a four-layer FCN trained from scratch baseline with AdaBoost [16]) to keep recall high
and the second one an AlexNet [181] pre-trained enough while improving precision. To save com-
on ImageNet. putations that would be otherwise unmanageable,
All the approaches mentioned in the last para- they terminated the computations of the weighted
graph are ad hoc: the CNNs are independent of sum whenever the score for a certain number of
each other, there is no overall design, therefore, classifiers fell under a specified threshold (there are,
they could benefit from integrating the elegant therefore, as many thresholds to learn as there are
zooming module that is the RoI-Pooling. The RoI- classifiers). These thresholds are then really im-
Pooling can act like a glue to pass the detections portant because they control the trade-off between
from one network to the other, while doing the speed, recall and precision.
down-sampling operation locally. Dai et al. [61] All the previous works in this Section involved
used a Mask R-CNN like structure that first pro- a small fixed number of localization refinement
posed bounding boxes, then predicted a mask and steps, which might cause proposals to be not per-
used a third stage to perform fine grained discrim- fectly aligned with the ground truth, which in turn
ination on masked regions that are RoI-Pooled a might impact the accuracy. That is why lots of
second time. work proposed iterative bounding box regression
Ouyang et al. [268], Wang et al. [404] optimized (while loop on localization refinement until condi-
in an end-to-end manner a Faster R-CNN with mul- tion is reached). Najibi et al. [254], Rajaram et al.
tiple stages of RoI-Pooling. Each stage accepted [295] started with a regularly spaced grid of sparse
only the highest scored proposals from the previ- pyramid boxes (only 200 non-overlapping in Najibi
ous stage and added more context and/or localized et al. [254] whereas, Rajaram et al. [295] used all
the detection better. Then additional information Faster R-CNN anchors on the grid) that were iter-
about context was used to do fine grained discrim- atively pushed towards the ground truth according
ination between hard negatives and true positives to the feature representation obtained from RoI-
in [268], for example. On the contrary, Zhang et al. Pooling the current region. An interesting finding
[455] showed that for pedestrian detection RoI- was that even if the goal was to use as many re-
Pooling, too coarse a feature map actually hurts finement steps as necessary if the seed boxes or an-
the result. This problem has been alleviated by chors span the space appropriately, regressing the
the use of feature pyramid networks with higher boxes only twice can in fact be sufficient [254]. Ap-
resolution feature maps. Therefore, they used the proaches proposed by Gidaris and Komodakis [101]
RPN proposals of a Faster R-CNN in a boosting and Li et al. [199] can also be viewed, internally,
pipeline involving a forest (Tang et al. [373] acted as iterative regression based methods proposing re-
similarly for small vehicle detection). gions for detectors, such as Fast R-CNN.
Yang et al. [431], aware of the problem raised by Recently, Cai and Vasconcelos [27] noticed that
Zhang et al. [455], used RoI-Pooling on multiple when increasing the IoU threshold for a window
scaled feature maps of all the layers of the net- to be considered positive in the training (to get
work. The classification function on each layer was better quality hypothesis for the next stages), one
learned using the weak classifiers of AdaBoost and loses a lot of positive windows. Thus one has to
then approximated using a fully connected neural keep using the low 0.5 threshold to prevent over-
network. While all the mentioned pipelines are fitting and thus one gets bad quality hypothesis in
hard cascades where the different classifiers are in- the next stages. This is true for all the works men-
dependent, it is sometimes possible to use a soft tioned in this section that are based on Faster R-

22
CNN (e.g. [268, 404]). To combat this effect, they In 2014, Savalle and Tsogkas [325] tried to get
slowly increase the IoU threshold over the stages to the best of both worlds: they replaced the HoG fea-
get different sets of detectors using the latest stage ture pyramids used in the DPM with the CNN lay-
proposals as input distribution for the next one. ers. Surprisingly, the performance they obtained,
With only 3 to 4 stages they consistently improve even if far superior to the DPM+HoG baseline, was
the quality of a wide range of detectors with an considerably worse than the R-CNN method. The
average of 3 points gained w.r.t. the non-cascaded authors suspected the reason for it was the fixed
version. This algorithmic advance is used in most size aspect ratios used in the DPM together with
of the winning entries of the 2018 COCO challenge the training strategy. Girshick et al. [106] put more
(used at least by the first three teams). thought on how to mix CNN and DPM by coming
Orthogonal to this approach Jiang et al. [158] up with the distance transform pooling thus bring-
frames the regression of the multi-stage cascade as ing the new DPM (DeepPyramidDPM) to the level
an optimization problem thus introducing a proxy of R-CNN (even slightly better). Ranjan et al.
for a smooth measure of confidence of the bounding [297] built on it and introduced a normalization
box localization. This article among others will be layer that forced each scale-specific feature map to
discussed in more details in the Section 2.2.1. have the same activation intensities. They also im-
plemented a new procedure of sampling optimal
Boosting and multistage (> 2) methods, we have targets by using the closest root filter in the pyra-
seen previously, exhibit very different possible com- mid in terms of dimensions. This allowed them
binations of DCNNs. But we thought it would to further mimic the HOG-DPM strengths. Si-
be interesting to still have a Section for a special multaneously, Wan et al. [401] also improved the
kind of method that was hinted at in the previ- DeepPyramidDPM but failed short compared to
ous Sections, namely the part-based models, if not the newest version of R-CNN, fine-tuned (R-CNN
for their performances at least for their historical FT). Therefore, in 2015 it seemed that the DPM
importance. based approaches have hit a dead end and that the
community should focus on R-CNN type methods.
2.1.5 Parts-Based Models However, the flexibility of the RoI-Pooling of
Fast R-CNN was going to help making the two
Before the reign of CNN methods, the algorithms approaches come together. Ouyang et al. [267]
based on Deformable Parts-based Model (DPM) combined Fast R-CNN to get rid of most back-
and HoG features used to win all the object de- grounds and a DeepID-Net, which introduced a
tection competitions. In this algorithm latent (not max-pooling penalized by the deformation of the
supervised) object parts were discovered for each parts called def-pooling. The combination im-
class and optimized by minimizing the deforma- proved over the state-of-the-art. As we mentioned
tions of the full objects (connections were modeled in Section 2.1.3, Dai et al. [63] built on R-FCN
by springs forces). The whole thing was built on a and added deformations in the Position Sensitive
HoG image pyramid. RoI-Pooling: an offset is learned from the classi-
When Region based DCNNs started to beat the cal Position Sensitive pooled tensor with a fully
former champion, researchers began to wonder if connected network for each cell of the RoI-Pooling
it was only a matter of using better features. If thus creating ”parts” like features. This trick of
this was the case then the region based approach moving RoI cells around is also present in [247],
would not necessarily be a more powerful algo- although slightly different because it is closer to
rithm. The DPM was flexible enough to inte- the original DPM. Dai et al. [63] even added off-
grate the newer more discriminative CNN features. sets to convolutional filters cells on Conv-5, which
Therefore, some research works focused in this re- became doable thanks to bilinear interpolation. It,
search direction. thus, became a truly deformable fully convolutional

23
network. However, Mordan et al. [247] got better positive labels with a value one. This equation con-
performances on VOC without it. Several works straints the network to output the predicted confi-
used deformable R-FCN like [429] for aerial im- dence score, p, to be 1 if it thinks there is an object
agery that used a different training strategy. How- and 0 otherwise.
ever, even if it is still present in famous competi-
tions like COCO, it is less used than its counter-
parts with fixed RoI-Pooling. It might come back
(
−log(p) if y = 1
though thanks to recent best performing models CE(p, y) = (1)
−log(1 − p) otherwise
like [345] that used [63] as their baseline and selec-
tively back-propagated gradients according to the
object size. A multi-variate version of the log loss is used for
classification (Eq. 2). po,c predicts the probability
of observation o being class c where c ∈ {1, .., C}.
2.2 Model Training yo,c is 1 if observation o belongs to class c and 0
The next important aspect of the detection model’s otherwise. c = 0 is accounted for the special case
design is the losses being used to converge the huge of background class.
number of weights and the hyper-parameters that
must be conducive to this convergence. Optimiz- C
X
ing for a wrongfully crafted loss may actually lead M CE(p, y) = − yo,c log(po,c ) (2)
the model to diverge instead. Choosing incorrect c=0

hyper-parameters, on the one hand, can stagnate


the model, trap it in a local optima or, on the Fast-RCNN [104] used a multitask loss (Eq. 3)
other hand, over-fit the training data (causing poor which is the de-facto equation used for classify-
generalizations). Since DCNNs are mostly trained ing as well as regressing. The losses are summed
with mini-batch SGD (see for instance [190]), we fo- over all the regions proposals or default reference
cus the following discussion on losses and on the op- boxes, i. The ground-truth label, p∗i , is 1 if the pro-
timization tricks necessary to attain convergence. posal box is positive, otherwise 0. Regularization
We also review the contribution of pre-training on is learned only for positive proposal boxes.
some other dataset and data augmentation tech-
niques which bring about an excellent initialization 1 X
L({pi }, {ti }) = Lcls (pi , p∗i )+
point and good generalizations respectively. Ncls i
1 X ∗
2.2.1 Losses λ p Lreg (ti , t∗i ) (3)
Nreg i i
Multi-variate cross entropy loss, or log loss, is gen-
erally used throughout the literature to classify im- where ti is a vector representing the 4 coordinates
ages or regions in the context of detectors. How- of the predicted bounding box and similarly t∗i rep-
ever, detecting objects in large images comes with resents the 4 coordinates of the ground truth. Eq.
its own set of specific challenges: regress bound- 4 presents the equation for exact parameterized co-
ing boxes to get precise localization, which is a ordinates. {xa , ya , wa , ha } are the center x and y
hard problem that is not present at all in classi- coordinates, width and height of the default an-
fication and an imbalance between target object chor box respectively. Similarly {x∗a , ya∗ , wa∗ , h∗a } are
regions and background regions. ground truths and {x, y, w, h} are the coordinates
A binary cross entropy loss is formulated as to be predicted. The two terms are normalized
shown in Eq. 1. It is used for learning the com- by mini-batch size, Ncls , and number of propos-
bined objectness. All instances, y, are marked as als/default reference boxes, Nreg , and weighted by

24
a balancing parameter λ. the predicted observation or not (Eq. 7).
x − xa y − ya M
tx = , ty = X X
wa ha Lborder = λ+ Ts,i log(ps,i )+
w h s∈{l,r,t,b} i=1
tw = log , th = log −
wa ha λ (1 − Ts,i )log(1 − ps,i )
∗ ∗ (4)
x − xa y − ya (
t∗x = , t∗y = if i = Bsgt
1,
wa ha ∀i ∈ {1, ..., M }, Ts,i =
w∗ h∗ 0,
otherwise
t∗w = log , t∗h = log (7)
wa ha
where λ− = 0.5 MM−1 , λ+ = (M − 1)λ− . The
Lreg is a smooth L1 loss defined by Eq. 5. In its notations can be inferred from Eq. 6. In the second
place some papers also use L2 losses. paper [101], related to the same topic, applied the
( regression losses iteratively at the region proposal
∗ 0.5(t − t∗ )2 if |t − t∗ | < 1 stage in a class agnostic manner. They used final
lreg (t, t ) = ∗ convolutional features and predictions from last it-
|t − t | − 0.5 otherwise
(5) eration to further refine the proposals.
It was also found out to be beneficial to optimize
the loss directly over Intersection over Union (IoU)
Losses for regressing bounding boxes: Since which is the standard practice to evaluate a bound-
accurate localization is a major issue, papers have ing box or segmentation algorithm. Yu et al. [441]
suggested a more sophisticated localization loss. presented Eq. 8 for regression loss.
[103] came up with a binary logistic type regres-
sion loss. After dividing the image patch into M Lunit−box = −ln(IoU (gt, pred)) (8)
columns and M rows, they computed the probabil-
ity of each row and column being inside or outside The terms are self-explanatory. Jiang et al. [158]
the predicted observation box (in-out loss) (Eq. 6). also learned to predict IoU between predicted box
and ground truth. They made a case to use lo-
M
calization confidence instead of classification con-
fidence to suppress boxes at NMS stage. It gave
X X
Lin−out = Ta,i log(pa,i )+
a∈{x,y} i=1
higher recall on MS COCO dataset. This loss is
however very unstable and has a number of regions
(1 − Ta,i )log(1 − pa,i ) where the IoU has zero-gradient and thus it is un-
(
1, if Blgt ≤ i ≤ Brgt defined. Tychsen-Smith and Petersson [390] adapt
∀i ∈ {1, ..., M }, Tx,i = this loss to make it more stable by adding hard
0, otherwise
( bounds, which prevent the function from diverging.
1, if Btgt ≤ i ≤ Bbgt They also factorize the score function by adding a
∀i ∈ {1, ..., M }, Ty,i =
0, otherwise fitness term representing the IoU of the box w.r.t.
(6) the ground truth.
where {Blgt , Brgt , Btgt , Bbgt } are the left, right, top
and bottom edges of the bounding box respectively. Losses for class imbalance: Since in recent de-
Tx and Ty are the binary positive or negative values tectors there are a lot of anchors which most of the
for rows and columns respectively. p is the proba- time cover background, there is a class imbalance
bility associated with it respectively. between positive and negative anchors. An alter-
In addition, they also compute the confidence for native is Online Hard Example Mining (OHEM).
each column and row being the exact boundary of Shrivastava et al. [337] performed to select only

25
worst performing examples (so-called hard exam- for classification where the structure is much sim-
ples) for calculating gradients. Even if by fixing the pler, no general strategy has emerged yet on how to
ratio between positive and negative instances, gen- use mini-batch gradient descent correctly. Different
erally 1:3, one can partly solve this imbalance. Lin popular versions of mini-batch Stochastic Gradient
et al. [216] proposed a tweak to the cross entropy Descent(SGD) [318] have been proposed based on
loss, called focal loss, which took into account all a combination of momentum, to accelerate conver-
the anchors but penalized easy examples less and gence, and using the history of the past gradients,
hard examples more. Focal loss (Eq. 9) was found to dampen the oscillations when reaching a min-
to increase the performance by 3.2 mAP points on imum: AdaDelta [447], RMSProp [378] and the
MS COCO, in comparison to OHEM on a ResNet- unavoidable ADAM [171, 304] are only the most
50-FPN backbone and 600 pixel image scale. well-known. However, in object detection literature
( authors, use either plain SGD or ADAM, without
−αt (1 − p)γ log(p) if y = 1 putting too much thought into it. The most im-
F L(p, y) = portant hyper-parameters remain the learning rate
−αt pγ log(1 − p) otherwise
(9) and the batch size.
One can also adopt simpler strategies like rebalanc-
ing the cross-entropy by putting more weights on Learning rate: There is no concrete way to de-
the minority class [259]. cide the learning rate policy over the period of the
training. It depends on a myriad of factors like op-
Supplementary losses: In addition to classifi- timizer, number of training examples, model, batch
cation and regression losses, some papers also op- size, etc. We cannot quantify the effect of each fac-
timized extra losses in parallel. Dai et al. [61] pro- tor; Therefore, the current way to determine the
posed a three-stage cascade for differentiating in- policy is by hit-and-trial. What works for one set-
stances, estimating masks and categorizing objects. ting may or may not work for other settings. If
Because of this they achieved competitive perfor- the policy is incorrect then the model might fail
mance on object detection task too. They further to converge at all. Nevertheless, some papers have
experimented with a five-stage cascade also. Uber- studied it and have established general guidelines
Net [173] trained on as many as six other tasks in that have been found to work better than others.
parallel with object detection. He et al. [118] have A large learning rate might never converge while a
shown that using an additional segmentation loss small learning rate gives sub-optimal results. Since,
by adding an extra branch to the Faster R-CNN in the initial stage of training the change in weights
detection sub-network can also improve detection is dramatic, Goyal et al. [111] have proposed a Lin-
performance. Li et al. [203] introduced position- ear Gradual Warmup strategy in which learning
sensitive inside/outside score maps to train for de- rate is increased every iteration during this period.
tection and segmentation simultaneously. Wang Then starting from a point (for e.g. 10−3 ) the pol-
et al. [409] proposed an additional repulsion loss icy was to decrease learning rate over many epochs.
between predicted bounding boxes in order to have Krizhevsky [180] and Goyal et al. [111] also used a
one final prediction per ground truth. Generally, it Linear Scaling Rule which linearly scaled the learn-
can be observed, instance segmentation in particu- ing rate according to the mini-batch size.
lar, aids the object detection task.
Batch size: The object detection literature
2.2.2 Hyper-Parameters doesn’t generally focus on the effects of using a big-
ger or smaller batch size during training. Training
The detection problem is a highly non-convex prob- modern detectors requires working on full images
lem in hundreds of thousands of dimensions. Even and therefore on large tensors which can be trou-

26
blesome to store on the GPU RAM. It has forced Singh and Davis [345] made a compelling case
the community to use small batches, of 1 to 16 im- for the minimum difference in scales of object in-
ages, for training (16 in RetinaNet [216] and Mask stances between classification dataset used for pre-
R-CNN [118] with the latest GPUs). training and detection dataset to minimize domain
One obvious advantage of increasing the batch shift while fine-tuning. They asked should we pre-
size is that it reduces the training time but since train CNNs on low resolution classification dataset
the memory constraint restricts the number of im- or restrict the scale of object instances in a range
ages, more GPUs have to be employed. However, by training on an image pyramid? By minimizing
using extra large batches have been shown to po- scale variance they were able to get better results
tentially lead to big improvements in performances than the methods that employed scale invariant de-
or speed. For instance, batch normalization [152] tector. The problem with the second approach is
needs many images to provide meaningful statis- some instances are so small that in order to bring
tics. Originally batch size effects were studied by them in the scale range, the images have to be up-
[111] on ImageNet dataset. They were able to show scaled so much that they might not fit in the mem-
that by increasing the batch size from 256 to 8192, ory or they will not be used for training at all.
train time can be reduced from 29 hours to just Using a pyramid of images and using each for in-
1 hour while maintaining the same accuracy. Fur- ference is also slower than methods that use input
ther, You et al. [437] and Akiba et al. [1] brought image exactly once.
down the training time to below 15 minutes by in- Section 3.1.3 covers pre-training and other as-
creasing the batch size to 32k. pects of it like fine-tuning and beyond in great de-
Very recently, MegDet [274], inspired from [111], tail. There are only, to the best of our knowledge,
have shown that by averaging gradients on many two articles that tried to match the performances
GPUs to get an equivalent batch size of 256 and ad- of ImageNet pre-training by training detectors from
justing the learning rates could lead to some perfor- scratch. The first one being [331] that used deep
mance gains. It is hard to say now which strategy supervision (dense access to gradients for all layers)
will eventually win in the long term but they have and very recently [332] that adaptively recalibrated
shown that it is worth exploring. supervision intensities based on input object sizes.

2.2.4 Data Augmentation


2.2.3 Pre-Training
The aim of augmenting the train set images is to
Transfer learning was first shown to be useful in create diversity, avoid overfitting, increase amount
a supervised learning approach by Girshick et al. of data, improve generalizability and overcome
[105]. The idea is to fine-tune from a model al- different kinds of variances. This can easily be
ready trained on a dataset that is similar to the achieved without any extra annotations efforts by
target dataset. This is usually a better starting manually designing many augmentation strategies.
point for training instead of randomly initializ- The general practices that are followed include
ing weights. For e.g. model pre-trained on Ima- and are not limited to scale, resize, translation,
geNet being used for training on MS COCO. And rotation, vertical and horizontal flipping, elastic
since, COCO dataset’s classes is a superset of PAS- distortions, random cropping and contrast, color,
CAL VOC’s classes most of the state-of-the-art ap- hue, brightness, saturation and sharpness adjust-
proaches pre-train on COCO before training it on ments etc. The two recent and promising but not
PASCAL VOC. If the dataset at hand is completely widely adapted techniques are Cutout [70] and
unrelated to dataset used for pre-training, it might Sample Pairing [149]. Taylor and Nitschke [376]
not be useful. For e.g. model pre-trained on Ima- benchmarked various popular data augmentation
geNet being used for detecting cars in aerial images. schemes to know the ones that are most appro-

27
priate, and found out that cropping was the most
influential in their case.
Although there are many techniques available
and each one of them is easy to implement, it is
difficult to know in advance, without expert knowl-
Original edge, which techniques assist the performance for
a target dataset. For example, vertical flipping in
case of traffic signs dataset is not helpful because
one is not likely to encounter inverted signs in the
test set. It is not trivial to select the approaches
for each target dataset and test all of them before
Resize Scale deploying a model. Therefore, Cubuk et al. [60]
proposed a search algorithm based on reinforce-
ment learning to find the best augmentation pol-
icy. Their approach tried to find the best suitable
augmentation operations along with their magni-
tude and probability of happening. Smart Aug-
Vertical Flip Horizontal Flip
mentation [194] worked by creating a network that
learned how to automatically generate augmented
data during the training process of a target net-
work in a way that reduced the loss. Tran et al.
[382] proposed a Bayesian approach, where new
annotated training points are treated as missing
Shear Rotation
variables and generated based on the distribution
learned from the training set. Devries and Taylor
[69] applied simple transformations such as adding
noise, interpolating, or extrapolating between data
points. They performed the transformation, not in
input space, but in a learned feature space. All the
Elastic Distortions Lighting above approaches are implemented in the domain
of classification only but they might be beneficial
for the detection task as well and it would be in-
teresting to test them.
Generative adversarial networks (GANs) have
also been used to generate the augmented data di-
Greyscale Noise
rectly for classification without searching for the
best policies explicitly [7, 251, 280, 348]. Ratner
et al. [300] used GANs to describe data augmenta-
tion strategies. GAN approaches may not be as ef-
fective for detection scenarios yet because generat-
ing an image with many object instances placed in
Sample Pairing Cut-Outs a relevant background is much more difficult than
generating an image with just one dominant object.
This is also an interesting problem which might be
Figure 9: Different kinds of data augmentation addressed in the near future and is explored in Sec-
techniques used to improve the generalization of tion 3.2.2.
the network. The modification done for each im-
age is mentioned below the figure. Best viewed in
color. 28
0.98 0.2 0.98 0.98

0.15
0.85 0.54
0.8 0.63
0.99 0.24 0.99 0.99

Predictions of the detector (a) After NMS (b) After Soft-NMS

Figure 10: An illustration of the inference stage. In this example, bounding boxes around horses (blue) and
persons (pink) are obtained from the detector (along with the confidence scores mentioned on top of each
box). (a) NMS chooses the most confident box and suppresses all other boxes having an IoU greater than a
threshold. Note, it sometimes leads to suppression of boxes around other occluded objects. (b) Soft-NMS
deals with this situation by reducing the confidence scores of boxes instead of completely suppressing them.

2.3 Inference best for all datasets. Datasets with just one ob-
ject per image will trivially apply NMS by choosing
The behavior of the modern detectors is to pick only the highest-ranking box. Generally, datasets
up pixels of target objects, propose as many win- with sparse and fewer number of objects per image
dows as possible surrounding those pixels and esti- (2 to 3 objects) require a lower threshold. While
mate confidence scores for each of the window. It datasets with cramped and higher numbers of ob-
does not aim to suggest one box exactly per object. jects per image (7 and above) give better results
Since all the reference boxes act independently dur- with a higher threshold. The problems that arose
ing test time and similar input pixels are picked up with this naive and hand-crafted approach was that
by many neighboring anchors, each positive predic- it may completely suppress nearby or occluded true
tion in the prediction set highly overlaps with other positive detections, choose top scoring box which
boxes. If the best ones out of these are not selected, might not be the best localized one and its inability
it will lead to many double detections and thus false to suppress false positives with insufficient overlap.
positives. The ideal result would be to predict ex-
actly one prediction box per ground-truth object To improve upon it, many approaches have been
that has high overlap with it. To reach near this proposed but most of them work for a very special
ideal state, some sort of post-processing needs to case such as pedestrians in highly occluded scenar-
be done. ios. We discuss the various directions they take and
Greedy Non-maximum suppression (NMS) [64] the approaches that work better than the greedy
is the most frequent technique used for inference NMS in the general scenario. Most of the following
modules to suppress double detections through discussion is based on [132] and [21], who, in their
hard thresholding. In this approach, the predic- papers, provided us with an in-depth view of the
tion box with the highest confidence was chosen alternate strategies being used.
and all the boxes having an Intersection over Union Many clustering approaches for predicted boxes
(IoU) higher than a threshold, Nt , were suppressed have been proposed. A few of them are mean shift
or rescored to zero. This step was made itera- clustering [64, 413], agglomerative clustering [22],
tively till all the boxes were covered. Because of affinity propagation clustering [250], heuristic vari-
its nature there is no single threshold that works ants [327], etc. Rothe et al. [315] presented a learn-

29
ing based method which ”passes messages between as edges in a graph.
windows” or clustered the final detections to finally As a bonus, in the end, we also throw some light
select exemplars from each cluster. Mrowca et al. on the inference ”tricks” that are generally known
[250] deployed a multi-class version of this paper. to the experts participating in the competitions.
Clustering formulations with globally optimal so- The tricks that are used to further improve the
lutions have been proposed in [371]. All of them evaluation metrics are: Doing multi-scale inference
worked for special cases but are less consistent than on an image pyramid (see Section 3.1.1 for train-
Greedy NMS, generally. ing); Doing inference on the original image and on
Some papers learn NMS in a convolutional net- its horizontal flip (or on different rotated versions of
work. Henderson and Ferrari [121] and Wan et al. the image if the application domain does not have a
[401] tried to incorporate NMS procedure at train- fixed direction) and aggregating results with NMS;
ing time. Stewart et al. [357] generated a sparse set Doing bounding box voting as in [102] using the
of detections by training an LSTM. Hosang et al. score of each box as its weight; Using heavy back-
[130] trained the network to find different optimal bones, as observed in the backbone section; Finally,
cutoff thresholds (Nt ) locally. Hosang et al. [132] averaging the predictions of different models in en-
took one step further and got rid of the NMS step sembles. For the last trick often it is better to not
completely by taking into account double detec- necessarily use the top-N best performing models
tions in the loss and jointly processed neighbor- but to prefer instead uncorrelated models so that
ing detections. The former inclined the network they can correct each other’s weaknesses. Ensem-
to predict one detection per object and the lat- bles of models are outperforming single models by
ter provided with the information if an object was often a large margin and one can average as many
detected multiple times. Their approach worked as a dozen models to outrank its competitors. Fur-
better than greedy NMS and they obtained a per- thermore, with DCNNs generally one does not need
formance gain of 0.8 mAP on COCO dataset. to put too much thought on normalizing the models
Most recently, greedy NMS was improved upon as each one gives bounded probabilities (because of
by Bodla et al. [21]. Instead of setting the score of the softmax operator in the last layer).
neighboring detections as zero they decreased the
detection confidence as an increasing function of
2.4 Concluding Remarks
overlap. It improved the performance by 1.1 mAP
for COCO dataset. There was no extra training This concludes a general overview of the land-
required and since it is hand-crafted it can be easily scape of the mainstream object detection halfway
integrated in object detection pipeline. It is used in through 2018. Although the methods presented are
current top entries for MS COCO object detection all different, it has been shown that in fact most
challenge. papers have converged towards the same crucial
Jiang et al. [158] performed NMS based on sep- design choices. All pipelines are now fully convolu-
arately predicted localization confidence instead of tional, which brings structure (regularization), sim-
usually accepted classification confidence. Other plicity, speed and elegance to the detectors. The
papers rescored detections locally [38, 386] or glob- anchors mechanism of Ren et al. [309] has now also
ally [397]. Some others detected objects in pairs been widely adopted and has not really been chal-
in order to handle occlusions [266, 321, 370]. Ro- lenged yet, although iteratively regressing a set of
driguez et al. [312] made use of the crowd den- seed boxes show some promise [101, 254]. The need
sity. Quadratic unconstrained binary optimization to use multi-scale information from different lay-
(QUBO) [317] used detection scores as a unary po- ers of the CNN is now apparent [174, 215, 216].
tential and overlap between detections as a pairwise The RoI-Pooling module and its cousins can also
potential to obtain the optimal subset of detection be cited as one of the main architectural advances
boxes. Niepert et al. [257] saw overlapping windows of recent years but might not ultimately be used

30
Figure 11: An illustration of challenges in object detection. To detect all instances of the class ”fork”
(yellow bounding boxes) from the COCO dataset [214], a detector should be able to handle small objects
(lower middle picture) as well as big objects (third column photograph). It needs to be scale invariant
as well as being rotation invariant (all forks have different orientation in the pictures). It should also
manage occlusions as in the left-hand side photograph. After being trained on the pictures in the first
three columns, detection algorithms are expected to generalize to the ”cartoon” image on the right (domain
adaptation).

by future works. ideas. To have an idea of number of papers be-


With that said, most of the research being done ing published targeting each challenge, we ran a
now in the mainstream object recognition consists corresponding query on advanced search of Google
of inventing new ways of passing the information Scholar. The exact query is mentioned below each
through the different layers or coming up with dif- figure respectively (see Figure 12, 13, 14, 15 and
ferent kinds of losses or parametrization [103, 441]. 16). We report these numbers from year 2011 to
There is a small paradox now in the fact that even 2018. We note that this method doesn’t give the
if man-made features are now absent of most mod- exact number of papers targeting each challenge
ern detectors, more and more research is being done but still gives us a rough idea of the interest of the
on how to better hand-craft the CNN architectures community in each challenge. We couldn’t use this
and modules. for the localization challenge because almost all ob-
ject detection papers mention localization even if
they are not targeting to solve it.
3 Going Forward in Object
Detection
3.1 Major Challenges
While we demonstrated that object detection has
already been turned upside-down by CNN archi- There are some walls that the current models can-
tectures and that nowadays most methods revolve not overcome without heavy structural changes, we
around the same architectural ideas, the field has list these challenges in Figure 11.
not yet reached a status quo, far from it. Com- Often, when we hear that object recognition is
pletely new ideas and paradigms are being devel- solved, we argue that the existence of these walls
oped and explored as we write this survey, shaping are solid proof that it is not. Although we have
the future of object detection. This section lists advanced the field, we cannot rely indefinitely on
the major challenges that remain mostly unsolved the current DCNNs. This section shows how the
and the attempts to get around them using such recent literature addressed these topics.

31
Scale Variance Singh and Davis [345] selectively back-propagated
4000 the gradients of object instances if they fall in a
predetermined size range. This way, small objects
3000 must be scaled up to be considered for training.
They named their technique Scale Normalization
Scale Variance

2000 for Image Pyramids (SNIP). Singh et al. [347] op-


timized this approach by processing only context
1000 regions around ground-truth instances, referred to
as chips.
0
2011 2012 2013 2014 2015 2016 2017 2018
Second, a set of default reference boxes, with var-
ied size and aspect ratios that cover the whole im-
age uniformly, were used. Ren et al. [309] proposed
Figure 12: Number of papers published each year a set of reference boxes at each sliding window lo-
for challenge of scale variance. Query used in cation which are trained to regress and classify. If
Google Scholar: (”scale variance” OR ”scale in- an anchor box has a significant overlap with the
variance” OR ”scale invariant”) AND ”object de- ground truth it is treated as positive otherwise,
tection”. it is ignored or treated as negative. Due to the
huge density of anchors most of them are nega-
3.1.1 Scale Variance tive. This leads to an imbalance in the positive
and negative examples. To overcome it OHEM
In the past three years a lot of approaches have [337] or Focal Loss [216] are generally applied at
been proposed to deal with the challenge of scale training time. One more downside of anchors is
variance. On the one hand, object instances in the that their design has to be adapted according to
image may fill only 0.01% to 0.25% of the pixels, the object sizes in the dataset. If large anchors
and, on the other hand, the instance may fill 80% are used with too many small objects then, and
to 90% of the whole image. It is tough to make vice versa, then they won’t be able to train as ef-
a single feature map predict all the objects, with ficiently. Default reference boxes are an important
this huge variance, because of the limited recep- design feature in double stage [62] as well as single-
tive field that it’s neurons have. Particularly small stage methods [221, 306]. Most of the top winning
objects (discussed in Section 3.1.6) are difficult to entries [63, 118, 215, 216] use them in their models.
classify and localize. In this section we will discuss Bodla et al. [21] helped by improving the suppres-
three main approaches that are used to tackle the sion technique of double detections, generated from
challenge of scale variance. the dense set of reference boxes, at inference time.
First, is to make image pyramids [90, 104, 116, Third, multiple convolutional layers were used
327]. This helps enlarge small objects and shrink for bounding box predictions. Since a single fea-
the large objects. Although the variance is reduced ture map was not enough to predict objects of var-
to an extent but each image has to be pass for- ied sizes, SSD [221] added more feature maps to
warded multiple times thus, making it computa- the original classification backbones. Cai et al. [28]
tionally expensive and slower than the approaches proposed regions as well as performed detections
discussed in the following discussion. This ap- on multiple scales in a two-stage detector. Najibi
proach is different from data augmentation tech- et al. [255] used this method to achieve state-of-
niques [60] where an image is randomly cropped, the-art on a face dataset [433] and Li et al. [200]
zoomed in or out, rotated etc. and used exactly on pedestrian dataset [87]. Yang et al. [431] used
once for inference. Ren et al. [310] extracted fea- all the layers to reject easy negatives and then per-
ture maps from a frozen network at different im- formed scale-dependent pooling on the remaining
age scales and merged them using maxout [110]. proposals. Shallower or finer layers are deemed

32
Rotational Variance at an angle or even inverted. While it is hard to
250 define rotation for flexible objects like a cat, a pose
definition would be more appropriate, it is much
200
easier to define it for texts or objects in aerial im-
ages which have an expected rigid shape. It is well
Rotational Variance

150

known that CNNs as they are now do not have the


100
ability to deal with the rotational variance of the
50
data. More often than not, this problem is cir-
cumvented by using data augmentation: showing
0
2011 2012 2013 2014 2015 2016 2017 2018
the network slightly rotated versions of each patch.
When training on full images with multiple anno-
tations it becomes less practical. Furthermore, like
Figure 13: Number of papers published each year for occlusions, this might work but it is disappoint-
for challenge of rotational variance. Query used in ing as one could imagine incorporating rotational
Google Scholar: (”rotational variance” OR ”rota- invariance into the structure of the network.
tional invariance” OR ”rotational invariant”) AND Building rotational invariance can be simply
”object detection”. done by using oriented bounding boxes in the re-
gion proposal step of modern detectors. Jiang et al.
to be better for detecting small objects while top [159] used Faster R-CNN features to predict ori-
or coarser layers are better at detecting bigger ob- ented bounding boxes, their straightened versions
jects. In the original design, all the layers predict were then passed on to the classifier. More ele-
the boxes independently and no information from gantly, few works like [26, 119, 228] proposed to
other layers is combined or merged. Many papers, construct different kinds of RoI-pooling module for
then, tried to fuse different layers [51, 192] or added oriented bounding boxes. Ma et al. [228] trans-
additional top-down network [338, 415]. They have formed the RoI-Pooling layer of Faster R-CNN by
already been discussed in Section 2.1.1. rotating the region inside the detector to make it
Fourth, Dilated Convolutions (a.k.a. atrous con- fit the usual horizontal grid, which brought an as-
volutions) [438] were deployed to increase the fil- tonishing increase of performances from the 38.7%
ter’s stride. This helped increase the receptive field of regular Faster R-CNN to 71.8% with additional
size and, thus, incorporate larger context without tricks on MSRA. Similarly, He et al. [119] used a ro-
additional computations. Obviously smaller recep- tated version of the recently introduced RoI-Align
tive fields are also needed if the objects are small to pool oriented proposals to get more discrimi-
and thus only a clever combination of larger recep- native features (better aligned with the text direc-
tive field with atrous convolutions and smaller ones tion) that will be used in the text recognition parts.
like in ASPP [42] (Atrous Spatial Pyramid Pooling) Busta et al. [26] also used rotated pooling by bilin-
can lead to a successful scale invariance in detec- ear interpolation to extract oriented features to rec-
tion. It has been successfully applied in the context ognize text after having rendered YOLO to be able
of object detection [62] and semantic segmentation to predict rotated bounding boxes. Shi et al. [333]
[42]. Dai et al. [63] presented a generalized version detected in the same way, oriented bounding boxes
of it by learning the deformation offsets addition- (called segments) with a similar architecture but
ally. differ from [26, 119, 228] because it also learned to
merge the oriented segments appropriately, if they
cover the same word or sentence, which allowed
3.1.2 Rotational Variance
greater flexibility.
In the real world object instances are not necessar- Liu and Jin [222] needed slightly more compli-
ily present in an upright manner but can be found cated anchors: quadrangles anchors, and regressed

33
compact text zones in a single-stage architecture Domain Adaptation
similar to Faster R-CNN’s RPN. This system be- 1000

ing more flexible than the previous ones, necessi-


tated more parameters. They used Monte-Carlo 750

simulations to compute overlaps between quadran-

Domain Adaptation
gles. Liao et al. [212] directly rotated convolution 500

filters inside the SSD framework, which effectively


rendered the network rotation-invariant for a finite 250

set of rotations (which is generalized in the recent


[410] for segmentation). However, in the case of 0
2011 2012 2013 2014 2015 2016 2017 2018

text detection even oriented bounding boxes can


be insufficient to cover text with a layout with too
much curvature and one often sees the same fail- Figure 14: Number of papers published each year
ure cases in different articles (circle-shaped texts for challenge of domain adaptation. Query used in
for instance). Google Scholar: (”domain adaptation” OR ”adapt-
A different kind of approach for translation in- ing domains”) AND ”object detection”.
variance was taken by the two following works of
Cheng et al. [52] and Laptev et al. [188] that made
like [335] relied on oriented proposals too. The di-
use of metric-learning. Former proposed an original
versity of the methods show that no real standard
approach of using metric learning to force features
has emerged yet. Even the most sophisticated de-
of an image and its rotated versions to be close to
tection pipelines are only rotation invariant to a
each other hence, somehow invariant to rotations.
certain extent.
In a somewhat related approach the latter found
a canonical pose for different rotated versions of The detectors presented in this section do not
an image and used a differentiable transformation yet have the same popularity as the vertical ones
to make every example canonical and to pool the because all the main datasets like COCO do not
same features. present rotated images. One could define a rotated-
The difficulty of predicting oriented bounding COCO or rotated-VOC to evaluate the benefit
boxes is alleviated if one resorts to semantic seg- these pipelines could bring over their vertical ver-
mentation like in [466]. They learned to output se- sions but it is obviously difficult and would not be
mantic segmentation then oriented bounding boxes accepted as is by the community without a strong,
were found based on the output score map. How- well-thought-evaluation protocol.
ever, it shares the same downsizes as other ap-
proaches [26, 119, 212, 222, 228] for text detection 3.1.3 Domain Adaptation
because in the end one still has to fit oriented rect-
angles to evaluate the performances. It is often needed to repurpose a detector trained
Other applications than text detection also re- on domain A to function on domain B. In most
quire rotation invariance. In the domain of aerial cases this is because the dataset in domain A has
imagery, the recently released DOTA [421] is one lots of training examples and the categories in it
of the first datasets of its kind expecting oriented are generic whereas the dataset in domain B has
bounding boxes for predictions. One can anticipate less training examples and objects that are very
an avalanche of papers trying to use text detection specific or distinct from A. There are surprisingly
techniques like [372], where the SSD framework is very few recent articles that tackled explicit do-
used to regress bounding box angles or the former main adaptation in the context of object detection
metric learning technique from Cheng et al. [52] –[361, 368, 428] did it for HOG based features –
and Cheng et al. [54]. For face detection, paper even though the literature for domain adaptation

34
for classification is dense, as shown by the recent One of the end goals of domain adaptation would
survey of Csurka [59]. For instance when one trains be to be able to learn a model on synthetic data,
a Faster R-CNN on COCO and want to test it off- which is available (almost) for free and to have it
the-shelf on the car images of KITTI Geiger et al. performing well on real images. Pepik et al. [279]
[98] (’car’ is one of the 80 classes of COCO) one was, to the best of our knowledge, the first to point
gets only 56.1% AP w.r.t. 83.7% using more sim- out that, even though CNNs are texture sensitive,
ilar images because of the differences between the wire-framed and CAD models used in addition to
domains (see [383]) . real data can improve the performances of detec-
Most works adapt the features learned in an- tors. Peng et al. [277] augmented PASCAL-VOC
other domain (mostly classification) by simply fine- data with 3D CAD models of the objects found in
tuning the weights on the task at hand. Since [105], PASCAL-VOC (planes, horses, potted plants, etc.)
literally every state-of-the-art detectors are pre- and then rendered them in backgrounds where they
trained on ImageNet or on an even bigger dataset. are likely to be found and improved overall detec-
This is the case even for relatively large object de- tion performances. Following this line, several au-
tection datasets like COCO. There is no fundamen- thors introduced synthetic data for various tasks
tal reason for it to be a requirement. The objects such as i) persons: Varol et al. [395] ii) furniture:
of the target domains have to be similar and of Massa et al. [236] created rendered CAD furni-
the same scales as the objects on which the net- tures on real backgrounds by using grayscale im-
work was pre-trained as pointed out by Singh and ages to avoid color artifacts and improved the de-
Davis [345], that detected small cars in aerial im- tection performances on the IKEA dataset. iii)
agery by first pre-training on ImageNet. The semi- text: Gupta et al. [112] created an oriented text de-
nal work of Hoffman et al. [127], already evoked in tection benchmark by superimposing synthetic text
the weakly supervised Section 4.2.1, showed how to existing scenes while respecting geometric and
to transfer a good classifier trained on large scale uniformity constraints and showed better results on
image datasets to a good detector trained on few ICDAR iv) logos: Su et al. [359] did the same with-
images by fine-tuning the first layers of a con- out any constraints by superimposing transparent
vnet trained on classification and adapting the fi- logos to existing images.
nal layer using nearest neighbor classes. Hinter- Georgakis et al. [99] synthesized new instances
stoisser et al. [125] demonstrated another example of 3D CAD models by copy pasting rendered ob-
of transfer learning where they froze the first lay- jects on surface normals, very close to [296], which
ers of detectors trained on synthetic data and fine- used Blender to put instances of objects inside a re-
tuned only the last layers on the target task. frigerator. Later Dwibedi et al. [79] with the same
We discuss below all the articles we found that approach but without respecting any global con-
go farther than simple transfer learning for domain sistency shown promise. For them only local con-
adaptation for object detection. Raj et al. [294] sistency is important for modern object detectors.
aligned features subspace from different domains Similar to [99], they used different kinds of blending
for each class using Principal Component Analy- to make the detector robust to the pasting artifacts
sis (PCA). Chen et al. [49] used H-divergence the- (more details can be found in [78]). More recently,
ory and adversarial training to bridge the distribu- Dvornik et al. [77] extended [79] by first finding
tion mismatches. All the mentioned articles worked locations in images with high likelihood of object
on adapting the features. Thanks to GANs some presence before pasting objects. Another recent
of them are trying to adapt directly to the image approach [383] found that domain randomization
[150], which used CycleGAN from [480] to convert when creating synthetic data is vital to train detec-
images directly from one domain to the other. The tors: training on Virtual KITTI Gaidon et al. [94],
object detection community needs to evolve if we a dataset that was built to be close to KITTI (in
want to move beyond transfer-learning. terms of aspects, textures, vehicles and bounding

35
boxes statistics), is not sufficient to be state-of-the- 3.1.4 Object Localization
art on KITTI. One can gain almost one point of AP
when building his own version of Virtual KITTI by Accurate localization remains one of the two
introducing more randomness than was present in biggest sources of error [129] in fully supervised
the original in the form of random textures and object detection. It mainly originates from small
backgrounds, random camera angles and random objects and more stringent evaluation protocol ap-
flying distractor objects. Randomness was appar- plied in the latest datasets. The predicted boxes
ently absent from KITTI but is beneficial for the are required to have an IoU of up to 0.95 with the
detector to gain generalization capabilities. ground-truth boxes. Generally, localization is dealt
by using smooth L1 or L2 losses along with classifi-
cation loss. Some papers proposed a more detailed
Several authors have shown interest in propos-
methodology to overcome this issue. Also, anno-
ing tools for generating artificial images at a large
tating bounding boxes for each and every object
scale. Qiu and Yuille [291] created the open-source
is expensive. We will also look into some methods
plug-in UnrealCV for a popular game engine Unreal
that localize objects using only weakly annotated
Engine 4 and showed applications to deep network
images.
algorithms. Tian et al. [377] used the graphical
Kong et al. [174] overcame the poor localization
model CityEngine to generate a synthetic city ac-
because of coarseness of the feature maps by ag-
cording to the layout of existing cities and added
gregating hierarchical feature maps and then com-
cars, trucks and buses to it using a game engine
pressing them into a uniform space. It provided
(Unity3D). The detectors trained on KITTI and
an efficient combination framework for deep but se-
this dataset are again better than just with KITTI.
mantic, intermediate but complementary, and shal-
Alhaija et al. [4] pushed Blender to its limits to
low but high-resolution CNN features. Chen et al.
generate almost real-looking 3D CAD cars with
[46] proposed multi-thresholding straddling expan-
environment maps and pasted them inside differ-
sion (MTSE) to reduce localization bias and re-
ent 2D/3D environments including KITTI, Virtu-
fine boxes during proposal time which is based
alKITTI (and even Flickr). It is worth noting that
on super-pixel tightness as opposed to objectness
some datasets included real images to better sim-
based models. Zhang et al. [465] addressed the
ulate the scene viewed by a robot in active vision
localization problem by using a search algorithm
settings, as in [5].
based on Bayesian optimization that sequentially
proposed candidate regions for an object bounding
Another strategy is to render simple artificial box. Hosang et al. [132] tried to integrate NMS
images and increase the realism of the images in in the convolutional network which in the end im-
a second iteration, using Generative Adversarial proved localization.
Networks [339]. RenderGAN was used to directly Many papers [101, 158] also try to adapt the loss
generate realistic training images [348]. We refer function to address the localization problem. Gi-
the reader to the section on GANs (Section 3.2.2) daris and Komodakis [103] proposed to assign con-
for more information on the use of GANs for style ditional probabilities to each row and column of
transfer. a sample region, using a neural convolutional net-
work adapted for this task. These probabilities al-
We have seen that for the time being synthetic low more accurate inference of the object bounding
datasets can augment existing ones but not totally box under a simple probabilistic framework. Since
replace them for object detection, however, the do- Intersection over Union (IoU) is used in the eval-
main shift between synthetic data and the target uation strategies of many detection challenges, Yu
distribution is still too large to rely on synthetic et al. [441] and Jiang et al. [158] optimized over
data only. IoU directly. The loss-based papers have been dis-

36
Occlusions struction.
8000 Training with occluded objects help for sure [244]
but it is often not doable because of a lack of data
6000 and furthermore, it cannot be bulletproof. Wu
et al. [419] managed to learn an And-Or model
Occlusions

4000 for cars by dynamic programming, where the And


stood for the decomposition of the objects into
2000 parts and the Or for all different configurations
of parts (including occluded configurations). The
0
2011 2012 2013 2014 2015 2016 2017 2018
learning was only possible thanks to the heavy use
of synthetic data to model every possible type of
occlusion. Another way to generate examples of
Figure 15: Number of papers published each year occlusions is to directly learn to mask the propos-
for challenge of occlusion. Query used in Google als of Fast R-CNN [407].
Scholar: (occlusions OR occlusion OR occluded) For dense pedestrians crowds deformable models
AND ”object detection”. and parts can help improve detection accuracy (see
2.1.5) e.g. if some parts are masked some others will
cussed in Section 2.2.1 in detail. not be, therefore, the average score is diminished
There is also an interesting case made by some but not made zero like in [106, 265, 325]. Parts
papers that do we really need to optimize for lo- are also useful for occlusion handling in face detec-
calization? Oquab et al. [262] used weakly anno- tion where different CNNs can be trained on differ-
tated images to predict approximate locations of ent facial parts [432]. The survey already tackled
the object. Their approach performed comparably Deformable RoI-Pooling (RoI-Pooling with parts)
to the fully supervised counterparts. Zhou et al. [247]. Another way of re-introducing parts in mod-
[474] were able to get localizable deep representa- ern pipelines is the deformable kernels of [63]. They
tions that exposed the implicit attention of CNNs presented a way to alleviate the occlusion problems
on an image with the help of global average pooling by giving more flexibility to the usually fixed geo-
layers. In comparison to the earlier approach, their metric structures.
localization is not limited to localizing a point lying Building special kinds of regression losses for
inside an object but determining the full extent of bounding boxes acknowledging the proximity of
the object. [12, 448, 472] have also tried to predict each detection (which is reminiscent of the springs
localizations by masking different patches of the in the old part-based models) was done in [409].
image during test time. More weakly supervised They, in addition, to the attraction term in the
methods have been discussed in Section 4.2.1. traditional regression loss that pushes predictions
towards their assigned ground truth added a repul-
sion term that pushed predictions away from each
3.1.5 Occlusions
other.
The occlusions lead to partial missing information Traditional non-maximum suppression causes a
from object instances. They may be occluded due lot of problems with occlusions because overlap-
to the background or other object instances. Less ping boxes are suppressed. Hence, if one object is
information naturally leads to harder examples and in front of another only one is detected. To ad-
inaccurate localizations. The occlusions happen all dress this, Hosang et al. [132] offered to learn non-
the time in real-life images. However, since deep maximum suppression making it continuous (and
learning is based on convoluting filters and that differentiable) and Bodla et al. [21] used a soft ver-
occlusions by definition introduce parasite patterns sion that only degraded the score of the overlapping
most modern methods are not robust to it by con- objects (more details can be found about various

37
Small Objects lated to aerial images [421], traffic signs [490], faces
2000 [253], pedestrians [84] or logos [358] are generally
abundant with small object instances.
1500 In case of objects like logos or traffic signs, ob-
jects have an expected shape, size and aspect ratio
Small Objects

1000 of the objects to be detected, and this information


can be embedded to bias the deep learning model.
500 This strategy is much harder and not feasible for
common objects as they are a lot more diverse. As
0
2011 2012 2013 2014 2015 2016 2017 2018
an illustration, the winner of the COCO challenge
2017 [274], which used many of the latest tech-
niques and ensemble of four detectors reported a
Figure 16: Number of papers published each year performance of 34.5% mAP on small objects and
for challenge of small objects. Query used in 64.9% mAP on large objects. The following entries
Google Scholar: (”small objects” OR ”small ob- reported even a greater dip for smaller objects than
ject”) AND ”object detection”. the larger ones. Pham et al. [281] have presented
an evaluation, focusing on real-time small object
detection, of three state-of-the-art models, YOLO,
other types of NMS in Section 2.3.
SSD and Faster R-CNN with related trade-off be-
Other approaches used clues and context to help tween accuracy, execution time and resource con-
infer the presence of occluded objects. Zhang et al. straints.
[457] used super-pixel labeling to help occluded ob- There are different ways to tackle this prob-
jects detection. They hypothesized that if some lem, such as: i) up-scaling the images ii) shallow
pixels are visible then the object is there. This is networks, iii) contextual information, iv) super-
also the approach of the recent [118] but it needs resolution. These four directions are discussed in
pixel-level annotations. In videos, temporal coher- the following.
ence can be used [436], where heavily occluded ob- The first – and most trivial direction – consists
jects are not occluded in every frame and can be in up-scaling the image before detection. But a
tracked to help detection. naive upscaling is not efficient as the large im-
But for now all the solutions seem to be far-off ages become too large to fit into a GPU for train-
from the mentally inpainting ability of humans to ing. Gao et al. [95], first, down-sampled the im-
infer missing parts. Using GANs for this purpose age and then used reinforcement learning to train
might be an interesting research direction. attention-based models to dynamically search for
the interesting regions in the image. The se-
3.1.6 Detecting Small Objects lected regions are then studied at higher resolution
and can be used to predict smaller objects. This
Detecting small objects is harder than detecting avoided the need of analyzing each pixel of the im-
medium sized and large sized objects because of age with equal attention and saved some computa-
less information associated with them, easier pos- tional costs. Some papers [62, 63, 345] used image
sibility of confusion with the background, higher pyramids during training time in the context of ob-
precision requirement for localization, large image ject detection while [310] used it during inference
size, etc. In COCO metrics evaluation, objects oc- time.
cupying areas lesser than and equal to 32 × 32 pix- The second direction is to use shallow networks.
els come under this category and this size thresh- Small objects are easier to predict by detectors
old is generally accepted within the community for which have smaller receptive field. The deeper net-
datasets related to common objects. Datasets re- works with their large receptive field tend to lose

38
some information about the small objects in their
coarser layers. Sommer et al. [259, 351] proposed
very shallow networks with less than 5 convolu-
tional layers and three fully connected layers for
the purpose of detecting objects in aerial imagery.
Such type of detectors are useful when the expected
instances are only of type small. But if expected
instances are of diverse size it is more beneficial to
use finer feature maps of very deep networks for
small objects and coarser feature maps for larger
objects. We have already discussed this approach
in Section 3.1.1. Please refer to Section 4.2.4 for Figure 17: Good detections of persons are marked
more low power and shallow detectors. in green and bad detections in red. Helpful con-
text in blue (the presence of mirror frames) can
help lower the score of a box. The relationships (in
The third direction is to make use of context Fuchsia) between bounding boxes can also help: a
surrounding the small object instances. Gidaris person cannot be present twice in a picture. En-
and Komodakis [102], Zhu et al. [489] used con- hancing parts of the picture using SR ( Figure 18)
text to improve the performance but Chen et al. is yet another way to better make a decision. All
[36] used context specifically for improving the per- those ”reasoning” modules are not included in the
formance for small objects. They augmented the mainstream detectors.
R-CNN with the context patch in parallel to the
proposal patch generated from region proposal net-
work. Zagoruyko et al. [446] combined their ap-
3.2 Complementary New Ideas in
proach of making the information flow through Object Detection
multiple paths with DeepMask object proposals In this subsection we review ideas which haven’t
[282, 284] to gain a massive improvement in the quite matured yet but we feel could bring major
performance for small objects. Context can also be breakthroughs in the near future. If we want the
used by fusing coarser layers of the network with field to advance, we should embrace new grand
finer layers [215, 216, 338]. Context related litera- ideas like these, even if that means completely re-
ture has been covered in Section 3.2.3 in detail. thinking all the architectural ideas evoked in Sec-
tion 2.

Finally, the last direction is to use Generative 3.2.1 Graph Networks


Adversarial Networks to selectively increase the
resolution of small objects, as proposed by Li The dramatic failings of state-of-the-art detectors
et al. [201]. Its generator learned to enhance the on perturbed versions of the COCO validation sets,
poor representations of the small objects to super- spotted by Rosenfeld et al. [314], are raising ques-
resolved ones that are similar enough to real large tions for better understanding of compositionality,
objects to fool a competing discriminator. Table 1 context and relationships in detectors.
summarizes the past subsection by grouping the ar- Battaglia et al. [11] recently wrote a position
ticles by main idea and target challenge. We find it article arguing about the need to introduce more
very useful to see which ideas have been thoroughly representational power into Deep Learning using
investigated by the literature and which are under- graph networks. It means finding new ways to en-
explored. force the learning of graph structures of connected

39
Article references Main idea Challenge(s) addressed
[90, 104, 116, 327] Image Pyramids Scale Variance, Small Objects
[28, 72, 87, 174, Features Fusion Scale Variance, Small Objects
200, 215, 221, 255,
310, 332, 338, 345,
347, 415, 433, 446]
[345, 347] Selective Backpropagation (SN) Scale Variance, Small Objects, Object Localization
[21, 390] Better NMS Small Objects, Occlusions, Object Localization
[216, 337] Hard Examples Mining (Explicit and Implicit) Small Objects, Occlusions
[431] Scale Dependent Pooling Scale Variance, Small Objects
[26, 119, 159, 228] Oriented Bounding boxes Rotational Variance
[26, 119, 228] Oriented Pooling Rotational Variance
[222, 333] Flexible anchors (segments, quadrangles) Rotational Variance
[212] Rotating Filters Rotational Variance
[52, 188] Rotation Invariant Features Rotational Variance
[118, 466] Auxiliary Task (semantic segmentation) Rotational Variance, Occlusions
[49, 294] Aligning Feature distribution Domain Adaptation
[150, 339] Image Transformations (GANs) Domain Adaptation
[79, 99, 277, 279, Data Augmentation using Synthetic Datasets Domain Adaptation
296]
[383] Domain Randomization Domain Adaptation
[46, 457] Super-Pixels Object Localization, Occlusions
[465] Sequential Search Object Localization
[101, 103, 158, 216, Loss Function Modifications Small Objects, Object Localization
441]
[63, 106, 247, 265, Part Based Models Occlusions
325, 432]
[63, 247] Deformable CNN Modules Occlusions
[436] Tracking (in videos) Occlusions
[95] Dynamic Zooming Small Objects
[259, 351] Shallow Networks Small Objects
[36, 102, 489] Use of Contextual Information Small Objects
[201] Features Super Resolution Small Objects

Table 1: Summary of the main ideas found in the literature to account for the limitations of the current
deep learning architectures. For each idea we list the papers that implement it and the challenges they
(sometimes only partially) address.

40
entities instead of outputting independent predic- tion up to an impressive degree of realism. This
tions. Convolutions are too local and translation new tool keeps the flexibility of the regular CNN
equivariant to reflect the intricate structure of ob- architectures as it is implemented using the same
jects in their context. bricks and therefore, it can be added in any detec-
One embodiment of this idea in the realm of de- tion pipeline.
tection can be found in the work of Wang et al. Even if [407] does not belong to the GAN family
[406], where long-distance dependencies were intro- per say, the adversarial training it uses: dropping
duced in deep-learning architectures. These com- pixels in examples to make them harder to clas-
bined local and non-local interactions are reminis- sify and hence, render the network robust to occlu-
cent of the CRF [185], which sparked a renewed in- sions, obviously drew its inspiration from GANs.
terest for graphical models in 2001. Dot products Ouyang et al. [269] went a step further and used
between features determine their influences on each the GAN formalism to learn to generate pedestri-
other, the closest they are in the feature space, the ans from white noise in large images and showed
stronger their interactions will be (using a Gaussian how those created examples were beneficial for the
kernel for instance). This seems to go against the training of object detectors. There are numerous
very principles of DCNNs, which are, by nature, recent papers, e.g., [23, 276], proposing approaches
local. However this kind of layer can be integrated for converting synthetic data towards more realis-
seamlessly in any DCNN to its benefit, it is very tic images for classification. Inoue et al. [150] used
similar to self-attention [55]. It is not clear yet if the latest CycleGAN [480] to convert real images
these new networks will replace their local coun- to cartoons and by doing so gained free annota-
terparts in the long-term but they are definitely tions to train detectors on weakly labeled images
suitable candidates. and became the first work to use GANs to create
Graph structures also emerge when one needs to full images for detectors. As stated in the intro-
incorporate a priori (or inductive biases) on the duction, GANs can also be used, not in a stan-
spatial relationships of the objects to detect (rela- dalone manner but, directly embedded inside a de-
tional reasoning) [135]. The relation module uses tector too: Li et al. [201] operated at the feature
attention to learn object dependencies, also using level by adapting the features of small objects to
dot products of features. Similarly, Wang et al. match features obtained with well resolved objects.
[406] incorporated geometrical features to further Bai et al. [9] trained a generator directly for super-
disambiguate relationships between objects. One resolution of small objects patches using traditional
of the advantages of this pipeline is the last re- GAN loss in addition to classification losses and
lation module, which is used to remove duplicates MSE loss per pixel. Integrating the module in mod-
similarly to the usual NMS step but adaptively. We ern pipelines brought improvement to the original
mention this article in particular because although mAP on COCO, this very simple pipeline is sum-
relationships between detected objects have been marized Figure 18. These two articles addressed
used in the literature before, it was the first at- the detection of small objects, which will be tack-
tempt to have it as a differentiable module inside led in more details in Section 3.1.6.
a CNN architecture. Shen et al. [330] used GANs to completely re-
place the Multiple Instance Learning paradigm (see
3.2.2 Adversarial Trainings Section 4.2.1) using the GAN framework to gener-
ate candidate boxes following the real distribution
No one in the computer vision community was of the training images boxes and built a state-of-
spared by the amazing successes of the Generative the-art detector that is faster than all the others
Adversarial Networks [109]. By pitting a con-artist by two orders of magnitude.
(a CNN) against a judge (another CNN) one can Thus, this extraordinary breakthrough is start-
learn to generate images from a target distribu- ing to produce interesting results in object detec-

41
vironments of the visual objects also comprise of
other objects that they are present with, which ad-
vocates for learning spatial relationships between
objects. Mrowca et al. [250] and Gupta et al. [113]
independently used spatial relationships between
proposals and classes (using WordNet hierarchy)
to post-process detections. This is also the case in
[492] where RNNs were used to model those rela-
tionships at different scales and in [48] where an
external memory module was keeping track of the
likelihood of objects being together. Hu et al. [135],
Figure 18: Small object patches from Regions of that we mentioned in Section 3.2.1, went even fur-
Interest are enhanced to better help the classifier ther with a trainable relation module inside the
make a decision in SOD-MTGAN [9]. structure of the network. In a different but not
unrelated manner Gonzalez-Garcia et al. [108] im-
proved the detection of parts of objects by associ-
tion and its importance is growing. Considering the ating parts with their root objects.
latest result in the generation of synthetic data us- All multi-scale architectures use different sized
ing GANs for instance the high resolution examples context, as we saw in Section 2.1.1. Zeng et al.
of [166] or the infinite image generators, BiCycle- [450] used features from different sized regions (dif-
GAN from Zhu et al. [480] and MUNIT from Huang ferent contexts) in different layers of the CNN with
et al. [142], it seems the tsunami that started in message-passing in between features related to dif-
2014 will only get bigger in the years to come. ferent context. Kong et al. [175] used skip connec-
tions and concatenation directly in the CNN archi-
3.2.3 Use of Contextual Information tecture to extract multi-level and multi-scale infor-
mation.
We will see in this section that the word context Sometimes, even the simplest local context sur-
can mean a lot of different things but taking it into rounding a region of interest can help (see, for
account gives rise to many new methods in object instance, the methods presented in Section 2.1.4,
detection. Most of them (like spatial relationships where the amount of context varies in between the
or using stuff to find things) are often overlooked in classifiers). Extracted proposals can include vari-
competitions, arguably for bad reasons (too com- able amounts of pixels (context means size of the
plex to implement in the time frame of the chal- proposal) to help the classifiers such as in [268] or
lenge). in [36, 101, 103]. Li et al. [196] included global im-
Methods have evolved a lot since Heitz and age context in addition to regional context. Some
Koller [120] used clustering of stuff/backgrounds to approaches went as far as integrating all the image
help detect objects in aerial imagery. Now, thanks context: it was done for the first time in YOLO
to the CNN architectures, it is possible to do de- [308] with the addition of a fully connected layer
tection of things and stuff segmentation in parallel, on the last feature map. Wang et al. [406] modified
both tasks helping the other [24]. the convolutional operator to put weights on every
Of course, this finding is not surprising. Certain part of the image, helping the network use context
objects are more likely to appear in certain stuff or outside the object to infer their existence. This
environments (or context): thanks to our knowl- use of global context is also found with the Global
edge of the world, we find it weird to have a flying Context Module of the recent detection pipeline
train: Katti et al. [167] showed that adding this from Megvii [275]. Li et al. [207] proposed a fully
human knowledge helps existing pipelines. The en- connected layer on all the feature maps (similar to

42
Article references Ideas Type of Context
[24, 167] Segmenting Stuff/Using background cues Background context (Environment)
[48, 113, 135, 250, 442, 492] Likelihood of Objects being together/Infering Relationships/Memory Modules Other Objects Context
[108] Finding Root to find parts Parts-Objects Context
[175, 450] Using dfferent feature scales Multi-scale Context
[36, 101, 103, 196, 268] Adding variable sized context/Adding Borders of RoI Surrounding Pixels
[207, 308, 406] Adding connections to all pixels Full Image Context

Table 2: Summary of the approaches taken to exploit different types of context.

Redmon et al. [308]) with dilated kernels. 4 Extending Object Detec-


Other kinds of context can also be put to work. tion
Yu et al. [442] used latent variables to decide on
which context cues to use to predict the bounding Object detection may still feel like a narrow prob-
boxes. It is not clear yet which method is the best lem: one has a big training set of 2D images, huge
to take context into account, another question is: resources (GPUs, TPUs, etc.) and wants to out-
do we want to? Even if the presence of an object put 2D bounding boxes on a similar set of 2D im-
in a context is unlikely, do we actually want to ages. However, these basic assumptions are often
blind our detectors to unlikely situations? All the not present in practical scenarios. Firstly, because
types of context that can be leveraged, have been there exists many other modalities where one can
summarized in Table 2. perform object detection. These require conceptual
changes in architectures to perform equally well.
Secondly, sometimes one might be constrained to
learn from exceedingly few fully annotated images,
therefore, training a regular detector is either ir-
relevant or not an optimal choice because of over-
3.3 Concluding Remarks fitting. Also detectors are not built to be run in
research labs alone but to be integrated into in-
dustrial products, which often come with an up-
This section finished the tour of all the principal per bound on energy consumption and speed re-
CNN based approaches past, present and future quirements to satisfy the customer. The aim of the
that treat general object detection in the tradi- following discussion will be to know more about
tional settings. It has allowed to peer through the research work done to extend the deep learn-
the armor of the CNN detectors and see them for ing based object detection into new modalities and
what they are: impressive machines having amaz- with tough constraints. It ends with reflections on
ing generalization capabilities but still powerless in what other interesting functionalities a strong de-
a variety of cases, in which a trained human would tector in the future might possess.
have no problem (domain adaptation, occlusions,
rotations, small objects) for an example of a dif- 4.1 Detecting Objects in Other
ficult test case even for so-called robust detectors
Modalities
see Figure 17. Potential ideas to go past these ob-
stacles have also been mentioned among them the There are several modalities other than 2D images
use of adversarial training and context are the most that can be interesting: videos, 3D point clouds,
prominent. The following section will go into more medical imaging, hyper-spectral imagery, etc. We
specific set-ups, less traditional problems or envi- will be discussing in this survey the former two. We
ronments that will frame the detector abilities even did not treat for instance the volumetric images
further. from the medical domain (MRI, etc.) or hyper-

43
having the lowest scores: highly ranked detection
scores were treated as high-confidence classes and
the rest were suppressed. Third, motion-guided
propagation transferred detection results to adja-
cent frames to reduce false negatives. Fourth, tem-
poral tubelet rescoring used a tracking algorithm to
obtain sequences of bounding boxes, classified into
positive and negative samples. Positive samples
were mapped to a higher range, thus, increasing
the score margins. T-CNN has several follow ups.
Figure 19: Detecting objects in other modalities: The first was Seq-NMS [115] which constructed
Left videos. Right 3D point-clouds. sequences along nearby high-confidence bounding
boxes from consecutive frames, rescoring to the
average confidence. Other boxes close to this se-
spectral imagery, which are outside of the scope of quence were suppressed. Another one was MC-
this article and would deserve their own survey. MOT [191] in which a post-processing stage, un-
der the form of a multi-object tracker, was intro-
4.1.1 Object Detection in Videos duced, relying on hand-crafted rules (e.g., detec-
tor confidences, color/motion clues, changing point
The upside of detecting objects in videos is that detection and forward-backward validation) to de-
it provides additional temporal information but it termine whether bounding boxes belonged to the
also has unique challenges associated with it: mo- tracked objects, and to further refine the track-
tion blur, appearance changes, video defocus, pose ing results. Tripathi et al. [385] exploited tempo-
variations, computational efficiency etc. It is a re- ral information by training a recurrent neural net-
cent research domain due to the lack of large scale work that took as input, sequences with predicted
public datasets. One of the first video datasets is bounding boxes, and optimized an objective enforc-
the ImageNet VID [319], proposed in 2015. This ing consistency across frames.
dataset as well as the recent datasets for object The most advanced pipeline for object detection
detection in video are mentioned in Section A.4. in videos is certainly the approach of Feichtenhofer
One of the simplest ways to use temporal in- et al. [89], borrowing ideas from tubelets as well as
formation for detecting object is the detection by from feature aggregation. The approach relies on a
tracking paradigm. As an example, Ray et al. multitask objective loss, for frame-based object de-
[301] proposed a spatio-temporal detector of mo- tection and across-frame track regression, correlat-
tion blobs, associated into tracks by a tracking al- ing features that represented object co-occurrences
gorithm. Each track is then interpreted as a mov- across time and linking the frame level detections
ing object. Despite its simplicity, this type of al- based on across-frame tracklets to produce the de-
gorithm is marginal in the literature as it is only tections.
interesting when the appearances of the objects are The literature on object detection in videos also
not available. addressed the question of computing time, since
The most widely used approaches in the lit- applying a detector on each frame can be time
erature are those relying on tubelets. Tubelets consuming. In general, it is non-trivial to trans-
have been introduced in the T-CNN approach of fer the state-of-the-art object detection networks
Kang et al. [163, 164]. T-CNN relied on 4 steps. to videos, as per-frame evaluation is slow. Deep
First, still-image object detection (with Faster R- feature flow [485, 486] ran the convolutional sub-
CNN like detectors) was performed. Second, multi- network only on sparse key frames, propagated
context suppression removed detection hypotheses deep feature maps to other frames via a flow field.

44
Article references Highlight Type of Detections
[191, 301] Context cues/Motion blobs Basic Tracking
[89, 115, 163, 164] Motion Propagation / Tracking / Seq-NMS / Feature aggregation Tubelets
[385] Enforcing Consistency RNNs
[486, 487] Sparse Key frame aggregation / Fast Computation of flow Flow Field
[41, 329] Motion-based Inference Adaptive Computation

Table 3: Summary of the video object detection methods.

It led to significant speedup as flow computation clouds, iii) the detections made in a 3D voxel grid
is relatively fast. In the impression network [123] iv) the detections made in 2D after projecting the
proposed to iteratively absorb sparsely extracted point cloud on a 2D plane. Most of the presented
frame features, impression features being propa- methods are evaluated on the KITTI benchmark
gated all the way down the video which helped en- [98]. Section A.3 introduces the datasets used for
hance features of low-quality frames. In the same 3D object detection and quantitatively compares
way, the light flow of [487] is a very small net- best methods on these datasets.
work designed to aggregate features on key frames. The methods belonging to the first category,
For non-key frames, sparse feature propagation was monocular, start by the processing of RGB im-
performed, reaching a speed of 25.6 fps. Fast ages and then add shape and geometric prior or
YOLO [329] came up with an optimized architec- occlusion patterns to infer 3D bounding boxes, as
ture that has 2.8X fewer parameters with just a 2% proposed by Chen et al. [44], Mousavian et al. [249]
IOU drop, by applying a motion-adaptive inference and Xiang et al. [424]. Deng and Latecki [68] revis-
method. Finally, [41] proposed to reallocate com- ited the amodal 3D detection by directly relating
putational resources over a scale-time space: while 2.5D visual appearance to 3D objects and proposed
expensive detection is done sparsely and propa- a 3D object detection system that simultaneously
gated across both scales and time. Cheaper net- predicted 3D locations and orientations of objects
works did the temporal propagation over a scale- in indoor scenes. Li et al. [197] represented the data
time lattice. in a 2D point map and used a single 2D end-to-end
An interesting question is ”What can we expect fully convolutional network to detect objects and
from using temporal information?” The improve- predicted full 3D bounding boxes even while using
ment of the mAP due to the direct use of temporal a 2D convolutional network. Deep MANTA [34] is a
information can vary from +2.9% [484] to +5.6% robust convolutional network introduced for simul-
[89]. Table 3 does a recap of this sub-subsection. taneous vehicle detection, part localization, visibil-
ity characterization and 3D dimension estimation,
4.1.2 Object Detection in 3D Point Clouds from 2D images.
This section addresses the literature about object Among the methods using 3D point clouds di-
detection in 3D data, whether it is true 3D point rectly, we can mention the series of papers relying
clouds or 2D images augmented with depth data on PointNet [288] and PointNet++ [290] networks,
(RGBD images). These problems raise novel chal- which are capable of dealing with the irregular for-
lenges, especially in the case of 3D point clouds mat of point clouds without having to transform
for which the nature of the data is totally different them into 3D voxel grids. F-PointNet [289] is a
(both in terms of structure and contained infor- 3D detector operating on raw point clouds (RGB-
mation). We can distinguish 4 main types of ap- D scans). It leveraged mature 2D object detector
proaches depending on i) the use of 2D images and to propose 2D object regions in RGB images and
geometry, ii) the detections made in raw 3D point then collected all points within the frustum to form

45
Article references Implementation Operates on
[34, 44, 68, 197, 249, 424] 3D priors 2D images
[288–290, 356] PointNetworks / Graph Convolutions / SuperPixels PointClouds
[83, 195, 478] 3/4D convolutions Voxels
[15, 47, 240, 342] Plane choices / Discretization / Counting Projections (Bird’s eye)
[182] Feature Fusion Multi-modal

Table 4: Summary of the 3D object detection approaches.

a frustum point cloud. clouds. Using the same fashion as in the previous
Voxel based methods such as VoxelNet [478] rep- section we display a recap in Table 4.
resented the irregular format of point clouds by
fixed size 3D Voxel grids on which standard 3D
4.2 Detecting Objects Under Con-
convolution can be applied. Li [195] discretized
the point cloud on square grids, and represented straints
discretized data by a 4D array of fixed dimensions. In object detection, challenges arise not only be-
Vote3Deep [83] examined the trade-off between ac- cause of the naturally expected problems (scale, ro-
curacy and speed for different architectures applied tation, localization, occlusions, etc.) but also due
on a voxelized representation of input data. to the ones that are created artificially. The first
Regarding approaches based on birds eye view, motivation for the following discussion is to know
MV3D [47] projected LiDAR point cloud to a birds and understand the research works that deal with
eye view on which a 2D region proposal network the inadequacy of annotations in certain datasets.
is applied, allowing the generation of 3D bound- This inadequacy could be due to weak (image-level)
ing box proposals. In a similar way, LMNet [240] labels, scarce bounding box annotations or no an-
addressed the question of real-time object detec- notations at all for certain classes. The second mo-
tion using 3D LiDAR by projecting the point cloud tivation is to discuss the approaches dealing with
onto 5 different frontal planes. More recently, Bird- hardware and application constraints, real-world
Net [15] proposed an original cell encoding mech- detectors might encounter.
anisms for birds eye view, which is invariant to
distance and differences on LiDAR devices resolu-
4.2.1 Weakly Supervised Detection
tion, as well as a detector taking this representa-
tion as input. One of the fastest methods (50 fps) Research teams want to include as many images as
is ComplexYOLO [342], which expanded YOLOv2 possible in their proposed datasets. Due to budget
by a specific complex regression strategy to esti- constraints or to save costs or for some other rea-
mate multi-class 3D boxes in Cartesian space, after sons, sometimes, they chose not to annotate precise
building a birds eye view of the data. bounding boxes around objects and include only
Some recent methods, such as [182], combined image level annotations or captions. The object
different sources of information (eg., birds eye view, detection community has proven that it is still pos-
RGB images, 3D voxels, etc.) and proposed an sible with enough weakly annotated data to train
architecture performing multimodal feature fusion good object detectors.
on high resolution feature maps. Ku et al. [182] The most obvious way to address Weakly Su-
is one of the top performing methods on KITTI pervised Object Detection (WSOD) is to use the
benchmark [98]. Finally, it is worth mentioning Multiple Instance Learning (MIL) framework [233].
the super-pixel based method by Srivastava et al. The image is considered as being a bag of re-
[356] allowed to discover novel objects in 3D point gions extracted by conventional object proposals:

46
at least one of these candidate regions is positive This free localization information can be im-
if the image has the appropriate weak label, if not, proved through the use of different pooling strate-
no region is positive. The classical formulation of gies. For instance: producing a spatial heat map
the problem at hand (before CNNs) then becomes and using a global average pooling instead of global
a latent-SVM on the region’s features where the la- max pooling to train in classification. This strat-
tent part is the assignment of each proposal (that is egy was used in [474] where the heat maps per
weakly constrained by the image label). This prob- class were thresholded to obtain bounding boxes.
lem being highly non-convex is heavily dependent In this line of work, Pinheiro and Collobert [283]
on the quality of the initialization. went a step further by producing pixel-level label
Song et al. [353, 354] thus focused on the ini- segmentation maps using Log-Sum-Exp pooling in
tialization of the boxes by starting from selective- conjunction with some image and smoothing prior.
search proposals. They used for each proposal, its Other pooling strategies involved aggregating mini-
K-nearest neighbors in other images to construct a mum and maximum evidences to get a more precise
bipartite graph. The boxes were then pruned by idea where the object is and isn’t, e.g., as in the
taking only the patches that occur in most posi- line developed in Durand et al. [74, 75, 76]. Bilen
tive images (covering) while not belonging to the and Vedaldi [18] used the spatial pyramid pooling
set of neighbors of regions found in negative im- module to take MIL to the modern-age by incorpo-
ages. They also applied Nesterov smoothing on the rating it into a Fast R-CNN like architecture with
SVM objective to make the optimization easier. Of a two-stream Fast R-CNN proposal classification
course, if proposals do not spin enough of the image part: one with classification score and the other
some objects will not be detected and thus the per- with relative rankings of proposals that are merged
formance will be bad as there is no re-localization. together using hadamard products. Thus, produc-
The work of Sun et al. [362] also belongs to this ing region level labels predictions like in classic
category. Bilen et al. [19] added regularization detection settings. They then aggregated all la-
to the smoothened optimization problem of Song bels per image by taking the sum. They trained
et al. [353] using prior knowledge, but followed it end-to-end using image level labels thanks to
the same general directions. In another related re- their aggregation module while adding a spatial-
search direction Wang et al. [402] learned to cluster regularization constraint on the features obtained
the regions extracted with selective search into K- by the SPP module.
categories using unsupervised learning (pLSA) and Another idea, which can be combined with MIL
then learned category selection using bag of words is to draw the supervision from elsewhere. Tracked
to determine the most discriminative clusters per object proposals were used by Kumar Singh et al.
class. [183] to extract pseudo-groundtruth to train detec-
However, it is not always a requirement to ex- tors. This idea was further explored by Chen et al.
plicitly solve the latent-SVM problem. Thanks to [40] where the keywords extracted from the sub-
the fully convolutional structure of most CNNs it is titles of documentaries allowed to further ground
sometimes possible to get a rough idea where an ob- and cluster the generated annotations. In a simi-
ject might be while training for classification. For lar way, Yuan et al. [443] used action description
example, the arg-max of the produced spatial heat supervision via LSTMs. Cheap supervision can
maps before global max-pooling is often located in- also be gained by involving user feedback [270],
side a bounding box as shown in [261, 262]. It is where the users iteratively improved the pseudo-
also possible to learn to detect objects without us- ground truth by saying if the objects were missed
ing any ground truth bounding boxes for training or partly included in the detections. Click super-
by masking regions of the image and see how the vision by users, far less demanding than full an-
global classification score is impacted, as proposed notations, also improved the performance of detec-
by Bazzani et al. [12]. tors [271]. [316] used active learning to select the

47
Article references Implementation Paradigm
[19, 353, 354, 362, 402] Optimization Tricks (smoothing, EM, etc.) Full MIL
[12] Monitor score change Masking
[18, 261, 262, 283, 474] Global Pooling / GAPooling / LogSumExp Pooling Refining Pooling
[74–76] top-k max/min Contradictory Evidence Pooling
[40, 128, 146, 183, 270, 311, 374, 443] Subtitles / Motion Cues / User clicks / Strong Annotations Auxiliary Supervision

Table 5: Summary of the weakly supervised approaches.

right images to annotate and thus get the same samples which are used in the following iterations
performance by using far fewer images. One can for training. They observed that as the model
also leverage strong annotations for other classes becomes more discriminative it is able to sample
to improve the performance of weakly supervised harder as well as more number of instances. Iterat-
classes. This was done in [374] by using the power- ing between multiple kinds of detectors was found
ful LSDA framework [127]. This was also the case to outperform the single detector approach. One
in [128, 146, 311]. interesting aspect of the paper is that their ap-
This year, a lot of interesting new works con- proach with only three to four annotations per class
tinued to develop the MIL+CNN framework using gives results comparable to weakly annotated ap-
diverse approaches [97, 369, 400, 462–464]. These proaches with image level annotations on the whole
articles will not be treated in detail because the fo- PASCAL VOC dataset. A similar approach was
cus of this survey is object detection in general and used by Keren et al. [168], who proposed a model
not WSOD. which can be trained with as few as one single ex-
As of this writing, the state-of-the-art mAP on emplar of an unseen class and a larger target ex-
VOC2007 in WSOD is 47.6% [463]. The gap is be- ample that may or may not contain an instance of
ing reduced at an exhilarating pace but we are still the same class as the exemplar (weakly supervised
far from the 83.1% state-of-the-art with full an- learning). This model was able to simultaneously
notations [247] (without COCO pre-training). We identify and localize instances of classes unseen at
present a recap in Table 5. training time.

4.2.2 Few-shot Detection


Another way to deal with few-shot detection is
The cost of annotating thousands of boxes over to fine-tune a detector trained on a sourced do-
hundreds of classes is too high. Although some main to a target domain for which only few sam-
large scale datasets are created, but it is not practi- ples are available. This is what Chen et al. [39]
cal to do it for every single target domain. Collect- did, by introducing a novel regularization method,
ing and annotating training examples in the case involving, depressing the background and transfer-
of video is even costlier than still images, making ring the knowledge from the source domain to the
few shot detection more interesting. For this pur- target domain to enhance the fine-tuned detector.
pose, researchers have come up with ways to train
the detectors with as low as three to five bounding
boxes per target class and get lower but compet- For videos, Misra et al. [243] proposed a semi-
itive performance as compared to the fully super- supervised framework in which some initial labeled
vised approach on a large scale dataset. Few shot boxes allowed to iteratively learn and label hun-
learning usually relies on semi-supervised learning dreds of thousands of object instances automat-
mechanisms. ically. Criteria for reliable object detection and
Dong et al. [73] took up an iterative approach to tracking constrained the semi-supervised learning
simultaneously train the model and generate new process and minimized semantic drift.

48
4.2.3 Zero-shot Detection ent classes belonging to a large open vocabulary,
for this task. They used MSCOCO [214] and Vi-
Zero-shot detection is useful for a system where sualGenome [179] which contain an average of 7.7
large number of classes are to be detected. Its hard and 35 objects per image respectively. They also
to annotate a large number of classes as the cost set number of unseen classes to be higher, making
of annotation gets higher with more classes. This their task more complex than previous two papers.
is a unique type of problem in the object detection Since, it is quite a new problem there is no well-
domain as the aim is to classify and localize new defined experimental protocol for this approach.
categories, without any training examples, during They vary in number and nature of unseen classes,
test time with the constraint that the new cate- use of semantic attribute information of unseen
gories are semantically related to the objects in the classes during training, complexity of the visual
training classes. Therefore, in practice the seman- scene, etc.
tic attributes are available for the unseen classes.
The challenges that come with this problem are:
4.2.4 Fast and Low Power Detection
First, zero-shot learning techniques are restricted
to recognize a single dominant objects and not all There is generally a trade-off between performance
the object instances present in the image. Second, and speed (we refer to the comprehensive study of
the background class during fully supervised train- [140] for instance). When one needs real time de-
ing may contain objects from unseen classes. The tectors, like for video object detection, one loses
detector will be trained to discriminatively treat some precision. However, researchers have been
these classes as background. constantly working on improving the precision of
While there is a comparably large literature fast methods and making precise methods faster.
present for zero shot classification, well covered in Furthermore, not every setup can have powerful
the survey [93], zero shot detection has only a few GPUs, so for most industrial applications the de-
papers to the best of our knowledge. Zhu et al. tectors have to run on CPUs or on different low
[482] proposed a method where semantic features power embedded devices like Raspberry-Pie.
are utilized during training but it is agnostic to se- Most real-time methods are single stage because
mantic information during test time. This means they need to perform inference in a quasi fully con-
they incorporated semantic attribute information stitutional manner. The most iconic methods have
in addition to seen classes during training and gen- already been discussed in detail in the rest of the
erated proposals only, but no identification label. paper [216, 221, 306–308]. Zhou et al. [475] de-
for seen and unseen objects at test time. Rahman signed a scale transfer module to replace the feature
et al. [292] proposed a multitask loss that com- pyramid and thus got a detection network more
bines max-margin, useful for separating individual accurate and faster than YOLOv2. Iandola et al.
classes, and semantic clustering, useful for reduc- [147] provided a framework to efficiently compute
ing noise in semantic vectors by positioning simi- multi-scale features. Redmon and Angelova [305]
lar classes together and dissimilar classes far apart. used a YOLO-like architecture to provide oriented
They used ILSVRC [66] which contains an average bounding boxes symbolizing grasps in real time.
of only three objects per image. They also pro- Shafiee et al. [329] built a faster version of YOLOv2
posed another method for a more general case when that runs on embedded devices other than GPUs.
unseen classes are not predefined during training. Li and Zhou [210] managed to speed-up the SSD
Bansal et al. [10] proposed two background-aware detector, bringing it to almost 70 fps, using a more
approaches, statically assigning the background lightweight architecture.
image regions into a single background class em- In single stage methods most of the compu-
bedding and latent assignment based alternating tations are found in the backbone networks so
algorithms which associated background to differ- researchers started to design new backbones for

49
detection in order to have fewer operations like it needed 30% more pixels than original images at
PVANet [170] that built a deep and thin networks inference time making it slower.
with fewer channels than its classification counter- There have also been lots of work done on prun-
parts, or SqueezeDet [416] that is similar to YOLO ing and/or quantifying the weights of CNNs for
but with more anchors and fewer parameters. image classification [114, 138, 141, 143, 144, 218,
Iandola et al. [148] built an AlexNet backbone 273, 299, 454, 476], but much fewer in detection
with 50 times fewer parameters. Howard et al. yet. Although, one can find some detection arti-
[134] used depth-wise-separable convolutions and cles that used pruning. Girshick [104] used SVD
point-wise convolutions to build an efficient back- on the weights of the fully connected layers in Fast
bone called MobileNets for image classification and R-CNN. Masana et al. [234], who pruned near-
detection. Sandler et al. [324] improved upon it zero weights in detection networks and extended
by adding residual connections and removing non- the compression to be domain-adaptive in Masana
linearities. Very recently, Tan et al. [367] used ar- et al. [235].
chitecture search to come up with an even more ef-
To help the reader better encompass the different
ficient network (1.5 times faster than Sandler et al.
accuracy vs speed trade-offs present in the modern
[324] and with lower latency). ShuffleNet [461] at-
methods, we display some of the leading methods
tained impressive performance on ARM devices.
on PASCAL-VOC 2007 [88] with their inference
They can sustain only that many computations
speed on one image (batch size of 1) in Figure 20.
(40MFlops). Their backbone is 13 times faster than
AlexNet. It is not only necessary to respect available mate-
Finally, Wang et al. [405] proposed PeleeNet, rial constraints (data and machines) but detectors
a light network that is 66% of the model size have to be reliable too. They must be robust to
of MobileNet, achieving 76.4% mAP on PASCAL perturbations and they can make mistakes but the
VOC2007 and 22.4% mAP on MS COCO at a mistakes also need to be interpretable, which is a
speed of 17.1 fps on iPhone 6s and 23.6 fps on challenge in itself with the millions of weights and
iPhone 8. [205] is also very efficient, achieving the architectural complexity of modern pipelines.
72.1% mAP on PASCAL VOC2007 with 0.95M pa- It is a good sign to outperform all other methods
rameters and 1.06B FLOPs. on a benchmark, it is something else to perform ac-
Fast double-staged methods exist, although curately in the wild. That is why we dedicate the
the NMS part becomes generally the bottleneck. following sections to the exploration of such chal-
Among them one can also mention for the second lenges.
time Singh et al. [346], which is one of the double-
staged methods that researchers have brought to
30 fps by using superclass (sets of similar classes)
specific detection. Using a mask obtained by a 4.3 Towards Versatile Object Detec-
fast and coarse face detection method the authors tors
of [37] reduced the computational complexity of
their double stage detector by a great amount at So far in all this survey, detectors were tested on
test time by only computing convolutions on non- limited, well-defined benchmarks. It is mandatory
masked regions. Singh et al. [346] sped up R-FCN to assess their performances. However, at the end
by using detection heads super classes (sets of sim- we are really interested in their behaviors in the
ilar classes) specific and thus decouple detection wild where no annotations are present. Detectors
from classification. SNIPER [347] can train on have to be robust to unusual situations and one
512x512 images using an adaptive sampling of the would wish for detectors to be able to evolve them-
region of interests. Therefore, it’s training can use selves. This section will review the state of deep
larger batch size and therefore, be way faster but learning methods w.r.t. these expectations.

50
Figure 20: Performance on VOC07 with respect to Inference speed on a TitanX GPU. The vertical line
represents the limit of Real-Time Speed (indistinguishable from continuous motion for the human eye). We
also added in light gray some relevant work measured on similar devices (K40, TITAN Xp, Jetson TX2).
Only RefineDet [460], DES [467] and STDN [475] are simultaneously real-time and above 80% in mAP
although for some of them (DES, STDN) better hardware (TITAN Xp) must have helped.

51
Figure 21: On the left side we display an example of guided backpropagation to visualize the pattern that
make the neurons fire from [355] and on the right side we show the approach of gradients mask to find
important zones for a classifier on an image from [91], which can lead to bad surprises (the network uses the
spoon as a proxy for the presence of coffee).

4.3.1 Interpretability and Robustness But most of all the detectors should incorporate
a certain level of interpretability so that if a dra-
With the recent craze about self-driving cars, it matic failure happens it can be understood and
has become a top priority to build detectors, that fixed. It is also a need for legal matters. Very few
can be trusted with our lives. Hence, detectors works have done so because it requires delving into
should be robust to physical adversarial attacks the feature maps of the backbone network. A few
[43, 226] and weather conditions, which was the works proposed different approaches for classifica-
reason for building KITTI [98] and DETRAC [411] tion only but no consensus has been reached yet.
back then and has now led to the creation of two Among the popular methods one can cite the gradi-
amazingly huge datasets: ApolloScape from Baidu ent map in the image space of Simonyan et al. [344],
[440] and BDD100K from Berkeley [440] car detec- the occlusion analysis of Zeiler and Fergus [449],
tion datasets. The driving conditions of the real the guided back propagation of Springenberg et al.
world are so complex: changing environments, re- [355] and, recently, the perturbation approach of
flections, different traffic signs and rules for differ- Fong and Vedaldi [91]. Figure 21 shows the insights
ent countries. So far, this open problem is largely gained by using two of the mentioned methods on
unsolved even if some industry players seem to be a classifier.
confident enough to leave self-driving cars without No method exists yet for object detectors to the
safety nets in specific cities. It will surely involve best of our knowledge. It would be a very interest-
at some point the heavy use of synthetic data oth- ing research direction for future works.
erwise it would take a lifetime to gather the data
necessary to be confident enough. To finish on a 4.3.2 Universal Detector, Lifelong Learn-
positive note detectors in self-driving cars can ben- ing
efit from multi-sensory inputs such as LiDAR point
clouds [124], other lasers and multiple cameras so Having object detectors able to iteratively, and
it can help disambiguate certain difficult situations without any supervision, learn to detect novel ob-
(reflections on the cars in front of it for instance). ject classes and improve their performance would

52
be one of the Holy Grails of computer vision. This NNs entirely. However, the Achilles heel of deep-
can have the form of lifelong learning, where goal learning methods is their interpretability and trust-
is to sequentially retrain learned knowledge and to worthiness. The object detection community seems
selectively transfer the knowledge when learning focused on improving the performances on static
a new task, as defined in [341]. Or never ending benchmarks instead of finding ways to better un-
learning [245], where the system has sufficient self- derstand the behavior of DCNNs. It is under-
reflection to avoid plateaus in performances and standable but it shows that Deep Learning has
can decide how to progress by itself. However, one not yet reached full maturity. Eventually, one
of the biggest issues with current detectors is they can hope that the performances of new detectors
suffer from catastrophic forgetting, as say Castro will plateau and when it does, researchers will be
et al. [33]. It means their performance decreases forced to come back to the basics and focus instead
when new classes are added incrementally. Some on interpretability and robustness before the next
authors tried to face this challenge. For exam- paradigm washes off deep-learning entirely.
ple, the knowledge distillation loss introduced by
Li and Hoiem [209] allows to forget old data while
using previous models to constraint updated ones
during learning. In the domain of object detec-
tion, the only recent contribution we are aware of
is the incremental learning approach of Shmelkov
et al. [336], relying on a distillation mechanism. 5 Conclusions
Lifelong learning and never ending learning are do-
mains where a lot still have to be discovered or
developed. Object detection in images, a key topic attracting
a substantial part of the computer vision commu-
nity, has been revolutionized by the recent arrival of
4.4 Concluding Remarks
convolutional neural networks, which swept all the
It seems that deep learning in its current form is methods previously dominating the field. This ar-
not yet fully ready to be applied to other modal- ticle provides a comprehensive survey of what hap-
ities than 2D images: in videos, temporal consis- pened in the domain since 2012. It shows that, even
tency is hard to take into account with DCNNs be- if top-performing methods concentrate around two
cause 3D convolutions are expensive, tubelets and main alternatives – single stage methods such as
tracklets are interesting ideas but lack the elegance SSD or YOLO, or two stages methods in the foot-
of DCNNs on still images. For point clouds the steps of Faster RCNN – the domain is still very
picture is even worse. The voxelisation of point active. Graph networks, GANs, context, small ob-
clouds does not deal with their inherent sparsity jects, domain adaptation, occlusions, etc. are the
and create memory issues and even the simplicity directions that are actively studied in the context
and originality of the PointNet articles Qi et al. of object detection. Extension of object detection
[288, 290] that leaves the point clouds untouched to other modalities, such as videos or 3D point
has not matured enough yet to be widely adopted clouds, as well as constraints, such as weak super-
by the community. Hopefully, dealing with other vision is also very active and has been addressed.
constraints like weak supervision or few training The appendix of this survey also provides a very
images is starting to produce worthy results with- complete list of the public datasets available to the
out too much change to the original DCNN archi- community and highlights top performing methods
tectures [76, 97, 369, 400, 462–464]. It seems to be on these datasets. We believe this article will be
only a matter of refining cost functions and coming- useful to better understand the recent progress and
up with more building blocks than reinventing DC- the bigger picture of this constantly moving field.

53
References [8] Seung-Hwan Bae, Youngwan Lee, Youngjoo
Jo, Yuseok Bae, and Joong-won Hwang.
[1] Takuya Akiba, Shuji Suzuki, and Keisuke Rank of experts: Detection network ensem-
Fukuda. Extremely large minibatch SGD: ble. CoRR, abs/1712.00185, 2017. URL
training resnet-50 on imagenet in 15 min- http://arxiv.org/abs/1712.00185.
utes. CoRR, abs/1711.04325, 2017. URL
http://arxiv.org/abs/1711.04325. [9] Yancheng Bai, Yongqiang Zhang, Mingli
Ding, and Bernard Ghanem. SOD-MTGAN:
[2] Bogdan Alexe, Thomas Deselaers, and Vit- Small Object Detection via Multi-Task Gen-
torio Ferrari. What is an object? In The erative Adversarial Network. In Computer
Twenty-Third IEEE Conference on Com- Vision - ECCV 2018 - 15th European Con-
puter Vision and Pattern Recognition, CVPR ference, Munich, Germany, September 8 -
2010, San Francisco, CA, USA, 13-18 June 14, 2018, page 16, 2018.
2010, pages 73–80, 2010.
[10] Ankan Bansal, Karan Sikka, Gaurav
[3] Bogdan Alexe, Thomas Deselaers, and Vit- Sharma, Rama Chellappa, and Ajay
torio Ferrari. Measuring the objectness of Divakaran. Zero-shot object detection.
image windows. IEEE Transactions on Pat- CoRR, abs/1804.04340, 2018. URL
tern Analysis and Machine Intelligence, 34 http://arxiv.org/abs/1804.04340.
(11):2189–2202, 2012.
[11] Peter W. Battaglia, Jessica B. Hamrick,
[4] Hassan Abu Alhaija, Siva Karthik Victor Bapst, Alvaro Sanchez-Gonzalez,
Mustikovela, Lars M. Mescheder, Andreas Vinı́cius Flores Zambaldi, Mateusz Mali-
Geiger, and Carsten Rother. Augmented nowski, Andrea Tacchetti, David Raposo,
reality meets computer vision: Efficient Adam Santoro, Ryan Faulkner, Çaglar
data generation for urban driving scenes. Gülçehre, Francis Song, Andrew J. Bal-
International Journal of Computer Vision lard, Justin Gilmer, George E. Dahl, Ashish
(IJCV), 126(9):961–972, 2018. Vaswani, Kelsey Allen, Charles Nash, Vic-
[5] Phil Ammirato, Patrick Poirson, Eunbyung toria Langston, Chris Dyer, Nicolas Heess,
Park, Jana Kosecka, and Alexander C. Berg. Daan Wierstra, Pushmeet Kohli, Matthew
A dataset for developing and benchmarking Botvinick, Oriol Vinyals, Yujia Li, and Raz-
active vision. IEEE International Conference van Pascanu. Relational inductive biases,
on Robotics and Automation (ICRA), cs.CV, deep learning, and graph networks. CoRR,
2017. abs/1806.01261, 2018. URL http://arxiv.
org/abs/1806.01261.
[6] Anelia Angelova, Alex Krizhevsky, Vincent
Vanhoucke, Abhijit S Ogale, and Dave Fer- [12] Loris Bazzani, Alessandro Bergamo,
guson. Real-time pedestrian detection with Dragomir Anguelov, and Lorenzo Tor-
deep network cascades. In Proceedings of resani. Self-taught object localization with
the British Machine Vision Conference 2015, deep networks. In 2016 IEEE Winter Con-
BMVC 2015, Swansea, UK, September 7-10, ference on Applications of Computer Vision,
2015, volume 2, page 4, 2015. WACV 2016, Lake Placid, NY, USA, March
7-10, 2016, pages 1–9, 2016. URL https:
[7] Antreas Antoniou, Amos J. Storkey, and //doi.org/10.1109/WACV.2016.7477688.
Harrison Edwards. Data augmentation
generative adversarial networks. CoRR, [13] Karsten Behrendt and Libor Novak. A Deep
abs/1711.04340, 2017. URL http://arxiv. Learning Approach to Traffic Lights: De-
org/abs/1711.04340. tection, Tracking, and Classification. In

54
Robotics and Automation (ICRA), 2017 [21] Navaneeth Bodla, Bharat Singh, Rama Chel-
IEEE International Conference On, 2017. lappa, and Larry S Davis. Soft-nms—
improving object detection with one line of
[14] Sean Bell, C. Lawrence Zitnick, Kavita Bala, code. In IEEE International Conference on
and Ross Girshick. Inside-Outside Net: De- Computer Vision, ICCV 2017, Venice, Italy,
tecting Objects in Context with Skip Pool- October 22-29, 2017, pages 5562–5570, 2017.
ing and Recurrent Neural Networks. In 2016
IEEE Conference on Computer Vision and [22] Lubomir Bourdev, Subhransu Maji, Thomas
Pattern Recognition, CVPR 2016, Las Ve- Brox, and Jitendra Malik. Detecting people
gas,NV, USA, June 27-30, 2016, 2016. using mutually consistent poselet activations.
In Computer Vision - ECCV 2010, 11th Eu-
[15] Jorge Beltrán, Carlos Guindel, Fran- ropean Conference on Computer Vision, Her-
cisco Miguel Moreno, Daniel Cruzado, aklion, Crete, Greece, September 5-11, 2010,
Fernando Garcı́a, and Arturo de la pages 168–181, 2010.
Escalera. Birdnet: a 3d object detec-
tion framework from lidar information. [23] Konstantinos Bousmalis, Nathan Silberman,
CoRR, abs/1805.01195, 2018. URL David Dohan, Dumitru Erhan, and Dilip Kr-
http://arxiv.org/abs/1805.01195. ishnan. Unsupervised Pixel-Level Domain
Adaptation with Generative Adversarial Net-
[16] Rodrigo Benenson, Markus Mathias, Radu works. In 2017 IEEE Conference on Com-
Timofte, and Luc Van Gool. Pedestrian de- puter Vision and Pattern Recognition, CVPR
tection at 100 frames per second. In 2012 2017, Honolulu, HI, USA, July 21-26, 2017,
IEEE Conference on Computer Vision and pages 95–104, 2017.
Pattern Recognition, Providence, RI, USA,
June 16-21, 2012, pages 2903–2910, 2012. [24] Samarth Brahmbhatt, Henrik I. Chris-
tensen, and James Hays. StuffNet - Using
[17] Simone Bianco, Marco Buzzelli, Davide &apos;Stuff&apos; to Improve Object Detec-
Mazzini, and Raimondo Schettini. Deep tion. In IEEE Winter Conf. on Applications
Learning for Logo Recognition. Neurocom- of Computer Vision (WACV), 2017.
puting, 245:23–30, July 2017.
[25] Markus Braun, Sebastian Krebs, Fabian
[18] Hakan Bilen and Andrea Vedaldi. Weakly Su- Flohr, and Dariu M. Gavrila. The eurocity
pervised Deep Detection Networks. In 2016 persons dataset: A novel benchmark for ob-
IEEE Conference on Computer Vision and ject detection. CoRR, abs/1805.07193, 2018.
Pattern Recognition, CVPR 2016, Las Ve- URL http://arxiv.org/abs/1805.07193.
gas,NV, USA, June 27-30, 2016, 2016.
[26] Michal Busta, Lukas Neumann, and Jiri
[19] Hakan Bilen, Marco Pedersoli, and Tinne Matas. Deep textspotter: An end-to-end
Tuytelaars. Weakly supervised object detec- trainable scene text localization and recogni-
tion with convex clustering. In IEEE Confer- tion framework. In IEEE International Con-
ence on Computer Vision and Pattern Recog- ference on Computer Vision, ICCV 2017,
nition, CVPR 2015, Boston, MA, USA, June Venice, Italy, October 22-29, 2017, pages
7-12, 2015, June 2015. 2223–2231. IEEE Computer Society, 2017.
[20] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z. [27] Zhaowei Cai and Nuno Vasconcelos. Cas-
Li. Fine-grained evaluation on face detection cade R-CNN: delving into high quality ob-
in the wild. In Automatic Face and Gesture ject detection. In Computer Vision and Pat-
Recognition (FG), pages 1–7, 2015. tern Recognition (CVPR), 2018 IEEE Con-

55
ference on, pages 6154–6162, 2018. doi: Thierry Chateau. Deep MANTA: A coarse-
10.1109/CVPR.2018.00644. to-fine many-task network for joint 2d and
3d vehicle analysis from monocular image.
[28] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, In 2017 IEEE Conference on Computer
and Nuno Vasconcelos. A unified multi- Vision and Pattern Recognition, CVPR
scale deep convolutional neural network for 2017, Honolulu, HI, USA, July 21-26, 2017,
fast object detection. In Computer Vision pages 1827–1836, 2017.
- ECCV 2016 - 14th European Conference,
Amsterdam, The Netherlands, October 11- [35] Karanbir Singh Chahal and Kuntal Dey. A
14, 2016, pages 354–370, 2016. survey of modern object detection literature
using deep learning. CoRR, 2018.
[29] Guimei Cao, Xuemei Xie, Wenzhe Yang,
Quan Liao, Guangming Shi, and Jinjian Wu. [36] Chenyi Chen, Ming-Yu Liu 0001, Oncel
Feature-fused SSD: fast detection for small Tuzel, and Jianxiong Xiao. R-CNN for Small
objects. CoRR, abs/1709.05054, 2017. URL Object Detection. Computer Vision - ACCV
http://arxiv.org/abs/1709.05054. 2016 - 13th Asian Conference on Computer
Vision, Taipei, Taiwan, November 20-24,
[30] Joao Carreira and Cristian Sminchisescu.
2016, 10115:214–230, 2016.
Constrained parametric min-cuts for auto-
matic object segmentation. In The Twenty- [37] D. Chen, G. Hua, F. Wen, and J. Sun.
Third IEEE Conference on Computer Vision Supervised transformer network for efficient
and Pattern Recognition, CVPR 2010, San face detection. In Computer Vision - ECCV
Francisco, CA, USA, 13-18 June 2010, pages 2016 - 14th European Conference, Amster-
3241–3248, 2010. dam, The Netherlands, October 11-14, 2016,
[31] Joao Carreira and Cristian Sminchisescu. 2016.
Cpmc: Automatic object segmentation us-
[38] Guang Chen, Yuanyuan Ding, Jing Xiao, and
ing constrained parametric min-cuts. IEEE
Tony X Han. Detection evolution with multi-
Transactions on Pattern Analysis and Ma-
order contextual co-occurrence. In 2013
chine Intelligence, 34(7):1312–1328, 2011.
IEEE Conference on Computer Vision and
[32] Lluı́s Castrejón, Kaustav Kundu, Raquel Ur- Pattern Recognition, Portland, OR, USA,
tasun, and Sanja Fidler. Annotating ob- June 23-28, 2013, pages 1798–1805, 2013.
ject instances with a polygon-rnn. In 2017
IEEE Conference on Computer Vision and [39] Hao Chen, Yali Wang, Guoyou Wang,
Pattern Recognition, CVPR 2017, Honolulu, and Yu Qiao. LSTD: A low-shot transfer
HI, USA, July 21-26, 2017, pages 4485–4493, detector for object detection. In Sheila A.
2017. doi: 10.1109/CVPR.2017.477. McIlraith and Kilian Q. Weinberger, ed-
itors, Proceedings of the Thirty-Second
[33] Francisco M. Castro, Manuel J. Marı́n- AAAI Conference on Artificial Intelligence,
Jiménez, Nicolás Guil, Cordelia Schmid, and New Orleans, Louisiana, USA, Febru-
Karteek Alahari. End-to-End Incremental ary 2-7, 2018. AAAI Press, 2018. URL
Learning. In Computer Vision - ECCV 2018 https://www.aaai.org/ocs/index.php/
- 15th European Conference, Munich, Ger- AAAI/AAAI18/paper/view/16778.
many, September 8 - 14, 2018, 2018.
[40] Kai Chen, Hang Song, Chen Change Loy,
[34] Florian Chabot, Mohamed Chaouch, and Dahua Lin. Discover and Learn New
Jaonary Rabarisoa, Céline Teulière, and Objects from Documentaries. In 2017 IEEE

56
Conference on Computer Vision and Pat- cc/paper/5644-3d-object-proposals-
tern Recognition, CVPR 2017, Honolulu, HI, for-accurate-object-class-detection.
USA, July 21-26, 2017, pages 1111–1120,
July 2017. [46] Xiaozhi Chen, Huimin Ma, Xiang Wang, and
Zhichen Zhao. Improving object propos-
[41] Kai Chen, Jiaqi Wang, Shuo Yang, als with multi-thresholding straddling expan-
Xingcheng Zhang, Yuanjun Xiong, sion. In IEEE Conference on Computer Vi-
Chen Change Loy, and Dahua Lin. Optimiz- sion and Pattern Recognition, CVPR 2015,
ing video object detection via a scale-time Boston, MA, USA, June 7-12, 2015, 2015.
lattice. CoRR, abs/1804.05472, 2018. URL
http://arxiv.org/abs/1804.05472. [47] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li,
and Tian Xia. Multi-view 3d object detec-
[42] Liang-Chieh Chen, George Papandreou, Ia- tion network for autonomous driving. In 2017
sonas Kokkinos, Kevin Murphy, and Alan L. IEEE Conference on Computer Vision and
Yuille. Deeplab: Semantic image seg- Pattern Recognition, CVPR 2017, Honolulu,
mentation with deep convolutional nets, HI, USA, July 21-26, 2017, pages 6526–6534.
atrous convolution, and fully connected IEEE Computer Society, 2017.
crfs. IEEE Transactions on Pattern Analy-
[48] Xinlei Chen and Abhinav Gupta. Spatial
sis and Machine Intelligence, 40(4):834–848,
Memory for Context Reasoning in Object De-
2018. URL https://doi.org/10.1109/
tection. In 2017 IEEE Conference on Com-
TPAMI.2017.2699184.
puter Vision and Pattern Recognition, CVPR
[43] Shang-Tse Chen, Cory Cornelius, Jason Mar- 2017, Honolulu, HI, USA, July 21-26, 2017,
tin, and Duen Horng Chau. Robust physical 2017.
adversarial attack on faster R-CNN object
[49] Yuhua Chen, Wen Li, Christos Sakaridis,
detector. CoRR, abs/1804.05810, 2018. URL
Dengxin Dai, and Luc Van Gool. Domain
http://arxiv.org/abs/1804.05810.
adaptive faster R-CNN for object detection
[44] X. Chen, K. Kundu, Z. Zhang, H. Ma, and in the wild. CoRR, abs/1803.03243, 2018.
S. Fidler. Monocular 3d object detection for URL http://arxiv.org/abs/1803.03243.
autonomous driving. In 2016 IEEE Confer- [50] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xi-
ence on Computer Vision and Pattern Recog- aojie Jin, Shuicheng Yan, and Jiashi Feng.
nition, CVPR 2016, Las Vegas,NV, USA, Dual path networks. In Advances in Neural
June 27-30, 2016, 2016. Information Processing Systems 30: Annual
Conference on Neural Information Process-
[45] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu,
ing Systems 2017, 4-9 December 2017, Long
Andrew G. Berneshawi, Huimin Ma, Sanja
Beach, CA, USA, pages 4467–4475, 2017.
Fidler, and Raquel Urtasun. 3d object
proposals for accurate object class detec- [51] Yunpeng Chen, Jianshu Li, Bin Zhou, Jiashi
tion. In Corinna Cortes, Neil D. Lawrence, Feng, and Shuicheng Yan. Weaving multi-
Daniel D. Lee, Masashi Sugiyama, and scale context for single shot detector. CoRR,
Roman Garnett, editors, Advances in Neural abs/1712.03149, 2017. URL http://arxiv.
Information Processing Systems 28: An- org/abs/1712.03149.
nual Conference on Neural Information
Processing Systems 2015, December 7-12, [52] G. Cheng, P. Zhou, and J. Han. RIFD-CNN:
2015, Montreal, Quebec, Canada, pages Rotation-Invariant and Fisher Discriminative
424–432, 2015. URL http://papers.nips. Convolutional Neural Networks for Object

57
Detection. In 2016 IEEE Conference on [59] Gabriela Csurka. A comprehensive sur-
Computer Vision and Pattern Recognition, vey on domain adaptation for visual appli-
CVPR 2016, Las Vegas,NV, USA, June 27- cations. In Gabriela Csurka, editor, Do-
30, 2016, 2016. main Adaptation in Computer Vision Appli-
cations., Advances in Computer Vision and
[53] Gong Cheng and Junwei Han. A Survey on Pattern Recognition, pages 1–35. Springer,
Object Detection in Optical Remote Sensing 2017. URL https://doi.org/10.1007/
Images. ISPRS Journal of Photogrammetry 978-3-319-58347-1_1.
and Remote Sensing, 117:11–28, 2016.
[54] Gong Cheng, Peicheng Zhou, and Junwei [60] Ekin Dogus Cubuk, Barret Zoph, Dandelion
Han. Learning rotation-invariant convo- Mané, Vijay Vasudevan, and Quoc V. Le.
lutional neural networks for object detec- Autoaugment: Learning augmentation poli-
tion in vhr optical remote sensing images. cies from data. CoRR, abs/1805.09501, 2018.
IEEE Transactions on Geoscience and Re- URL http://arxiv.org/abs/1805.09501.
mote Sensing, 54(12):7405–7415, 2016.
[61] Jifeng Dai, Kaiming He, and Jian Sun.
[55] Jianpeng Cheng, Li Dong, and Mirella Lap- Instance-aware semantic segmentation via
ata. Long short-term memory-networks for multi-task network cascades. In 2016 IEEE
machine reading. In Proceedings of the 2016 Conference on Computer Vision and Pattern
Conference on Empirical Methods in Natural Recognition, CVPR 2016, Las Vegas,NV,
Language Processing, EMNLP 2016, Austin, USA, June 27-30, 2016, pages 3150–3158,
Texas, USA, November 1-4, 2016, pages 551– 2016.
561, 2016.
[62] Jifeng Dai, Yi Li, Kaiming He, and Jian
[56] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Sun. R-fcn: Object detection via region-
Lin, and Philip Torr. Bing: Binarized based fully convolutional networks. In Ad-
normed gradients for objectness estimation vances in Neural Information Processing Sys-
at 300fps. In 2014 IEEE Conference on tems 29: Annual Conference on Neural In-
Computer Vision and Pattern Recognition, formation Processing Systems 2016, Decem-
CVPR 2014, Columbus, OH, USA, June 23- ber 5-10, 2016, Barcelona, Spain, pages 379–
28, 2014, pages 3286–3293, 2014. 387, 2016.
[57] François Chollet. Xception: Deep learn-
ing with depthwise separable convolutions. [63] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li,
In 2017 IEEE Conference on Computer Vi- Guodong Zhang, Han Hu, and Yichen Wei.
sion and Pattern Recognition, CVPR 2017, Deformable convolutional networks. In IEEE
Honolulu, HI, USA, July 21-26, 2017, pages International Conference on Computer Vi-
1800–1807, 2017. sion, ICCV 2017, Venice, Italy, October 22-
29, 2017, pages 764–773. IEEE Computer So-
[58] Marius Cordts, Mohamed Omran, Sebastian ciety, 2017.
Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, [64] Navneet Dalal and Bill Triggs. Histograms
and Bernt Schiele. The cityscapes dataset of oriented gradients for human detection.
for semantic urban scene understanding. In In 2005 IEEE Computer Society Conference
Proc. of the IEEE Conference on Computer on Computer Vision and Pattern Recognition
Vision and Pattern Recognition (CVPR), (CVPR 2005), 20-26 June 2005, San Diego,
2016. CA, USA, volume 1, pages 886–893, 2005.

58
[65] Manolis Delakis and Christophe Garcia. text [72] Piotr Dollár, Ron Appel, Serge J. Belongie,
detection with convolutional neural net- and Pietro Perona. Fast feature pyramids for
works. In International Joint Conference object detection. IEEE Transactions on Pat-
on Computer Vision, Imaging and Computer tern Analysis and Machine Intelligence, 36
Graphics Theory and Applications (VISAP), (8):1532–1545, 2014.
pages 290–294, 2008.
[73] Xuanyi Dong, Liang Zheng, Fan Ma,
[66] Jia Deng, Wei Dong, Richard Socher, Li- Yi Yang, and Deyu Meng. Few-shot ob-
Jia Li, Kai Li, and Li Fei-Fei. Imagenet: ject detection. CoRR, abs/1706.08249, 2017.
A large-scale hierarchical image database. URL http://arxiv.org/abs/1706.08249.
In 2009 IEEE Computer Society Conference
on Computer Vision and Pattern Recogni- [74] Thibaut Durand, Nicolas Thome, and
tion (CVPR 2009), 20-25 June 2009, Miami, Matthieu Cord. MANTRA: Minimum Maxi-
Florida, USA, pages 248–255, 2009. mum Latent Structural SVM for Image Clas-
sification and Ranking. In IEEE Inter-
[67] Zhipeng Deng, Hao Sun, Shilin Zhou, Juan- national Conference on Computer Vision,
ping Zhao, and Huanxin Zou. Toward Fast ICCV 2015, Santiago, Chile, December 7-13,
and Accurate Vehicle Detection in Aerial Im- 2015, 2015.
ages Using Coupled Region-Based Convolu-
tional Neural Networks. IEEE Journal of Se- [75] Thibaut Durand, Nicolas Thome, and
lected Topics in Applied Earth Observations Matthieu Cord. Weldon: Weakly supervised
and Remote Sensing, 10:3652–3664, 2017. learning of deep convolutional neural net-
works. In 2016 IEEE Conference on Com-
[68] Zhuo Deng and Longin Jan Latecki. Amodal
puter Vision and Pattern Recognition, CVPR
Detection of 3D Objects: Inferring 3D
2016, Las Vegas,NV, USA, June 27-30, 2016,
Bounding Boxes from 2D Ones in RGB-
2016.
Depth Images. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition,
[76] Thibaut Durand, Taylor Mordan, Nicolas
CVPR 2017, Honolulu, HI, USA, July 21-26,
Thome, and Matthieu Cord. WILDCAT:
2017, pages 398–406, 2017.
Weakly Supervised Learning of Deep Con-
[69] Terrance Devries and Graham W. Tay- vNets for Image Classification, Pointwise Lo-
lor. Dataset augmentation in feature space. calization and Segmentation. In 2017 IEEE
CoRR, abs/1702.05538, 2017. URL http: Conference on Computer Vision and Pat-
//arxiv.org/abs/1702.05538. tern Recognition, CVPR 2017, Honolulu, HI,
USA, July 21-26, 2017, 2017.
[70] Terrance Devries and Graham W. Tay-
lor. Improved regularization of convolu- [77] Nikita Dvornik, Julien Mairal, and Cordelia
tional neural networks with cutout. CoRR, Schmid. Modeling Visual Context is Key
abs/1708.04552, 2017. URL http://arxiv. to Augmenting Object Detection Datasets.
org/abs/1708.04552. In Computer Vision - ECCV 2018 - 15th
European Conference, Munich, Germany,
[71] Piotr Dollar, Christian Wojek, Bernt Schiele, September 8 - 14, 2018, page 18, 2018.
and Pietro Perona. Pedestrian detection: An
evaluation of the state of the art. IEEE [78] D. Dwibedi. Synthesizing scenes for instance
Transactions on Pattern Analysis and Ma- detection. Master’s thesis, Carnegie Mellon
chine Intelligence, 34(4):743–761, 2012. University, 2017.

59
[79] Debidatta Dwibedi, Ishan Misra, and Mar- [86] Dumitru Erhan, Christian Szegedy, Alexan-
tial Hebert. Cut, paste and learn: Surpris- der Toshev, and Dragomir Anguelov. Scal-
ingly easy synthesis for instance detection. In able Object Detection Using Deep Neural
IEEE International Conference on Computer Networks. In 2014 IEEE Conference on
Vision, ICCV 2017, Venice, Italy, October Computer Vision and Pattern Recognition,
22-29, 2017, pages 1310–1319. IEEE Com- CVPR 2014, Columbus, OH, USA, June 23-
puter Society, 2017. 28, 2014, 2014.

[80] Christian Eggert, Dan Zecha, Stephan [87] Andreas Ess, Bastian Leibe, and Luc
Brehm, and Rainer Lienhart. Improving Van Gool. Depth and appearance for mo-
small object proposals for company logo de- bile scene analysis. In IEEE 11th Inter-
tection. In Proceedings of the 2017 ACM on national Conference on Computer Vision,
International Conference on Multimedia Re- ICCV 2007, Rio de Janeiro, Brazil, October
trieval, pages 167–174, 2017. 14-20, 2007, pages 1–8, 2007.
[88] Mark Everingham, Luc Van Gool, Christo-
[81] Ian Endres and Derek Hoiem. Category inde- pher KI Williams, John Winn, and An-
pendent object proposals. In Computer Vi- drew Zisserman. The pascal visual object
sion - ECCV 2010, 11th European Confer- classes (voc) challenge. International Journal
ence on Computer Vision, Heraklion, Crete, of Computer Vision (IJCV), 88(2):303–338,
Greece, September 5-11, 2010, pages 575– 2010.
588, 2010.
[89] Christoph Feichtenhofer, Axel Pinz, and An-
[82] Ian Endres and Derek Hoiem. Category- drew Zisserman. Detect to track and track
independent object proposals with diverse to detect. In 2017 IEEE Conference on Com-
ranking. IEEE Transactions on Pattern puter Vision and Pattern Recognition, CVPR
Analysis and Machine Intelligence, 36(2): 2017, Honolulu, HI, USA, July 21-26, 2017,
222–234, 2014. pages 3038–3046, 2017.

[83] Martin Engelcke, Dushyant Rao, Do- [90] Pedro F. Felzenszwalb, Ross B. Gir-
minic Zeng Wang, Chi Hay Tong, and Ingmar shick, David A. McAllester, and Deva Ra-
Posner. Vote3Deep: Fast Object Detection manan. Object detection with discrimi-
in 3D Point Clouds Using Efficient Convolu- natively trained part-based models. IEEE
tional Neural Networks. In IEEE Interna- Transactions on Pattern Analysis and Ma-
tional Conference on Robotics and Automa- chine Intelligence, 32(9):1627–1645, 2010.
tion (ICRA), 2017. [91] Ruth C. Fong and Andrea Vedaldi. In-
terpretable explanations of black boxes by
[84] Markus Enzweiler and Dariu M Gavrila. meaningful perturbation. In IEEE Inter-
Monocular pedestrian detection: Survey and national Conference on Computer Vision,
experiments. IEEE Transactions on Pattern ICCV 2017, Venice, Italy, October 22-29,
Analysis and Machine Intelligence, 31(12): 2017, pages 3449–3457. IEEE Computer So-
2179–2195, 2008. ciety, 2017.
[85] Markus Enzweiler and Dariu M. Gavrila. [92] Cheng-Yang Fu, Wei Liu, Ananth Ranga,
A multilevel mixture-of-experts framework Ambrish Tyagi, and Alexander C. Berg.
for pedestrian classification. IEEE Trans- DSSD : Deconvolutional single shot detec-
actions on Image Processing, 20(10):2967– tor. CoRR, abs/1701.06659, 2017. URL
2979, 2011. http://arxiv.org/abs/1701.06659.

60
[93] Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xi- Institute of Technology, Cambridge, Mas-
angyang Xue, Leonid Sigal, and Shaogang sachusetts, USA, July 12-16, 2017, 2017.
Gong. Recent advances in zero-shot recogni- URL http://www.roboticsproceedings.
tion: Toward data-efficient understanding of org/rss13/p43.html.
visual content. IEEE Signal Processing Mag-
azine, 35(1):112–125, 2018. [100] David Gerónimo, Angel Domingo Sappa, An-
tonio López, and Daniel Ponsa. Adaptive
[94] A Gaidon, Q Wang, Y Cabon, and E Vig. image sampling and windows classification
Virtual worlds as proxy for multi-object for on-board pedestrian detection. In Pro-
tracking analysis. In 2016 IEEE Conference ceedings of the 5th International Conference
on Computer Vision and Pattern Recogni- on Computer Vision Systems (ICVS 2007),
tion, CVPR 2016, Las Vegas,NV, USA, June 2007.
27-30, 2016, 2016.
[101] Spyridon Gidaris and Nikos Komodakis. At-
[95] Mingfei Gao, Ruichi Yu, Ang Li, Vlad I. tend Refine Repeat - Active Box Proposal
Morariu, and Larry S. Davis. Dynamic zoom- Generation via In-Out Localization. In
in network for fast object detection in large Proceedings of the British Machine Vision
images. CoRR, abs/1711.05187, 2017. URL Conference 2016, BMVC 2016, York, UK,
http://arxiv.org/abs/1711.05187. September 19-22, 2016, 2016.
[96] Christophe Garcia and Manolis Delakis. A
[102] Spyros Gidaris and Nikos Komodakis. Ob-
neural architecture for fast and robust face
ject detection via a multi-region and seman-
detection. In Pattern Recognition, 2002. Pro-
tic segmentation-aware cnn model. In IEEE
ceedings. 16th International Conference on,
Conference on Computer Vision and Pat-
volume 2, pages 44–47, 2002.
tern Recognition, CVPR 2015, Boston, MA,
[97] Weifeng Ge, Sibei Yang, and Yizhou Yu. USA, June 7-12, 2015, pages 1134–1142,
Multi-evidence filtering and fusion for multi- 2015.
label classification, object detection and se-
mantic segmentation based on weakly super- [103] Spyros Gidaris and Nikos Komodakis. Loc-
vised learning. In Computer Vision and Pat- Net: Improving Localization Accuracy for
tern Recognition (CVPR), 2018 IEEE Con- Object Detection. In 2016 IEEE Conference
ference on, June 2018. on Computer Vision and Pattern Recogni-
tion, CVPR 2016, Las Vegas,NV, USA, June
[98] Andreas Geiger, Philip Lenz, and Raquel Ur- 27-30, 2016, 2016.
tasun. Are we ready for autonomous driving?
the kitti vision benchmark suite. In 2012 [104] Ross Girshick. Fast r-cnn. In IEEE In-
IEEE Conference on Computer Vision and ternational Conference on Computer Vision,
Pattern Recognition, Providence, RI, USA, ICCV 2015, Santiago, Chile, December 7-13,
June 16-21, 2012, pages 3354–3361, 2012. 2015, pages 1440–1448, 2015.

[99] Georgios Georgakis, Arsalan Mousavian, [105] Ross Girshick, Jeff Donahue, Trevor Darrell,
Alexander C. Berg, and Jana Kosecka. Syn- and Jitendra Malik. Rich feature hierarchies
thesizing training data for object detection for accurate object detection and semantic
in indoor scenes. In Nancy M. Amato, segmentation. In 2014 IEEE Conference on
Siddhartha S. Srinivasa, Nora Ayanian, Computer Vision and Pattern Recognition,
and Scott Kuindersma, editors, Robotics: CVPR 2014, Columbus, OH, USA, June 23-
Science and Systems XIII, Massachusetts 28, 2014, pages 580–587, 2014.

61
[106] Ross B. Girshick, Forrest N. Iandola, Trevor [112] Ankush Gupta, Andrea Vedaldi, and An-
Darrell, and Jitendra Malik. Deformable part drew Zisserman. Synthetic Data for Text
models are convolutional neural networks. In Localisation in Natural Images. In 2016
IEEE Conference on Computer Vision and IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2015, Boston, Pattern Recognition, CVPR 2016, Las Ve-
MA, USA, June 7-12, 2015, 2015. gas,NV, USA, June 27-30, 2016, pages 2315–
2324, June 2016.
[107] Georgia Gkioxari and Jitendra Malik. Find-
ing action tubes. In IEEE Conference on [113] Saurabh Gupta, Bharath Hariharan, and Ji-
Computer Vision and Pattern Recognition, tendra Malik. Exploring person context
CVPR 2015, Boston, MA, USA, June 7-12, and local scene context for object detection.
2015, pages 759–768, 2015. doi: 10.1109/ CoRR, abs/1511.08177, 2015. URL http:
CVPR.2015.7298676. //arxiv.org/abs/1511.08177.
[114] Song Han, Huizi Mao, and William J. Dally.
[108] Abel Gonzalez-Garcia, Davide Modolo, and Deep compression: Compressing deep neural
Vittorio Ferrari. Objects as context for network with pruning, trained quantization
part detection. CoRR, abs/1703.09529, 2017. and huffman coding. CoRR, abs/1510.00149,
URL http://arxiv.org/abs/1703.09529. 2015. URL http://arxiv.org/abs/1510.
00149.
[109] Ian Goodfellow, Jean Pouget-Abadie, Mehdi
Mirza, Bing Xu, David Warde-Farley, Sher- [115] Wei Han, Pooya Khorrami, Tom Le
jil Ozair, Aaron Courville, and Yoshua Ben- Paine, Prajit Ramachandran, Mohammad
gio. Generative adversarial nets. In Ad- Babaeizadeh, Honghui Shi, Jianan Li,
vances in Neural Information Processing Sys- Shuicheng Yan, and Thomas S. Huang. Seq-
tems 27: Annual Conference on Neural In- nms for video object detection. CoRR,
formation Processing Systems 2014, Decem- abs/1602.08465, 2016. URL http://arxiv.
ber 8-13 2014, Montreal, Quebec, Canada, org/abs/1602.08465.
pages 2672–2680, 2014.
[116] Kaiming He, Xiangyu Zhang, Shaoqing Ren,
[110] Ian J. Goodfellow, David Warde-Farley, and Jian Sun. Spatial pyramid pooling in
Mehdi Mirza, Aaron C. Courville, and deep convolutional networks for visual recog-
Yoshua Bengio. Maxout networks. In nition. IEEE Transactions on Pattern Anal-
Proceedings of the 30th International ysis and Machine Intelligence, 37(9):1904–
Conference on Machine Learning, ICML 1916, 2015.
2013, Atlanta, GA, USA, 16-21 June [117] Kaiming He, Xiangyu Zhang, Shaoqing Ren,
2013, pages 1319–1327, 2013. URL and Jian Sun. Deep residual learning for im-
http://jmlr.org/proceedings/papers/ age recognition. In 2016 IEEE Conference
v28/goodfellow13.html. on Computer Vision and Pattern Recogni-
tion, CVPR 2016, Las Vegas,NV, USA, June
[111] Priya Goyal, Piotr Dollár, Ross B. Girshick, 27-30, 2016, pages 770–778, 2016.
Pieter Noordhuis, Lukasz Wesolowski, Aapo
Kyrola, Andrew Tulloch, Yangqing Jia, and [118] Kaiming He, Georgia Gkioxari, Piotr Dollár,
Kaiming He. Accurate, large minibatch and Ross Girshick. Mask r-cnn. In IEEE
SGD: training imagenet in 1 hour. CoRR, International Conference on Computer Vi-
abs/1706.02677, 2017. URL http://arxiv. sion, ICCV 2017, Venice, Italy, October 22-
org/abs/1706.02677. 29, 2017, pages 2980–2988, 2017.

62
[119] Tong He, Zhi Tian, Weilin Huang, Chunhua [126] Erik Hjelmås and Boon Kee Low. Face De-
Shen, Yu Qiao, and Changming Sun. An end- tection: A Survey. Computer Vision and Im-
to-end textspotter with explicit alignment age Understanding (CVIU), 83(3):236–274,
and attention. In Computer Vision and Pat- September 2001.
tern Recognition (CVPR), 2018 IEEE Con-
ference on, 2018. [127] Judy Hoffman, Sergio Guadarrama, Eric S
Tzeng, Ronghang Hu, Jeff Donahue, Ross
[120] Geremy Heitz and Daphne Koller. Learn- Girshick, Trevor Darrell, and Kate Saenko.
ing Spatial Context - Using Stuff to Find Lsda: Large scale detection through adap-
Things. In Computer Vision - ECCV 2008, tation. In Advances in Neural Information
10th European Conference on Computer Vi- Processing Systems 27: Annual Conference
sion, Marseille, France, October 12-18, 2008, on Neural Information Processing Systems
Berlin, Heidelberg, 2008. 2014, December 8-13 2014, Montreal, Que-
bec, Canada, pages 3536–3544, 2014.
[121] Paul Henderson and Vittorio Ferrari. End-
to-end training of object class detectors for [128] Judy Hoffman, Deepak Pathak, Trevor Dar-
mean average precision. In Computer Vision rell, and Kate Saenko. Detector discovery in
- ACCV 2016 - 13th Asian Conference on the wild: Joint multiple instance and repre-
Computer Vision, Taipei, Taiwan, November sentation learning. In IEEE Conference on
20-24, 2016, pages 198–213, 2016. Computer Vision and Pattern Recognition,
CVPR 2015, Boston, MA, USA, June 7-12,
[122] João F. Henriques and Andrea Vedaldi. 2015, pages 2883–2891, 2015.
Warped Convolutions - Efficient Invariance
[129] Derek Hoiem, Yodsawalai Chodpathumwan,
to Spatial Transformations. International
and Qieyun Dai. Diagnosing error in ob-
Conference on Machine Learning (ICML),
ject detectors. In Computer Vision - ECCV
2017.
2012 - 12th European Conference on Com-
puter Vision, Florence, Italy, October 7-13,
[123] Congrui Hetang, Hongwei Qin, Shaohui Liu,
2012, pages 340–353, 2012.
and Junjie Yan. Impression network for video
object detection. CoRR, abs/1712.05896, [130] Jan Hosang, Rodrigo Benenson, and Bernt
2017. URL http://arxiv.org/abs/1712. Schiele. A convnet for non-maximum sup-
05896. pression. In German Conference on Pattern
Recognition, pages 192–204, 2016.
[124] Michael Himmelsbach, Andre Mueller,
Thorsten Lüttel, and Hans-Joachim [131] Jan Hendrik Hosang, Rodrigo Benenson, and
Wünsche. Lidar-based 3d object per- Bernt Schiele. How good are detection pro-
ception. In Proceedings of 1st international posals, really?. In British Machine Vision
workshop on cognition for technical systems, Conference, BMVC 2014, Nottingham, UK,
volume 1, 2008. September 1-5, 2014, 2014.

[125] Stefan Hinterstoisser, Vincent Lepetit, Paul [132] Jan Hendrik Hosang, Rodrigo Benenson, and
Wohlhart, and Kurt Konolige. On pre- Bernt Schiele. Learning non-maximum sup-
trained image features and synthetic images pression. In 2017 IEEE Conference on Com-
for deep learning. CoRR, abs/1710.10710, puter Vision and Pattern Recognition, CVPR
2017. URL http://arxiv.org/abs/1710. 2017, Honolulu, HI, USA, July 21-26, 2017,
10710. pages 6469–6477, 2017.

63
[133] Sebastian Houben, Johannes Stallkamp, Jan [140] Jonathan Huang, Vivek Rathod, Chen
Salmen, Marc Schlipsing, and Christian Igel. Sun, Menglong Zhu, Anoop Korattikara,
Detection of traffic signs in real-world images: Alireza Fathi, Ian Fischer, Zbigniew Wo-
The German Traffic Sign Detection Bench- jna, Yang Song, Sergio Guadarrama, et al.
mark. In International Joint Conference on Speed/accuracy trade-offs for modern con-
Neural Networks, number 1288, 2013. volutional object detectors. In 2017 IEEE
Conference on Computer Vision and Pat-
[134] Andrew G. Howard, Menglong Zhu, tern Recognition, CVPR 2017, Honolulu, HI,
Bo Chen, Dmitry Kalenichenko, Wei- USA, July 21-26, 2017, 2017.
jun Wang, Tobias Weyand, Marco
Andreetto, and Hartwig Adam. Mo- [141] Qiangui Huang, Shaohua Kevin Zhou, Suya
bilenets: Efficient convolutional neural You, and Ulrich Neumann. Learning to prune
networks for mobile vision applications. filters in convolutional neural networks. In
CoRR, abs/1704.04861, 2017. URL 2018 IEEE Winter Conference on Applica-
http://arxiv.org/abs/1704.04861. tions of Computer Vision, WACV 2018, Lake
Tahoe, NV, USA, March 12-15, 2018, pages
[135] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng
709–718. IEEE Computer Society, 2018.
Dai, and Yichen Wei. Relation networks for
object detection. In Computer Vision and [142] Xun Huang, Ming-Yu Liu, Serge J. Be-
Pattern Recognition (CVPR), 2018 IEEE longie, and Jan Kautz. Multimodal unsu-
Conference on, June 2018. pervised image-to-image translation. CoRR,
[136] Jie Hu, Li Shen, and Gang Sun. Squeeze-and- abs/1804.04732, 2018. URL http://arxiv.
excitation networks. CoRR, abs/1709.01507, org/abs/1804.04732.
2017. URL http://arxiv.org/abs/1709. [143] Itay Hubara, Matthieu Courbariaux, Daniel
01507. Soudry, Ran El-Yaniv, and Yoshua Ben-
[137] Peiyun Hu and Deva Ramanan. Finding tiny gio. Binarized neural networks. In Advances
faces. In 2017 IEEE Conference on Com- in Neural Information Processing Systems
puter Vision and Pattern Recognition, CVPR 29: Annual Conference on Neural Informa-
2017, Honolulu, HI, USA, July 21-26, 2017, tion Processing Systems 2016, December 5-
pages 1522–1530. IEEE Computer Society, 10, 2016, Barcelona, Spain, pages 4107–4115,
2017. 2016.

[138] Gao Huang, Shichen Liu, Laurens van der [144] Itay Hubara, Matthieu Courbariaux, Daniel
Maaten, and Kilian Q. Weinberger. Con- Soudry, Ran El-Yaniv, and Yoshua Bengio.
densenet: An efficient densenet using learned Quantized neural networks: Training neural
group convolutions. CoRR, abs/1711.09224, networks with low precision weights and ac-
2017. URL http://arxiv.org/abs/1711. tivations. The Journal of Machine Learning
09224. Research, 18(1):6869–6898, 2017.

[139] Gao Huang, Zhuang Liu, Laurens Van [145] Ahmad Humayun, Fuxin Li, and James M
Der Maaten, and Kilian Q Weinberger. Rehg. Rigor: Reusing inference in graph
Densely connected convolutional networks. cuts for generating object regions. In 2014
In 2017 IEEE Conference on Computer Vi- IEEE Conference on Computer Vision and
sion and Pattern Recognition, CVPR 2017, Pattern Recognition, CVPR 2014, Columbus,
Honolulu, HI, USA, July 21-26, 2017, vol- OH, USA, June 23-28, 2014, pages 336–343,
ume 1, page 3, 2017. 2014.

64
[146] Brody Huval, Adam Coates, and Andrew Y. Conference on Machine Learning, ICML
Ng. Deep learning for class-generic object 2015, Lille, France, 6-11 July 2015, pages
detection. CoRR, abs/1312.6885, 2013. URL 448–456, 2015. URL http://jmlr.org/
http://arxiv.org/abs/1312.6885. proceedings/papers/v37/ioffe15.html.
[147] Forrest N. Iandola, Matthew W. Moskewicz, [153] Max Jaderberg, Karen Simonyan, Andrea
Sergey Karayev, Ross B. Girshick, Trevor Vedaldi, and Andrew Zisserman. Syn-
Darrell, and Kurt Keutzer. Densenet: Im- thetic data and artificial neural networks
plementing efficient convnet descriptor pyra- for natural scene text recognition. CoRR,
mids. CoRR, abs/1404.1869, 2014. URL abs/1406.2227, 2014. URL http://arxiv.
http://arxiv.org/abs/1404.1869. org/abs/1406.2227.

[148] Forrest N Iandola, Song Han, Matthew W [154] Max Jaderberg, Karen Simonyan, and An-
Moskewicz, Khalid Ashraf, William J Dally, drew Zisserman. Spatial transformer net-
and Kurt Keutzer. Squeezenet: Alexnet-level works. In Advances in Neural Information
accuracy with 50x fewer parameters and¡ 0.5 Processing Systems 28: Annual Conference
mb model size. CoRR, abs/1602.07360v3, on Neural Information Processing Systems
2016. URL http://arxiv.org/abs/1602. 2015, December 7-12, 2015, Montreal, Que-
07360v3. bec, Canada, 2015.

[149] Hiroshi Inoue. Data augmentation by pair- [155] Vidit Jain and Erik Learned-Miller. FDDB:
ing samples for images classification. CoRR, A Benchmark for Face Detection in Uncon-
abs/1801.02929, 2018. URL http://arxiv. strained Settings. UM-CS-2010-009, Univer-
org/abs/1801.02929. sity of Massachusetts Amherst, 2010.

[150] Naoto Inoue, Ryosuke Furuta, Toshihiko Ya- [156] Jisoo Jeong, Hyojin Park, and Nojun Kwak.
masaki, and Kiyoharu Aizawa. Cross-domain Enhancement of SSD by concatenating fea-
weakly-supervised object detection through ture maps for object detection. CoRR,
progressive domain adaptation. CoRR, abs/1705.09587, 2017. URL http://arxiv.
abs/1803.11365, 2018. URL http://arxiv. org/abs/1705.09587.
org/abs/1803.11365. [157] Saurav Jha, Nikhil Agarwal, and Suneeta
Agarwal. Towards improved cartoon face
[151] Sergey Ioffe. Batch renormalization: Towards
detection and recognition systems. CoRR,
reducing minibatch dependence in batch-
abs/1804.01753, 2018. URL http://arxiv.
normalized models. In Advances in Neural
org/abs/1804.01753.
Information Processing Systems 30: Annual
Conference on Neural Information Process- [158] Borui Jiang, Ruixuan Luo, Jiayuan Mao,
ing Systems 2017, 4-9 December 2017, Long Tete Xiao, and Yuning Jiang. Acquisition
Beach, CA, USA, pages 1942–1950, 2017. of localization confidence for accurate ob-
URL http://papers.nips.cc/paper/ ject detection. CoRR, abs/1807.11590, 2018.
6790-batch-renormalization-towards- URL http://arxiv.org/abs/1807.11590.
reducing-minibatch-dependence-in-
batch-normalized-models. [159] Yingying Jiang, Xiangyu Zhu, Xiaobing
Wang, Shuli Yang, Wei Li, Hua Wang, Pei
[152] Sergey Ioffe and Christian Szegedy. Batch Fu, and Zhenbo Luo. R2CNN: rotational re-
normalization: Accelerating deep network gion CNN for orientation robust scene text
training by reducing internal covariate shift. detection. CoRR, abs/1706.09579, 2017.
In Proceedings of the 32nd International URL http://arxiv.org/abs/1706.09579.

65
[160] Alexis Joly and Olivier Buisson. Logo re- [166] Tero Karras, Timo Aila, Samuli Laine,
trieval with a contrario visual query expan- and Jaakko Lehtinen. Progressive grow-
sion. In Wen Gao, Yong Rui, Alan Hanjalic, ing of GANs for improved quality, stabil-
Changsheng Xu, Eckehard G. Steinbach, Ab- ity, and variation. In International Con-
dulmotaleb El-Saddik, and Michelle X. Zhou, ference on Learning Representations, 2018.
editors, Proceedings of the 17th International URL https://openreview.net/forum?id=
Conference on Multimedia 2009, Vancouver, Hk99zCeAb.
British Columbia, Canada, October 19-24,
2009, pages 581–584. ACM, 2009. [167] Harish Katti, Marius V. Peelen, and S. P.
Arun. Object detection can be improved us-
[161] Kinjal A Joshi and Darshak G Thakore. ing human-derived contextual expectations.
A Survey on Moving Object Detection and CoRR, abs/1611.07218, 2016. URL http:
Tracking in Video Surveillance System. In- //arxiv.org/abs/1611.07218.
ternational Journal of Soft Computing and
[168] Gil Keren, Maximilian Schmitt, Thomas
Engineering (IJSCE), 2(3):5, 2012.
Kehrenberg, and Björn W. Schuller. Weakly
[162] Hongwen Kang, Martial Hebert, Alexei A supervised one-shot detection with attention
Efros, and Takeo Kanade. Data-driven ob- siamese networks. CoRR, abs/1801.03329,
jectness. IEEE Transactions on Pattern 2018. URL http://arxiv.org/abs/1801.
Analysis and Machine Intelligence, (1):189– 03329.
195, 2015. [169] Aditya Khosla, Tinghui Zhou, Tomasz Mal-
[163] Kai Kang, Wanli Ouyang, Hongsheng Li, and isiewicz, Alexei A Efros, and Antonio Tor-
Xiaogang Wang. Object detection from video ralba. Undoing the damage of dataset bias.
tubelets with convolutional neural networks. In Computer Vision - ECCV 2012 - 12th Eu-
In 2016 IEEE Conference on Computer Vi- ropean Conference on Computer Vision, Flo-
sion and Pattern Recognition, CVPR 2016, rence, Italy, October 7-13, 2012, pages 158–
Las Vegas,NV, USA, June 27-30, 2016, pages 171, 2012.
817–825, 2016. [170] Kye-Hyeon Kim, Yeongjae Cheon, Sanghoon
Hong, Byung-Seok Roh, and Minje Park.
[164] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu
PVANET: deep but lightweight neural net-
Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe
works for real-time object detection. CoRR,
Wang, Ruohui Wang, Xiaogang Wang, and
abs/1608.08021, 2016. URL http://arxiv.
Wanli Ouyang. T-CNN: Tubelets with Con-
org/abs/1608.08021.
volutional Neural Networks for Object De-
tection from Videos. IEEE Transactions on [171] Diederik P. Kingma and Jimmy Ba. Adam:
Circuits and Systems for Video Technology, A method for stochastic optimization. CoRR,
pages 1–1, 2017. abs/1412.6980, 2014. URL http://arxiv.
org/abs/1412.6980.
[165] D. Karatzas, L. Gomez-Bigorda, A. Nico-
laou, S. Ghosh, A. Bagdanov, M. Iwamura, [172] Brendan F Klare, Ben Klein, Emma
J. Matas, L. Neumann, V. R. Chandrasekhar, Taborsky, Austin Blanton, Jordan Cheney,
S. Lu, F. Shafait, S. Uchida, and E. Valveny. Kristen Allen, Patrick Grother, Alan Mah,
Icdar 2015 competition on robust reading. In and Anil K Jain. Pushing the frontiers of
2015 13th International Conference on Doc- unconstrained face detection and recognition:
ument Analysis and Recognition (ICDAR), Iarpa janus benchmark a. In IEEE Confer-
pages 1156–1160, Aug 2015. ence on Computer Vision and Pattern Recog-

66
nition, CVPR 2015, Boston, MA, USA, June Belongie, Victor Gomes, Abhinav Gupta,
7-12, 2015, pages 1931–1939, 2015. Chen Sun, Gal Chechik, David Cai, Zheyun
Feng, Dhyanesh Narayanan, and Kevin
[173] Iasonas Kokkinos. Ubernet: Training a uni- Murphy. Openimages: A public dataset
versal convolutional neural network for low- for large-scale multi-label and multi-class
, mid-, and high-level vision using diverse image classification. Dataset available from
datasets and limited memory. In 2017 IEEE https://storage.googleapis.com/openimages/web/index.html,
Conference on Computer Vision and Pat- 2017.
tern Recognition, CVPR 2017, Honolulu, HI,
USA, July 21-26, 2017, pages 5454–5463. [179] Ranjay Krishna, Yuke Zhu, Oliver Groth,
IEEE Computer Society, 2017. Justin Johnson, Kenji Hata, Joshua Kravitz,
[174] Tao Kong, Anbang Yao, Yurong Chen, and Stephanie Chen, Yannis Kalantidis, Li-Jia
Fuchun Sun. HyperNet: Towards Accurate Li, David A. Shamma, Michael S. Bernstein,
Region Proposal Generation and Joint Ob- and Li Fei-Fei. Visual genome: Connect-
ject Detection. In 2016 IEEE Conference on ing language and vision using crowdsourced
Computer Vision and Pattern Recognition, dense image annotations. International Jour-
CVPR 2016, Las Vegas,NV, USA, June 27- nal of Computer Vision (IJCV), 123(1):32–
30, 2016, April 2016. 73, 2017.

[175] Tao Kong, Fuchun Sun, Anbang Yao, Huap- [180] Alex Krizhevsky. One weird trick for
ing Liu, Ming Lu, and Yurong Chen. RON: parallelizing convolutional neural networks.
reverse connection with objectness prior net- CoRR, abs/1404.5997, 2014. URL http:
works for object detection. In 2017 IEEE //arxiv.org/abs/1404.5997.
Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI, [181] Alex Krizhevsky, Ilya Sutskever, and Ge-
USA, July 21-26, 2017, pages 5244–5252. offrey E. Hinton. Imagenet classification
IEEE Computer Society, 2017. with deep convolutional neural networks. In
Peter L. Bartlett, Fernando C. N. Pereira,
[176] Tao Kong, Fuchun Sun, Wen-bing Huang, Christopher J. C. Burges, Léon Bottou,
and Huaping Liu. Deep feature pyramid re- and Kilian Q. Weinberger, editors, Ad-
configuration for object detection. CoRR, vances in Neural Information Processing
abs/1808.07993, 2018. URL http://arxiv. Systems 25: 26th Annual Conference on
org/abs/1808.07993. Neural Information Processing Systems
[177] Martin Kostinger, Paul Wohlhart, Peter M. 2012. Proceedings of a meeting held De-
Roth, and Horst Bischof. Annotated Fa- cember 3-6, 2012, Lake Tahoe, Nevada,
cial Landmarks in the Wild: A large-scale, United States, pages 1106–1114, 2012. URL
real-world database for facial landmark local- http://papers.nips.cc/paper/4824-
ization. In First IEEE International Work- imagenet-classification-with-deep-
shop on Benchmarking Facial Image Analysis convolutional-neural-networks.
Technologies, pages 2144–2151, 2011.
[182] Jason Ku, Melissa Mozifian, Jungwook Lee,
[178] Ivan Krasin, Tom Duerig, Neil Alldrin, Ali Harakeh, and Steven Lake Waslander.
Vittorio Ferrari, Sami Abu-El-Haija, Alina Joint 3d proposal generation and object
Kuznetsova, Hassan Rom, Jasper Uijlings, detection from view aggregation. CoRR,
Stefan Popov, Shahab Kamali, Matteo Mal- abs/1712.02294, 2017. URL http://arxiv.
loci, Jordi Pont-Tuset, Andreas Veit, Serge org/abs/1712.02294.

67
[183] Krishna Kumar Singh, Fanyi Xiao, and Yong 2016, Las Vegas, NV, USA, June 27-30,
Jae Lee. Track and transfer: Watching videos 2016, pages 289–297. IEEE Computer Soci-
to simulate strong human supervision for ety, 2016.
weakly-supervised object detection. In 2016
IEEE Conference on Computer Vision and [189] Hei Law and Jia Deng. Cornernet: Detecting
Pattern Recognition, CVPR 2016, Las Ve- objects as paired keypoints. In Computer Vi-
gas,NV, USA, June 27-30, 2016, pages 3548– sion - ECCV 2018 - 15th European Confer-
3556, 2016. ence, Munich, Germany, September 8 - 14,
2018, 2018.
[184] Weicheng Kuo, Bharath Hariharan, and Ji-
tendra Malik. Deepbox: Learning objectness [190] Yann LeCun, Léon Bottou, Genevieve B.
with convolutional networks. In IEEE In- Orr, and Klaus-Robert Müller. Effi-
ternational Conference on Computer Vision, cient backprop. In Grégoire Montavon,
ICCV 2015, Santiago, Chile, December 7-13, Genevieve B. Orr, and Klaus-Robert Müller,
2015, pages 2479–2487, 2015. editors, Neural Networks: Tricks of the Trade
- Second Edition, volume 7700 of Lecture
[185] John D. Lafferty, Andrew McCallum, and Notes in Computer Science, pages 9–48.
Fernando C. N. Pereira. Conditional ran- Springer, 2012. URL https://doi.org/10.
dom fields: Probabilistic models for segment- 1007/978-3-642-35289-8_3.
ing and labeling sequence data. In Pro-
ceedings of the Eighteenth International Con- [191] Byungjae Lee, Enkhbayar Erdenee, Song-
ference on Machine Learning, ICML ’01, Guo Jin, Mi Young Nam, Young Giu Jung,
pages 282–289, San Francisco, CA, USA, and Phill-Kyu Rhee. Multi-class multi-object
2001. URL http://dl.acm.org/citation. tracking using changing point detection. In
cfm?id=645530.655813. Gang Hua and Hervé Jégou, editors, Com-
puter Vision - ECCV 2016 - 14th European
[186] Darius Lam, Richard Kuzma, Kevin McGee, Conference, Amsterdam, The Netherlands,
Samuel Dooley, Michael Laielli, Matthew October 11-14, 2016, volume 9914 of Lec-
Klaric, Yaroslav Bulatov, and Brendan Mc- ture Notes in Computer Science, pages 68–
Cord. xview: Objects in context in overhead 83, 2016. URL https://doi.org/10.1007/
imagery. CoRR, abs/1802.07856, 2018. URL 978-3-319-48881-3_6.
http://arxiv.org/abs/1802.07856.
[192] Kyoungmin Lee, Jaeseok Choi, Jisoo Jeong,
[187] Christoph H. Lampert, Matthew B. and Nojun Kwak. Residual features and uni-
Blaschko, and Thomas Hofmann. Be- fied prediction network for single stage de-
yond sliding windows: Object localization tection. CoRR, abs/1707.05031, 2017. URL
by efficient subwindow search. In 2008 IEEE http://arxiv.org/abs/1707.05031.
Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR [193] Youngwan Lee, Huieun Kim, Eunsoo Park,
2008), 24-26 June 2008, Anchorage, Alaska, Xuenan Cui, and Hakil Kim. Wide-residual-
USA, 2008. inception networks for real-time object detec-
tion. In Intelligent Vehicles Symposium (IV),
[188] Dmitry Laptev, Nikolay Savinov, Joachim M. 2017 IEEE, pages 758–764, 2017.
Buhmann, and Marc Pollefeys. TI-
POOLING: transformation-invariant pooling [194] Joseph Lemley, Shabab Bazrafkan, and Peter
for feature learning in convolutional neural Corcoran. Smart augmentation learning an
networks. In 2016 IEEE Conference on Com- optimal data augmentation strategy. IEEE
puter Vision and Pattern Recognition, CVPR Access, 5:5858–5869, 2017.

68
[195] Bo Li. 3D Fully Convolutional Network for Conference on Computer Vision and Pat-
Vehicle Detection in Point Cloud. In IROS, tern Recognition, CVPR 2017, Honolulu, HI,
2017. USA, July 21-26, 2017, pages 1951–1959.
IEEE Computer Society, 2017.
[196] Bo Li, Tianfu Wu, Shuai Shao, Lun Zhang,
and Rufeng Chu. Object detection via end- [202] Xiaofei Li, Fabian Flohr, Yue Yang, Hui
to-end integration of aspect ratio and con- Xiong, Markus Braun, Shuyue Pan, Keqiang
text aware part-based models and fully con- Li, and Dariu M Gavrila. A new benchmark
volutional networks. CoRR, abs/1612.00534, for vision-based cyclist detection. In Intelli-
2016. URL http://arxiv.org/abs/1612. gent Vehicles Symposium (IV), 2016 IEEE,
00534. pages 1028–1033, 2016.

[197] Bo Li, Tianlei Zhang, and Tian Xia. [203] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang
Vehicle detection from 3d lidar using Ji, and Yichen Wei. Fully convolutional
fully convolutional network. In David instance-aware semantic segmentation. In
Hsu, Nancy M. Amato, Spring Berman, 2017 IEEE Conference on Computer Vision
and Sam Ade Jacobs, editors, Robotics: and Pattern Recognition, CVPR 2017, Hon-
Science and Systems XII, University of olulu, HI, USA, July 21-26, 2017, pages
Michigan, Ann Arbor, Michigan, USA, 4438–4446, 2017. doi: 10.1109/CVPR.
June 18 - June 22, 2016, 2016. URL 2017.472. URL https://doi.org/10.1109/
http://www.roboticsproceedings.org/ CVPR.2017.472.
rss12/p42.html.
[204] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun
[198] Haoxiang Li, Zhe Lin, Xiaohui Shen, Wang, and Xiaogang Wang. Scene graph gen-
Jonathan Brandt, and Gang Hua. A convolu- eration from objects, phrases and caption re-
tional neural network cascade for face detec- gions. CoRR, abs/1707.09700, 2017. URL
tion. In IEEE Conference on Computer Vi- http://arxiv.org/abs/1707.09700.
sion and Pattern Recognition, CVPR 2015, [205] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo
Boston, MA, USA, June 7-12, 2015, pages Li. Tiny-DSOD: Lightweight Object Detec-
5325–5334, 2015. tion for Resource-Restricted Usages. In Pro-
ceedings of the British Machine Vision Con-
[199] Hongyang Li, Yu Liu, Wanli Ouyang, and
ference 2018, BMVC 2018, Newcastle, UK,
Xiaogang Wang. Zoom out-and-in network
September 3-6, 2018, July 2018.
with recursive training for object proposal.
CoRR, abs/1702.05711, 2017. URL http: [206] Zeming Li, Chao Peng, Gang Yu, Xiangyu
//arxiv.org/abs/1702.05711. Zhang, Yangdong Deng, and Jian Sun. Light-
head R-CNN: in defense of two-stage object
[200] Jianan Li, Xiaodan Liang, ShengMei Shen, detector. CoRR, abs/1711.07264, 2017. URL
Tingfa Xu, Jiashi Feng, and Shuicheng Yan. http://arxiv.org/abs/1711.07264.
Scale-aware fast r-cnn for pedestrian detec-
tion. IEEE Transactions on Multimedia, [207] Zeming Li, Yilun Chen, Gang Yu, and Yang-
2017. dong Deng. R-FCN++: Towards Accurate
Region-Based Fully Convolutional Networks
[201] Jianan Li, Xiaodan Liang, Yunchao Wei, for Object Detection. In AAAI, page 8, 2018.
Tingfa Xu, Jiashi Feng, and Shuicheng Yan.
Perceptual generative adversarial networks [208] Zeming Li, Chao Peng, Gang Yu, Xiangyu
for small object detection. In 2017 IEEE Zhang, Yangdong Deng, and Jian Sun. Det-

69
net: A backbone network for object detec- [216] Tsung-Yi Lin, Priya Goyal, Ross B. Gir-
tion. CoRR, abs/1804.06215, 2018. URL shick, Kaiming He, and Piotr Dollár. Fo-
http://arxiv.org/abs/1804.06215. cal loss for dense object detection. In IEEE
International Conference on Computer Vi-
[209] Zhizhong Li and Derek Hoiem. Learning
sion, ICCV 2017, Venice, Italy, October 22-
without Forgetting. IEEE Transactions on
29, 2017, pages 2999–3007. IEEE Computer
Pattern Analysis and Machine Intelligence,
Society, 2017.
(to appear), 2018.
[217] Zhe Lin, Larry S. Davis, David S. Doer-
[210] Zuoxin Li and Fuqiang Zhou. FSSD: feature mann, and Daniel DeMenthon. Hierarchi-
fusion single shot multibox detector. CoRR, cal part-template matching for human detec-
abs/1712.00960, 2017. URL http://arxiv. tion and segmentation. In IEEE 11th In-
org/abs/1712.00960. ternational Conference on Computer Vision,
[211] Minghui Liao, Baoguang Shi, and Xiang Bai. ICCV 2007, Rio de Janeiro, Brazil, October
Textboxes++: A single-shot oriented scene 14-20, 2007, pages 1–8, 2007.
text detector. CoRR, abs/1801.02765, 2018. [218] Zhouhan Lin, Matthieu Courbariaux, Roland
URL http://arxiv.org/abs/1801.02765. Memisevic, and Yoshua Bengio. Neural
[212] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui- networks with few multiplications. CoRR,
Song Xia, and Xiang Bai. Rotation-sensitive abs/1510.03009, 2015. URL http://arxiv.
regression for oriented scene text detection. org/abs/1510.03009.
CoRR, abs/1803.05265, 2018. URL http:// [219] Kang Liu and Gellert Mattyus. Fast multi-
arxiv.org/abs/1803.05265. class vehicle detection on aerial images. IEEE
[213] Yuan Liao, Xiaoqing Lu, Chengcui Zhang, Geoscience and Remote Sensing Letters, 12
Yongtao Wang, and Zhi Tang. Mutual En- (9):1938–1942, 2015.
hancement for Detection of Multiple Logos in [220] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul
Sports Videos. In IEEE International Con- Fieguth, Jie Chen, Xinwang Liu, and Matti
ference on Computer Vision, ICCV 2017, Pietikäinen. Deep learning for generic object
Venice, Italy, October 22-29, 2017, pages detection: A survey. CoRR, abs/1809.02165,
4856–4865, October 2017. 2018. URL https://arxiv.org/abs/1809.
02165.
[214] Tsung-Yi Lin, Michael Maire, Serge Be-
longie, James Hays, Pietro Perona, Deva Ra- [221] Wei Liu, Dragomir Anguelov, Dumitru Er-
manan, Piotr Dollár, and C Lawrence Zit- han, Christian Szegedy, Scott Reed, Cheng-
nick. Microsoft coco: Common objects in Yang Fu, and Alexander C Berg. Ssd: Sin-
context. In Computer Vision - ECCV 2014 - gle shot multibox detector. In Computer Vi-
13th European Conference, Zurich, Switzer- sion - ECCV 2016 - 14th European Confer-
land, September 6-12, 2014, pages 740–755, ence, Amsterdam, The Netherlands, October
2014. 11-14, 2016, pages 21–37, 2016.
[215] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, [222] Yuliang Liu and Lianwen Jin. Deep matching
Kaiming He, Bharath Hariharan, and Serge prior network: Toward tighter multi-oriented
Belongie. Feature pyramid networks for ob- text detection. In 2017 IEEE Conference on
ject detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition,
Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-
CVPR 2017, Honolulu, HI, USA, July 21-26, 26, 2017, pages 3454–3461. IEEE Computer
2017, volume 1, page 4, 2017. Society, 2017.

70
[223] Zikun Liu, Liu Yuan, Lubin Weng, and Yip- Society, 2018. doi: 10.1109/CVPR.2018.
ing Yang. A high resolution optical satellite 00071.
image dataset for ship recognition and some
new baselines. In ICPRAM, pages 324–331, [231] Jiayuan Mao, Tete Xiao, Yuning Jiang, and
2017. Zhimin Cao. What can help pedestrian de-
tection? In 2017 IEEE Conference on
[224] David G Lowe. Object recognition from local Computer Vision and Pattern Recognition
scale-invariant features. In Computer vision, (CVPR), pages 6034–6043, 2017.
1999. The proceedings of the seventh IEEE
international conference on, volume 2, pages [232] V.Y. Mariano, Junghye Min, Jin-Hyeong
1150–1157, 1999. Park, R. Kasturi, D. Mihalcik, Huiping Li,
D. Doermann, and T. Drayer. Performance
[225] David G Lowe. Distinctive image features
evaluation of object detection algorithms. In
from scale-invariant keypoints. International
International Conference on Pattern Recog-
Journal of Computer Vision (IJCV), 60(2):
nition (ICPR), volume 3, pages 965–969,
91–110, 2004.
2002.
[226] Jiajun Lu, Hussein Sibai, Evan Fabry, and
David A. Forsyth. Standard detectors aren’t [233] Oded Maron and Tomás Lozano-Pérez. A
(currently) fooled by physical adversarial framework for multiple-instance learning. In
stop signs. CoRR, abs/1710.03337, 2017. Michael I. Jordan, Michael J. Kearns, and
URL http://arxiv.org/abs/1710.03337. Sara A. Solla, editors, Advances in Neural
Information Processing Systems 10, [NIPS
[227] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, Conference, Denver, Colorado, USA, 1997],
S. Wong, and R. Young. Icdar 2003 robust pages 570–576. The MIT Press, 1997. URL
reading competitions. In Seventh Interna- http://papers.nips.cc/paper/1346-
tional Conference on Document Analysis and a-framework-for-multiple-instance-
Recognition, 2003. Proceedings., pages 682– learning.
687, Aug 2003.
[234] Marc Masana, Joost van de Weijer, and
[228] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang,
Andrew D. Bagdanov. On-the-fly net-
Hong Wang, Yingbin Zheng, and Xiangyang
work pruning for object detection. CoRR,
Xue. Arbitrary-Oriented Scene Text Detec-
abs/1605.03477, 2016. URL http://arxiv.
tion via Rotation Proposals. IEEE Transac-
org/abs/1605.03477.
tions on Multimedia, pages 1–1, 2018.
[229] Santiago Manen, Matthieu Guillaumin, and [235] Marc Masana, Joost van de Weijer, Luis Her-
Luc Van Gool. Prime object proposals with ranz, Andrew D. Bagdanov, and Jose M.
randomized prim’s algorithm. In IEEE In- Álvarez. Domain-adaptive deep network
ternational Conference on Computer Vision, compression. In IEEE International Con-
ICCV 2013, Sydney, Australia, December 1- ference on Computer Vision, ICCV 2017,
8, 2013, pages 2536–2543, 2013. Venice, Italy, October 22-29, 2017, pages
4299–4307. IEEE Computer Society, 2017.
[230] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi
Pont-Tuset, and Luc Van Gool. Deep ex- [236] Francisco Massa, Bryan C. Russell, and
treme cut: From extreme points to object Mathieu Aubry. Deep Exemplar 2D-3D De-
segmentation. In Computer Vision and Pat- tection by Adapting from Real to Rendered
tern Recognition (CVPR), 2018 IEEE Con- Views. 2016 IEEE Conference on Com-
ference on, pages 616–625. IEEE Computer puter Vision and Pattern Recognition, CVPR

71
2016, Las Vegas,NV, USA, June 27-30, 2016, USA, June 7-12, 2015, pages 3593–3602,
pages 6024–6033, 2016. June 2015.
[237] Ofer Matan, Henry S. Baird, Jane Bromley, [244] Chaitanya Mitash, Kun Wang, Kostas E
Christopher J. C. Burges, John S. Denker, Bekris, and Abdeslam Boularias. Physics-
Lawrence D. Jackel, Yann Le Cun, Ed- aware Self-supervised Training of CNNs for
win P. D. Pednault, William D Satterfield, Object Detection. In IEEE International
Charles E. Stenard, et al. Reading handwrit- Conference on Robotics and Automation
ten digits: A zip code recognition system. (ICRA), 2017.
IEEE Computer, 25(7):59–63, 1992.
[245] T M Mitchell. Never-Ending Learning. Com-
[238] Brianna Maze, Jocelyn Adams, James A mun. ACM, 61(5):103–115, 2018.
Duncan, Nathan Kalka, Tim Miller, Charles
[246] A. Mogelmose, M. M. Trivedi, and T. B.
Otto, Anil K Jain, W Tyler Niggel, Janet An-
Moeslund. Vision-Based Traffic Sign De-
derson, Jordan Cheney, and Patrick Grother.
tection and Analysis for Intelligent Driver
IARPA Janus Benchmark – C: Face Dataset
Assistance Systems: Perspectives and Sur-
and Protocol. In ICB, page 8, 2018.
vey. IEEE Transactions on Intelligent Trans-
[239] John McCormac, Ankur Handa, Stefan portation Systems, 13:1484–1497, November
Leutenegger, and Andrew J. Davison. 2012.
Scenenet RGB-D: can 5m synthetic images
[247] Taylor Mordan, Nicolas Thome, Matthieu
beat generic imagenet pre-training on indoor
Cord, and Gilles Henaff. Deformable Part-
segmentation? In IEEE International Con-
based Fully Convolutional Network for Ob-
ference on Computer Vision, ICCV 2017,
ject Detection. In Proceedings of the British
Venice, Italy, October 22-29, 2017, pages
Machine Vision Conference 2017, BMVC
2697–2706. IEEE Computer Society, 2017.
2017, London, UK, September 4-7, 2017,
[240] Kazuki Minemura, Hengfui Liau, Abraham 2017.
Monrroy, and Shinpei Kato. Lmnet: Real-
[248] Taylor Mordan, Nicolas Thome, Gilles
time multiclass object detection on CPU us-
Henaff, and Matthieu Cord. End-to-End
ing 3d lidar. CoRR, abs/1805.04902, 2018.
Learning of Latent Deformable Part-Based
URL http://arxiv.org/abs/1805.04902.
Representations for Object Detection. Inter-
[241] A. Mishra, S. Nandan Rai, A. Mishra, and national Journal of Computer Vision, 2018.
C. V. Jawahar. IIIT-CFW: A Benchmark doi: 10.1007/s11263-018-1109-z.
Database of Cartoon Faces in the Wild. In
[249] Arsalan Mousavian, Dragomir Anguelov,
VASE ECCVW, 2016.
John Flynn, and Jana Kosecka. 3d bound-
[242] Anand Mishra, Karteek Alahari, and ing box estimation using deep learning and
CV Jawahar. Scene text recognition using geometry. In 2017 IEEE Conference on Com-
higher order language priors. In British puter Vision and Pattern Recognition, CVPR
Machine Vision Conference, BMVC 2012, 2017, Honolulu, HI, USA, July 21-26, 2017,
Surrey, UK, September 3-7, 2012, 2012. pages 5632–5640. IEEE Computer Society,
2017.
[243] I. Misra, A. Shrivastava, and M. Hebert.
Watch and learn: Semi-supervised learning [250] Damian Mrowca, Marcus Rohrbach, Judy
of object detectors from videos. In IEEE Hoffman, Ronghang Hu, Kate Saenko, and
Conference on Computer Vision and Pat- Trevor Darrell. Spatial semantic regularisa-
tern Recognition, CVPR 2015, Boston, MA, tion for large scale object detection. In IEEE

72
International Conference on Computer Vi- neural networks for graphs. In International
sion, ICCV 2015, Santiago, Chile, December conference on machine learning, pages 2014–
7-13, 2015, pages 2003–2011, 2015. 2023, 2016.
[251] Seongkyu Mun, Sangwook Park, David K [258] Steven J Nowlan and John C Platt. A convo-
Han, and Hanseok Ko. Generative adversar- lutional neural network hand tracker. In Ad-
ial network based acoustic scene training set vances in Neural Information Processing Sys-
augmentation and selection using svm hyper- tems 8, NIPS, Denver, CO, USA, November
plane. Proc. DCASE, pages 93–97, 2017. 27-30, 1995, pages 901–908, 1995.
[252] T Nathan Mundhenk, Goran Konjevod, We- [259] Jean Ogier Du Terrail and Frédéric Ju-
sam A Sakla, and Kofi Boakye. A large con- rie. ON THE USE OF DEEP NEURAL
textual dataset for classification, detection NETWORKS FOR THE DETECTION OF
and counting of cars with deep learning. In SMALL VEHICLES IN ORTHO-IMAGES.
Computer Vision - ECCV 2016 - 14th Eu- In IEEE International Conference on Im-
ropean Conference, Amsterdam, The Nether- age Processing, Beijing, China, Septem-
lands, October 11-14, 2016, pages 785–800, ber 2017. URL https://hal.archives-
2016. ouvertes.fr/hal-01527906.
[253] Hajime Nada, Vishwanath A. Sindagi, [260] Kemal Oksuz, Baris Can Cam, Emre Ak-
He Zhang, and Vishal M. Patel. Push- bas, and Sinan Kalkan. Localization Recall
ing the limits of unconstrained face detec- Precision (LRP): A New Performance Metric
tion: a challenge dataset and baseline re- for Object Detection. In Computer Vision
sults. CoRR, abs/1804.10275, 2018. URL - ECCV 2018 - 15th European Conference,
http://arxiv.org/abs/1804.10275. Munich, Germany, September 8 - 14, 2018,
[254] Mahyar Najibi, Mohammad Rastegari, and July 2018.
Larry S. Davis. G-CNN: An Iterative Grid [261] M. Oquab, L. Bottou, I. Laptev, and J. Sivic.
Based Object Detector. In 2016 IEEE Con- Weakly supervised object recognition with
ference on Computer Vision and Pattern convolutional neural networks. In Advances
Recognition, CVPR 2016, Las Vegas,NV, in Neural Information Processing Systems
USA, June 27-30, 2016, 2016. 27: Annual Conference on Neural Informa-
[255] Mahyar Najibi, Pouya Samangouei, Rama tion Processing Systems 2014, December 8-13
Chellappa, and Larry Davis. SSH: Single 2014, Montreal, Quebec, Canada, 2014.
Stage Headless Face Detector. In IEEE In-
[262] Maxime Oquab, Léon Bottou, Ivan Laptev,
ternational Conference on Computer Vision,
and Josef Sivic. Is object localization for
ICCV 2017, Venice, Italy, October 22-29,
free? - weakly-supervised learning with con-
2017, 2017.
volutional neural networks. In IEEE Confer-
[256] Alejandro Newell, Kaiyu Yang, and Jia Deng. ence on Computer Vision and Pattern Recog-
Stacked hourglass networks for human pose nition, CVPR 2015, Boston, MA, USA, June
estimation. In Computer Vision - ECCV 7-12, 2015, pages 685–694, 2015.
2016 - 14th European Conference, Amster-
[263] Margarita Osadchy, Yann Le Cun, and
dam, The Netherlands, October 11-14, 2016,
Matthew L Miller. Synergistic face detection
pages 483–499, 2016.
and pose estimation with energy-based mod-
[257] Mathias Niepert, Mohamed Ahmed, and els. Journal of Machine Learning Research,
Konstantin Kutzkov. Learning convolutional 8(May):1197–1215, 2007.

73
[264] W. Ouyang, X. Wang, and C. Zhang. Fac- class detectors using only human verifica-
tors in finetuning deep model for object de- tion. In 2016 IEEE Conference on Com-
tection with long-tail distribution. In 2016 puter Vision and Pattern Recognition, CVPR
IEEE Conference on Computer Vision and 2016, Las Vegas,NV, USA, June 27-30, 2016,
Pattern Recognition, CVPR 2016, Las Ve- February 2016.
gas,NV, USA, June 27-30, 2016, 2016.
[271] Dim P. Papadopoulos, Jasper R. R. Uijlings,
[265] Wanli Ouyang and Xiaogang Wang. Joint Frank Keller, and Vittorio Ferrari. Training
deep learning for pedestrian detection. In object class detectors with click supervision.
IEEE International Conference on Computer In 2017 IEEE Conference on Computer Vi-
Vision, ICCV 2013, Sydney, Australia, De- sion and Pattern Recognition, CVPR 2017,
cember 1-8, 2013, 2013. Honolulu, HI, USA, July 21-26, 2017, pages
180–189. IEEE Computer Society, 2017.
[266] Wanli Ouyang and Xiaogang Wang. Single-
pedestrian detection aided by multi- [272] Constantine Papageorgiou and Tomaso Pog-
pedestrian detection. In 2013 IEEE gio. A trainable system for object detec-
Conference on Computer Vision and Pattern tion. International Journal of Computer Vi-
Recognition, Portland, OR, USA, June sion (IJCV), 38(1):15–33, 2000.
23-28, 2013, pages 3198–3205, 2013. [273] Bo Peng, Wenming Tan, Zheyang Li, Shun
Zhang, Di Xie, and Shiliang Pu. Extreme
[267] Wanli Ouyang, Xiaogang Wang, Xingyu
network compression via filter group approx-
Zeng, Shi Qiu, Ping Luo, Yonglong Tian,
imation. CoRR, abs/1807.11254, 2018. URL
Hongsheng Li, Shuo Yang, Zhe Wang, Chen-
http://arxiv.org/abs/1807.11254.
Change Loy, and Xiaoou Tang. DeepID-
Net: Deformable deep convolutional neural [274] Chao Peng, Tete Xiao, Zeming Li, Yuning
networks for object detection. In Advances Jiang, Xiangyu Zhang, Kai Jia, Gang Yu,
in Neural Information Processing Systems and Jian Sun. Megdet: A large mini-batch
28: Annual Conference on Neural Informa- object detector. CoRR, abs/1711.07240,
tion Processing Systems 2015, December 7- 2017. URL http://arxiv.org/abs/1711.
12, 2015, Montreal, Quebec, Canada, 2015. 07240.
[268] Wanli Ouyang, Ku Wang, Xin Zhu, and Xi- [275] Chao Peng, Xiangyu Zhang, Gang Yu, Guim-
aogang Wang. Learning chained deep fea- ing Luo, and Jian Sun. Large kernel mat-
tures and classifiers for cascade in object de- ters???improve semantic segmentation by
tection. CoRR, abs/1702.07054, 2017. URL global convolutional network. In 2017 IEEE
http://arxiv.org/abs/1702.07054. Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI,
[269] Xi Ouyang, Yu Cheng, Yifan Jiang, Chun- USA, July 21-26, 2017, pages 1743–1751,
Liang Li, and Pan Zhou. Pedestrian- 2017.
synthesis-gan: Generating pedestrian data
in real scene and beyond. CoRR, [276] Xingchao Peng and Kate Saenko. Synthetic
abs/1804.02047, 2018. URL http://arxiv. to real adaptation with generative correlation
org/abs/1804.02047. alignment networks. In 2018 IEEE Winter
Conference on Applications of Computer Vi-
[270] Dim P. Papadopoulos, Jasper R. R. Uijlings, sion, WACV 2018, Lake Tahoe, NV, USA,
Frank Keller, and Vittorio Ferrari. We don’t March 12-15, 2018, pages 1982–1991. IEEE
need no bounding-boxes: Training object Computer Society, 2018.

74
[277] Xingchao Peng, Baochen Sun, Karim Ali Convolutional Networks. In IEEE Confer-
0002, and Kate Saenko. Learning Deep Ob- ence on Computer Vision and Pattern Recog-
ject Detectors from 3D Models. In IEEE In- nition, CVPR 2015, Boston, MA, USA, June
ternational Conference on Computer Vision, 7-12, 2015, 2015.
ICCV 2015, Santiago, Chile, December 7-13,
2015, 2015. [284] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Col-
lobert, and Piotr Dollár. Learning to re-
[278] Alex Pentland, Baback Moghaddam, and fine object segments. In Computer Vision
Thad Starner. View-based and modular - ECCV 2016 - 14th European Conference,
eigenspaces for face recognition. In Confer- Amsterdam, The Netherlands, October 11-
ence on Computer Vision and Pattern Recog- 14, 2016, pages 75–91, 2016.
nition, CVPR 1994, 21-23 June, 1994, Seat-
tle, WA, USA, pages 84–91, 1994. [285] Alex D. Pon, Oles Andrienko, Ali Harakeh,
and Steven L. Waslander. A Hierarchical
[279] Bojan Pepik, Rodrigo Benenson, Tobias Deep Architecture and Mini-Batch Selection
Ritschel, and Bernt Schiele. What is hold- Method For Joint Traffic Sign and Light De-
ing back convnets for detection? In German tection. In IEEE Conference on Computer
Conference on Pattern Recognition, pages and Robot Vision, June 2018.
517–528, 2015.
[286] Jordi Pont-Tuset, Pablo Arbelaez,
[280] Luis Perez and Jason Wang. The ef- Jonathan T Barron, Ferran Marques, and
fectiveness of data augmentation in image Jitendra Malik. Multiscale combinatorial
classification using deep learning. CoRR, grouping for image segmentation and object
abs/1712.04621, 2017. URL http://arxiv. proposal generation. IEEE Transactions on
org/abs/1712.04621. Pattern Analysis and Machine Intelligence,
39(1):128–140, 2017.
[281] Phuoc Pham, Duy Nguyen, Tien Do,
Thanh Duc Ngo, and Duy-Dinh Le. Eval- [287] Fatih Murat Porikli. Integral histogram: A
uation of Deep Models for Real-Time Small fast way to extract histograms in cartesian
Object Detection. ICONIP, 10636:516–526, spaces. In 2005 IEEE Computer Society
2017. Conference on Computer Vision and Pattern
Recognition (CVPR 2005), 20-26 June 2005,
[282] Pedro H. O. Pinheiro, Ronan Collobert, San Diego, CA, USA, pages 829–836, 2005.
and Piotr Dollár. Learning to segment ob-
ject candidates. In Corinna Cortes, Neil D. [288] Charles R. Qi, Hao Su, Kaichun Mo, and
Lawrence, Daniel D. Lee, Masashi Sugiyama, Leonidas J. Guibas. Pointnet: Deep learning
and Roman Garnett, editors, Advances in on point sets for 3d classification and segmen-
Neural Information Processing Systems 28: tation. In 2017 IEEE Conference on Com-
Annual Conference on Neural Informa- puter Vision and Pattern Recognition, CVPR
tion Processing Systems 2015, December 2017, Honolulu, HI, USA, July 21-26, 2017,
7-12, 2015, Montreal, Quebec, Canada, July 2017.
pages 1990–1998, 2015. URL http://
papers.nips.cc/paper/5852-learning- [289] Charles Ruizhongtai Qi, Wei Liu, Chenxia
to-segment-object-candidates. Wu, Hao Su, and Leonidas J. Guibas. Frus-
tum pointnets for 3d object detection from
[283] Pedro O. Pinheiro and Ronan Collobert. RGB-D data. CoRR, abs/1711.08488, 2017.
From Image-level to Pixel-level Labeling with URL http://arxiv.org/abs/1711.08488.

75
[290] Charles Ruizhongtai Qi, Li Yi, Hao Su, and [295] Rakesh N. Rajaram, Eshed Ohn-Bar, and
Leonidas J. Guibas. Pointnet++: Deep Mohan M. Trivedi. RefineNet: Iterative
hierarchical feature learning on point sets in refinement for accurate object localization.
a metric space. In Isabelle Guyon, Ulrike von In IEEE 19th International Conference on
Luxburg, Samy Bengio, Hanna M. Wallach, Intelligent Transportation Systems (ITSC),
Rob Fergus, S. V. N. Vishwanathan, and pages 1528–1533, November 2016.
Roman Garnett, editors, Advances in Neural
Information Processing Systems 30: Annual [296] Param S. Rajpura, Ravi S. Hegde, and
Conference on Neural Information Process- Hristo Bojinov. Object detection using deep
ing Systems 2017, 4-9 December 2017, Long cnns trained on synthetic images. CoRR,
Beach, CA, USA, pages 5105–5114, 2017. abs/1706.06782, 2017. URL http://arxiv.
URL http://papers.nips.cc/paper/ org/abs/1706.06782.
7095-pointnet-deep-hierarchical-
feature-learning-on-point-sets-in-a- [297] Rajeev Ranjan, Vishal M. Patel, and Rama
metric-space. Chellappa. A deep pyramid deformable part
model for face detection. In IEEE 7th In-
[291] Weichao Qiu and Alan L. Yuille. Unrealcv: ternational Conference on Biometrics The-
Connecting computer vision to unreal en- ory, Applications and Systems, BTAS 2015,
gine. In Gang Hua and Hervé Jégou, editors, Arlington, VA, USA, September 8-11, 2015,
Computer Vision - ECCV 2016 - 14th Eu- pages 1–8. IEEE, 2015.
ropean Conference, Amsterdam, The Nether-
lands, October 11-14, 2016, volume 9915 of [298] Pekka Rantalankila, Juho Kannala, and Esa
Lecture Notes in Computer Science, pages Rahtu. Generating object segmentation pro-
909–916, 2016. URL https://doi.org/10. posals using global and local search. In 2014
1007/978-3-319-49409-8_75. IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2014, Columbus,
[292] Shafin Rahman, Salman Hameed Khan,
OH, USA, June 23-28, 2014, pages 2417–
and Fatih Porikli. Zero-shot object de-
2424, 2014.
tection: Learning to simultaneously recog-
nize and localize novel concepts. CoRR,
[299] Mohammad Rastegari, Vicente Ordonez,
abs/1803.06049, 2018. URL http://arxiv.
Joseph Redmon, and Ali Farhadi. Xnor-net:
org/abs/1803.06049.
Imagenet classification using binary convo-
[293] Esa Rahtu, Juho Kannala, and Matthew lutional neural networks. In Computer Vi-
Blaschko. Learning a category independent sion - ECCV 2016 - 14th European Confer-
object detection cascade. In IEEE Inter- ence, Amsterdam, The Netherlands, October
national Conference on Computer Vision, 11-14, 2016, pages 525–542, 2016.
ICCV 2011, Barcelona, Spain, November 6-
13, 2011, pages 1052–1059, 2011. [300] Alexander J Ratner, Henry Ehrenberg, Ze-
shan Hussain, Jared Dunnmon, and Christo-
[294] Anant Raj, Vinay P. Namboodiri, and Tinne pher Ré. Learning to compose domain-
Tuytelaars. Subspace Alignment Based Do- specific transformations for data augmenta-
main Adaptation for RCNN Detector. In tion. In Advances in Neural Information
Proceedings of the British Machine Vision Processing Systems 30: Annual Conference
Conference 2015, BMVC 2015, Swansea, on Neural Information Processing Systems
UK, September 7-10, 2015, pages 166.1– 2017, 4-9 December 2017, Long Beach, CA,
166.11, Swansea, 2015. USA, pages 3236–3246, 2017.

76
[301] Kumar S. Ray, Vijayan K. Asari, and Soma Unified, real-time object detection. In 2016
Chakraborty. Object detection by spatio- IEEE Conference on Computer Vision and
temporal analysis and tracking of the de- Pattern Recognition, CVPR 2016, Las Ve-
tected objects in a video with variable back- gas,NV, USA, June 27-30, 2016, pages 779–
ground. CoRR, abs/1705.02949, 2017. URL 788, 2016.
http://arxiv.org/abs/1705.02949.
[309] Shaoqing Ren, Kaiming He, Ross Girshick,
[302] Sébastien Razakarivony and Frédéric Jurie. and Jian Sun. Faster r-cnn: Towards real-
Vehicle detection in aerial imagery: A small time object detection with region proposal
target detection benchmark. Journal of Vi- networks. In Advances in Neural Informa-
sual Communication and Image Representa- tion Processing Systems 28: Annual Confer-
tion, 34:187–203, 2016. ence on Neural Information Processing Sys-
tems 2015, December 7-12, 2015, Montreal,
[303] Esteban Real, Jonathon Shlens, Stefano Quebec, Canada, pages 91–99, 2015.
Mazzocchi, Xin Pan, and Vincent Van-
houcke. Youtube-boundingboxes: A large [310] Shaoqing Ren, Kaiming He, Ross B. Gir-
high-precision human-annotated data set for shick, Xiangyu Zhang, and Jian Sun. Ob-
object detection in video. In 2017 IEEE ject detection networks on convolutional fea-
Conference on Computer Vision and Pat- ture maps. IEEE Transactions on Pattern
tern Recognition, CVPR 2017, Honolulu, HI, Analysis and Machine Intelligence, 39(7):
USA, July 21-26, 2017, pages 7464–7473. 1476–1481, 2017. URL https://doi.org/
IEEE Computer Society, 2017. 10.1109/TPAMI.2016.2601099.

[304] Sashank J Reddi, Satyen Kale, and Sanjiv [311] M. Rochan and Yang Wang. Weakly super-
Kumar. On the convergence of adam and be- vised localization of novel objects using ap-
yond. In International Conference on Learn- pearance transfer. In IEEE Conference on
ing Representations (ICLR), 2018. Computer Vision and Pattern Recognition,
CVPR 2015, Boston, MA, USA, June 7-12,
[305] Joseph Redmon and Anelia Angelova. Real- 2015, 2015.
time grasp detection using convolutional neu-
ral networks. In IEEE International Confer- [312] Mikel Rodriguez, Ivan Laptev, Josef Sivic,
ence on Robotics and Automation (ICRA), and Jean-Yves Audibert. Density-aware
2015. person detection and tracking in crowds.
In IEEE International Conference on Com-
[306] Joseph Redmon and Ali Farhadi. puter Vision, ICCV 2011, Barcelona, Spain,
YOLO9000: better, faster, stronger. In November 6-13, 2011, pages 2423–2430,
2017 IEEE Conference on Computer Vision 2011.
and Pattern Recognition, CVPR 2017,
Honolulu, HI, USA, July 21-26, 2017, pages [313] Stefan Romberg, Lluis Garcia Pueyo, Rainer
6517–6525. IEEE Computer Society, 2017. Lienhart, and Roelof Van Zwol. Scalable logo
recognition in real-world images. In Proceed-
[307] Joseph Redmon and Ali Farhadi. Yolov3: ings of the 1st ACM International Confer-
An incremental improvement. CoRR, ence on Multimedia Retrieval, page 25, 2011.
abs/1804.02767, 2018. URL http://arxiv.
org/abs/1804.02767. [314] Amir Rosenfeld, Richard Zemel, and John K.
Tsotsos. The elephant in the room. CoRR,
[308] Joseph Redmon, Santosh Divvala, Ross Gir- abs/1808.03305, 2018. URL http://arxiv.
shick, and Ali Farhadi. You only look once: org/abs/1808.03305.

77
[315] Rasmus Rothe, Matthieu Guillaumin, and Springs, CO, USA, 20-25 June 2011, pages
Luc Van Gool. Non-maximum suppression 1745–1752, 2011.
for object detection by passing messages be-
tween windows. In Computer Vision - ACCV [322] Mohammad Amin Sadeghi and David A.
2014 - 12th Asian Conference on Computer Forsyth. 30hz object detection with DPM
Vision, Singapore, Singapore, November 1-5, V5. In David J. Fleet, Tomás Pajdla,
2014, pages 290–306, 2014. Bernt Schiele, and Tinne Tuytelaars, edi-
tors, Computer Vision - ECCV 2014 - 13th
[316] Soumya Roy, Vinay P. Namboodiri, and European Conference, Zurich, Switzerland,
Arijit Biswas. Active learning with ver- September 6-12, 2014, volume 8689 of Lec-
sion spaces for object detection. CoRR, ture Notes in Computer Science, pages 65–
abs/1611.07285, 2016. URL http://arxiv. 79. Springer, 2014. URL https://doi.org/
org/abs/1611.07285. 10.1007/978-3-319-10590-1_5.

[317] Sitapa Rujikietgumjorn and Robert T [323] Wesam A. Sakla, Goran Konjevod, and
Collins. Optimized pedestrian detection for T. Nathan Mundhenk. Deep multi-modal ve-
multiple and occluded people. In 2013 IEEE hicle detection in aerial ISR imagery. In 2017
Conference on Computer Vision and Pattern IEEE Winter Conference on Applications of
Recognition, Portland, OR, USA, June 23- Computer Vision, WACV 2017, Santa Rosa,
28, 2013, pages 3690–3697, 2013. CA, USA, March 24-31, 2017, pages 916–
923. IEEE, 2017.
[318] David E Rumelhart, Geoffrey E Hinton, and
Ronald J Williams. Learning internal rep- [324] Mark Sandler, Andrew Howard, Menglong
resentations by error propagation. Technical Zhu, Andrey Zhmoginov, and Liang-Chieh
report, California Univ San Diego La Jolla Chen. Mobilenetv2: Inverted residuals and
Inst for Cognitive Science, 1985. linear bottlenecks. In Computer Vision and
Pattern Recognition (CVPR), 2018 IEEE
[319] Olga Russakovsky, Jia Deng, Hao Su, Conference on, pages 4510–4520, 2018.
Jonathan Krause, Sanjeev Satheesh, Sean
[325] P. A. Savalle and S. Tsogkas. Deformable
Ma, Zhiheng Huang, Andrej Karpathy,
part models with cnn features. In SAICSIT
Aditya Khosla, Michael Bernstein, Alexan-
Conf., 2014.
der C. Berg, and Li Fei-Fei. ImageNet Large
Scale Visual Recognition Challenge. Interna- [326] Henry Schneiderman and Takeo Kanade. Ob-
tional Journal of Computer Vision (IJCV), ject detection using the statistics of parts.
115(3):211–252, 2015. International Journal of Computer Vision
(IJCV), 56(3):151–177, 2004.
[320] Payam Sabzmeydani and Greg Mori. Detect-
ing pedestrians by learning shapelet features. [327] Pierre Sermanet, David Eigen, Xiang Zhang,
In 2007 IEEE Computer Society Conference Michaël Mathieu, Rob Fergus, and Yann Le-
on Computer Vision and Pattern Recognition Cun. Overfeat: Integrated recognition, lo-
(CVPR 2007), 18-23 June 2007, Minneapo- calization and detection using convolutional
lis, Minnesota, USA, 2007. networks. CoRR, abs/1312.6229, 2013. URL
http://arxiv.org/abs/1312.6229.
[321] Mohammad Amin Sadeghi and Ali Farhadi.
Recognition using visual phrases. In The 24th [328] Pierre Sermanet, Koray Kavukcuoglu,
IEEE Conference on Computer Vision and Soumith Chintala, and Yann LeCun.
Pattern Recognition, CVPR 2011, Colorado Pedestrian detection with unsupervised

78
multi-stage feature learning. In 2013 IEEE text in the wild (RCTW-17). CoRR,
Conference on Computer Vision and Pattern abs/1708.09585, 2017. URL http://arxiv.
Recognition, Portland, OR, USA, June org/abs/1708.09585.
23-28, 2013, pages 3626–3633, 2013.
[335] Xuepeng Shi, Shiguang Shan, Meina Kan,
[329] Mohammad Javad Shafiee, Brendan Chywl, Shuzhe Wu, and Xilin Chen. Real-time
Francis Li, and Alexander Wong. Fast rotation-invariant face detection with pro-
YOLO: A fast you only look once system gressive calibration networks. In Computer
for real-time embedded object detection in Vision and Pattern Recognition (CVPR),
video. CoRR, abs/1709.05943, 2017. URL 2018 IEEE Conference on, June 2018.
http://arxiv.org/abs/1709.05943.
[336] Konstantin Shmelkov, Cordelia Schmid, and
[330] Yunhan Shen, Rongrong Ji, Shengchuan Karteek Alahari. Incremental learning of ob-
Zhang, Wangmeng Zuo, and Yan Wang. ject detectors without catastrophic forget-
Generative adversarial learning towards fast ting. In IEEE International Conference on
weakly supervised detection. In Computer Computer Vision, ICCV 2017, Venice, Italy,
Vision and Pattern Recognition (CVPR), October 22-29, 2017, pages 3420–3429, 2017.
2018 IEEE Conference on, June 2018.
[337] Abhinav Shrivastava, Abhinav Gupta, and
[331] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-
Ross Girshick. Training region-based object
Gang Jiang, Yurong Chen, and Xiangyang
detectors with online hard example mining.
Xue. Dsod: Learning deeply supervised ob-
In 2016 IEEE Conference on Computer Vi-
ject detectors from scratch. In IEEE In-
sion and Pattern Recognition, CVPR 2016,
ternational Conference on Computer Vision,
Las Vegas,NV, USA, June 27-30, 2016, pages
ICCV 2017, Venice, Italy, October 22-29,
761–769, 2016.
2017, volume 3, page 7, 2017.
[332] Zhiqiang Shen, Honghui Shi, [338] Abhinav Shrivastava, Rahul Sukthankar, Ji-
Rogério Schmidt Feris, Liangliang Cao, tendra Malik, and Abhinav Gupta. Beyond
Shuicheng Yan, Ding Liu, Xinchao Wang, skip connections: Top-down modulation for
Xiangyang Xue, and Thomas S. Huang. object detection. CoRR, abs/1612.06851,
Learning object detectors from scratch 2016. URL http://arxiv.org/abs/1612.
with gated recurrent feature pyramids. 06851.
CoRR, abs/1712.00886, 2017. URL
[339] Ashish Shrivastava, Tomas Pfister, Oncel
http://arxiv.org/abs/1712.00886.
Tuzel, Joshua Susskind, Wenda Wang, and
[333] Baoguang Shi, Xiang Bai, and Serge J. Be- Russell Webb. Learning from Simulated
longie. Detecting oriented text in natural and Unsupervised Images through Adversar-
images by linking segments. In 2017 IEEE ial Training. 2017 IEEE Conference on Com-
Conference on Computer Vision and Pat- puter Vision and Pattern Recognition, CVPR
tern Recognition, CVPR 2017, Honolulu, HI, 2017, Honolulu, HI, USA, July 21-26, 2017,
USA, July 21-26, 2017, pages 3482–3490. pages 2242–2251, 2017.
IEEE Computer Society, 2017.
[340] Shai Silberstein, Dan Levi, Victoria Kogan,
[334] Baoguang Shi, Cong Yao, Minghui Liao, and Ran Gazit. Vision-based pedestrian de-
Mingkun Yang, Pei Xu, Linyan Cui, Serge J. tection for rear-view cameras. In Intelli-
Belongie, Shijian Lu, and Xiang Bai. IC- gent Vehicles Symposium Proceedings, 2014
DAR2017 competition on reading chinese IEEE, pages 853–860, 2014.

79
[341] Daniel L Silver, Qiang Yang, and Lianghao Transactions on Pattern Analysis and Ma-
Li. Lifelong Machine Learning Systems: Be- chine Intelligence, 22(12):32, 2000.
yond Learning Algorithms. In 2013 AAAI
Spring Symposium, page 7, 2013. [350] Lars W. Sommer, Tobias Schuchert, Jur-
gen Beyerer, Firooz A. Sadjadi, and Abhi-
[342] Martin Simon, Stefan Milz, Karl Amende, jit Mahalanobis. Deep learning based multi-
and Horst-Michael Gross. Complex-yolo: category object detection in aerial images. In
Real-time 3d object detection on point SPIE Defense+ Security, May 2017.
clouds. CoRR, abs/1803.06199, 2018. URL
http://arxiv.org/abs/1803.06199. [351] Lars Wilko Sommer, Tobias Schuchert, and
Jürgen Beyerer. Fast deep vehicle detection
[343] Karen Simonyan and Andrew Zisser- in aerial images. In 2017 IEEE Winter Con-
man. Very deep convolutional net- ference on Applications of Computer Vision,
works for large-scale image recognition. WACV 2017, Santa Rosa, CA, USA, March
CoRR, abs/1409.1556, 2014. URL 24-31, 2017, pages 311–319. IEEE, 2017.
http://arxiv.org/abs/1409.1556.
[352] Lars Wilko Sommer, Arne Schumann, Tobias
[344] Karen Simonyan, Andrea Vedaldi, and An- Schuchert, and Jürgen Beyerer. Multi fea-
drew Zisserman. Deep inside convolu- ture deconvolutional faster R-CNN for pre-
tional networks: Visualising image classifi- cise vehicle detection in aerial imagery. In
cation models and saliency maps. CoRR, 2018 IEEE Winter Conference on Applica-
abs/1312.6034, 2013. URL http://arxiv. tions of Computer Vision, WACV 2018, Lake
org/abs/1312.6034. Tahoe, NV, USA, March 12-15, 2018, pages
635–642. IEEE Computer Society, 2018.
[345] Bharat Singh and Larry S Davis. An analysis
of scale invariance in object detection-snip. [353] Hyun Oh Song, Ross B. Girshick, Ste-
In 2017 IEEE Conference on Computer Vi- fanie Jegelka, Julien Mairal, Zaı̈d Har-
sion and Pattern Recognition, CVPR 2017, chaoui, and Trevor Darrell. On learning
Honolulu, HI, USA, July 21-26, 2017, 2018. to localize objects with minimal supervi-
sion. In Proceedings of the 31th Inter-
[346] Bharat Singh, Hengduo Li, Abhishek national Conference on Machine Learning,
Sharma, and Larry S. Davis. R-FCN-3000 ICML 2014, Beijing, China, 21-26 June
at 30fps: Decoupling detection and classifi- 2014, volume 32 of JMLR Workshop and
cation. CoRR, abs/1712.01802, 2017. URL Conference Proceedings, pages 1611–1619.
http://arxiv.org/abs/1712.01802. JMLR.org, 2014. URL http://jmlr.org/
proceedings/papers/v32/songb14.html.
[347] Bharat Singh, Mahyar Najibi, and Larry S.
Davis. SNIPER: efficient multi-scale training. [354] Hyun Oh Song, Yong Jae Lee, Stefanie
CoRR, abs/1805.09300, 2018. URL http:// Jegelka, and Trevor Darrell. Weakly-
arxiv.org/abs/1805.09300. supervised discovery of visual pattern con-
figurations. In Advances in Neural Informa-
[348] Leon Sixt, Benjamin Wild, and Tim Land- tion Processing Systems 27: Annual Confer-
graf. Rendergan: Generating realistic labeled ence on Neural Information Processing Sys-
data. Front. Robotics and AI, 2018, 2018. tems 2014, December 8-13 2014, Montreal,
Quebec, Canada, pages 1637–1645, 2014.
[349] Arnold W M Smeulders, Amarnath Gupta,
and Ramesh Jain. Content-Based Image Re- [355] Jost Tobias Springenberg, Alexey Dosovit-
trieval at the End of the Early Years. IEEE skiy, Thomas Brox, and Martin A. Ried-

80
miller. Striving for simplicity: The all con- Pattern Recognition, CVPR 2016, Las Ve-
volutional net. CoRR, abs/1412.6806, 2014. gas,NV, USA, June 27-30, 2016, 2016.
URL http://arxiv.org/abs/1412.6806.
[363] Christian Szegedy, Scott E. Reed, Dumitru
[356] Siddharth Srivastava, Gaurav Sharma, and Erhan, and Dragomir Anguelov. Scal-
Brejesh Lall. Large scale novel object discov- able, high-quality object detection. CoRR,
ery in 3d. In 2018 IEEE Winter Conference abs/1412.1441, 2014. URL http://arxiv.
on Applications of Computer Vision, WACV org/abs/1412.1441.
2018, Lake Tahoe, NV, USA, March 12-15,
2018, pages 179–188. IEEE Computer Soci- [364] Christian Szegedy, Wei Liu, Yangqing Jia,
ety, 2018. Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Van-
[357] Russell Stewart, Mykhaylo Andriluka, and houcke, Andrew Rabinovich, et al. Going
Andrew Y Ng. End-to-end people detection deeper with convolutions. In IEEE Confer-
in crowded scenes. In 2016 IEEE Conference ence on Computer Vision and Pattern Recog-
on Computer Vision and Pattern Recogni- nition, CVPR 2015, Boston, MA, USA, June
tion, CVPR 2016, Las Vegas,NV, USA, June 7-12, 2015, pages 1–9, 2015.
27-30, 2016, pages 2325–2333, 2016.
[365] Christian Szegedy, Vincent Vanhoucke,
[358] Hang Su, Shaogang Gong, and Xiatian Zhu. Sergey Ioffe, Jonathon Shlens, and Zbigniew
WebLogo-2M: Scalable Logo Detection by Wojna. Rethinking the inception architec-
Deep Learning from the Web. In ICCB ture for computer vision. In 2016 IEEE
Workshops, pages 270–279, October 2017. Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV,
[359] Hang Su, Xiatian Zhu, and Shaogang Gong. USA, June 27-30, 2016, pages 2818–2826.
Deep Learning Logo Detection with Data Ex- IEEE Computer Society, 2016.
pansion by Synthesising Context. IEEE Win-
ter Conf. on Applications of Computer Vi- [366] Christian Szegedy, Sergey Ioffe, Vincent Van-
sion (WACV), pages 530–539, 2017. houcke, and Alexander A Alemi. Inception-
v4, inception-resnet and the impact of resid-
[360] Hang Su, Xiatian Zhu, and Shaogang Gong. ual connections on learning. In AAAI, vol-
Open Logo Detection Challenge. In Pro- ume 4, page 12, 2017.
ceedings of the British Machine Vision Con-
ference 2018, BMVC 2018, Newcastle, UK, [367] Mingxing Tan, Bo Chen, Ruoming Pang, Vi-
September 3-6, 2018, 2018. jay Vasudevan, and Quoc V. Le. Mnasnet:
Platform-aware neural architecture search for
[361] Baochen Sun and Kate Saenko. From vir- mobile. CoRR, abs/1807.11626, 2018. URL
tual to reality: Fast adaptation of virtual ob- http://arxiv.org/abs/1807.11626.
ject detectors to real domains. In British
Machine Vision Conference, BMVC 2014, [368] Kevin D. Tang, Vignesh Ramanathan, Fei-
Nottingham, UK, September 1-5, 2014, vol- Fei Li, and Daphne Koller. Shifting Weights:
ume 1, page 3, 2014. Adapting Object Detectors from Image to
Video. In Advances in Neural Information
[362] Chen Sun, Manohar Paluri, Ronan Col- Processing Systems 25: 26th Annual Confer-
lobert, Ram Nevatia, and Lubomir Bourdev. ence on Neural Information Processing Sys-
ProNet: Learning to Propose Object-Specific tems 2012. Proceedings of a meeting held
Boxes for Cascaded Neural Networks. In 2016 December 3-6, 2012, Lake Tahoe, Nevada,
IEEE Conference on Computer Vision and United States, 2012.

81
[369] Peng Tang, Xinggang Wang, Xiang Bai, and 2009 IEEE Applied Imagery Pattern Recog-
Wenyu Liu. Multiple instance detection net- nition Workshop (AIPR 2009), pages 1–8,
work with online instance classifier refine- 2009.
ment. In 2017 IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR [376] Luke Taylor and Geoff Nitschke. Improving
2017, Honolulu, HI, USA, July 21-26, 2017, deep learning using generic data augmenta-
2017. tion. CoRR, abs/1708.06020, 2017. URL
http://arxiv.org/abs/1708.06020.
[370] Siyu Tang, Mykhaylo Andriluka, and Bernt
Schiele. Detection and tracking of occluded [377] Yonglin Tian, Xuan Li, Kunfeng Wang, and
people. International Journal of Computer Fei-Yue Wang. Training and testing ob-
Vision (IJCV), 110(1):58–69, 2014. ject detectors with virtual images. CoRR,
abs/1712.08470, 2017. URL http://arxiv.
[371] Siyu Tang, Bjoern Andres, Miykhaylo An- org/abs/1712.08470.
driluka, and Bernt Schiele. Subgraph de-
composition for multi-target tracking. In [378] Tijmen Tieleman and Geoffrey Hinton. Lec-
IEEE Conference on Computer Vision and ture 6.5-rmsprop: Divide the gradient by
Pattern Recognition, CVPR 2015, Boston, a running average of its recent magnitude.
MA, USA, June 7-12, 2015, pages 5033– COURSERA: Neural networks for machine
5041, 2015. learning, 4(2):26–31, 2012.

[372] Tianyu Tang, Shilin Zhou, Zhipeng Deng, [379] Radu Timofte, Karel Zimmermann, and Luc
Lin Lei, and Huanxin Zou. Arbitrary- Van Gool. Multi-view traffic sign detection,
Oriented Vehicle Detection in Aerial Imagery recognition, and 3d localisation. Machine vi-
with Single Convolutional Neural Networks. sion and applications, 25(3):633–647, 2014.
Remote Sensing, 9:1170–17, November 2017.
[380] Tatiana Tommasi, Novi Patricia, Barbara
[373] Tianyu Tang, Shilin Zhou, Zhipeng Deng, Caputo, and Tinne Tuytelaars. A deeper
Huanxin Zou, and Lin Lei. Vehicle Detection look at dataset bias. In Gabriela Csurka,
in Aerial Images Based on Region Convolu- editor, Domain Adaptation in Computer Vi-
tional Neural Networks and Hard Negative sion Applications., Advances in Computer
Example Mining. Sensors, 17:336–17, Febru- Vision and Pattern Recognition, pages 37–
ary 2017. 55. Springer, 2017. URL https://doi.org/
10.1007/978-3-319-58347-1_2.
[374] Y. Tang, J. K. Wang, B. Gao, and E. Del-
landréa. Large Scale Semi-supervised Object [381] Antonio Torralba and Alexei A Efros. Un-
Detection using Visual and Semantic Knowl- biased look at dataset bias. In The 24th
edge Transfer. In 2016 IEEE Conference on IEEE Conference on Computer Vision and
Computer Vision and Pattern Recognition, Pattern Recognition, CVPR 2011, Colorado
CVPR 2016, Las Vegas,NV, USA, June 27- Springs, CO, USA, 20-25 June 2011, pages
30, 2016, 2016. 1521–1528, 2011.

[375] Franklin Tanner, Brian Colder, Craig Pullen, [382] Toan Tran, Trung Pham, Gustavo Carneiro,
David Heagy, Michael Eppolito, Veronica Lyle Palmer, and Ian Reid. A bayesian data
Carlan, Carsten Oertel, and Phil Sallee. augmentation approach for learning deep
Overhead imagery research data set???an an- models. In Advances in Neural Information
notated data library & tools to aid in the de- Processing Systems 30: Annual Conference
velopment of computer vision algorithms. In on Neural Information Processing Systems

82
2017, 4-9 December 2017, Long Beach, CA, [389] Andras Tüzkö, Christian Herrmann, Daniel
USA, pages 2797–2806, 2017. Manger, and Jürgen Beyerer. Open set logo
detection and retrieval. In Francisco H. Imai,
[383] Jonathan Tremblay, Aayush Prakash, David Alain Trémeau, and José Braz, editors, Pro-
Acuna, Mark Brophy, Varun Jampani, Cem ceedings of the 13th International Joint Con-
Anil, Thang To, Eric Cameracci, Shaad ference on Computer Vision, Imaging and
Boochoon, and Stan Birchfield. Training Computer Graphics Theory and Applications
deep networks with synthetic data: Bridg- (VISIGRAPP 2018) - Volume 5: VISAPP,
ing the reality gap by domain randomization. Funchal, Madeira, Portugal, January 27-29,
In The IEEE Conference on Computer Vi- 2018., pages 284–292. SciTePress, 2018.
sion and Pattern Recognition (CVPR) Work-
shops, June 2018. [390] Lachlan Tychsen-Smith and Lars Petersson.
Improving object localization with fitness
[384] Jonathan Tremblay, Thang To, and Stan NMS and bounded iou loss. In Computer Vi-
Birchfield. Falling things: A synthetic sion and Pattern Recognition (CVPR), 2018
dataset for 3d object detection and pose esti- IEEE Conference on, pages 6877–6885, 2018.
mation. CoRR, abs/1804.06534, 2018. URL doi: 10.1109/CVPR.2018.00719.
http://arxiv.org/abs/1804.06534.
[391] Jasper RR Uijlings, Koen EA Van De Sande,
[385] Subarna Tripathi, Zachary C. Lipton, Theo Gevers, and Arnold WM Smeul-
Serge J. Belongie, and Truong Q. Nguyen. ders. Selective search for object recogni-
Context matters: Refining object detec- tion. International Journal of Computer Vi-
tion in video with recurrent neural net- sion (IJCV), 104(2):154–171, 2013.
works. In Richard C. Wilson, Edwin R.
Hancock, and William A. P. Smith, edi- [392] Régis Vaillant, Christophe Monrocq, and
tors, Proceedings of the British Machine Vi- Yann Le Cun. Original approach for the
sion Conference 2016, BMVC 2016, York, localisation of objects in images. IEE
UK, September 19-22, 2016. BMVA Press, Proceedings-Vision, Image and Signal Pro-
2016. URL http://www.bmva.org/bmvc/ cessing, 141(4):245–250, 1994.
2016/papers/paper044/index.html.
[393] Koen EA Van de Sande, Jasper RR Uijlings,
[386] Zhuowen Tu and Xiang Bai. Auto-context Theo Gevers, and Arnold WM Smeulders.
and its application to high-level vision tasks Segmentation as selective search for object
and 3d brain image segmentation. IEEE recognition. In IEEE International Con-
Transactions on Pattern Analysis and Ma- ference on Computer Vision, ICCV 2011,
chine Intelligence, 32(10):1744–1757, 2010. Barcelona, Spain, November 6-13, 2011,
pages 1879–1886, 2011.
[387] Zhuowen Tu, Yi Ma, Wenyu Liu, Xiang Bai,
and Cong Yao. Detecting texts of arbitrary [394] Grant Van Horn, Oisin Mac Aodha, Yang
orientations in natural images. In 2012 IEEE Song, Yin Cui, Chen Sun, Alex Shepard,
Conference on Computer Vision and Pattern Hartwig Adam, Pietro Perona, and Serge Be-
Recognition, pages 1083–1090, 2012. longie. The iNaturalist Species Classifica-
tion and Detection Dataset. In Computer Vi-
[388] Oncel Tuzel, Fatih Porikli, and Peter Meer. sion and Pattern Recognition (CVPR), 2018
Pedestrian detection via classification on rie- IEEE Conference on, 2018.
mannian manifolds. IEEE Transactions on
Pattern Analysis and Machine Intelligence, [395] Gül Varol, Javier Romero, Xavier Martin,
30(10):1713–1727, 2008. Naureen Mahmood, Michael J. Black, Ivan

83
Laptev, and Cordelia Schmid. Learning from on Computer Vision and Pattern Recogni-
synthetic humans. In 2017 IEEE Conference tion, CVPR 2015, Boston, MA, USA, June
on Computer Vision and Pattern Recogni- 7-12, 2015, pages 851–859. IEEE Computer
tion, CVPR 2017, Honolulu, HI, USA, July Society, 2015.
21-26, 2017, pages 4627–4635. IEEE Com-
puter Society, 2017. [402] Chong Wang, Weiqiang Ren, Kaiqi Huang,
and Tieniu Tan. Weakly Supervised Object
[396] Andreas Veit, Tomas Matera, Lukas Neu- Localization with Latent Category Learning.
mann, Jiri Matas, and Serge J. Belongie. In Computer Vision - ECCV 2014 - 13th
Coco-text: Dataset and benchmark for text European Conference, Zurich, Switzerland,
detection and recognition in natural images. September 6-12, 2014, 2014.
CoRR, abs/1601.07140, 2016. URL http:
//arxiv.org/abs/1601.07140. [403] Kai Wang and Serge Belongie. Word spot-
ting in the wild. In Computer Vision -
[397] Alexander Vezhnevets and Vittorio Ferrari. ECCV 2010, 11th European Conference on
Object localization in imagenet by look- Computer Vision, Heraklion, Crete, Greece,
ing out of the window. In Xianghua Xie, September 5-11, 2010, pages 591–604, 2010.
Mark W. Jones, and Gary K. L. Tam, editors,
Proceedings of the British Machine Vision [404] Li Wang, Yao Lu, Hong Wang, Yingbin
Conference 2015, BMVC 2015, Swansea, Zheng, Hao Ye, and Xiangyang Xue. Evolv-
UK, September 7-10, 2015, pages 27.1–27.12. ing boxes for fast vehicle detection. ICME,
BMVA Press, 2015. pages 1135–1140, 2017.

[398] Paul A. Viola, Michael J. Jones, and Daniel [405] Robert J. Wang, Xiang Li, Shuang Ao, and
Snow. Detecting pedestrians using patterns Charles X. Ling. Pelee: A Real-Time Ob-
of motion and appearance. International ject Detection System on Mobile Devices. In
Journal of Computer Vision (IJCV), 63(2): International Conference on Learning Repre-
153–161, 2005. sentations (ICLR), 2018.

[399] Stefan Walk, Nikodem Majer, Konrad [406] Xiaolong Wang, Ross B. Girshick, Abhinav
Schindler, and Bernt Schiele. New fea- Gupta, and Kaiming He. Non-local neu-
tures and insights for pedestrian detection. ral networks. CoRR, abs/1711.07971, 2017.
In The Twenty-Third IEEE Conference on URL http://arxiv.org/abs/1711.07971.
Computer Vision and Pattern Recognition,
CVPR 2010, San Francisco, CA, USA, 13- [407] Xiaolong Wang, Abhinav Shrivastava, and
18 June 2010, pages 1030–1037, 2010. Abhinav Gupta. A-fast-rcnn: Hard positive
generation via adversary for object detection.
[400] Fang Wan, Pengxu Wei, Jianbin Jiao, Zhen- In 2017 IEEE Conference on Computer Vi-
jun Han, and Qixiang Ye. Min-entropy latent sion and Pattern Recognition, CVPR 2017,
model for weakly supervised object detection. Honolulu, HI, USA, July 21-26, 2017, pages
In Computer Vision and Pattern Recognition 3039–3048. IEEE Computer Society, 2017.
(CVPR), 2018 IEEE Conference on, June
2018. [408] Xiaoyu Wang, Tony X. Han, and Shuicheng
Yan. An HOG-LBP human detector with
[401] Li Wan, David Eigen, and Rob Fergus. partial occlusion handling. In IEEE 12th In-
End-to-end integration of a convolutional ternational Conference on Computer Vision,
network, deformable parts model and non- ICCV 2009, Kyoto, Japan, September 27 -
maximum suppression. In IEEE Conference October 4, 2009, pages 32–39, 2009.

84
[409] Xinlong Wang, Tete Xiao, Yuning Jiang, Tahoe, NV, USA, March 12-15, 2018, pages
Shuai Shao, Jian Sun, and Chunhua Shen. 1093–1102. IEEE Computer Society, 2018.
Repulsion Loss: Detecting Pedestrians in a
Crowd. In Computer Vision and Pattern [416] Bichen Wu, Forrest N. Iandola, Peter H.
Recognition (CVPR), 2018 IEEE Conference Jin, and Kurt Keutzer. Squeezedet: Unified,
on, 2018. small, low power fully convolutional neural
networks for real-time object detection for
[410] Maurice Weiler, Fred A. Hamprecht, and autonomous driving. In 2017 IEEE Confer-
Martin Storath. Learning steerable filters for ence on Computer Vision and Pattern Recog-
rotation equivariant cnns. In Computer Vi- nition Workshops, CVPR Workshops, Hon-
sion and Pattern Recognition (CVPR), 2018 olulu, HI, USA, July 21-26, 2017, pages 446–
IEEE Conference on, June 2018. 454. IEEE Computer Society, 2017.

[411] Longyin Wen, Dawei Du, Zhaowei Cai, Zhen [417] Bo Wu and Ram Nevatia. Cluster boosted
Lei, Ming-Ching Chang, Honggang Qi, Jong- tree classifier for multi-view, multi-pose
woo Lim, Ming-Hsuan Yang, and Siwei object detection. In IEEE 11th Inter-
Lyu. DETRAC: A new benchmark and national Conference on Computer Vision,
protocol for multi-object tracking. CoRR, ICCV 2007, Rio de Janeiro, Brazil, October
abs/1511.04136, 2015. URL http://arxiv. 14-20, 2007, pages 1–8, 2007.
org/abs/1511.04136.
[418] Bo Wu and Ramakant Nevatia. Detection
[412] Cameron Whitelam, Emma Taborsky, of multiple, partially occluded humans in
Austin Blanton, Brianna Maze, Jocelyn a single image by bayesian combination of
Adams, Tim Miller, Nathan Kalka, Anil K edgelet part detectors. In 10th IEEE In-
Jain, James A Duncan, Kristen Allen, et al. ternational Conference on Computer Vision
Iarpa janus benchmark-b face dataset. In (ICCV 2005), 17-20 October 2005, Beijing,
CVPR Workshop on Biometrics, 2017. China, pages 90–97, 2005.

[413] Christian Wojek, Gyuri Dorkó, André [419] Tianfu Wu, Bo Li, and Song-Chun Zhu.
Schulz, and Bernt Schiele. Sliding-windows Learning and-or model to represent context
for rapid object class localization: A paral- and occlusion for car detection and view-
lel technique. In Joint Pattern Recognition point estimation. IEEE Transactions on Pat-
Symposium, pages 71–81, 2008. tern Analysis and Machine Intelligence, 38
(9):1829–1843, 2016.
[414] Christian Wojek, Stefan Walk, and Bernt
Schiele. Multi-cue onboard pedestrian de- [420] Yue Wu and Qiang Ji. Facial Landmark De-
tection. In 2009 IEEE Computer Society tection: A Literature Survey. International
Conference on Computer Vision and Pattern Journal of Computer Vision (IJCV), To ap-
Recognition (CVPR 2009), 20-25 June 2009, pear, May 2018.
Miami, Florida, USA, pages 794–801. IEEE
Computer Society, 2009. [421] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen
Zhu, Serge J. Belongie, Jiebo Luo, Mi-
[415] Sanghyun Woo, Soonmin Hwang, and In So hai Datcu, Marcello Pelillo, and Liangpei
Kweon. Stairnet: Top-down semantic aggre- Zhang. DOTA: A large-scale dataset for
gation for accurate one shot detection. In object detection in aerial images. CoRR,
2018 IEEE Winter Conference on Applica- abs/1711.10398, 2017. URL http://arxiv.
tions of Computer Vision, WACV 2018, Lake org/abs/1711.10398.

85
[422] Wei Xiang, Dong-Qing Zhang, Heather Yu, [429] Zhaozhuo Xu, Xin Xu, Lei Wang, Rui Yang,
and Vassilis Athitsos. Context-aware single- and Fangling Pu. Deformable ConvNet with
shot detector. pages 1784–1793, 2018. doi: Aspect Ratio Constrained NMS for Object
10.1109/WACV.2018.00198. Detection in Remote Sensing Imagery. Re-
mote Sensing, 9:1312–19, December 2017.
[423] Yu Xiang and S. Savarese. Estimating the
aspect layout of object categories. In 2012 [430] Junjie Yan, Xuzong Zhang, Zhen Lei, and
IEEE Conference on Computer Vision and Stan Z. Li. Face detection by structural mod-
Pattern Recognition, Providence, RI, USA, els. Image and Vision Computing, 32(10):
June 16-21, 2012, 2012. 790–799, October 2014.

[424] Yu Xiang, Wongun Choi, Yuanqing Lin, and [431] Fan Yang, Wongun Choi, and Yuanqing Lin.
Silvio Savarese. Data-driven 3d voxel pat- Exploit all the layers: Fast and accurate cnn
terns for object category recognition. In object detector with scale dependent pooling
IEEE Conference on Computer Vision and and cascaded rejection classifiers. In 2016
Pattern Recognition, CVPR 2015, Boston, IEEE Conference on Computer Vision and
MA, USA, June 7-12, 2015, pages 1903– Pattern Recognition, CVPR 2016, Las Ve-
1911. IEEE Computer Society, 2015. gas,NV, USA, June 27-30, 2016, pages 2129–
2137, 2016.
[425] Yao Xiao, Cewu Lu, E. Tsougenis, Yongyi
[432] Shuo Yang, Ping Luo, Chen Change Loy,
Lu, and Chi-Keung Tang. Complexity-
and Xiaoou Tang. From facial parts re-
adaptive distance metric for object propos-
sponses to face detection: A deep learning
als generation. In IEEE Conference on Com-
approach. In 2015 IEEE International Con-
puter Vision and Pattern Recognition, CVPR
ference on Computer Vision, ICCV 2015,
2015, Boston, MA, USA, June 7-12, 2015,
Santiago, Chile, December 7-13, 2015, pages
2015.
3676–3684. IEEE Computer Society, 2015.
[426] Saining Xie, Ross Girshick, Piotr Dollár, [433] Shuo Yang, Ping Luo, Chen-Change Loy, and
Zhuowen Tu, and Kaiming He. Aggregated Xiaoou Tang. Wider face: A face detection
residual transformations for deep neural net- benchmark. In 2016 IEEE Conference on
works. In 2017 IEEE Conference on Com- Computer Vision and Pattern Recognition,
puter Vision and Pattern Recognition, CVPR CVPR 2016, Las Vegas,NV, USA, June 27-
2017, Honolulu, HI, USA, July 21-26, 2017, 30, 2016, pages 5525–5533, 2016.
pages 5987–5995, 2017.
[434] Zhenheng Yang and Ramakant Nevatia. A
[427] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou multi-scale cascade fully convolutional net-
Ren, and Rama Chellappa. Deep regionlets work face detector. In 23rd International
for object detection. CoRR, abs/1712.02408, Conference on Pattern Recognition, ICPR
2017. URL http://arxiv.org/abs/1712. 2016, Cancún, Mexico, December 4-8, 2016,
02408. pages 633–638. IEEE, 2016.

[428] Jiaolong Xu, Sebastian Ramos, David [435] Cong Yao, Xiang Bai, Nong Sang, Xinyu
Vázquez, and Antonio M López. Domain Zhou, Shuchang Zhou, and Zhimin Cao.
adaptation of deformable part-based mod- Scene text detection via holistic, multi-
els. IEEE Transactions on Pattern Analysis channel prediction. CoRR, abs/1606.09002,
and Machine Intelligence, 36(12):2367–2380, 2016. URL http://arxiv.org/abs/1606.
2014. 09002.

86
[436] Ryota Yoshihashi, Tu Tuan Trinh, Rei the British Machine Vision Conference 2016,
Kawakami, Shaodi You, Makoto Iida, and BMVC 2016, York, UK, September 19-22,
Takeshi Naemura. Learning multi-frame 2016, September 2016.
visual representation for joint detection
and tracking of small objects. CoRR, [443] Yuan Yuan, Xiaodan Liang, Xiaolong Wang,
abs/1709.04666, 2017. URL http://arxiv. Dit-Yan Yeung, and Abhinav Gupta. Tem-
org/abs/1709.04666. poral dynamic graph lstm for action-driven
video object detection. In IEEE Inter-
[437] Yang You, Zhao Zhang, Cho-Jui Hsieh, national Conference on Computer Vision,
James Demmel, and Kurt Keutzer. Ima- ICCV 2017, Venice, Italy, October 22-29,
genet training in minutes. In Proceedings of 2017, Oct 2017.
the 47th International Conference on Parallel
Processing, ICPP 2018, Eugene, OR, USA, [444] Mehmet Kerim Yucel, Yunus Can Bilge,
August 13-16, 2018, pages 1:1–1:10. ACM, Oguzhan Oguz, Nazli Ikizler-Cinbis, Pinar
2018. Duygulu, and Ramazan Gokberk Cinbis.
Wildest faces: Face detection and recognition
[438] Fisher Yu and Vladlen Koltun. Multi-scale in violent settings. CoRR, abs/1805.07566,
context aggregation by dilated convolutions. 2018. URL http://arxiv.org/abs/1805.
CoRR, abs/1511.07122, 2015. URL http:// 07566.
arxiv.org/abs/1511.07122.
[445] Sergey Zagoruyko and Nikos Komodakis.
[439] Fisher Yu, Vladlen Koltun, and Thomas A. Wide residual networks. In Richard C. Wil-
Funkhouser. Dilated residual networks. In son, Edwin R. Hancock, and William A. P.
2017 IEEE Conference on Computer Vision Smith, editors, Proceedings of the British
and Pattern Recognition, CVPR 2017, Hon- Machine Vision Conference 2016, BMVC
olulu, HI, USA, July 21-26, 2017, pages 636– 2016, York, UK, September 19-22, 2016.
644. IEEE Computer Society, 2017. doi: BMVA Press. URL http://www.bmva.org/
10.1109/CVPR.2017.75. bmvc/2016/papers/paper087/index.html.
[440] Fisher Yu, Wenqi Xian, Yingying Chen, [446] Sergey Zagoruyko, Adam Lerer, Tsung-Yi
Fangchen Liu, Mike Liao, Vashisht Madha- Lin, Pedro Oliveira Pinheiro, Sam Gross,
van, and Trevor Darrell. BDD100K: A di- Soumith Chintala, and Piotr Dollár. A
verse driving video database with scalable multipath network for object detection. In
annotation tooling. CoRR, abs/1805.04687, Richard C. Wilson, Edwin R. Hancock, and
2018. URL http://arxiv.org/abs/1805. William A. P. Smith, editors, Proceedings of
04687. the British Machine Vision Conference 2016,
BMVC 2016, York, UK, September 19-22,
[441] Jiahui Yu, Yuning Jiang, Zhangyang Wang, 2016, 2016. URL http://www.bmva.org/
Zhimin Cao, and Thomas S. Huang. Unitbox: bmvc/2016/papers/paper015/index.html.
An advanced object detection network. In
Proceedings of the 2016 ACM Conference on [447] Matthew D. Zeiler. ADADELTA: an
Multimedia Conference, MM 2016, Amster- adaptive learning rate method. CoRR,
dam, The Netherlands, October 15-19, 2016, abs/1212.5701, 2012. URL http://arxiv.
pages 516–520, 2016. org/abs/1212.5701.
[442] Ruichi Yu, Xi Chen, Vlad I. Morariu, and [448] Matthew D. Zeiler and Rob Fergus. Vi-
Larry S. Davis. The Role of Context Selec- sualizing and understanding convolutional
tion in Object Detection. In Proceedings of networks. In Computer Vision - ECCV

87
2014 - 13th European Conference, Zurich, ECCV 2016 - 14th European Conference,
Switzerland, September 6-12, 2014, pages Amsterdam, The Netherlands, October 11-
818–833, 2014. URL https://doi.org/10. 14, 2016, volume 9906 of Lecture Notes in
1007/978-3-319-10590-1_53. Computer Science, pages 443–457. Springer,
2016. URL https://doi.org/10.1007/
[449] Matthew D Zeiler and Rob Fergus. Visu- 978-3-319-46475-6_28.
alizing and understanding convolutional net-
works. In Computer Vision - ECCV 2014 - [456] Shanshan Zhang, Rodrigo Benenson, and
13th European Conference, Zurich, Switzer- Bernt Schiele. Citypersons: A diverse dataset
land, September 6-12, 2014, pages 818–833, for pedestrian detection. In 2017 IEEE
2014. Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI,
[450] Xingyu Zeng, Wanli Ouyang, Bin Yang, Jun-
USA, July 21-26, 2017, pages 4457–4465.
jie Yan, and Xiaogang Wang. Gated Bi-
IEEE Computer Society, 2017.
directional CNN for Object Detection. In
Computer Vision - ECCV 2016 - 14th Eu-
[457] Shanshan Zhang, Jian Yang, and Bernt
ropean Conference, Amsterdam, The Nether-
Schiele. Occluded Pedestrian Detection
lands, October 11-14, 2016, October 2016.
Through Guided Attention in CNNs. In
[451] Xingyu Zeng, Wanli Ouyang, Junjie Yan, Computer Vision and Pattern Recognition
Hongsheng Li, Tong Xiao, Kun Wang, (CVPR), 2018 IEEE Conference on, page 9,
Yu Liu, Yucong Zhou, Bin Yang, Zhe Wang, 2018.
et al. Crafting gbd-net for object detection.
IEEE Transactions on Pattern Analysis and [458] Shifeng Zhang, Xiangyu Zhu, Zhen Lei,
Machine Intelligence, 2017. Hailin Shi, Xiaobo Wang, and Stan Z. Li.
S$3̂$FD: Single Shot Scale-invariant Face De-
[452] Yao Zhai, Jingjing Fu, Yan Lu, and Houqiang tector. In IEEE International Conference on
Li. Feature selective networks for object de- Computer Vision, ICCV 2017, Venice, Italy,
tection. In Computer Vision and Pattern October 22-29, 2017, 2017.
Recognition (CVPR), 2018 IEEE Conference
on, June 2018. [459] Shifeng Zhang, Longyin Wen, Xiao Bian,
Zhen Lei, and Stan Z. Li. Occlusion-aware
[453] Cha Zhang and Zhengyou Zhang. A survey of
R-CNN: detecting pedestrians in a crowd.
recent advances in face detection. Technical
CoRR, abs/1807.08407, 2018. URL http:
report, Tech. rep., Microsoft Research, 2010.
//arxiv.org/abs/1807.08407.
[454] Dongqing Zhang, Jiaolong Yang,
Dongqiangzi Ye, and Gang Hua. Lq- [460] Shifeng Zhang, Longyin Wen, Xiao Bian,
nets: Learned quantization for highly Zhen Lei, and Stan Z. Li. Single-shot re-
accurate and compact deep neural net- finement neural network for object detection.
works. CoRR, abs/1807.10029, 2018. URL In Computer Vision and Pattern Recognition
http://arxiv.org/abs/1807.10029. (CVPR), 2018 IEEE Conference on, 2018.

[455] Liliang Zhang, Liang Lin, Xiaodan Liang, [461] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin,
and Kaiming He. Is faster R-CNN do- and Jian Sun. Shufflenet: An extremely effi-
ing well for pedestrian detection? In cient convolutional neural network for mobile
Bastian Leibe, Jiri Matas, Nicu Sebe, and devices. CoRR, abs/1707.01083, 2017. URL
Max Welling, editors, Computer Vision - http://arxiv.org/abs/1707.01083.

88
[462] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, [468] Fan Zhao, Yao Yang, Hai-yan Zhang, Lin-
Yi Yang, and Thomas S. Huang. Adversar- lin Yang, and Lin Zhang. Sign text detec-
ial complementary learning for weakly super- tion in street view images using an integrated
vised object localization. In Computer Vi- feature. Multimedia Tools and Applications,
sion and Pattern Recognition (CVPR), 2018 April 2018.
IEEE Conference on, June 2018.
[469] Xiangyun Zhao, Shuang Liang, and Yichen
[463] Xiaopeng Zhang, Jiashi Feng, Hongkai Wei. Pseudo mask augmented object detec-
Xiong, and Qi Tian. Zigzag learning for tion. In Computer Vision and Pattern Recog-
weakly supervised object detection. In nition (CVPR), 2018 IEEE Conference on,
Computer Vision and Pattern Recognition June 2018.
(CVPR), 2018 IEEE Conference on, June [470] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu,
2018. and Xindong Wu. Object detection with deep
learning: A review. CoRR, abs/1807.05511,
[464] Yongqiang Zhang, Yancheng Bai, Min- 2018. URL http://arxiv.org/abs/1807.
gli Ding, Yongqiang Li, and Bernard 05511.
Ghanem. W2f: A weakly-supervised to fully-
supervised framework for object detection. [471] Liwen Zheng, Canmiao Fu, and Yong Zhao.
In Computer Vision and Pattern Recognition Extend the shallow part of single shot multi-
(CVPR), 2018 IEEE Conference on, June box detector via convolutional neural net-
2018. work. CoRR, abs/1801.05918, 2018. URL
http://arxiv.org/abs/1801.05918.
[465] Yuting Zhang, Kihyuk Sohn, R. Villegas,
Gang Pan, and Honglak Lee. Improving ob- [472] 15 Bolei Zhou, Aditya Khosla, Àgata
ject detection with deep convolutional net- Lapedriza, Aude Oliva, and Antonio Tor-
works via Bayesian optimization and struc- ralba. Object detectors emerge in deep scene
tured prediction. In IEEE Conference on cnns. In IEEE Conference on Computer Vi-
Computer Vision and Pattern Recognition, sion and Pattern Recognition, CVPR 2015,
CVPR 2015, Boston, MA, USA, June 7-12, Boston, MA, USA, June 7-12, 2015, 2015.
2015, 2015. [473] Bolei Zhou, Àgata Lapedriza, Jianxiong
Xiao, Antonio Torralba, and Aude Oliva.
[466] Zheng Zhang, Chengquan Zhang, Wei Shen, Learning deep features for scene recognition
Cong Yao, Wenyu Liu, and Xiang Bai. Multi- using places database. In Advances in Neu-
oriented text detection with fully convolu- ral Information Processing Systems 27: An-
tional networks. In 2016 IEEE Conference on nual Conference on Neural Information Pro-
Computer Vision and Pattern Recognition, cessing Systems 2014, December 8-13 2014,
CVPR 2016, Las Vegas, NV, USA, June 27- Montreal, Quebec, Canada, 2014.
30, 2016, pages 4159–4167. IEEE Computer
Society, 2016. [474] Bolei Zhou, Aditya Khosla, Àgata Lapedriza,
Aude Oliva, and Antonio Torralba. Learning
[467] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, deep features for discriminative localization.
Wei Shen, Bo Wang, and Alan L. Yuille. In 2016 IEEE Conference on Computer Vi-
Single-shot object detection with enriched se- sion and Pattern Recognition, CVPR 2016,
mantics. In Computer Vision and Pattern Las Vegas, NV, USA, June 27-30, 2016,
Recognition (CVPR), 2018 IEEE Conference pages 2921–2929. IEEE Computer Society,
on, June 2018. 2016.

89
[475] Peng Zhou, Bingbing Ni, Cong Geng, Jian- [482] Pengkai Zhu, Hanxiao Wang, Tolga Boluk-
guo Hu, and Yi Xu. Scale-Transferrable Ob- basi, and Venkatesh Saligrama. Zero-shot de-
ject Detection. In Computer Vision and Pat- tection. CoRR, abs/1803.07113, 2018. URL
tern Recognition (CVPR), 2018 IEEE Con- http://arxiv.org/abs/1803.07113.
ference on, page 10, 2018.
[483] Xiangxin Zhu and Deva Ramanan. Face de-
[476] Shuchang Zhou, Zekun Ni, Xinyu Zhou, tection, pose estimation, and landmark lo-
He Wen, Yuxin Wu, and Yuheng Zou. calization in the wild. In 2012 IEEE Confer-
Dorefa-net: Training low bitwidth convolu- ence on Computer Vision and Pattern Recog-
tional neural networks with low bitwidth gra- nition, Providence, RI, USA, June 16-21,
dients. CoRR, abs/1606.06160, 2016. URL 2012, pages 2879–2886. IEEE Computer So-
http://arxiv.org/abs/1606.06160. ciety, 2012.
[477] Xinyu Zhou, Cong Yao, He Wen, Yuzhi [484] Xizhou Zhu, Yujie Wang, Jifeng Dai,
Wang, Shuchang Zhou, Weiran He, and Ji- Lu Yuan, and Yichen Wei. Flow-guided
ajun Liang. East: An efficient and accurate feature aggregation for video object detec-
scene text detector. In 2017 IEEE Confer- tion. In IEEE International Conference on
ence on Computer Vision and Pattern Recog- Computer Vision, ICCV 2017, Venice, Italy,
nition, CVPR 2017, Honolulu, HI, USA, October 22-29, 2017, pages 408–417. IEEE
July 21-26, 2017, July 2017. Computer Society, 2017.
[478] Yin Zhou and Oncel Tuzel. Voxelnet: End-
[485] Xizhou Zhu, Yujie Wang, Jifeng Dai,
to-end learning for point cloud based 3d ob-
Lu Yuan, and Yichen Wei. Flow-guided fea-
ject detection. CoRR, abs/1711.06396, 2017.
ture aggregation for video object detection.
URL http://arxiv.org/abs/1711.06396.
In IEEE International Conference on Com-
[479] Haigang Zhu, Xiaogang Chen, Weiqun Dai, puter Vision, ICCV 2017, Venice, Italy, Oc-
Kun Fu, Qixiang Ye, and Jianbin Jiao. Ori- tober 22-29, 2017, pages 408–417, 2017. doi:
entation robust object detection in aerial im- 10.1109/ICCV.2017.52.
ages using deep convolutional neural net-
work. In Image Processing (ICIP), 2015 [486] Xizhou Zhu, Yuwen Xiong, Jifeng Dai,
IEEE International Conference on, pages Lu Yuan, and Yichen Wei. Deep feature
3735–3739, 2015. flow for video recognition. In 2017 IEEE
Conference on Computer Vision and Pat-
[480] Jun-Yan Zhu, Taesung Park, Phillip Isola, tern Recognition, CVPR 2017, Honolulu, HI,
and Alexei A. Efros. Unpaired image-to- USA, July 21-26, 2017, volume 2, page 7,
image translation using cycle-consistent ad- 2017.
versarial networks. In IEEE International
Conference on Computer Vision, ICCV [487] Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen
2017, Venice, Italy, October 22-29, 2017, Wei, and Lu Yuan. Towards high perfor-
pages 2242–2251. IEEE Computer Society, mance video object detection for mobiles.
2017. CoRR, abs/1804.05830, 2018. URL http:
//arxiv.org/abs/1804.05830.
[481] Pengfei Zhu, Longyin Wen, Xiao Bian,
Haibin Ling, and Qinghua Hu. Vision meets [488] Yousong Zhu, Chaoyang Zhao, Jinqiao
drones: A challenge. CoRR, abs/1804.07437, Wang, Xu Zhao, Yi Wu, and Hanqing Lu.
2018. URL http://arxiv.org/abs/1804. Couplenet: Coupling global structure with
07437. local parts for object detection. In IEEE

90
International Conference on Computer Vi- A Datasets and Results
sion, ICCV 2017, Venice, Italy, October 22-
29, 2017, pages 4126–4134, 2017. doi: 10. Most of the object detection’s influential ideas, con-
1109/ICCV.2017.444. cepts and literature having been now reviewed, the
rest of the article dives into the datasets used to
train and evaluate these detectors.
Public datasets play an essential role as they
not only allow to measure and compare the per-
[489] Yukun Zhu, R. Urtasun, R. Salakhutdinov, formance of object detectors but also provides re-
and S. Fidler. segDeepM: Exploiting segmen- sources allowing to learn object models from exam-
tation and context in deep neural networks ples. In the area of deep learning, these resources
for object detection. In IEEE Conference on play an essential role, as it has been clearly shown
Computer Vision and Pattern Recognition, that deep convolutional neural networks are de-
CVPR 2015, Boston, MA, USA, June 7-12, signed to benefit and learn from massive amount of
2015, 2015. data [473]. This section discusses the main datasets
used in the recent literature on object detection and
present state-of-the-art methods for each dataset.

A.1 Classical Datasets with Com-


[490] Zhe Zhu, Dun Liang, Songhai Zhang, Xiaolei mon Objects
Huang, Baoli Li, and Shimin Hu. Traffic-
sign detection and classification in the wild. We first start by presenting the datasets contain-
In 2016 IEEE Conference on Computer Vi- ing everyday life object taken from consumer cam-
sion and Pattern Recognition, CVPR 2016, eras. This category contains the most important
Las Vegas,NV, USA, June 27-30, 2016, pages datasets for the domain, attracting the largest part
2110–2118, 2016. of the community. We will discuss in a second sec-
tion the datasets devoted to specific detection tasks
(e.g., face detection, pedestrian detection, etc.).

A.1.1 Pascal-VOC
[491] C. L. Zitnick and P. Dollar. Edge boxes: Lo- Pascal-VOC [88] is the most iconic object detec-
cating object proposals from edges. In Com- tion dataset. It has changed over the years but
puter Vision - ECCV 2014 - 13th European the format everyone is familiar with is the one that
Conference, Zurich, Switzerland, September emerged in 2007 with 20 classes (Person: person;
6-12, 2014, 2014. Animal: bird, cat, cow, dog, horse, sheep; Ve-
hicle: aeroplane, bicycle, boat, bus, car, motor-
bike, train; Indoor: bottle, chair, dining table, pot-
ted plant, sofa, tv/monitor). It is now used as a
test bed for most new algorithms. As it is quite
[492] Zhen Zuo, Bing Shuai, Gang Wang 0012, small there have been claims that we are start-
Xiao Liu, Xingxing Wang, Bing Wang, and ing to overfit on the test set and therefore, MS-
Yushi Chen. Learning Contextual Depen- COCO (see next section) is preferred nowadays to
dence With Convolutional Hierarchical Re- demonstrate the quality of a new algorithm. The
current Neural Networks. IEEE Transactions 0.5 IoU based metrics this dataset introduced has
on Image Processing, 2016. now become the de facto standard for every single

91
detection problem. Overall, this dataset’s impact Method Backbone mAP
on the development of innovative methods in ob- [247] ResNeXt-101 83.1
ject detection cannot be overstated. It is quite [427] ResNet-101 83.1
hard to find all relevant literature but we have [452] ResNet-101 82.9
tried to be as thorough as possible in terms of best [63] ResNet-101 82.6
performing methods. The URL of the dataset is [176] ResNet-101 82.4
http://host.robots.ox.ac.uk/pascal/VOC/. [207] ResNet-101 82.1
Two versions of Pascal-VOC are commonly used [467] VGG-16 81.7
in the literature, namely VOC2007 and VOC2012: [92] ResNet-101 81.5
[475] DenseNet-169 80.9
• VOC07, with 9,963 images containing 24,640 [469] ResNet-101 80.7
annotated objects, is small. For this reason, [62] ResNet-101 80.5
papers using VOC07 often train on the union
of VOC07 and VOC12 trainvals (VOC07+12). Table 6: State-of-the-art methods on VOC07 test
The Average Precision (AP) averaged across set (Using VOC07+12).
the 20 classes is saturating at around 80 points
@0.5 IoU. Some methods got extra points but Method Backbone mAP
it seems one cannot go over around 85 points [427] ResNet-101 81.2
(without pre-training on MS COCO). Using [176] ResNet-101 81.1
MS COCO data in addition, one can get up to [247] ResNeXt-101 80.9
86.3 AP (see [207]). We chose to display meth- [207] ResNet-101 80.6
ods with mAP over 80 points only on Table 6. [452] ResNet-101 80.5
We do not distinguish between the methods [467] VGG-16 80.3
that do multiple inference tricks or the meth- [92] ResNet-101 80.0
ods that reports results as is. However for each [221] ResNet-101 78.5
method we reported for the highest published [62] ResNet-101 77.6
results we could get.
Table 7: State-of-the-art methods on VOC12 test
• VOC12 is a little bit harder than its 2007 coun-
set (Using VOC07++12).
terpart, and we have just gone over the 80
point mark. As it is harder, this time, most lit-
erature uses the union of the whole VOC2007
118,000 training images, 5,000 validation images
data (trainval+test) and VOC2012 trainval; It
and 41,000 testing images. They have also released
is referred to as 07++12. Again better results
120K unlabeled images that follow the same class
are obtained with pre-training on COCO data
distribution as the labeled images. They may be
(83.8 points in [117]). Results above 75 points
useful for semi-supervised learning on COCO. The
are presented in Table 7.
MS COCO challenge has been ongoing since 2015.
On both splits all backbones used by the leaders of There are 80 object categories, over 4 times more
the board are heavy backbones with more than a than Pascal-VOC. MS COCO is a fine replacement
100 layers except for [467] that gets close to state for Pascal-VOC, that has arguably started to age
of the art using only VGG-16. a little. Like ImageNet in its time, MS-COCO has
become the de facto standard for the object de-
tection community and any method winning the
A.1.2 MS COCO
state-of-the-art on it is assured to gain much trac-
MS COCO [214] is the most challenging object tion and visibility. The AP is calculated similar to
detection dataset available today. It consists of Pascal-VOC but averaged on multiple IoUs from

92
0.5 to 0.95. Nanjing University of Information Science and Im-
Most available alternatives stemmed from Faster perial College London. It ranked first on 85 cate-
R-CNN [309], which in its first iteration won the gories with an overall AP of 73.13. As far as we
first challenge with 37.3 mAP with a ResNet101 know, there is no paper describing the approach
backbone. In the second iteration of the challenge precisely (but some slides are available at the work-
the mAP went up to 41.5 with an ensemble of shop page). The 2nd ranked method was from
Faster R-CNN [309] that used a different imple- Bae et al. [8], who observed that modern convolu-
mentation of RoI-Pooling. This maybe inspired tional detectors behave differently for each object
the RoI-Align of Mask R-CNN [118]. Tao Kong class. The authors consequently built an ensem-
claimed that a single Faster R-CNN with Hyper- ble detector by finding the best detector for each
Net features [174] can reach 42.0 mAP. The best object class. They obtained a AP of 59.30 points
published single model method [274] nowadays is and won 10 categories. ImageNet is available at
around 50.5 (52.5 with an ensemble) and relied http://image-net.org.
on different techniques already mentioned in this
survey. Among them one can mention FPN [215], A.1.4 VisualGenome
large batch training [274] and GCN [275]. Ensem-
bling Mask R-CNNs [118] gave around the same VisualGenome [179] is a very peculiar dataset fo-
performance as [274] at around 50.3 mAP. De- cusing on object relationships. It contains over
formable R-FCN [63] is not lagging too far behind 100,000 images. Each image has bounding boxes
with 48.5 mAP single model performance (50.4 but also complete scene graphs. Over 17,000 cate-
mAP with an ensemble) using Soft NMS [21] and gories of objects are present. The first ones in terms
the ”mandatory” FPN [215]. Other entries were of representativeness by far are man and woman
based mostly on Mask R-CNN [118]. We display followed by trees and sky. On average there are
the current leaderboard (http://cocodataset. 21 objects per image. It is unclear if it qualifies
org/#detection-leaderboard) also visible at for as an object detection dataset as the paper does
all the past challenges with the main-ideas present not include clear object detection metrics or eval-
in the winning entries Figure 22. The URL of the uation as its focus is on scene graphs and visual
dataset is http://cocodataset.org. relationships. However, it is undoubtedly an enor-
mous source of strongly supervised images to train
A.1.3 ImageNet Detection Task object detectors. The Visual Genome Dataset has
huge number of classes, most of them being small
ImageNet is a dataset organized according to the and hard to detect. The mAP reported in the
nouns of the WordNet hierarchy. Each node of the literature is therefore, much smaller compared to
hierarchy is depicted by hundreds and thousands of previous datasets. One of the best performing ap-
images, with an average of over 5,000 images per proaches is of Li et al. [204] which reached 7.43
node. Since 2010, the Large Scale Visual Recogni- mAP by linking object detection, scene graph gen-
tion Challenge is organized each year and contains eration and region captioning. Faster R-CNN [104]
a detection challenge using ImageNet images. The has a mAP of 6.72 points on this dataset. The URL
detection task, in which each object instance has of the dataset is https://visualgenome.org.
to be detected, has 200 categories. There is also a
classification and localization task, with 1,000 cate- A.1.5 OpenImages
gories in which algorithms have to produce 5 labels
(and 5 bounding boxes) only, allowing not to pe- The challenge OpenImagesV4 [178] that will be or-
nalize the detection of objects that are present, but ganized for the first time at ECCV2018 offers the
not included in the ground truth. In the 2017 con- largest to date common objects detection dataset
test, the top detector was proposed by a team from with up to 500 classes (including the familiar ones

93
Figure 22: This plot displays the performance advances in the bounding boxes detection COCO challenge
over the years. For each year we present the main ideas behind the three best performing entries in terms
of mmAP. In 2015 the main frameworks were Fast R-CNN [104], DeepMask [282] and Faster R-CNN [309]
supported by the new Deep ResNets [117]. In 2016, the same pipelines won the competition with the
addition of AttractioNet [101] and LocNet [103] for better proposals and localization accuracy. In 2017
Mask R-CNN [118], FPN [215] and MegDet [274] proved that more complex ideas could allow to go over
the 50% mark. In 2018 the same pipelines, as in 2017 (namely Mask R-CNN), were enriched with the
multi-stages of Cascade R-CNN [27], a new RPN and backbones that were for the first time specifically
designed for the detection task. The last entry of 2018 reached 53% mmAP and we can extrapolate the two
first entries to be around 55% bbox mmAP based on their ranking for instance segmentation.

94
from Pascal-VOC) on 1,743,000 images and more A.2.1 Aerial Imagery
than 12,000,000 bounding boxes with an average
of 7 objects per image for training, and 125,436 The detection of small vehicles in aerial imagery is
images for tests (41,620 for validation). The object an old problem that has gained much attraction in
detection metric is the AP@0.5IoU averaged across recent times. However, it was only in the last years
classes taking into account the hierarchical struc- that large dataset have been made publicly avail-
ture of the classes with some technical subtleties on able, making the topic even more popular. The fol-
how to deal with groups of objects closely packed lowing paragraphs take inventory of these datasets
together. This is the first detection dataset to have and of the best performing methods.
so many classes and images and it will surely re- Google Earth [120] comprises 30 images of the
quire some new breakthrough to get it right. At city of Bruxelles with 1,319 small cars and verti-
the time of writing there is no published or non- cal bounding boxes, its variability is not enormous
published results on it, although the results of an but it is still widely used in the literature. There
Inception ResNet Faster R-CNN baseline can be are 5 folds. The CNN best result is [52] with 94.6
found on their site to have 37 mAP. The URL AP. It was later augmented with angle annota-
of the project is https://storage.googleapis. tions by Henriques and Vedaldi [122]. The data
com/openimages/web/index.html. can be found on Geremy Heitz webpage (http:
For industrial applications, more often than not, //ai.stanford.edu/~gaheitz/Research/TAS/).
the objects to detect does not come from the cate- OIRDS [375], with only 180 vehicles this dataset,
gories present in VOC or MS-COCO. Furthermore, is not very much used by the community.
they do not share the same variances; Rotation DLR 3k Munich Dataset [219] is one of the
variance for instance, is a property of several appli- most used datasets in the small vehicle detection
cations domains but is not present in any classical literature with 20 extra large images. 10 training
common object dataset. That is why, pushed by images with up to 3,500 cars and 70 trucks and
the industry needs, several other object detection 10 test images with 5,800 cars 90 trucks. Other
domains have appeared all with their respective lit- classes are also available like car or truck’s trails
erature. The most famous of them are listed in the and dashed lines. The state-of-the-art seems to
following sections. belong to [373] at 83% of F1 on both cars and
trucks and [372] at 82%, which provide oriented
boxes. Some relevant articles that compare on
A.2 Specialized datasets this dataset are [67, 350, 351]. The data can be
downloaded by asking the provided contact on
To find interesting domains one has to find interest- https://www.dlr.de/eoc/en/desktopdefault.
ing products or applications that drive them. The aspx/tabid-5431/9230_read-42467/.
industry has given birth to many sub-fields in ob- VeDAI [302] is for vehicle detection is aerial im-
ject detection: they wanted to have self-driving ages. The vehicles contained in the database, in
cars so we built pedestrian detection and traffic addition to being small, exhibit different variability
signs detection datasets; they wanted to monitor such as multiple orientations, lighting/shadowing
traffic so we had to have aerial imagery datasets; changes, occlusions. etc. Furthermore, each image
they wanted to be able to read text for blind per- is available in several spectral bands and resolu-
sons or automatic translations of foreign languages tions. They provide the same images in 2 reso-
so we constructed text detection datasets; some lutions 512x512 and 1024x1024. There are a to-
people wanted to do personalized advertising (ar- tal of 9 classes and 1,200 images with an aver-
guably not a good idea) so we engineered logo age of 5.5 instances per image. It is one of the
datasets. They all have their place in this special- few datasets to have 10 folds and the metric is
ized dataset section. based on an ellipse based distance between the cen-

95
ter of the ground truth and the centers of the de- quences formed by 179,264 frames and 10,209 static
tections. The state-of-the-art is currently held by images and contains different objects such pedes-
[259]. Although many recent articles used their trian, vehicles, bicycles, etc. and density (sparse
own metrics, which makes them difficult to com- and crowded scenes). Frames are manually anno-
pare [323, 351, 352, 372, 373]. VeDAI is available tated with more than 2.5 million bounding boxes
at https://downloads.greyc.fr/vedai/. and some attributes, e.g. scene visibility, object
COWC [252], introduced in ECCV2016, is a class and occlusion, are provided. VisDrone is very
very large dataset with regions from all over the recent and no results are available yet. VisDrone
world and more than 32,000 cars. It also con- is available at http://www.aiskyeye.com.
tains almost 60,000 hard negative patches hand-
picked, which is a blessing when training detectors A.2.2 Text Detection in Images
that do not include hard-example mining strate-
gies. Unfortunately, no test data annotations are Text detection in images or videos is a common
available so detection methods cannot yet be prop- way to extract content from images and opens the
erly tested on it. COWC is available at https: door to image retrieval or automatic text transla-
//gdo152.llnl.gov/cowc/. tion applications. We inventory, in the following,
DOTA [421], released this year at CVPR, is the main datasets as well as the best practices to
the first mainstream dataset to change its metric address this problem.
to incorporate rotated bounding boxes similar to ICDAR 2003 [227] was one of the first public
the text detections datasets. The images are of datasets for text detection. The dataset contains
very different resolutions and zoom factors. There 509 scene images and the scene text is mostly cen-
are 2,800 images with almost 200,000 instances tered and iconic. Delakis and Garcia [65] was one
and 15 categories. This dataset will surely be- of the first to use CNN on this dataset.
come one of the important ones in the near future. Street View Text (SVT) [403]. Taken from
The leader board https://captain-whu.github. Google StreetView, it is a dataset filled with busi-
io/DOTA/results.html shows that Mask R-CNN ness names mostly, from outdoor streets. There
structures are the best at this task for the moment are 350 images and 725 instances. One of the
with the winner culminating at 76.2 oriented mAP best performing methods on SVT is [468] with a
but no other published method apart from [421] F-measure of 83%. SVT can be downloaded from
yet. UCAS-AOD [479], NWPU VHR10 [54] and http://tc11.cvc.uab.es/datasets/SVT_1.
HRSC2016 [223] all provided oriented annotations MSRA-TD500 [387] contains 500 natural im-
also but they are hard to find and very few articles ages, which are taken from indoor (office and mall)
actually use them. DOTA is available at https: and outdoor (street) scenes. The resolutions of
//captain-whu.github.io/DOTA/dataset.html the images vary from 1296 × 864 to 1920 × 1280.
xView [186] is a very large scale dataset gathered There are Chinese and English texts and mixed
by the pentagon, containing 60 classes and 1 million too. The training set contains 300 images ran-
instances. It is split in three parts train, val and domly selected from the original dataset and the
test. xView is available at http://xviewdataset. remaining 200 images constitute the test set. Best
org. First challenge will end in August 2018, no performing method on MSRA-TD500 is [212]
results are available yet. with a F-measure of 79%. Shi et al. [333], Yao
VisDrone [481] is the most recent dataset in- et al. [435], Ma et al. [228] and Zhang et al. [466]
cluding aerial images. Images, captured by dif- also performed very well (F-measures of 77%,
ferent drones flying over 14 different cities sepa- 76%, 75% and 75% respectively). The dataset
rated by thousands of kilometers in China, in dif- is available at http://www.iapr-tc11.org/
ferent scenarios under various weather and lighting mediawiki/index.php/MSRA_Text_Detection_
conditions. The dataset consists of 263 video se- 500_Database_(MSRA-TD500).

96
IIIT 5k-word [242] has 1,120 images and 5,000 Face Detection Data Set and Benchmark
words from both street scene texts and born- (FDDB) [155] is built using Yahoo!, with 2845 im-
digital images. 380 images are used to train ages and a total of 5171 faces; it has a wide range of
and the remaining to test. Each text has also difficulties such as occlusions, strong pose changes,
a category label easy or hard. [212] is state-of- low resolution and out-of-focus faces, with both
the-art, as for MSRA-TD500. IIIT 5k-word is grayscale and color images. Zhang et al. [458] ob-
available at http://cvit.iiit.ac.in/projects/ tained an AUR of 98.3% on this dataset and is cur-
SceneTextUnderstanding/IIIT5K.html. rently state-of-the-art for this dataset. Najibi et al.
Synth90K [153] is a completely generated [255] obtained 98.1%. The dataset can be down-
grayscale text dataset with multiple fonts and vo- loaded at http://vis-www.cs.umass.edu/fddb/
cabulary well blended into scenes with 9 million index.html.
images from a 90,000 vocabulary. It can be found Annotated Facial Landmarks in the Wild
on the VGG page at http://www.robots.ox.ac. (AFLW) [177] is made from a collection of images
uk/~vgg/data/text/ collected on Flickr, with a large variety in face ap-
ICDAR 2015 [165] is another popular iteration pearance (pose, expression, ethnicity, age, gender)
of the ICDAR challenge, following ICDAR 2013. and environmental conditions. It has the partic-
Busta et al. [26] got state-of-the-art 87% of F mea- ularity to not to be aimed at face detection only,
sure in comparison to the 83.8% of Liao et al. [212] but more oriented towards landmark detection and
and the 82.54% of Jiang et al. [159]. TextBoxes++ face alignment. In total 25,993 faces in 21,997 real-
[211] reached 81.7% and Shi et al. [333] is at 75%. world images are annotated. Annotations come
COCO Text [396], based on MS COCO, is the with rich facial landmark information (21 land-
biggest dataset for text detection. It has 63,000 marks per faces). The dataset can be downloaded
images with 173,000 annotations. [212] is the only from https://www.tugraz.at/institute/icg/
published result with [477] yet that differs from research/team-bischof/lrs/downloads/aflw/.
the baselines implemented in the dataset paper
Annotated Face in-the-Wild (AFW) [483] is a
[396]. So there must still be room for improvement.
dataset containing faces in real conditions, with
The very recent [211] outperformed [477]. COCO
their associated annotations (bounding box, facial
Text is available at https://bgshih.github.io/
landmarks and pose angle labels). Each image con-
cocotext/.
tains multiple, non-frontal faces. The dataset con-
RCTW-17 (ICDAR 2017) [334] is the latest IC- tains 205 images with 468 faces. Zhang et al. [458]
DAR database. It is a large line-based dataset with obtained an AP of 99.85% on this dataset and is
mostly Chinese text. Liao et al. [212] achieved currently state-of-the-art for this dataset.
SOTA on this one too with 67.0% of F mea-
sure. The dataset is available at http://www. PASCAL Faces [430] contains images selected
icdar2017chinese.site/dataset/. from PASCAL VOC [88] in which the faces have
been annotated. [458] obtained an AP of 98.49%
on this dataset, and is currently state-of-the-art for
A.2.3 Face Detection this dataset.
Face detection is one of the most widely addressed Multi-Attribute Labeled Faces (MALF ) [20] in-
detection tasks. Even if the detection of frontal in corporates richer semantic annotations such as
high resolution images is an almost solved problem, pose, gender and occlusion information as well
there is room for improvement when the conditions as expression information. It contains 5,250 im-
are harder (non-frontal images, small faces, etc.). ages collected from the Internet and approximately
These harder conditions are reflected by the follow- 12,000 labeled faces. The dataset and up-to-date
ing recent datasets. The main characteristics of the results of the evaluation can be found at http:
different face datasets are proposed in Table 8. //www.cbsr.ia.ac.cn/faceevaluation/.

97
Dataset #Images #Faces Source Type
Wider Face [433] is one of the largest datasets FDDB [155] 2,845 5,171 Yahoo! News Images
for face detection. Each annotation includes infor- AFLW [177] 21,997 25,993 Flickr Images
AFW [483] 205 473 Flickr Images
mation such as scale, occlusion, pose, overall dif- PASCAL Faces [430] 851 1,335 Pascal-VOC Images
MALF [20] 5,250 11,931 Flickr, Baidu Inc. Images
ficulty and events, which makes possible in-depth IJB-A [172] 24,327 67,183 Google, Bing, etc. Images/Videos
analyses. This dataset is very challenging espe- IIIT-CFW [241] 8,927 8,928 Google Images
Wider Face [433] 32,203 393,703 Google, Bing Images
cially for the ’hard set’. Najibi et al. [255] ob- IJB-B [412] 76,824 125,474 Freebase Images/Videos
IJB-C [238] 148,876 540,630 Freebase Images/Videos
tained an AP of 93.1% (easy), 92.1% (medium) Wildest Faces [444] 67,889 109,771 YouTube Videos
and 84.5% (hard) on this dataset and is currently UFDD [253] 6,424 10,895 Google, Bing, etc. Images

state-of-the-art for this dataset. Zhang et al. [458]


are also very good with AP of 92.8% (easy), 91.3% Table 8: Datasets for face detection.
(medium) and 84.0% (hard). Datasets and results
can be downloaded at http://mmlab.ie.cuhk. IIIT-Cartoon Faces in the Wild) [241] contains
edu.hk/projects/WIDERFace/. 8,927 annotated images of cartoon faces belong-
IARPA Janus Benchmark A (IJ-A) [172] con- ing to 100 famous personalities, harvested from
tains images and videos from 500 subjects captured Google image search, with annotations including
from ’in the wild’ environment, and contains anno- attributes such as age group, view, expression,
tations for both recognition and detection tasks. pose, etc. The benchmark includes 7 challenges:
All labeled faces are localized with bounding boxes Cartoon face recognition, Cartoon face verification,
as well as with landmarks (center of the two eyes, Cartoon gender identification, photo2cartoon and
base of the nose). IJB-B [412] extended this dataset cartoon2photo, face detection, pose estimation and
with 1,845 subjects, for 21,798 still images and landmark detection, relative attributes in Cartoon
55,026 frames from 7,011 videos. IJB-C [238], and attribute-based cartoon search. Jha et al.
which is the new extended version of the IARPA [157] have published SOTA detection results using
Janus Benchmark A and B, adds 1,661 new sub- a Haar features-based detector, with a F measure
jects to the 1,870 subjects released in IJB-B. The of 84%. The dataset can be downloaded from
NIST Face Challenges are at https://www.nist. http://cvit.iiit.ac.in/research/projects/
gov/programs-projects/face-challenges. cvit-projects/cartoonfaces
Un-constrained Face Detection Dataset (UFDD) Wildest Faces [444] is a dataset where the em-
[253] was built after noting that in many chal- phasis is put on violent scenes in unconstrained sce-
lenges large variations in scale, pose, appearance narios. It contains images of diverse quality, resolu-
are successfully addressed but there is a gap in tion and motion blur. It includes 68K images (aka
the performance of state-of-the-art detectors and video frames) and 2186 shots of 64 fighting celebri-
real-world requirements, not captured by existing ties. All of the video frames are manually anno-
methods or datasets. UFDD aimed at identify- tated to foster research for detection and recogni-
ing the next set of challenges and collect a new tion, both. The dataset is not released at the time
dataset of face images that involve variations such this survey is written.
as weather-based degradations, motion blur and fo-
cus blur. The authors also provide an in-depth A.2.4 Pedestrian Detection
analysis of the results and failure cases of these
methods. This dataset is very recent and has Pedestrian detection is one of the specific tasks
not been used specifically yet. However, Nada abundantly studied in the literature, especially
et al. [253] reported the performances (in terms since research on autonomous vehicles has inten-
of AP) of Faster-RCNN [309] (52.1%), SSH [255] sified.
(69.5%), S3FD [458] (72.5%) and HR-ER [137] MIT [272] is one of the first pedestrian
(74.2%). Dataset and results can be downloaded datasets. It’s puny in size (509 training and
at http://www.ufdd.info/. 200 testing images). The images were extracted

98
FDDB ETH [87] was captured from a stroller. There are
500000 AFLW

AFW
490 training frames with 1578 annotations. There
100000
PASCAL Faces are three test sets. The first test set has 999
50000
MALF

IJB-A
frames with 5193 annotations, the second one 450
IIIT-CFW and 2359 and the third one 354 and 1828 respec-
Faces

10000 Wider Face

IJB-B
tively. The stereo cues are available. It is a diffi-
5000
IJB-C cult dataset where the state-of-the-art from Zhang
1000
Wildest Faces

UFDD
et al. [459] trained on CityPersons still remains at
500 24.5% log average miss rate. The boosted forest
of Zhang et al. [455] gets 30.2% only. It is avail-
0

00

00

00
50

00

00
10

50

00
10

50

10

Images
able at https://data.vision.ee.ethz.ch/cvl/
aess/iccv2007/
Daimler DB [84] is an old dataset captured in an
Figure 23: Number of images vs number of faces in urban setting, builds on DaimlerChrysler datasets
each dataset (Table 8) on a log scale. The size of with only grayscale images. It has been recently
the bubble indicates average number of faces per extended with Cyclist annotations into the Ts-
image which can be used as an estimate of com- inghua Daimler Cyclist (TDC) dataset [202] with
plexity of the dataset. color images. The dataset is available at http:
//www.gavrila.net/Datasets/datasets.html.
TUD-Brussels [414] is from the TU Darmstadt
from the LabelMe database. You can find University and contains image pairs recorded in a
it at http://cbcl.mit.edu/software-datasets/ crowded urban setting with an on-board camera
PedestrianData.html from a car. There are 1092 image pairs with
INRIA [64] is currently one of the most popu- 1776 annotations in the training set. The test set
lar static pedestrian detection datasets introduced contains 508 image pairs with 1326 pedestrians.
in the seminal HOG paper [64]. It uses obvi- The evaluation is measured from the recall at 90%
ously the Caltech metric. Zhang et al. [459] gained precision, somehow reminiscent of KITTI dataset.
state-of-the-art with 6.4% log average miss rate. TUD-Brussels is available at https://www.mpi-
Method at the second position is [455] with 6.9% inf.mpg.de/departments/computer-vision-
using the RPN from Faster R-CNN and boosted and-multimodal-computing/research/people-
forests on extracted features. The others are detection-pose-estimation-and-tracking/
not CNN methods (the third one using pooling multi-cue-onboard-pedestrian-detection/.
with HOG, LBP and covariance matrices). It can Caltech USA [71] contains images are captured
be found at http://pascal.inrialpes.fr/data/ in the Greater Los Angeles area by an independent
human/. Similarly, PASCAL Persons dataset is a driver to simulate real-life conditions without any
subset of the aforementioned Pascal-VOC dataset. bias. 192,000 pedestrian instances are available for
training. 155,000 for testing. The evaluation use
CVC-ADAS [100] is a collection of datasets in- Pascal-VOC criteria at 0.5 IoU. The performance
cluding videos acquired on board, virtual-world measure is the log average miss rate as application
pedestrians and real pedestrians. It can be found wise one cannot have too many False Positive per
at following http://adas.cvc.uab.es/site/. Image (FPPI). It is computed by averaging miss
USC [417] is an old small pedestrian rates at 9 FPPIs from 10−2 to 1 uniformly in log
dataset taken largely from surveillance scale. State-of-the-art algorithms are at around 4%
videos. It is still downloadable at http: log average miss rate. Wang et al. [409] got 4.0%
//iris.usc.edu/Vision-Users/OldUsers/ by using a novel bounding box regression loss. Fol-
bowu/DatasetWebpage/dataset.html lowing it, we have Zhang et al. [459] at 4.1% using

99
a novel RoI-Pooling of parts helping with occlu- age miss rate. The dataset is available at https:
sions and pre-training on CityPersons. Mao et al. //bitbucket.org/shanshanzhang/citypersons.
[231] is lagging behind with 5.5%, using a Faster EuroCity [25] is the largest pedestrian detec-
R-CNN with additional aggregated features. There tion dataset ever released with 238,300 instances
also exists a CalTech Japan dataset. The bench- in 47,300 images. Images are taken over 31 cities
mark is hosted at http://www.vision.caltech. in 12 different European countries. The metric is
edu/Image_Datasets/CaltechPedestrians/. the same as CalTech. Three baselines were tested
KITTI [98] is one of the most famous datasets (Faster R-CNN, R-FCN and YOLOv3). Faster R-
in Computer Vision taken over the city of Karl- CNN dominated on the reasonable set with 8.1%,
sruhe in Germany. There are 100,000 instances of followed by YOLOv3 with 8.5% and R-FCN lag-
pedestrians. With around 6000 identities and one ging behind with 12.1%. On other subsets with
person in average per image. The preferred met- heavily occluded or small pedestrians the ranking
ric is the AP (Average Precision) on the moderate is not the same. We refer the reader to the dataset
(persons who are less than 25 pixels tall are left be- paper of [25].
hind for ranking) set. Li et al. [200] got 65.01 AP
on moderate by using an adapted version of Fast A.2.5 Logo Detection
R-CNN with different heads to deal with different
scales. The state-of-the-art of Chen et al. [45] had Logo detection was attracting a lot of attention in
to rely on stereo information to get good object the past, due to the specificity of the task. At
proposals and 67.47 AP. All KITTI related datasets the moment we write this survey, there are fewer
are found at http://www.cvlibs.net/datasets/ papers on this topic and most of the logo detection
kitti/index.php. pipelines are direct applications of Faster RCNN
GM-ATCI [340] is a dataset captured from a [309].
fisheye-lens camera that uses CalTech evaluation BelgaLogos [160] images come from the BELGA
system. We could not find any CNN detection press agency. The dataset is composed of 10,000
results on it possibly because the state-of-the-art images covering all aspects of life and current af-
using multiple cues is already pretty good with fairs: politics and economics, finance and social
3.5% log average miss rate. The sequences can affairs, sports, culture and personalities. All im-
be downloaded here https://sites.google.com/ ages are in JPEG format and have been re-sized
site/rearviewpeds1/ with a maximum value of height and width equal
CityPersons [456] is a relatively new dataset that to 800 pixels, preserving aspect ratio. There are 26
builds upon CityScapes [58]. It is a semantic seg- different logos. Only a few images are annotated
mentation dataset recorded in 27 different cities in with bounding boxes. The dataset can be down-
Germany. There are 19,744 persons in the train- loaded at https://www-sop.inria.fr/members/
ing set and around 11,000 in the test set. There Alexis.Joly/BelgaLogos/BelgaLogos.html.
are way more identities present than in CalTech FlickrLogos [80, 313] consists of real-world im-
even though there are fewer instances (1300 in Cal- ages collected from Flickr, depicting company lo-
Tech w.r.t. 19000 in CityPersons). Therefore, it gos in various situations. The dataset comes in two
is more diverse and thus, more challenging. The versions: The original FlickrLogos-32 dataset and
metric is the same as CalTech with some subsets the FlickrLogos-47 [80] dataset. In FlickrLogos-
like the Reasonable: the pedestrians that are more 32 the annotations for object detection were of-
than 50 pixels tall and less than 35% occluded. ten incomplete, since only the most prominent
Again Zhang et al. [459] and Wang et al. [409] take logo instances were labeled. FlickrLogos-47 uses
the lead with 11.32% and 11.48% respectively on the same image corpus as FlickrLogos-32 but new
the reasonable set w.r.t. the baseline on adapted classes were introduced (logo and text as separate
Faster R-CNN that stands at 12.97% log aver- classes) and missing object instances have been an-

100
notated. FlickrLogos-47 contains 833 training and Dataset #Classes #Images
1402 testing images. The dataset can be down- BelgaLogos [160] 26 10,000
loaded at http://www.multimedia-computing. FlickrLogos-32 [313] 32 8,240
de/flickrlogos/. FlickrLogos-47 [80] 47 8,240
Logo32plus [17] is an extension of the train Logo32plus [17] 32 7,830
set of FlickrLogos-32 [80]. It has the same WebLogo-2M [358] 194 2,190,757
classes of objects but much more training in- SportsLogo [213] 20 1,978
stances (12,312 instances). The dataset can be Logos in the Wild [389] 871 11,054
downloaded at http://www.ivl.disco.unimib. OpenLogos [360] 309 27,189
it/activities/logorecognition.
WebLogo-2M [358] is very large, but annotated Table 9: Datasets for logo detection.
at image level only and does not contain bound-
ing boxes. It contains 194 logo classes and over 2
million logo images. Labels are noisy as the an- classes have no labeled training data. It contrasts
notations are automatically generated. Therefore, with previous logo datasets which assumed all the
this dataset is designed for large-scale logo detec- logo classes are annotated. The OpenLogo chal-
tion model learning from noisy training data. For lenge contains 27,189 images from 309 logo classes,
performance evaluation, the dataset includes 6,569 built by aggregating/refining 7 existing datasets
test images with manually labeled logo bounding and establishing an open logo detection evalua-
boxes for all the 194 logo classes. The dataset can tion protocol. The dataset can be downloaded at
be downloaded at http://www.eecs.qmul.ac.uk/ https://qmul-openlogo.github.io.
%7Ehs308/WebLogo-2M.html/.
SportsLogo [213], in the absence of public video A.2.6 Traffic Signs Detection
logo dataset, was collected on a set of tennis videos
containing 20 different tennis video clips with cam- This section reviews the 4 main datasets and
era motions (blurring) and occlusion. The logos benchmarks for evaluating traffic sign detectors
can appear on the background as well as on play- [133, 246, 379, 490], as well as the Bosch Small
ers and staffs clothes. 20 logos are annotated, with Traffic Lights [13]. The most challenging one is the
about 100 images for each logo. Tsinghua Tencent 100k (TTK100) [490], on which
Logos in the Wild [389] contains images collected Faster RCNN like detectors detectors such as [285]
from the web with logo annotations provided in have an overall precision/recall of 44%/68%, which
Pascal-VOC style. It contains large varieties of shows the difficulty of the dataset.
brands in-the-wild. The latest version (v2.0) of LISA Traffic Sign Dataset [246] was among the
the dataset consists of 11,054 images with 32,850 first datasets for traffic sign detection. It contains
annotated logo bounding boxes of 871 brands. It 47 US signs and 7,855 annotations on 6,610 video
contains from 4 to 608 images per searched brand, frames. Sign sizes vary from 6x6 to 167x168 pixels.
and 238 brands occur at least 10 times. It has up Each sign is annotated with sign type, position,
to 118 logos in one image. Only the links to the im- size, occluded (yes/no), on side road (yes/no). The
ages are released, which is problematic as numer- URL for this dataset is http://cvrr.ucsd.edu/
ous images have already disappeared, making exact LISA/lisa-traffic-sign-dataset.html
comparisons impossible. The dataset can be down- The German Traffic Sign Detection Benchmark
loaded from https://www.iosb.fraunhofer.de/ (GTSDB) [133] is one of the most popular traf-
servlet/is/78045/. fic signs detection benchmarks. It introduced a
Open Logo Detection Challenge [360]. This dataset with evaluation metrics, baseline results,
dataset assumes that only on a small proportion and a web interface for comparing approaches. The
of logo classes are annotated whilst the remaining dataset provides a total of 900 images with 1,206

101
traffic signs. The traffic sign sizes vary between A.2.7 Other Datasets
16 and 128 pixels w.r.t. the longest edge. The im-
age resolution is 1360 × 800; images capture dif- Some datasets do not fit in any of the previously
ferent scenarios (urban, rural, highway) during the mentioned category but deserve to be mentioned
daytime and dusk featuring various weather condi- because of the interest the community has for them.
tions. It can be found at http://benchmark.ini. iNaturalist Species Classification and Detection
rub.de/?section=gtsdb&subsection=news. Dataset [394] contains 859,000 images from over
5,000 different species of plants and animals. The
goal of this dataset is to encourage the devel-
Belgian TSD [379] consists of 7,356 still images opment of algorithms for ’in the wild’ data fea-
for training, with a total of 11,219 annotations, turing large numbers of imbalanced, one-grained,
corresponding to 2,459 traffic signs visible at less categories. The dataset can be downloaded
than 50 meters in at least one view. The test set at https://github.com/visipedia/inat_comp/
contains 4 sequences, captured by 8 roof-mounted tree/master/2017.
cameras on the van, with a total of 121,632 frames Below we give all known datasets that can be
and 269 different traffic signs for evaluating the used to tackle object detection with the different
detectors. For each sign, the type and 3D loca- modalities that we presented in the Sec. 4.1.
tion is given. The dataset can be downloaded at
https://btsd.ethz.ch/shareddata/.
A.3 3D Datasets
Tsinghua Tencent 100k (TTK100) [490] pro- KITTI object detection benchmark [98] is the most
vides 2048 × 2048 images for traffic signs detec- widely used dataset for evaluating detection in 3D
tion and classification, with various illumination point clouds. It contains 3 main categories (namely
and weather conditions. It’s the largest dataset 2D, 3D and birds-eye-view objects), 3 object cat-
for traffic signs detection, with 100,000 images out egories (cars, pedestrians and cyclists), and 3 dif-
of which 16,787 contain traffic signs instances, for ficulty levels (easy, moderate and hard consider-
a total of 30,000 traffic instances. There are a to- ing the object size, distance, occlusion and trun-
tal of 128 classes. Each instance is annotated with cation). The dataset is public and contains 7,481
class label, bounding box and pixel mask. It has images for training and 7,518 for testing, compris-
small objects in abundance and huge scale varia- ing a total of 80,256 labeled objects. The 3D point
tions. Some signs which are naturally rare, e.g. clouds are acquired with a Velodyne laser scanner.
signs to warn the driver to be cautious on mountain 3D object detection performance is evaluated using
roads appear, have quite low number of instances. the PASCAL criteria also used for 2D object detec-
There are 45 classes with at least 100 instances tion. For cars a 3D bounding box overlap of 70%
present. The dataset can be obtained at http: is required, while for pedestrians and cyclists a 3D
//cg.cs.tsinghua.edu.cn/traffic%2Dsign/. bounding box overlap of 50% is required. For eval-
uation, precision-recall curves are computed and
Bosch Small Traffic Lights [13] is made for the methods are ranked according to average preci-
benchmarking traffic light detectors. It contains sion. The algorithms can use the following sources
13,427 images of size 1280 × 720 pixels with around of information: i) Stereo: Method uses left and
24,000 annotated traffic lights, annotated with right (stereo) images ii) Flow: Method uses optical
bounding boxes and states (active light). Best flow (2 temporally adjacent images) iii) Multiview:
performing algorithm is [285] which obtained a Method uses more than 2 temporally adjacent im-
mAP of 53 on this dataset. Bosch Small Traffic ages iv) Laser Points: Method uses point clouds
Lights can be downloaded at https://hci.iwr. from Velodyne laser scanner v) Additional train-
uni-heidelberg.de/node/6132. ing data: Use of additional data sources for train-

102
ing. The datasets and performance of SOTA de- the ImageNet VID challenge [319]. Both are re-
tectors can be download at http://www.cvlibs. viewed in this section.
net/datasets/kitti/, and the leader board YouTube-BoundingBoxes [303] is a data set of
is at http://www.cvlibs.net/datasets/kitti/ video URLs with the single object bounding box
eval_object.php?obj_benchmark=3d. One of the annotations. All video sequences are annotated
leading methods is [342] which is at an mAP of with classifications and bounding boxes, at 1 frame
67.72/64.00/63.01 (Easy/Mod./Hard) for the car per second. There is a total of about 380,000 video
category, at 50 fps. Slower (10 fps) but more accu- segments of 15-20 seconds, from 240,000 publicly
rate, [182] has a performance of 81.94/71.88/66.38 available YouTube videos, featuring objects in nat-
on cars. Chen et al. [47], Zhou and Tuzel [478] and ural settings, without editing or post-processing.
Qi et al. [289] also gave very good results. Real et al. [303] reported a mAP of 59 on this
Active Vision Dataset (AVD) [5] contains dataset. This dataset can be downloaded at https:
30,000+ RGBD images, 30+ frequently occur- //research.google.com/youtube-bb/.
ring instances, 15 scenes, and 70,000+ 2D bound- ImageNet VID challenge [319] was a part of
ing boxes. This dataset focused on simulating the ILSVRC 2015 challenge. It has a training
robotic vision tasks in everyday indoor environ- set of 3,862 fully annotated video sequences hav-
ments using real imagery. The dataset can be ing a length from 6 frames to 5,492 frames per
downloaded at http://cs.unc.edu/~ammirato/ video. The validation set contains 555 fully an-
active_vision_dataset_website/. notated videos, ranging from 11 frames to 2898
SceneNet RGB-D [239] is a synthetic dataset de- frames per video. Finally, the test set contains
signed for scene understanding problems such as se- 937 video sequences and the ground-truth anno-
mantic segmentation, instance segmentation, and tation are not publicly available. One of the best
object detection. It provides camera poses and performing methods on ImageNet VID is [89] with
depth data and permits to create any scene con- a mAP of 79.8, by combining detection and track-
figuration. 5M rendered RGB-D images from 16K ing. Zhu et al. [484] reached 76.3 points with a flow
randomly generated 3D trajectories in synthetic best approach. This dataset can be downloaded at
layouts are also provided. The dataset can be http://image-net.org/challenges/LSVRC.
downloaded at http://robotvault.bitbucket. VisDrone [481] contains video clips acquired by
io/scenenet-rgbd.html. drones. This dataset is presented in Section 5.2.1
Falling Things [384] introduced a novel synthetic
dataset for 3D object detection and pose estima- A.5 Concluding Remarks
tion, the Falling Things dataset. The dataset con-
tains 60k annotated photos of 21 household objects This appendix gave a large overview of the datasets
taken from the YCB dataset. For each image, the introduced by the community for developing and
3D poses, per-pixel class segmentation, and 2D/3D evaluating object detectors in images, videos or
bounding box coordinates for all objects are given. 3D point clouds. Each object detection dataset
To facilitate testing different input modalities, presents a very biased view of the world, as shown
mono and stereo RGB images are provided, along in [169, 380, 381], representative of the user’s needs
with registered dense depth images. The dataset when they built it. The bias is not only in the
can be downloaded at http://research.nvidia. images they chose (specific views of objects, ob-
com/publication/2018-06_Falling-Things. jects imbalance [264], objects categories) but also
in the metric they created and the evaluation pro-
tocol they devised. The community is trying its
A.4 Video Datasets
best to build more and more datasets with less and
The two most popular datasets for video object de- less bias and as a result it has become quite hard
tection are the YouTube-BoundingBoxes [303] and to find its way in this jungle of datasets, especially

103
when one needs: older datasets that have fallen
out of fashion or even exhaustive lists of state-of-
the-art algorithms performances on modern ones.
Through this survey we have partially addressed
this need of a common source for information on
datasets.

104
Neural Approaches to Conversational AI
Question Answering, Task-Oriented Dialogues and Social Chatbots

Jianfeng Gao Michel Galley Lihong Li


Microsoft Research Microsoft Research Google Brain
jfgao@microsoft.com mgalley@microsoft.com lihong@google.com
arXiv:1809.08267v3 [cs.CL] 10 Sep 2019

Abstract
The present paper surveys neural approaches to conversational AI that have been
developed in the last few years. We group conversational systems into three cat-
egories: (1) question answering agents, (2) task-oriented dialogue agents, and
(3) chatbots. For each category, we present a review of state-of-the-art neural
approaches, draw the connection between them and traditional approaches, and
discuss the progress that has been made and challenges still being faced, using
specific systems and models as case studies.1

1
We are grateful to the anonymous reviewers, Chris Brockett, Asli Celikyilmaz, Yu Cheng, Bill Dolan,
Pascale Fung, Zhe Gan, Sungjin Lee, Jinchao Li, Xiujun Li, Bing Liu, Andrea Madotto, Rangan Majumder,
Alexandros Papangelis, Olivier Pietquin, Chris Quirk, Alan Ritter, Paul Smolensky, Alessandro Sordoni, Yang
Song, Hisami Suzuki, Wei Wei, Tal Weiss, Kun Yuan, and Yizhe Zhang for their helpful comments and sug-
gestions on earlier versions of this paper.
Contents

1 Introduction 5
1.1 Who Should Read this Paper? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Dialogue: What Kinds of Problems? . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 A Unified View: Dialogue as Optimal Decision Making . . . . . . . . . . . . . . . 8
1.4 The Transition of NLP to Neural Approaches . . . . . . . . . . . . . . . . . . . . 9

2 Machine Learning Background 11


2.1 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Two Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Question Answering and Machine Reading Comprehension 20


3.1 Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Semantic Parsing for KB-QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Embedding-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Multi-Step Reasoning on KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Symbolic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Neural Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.3 Reinforcement Learning based Methods . . . . . . . . . . . . . . . . . . . 25
3.5 Conversational KB-QA Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Machine Reading for Text-QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Neural MRC Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7.2 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Conversational Text-QA Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2
4 Task-oriented Dialogue Systems 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Evaluation and User Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Simulation-Based Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Human-based Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.4 Other Evaluation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Natural Language Understanding and Dialogue State Tracking . . . . . . . . . . . 41
4.3.1 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Dialogue State Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Dialogue Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1 Deep RL for Policy Optimization . . . . . . . . . . . . . . . . . . . . . . 44
4.4.2 Efficient Exploration and Domain Extension . . . . . . . . . . . . . . . . 45
4.4.3 Composite-task Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.4 Multi-domain Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.5 Integration of Planning and Learning . . . . . . . . . . . . . . . . . . . . 48
4.4.6 Reward Function Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Natural Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 End-to-end Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 Further Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Fully Data-Driven Conversation Models and Social Bots 53


5.1 End-to-End Conversation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.1 The LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.2 The HRED Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.3 Attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.4 Pointer-Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Challenges and Remedies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 Response Blandness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Speaker Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.3 Word Repetitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.4 Further Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Grounded Conversation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Beyond Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7 Open Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Conversational AI in Industry 65
6.1 Question Answering Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3
6.1.1 Bing QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1.2 Satori QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.3 Customer Support Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 Task-oriented Dialogue Systems (Virtual Assistants) . . . . . . . . . . . . . . . . 67
6.3 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Conclusions and Research Trends 72

4
Chapter 1

Introduction

Developing an intelligent dialogue system1 that not only emulates human conversation, but also
answers questions on topics ranging from latest news about a movie star to Einstein’s theory of rel-
ativity, and fulfills complex tasks such as travel planning, has been one of the longest running goals
in AI. The goal has remained elusive until recently. We are now observing promising results both in
academia sindustry, as large amounts of conversational data become available for training, and the
breakthroughs in deep learning (DL) and reinforcement learning (RL) are applied to conversational
AI.
Conversational AI is fundamental to natural user interfaces. It is a rapidly growing field, attracting
many researchers in the Natural Language Processing (NLP), Information Retrieval (IR) and Ma-
chine Learning (ML) communities. For example, SIGIR 2018 has created a new track of Artificial
Intelligence, Semantics, and Dialog to bridge research in AI and IR, especially targeting Question
Answering (QA), deep semantics and dialogue with intelligent agents.
Recent years have seen the rise of a small industry of tutorials and survey papers on deep learning
and dialogue systems. Yih et al. (2015b, 2016); Gao (2017) reviewed deep learning approaches for
a wide range of IR and NLP tasks, including dialogues. Chen et al. (2017e) presented a tutorial on
dialogues, with a focus on task-oriented agents. Serban et al. (2015; 2018) surveyed public dialogue
datasets that can be used to develop conversational agents. Chen et al. (2017b) reviewed popular
deep neural network models for dialogues, focusing on supervised learning approaches. The present
work substantially expands the scope of Chen et al. (2017b); Serban et al. (2015) by going beyond
data and supervised learning to provide what we believe is the first survey of neural approaches to
conversational AI, targeting NLP and IR audiences.2 Its contributions are:

• We provide a comprehensive survey of the neural approaches to conversational AI that


have been developed in the last few years, covering QA, task-oriented and social bots with
a unified view of optimal decision making.

• We draw connections between modern neural approaches and traditional approaches, al-
lowing us to better understand why and how the research has evolved and to shed light on
how we can move forward.

• We present state-of-the-art approaches to training dialogue agents using both supervised


and reinforcement learning.

1
“Dialogue systems” and “conversational AI” are often used interchangeably in the scientific literature. The
difference is reflective of different traditions. The former term is more general in that a dialogue system might
be purely rule-based rather than AI-based.
2
One important topic of conversational AI that we do not cover is Spoken Language Understanding (SLU).
SLU systems are designed to extract the meaning from speech utterances and their application are vast, ranging
from voice search in mobile devices to meeting summarization. The present work does encompass many
Spoken Dialogue Systems – for example Young et al. (2013) – but does not focus on components related to
speech. We refer readers to Tur and De Mori (2011) for a survey of SLU.

5
• We sketch out the landscape of conversational systems developed in the research commu-
nity and released in industry, demonstrating via case studies the progress that has been
made and the challenges that we are still facing.

1.1 Who Should Read this Paper?


This paper is based on tutorials given at the SIGIR and ACL conferences in 2018 (Gao et al.,
2018a,b), with the IR and NLP communities as the primary target audience. However, audiences
with other backgrounds (such as machine learning) will also find it an accessible introduction to
conversational AI with numerous pointers, especially to recently developed neural approaches.
We hope that this paper will prove a valuable resource for students, researchers, and software de-
velopers. It provides a unified view, as well as a detailed presentation of the important ideas and
insights needed to understand and create modern dialogue agents that will be instrumental to making
world knowledge and services accessible to millions of users in ways that seem natural and intuitive.
This survey is structured as follows:

• The rest of this chapter introduces dialogue tasks and presents a unified view in which
open-domain dialogue is formulated as an optimal decision making process.
• Chapter 2 introduces basic mathematical tools and machine learning concepts, and reviews
recent progress in the deep learning and reinforcement learning techniques that are funda-
mental to developing neural dialogue agents.
• Chapter 3 describes question answering (QA) agents, focusing on neural models for
knowledge-base QA and machine reading comprehension (MRC).
• Chapter 4 describes task-oriented dialogue agents, focusing on applying deep reinforce-
ment learning to dialogue management.
• Chapter 5 describes social chatbots, focusing on fully data-driven neural approaches to
end-to-end generation of conversational responses.
• Chapter 6 gives a brief review of several conversational systems in industry.
• Chapter 7 concludes the paper with a discussion of research trends.

1.2 Dialogue: What Kinds of Problems?


Fig. 1.1 shows a human-agent dialogue during the process of making a business decision. The
example illustrates the kinds of problems a dialogue system is expected to solve:

• question answering: the agent needs to provide concise, direct answers to user queries
based on rich knowledge drawn from various data sources including text collections such
as Web documents and pre-compiled knowledge bases such as sales and marketing datasets,
as the example shown in Turns 3 to 5 in Fig. 1.1.
• task completion: the agent needs to accomplish user tasks ranging from restaurant reser-
vation to meeting scheduling (e.g., Turns 6 to 7 in Fig. 1.1), and to business trip planning.
• social chat: the agent needs to converse seamlessly and appropriately with users — like a
human as in the Turing test — and provide useful recommendations (e.g., Turns 1 to 2 in
Fig. 1.1).

One may envision that the above dialogue can be collectively accomplished by a set of agents, also
known as bots, each of which is designed for solving a particular type of task, e.g., QA bots, task-
completion bots, social chatbots. These bots can be grouped into two categories, task-oriented and
chitchat, depending on whether the dialogue is conducted to assist users to achieve specific tasks,
e.g., obtain an answer to a query or have a meeting scheduled.
Most of the popular personal assistants in today’s market, such as Amazon Alexa, Apple Siri, Google
Home, and Microsoft Cortana, are task-oriented bots. These can only handle relatively simple tasks,
such as reporting weather and requesting songs. An example of a chitchat dialogue bot is Microsoft

6
Figure 1.1: A human-agent dialogue during the process of making a business decision. (usr: user,
agt: agent) The dialogue consists of multiple segments of different types. Turns 1 and 2 are a social
chat segment. Turns 3 to 5 are a QA segment. Turns 6 and 7 are a task-completion segment.

Figure 1.2: Two architectures of dialogue systems for (Top) traditional task-oriented dialogue and
(Bottom) fully data-driven dialogue.

XiaoIce. Building a dialogue agent to fulfill complex tasks as in Fig. 1.1 remains one of the most
fundamental challenges for the IR and NLP communities, and AI in general.
A typical task-oriented dialogue agent is composed of four modules, as illustrated in Fig. 1.2 (Top):
(1) a Natural Language Understanding (NLU) module for identifying user intents and extracting
associated information; (2) a state tracker for tracking the dialogue state that captures all essential
information in the conversation so far; (3) a dialogue policy that selects the next action based on the
current state; and (4) a Natural Language Generation (NLG) module for converting agent actions to
natural language responses. In recent years there has been a trend towards developing fully data-
driven systems by unifying these modules using a deep neural network that maps the user input to
the agent output directly, as shown in Fig. 1.2 (Bottom).

7
Most task-oriented bots are implemented using a modular system, where the bot often has access
to an external database on which to inquire about information to accomplish the task (Young et al.,
2013; Tur and De Mori, 2011). Social chatbots, on the other hand, are often implemented using a
unitary (non-modular) system. Since the primary goal of social chatbots is to be AI companions to
humans with an emotional connection rather than completing specific tasks, they are often developed
to mimic human conversations by training DNN-based response generation models on large amounts
of human-human conversational data (Ritter et al., 2011; Sordoni et al., 2015b; Vinyals and Le, 2015;
Shang et al., 2015). Only recently have researchers begun to explore how to ground the chitchat in
world knowledge (Ghazvininejad et al., 2018) and images (Mostafazadeh et al., 2017) so as to make
the conversation more contentful and interesting.

1.3 A Unified View: Dialogue as Optimal Decision Making

The example dialogue in Fig. 1.1 can be formulated as a decision making process. It has a natural
hierarchy: a top-level process selects what agent to activate for a particular subtask (e.g., answering
a question, scheduling a meeting, providing a recommendation or just having a casual chat), and
a low-level process, controlled by the selected agent, chooses primitive actions to complete the
subtask.
Such hierarchical decision making processes can be cast in the mathematical framework of options
over Markov Decision Processes (MDPs) (Sutton et al., 1999b), where options generalize primitive
actions to higher-level actions. In a traditional MDP setting, an agent chooses a primitive action at
each time step. With options, the agent can choose a “multi-step” action which for example could
be a sequence of primitive actions for completing a subtask.
If we view each option as an action, both top- and low-level processes can be naturally captured by
the reinforcement learning framework. The dialogue agent navigates in a MDP, interacting with its
environment over a sequence of discrete steps. At each step, the agent observes the current state, and
chooses an action according to a policy. The agent then receives a reward and observes a new state,
continuing the cycle until the episode terminates. The goal of dialogue learning is to find optimal
policies to maximize expected rewards. Table 1.1 formulates an sample of dialogue agents using
this unified view of RL, where the state-action spaces characterize the complexity of the problems,
and the rewards are the objective functions to be optimized.
The unified view of hierarchical MDPs has already been applied to guide the development of some
large-scale open-domain dialogue systems. Recent examples include Sounding Board 3 , a social
chatbot that won the 2017 Amazon Alexa Prize, and Microsoft XiaoIce 4 , arguably the most popular
social chatbot that has attracted more than 660 million users worldwide since its release in 2014.
Both systems use a hierarchical dialogue manager: a master (top-level) that manages the overall
conversation process, and a collection of skills (low-level) that handle different types of conversation
segments (subtasks).
The reward functions in Table 1.1, which seem contradictory in CPS (e.g., we need to minimize
CPS for efficient task completion but maximize CPS for improving user engagement), suggest that
we have to balance the long-term and short-term gains when developing a dialogue system. For
example, XiaoIce is a social chatbot optimized for user engagement, but is also equipped with more
than 230 skills, most of which are QA and task-oriented. XiaoIce is optimized for expected CPS
which corresponds a long-term, rather than a short-term, engagement. Although incorporating many
task-oriented and QA skills can reduce CPS in the short term since these skills help users accomplish
tasks more efficiently by minimizing CPS, these new skills establish XiaoIce as an efficient and
trustworthy personal assistant, thus strengthening the emotional bond with human users in the long
run.
Although RL provides a unified ML framework for building dialogue agents, applying RL requires
training the agents by interacting with real users, which can be expensive in many domains. Hence,
in practice, we often use a hybrid approach that combines the strengths of different ML methods.
For example, we might use imitation and/or supervised learning methods (if there is a large amount
of human-human conversational corpus) to obtain a reasonably good agent before applying RL to
3
https://sounding-board.github.io/
4
https://www.msxiaobing.com/

8
Table 1.1: Reinforcement Learning for Dialogue. CPS stands for Conversation-turns Per Session,
and is defined as the average number of conversation-turns between the bot and the user in a conver-
sational session.
dialogue state action reward
clarification
understanding of relevance of answer,
QA questions
user query intent (min) CPS
or answers
understanding of dialogue-act and task success rate,
task-oriented
user goal slot/value (min) CPS
conversation history user engagement,
chitchat responses
and user intent measured in CPS
understanding of user engagement,
top-level bot options
user top-level intent measured in CPS

Figure 1.3: Traditional NLP Component Stack. Figure credit: Bird et al. (2009).

continue improving it. In the paper, we will survey these ML approaches and their use for training
dialogue systems.

1.4 The Transition of NLP to Neural Approaches

Neural approaches are now transforming the field of NLP and IR, where symbolic approaches have
been dominating for decades.
NLP applications differ from other data processing systems in their use of language knowledge of
various levels, including phonology, morphology, syntax, semantics and discourse (Jurafsky and
Martin, 2009). Historically, much of the NLP field has organized itself around the architecture
of Fig. 1.3, with researchers aligning their work with one component task, such as morphological
analysis or parsing. These tasks can be viewed as resolving (or realizing) natural language ambiguity
(or diversity) at different levels by mapping (or generating) a natural language sentence to (or from)
a series of human-defined, unambiguous, symbolic representations, such as Part-Of-Speech (POS)
tags, context free grammar, first-order predicate calculus. With the rise of data-driven and statistical
approaches, these components have remained and have been adapted as a rich source of engineered
features to be fed into a variety of machine learning models (Manning et al., 2014).
Neural approaches do not rely on any human-defined symbolic representations but learn in a task-
specific neural space where task-specific knowledge is implicitly represented as semantic concepts
using low-dimensional continuous vectors. As Fig. 1.4 illustrates, neural methods in NLP tasks (e.g.,
machine reading comprehension and dialogue) often consist of three steps: (1) encoding symbolic

9
Figure 1.4: Symbolic and Neural Computation.

user input and knowledge into their neural semantic representations, where semantically related or
similar concepts are represented as vectors that are close to each other; (2) reasoning in the neural
space to generate a system response based on input and system state; and (3) decoding the system
response into a natural language output in a symbolic space. Encoding, reasoning and decoding are
implemented using neural networks of different architectures, all of which may be stacked into a
deep neural network trained in an end-to-end fashion via back propagation.
End-to-end training results in tighter coupling between the end application and the neural network
architecture, lessening the need for traditional NLP component boundaries like morphological anal-
ysis and parsing. This drastically flattens the technology stack of Fig. 1.3, and substantially reduces
the need for feature engineering. Instead, the focus has shifted to carefully tailoring the increasingly
complex architecture of neural networks to the end application.
Although neural approaches have already been widely adopted in many AI tasks, including image
processing, speech recognition and machine translation (e.g., Goodfellow et al., 2016), their impact
on conversational AI has come somewhat more slowly. Only recently have we begun to observe
neural approaches establish state-of-the-art results on an array of conversation benchmarks for both
component tasks and end applications and, in the process, sweep aside the traditional component-
based boundaries that have defined research areas for decades. This symbolic-to-neural shift is
also reshaping the conversational AI landscape by opening up new tasks and user experiences that
were not possible with older techniques. One reason for this is that neural approaches provide a
consistent representation for many modalities, capturing linguistic and non-linguistic (e.g., image
and video (Mostafazadeh et al., 2017)) features in the same modeling framework.
There are also works on hybrid methods that combine the strengths of both neural and symbolic
approaches e.g., (Mou et al., 2016; Liang et al., 2016). As summarized in Fig. 1.4, neural approaches
can be trained in an end-to-end fashion and are robust to paraphrase alternations, but are weak in
terms of execution efficiency and explicit interpretability. Symbolic approaches, on the other hand,
are difficult to train and sensitive to paraphrase alternations, but are more interpretable and efficient
in execution.

10
Chapter 2

Machine Learning Background

This chapter presents a brief review of the deep learning and reinforcement learning technologies
that are most relevant to conversational AI in later chapters.

2.1 Machine Learning Basics


Mitchell (1997) defines machine learning broadly to include any computer program that improves
its performance at some task T , measured by P , through experiences E.
Dialogue, as summarized in Table 1.1, is a well-defined learning problem with T , P , and E specified
as follows:

• T : perform conversations with a user to fulfill the user’s goal.


• P : cumulative reward defined in Table 1.1.
• E: a set of dialogues, each of which is a sequence of user-agent interactions.

As a simple example, a single-turn QA dialogue agent might improve its performance as measured
by accuracy or relevance of its generated answers at the QA task, through experiences of human-
labeled question-answer pairs.
A common recipe of building an ML agent using supervised learning (SL) consists of a dataset, a
model, a cost function (a.k.a. loss function) and an optimization procedure.

• The dataset consists of (x, y ∗ ) pairs, where for each input x, there is a ground-truth output
y ∗ . In QA, x consists of an input question and the documents from which an answer is
generated, and y ∗ is the desired answer provided by a knowledgeable external supervisor.
• The model is typically of the form y = f (x; θ), where f is a function (e.g., a neural
network) parameterized by θ that maps input x to output y.
• The cost function is of the form L(y ∗ , f (x; θ)). L(.) is often designed as a smooth function
of error, and is differentiable w.r.t. θ. A commonly used cost function that meets these
criteria is the mean squared error (MSE), defined as
M
1 X ∗
(y − f (xi ; θ))2 .
M i=1 i
• The optimization can be viewed as a search algorithm to identify the best θ that minimize
L(.). Given that L is differentiable, the most widely used optimization procedure for deep
learning is mini-batch Stochastic Gradient Descent (SGD) which updates θ after each batch
as
N
α X
θ←θ− ∇θ L(yi∗ , f (xi ; θ)) , (2.1)
N i=1
where N is the batch size and α the learning rate.

11
Common Supervised Learning Metrics. Once a model is trained, it can be tested on a hold-out
dataset to have an estimate of its generalization performance. Suppose the model is f (·; θ), and the
hold-out set contains N data points: D = {(x1 , y1∗ ), (x2 , y2∗ ), . . . , (xN , yN

)}.
The first metric is the aforementioned mean squared error that is appropriate for regression problems
(i.e., yi∗ is considered real-values):
N
1 X ∗
MSE(f ) := (y − f (xi ; θ))2 .
N i=1 i

For classification problems, yi∗ takes values from a finite set viewed as categories. For simplicity,
assume yi∗ ∈ {+1, −1} here, so that an example (xi , yi∗ ) is called positive (or negative) if yi∗ is +1
(or −1). The following metrics are often used:

• Accuracy: the fraction of examples for which f predicts correctly:


N
1 X
ACCURACY(f ) := 1(f (xi ; θ) = yi∗ ) ,
N i=1

where 1(E) is 1 if expression E is true and 0 otherwise.


• Precision: the fraction of correct predictions among examples that are predicted by f to be
positive:
PN
1(f (xi ; θ) = yi∗ AND yi∗ = +1)
PRECISION(f ) := i=1 PN .
i=1 1(f (xi ; θ) = +1)

• Recall: the fraction of positive examples that are correctly predicted by f :


PN
1(f (xi ; θ) = yi∗ AND yi∗ = +1)
RECALL(f ) := i=1 PN .

i=1 1(yi = +1)

• F1 Score: the harmonic mean of precision and recall:


2 × ACCURACY(f ) × RECALL(f )
F1(f ) := .
ACCURACY(f ) + RECALL(f )

Other metrics are also widely used, especially for complex tasks beyond binary classification, such
as the BLEU score (Papineni et al., 2002).

Reinforcement Learning. The above SL recipe applies to prediction tasks on a fixed dataset.
However, in interactive problems such as dialogues1 , it can be challenging to obtain examples of
desired behaviors that are both correct and representative of all the states in which the agent has
to act. In unexplored territories, the agent has to learn how to act by interacting with an unknown
environment on its own. This learning paradigm is known as reinforcement learning (RL), where
there is a feedback loop between the agent and the external environment. In other words, while SL
learns from previous experiences provided by a knowledgeable external supervisor, RL learns by
experiencing on its own. RL differs from SL in several important respects (Sutton and Barto, 2018;
Mitchell, 1997)

• Exploration-exploitation tradeoff. In RL, the agent needs to collect reward signals from
the environment. This raises the question of which experimentation strategy results in more
effective learning. The agent has to exploit what it already knows in order to obtain high
rewards, while having to explore unknown states and actions in order to make better action
selections in the future.
1
As shown in Table 1.1, dialogue learning is formulated as RL where the agent learns a policy π that in each
dialogue turn chooses an appropriate action a from the set A, based on dialogue state s, so as to achieve the
greatest cumulative reward.

12
• Delayed reward and temporal credit assignment. In RL, training information is not
available in the form of (x, y ∗ ) as in SL. Instead, the environment provides only delayed
rewards as the agent executes a sequence of actions. For example, we do not know whether
a dialogue succeeds in completing a task until the end of the session. The agent, therefore,
has to determine which of the actions in its sequence are to be credited with producing the
eventual reward, a problem known as temporal credit assignment.
• Partially observed states. In many RL problems, the observation perceived from the en-
vironment at each step, e.g., user input in each dialogue turn, provides only partial infor-
mation about the entire state of the environment based on which the agent selects the next
action. Neural approaches learn a deep neural network to represent the state by encoding
all information observed at the current and past steps, e.g., all the previous dialogue turns
and the retrieval results from external databases.

A central challenge in both SL and RL is generalization, the ability to perform well on unseen
inputs. Many learning theories and algorithms have been proposed to address the challenge with
some success by, e.g., seeking a good tradeoff between the amount of available training data and
the model capacity to avoid underfitting and overfitting. Compared to previous techniques, neural
approaches provide a potentially more effective solution by leveraging the representation learning
power of deep neural networks, as we will review in the next section.

2.2 Deep Learning


Deep learning (DL) involves training neural networks, which in their original form consisted of
a single layer (i.e., the perceptron) (Rosenblatt, 1957). The perceptron is incapable of learning
even simple functions such as the logical XOR, so subsequent work explored the use of “deep”
architectures, which added hidden layers between input and output (Rosenblatt, 1962; Minsky and
Papert, 1969), a form of neural network that is commonly called the multi-layer perceptron (MLP),
or deep neural networks (DNNs). This section introduces some commonly used DNNs for NLP and
IR. Interested readers are referred to Goodfellow et al. (2016) for a comprehensive discussion.

2.2.1 Foundations

Consider a text classification problem: labeling a text string (e.g., a document or a query) by a do-
main name such as “sport” and “politics”. As illustrated in Fig. 2.1 (Left), a classical ML algorithm
first maps a text string to a vector representation x using a set of hand-engineered features (e.g.,
word and character n-grams, entities, and phrases etc.), then learns a linear classifier with a softmax
layer to compute the distribution of the domain labels y = f (x; W), where W is a matrix learned
from training data using SGD to minimize the misclassification error. The design effort is focused
mainly on feature engineering.
Instead of using hand-designed features for x, DL approaches jointly optimize the feature repre-
sentation and classification using a DNN, as exemplified in Fig. 2.1 (Right). We see that the DNN
consists of two halves. The top half can be viewed as a linear classifier, similar to that in the clas-
sical ML model in Fig. 2.1 (Left), except that its input vector h is not based on hand-engineered
features but is learned using the bottom half of the DNN, which can be viewed as a feature gener-
ator optimized jointly with the classifier in an end-to-end fashion. Unlike classical ML, the effort
of designing a DL classifier is mainly on optimizing DNN architectures for effective representation
learning.
For NLP tasks, depending on the type of linguistic structures that we hope to capture in the text, we
may apply different types of neural network (NN) layer structures, such as convolutional layers for
local word dependencies and recurrent layers for global word sequences. These layers can be com-
bined and stacked to form a deep architecture to capture different semantic and context information
at different abstract levels. Several widely used NN layers are described below:

Word Embedding Layers. In a symbolic space each word is represented as a one-hot vector
whose dimensionality n is the size of a pre-defined vocabulary. The vocabulary is often large; e.g.,
n > 100K. We apply a word embedding model, which is parameterized by a linear projection
matrix We ∈ Rn×m , to map each one-hot vector to a m-dimensional real-valued vector (m  n)

13
Figure 2.1: Flowcharts of classic machine learning (Left) and deep learning (Right). A convolutional
neural network is used as an example for deep learning.

in a neural space where the embedding vectors of the words that are more semantically similar are
closer to each other.

Fully Connected Layers. They perform linear projections as W| x.2 We can stack multiple fully
connected layers to form a deep feed-forward NN (FFNN) by introducing a nonlinear activation
function g after each linear projection. If we view a text as a Bag-Of-Words (BOW) and let x be the
sum of the embedding vectors of all words in the text, a deep FFNN can extract highly nonlinear
features to represent hidden semantic topics of the text at different layers, e.g., h(1) = g W(1)| x

at the first layer, and h(2) = g W(2)| h(1) at the second layer, and so on, where W’s are trainable
matrices.

Convolutional-Pooling Layers. An example of convolutional neural networks (CNNs) is shown


in Fig. 2.1 (Right). A convolutional layer forms a local feature vector, denoted ui , of word wi in two
steps. It first generates a contextual vector ci by concatenating the word embedding vectors of wi and
its surrounding words defined by a fixed-length window. It then performs a projection to obtain ui =
g (Wc| ci ), where Wc is a trainable matrix and g is an activation function. Then, a pooling layer
combines the outputs ui , i = 1...L into a single global feature vector h. For example, in Fig. 2.1,
the max pooling operation is applied over each “time” i of the sequence of the vectors computed
by the convolutional layer to obtain h, where each element is computed as hj = max1≤i≤L ui,j .
Another popular pooling function is average pooling.

Recurrent Layers. An example of recurrent neural networks (RNNs) is shown in Fig. 2.2. RNNs
are commonly used for sentence embedding where we view a text as a sequence of words rather
than a BOW. They map the text to a dense and low-dimensional semantic vector by sequentially
and recurrently processing each word, and mapping the subsequence up to the current word into a
|
low-dimensional vector as hi = RNN(xi , hi−1 ) := g (Wih xi + Wr| hi−1 ), where xi is the word
embedding of the i-th word in the text, Wih and Wr are trainable matrices, and hi is the semantic
representation of the word sequence up to the i-th word.

2
We often omit the bias terms for simplifying notations in this paper.

14
Figure 2.2: An example of recurrent neural networks.

Figure 2.3: The architecture of DSSM.

2.2.2 Two Examples

This section gives a brief description of two examples of DNN models, designed for the ranking and
text generation tasks, respectively. They are composed of the NN layers described in the last section.

DSSM for Ranking. In a ranking task, given an input query x, we want to rank all its candidate
answers y ∈ Y, based on a similarity scoring function sim(x, y). The task is fundamental to many
IR and NLP applications, such as query-document ranking, answer selection in QA, and dialogue
response selection.
DSSM stands for Deep Structured Semantic Models (Huang et al., 2013; Shen et al., 2014), or more
generally, Deep Semantic Similarity Model (Gao et al., 2014b). DSSM is a deep learning model for
measuring the semantic similarity of a pair of inputs (x, y) 3 . As illustrated in Fig. 2.3, a DSSM
consists of a pair of DNNs, f1 and f2 , which map inputs x and y into corresponding vectors in a
common low-dimensional semantic space. Then the similarity of x and y is measured by the cosine
distance of the two vectors. f1 and f2 can be of different architectures depending on x and y. For
3
DSSM can be applied to a wide range of tasks depending on the definition of (x, y). For example, (x, y)
is a query-document pair for Web search ranking (Huang et al., 2013; Shen et al., 2014), a document pair
in recommendation (Gao et al., 2014b), a question-answer pair in QA (Yih et al., 2015a), a sentence pair of
different languages in machine translation (Gao et al., 2014a), and an image-text pair in image captioning (Fang
et al., 2015) and so on.

15
Figure 2.4: The architecture of seq2seq.

example, to compute the similarity of an image-text pair, f1 can be a deep convolutional NN and f2
an RNN.
Let θ be the parameters of f1 and f2 . θ is learned to identify the most effective feature representations
of x and y, optimized directly for end tasks. In other words, we learn a hidden semantic space,
parameterized by θ, where the semantics of distance between vectors in the space is defined by the
task or, more specifically, the training data of the task. For example, in Web document ranking, the
distance measures the query-document relevance, and θ is optimized using a pair-wise rank loss.
Consider a query x and two candidate documents y + and y − , where y + is more relevant than y − to
x. Let simθ (x, y) be the similarity of x and y in the semantic space parameterized by θ as
simθ (x, y) = cos(f1 (x), f2 (y)).
We want to maximize ∆ = simθ (x, y + ) − simθ (x, y − ). We do so by optimizing a smooth loss
function
L(∆; θ) = log (1 + exp (−γ∆)) , (2.2)
where γ is a scaling factor, using SGD of Eqn. 2.1.

Seq2Seq for Text Generation. In a text generation task, given an input text x, we want to generate
an output text y. This task is fundamental to applications such as machine translation and dialogue
response generation.
Seq2seq stands for the sequence-to-sequence architecture (Sutskever et al., 2014), which is also
known as the encoder-decoder architecture (Cho et al., 2014b). Seq2Seq is typically implemented
based on sequence models such as RNNs or gated RNNs. Gate RNNs, such as Long-Short Term
Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and the networks based on Gated Recurrent
Unit (GRU) (Cho et al., 2014b), are the extensions of RNN in Fig. 2.2, and are often more effective
in capturing long-term dependencies due to the use of gated cells that have paths through time that
have derivatives neither vanishing nor exploding. We will illustrate in detail how LSTM is applied
to end-to-end conversation models in Sec. 5.1.
Seq2seq defines the probability of generating y conditioned on x as P (y|x) 4 . As illustrated in
Fig. 2.4, a seq2seq model consists of (1) an input RNN or encoder f1 that encodes input sequence x
into context vector c, usually as a simple function of its final hidden state; and (2) an output RNN or
decoder f2 that generates output sequence y conditioned on c. x and y can be of different lengths.
The two RNNs, parameterized by θ, are trained jointly to minimize the loss function over all the
pairs of (x, y) in training data
1 X
L(θ) = log −Pθ (yi |xi ) . (2.3)
M
i=1...M

2.3 Reinforcement Learning


This section reviews reinforcement learning to facilitate discussions in later chapters. For a com-
prehensive treatment of this topic, interested readers are referred to existing textbooks and reviews,
such as Sutton and Barto (2018); Kaelbling et al. (1996); Bertsekas and Tsitsiklis (1996); Szepesvári
(2010); Wiering and van Otterlo (2012); Li (2018).
4
Similar to DSSM, seq2seq can be applied to a variety of generation tasks depending on the definition of
(x, y). For example, (x, y) is a sentence pair of different languages in machine translation (Sutskever et al.,
2014; Cho et al., 2014b), an image-text pairs in image captioning (Vinyals et al., 2015b) (where f1 is a CNN),
and message-response pairs in dialogue (Vinyals and Le, 2015; Li et al., 2016a).

16
Figure 2.5: Interaction between an RL agent and the external environment.

2.3.1 Foundations

Reinforcement learning (RL) is a learning paradigm where an intelligent agent learns to make op-
timal decisions by interacting with an initially unknown environment (Sutton and Barto, 2018).
Compared to supervised learning, a distinctive challenge in RL is to learn without a teacher (that
is, without supervisory labels). As we will see, this will lead to algorithmic considerations that are
often unique to RL.
As illustrated in Fig. 2.5, the agent-environment interaction is often modeled as a discrete-time
Markov decision process, or MDP (Puterman, 1994), described by a five-tuple M = hS, A, P, R, γi:
• S is a possibly infinite set of states the environment can be in;
• A is a possibly infinite set of actions the agent can take in a state;
• P (s0 |s, a) gives the transition probability of the environment landing in a new state s0 after
action a is taken in state s;
• R(s, a) is the average reward immediately received by the agent after taking action a in
state s; and
• γ ∈ (0, 1] is a discount factor.
The intersection can be recorded as a trajectory (s1 , a1 , r1 , . . .), generated as follows: at step t =
1, 2, . . .,
• the agent observes the environment’s current state st ∈ S, and takes an action at ∈ A;
• the environment transitions to a next-state st+1 , distributed according to the transition prob-
abilities P (·|st , at );
• associated with the transition is an immediate reward rt ∈ R, whose average is R(st , at ).
Omitting the subscript, each step results in a tuple (s, a, r, s0 ) that is called a transition. The goal
of an RL agent is to maximize the long-term reward by taking optimal actions (to be defined soon).
Its action-selection policy, denoted by π, can be deterministic or stochastic. In either case, we use
a ∼ π(s) to denote selection of action by following π in state s. Given a policy π, the value of a
state s is the average discounted long-term reward from that state:
V π (s) := E[r1 + γr2 + γ 2 r3 + · · · |s1 = s, ai ∼ π(si ), ∀i ≥ 1] .
We are interested in optimizing the policy so that V π is maximized for all states. Denote by π ∗ an
optimal policy, and V ∗ its corresponding value function (also known as the optimal value function).
In many cases, it is more convenient to use another form of value function called the Q-function:
Qπ (s, a) := E[r1 + γr2 + γ 2 r3 + · · · |s1 = s, a1 = a, ai ∼ π(si ), ∀i > 1] ,
which measures the average discounted long-term reward by first selecting a in state s and then fol-
lowing policy π thereafter. The optimal Q-function, corresponding to an optimal policy, is denoted
by Q∗ .

2.3.2 Basic Algorithms

We now describe two popular classes of algorithms, exemplified by Q-learning and policy gradient,
respectively.

17
Q-learning. The first family is based on the observation that an optimal policy can be immediately
retrieved if the optimal Q-function is available. Specifically, the optimal policy can be determined
by
π ∗ (s) = arg max Q∗ (s, a) .
a

Therefore, a large family of RL algorithms focuses on learning Q∗ (s, a), and are collectively called
value function-based methods.
In practice, it is expensive to represent Q(s, a) by a table, one entry for each distinct (s, a), when
the problem at hand is large. For instance, the number of states in the game of Go is larger than
2 × 10170 (Tromp and Farnebäck, 2006). Hence, we often use compact forms to represent Q. In
particular, we assume the Q-function has a predefined parametric form, parameterized by some
vector θ ∈ Rd . An example is linear approximation:

Q(s, a; θ) = φ(s, a)T θ ,

where φ(s, a) is a d-dimensional hand-coded feature vector for state-action pair (s, a), and θ is the
corresponding coefficient vector to be learned from data. In general, Q(s, a; θ) may take different
parametric forms. For example, in the case of Deep Q-Network (DQN), Q(s, a; θ) takes the form of
deep neural networks, such as multi-layer perceptrons and convolutional networks (Tesauro, 1995;
Mnih et al., 2015), recurrent network (Hausknecht and Stone, 2015; Li et al., 2015), etc. More
examples will be seen in later chapters. Furthermore, it is possible to represent the Q-function in
a non-parametric way, using decision trees (Ernst et al., 2005) or Gaussian processes (Engel et al.,
2005), which is outside of the scope of this introductory section.
To learn the Q-function, we modify the parameter θ using the following update rule, after observing
a state transition (s, a, r, s0 ):
 
0 0
θ ← θ + α r + γ max Q(s , a ; θ) − Q(s, a; θ) ∇θ Q(s, a; θ) . (2.4)
a0
| {z }
“temporal difference”

The above update is known as Q-learning (Watkins, 1989), which applies a small change to θ,
controlled by the step-size parameter α and computed from the temporal difference (Sutton, 1988).
While popular, in practice, Q-learning can be quite unstable and requires many samples before
reaching a good approximation of Q∗ . Two modifications are often helpful. The first is experience
replay (Lin, 1992), popularized by Mnih et al. (2015). Instead of using an observed transition to
update θ just once using Eqn. 2.4, one may store it in a replay buffer, and periodically sample
transitions from it to perform Q-learning updates. This way, every transition can be used multiple
times, thus increasing sample efficiency. Furthermore, it helps stabilize learning by preventing the
data distribution from changing too quickly over time when updating parameter θ.
The second is a two-network implementation (Mnih et al., 2015), an instance of the more general
fitted value iteration algorithm (Munos and Szepesvári, 2008). Here, the learner maintains an extra
copy of the Q-function, called the target network, parameterized by θtarget . During learning, θtarget
is fixed and is used to compute temporal difference to update θ. Specifically, Eqn. 2.4 now becomes:
 
0 0
θ ← θ + α r + γ max Q(s , a ; θ target ) − Q(s, a; θ) ∇θ Q(s, a; θ) . (2.5)
a0
| {z }
temporal difference with a target network

Periodically, θtarget is updated to be θ, and the process continues.


There have been a number of recent improvements to the basic Q-learning described above, such as
dueling Q-network (Wang et al., 2016), double Q-learning (van Hasselt et al., 2016), and a provably
convergent SBEED algorithm (Dai et al., 2018b).

Policy Gradient. The other family of algorithms tries to optimize the policy directly, without
having to learn the Q-function. Here, the policy itself is directly parameterized by θ ∈ Rd , and
π(s; θ) is often a distribution over actions. Given any θ, the policy is naturally evaluated by the

18
average long-term reward it gets in a trajectory of length H, τ = (s1 , a1 , r1 , . . . , sH , aH , rH ):5
"H #
X
t−1
J(θ) := E γ rt |at ∼ π(st ; θ) .
t=1

If it is possible to estimate the gradient ∇θ J from sampled trajectories, one can do stochastic gradi-
ent ascent6 to maximize J:
θ ← θ + α∇θ J(θ) , (2.6)
where α is again a stepsize parameter.
One such algorithm, known as REINFORCE (Williams, 1992), estimates the gradient as follows.
Let τ be a length-H trajectory generated by π(·; θ); that is, at ∼ π(st ; θ) for every t. Then, a
stochastic gradient based on this single trajectory is given by
H−1 H
!
X X
t−1 h−t
∇θ J(θ) = γ ∇θ log π(at |st ; θ) γ rh . (2.7)
t=1 h=t

REINFORCE may suffer high variance in practice, as its gradient estimate depends directly on the
sum of rewards along the entire trajectory. Its variance may be reduced by the use of an estimated
value function of the current policy, often referred to as the critic in actor-critic algorithms (Sutton
et al., 1999a; Konda and Tsitsiklis, 1999):
H−1
X  
∇θ J(θ) = γ t−1 ∇θ log π(at |st ; θ)Q̂(st , at , h) , (2.8)
t=1

where Q̂(s, a, h) is an estimated value function for the current policy π(s; θ) that is used to ap-
PH
proximate h=t γ h−t rh in Eqn. 2.7. Q̂(s, a, h) may be learned by standard temporal difference
methods (similar to Q-learning), but many variants exist. Moreover, there has been much work on
methods to compute the gradient ∇θ J more effectively than Eqn. 2.8. Interested readers can refer to
a few related works and the references therein for further details (Kakade, 2001; Peters et al., 2005;
Schulman et al., 2015a,b; Mnih et al., 2016; Gu et al., 2017; Dai et al., 2018a; Liu et al., 2018b).

2.3.3 Exploration

So far we have described basic algorithms for updating either the value function or the policy, when
transitions are given as input. Typically, an RL agent also has to determine how to select actions
to collect desired transitions for learning. Always selecting the action (“exploitation”) that seems
best is problematic, as not selecting a novel action (that is, underrepresented, or even absent, in data
collected so far), known as “exploration”, may result in the risk of not seeing outcomes that are
potentially better. Balancing exploration and exploitation efficiently is one of the unique challenges
in reinforcement learning.
A basic exploration strategy is known as -greedy. The idea is to choose the action that looks best
with high probability (for exploitation), and a random action with small probability (for exploration).
In the case of DQN, suppose θ is the current parameter of the Q-function, then the action-selection
rule for state s is given as follows:

arg maxa Q(st , a; θ) with probability 1 − 
at =
random action with probability  .
In many problems this simple approach is effective (although not necessarily optimal). A further
discussion is found in Sec. 4.4.2.

5
We describe policy gradient in the simpler bounded-length trajectory case, although it can be extended to
problems when the trajectory length is unbounded (Baxter and Bartlett, 2001; Baxter et al., 2001).
6
Stochastic gradient ascent is simply stochastic gradient descent on the negated objective function.

19
Chapter 3

Question Answering and Machine


Reading Comprehension

Recent years have witnessed an increasing demand for conversational Question Answering (QA)
agents that allow users to query a large-scale Knowledge Base (KB) or a document collection in
natural language. The former is known as KB-QA agents and the latter text-QA agents. KB-QA
agents are more flexible and user-friendly than traditional SQL-like systems in that users can query
a KB interactively without composing complicated SQL-like queries. Text-QA agents are much
easier to use in mobile devices than traditional search engines, such as Bing and Google, in that they
provide concise, direct answers to user queries, as opposed to a ranked list of relevant documents.
It is worth noting that multi-turn, conversational QA is an emerging research topic, and is not as
well-studied as single-turn QA. Many papers reviewed in this chapter are focused on the latter.
However, single-turn QA is an indispensable building block for all sorts of dialogues (e.g., chitchat
and task-oriented), deserving our full attention if we are to develop real-world dialogue systems.
In this chapter, we start with a review of KB and symbolic approaches to KB-QA based on semantic
parsing. We show that a symbolic system is hard to scale because the keyword-matching-based,
query-to-answer inference used by the system is inefficient for a very large KB, and is not robust
to paraphrasing. To address these issues, neural approaches are developed to represent queries and
KB using continuous semantic vectors so that the inference can be performed at the semantic level
in a compact neural space. We also describe the typical architecture of multi-turn, conversational
KB-QA agents, using a movie-on-demand agent as an example, and review several conversational
KB-QA datasets developed recently.
We then discuss neural text-QA agents. The heart of these systems is a neural Machine Reading
Comprehension (MRC) model that generates an answer to an input question based on a (set of) pas-
sage(s). After reviewing popular MRC datasets and TREC text-QA open benchmarks, we describe
the technologies developed for state-of-the-art MRC models along two dimensions: (1) the methods
of encoding questions and passages as vectors in a neural space, and (2) the methods of performing
reasoning in the neural space to generate the answer. We also describe the architecture of multi-turn,
conversational text-QA agents, and the way MRC tasks and models are extended to conversational
QA.

3.1 Knowledge Base

Organizing the world’s facts and storing them in a structured database, large scale Knowledge Bases
(KB) like DBPedia (Auer et al., 2007), Freebase (Bollacker et al., 2008) and Yago (Suchanek et al.,
2007) have become important resources for supporting open-domain QA.
A typical KB consists of a collection of subject-predicate-object triples (s, r, t) where s, t ∈ E are
entities and r ∈ R is a predicate or relation. A KB in this form is often called a Knowledge Graph

20
Figure 3.1: An example of semantic parsing for KB-QA. (Left) A subgraph of Freebase related to
the TV show Family Guy. (Right) A question, its logical form in λ-calculus and query graph, and
the answer. Figures adapted from Yih et al. (2015a).

(KG) due to its graphical representation, i.e., the entities are nodes and the relations the directed
edges that link the nodes.
Fig. 3.1 (Left) shows a small subgraph of Freebase related to the TV show Family Guy. Nodes
include some names, dates and special Compound Value Type (CVT) entities.1 A directed edge
describes the relation between two entities, labeled by a predicate.

3.2 Semantic Parsing for KB-QA

Most state-of-the-art symbolic approaches to KB-QA are based on semantic parsing, where a ques-
tion is mapped to its formal meaning representation (e.g., logical form) and then translated to a KB
query. The answers to the question can then be obtained by finding a set of paths in the KB that
match the query and retrieving the end nodes of these paths (Richardson et al., 1998; Berant et al.,
2013; Yao and Van Durme, 2014; Bao et al., 2014; Yih et al., 2015b).
We take the example used in Yih et al. (2015a) to illustrate the QA process. Fig. 3.1 (Right) shows
the logical form in λ-calculus and its equivalent graph representation, known as query graph, of
the question “Who first voiced Meg on Family Guy?”. Note that the query graph is grounded in
Freebase. The two entities, MegGriffin and FamilyGuy, are represented by two rounded rectangle
nodes. The circle node y means that there should exist an entity describing some casting relations
like the character, actor and the time she started the role. y is grounded in a CVT entity in this case.
The shaded circle node x is also called the answer node, and is used to map entities retrieved by
the query. The diamond node arg min constrains that the answer needs to be the earliest actor for
this role. Running the query graph without the aggregation function against the graph as in Fig. 3.1
(Left) will match both LaceyChabert and MilaKunis. But only LaceyChabert is the correct
answer as she started this role earlier (by checking the from property of the grounded CVT node).
Applying a symbolic KB-QA system to a very large KB is challenging for two reasons:

• Paraphrasing in natural language: This leads to a wide variety of semantically equivalent


ways of stating the same question, and in the KB-QA setting, this may cause mismatches
between the natural language questions and the label names (e.g., predicates) of the nodes
and edges used in the KB. As in the example of Fig. 3.1, we need to measure how likely
the predicate used in the question matches that in Freebase, such as “Who first voiced Meg
on Family Guy?” vs. cast-actor. Yih et al. (2015a) proposed to use a learned DSSM
1
CVT is not a real-world entity, but is used to collect multiple fields of an event or a special relationship.

21
described in Sec. 2.2.2, which is conceptually an embedding-based method we will review
in Sec. 3.3.
• Search complexity: Searching all possible multi-step (compositional) relation paths that
match complex queries is prohibitively expensive because the number of candidate paths
grows exponentially with the path length. We will review symbolic and neural approaches
to multi-step reasoning in Sec. 3.4.

3.3 Embedding-based Methods


To address the paraphrasing problem, embedding-based methods map entities and relations in a KB
to continuous vectors in a neural space; see, e.g., Bordes et al. (2013); Socher et al. (2013); Yang
et al. (2015); Yih et al. (2015b). This space can be viewed as a hidden semantic space where various
expressions with the same semantic meaning map to the same continuous vector.
Most KB embedding models are developed for the Knowledge Base Completion (KBC) task: pre-
dicting the existence of a triple (s, r, t) that is not seen in the KB. This is a simpler task than KB-QA
since it only needs to predict whether a fact is true or not, and thus does not suffer from the search
complexity problem.
The bilinear model is one of the basic KB embedding models (Yang et al., 2015; Nguyen, 2017). It
learns a vector xe ∈ Rd for each entity e ∈ E and a matrix Wr ∈ Rd×d for each relation r ∈ R.
The model scores how likely a triple (s, r, t) holds using

score(s, r, t; θ) = x>
s Wr xt . (3.1)
The model parameters θ (i.e., the embedding vectors and matrices) are trained on pair-wise training
samples in a similar way to that of the DSSM described in Sec. 2.2.2. For each positive triple
(s, r, t) in the KB, denoted by x+ , we construct a set of negative triples x− by corrupting s, t, or
r. The training objective is to minimize the pair-wise rank loss of Eqn. 2.2, or more commonly the
margin-based loss defined as
X 
γ + score(x− ; θ) − score(x+ ; θ) + ,

L(θ) =
(x+ ,x− )∈D

where [x]+ := max(0, x), γ is the margin hyperparameter, and D the training set of triples.
These basic KB models have been extended to answer multi-step relation queries, as known as path
queries, e.g., “Where did Tad Lincoln’s parents live?” (Toutanova et al., 2016; Guu et al., 2015;
Neelakantan et al., 2015). A path query consists of an initial anchor entity s (e.g., TadLincoln),
followed by a sequence of relations to be traversed (r1 , ..., rk ) (e.g., (parents, location)). We
can use vector space compositions to combine the embeddings of individual relations ri into an
embedding of the path (r1 , ..., rk ). The natural composition of the bilinear model of Eqn. 3.1 is
matrix multiplication. Thus, to answer how likely a path query (q, t) holds, where q = (s, r1 , ..., rk ),
we would compute
score(q, t) = x>
s Wr1 ...Wrk xt . (3.2)

These KB embedding methods are shown to have good generalization performance in terms of
validating unseen facts (e.g., triples and path queries) given an existing KB. Interested users are
referred to Nguyen (2017) for a detailed survey of embedding models for KBC.

3.4 Multi-Step Reasoning on KB


Knowledge Base Reasoning (KBR) is a subtask of KB-QA. As described in Sec. 3.2, KB-QA is
performed in two steps: (1) semantic parsing translates a question into a KB query, then (2) KBR
traverses the query-matched paths in a KB to find the answers.
To reason over a KB, for each relation r ∈ R, we are interested in learning a set of first-order logical
rules in the form of relational paths, π = (r1 , ..., rk ). For the KBR example in Fig. 3.2, given
the question “What is the citizenship of Obama?”, its translated KB query in the form of subject-
predicate-object triple is (Obama, citizenship, ?). Unless the triple (Obama, citizenship,

22
Figure 3.2: An example of knowledge base reasoning (KBR). We want to identify the answer node
USA for a KB query (Obama, citizenship, ?). Figure adapted from Shen et al. (2018).

Table 3.1: A sample of relational paths learned by PRA. For each relation, its top-2 PRA paths are
presented, adapted from Lao et al. (2011).
ID PRA Path # Comment
athlete-plays-for-team
1 (athlete-plays-in-league, league-players,
athlete-plays-for-team)
# teams with many players in the athlete’s league
2 (athlete-plays-in-league, league-teams, team-against-team)
# teams that play against many teams in the athlete’s league
stadium-located-in-city
1 (stadium-home-team,team-home-stadium,stadium-located-in-city)
# city of the stadium with the same team
2 (latitude-longitude,latitude-longitude-of,
stadium-located-in-city)
# city of the stadium with the same location
team-home-stadium
1 (team-plays-in-city,city-stadium)
# stadium located in the same city with the query team
2 (team-member,athlete-plays-for-team,team-home-stadium)
# home stadium of teams which share players with the query
team-plays-in-league
1 (team-plays-sport,players,athlete-players-in-league)
# the league that the query team’s members belong to
2 (team-plays-against-team,team-players-in-league)
# the league that query team’s competing team belongs to

USA) is explicitly stored in the KB,2 a multi-step reasoning procedure is needed to deduce the answer
from the paths that contain relevant triples, such as (Obama, born-in, Hawaii) and (Hawaii,
part-of, USA), using the learned relational paths such as (born-in, part-of).
Below, we describe three categories of multi-step KBR methods. They differ in whether reasoning
is performed in a discrete symbolic space or a continuous neural space.

3.4.1 Symbolic Methods

The Path Ranking Algorithm (PRA) (Lao and Cohen, 2010; Lao et al., 2011) is one of the primary
symbolic approaches to learning relational paths in large KBs. PRA uses random walks with restarts
to perform multiple bounded depth-first search to find relational paths. Table 3.1 shows a sample of
2
As pointed out by Nguyen (2017), even very large KBs, such as Freebase and DBpedia, which contain
billions of fact triples about the world, are still far from complete.

23
Figure 3.3: An overview of the neural methods for KBR (Shen et al., 2017a; Yang et al., 2017a).
The KB is embedded in neural space as matrix M that is learned to store compactly the connections
between related triples (e.g., the relations that are semantically similar are stored as a cluster). The
controller is designed to adaptively produce lookup sequences in M and decide when to stop, and
the encoder and decoder are responsible for the mapping between the symbolic and neural spaces.

relational paths learned by PRA. A relational path is a sequence π = (r1 , ..., rk ). An instance of the
relational path is a sequence of nodes e1 , ..., ek+1 such that (ei , ri , ei+1 ) is a valid triple.
During KBR, given a query q = (s, r, ?), PRA selects the set of relational paths for r, denoted by
Br = {π1 , π2 , ...}, then traverses the KB according to the query and Br , and scores each candidate
answer t using a linear model
X
score(q, t) = λπ P (t|s, π) , (3.3)
π∈Br

where λπ ’s are the learned weights, and P (t|s, π) is the probability of reaching t from s by a random
walk that instantiates the relational path π, also known as a path constrained random walk.
Because PRA operates in a fully discrete space, it does not take into account semantic similarities
among relations. As a result, PRA can easily produce millions of categorically distinct paths even for
a small path length, which not only hurts generalization but makes reasoning prohibitively expensive.
To reduce the number of relational paths that need to be considered in KBR, Lao et al. (2011) used
heuristics (e.g., requiring that a path be included in PRA only if it retrieves at least one target entity
in the training data) and added an L1 regularization term in the loss function for training the linear
model of Eqn. 3.3. Gardner et al. (2014) proposed a modification to PRA that leverages the KB
embedding methods, as described in Sec. 3.3, to collapse and cluster PRA paths according to their
relation embeddings.

3.4.2 Neural Methods

Implicit ReasoNet (IRN) (Shen et al., 2016, 2017a) and Neural Logic Programming (Neural LP)
(Yang et al., 2017a) are proposed to perform multi-step KBR in a neural space and achieve state-
of-the-art results on popular benchmarks. The overall architecture of these methods is shown in
Fig. 3.3, which can be viewed as an instance of the neural approaches illustrated in Fig. 1.4 (Right).
In what follows, we use IRN as an example to illustrate how these neural methods work. IRN
consists of four modules: encoder, decoder, shared memory, and controller, as in Fig. 3.3.

Encoder and Decoder These two modules are task-dependent. Given an input query (s, r, ?),
the encoder maps s and r, respectively, into their embedding vectors and then concatenates the
two vectors to form the initial hidden state vector s0 of the controller. The use of vectors rather than
matrices for relation representations is inspired by the bilinear-diag model (Yang et al., 2015), which
restricts the relation representations to the class of diagonal matrices.
The decoder outputs a prediction vector o = tanh(Wo> st + bo ), a nonlinear projection of state s
at time t, where Wo and bo are the weight matrix and bias vector, respectively. In KBR, we can
map the answer vector o to its answer node (entity) o in the symbolic space based on L1 distance as
o = arg mine∈E ko − xe k1 , where xe is the embedding vector of entity e.

24
Shared Memory The shared memory M is differentiable, and consists of a list of vectors
{mi }1≤i≤|M| that are randomly initialized and updated through back-propagation in training. M
stores a compact version of KB optimized for the KBR task. That is, each vector represents a
concept (a cluster of relations or entities) and the distance between vectors represents the semantic
relatedness of these concepts. For example, the system may fail to answer the question (Obama,
citizenship, ?) even if it finds the relevant facts in M, such as (Obama, born-in, Hawaii)
and (Hawaii, part-of, USA), because it does not know that bore-in and citizenship are se-
mantically related relations. In order to correct the error, M needs to be updated using the gradient
to encode the piece of new information by moving the two relation vectors closer to each other in
the neural space.

Controller The controller is implemented as an RNN. Given initial state s0 , it uses attention to
iteratively lookup and fetch information from M to update the state st at time t according to Eqn. 3.4,
until it decides to terminate the reasoning process and calls the decoder to generate the output.
exp λ cos(W1> mi , W2> st )

at,i = P > >
,
k exp λ cos(W1 mk , W2 st )
|M|
X (3.4)
xt = at,i mi ,
i
st+1 = g(W3> st + W4> xt ),
where W’s are learned projection matrices, λ a scaling factor and g a nonlinear activation function.
The reasoning process of IRN can be viewed as a Markov Decision Process (MDP), as illustrated in
Sec. 2.3.1. The step size in the information lookup and fetching sequence of Eqn. 3.4 is not given
by training data, but is decided by the controller on the fly. More complex queries need more steps.
Thus, IRN learns a stochastic policy to get a distribution over termination and prediction actions
by the REINFORCE algorithm (Williams, 1992), which is described in Sec. 2.3.2 and Eqn. 2.7.
Since all the modules of IRN are differentiable, IRN is an end-to-end differentiable neural model
whose parameters, including the embedded KB matrix M, can be jointly optimized using SGD on
the training samples derived from a KB, as shown in Fig. 3.3.
As outlined in Fig. 1.4, neural methods operate in a continuous neural space, and do not suffer from
the problems associated with symbolic methods. They are robust to paraphrase alternations because
knowledge is implicitly represented by semantic classes via continuous vectors and matrices. They
also do not suffer from the search complexity issue even with complex queries (e.g.path queries)
and a very large KB because they reason over a compact representation of a KB (e.g., the matrix M
in the shared memory in IRN) rather than the KB itself.
One of the major limitations of these methods is the lack of interpretability. Unlike PRA which
traverses the paths in the graph explicitly as Eqn. 3.3, IRN does not follow explicitly any path in
the KB during reasoning but performs lookup operations over the shared memory iteratively using
the RNN controller with attention, each time using the revised internal state s as a query for lookup.
It remains challenging to recover the symbolic representations of queries and paths (or first-order
logical rules) from the neural controller. See (Shen et al., 2017a; Yang et al., 2017a) for some
interesting preliminary results of interpretation of neural methods.

3.4.3 Reinforcement Learning based Methods

DeepPath (Xiong et al., 2017), MINERVA (Das et al., 2017b) and M-Walk (Shen et al., 2018) are
among the state-of-the-art methods that use RL for multi-step reasoning over a KB. They use a
policy-based agent with continuous states based on KB embeddings to traverse the knowledge graph
to identify the answer node (entity) for an input query. The RL-based methods are as robust as the
neural methods due to the use of continuous vectors for state representation, and are as interpretable
as symbolic methods because the agents explicitly traverse the paths in the graph.
We formulate KBR as an MDP defined by the tuple (S, A, R, P), where S is the continuous state
space, A the set of available actions, P the state transition probability matrix, and R the reward
function. Below, we follow M-Walk to describe these components in detail. We denote a KB as
graph G(E, R) which consists a collection of entity nodes E and the relation edges R that link the

25
nodes. We denote a KB query as q = (e0 , r, ?), where e0 and r are the given source node and
relation, respectively, and ? the answer node to be identified.

States Let st denote the state at time t, which encodes information of all traversed nodes up to t,
all the previous selected actions and the initial query q. st can be defined recursively as follows:
s0 := {q, Re0 , Ee0 },
(3.5)
st = st−1 ∪ {at−1 , et , Ret , Eet },
where at ∈ A is the action selected by the agent at time t, et is the currently visited node, Ret ∈ R
is the set of all the edges connected to et , and Eet ∈ E is the set of all the nodes connected to et . Note
that in RL-based methods, st is represented as a continuous vector using e.g., a RNN in M-Walk
and MINERVA or a MLP in DeepPath.

Actions Based on st , the agent selects one of the following actions: (1) choosing an edge in Eet
and moving to the next node et+1 ∈ E, or (2) terminating the reasoning process and outputting the
current node et as a prediction of the answer node eT .

Transitions The transitions are deterministic. As shown in Fig. 3.2, once action at is selected, the
next node et+1 and its associated Eet+1 and Ret+1 are known.

Rewards We only have the terminal reward of +1 if eT is the correct answer, and 0 otherwise.

Policy Network The policy πθ (a|s) denotes the probability of selecting action a given state s,
and is implemented as a neural network parameterized by θ. The policy network is optimized to
maximize E[Vθ (s0 )], which is the long-term reward of starting from s0 and following the policy πθ
afterwards. In KBR, the policy network can be trained using RL, such as the REINFORCE method,
from the training samples in the form of triples (es , r, et ) extracted from a KB. To address the reward
sparsity issue (i.e., the reward is only available at the end of a path), Shen et al. (2018) proposed
to use Monte Carlo Tree Search to generate a set of simulated paths with more positive terminal
rewards by exploiting the fact that all the transitions are deterministic for a given knowledge graph.

3.5 Conversational KB-QA Agents


All of the KB-QA methods we have described so far are based on single-turn agents which assume
that users can compose in one shot a complicated, compositional natural language query that can
uniquely identify the answer in the KB.
However, in many cases, it is unreasonable to assume that users can construct compositional queries
without prior knowledge of the structure of the KB to be queried. Thus, conversational KB-QA
agents are more desirable because they allow users to query a KB interactively without composing
complicated queries.
A conversational KB-QA agent is useful for many interactive KB-QA tasks such as movie-on-
demand, where a user attempts to find a movie based on certain attributes of that movie, as illustrated
by the example in Fig. 3.4, where the movie DB can be viewed as an entity-centric KB consisting
of entity-attribute-value triples.
In addition to the core KB-QA engine which typically consists of a semantic parser and a KBR
engine, a conversational KB-QA agent is also equipped with a Dialogue Manager (DM) which
tracks the dialogue state and decides what question to ask to effectively help users navigate the KB
in search of an entity (movie). The high-level architecture of the conversational agent for movie-
on-demand is illustrated in Fig. 3.5. At each turn, the agent receives a natural language utterance ut
as input, and selects an action at ∈ A as output. The action space A consists of a set of questions,
each for requesting the value of an attribute, and an action of informing the user with an ordered
list of retrieved entities. The agent is a typical task-oriented dialogue system of Fig. 1.2 (Top),
consisting of (1) a belief tracker module for resolving coreferences and ellipsis in user utterances
using conversation context, identifying user intents, extracting associated attributes, and tracking the
dialogue state; (2) an interface with the KB to query for relevant results (i.e., the Soft-KB Lookup
component, which can be implemented using the KB-QA models described in the previous sections,

26
Figure 3.4: An interaction between a user and a multi-turn KB-QA agent for the movie-on-demand
task. Figure credit: Dhingra et al. (2017).

Figure 3.5: An overview of a conversational KB-QA agent. Figure credit: Dhingra et al. (2017).

except that we need to form the query based on the dialogue history captured by the belief tracker,
not just the current user utterance, as described in Suhr et al. (2018)); (3) a beliefs summary module
to summarize the state into a vector; and (4) a dialogue policy which selects the next action based on
the dialogue state. The policy can be either programmed (Wu et al., 2015) or trained on dialogues
(Wen et al., 2017; Dhingra et al., 2017).
Wu et al. (2015) presented an Entropy Minimization Dialogue Management (EMDM) strategy. The
agent always asks for the value of the attribute with maximum entropy over the remaining entries
in the KB. EMDM is proved optimal in the absence of language understanding errors. However, it
does not take into account the fact that some questions are easy for users to answer, whereas others
are not. For example, in the movie-on-demand task, the agent could ask users to provide the movie
release ID which is unique to each movie but is often unknown to regular users.
Dhingra et al. (2017) proposed KB-InfoBot – a fully neural end-to-end multi-turn dialogue agent for
the movie-on-demand task. The agent is trained entirely from user feedback. It does not suffer from
the problem of EMDM, and always asks users easy-to-answer questions to help search in the KB.
Like all KB-QA agents, KB-InfoBot needs to interact with an external KB to retrieve real-world
knowledge. This is traditionally achieved by issuing a symbolic query to the KB to retrieve entries
based on their attributes. However, such symbolic operations break the differentiability of the system
and prevent end-to-end training of the dialogue agent. KB-InfoBot addresses this limitation by
replacing symbolic queries with an induced posterior distribution over the KB that indicates which
entries the user is interested in. The induction can be achieved using the neural KB-QA methods
described in the previous sections. Experiments show that integrating the induction process with RL
leads to higher task success rate and reward in both simulations and against real users 3 .
Recently, several datasets have been developed for building conversational KB-QA agents. Iyyer
et al. (2017) collected a Sequential Question Answering (SQA) dataset via crowd sourcing by lever-
aging WikiTableQuestions (WTQ (Pasupat and Liang, 2015)), which contains highly compositional
questions associated with HTML tables from Wikipedia. As shown in the example in Fig. 3.6 (Left),
3
It remains to be verified whether the method can deal with large-scale KBs with millions of entities.

27
Figure 3.6: The examples from two conversational KB-QA datasets. (Left) An example question
sequence created from a compositional question intent in the SQA dataset. Figure credit: Iyyer et al.
(2017). (Right) An example dialogue from the CSQA dataset. Figure credit: Saha et al. (2018).

each crowd sourcing task contains a long, complex question originally from WTQ as the question
intent. The workers are asked to compose a sequence of simpler but inter-related questions that lead
to the final intent. The answers to the simple questions are subsets of the cells in the table.
Saha et al. (2018) presented a dataset consisting of 200K QA dialogues for the task of Complex Se-
quence Question Answering (CSQA). CSQA combines two sub-tasks: (1) answering factoid ques-
tions through complex reasoning over a large-scale KB, and (2) learning to converse through a
sequence of coherent QA pairs. As the example in Fig. 3.6 (Right) shows, CSQA calls for a con-
versational KB-QA agent that combines many technologies described in this chapter, including (1)
parsing complex natural language queries (Sec. 3.2), (2) using conversation context to resolve coref-
erences and ellipsis in user utterances like the belief tracker in Fig. 3.5, (3) asking for clarification
questions for ambiguous queries, like the dialogue manager in Fig. 3.5, and (4) retrieving relevant
paths in the KB to answer questions (Sec. 3.4).

3.6 Machine Reading for Text-QA

Machine Reading Comprehension (MRC) is a challenging task: the goal is to have machines read a
(set of) text passage(s) and then answer any question about the passage(s). The MRC model is the
core component of text-QA agents.
The recent big progress on MRC is largely due to the availability of a multitude of large-scale
datasets that the research community has created over various text sources such as Wikipedia
(WikiReading (Hewlett et al., 2016), SQuAD (Rajpurkar et al., 2016), WikiHop (Welbl et al.,
2017), DRCD (Shao et al., 2018)), news and other articles (CNN/Daily Mail (Hermann et al., 2015),
NewsQA (Trischler et al., 2016), RACE (Lai et al., 2017), ReCoRD (Zhang et al., 2018d)), fictional
stories (MCTest (Richardson et al., 2013), CBT (Hill et al., 2015), NarrativeQA (Kočisky et al.,
2017)), science questions (ARC (Clark et al., 2018)), and general Web documents (MS MARCO
(Nguyen et al., 2016), TriviaQA (Joshi et al., 2017), SearchQA (Dunn et al., 2017), DuReader (He
et al., 2017b)).

SQuAD. This is the MRC dataset released by the Stanford NLP group. It consists of 100K ques-
tions posed by crowdworkers on a set of Wikipedia articles. As shown in the example in Fig. 3.7
(Left), the MRC task defined on SQuAD involves a question and a passage, and aims to find an
answer span in the passage. For example, in order to answer the question “what causes precipi-
tation to fall?”, one might first locate the relevant part of the passage “precipitation ... falls under
gravity”, then reason that “under” refers to a cause (not location), and thus determine the correct
answer: “gravity”. Although the questions with span-based answers are more constrained than the
real-world questions users submit to Web search engines such as Google and Bing, SQuAD provides
a rich diversity of question and answer types and became one of the most widely used MRC datasets
in the research community.

28
Figure 3.7: The examples from two MRC datasets. (Left) Question-answer pairs for a sample
passage in the SQuAD dataset, adapted from Rajpurkar et al. (2016). Each of the answers is a text
span in the passage. (Right) A question-answer pair for a set of passages in the MS MARCO dataset,
adapted from Nguyen et al. (2016). The answer, if there is one, is human generated.

MS MARCO. This is a large scale real-world MRC dataset, released by Microsoft, aiming to ad-
dress the limitations of other academic datasets. For example, MS MARCO differs from SQuAD
in that (1) SQuAD consists of the questions posed by crowdworkers while MS MARCO is sampled
from the real user queries; (2) SQuAD uses a small set of high quality Wikipedia articles while
MS MARCO is sampled from a large amount of Web documents, (3) MS MARCO includes some
unanswerable queries4 and (4) SQuAD requires identifying an answer span in a passage while MS
MARCO requires generating an answer (if there is one) from multiple passages that may or may not
be relevant to the given question. As a result, MS MARCO is far more challenging, and requires
more sophisticated reading comprehension skills. As shown in the example in Fig. 3.7 (Right),
given the question “will I qualify for OSAP if I’m new in Canada”, one might first locate the rele-
vant passage that includes: “you must be a 1 Canadian citizen; 2 permanent resident; or 3 protected
person...” and reason that being new to the country is usually the opposite of being a citizen, perma-
nent resident etc., thus determine the correct answer: “no, you won’t qualify”.
In addition, TREC5 also provides a series of text-QA benchmarks:

The automated QA track. This is one of the most popular tracks in TREC for many years, up
to year 2007 (Dang et al., 2007; Agichtein et al., 2015). It has focused on the task of providing
automatic answers for human questions. The track primarily dealt with factual questions, and the
answers provided by participants were extracted from a corpus of News articles. While the task
evolved to model increasingly realistic information needs, addressing question series, list questions,
and even interactive feedback, a major limitation remained: the questions did not directly come from
real users, in real time.

The LiveQA track. This track started in 2015 (Agichtein et al., 2015), focusing on answering user
questions in real time. Real user questions, i.e., fresh questions submitted on the Yahoo Answers
(YA) site that have not yet been answered, were sent to the participant systems, which provided
an answer in real time. Returned answers were judged by TREC editors on a 4-level Likert scale.
LiveQA revived this popular QA track which has been frozen for several years, attracting significant
attention from the QA research community.

29
Figure 3.8: Two examples of state of the art neural MRC models. (Left) The Stochastic Answer Net
(SAN) model. Figure credit: Liu et al. (2018d). (Right) The BiDirectional Attention Flow (BiDAF)
model. Figure credit: Seo et al. (2016).

3.7 Neural MRC Models

The description in this section is based on the state of the art models developed on SQuAD, where
given a question Q = (q1 , ..., qI ) and a passage P = (p1 , ..., pJ ), we need to locate an answer span
A = (astart , aend ) in P .
In spite of the variety of model structures and attention types (Chen et al., 2016a; Xiong et al., 2016;
Seo et al., 2016; Shen et al., 2017c; Wang et al., 2017b), a typical neural MRC model performs read-
ing comprehension in three steps, as outlined in Fig. 1.4: (1) encoding the symbolic representation
of the questions and passages into a set of vectors in a neural space; (2) reasoning in the neural
space to identify the answer vector (e.g., in SQuAD, this is equivalent to ranking and re-ranking the
embedded vectors of all possible text spans in P ); and (3) decoding the answer vector into a natural
language output in the symbolic space (e.g., this is equivalent to mapping the answer vector to its
text span in P ). Since the decoding module is straightforward for SQuAD models, we will focus on
encoding and reasoning below.
Fig. 3.8 illustrate two examples of neural MRC models. BiDAF (Seo et al., 2016) is among the most
widely used state of the art MRC baseline models in the research community, and SAN (Liu et al.,
2018d) is the best documented MRC model on the SQuAD1.1 leaderboard6 as of Dec. 19, 2017.

3.7.1 Encoding

Most MRC models encode questions and passages through three layers: a lexicon embedding layer,
a contextual embedding layer, and an attention layer, as reviewed below.

Lexicon Embedding Layer. This extracts information from Q and P at the word level and nor-
malizes for lexical variants. It typically maps each word to a vector space using a pre-trained word
embedding model, such as word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014), such
that semantically similar words are mapped to the vectors that are close to each other in the neural
space (also see Sec. 2.2.1). Word embedding can be enhanced by concatenating each word embed-
ding vector with other linguistic embeddings such as those derived from characters, Part-Of-Speech
(POS) tags, and named entities etc. Given Q and P , the word embeddings for the tokens in Q is a
matrix Eq ∈ Rd×I and tokens in P is Ep ∈ Rd×J , where d is the dimension of word embeddings.
4
SQuAD v2 (Rajpurkar et al., 2018) also includes unanswerable queries.
5
https://trec.nist.gov/data/qamain.html
6
https://rajpurkar.github.io/SQuAD-explorer/

30
Contextual Embedding Layer. This utilizes contextual cues from surrounding words to refine
the embedding of the words. As a result, the same word might map to different vectors in a neural
space depending on its context, such as “bank of a river” vs. “ bank of America”. This is typically
achieved by using a Bi-directional Long Short-Term Memory (BiLSTM) network,7 an extension of
RNN of Fig. 2.2. As shown in Fig. 3.8, we place two LSTMs in both directions, respectively, and
concatenate the outputs of the two LSTMs. Hence, we obtain a matrix Hq ∈ R2d×I as a contextually
aware representation of Q and a matrix Hp ∈ R2d×J as a contextually aware representation of P .
ELMo (Peters et al., 2018) is one of the state of the art contextual embedding models. It is based on
deep BiLSTM. Instead of using only the output layer representations of BiLSTM, ELMo combines
the intermediate layer representations in the BiLSTM, where the combination weights are optimized
on task-specific training data.
BERT (Devlin et al., 2018) differs from ELMo and BiLSTM in that it is designed to pre-train deep
bidirection representations by jointly conditioning on both left and right context in all layers. The
pre-trained BERT representations can be fine-tuned with just one additional output layer to create
state of the art models for a wide range of NLP tasks, including MRC.
Since an RNN/LSTM is hard to train efficiently using parallel computing, Yu et al. (2018) presents a
new contextual embedding model which does not require an RNN: Its encoder consists exclusively
of convolution and self-attention, where convolution models local interactions and self-attention
models global interactions. Such a model can be trained an order of magnitude faster than an RNN-
based model on GPU clusters.

Attention Layer. This couples the question and passage vectors and produces a set of query-
aware feature vectors for each word in the passage, and generates the working memory M over
which reasoning is performed. This is achieved by summarizing information from both Hq and Hp
via the attention process8 that consists of the following steps:

1. Compute an attention score, which signifies which query words are most relevant to each
passage word: sij = simθs (hqi , hpj ) ∈ R for each hqi in Hq , where simθs is the similarity
function e.g., a bilinear model, parameterized by θs .
2. Compute Pthe normalized attention weights through softmax: αij =
exp(sij )/ k exp(skj ).

3. Summarize information for each passage word via ĥpj = q


P
i αij hi . Thus, we obtain a
matrix Ĥp ∈ R2d×J as the question-aware representation of P .

Next, we form the working memory M in the neural space as M = fθ (Ĥp , Hp ), where fθ is a
function of fusing its input matrices, parameterized by θ. fθ can be an arbitrary trainable neural
network. For example, the fusion function in SAN includes a concatenation layer, a self-attention
layer and a BiLSTM layer. BiDAF computes attentions in two directions: from passage to question
Ĥq as well as from question to passage Ĥp . The fusion function in BiDAF includes a layer that
concatenates three matrices Hp , Ĥp and Ĥq , and a two-layer BiLSTM to encode for each word its
contextual information with respect to the entire passage and the query.

3.7.2 Reasoning

MRC models can be grouped into different categories based on how they perform reasoning to
generate the answer. Here, we distinguish single-step models from multi-step models.

7
Long Short-Term Memory (LSTM) networks are an extension for recurrent neural networks (RNNs). The
units of an LSTM are used as building units for the layers of a RNN. LSTMs enable RNNs to remember their
inputs over a long period of time because LSTMs contain their information in a gated cell, where gated means
that the cell decides whether to store or delete information based on the importance it assigns to the information.
The use of BiLSTM for contextual embedding is suggested by Melamud et al. (2016); McCann et al. (2017).
8
Interested readers may refer to Table 1 in Huang et al. (2017) for a summarized view on the attention
process used in several state of the art MRC models.

31
Figure 3.9: (Top) A human reader can easily answer the question by reading the passage only once.
(Bottom) A human reader may have to read the passage multiple times to answer the question.

Single-Step Reasoning. A single-step reasoning model matches the question and document only
once and produces the final answers. We use the single-step version of SAN9 in Fig. 3.8 (Left) as an
example to describe the reasoning process. We need to find the answer span (i.e., the start and end
points) over the working memory M. First, a summarized question vector is formed as
X
hq = βi hqi , (3.6)
i

where βi = exp(w> hqi )/ k exp(w> hqk ), and w is a trainable vector. Then, a bilinear function is
P
used to obtain the probability distribution of the start index over the entire passage by

p(start) = softmax(hq > W(start) M), (3.7)


where W(start) is a weight matrix. Another bilinear function is used to obtain the probability
distribution of the end index, incorporating the information of the span start obtained by Eqn. 3.7,
as

(start)
X
p(end) = softmax([hq ; pj mj ]> W(end) M), (3.8)
j
(start)
where the semicolon mark ; indicates the vector or matrix concatenation operator, pj is the
(end)
probability of the j-th word in the passage being the start of the answer span, W is a weight
matrix, and mj is the j-th vector of M.
Single-step reasoning is simple yet efficient and the model parameters can be trained using the
classical back-propagation algorithm, thus it is adopted by most of the systems (Chen et al., 2016b;
Seo et al., 2016; Wang et al., 2017b; Liu et al., 2017; Chen et al., 2017a; Weissenborn et al., 2017;
Hu et al., 2017). However, since humans often solve question answering tasks by re-reading and
re-digesting the document multiple times before reaching the final answer (this may be based on the
complexity of the questions and documents, as illustrated by the examples in Fig. 3.9), it is natural
to devise an iterative way to find answers as multi-step reasoning.

Multi-Step Reasoning. Multi-step reasoning models are pioneered by Hill et al. (2015); Dhingra
et al. (2016); Sordoni et al. (2016); Kumar et al. (2016), who used a pre-determined fixed number
of reasoning steps. Shen et al. (2017b,c) showed that multi-step reasoning outperforms single-step
ones and dynamic multi-step reasoning further outperforms the fixed multi-step ones on two distinct
MRC datasets (SQuAD and MS MARCO). But the dynamic multi-step reasoning models have to be
trained using RL methods, e.g., policy gradient, which are tricky to implement due to the instability
9
This is a special version of SAN where the maximum number of reasoning steps T = 1. SAN in Fig. 3.8
(Left) uses T = 3.

32
issue. SAN combines the strengths of both types of multi-step reasoning models. As shown in
Fig. 3.8 (Left), SAN (Liu et al., 2018d) uses a fixed number of reasoning steps, and generates a
prediction at each step. During decoding, the answer is based on the average of predictions in
all steps. During training, however, SAN drops predictions via stochastic dropout, and generates
the final result based on the average of the remaining predictions. Albeit simple, this technique
significantly improves the robustness and overall accuracy of the model. Furthermore, SAN can be
trained using back-propagation which is simple and efficient.
Taking SAN as an example, the multi-step reasoning module computes over T memory steps and
outputs the answer span. It is based on an RNN, similar to IRN in Fig. 3.5. It maintains a state vector,
which is updated on each step. At the beginning, the initial state s0 is the summarized question vector
computed by Eqn. 3.6. At time step t ∈ {1, 2, . . . , T }, the state is defined by st = RNN(st−1 , xt ),
where xt contains retrieved information from memory using the previous state vector as a query via
the attention process: M: xt = j γj mj and γ = softmax(st−1 > W(att) M), where W(att) is a
P
trainable weight matrix. Finally, a bilinear function is used to find the start and end points of answer
spans at each reasoning step t, similar to Eqn. 3.7 and 3.8:
(start)
pt = softmax(st > W(start) M), (3.9)
(end)
X (start)
pt = softmax([st ; pt,j mj ]> W(end) M), (3.10)
j

(start) (start)
where pt,j is the j-th value of the vector pt , indicating the probability of the j-th passage
word being the start of the answer span at reasoning step t.

3.7.3 Training

A neural MRC model can be viewed as a deep neural network that includes all component modules
(e.g., the embedding layers and reasoning engines) which by themselves are also neural networks.
Thus, it can be optimized on training data in an end-to-end fashion via back-propagation and SGD,
as outlined in Fig. 1.4. For SQuAD models, we optimize model parameters θ by minimizing the loss
function defined as the sum of the negative log probabilities of the ground truth answer span start
and end points by the predicted distributions, averaged over all training samples:

|D|     
1 X (start) (end)
L(θ) = − log p (start) + log p (end) , (3.11)
|D| i yi yi

(start) (end)
where D is the training set, yi and yi are the true start and end of the answer span of the
i-th training sample, respectively, and pk the k-th value of the vector p.

3.8 Conversational Text-QA Agents


While all the neural MRC models described in Sec. 3.7 assume a single-turn QA setting, in real-
ity, humans often ask questions in a conversational context (Ren et al., 2018a). For example, a user
might ask the question “when was California founded?”, and then depending on the received answer,
follow up by “who is its governor?” and “what is the population?”, where both refer to “Califor-
nia” mentioned in the first question. This incremental aspect, although making human conversations
succinct, presents new challenges that most state-of-the-art single-turn MRC models do not address
directly, such as referring back to conversational history using coreference and pragmatic reason-
ing10 (Reddy et al., 2018).
A conversational text-QA agent uses a similar architecture to Fig. 3.5, except that the Soft-KB
Lookup module is replaced by a text-QA module which consists of a search engine (e.g., Google
10
Pragmatic reasoning is defined as “the process of finding the intended meaning(s) of the given, and it is
suggested that this amounts to the process of inferring the appropriate context(s) in which to interpret the given”
(Bell, 1999). The analysis by Jia and Liang (2017); Chen et al. (2016a) revealed that state of the art neural MRC
models, e.g., developed on SQuAD, mostly excel at matching questions to local context via lexical matching
and paragraphing, but struggle with questions that require reasoning.

33
Figure 3.10: The examples from two conversational QA datasets. (Left) A QA dialogue example
in the QuAC dataset. The student, who does not see the passage (section text), asks questions. The
teacher provides answers in the form of text spans and dialogue acts. These acts include (1) whether
the student should ,→, could ,→,¯ or should not 6,→ ask a follow-up; (2) affirmation (Yes / No), and,
when appropriate, (3) No answer. Figure credit: Choi et al. (2018). (Right) A QA dialogue example
in the CoQA dataset. Each dialogue turn contains a question (Qi ), an answer (Ai ) and a rationale
(Ri ) that supports the answer. Figure credit: Reddy et al. (2018).

or Bing) that retrieves relevant passages for a given question, and an MRC model that generates the
answer from the retrieved passages. The MRC model needs to be extended to address the afore-
mentioned challenges in the conversation setting, henceforth referred to as a conversational MRC
model.
Recently, several datasets have been developed for building conversational MRC models. Among
them are CoQA (Conversational Question Answering (Reddy et al., 2018)) and QuAC (Question
Answering in Context (Choi et al., 2018)), as shown in Fig. 3.10. The task of conversational MRC is
defined as follows. Given a passage P , the conversation history in the form of question-answer pairs
{Q1 , A1 , Q2 , A2 , ..., Qi−1 , Ai−1 } and a question Qi , the MRC model needs to predict the answer
Ai .
A conversational MRC model extends the models described in Sec. 3.7 in two aspects. First, the
encoding module is extended to encode not only P and Ai but also the conversation history. Second,
the reasoning module is extended to be able to generate an answer (via pragmatic reasoning) that
might not overlap P . For example, Reddy et al. (2018) proposed a reasoning module that combines
the text-span MRC model of DrQA (Chen et al., 2017a) and the generative model of PGNet (See
et al., 2017). To generate a free-form answer, DrQA first points to the answer evidence in text (e.g.,
R5 in Fig. 3.10 (Right)), and PGNet generates the an answer (e.g., A5) based on the evidence.

34
Chapter 4

Task-oriented Dialogue Systems

This chapter focuses on task-oriented dialogue systems that assist users in solving a task. Differ-
ent from applications where the user seeks an answer or certain information (previous chapter),
dialogues covered here are often for completing a task, such as making a hotel reservation or book-
ing movie tickets. Furthermore, compared to chatbots (next chapter), these dialogues often have a
specific goal to achieve, and are typically domain dependent.
While task-oriented dialogue systems have been studied for decades, they have quickly gaining
increasing interest in recent years, both in the research community and in industry. This chapter
focuses on the foundation and algorithmic aspects, while industrial applications are discussed in
Chapter 6. Furthermore, we restrict ourselves to dialogues where user input is in the form of raw
text, not spoken language, but many of the techniques and discussions in this chapter can be adapted
to spoken dialogues systems.
The chapter is organized as follows. It starts with an overview of basic concepts, terminology, and
a typical architecture for task-oriented dialogue systems. Second, it reviews several representative
approaches to dialogue system evaluation. This part is critical in the development cycle of dialogue
systems, but is largely orthogonal to the concrete techniques used to build them. The next three
sections focus on each of three main components in a typical dialogue system, with an emphasis on
recent, neural approaches. Finally, we review several recent works on end-to-end dialogue systems,
which are enabled by recent advances at the intersection of deep learning and reinforcement learning,
with further discussions and pointers provided in the last section.

4.1 Overview
We start with a brief overview of task-oriented dialogue systems, focusing on aspects that facilitate
later discussions. For more information and historical developments, readers are referred to the
textbook of Jurafsky and Martin (2018, Chapters 25 & 26).

Slot-filling Dialogues

Throughout the chapter, we focus on a relatively simple yet important class of dialogues that involve
filling in values for a predefined set of slots before a task can be successfully completed. These
dialogues, known as slot-filling or form-filling, find a number of uses in practice. Table 4.1 gives an
example conversation between a user and a dialogue system. To successfully complete a transaction
to book tickets, the system must collect necessary information by conversing with the user.
Slot-filling dialogue may be applied to many other domains, such as movie (as shown in the example
above), restaurant, air ticket booking, etc. For each domain, a set of slots are defined by domain
experts and are application specific. For example, in the movie domain, slots like movie name,
theater name, time, date, ticket price, num tickets, etc. are necessary.
Finally, a slot is called informable if the value for this slot can be used to constrain the conversa-
tion, such as phone number; a slot is called requestable if the speaker can ask for its value, such

35
Table 4.1: An example movie-booking dialogue, adapted from Li et al. (2016d).
System Hello! How may I assist you?
User Can I get tickets for Zoolander 2 tomorrow?
System Sure! Which city?
User I want to watch it in Seattle.
System How many tickets do you need?
User Two, please. And I’d like to watch at Regal Meridian 16.
System 9:25 pm is available at Regal Meridian 16. Does it work for you?
User Well, yes.
System Okay, I was able to book 3 tickets for Zoolander 2 tomorrow at Regal Meridian 16 theater
in Seattle at 9:25 pm.
User Thank you.
System Thank you. Good bye!

as ticket price. Note that a slot can be both informable and requestable, an example being
movie name.

Dialogue Acts

The interaction between a dialogue agent and a user, as shown in the previous example, mirrors
the interaction between an RL agent and the environment (Fig. 2.5), where a user utterance is the
observation, and the system utterance is the action selected by the dialogue agent. The dialogue acts
theory gives a formal foundation for this intuition (Core and Allen, 1997; Traum, 1999).
In this framework, the utterances of a user or agent are considered actions that can change the
(mental) state of both the user and the system, thus the state of the conversation. These actions can
be used to suggest, inform, request certain information, among others. A simple example dialogue
act is greet, which corresponds to natural language sentences like “Hello! How may I assist you?”.
It allows the system to greet the user and start a conversation. Some dialogue acts may have slots or
slot-value pairs as arguments. For example, the following question in the movie-booking example
above:
“How many tickets do you need?”
is to request information about a certain slot:
request(num tickets),
while the following sentence
“I want to watch it in Seattle.”
is to inform the city name:
inform(city=‘‘seattle’’).
In general, dialogue acts are domain specific. Therefore, the set of dialogue acts in a movie domain,
for instance, will be different from that in the restaurant domain (Schatzmann and Young, 2009).

Dialogue as Optimal Decision Making

Equipped with dialogue acts, we are ready to model multi-turn conversations between a dialogue
agent and a user as an RL problem. Here, the dialogue system is the RL agent, and the user is the
environment. At every turn of the dialogue,

• the agent keeps track of the dialogue state, based on information revealed so far in the
conversation, and then takes an action; the action may be a response to the user in the form
of dialogue acts, or an internal operation such as a database lookup or an API call;
• the user responds with the next utterance, which will be used by the agent to update its
internal dialogue state in the next turn;
• an immediate reward is computed to measure the quality and/or cost for this turn of con-
versation.

This process is precisely the agent-environment interaction discussed in Sec. 2.3. We now discuss
how a reward function is determined.

36
Figure 4.1: An architecture for multi-turn task-oriented dialogues. It consists of the following
modules: NLU (Natural Language Understanding), DM (Dialogue Manager), and NLG (Natural
Language Generation). DM contains two sub-modules, DST (Dialogue State Tracker) and POL
(Dialogue Policy). The dialogue system, indicated by the dashed rectangle, may have access to an
external database (DB).

An appropriate reward function should capture desired features of a dialogue system. In task-
oriented dialogues, we would like the system to succeed in helping the user in as few turns as
possible. Therefore, it is natural to give a high reward (say +20) at the end of the conversation if the
task is successfully solved, or a low reward (say −20) otherwise. Furthermore, we may give a small
penalty (say, −1 reward) to every intermediate turn of the conversation, so that the agent is encour-
aged to make the dialogue as short as possible. The above is of course just a simplistic illustration
of how to set a reward function for task-oriented dialogues, but in practice more sophisticated re-
ward functions may be used, such as those that measure diversity and coherence of the conversation.
Further discussion of the reward function can be found in Sections 4.4.6, 4.2.1 and 5.4.
To build a system, the pipeline architecture depicted in Fig. 4.1 is often used in practice. It consists
of the following modules.

• Natural Language Understanding (NLU): This module takes the user’s raw utterance as
input and converts it to the semantic form of dialogue acts.
• Dialogue Manager (DM): This module is the central controller of the dialogue system. It
often has a Dialogue State Tracking (DST) sub-module that is responsible for keeping track
of the current dialogue state. The other sub-module, the policy, relies on the internal state
provided by DST to select an action. Note that an action can be a response to the user, or
some operation on backend databases (e.g., looking up certain information).
• Natural Language Generation (NLG): If the policy chooses to respond to the user, this
module will convert this action, often a dialogue act, into a natural language form.

Dialogue Manager

There is a huge literature on building (spoken) dialogue managers. A comprehensive survey is out
of the scope of the this chapter. Interested readers are referred to some of the earlier examples (Cole,
1999; Larsson and Traum, 2000; Rich et al., 2001; Allen et al., 2001; Bos et al., 2003; Bohus and
Rudnicky, 2009), as well as excellent surveys like McTear (2002), Paek and Pieraccini (2008), and
Young et al. (2013) for more information. Here, we review a small subset of traditional approaches
from the decision-theoretic view we take in this paper.
Levin et al. (2000) viewed conversation as a decision making problem. Walker (2000) and Singh
et al. (2002) are two early applications of reinforcement learning to manage dialogue systems. While
promising, these approaches assumed that the dialogue state can only take finitely many possible
values, and is fully observable (that is, the DST is perfect). Both assumptions are often violated in
real-world applications, given ambiguity in user utterance and unavoidable errors in NLU.
To handle uncertainty inherent in dialogue systems, Roy et al. (2000) and Williams and Young
(2007) proposed to use Partially Observable Markov Decision Process (POMDP) as a principled
mathematical framework for modeling and optimizing dialogue systems. The idea is to take user
utterances as observations to maintain a posterior distribution of the unobserved dialogue state; the

37
distribution is sometimes referred to as the “belief state.” Since exact optimization in POMDPs
is computationally intractable, authors have studied approximation techniques (Roy et al., 2000;
Williams and Young, 2007; Young et al., 2010; Li et al., 2009; Gašić and Young, 2014) and alterna-
tive representations such as the information states framework (Larsson and Traum, 2000; Daubigney
et al., 2012). Still, compared to the neural approaches covered in later sections, these methods often
require more domain knowledge to engineer features and design states.
Another important limitation of traditional approaches is that each module in Fig. 4.1 is often op-
timized separately. Consequently, when the system does not perform well, it can be challenging to
solve the “credit assignment” problem, namely, to identify which component in the system causes
undesired system response and needs to be improved. Indeed, as argued by McTear (2002), “[t]he
key to a successful dialogue system is the integration of these components into a working system.”
The recent marriage of differentiable neural models and reinforcement learning allows a dialogue
system to be optimized in an end-to-end fashion, potentially leading to higher conversation quality;
see Sec. 4.6 for further discussions and recent works on this topic.

4.2 Evaluation and User Simulation


Evaluation has been an important research topic for dialogue systems. Different approaches have
been used, including corpus-based approaches, user simulation, lab user study, actual user study,
etc. We will discuss pros and cons of these various methods, and in practice trade-offs are made to
find the best option or a combination of them.

4.2.1 Evaluation Metrics

While individual components in a dialogue system can often be optimized against more well-defined
metrics such as accuracy, precision, recall, F1 and BLEU scores, evaluating a whole dialogue sys-
tem requires a more holistic view and is more challenging (Walker et al., 1997, 1998, 2000; Paek,
2001; Hartikainen et al., 2004). In the reinforcement-learning framework, it implies that the reward
function has to take multiple aspects of dialogue quality into consideration. In practice, the reward
function is often a weighted linear combination of a subset of the following metrics.
The first class of metrics measures task completion success. The most common choice is perhaps
task success rate—the fraction of dialogues that successfully solve the user’s problem (buying the
right movie tickets, finding proper restaurants, etc.). Effectively, the reward corresponding to this
metric is 0 for every turn, except for the last turn where it is +1 for a successful dialogue and −1
otherwise. Many examples are found in the literature (Walker et al., 1997; Williams, 2006; Peng
et al., 2017). Other variants have also been used, such as those to measure partial success (Singh
et al., 2002; Young et al., 2016).
The second class measures cost incurred in a dialogue, such as time elapsed. A simple yet useful
example is the number of turns, which reflects the intuition that a more succinct dialogue is preferred
with everything else being equal. The reward is simply −1 per turn, although more complicated
choices exist (Walker et al., 1997).
In addition, other aspects of dialogue quality may also be encoded into the reward function, although
this is a relatively under-investigated direction. In the context of chatbots (Chapter 5), coherence, di-
versity and personal styles have been used to result in more human-like dialogues (Li et al., 2016a,b).
They can be useful for task-oriented dialogues as well. In Sec. 4.4.6, we will review a few recent
works that aim to learn reward functions automatically from data.

4.2.2 Simulation-Based Evaluation

Typically, an RL algorithm needs to interact with a user to learn (Sec. 2.3). But running RL on either
recruited users or actual users can be expensive. A natural way to get around this challenge is to
build a simulated user, with which an RL algorithm can interact at virtually no cost. Essentially, a
simulated user tries to mimic what a real user does in a conversation: it keeps track of the dialogue
state, and converses with an RL dialogue system.
Substantial research has gone into building realistic user simulators (Schatzmann et al., 2005a;
Georgila et al., 2006; Pietquin and Dutoit, 2006; Pietquin and Hastie, 2013). There are many differ-

38
{
request_slots:
{
ticket: UNK
theater: UNK
start_time: UNK
},
inform_slots:
{
number_of_people: 3
date: tomorrow
movie_name: batman vs. superman
}
}

Figure 4.2: An example user goal in the movie-ticket-booking domain

ent dimensions to categorize a user simulator, such as deterministic vs. stochastic, content-based vs.
collaboration-based, static vs. non-static user goals during the conversations, among others. Here,
we highlight two dimensions, and refer interested users to Schatzmann et al. (2006) for further de-
tails on creating and evaluating user simulators :

• Along the granularity dimension, the user simulator can operate either at the dialogue-act
level (also known as intention level), or at the utterance level (Jung et al., 2009).
• Along the methodology dimension, the user simulator can be implemented using a rule-
based approach, or a model-based approach with the model learned from a real conversa-
tional corpus.

Agenda-Based Simulation. As an example, we describe a popular hidden agenda-based user sim-


ulator developed by Schatzmann and Young (2009), as instantiated in Li et al. (2016d) and Ultes
et al. (2017c). Each dialogue simulation starts with a randomly generated user goal that is unknown
to the dialogue manager. In general the user goal consists of two parts: the inform-slots contain
a number of slot-value pairs that serve as constraints the user wants to impose on the dialogue; the
request-slots are slots whose values are initially unknown to the user and will be filled out during
the conversation. Fig. 4.2 shows an example user goal in a movie domain, in which the user is trying
to buy 3 tickets for tomorrow for the movie batman vs. superman.
Furthermore, to make the user goal more realistic, domain-specific constraints are added, so that
certain slots are required to appear in the user goal. For instance, it makes sense to require a user to
know the number of tickets she wants in the movie domain.
During the course of a dialogue, the simulated user maintains a stack data structure known as user
agenda. Each entry in the agenda corresponds to a pending intention the user aims to achieve, and
their priorities are implicitly determined by the first-in-last-out operations of the agenda stack. In
other words, the agenda provides a convenient way of encoding the history of conversation and the
“state-of-mind” of the user. Simulation of a user boils down to how to maintain the agenda after
each turn of the dialogue, when more information is revealed. Machine learning or expert-defined
rules can be used to set parameters in the stack-update process.

Model-based Simulation. Another approach to building user simulators is entirely based on


data (Eckert et al., 1997; Levin et al., 2000; Chandramohan et al., 2011). Here, we describe a
recent example due to El Asri et al. (2016). Similar to the agenda-based approach, the simulator
also starts an episode with a randomly generated user goal and constraints. These are fixed during a
conversation.
In each turn, the user model takes as input a sequence of contexts collected so far in the conversation,
and outputs the next action. Specifically, the context at a turn of conversation consists of:

• the most recent machine action,


• inconsistency between machine information and user goal,

39
• constraint status, and
• request status.

With these contexts, an LSTM or other sequence-to-sequence models are used to output the next
user utterance. The model can be learned from human-human dialogue corpora. In practice, it often
works well by combining both rule-based and model-based techniques to create user simulators.

Further Remarks on User Simulation. While there has been much work on user simulation,
building a human-like simulator remains challenging. In fact, even user simulator evaluation itself
continues to be an ongoing research topic (Williams, 2008; Ai and Litman, 2008; Pietquin and
Hastie, 2013). In practice, it is often observed that dialogue policies that are overfitted to a particular
user simulator may not work well when serving another user simulator or real humans (Schatzmann
et al., 2005b; Dhingra et al., 2017). The gap between a user simulator and humans is the major
limitation of user simulation-based dialogue policy optimization.
Some user simulators are publicly available for research purposes. Other than the aforementioned
agenda-based simulators by Li et al. (2016d); Ultes et al. (2017c), a large corpus with an evaluation
environment, called AirDialogue (in the flight booking domain), was recently made available (Wei
et al., 2018). At the IEEE workshop on Spoken Language Technology in 2018, Microsoft orga-
nized a dialogue challenge1 of building end-to-end task-oriented dialogue systems by providing an
experiment platform with built-in user simulators in several domains (Li et al., 2018).

4.2.3 Human-based Evaluation

Due to the discrepancy between simulated users and human users, it is often necessary to test a
dialogue system on human users to reliably evaluate its quality. There are roughly two types of
human users.
The first is human subjects recruited in a lab study, possibly through crowd-sourcing platforms.
Typically, the participants are asked to test-use a dialogue system to solve a given task (depending on
the domain of the dialogues), so that a collection of dialogues are obtained. Metrics of interest such
as task-completion rate and average turns per dialogue can be measured, as done with a simulated
user. In other cases, a fraction of these subjects are asked to test-use a baseline dialogue system, so
that the two can be compared against various metrics.
Many published studies involving human subjects are of the first type (Walker, 2000; Singh et al.,
2002; Ai et al., 2007; Rieser and Lemon, 2011; Gašić et al., 2013; Wen et al., 2015; Young et al.,
2016; Peng et al., 2017; Lipton et al., 2018). While this approach has benefits over simulation-based
evaluation, it is rather expensive and time-consuming to get a large number of subjects that can
participate for a long time. Consequently, it has the following limitations:

• The small number of subjects prevents detection of statistically significant yet numerically
small differences in metrics, often leading to inconclusive results.
• Only a very small number of dialogue systems may be compared.
• It is often impractical to run an RL agent that learns by interacting with these users, except
in relatively simple dialogue applications.

The other type of humans for dialogue system evaluation is actual users (e.g., Black et al. (2011)).
They are similar to the first type of users, except that they come with their actual tasks to be solved
by conversing with the system. Consequently, metrics evaluated on them are even more reliable
than those computed on recruited human subjects with artificially generated tasks. Furthermore,
the number of actual users can be much larger, thus resulting in greater flexibility in evaluation. In
this process, many online and offline evaluation techniques such as A/B-testing and counterfactual
estimation can be used (Hofmann et al., 2016). The major downside of experimenting with actual
users is the risk of negative user experience and disruption of normal services.

1
https://github.com/xiul-msr/e2e_dialog_challenge

40
4.2.4 Other Evaluation Techniques

Recently, researchers have started to investigate a different approach to evaluation that is inspired by
the self-play technique in RL (Tesauro, 1995; Mnih et al., 2015). This technique is typically used in a
two-player game (such as the game of Go), where both players are controlled by the same RL agent,
possibly initialized differently. By playing the agent against itself, a large amount of trajectories can
be generated at relatively low cost, from which the RL agent can learn a good policy.
Self-play must be adapted to be used for dialogue management, as the two parties involved in a
conversation often play asymmetric roles (unlike in games such as Go). Shah et al. (2018) described
such a dialogue self-play procedure, which can generate conversations between a simulated user and
the system agent. Promising results have been observed in negotiation dialogues (Lewis et al., 2017)
and task-oriented dialogues (Liu and Lane, 2017; Shah et al., 2018; Wei et al., 2018). It provides
an interesting solution to avoid the evaluation cost of involving human users as well as overfitting to
untruthful simulated users.
In practice, it is reasonable to have a hybrid approach to evaluation. One possibility is to start
with simulated users, then validate or fine-tune the dialogue policy on human users (cf., Shah et al.
(2018)). Furthermore, there are more systematic approaches to using both sources of users for policy
learning (see Sec. 4.4.5).

4.3 Natural Language Understanding and Dialogue State Tracking


NLU and DST are two closely related components essential to a dialogue system. They can have
a significant impact on the overall system’s performance (see, e.g., Li et al. (2017e)). This section
reviews some of the classic and state-of-the-art approaches.

4.3.1 Natural Language Understanding

The NLU module takes user utterance as input, and performs three tasks: domain detection, intent
determination, and slot tagging. An example output for the three tasks is given in Fig. 4.3. Typically,
a pipeline approach is taken, so that the three tasks are solved one after another. Accuracy and F1
score are two of the most common metrics used to evaluate a model’s prediction quality. NLU is a
pre-processing step for later modules in the dialogue system, whose quality has a significant impact
on the system’s overall quality (Li et al., 2017d).
Among them, the first two tasks are often framed as a classification problem, which infers the do-
main or intent (from a predefined set of candidates) based on the current user utterance (Schapire
and Singer, 2000; Yaman et al., 2008; Sarikaya et al., 2014). Neural approaches to multi-class clas-
sification have been used in the recent literature and outperformed traditional statistical methods.
Ravuri and Stolcke (2015; 2016) studied the use of standard recurrent neural networks, and found
them to be more effective. For short sentences where information has to be inferred from the con-
text, Lee and Dernoncourt (2016) proposed to use recurrent and convolutional neural networks that
also consider texts prior to the current utterance. Better results were shown on several benchmarks.
The more challenging task of slot tagging is often treated as sequence classification, where the
classifier predicts semantic class labels for subsequences of the input utterance (Wang et al., 2005;
Mesnil et al., 2013). Fig. 4.3 shows an ATIS (Airline Travel Information System) utterance example
in the Inside-Outside-Beginning (IOB) format (Ramshaw and Marcus, 1995), where for each word
the model predicts a semantic tag.
Yao et al. (2013) and Mesnil et al. (2015) applied recurrent neural networks to slot tagging, where
inputs are one-hot encoding of the words in the utterance, and obtained higher accuracy than statis-
tical baselines such as conditional random fields and support vector machines. Moreover, it is also
shown that a-prior word information can be effectively incorporated into basic recurrent models to
yield further accuracy gains.
As an example, this section describes the use of bidirectional LSTM (Graves and Schmidhuber,
2005), or bLSTM in short, in NLU tasks, following Hakkani-Tür et al. (2016) who also discussed
other models for the same tasks. The model, as shown in Fig. 4.4, uses two sets of LSTM cells
applied to the input sequence (the forward) and the reversed input sequence (the backward). The
concatenated hidden layers of the forward and backward LSTMs are used as input to another neural

41
Figure 4.3: An example output of NLU, where the utterance (W) is used to predict domain (I),
intent (I), and the slot tagging (S). The IOB representation is used. Figure credit: Hakkani-Tür et al.
(2016).

Figure 4.4: A bLSTM model for joint optimization in NLU. Picture credit: Hakkani-Tür et al.
(2016).

network to compute the output sequence. Mathematically, upon the tth input token, wt , operations
of the forward part of bLSTM are defined by the following set of equations:
it = g(Wwi wt + Whi ht−1 )
ft = g(Wwf wt + Whf ht−1 )
ot = g(Wwo wt + Who ht−1 )
ĉt = tanh(Wwc wt + Whc ht−1 )
ct = ft ct−1 + it ĉt
ht = ot tanh(ct ) ,
where ht−1 is the hidden layer, W? the trainable parameters, and g(·) the sigmoid function. As in
standard LSTMs, it , ft and ot are the input, forget, and output gates, respectively. The backward
part is similar, with the input reversed.
To predict the slot tags as shown in Fig. 4.3, the input wt is often a one-hot vector of a word
embedding vector. The output upon input wt is predicted according to the following distribution pt :
(f ) (f ) (b) (b)
pt = softmax(Why ht + Why ht ) ,
where the superscripts, (f ) and (b), denote forward and backward parts of the bLSTM, respec-
tively. For tasks like domain and intent classification, the output is predicted at the end of the input
sequence, and simpler architectures may be used (Ravuri and Stolcke, 2015, 2016).
In many situations, the present utterance alone can be ambiguous or lack all necessary information.
Contexts that include information from previous utterances are expected to help improve model
accuracy. Hori et al. (2015) treated conversation history as a long sequence of words, with alter-
nating roles (words from user, vs. words from system), and proposed a variant to LSTM with role-
dependent layers. Chen et al. (2016b) built on memory networks that learn which part of contextual
information should be attended to, when making slot-tagging predictions. Both models achieved
higher accuracy than context-free models.
Although the three NLU tasks are often studied separately, there are benefits to jointly solving them
(similar to multi-task learning), and over multiple domains, so that it may require fewer labeled

42
Figure 4.5: Neural Belief Tracker. Figure credit: Mrkšić et al. (2017).

data when creating NLU models for a new domain (Hakkani-Tür et al., 2016; Liu and Lane, 2016).
Another line of interesting work that can lead to substantial reduction of labeling cost in new do-
mains is zero-shot learning, where slots from different domains are represented in a shared latent
semantic space through embedding of the slots’ (text) descriptions (Bapna et al., 2017; Lee and Jha,
2019). Interested readers are referred to recent tutorials, such as Chen and Gao (2017) and Chen
et al. (2017e), for more details.

4.3.2 Dialogue State Tracking

In slot-filling problems, a dialogue state contains all information about what the user is looking for at
the current turn of the conversation. This state is what the dialogue policy takes as input for deciding
what action to take next (Fig. 4.1).
For example, in the restaurant domain, where a user tries to make a reservation, the dialogue state
may consists of the following components (Henderson, 2015):
• The goal constraint for every informable slot, in the form of a value assignment to that slot.
The value can be “don’t care” (if the user has no preference) or “none” (if the user has
not yet specified the value).
• The subset of requested slots that the user has asked the system to inform.
• The current dialogue search method, taking values by constraint, by alternative
and finished. It encodes how the user is trying to interact with the dialogue system.
Many alternatives have also been used in the literature, such as a compact, binary representation
recently proposed by Kotti et al. (2018), and the StateNet tracker of Ren et al. (2018b) that is more
scalable with the domain size (number of slots and number of slot values).
In the past, DST can either be created by experts, or obtained from data by statistical learning
algorithms like conditional random fields (Henderson, 2015). More recently, neural approaches
have started to gain popularity, with applications of deep neural networks (Henderson et al., 2013)
and recurrent networks (Mrkšić et al., 2015) as some of the early examples.
A more recent DST model is the Neural Belief Tracker proposed by Mrkšić et al. (2017), shown in
Fig. 4.5. The model takes three items as input. The first two are the last system and user utterances,
each of which is first mapped to an internal, vector representation. The authors studied two models
for representation learning, based on multi-layer perceptrons and convolutional neural networks,
both of which take advantage of pre-trained collections of word vectors and output an embedding for
the input utterance. The third input is any slot-value pair that is being tracked by DST. Then, the three
embeddings may interact among themselves for context modeling, to provide further contextual
information from the flow of conversation, and semantic decoding, to decide if the user explicitly
expressed an intent matching the input slot-value pair. Finally, the context modeling and semantic

43
decoding vectors go through a softmax layer to produce a final prediction. The same process is
repeated for all possible candidate slot-value pairs.
A different representation of dialogue states, called belief spans, is explored by Lei et al. (2018)
in the Sequicity framework. A belief span consists of two fields: one for informable slots and the
other for requestable slots. Each field collects values that have been found for respective slots in
the conversation so far. One of the main benefits of belief spans and Sequicity is that it facilitates
the use of neural sequence-to-sequence models to learn dialogue systems, which take the belief
spans as input and output system responses. This greatly simplifies system design and optimization,
compared to more traditional, pipeline approaches (c.f., Sec. 4.6).

Dialogue State Tracking Challenge (DSTC) is a series of challenges that provide common
testbeds and evaluation measures for dialogue state tracking. Starting from Williams et al. (2013),
it has successfully attracted many research teams to focus on a wide range of technical problems in
DST (Williams et al., 2014; Henderson et al., 2014b,a; Kim et al., 2016a,b; Hori et al., 2017). Cor-
pora used by DSTC over the years have covered human-computer and human-human conversations,
different domains such as restaurant and tourist, cross-language learning. More information may be
found in the DSTC website.2

4.4 Dialogue Policy Learning


In this section, we will focus on dialogue policy optimization based on reinforcement learning.

4.4.1 Deep RL for Policy Optimization

The dialogue policy may be optimized by many standard reinforcement learning algorithms. There
are two ways to use RL: online and batch. The online approach requires the learner to interact with
users to improve its policy; the batch approach assumes a fixed set of transitions, and optimizes the
policy based on the data only, without interacting with users (see, e.g., Li et al. (2009); Pietquin et al.
(2011)). In this chapter, we discuss the online setting which often has batch learning as an internal
step. Many covered topics can be useful in the batch setting. Here, we use the DQN as an example,
following Lipton et al. (2018), to illustrate the basic work flow.

Model: Architecture, Training and Inference. The DQN’s input is an encoding of the current
dialogue state. One option is to encode it as a feature vector, consisting of: (1) one-hot repre-
sentations of the dialogue act and slot corresponding to the last user action; (2) the same one-hot
representations of the dialogue act and slot corresponding to the last system action; (3) a bag of slots
corresponding to all previously filled slots in the conversation so far; (4) the current turn count; and
(5) the number of results from the knowledge base that match the already filled-in constraints for
informed slots. Denote this input vector by s.
DQN outputs a real-valued vector, whose entries correspond to all possible (dialogue-act, slot) pairs
that can be chosen by the dialogue system. Available prior knowledge can be used to reduce the
number of outputs, if some (dialogue-act, slot) pairs do not make sense for a system, such as
request(price). Denote this output vector by q.
The model may have L ≥ 1 hidden layers, parameterized by matrices {W1 , W2 , . . . , WL }, so that
h0 = s
hl = g(Wl hl−1 ) , l = 1, 2, . . . , L − 1
q = WL hL−1 ,
where g(·) is an activation function such as ReLU or sigmoid. Note that the last layer does not need
an activation function, and the output q is to approximate Q(s, ·), the Q-values in state s.
To learn parameters in the network, one can use an off-the-shelf reinforcement-learning algorithm
(e.g., Eqn. 2.4 or 2.5 with experience replay); see Sec. 2.3 for the exact update rules and improved
algorithms. Once these parameters are learned, the network induces a greedy action-selection policy
2
https://www.microsoft.com/en-us/research/event/dialog-state-tracking-challenge

44
as follows: for a current dialogue state s, use a forward pass on the network to compute q, the Q-
values for all actions. One can pick the action, which is a (dialogue act, slot) pair, that corresponds
to the entry in q with the largest value. Due to the need for exploration, the above greedy action
selection may not be desired; see Sec. 4.4.2 for a discussion on this subject.

Warm-start Policy. Learning a good policy from scratch often requires many data, but the process
can be significantly sped up by restricting the policy search using expert-generated dialogues (Hen-
derson et al., 2008) or teacher advice (Chen et al., 2017d), or by initializing the policy to be a
reasonable one before switching to online interaction with (simulated) users.
One approach is to use imitation learning (also known as behavioral cloning) to mimic an expert-
provided policy. A popular option is to use supervised learning to directly learn the expert’s action
in a state; see Su et al. (2016b); Dhingra et al. (2017); Williams et al. (2017); Liu and Lane (2017)
for a few recent examples. Li et al. (2014) turned imitation learning into an induced reinforcement
learning problem, and then applied an off-the-shelf RL algorithm to learn the expert’s policy.
Finally, Lipton et al. (2018) proposed a simple yet effective alternative known as Replay Buffer
Spiking (RBS) that is particularly suited to DQN. The idea is to pre-fill the experience replay buffer
of DQN with a small number of dialogues generated by running a naı̈ve yet occasionally successful,
rule-based agent. This technique is shown to be essential for DQN in simulated studies.

Other Approaches. In the above example, a standard multi-layer perceptron is used in the DQN
to approximate the Q-function. It may be replaced by other models, such as a Bayesian version de-
scribed in the next subsection for efficient exploration, and recurrent networks (Zhao and Eskénazi,
2016; Williams et al., 2017) that can more easily capture information from conversational histories
than expert-designed dialogue states. In another recent example, Chen et al. (2018) used graph neu-
ral networks to model the Q-function, with nodes in the graph corresponding to slots of the domain.
The nodes may share some of the parameters across multiple slots, therefore increasing learning
speed.
Furthermore, one may replace the above value function-based methods by others like policy gradient
(Sec. 2.3.2), as done by Fatemi et al. (2016); Dhingra et al. (2017); Strub et al. (2017); Williams et al.
(2017); Liu et al. (2018a).

4.4.2 Efficient Exploration and Domain Extension

Without a teacher, an RL agent learns from data collected by interacting with an initially unknown
environment. In general, the agent has to try new actions in novel states, in order to discover poten-
tially better policies. Hence, it has to strike a good trade-off between exploitation (choosing good
actions to maximize reward, based on information collected thus far) and exploration (choosing
novel actions to discover potentially better alternatives), leading to the need for efficient explo-
ration (Sutton and Barto, 2018). In the context of dialogue policy learning, the implication is that
the policy learner actively tries new ways to converse with a user, in the hope of discovering a better
policy in the long run (e.g., Daubigney et al., 2011).
While exploration in finite-state RL is relatively well-understood (Strehl et al., 2009; Jaksch et al.,
2010; Osband and Roy, 2017; Dann et al., 2017), exploration when parametric models like neural
networks are used is an active research topic (Bellemare et al., 2016; Osband et al., 2016; Houthooft
et al., 2016; Jiang et al., 2017). Here, a general-purpose exploration strategy is described, which is
particularly suited for dialogue systems that may evolve over time.
After a task-oriented dialogue system is deployed to serve users, there may be a need over time
to add more intents and/or slots to make the system more versatile. This problem, referred to as
domain extension (Gašic et al., 2014), makes exploration even more challenging: the agent needs to
explicitly quantify the uncertainty in its parameters for intents/slots, so as to explore new ones more
aggressively while avoiding exploring those that have already been learned. Lipton et al. (2018)
approached the problem using a Bayesian-by-Backprop variant of DQN.
Their model, called BBQ, is identical to DQN, except that it maintains a posterior distribution q over
the network weights w = (w1 , w2 , . . . , wd ). For computational convenience, q is a multivariate
Gaussian distribution with diagonal covariance, parameterized by θ = {(µi , ρi )}di=1 , where weight
wi has a Gaussian posterior distribution, N (µi , σi2 ) and σi = log(1 + exp(ρi )). The posterior

45
information leads to a natural exploration strategy, inspired by Thompson Sampling (Thompson,
1933; Chapelle and Li, 2012; Russo et al., 2018). When selecting actions, the agent simply draws
a random weight w̃ ∼ q, and then selects the action with the highest value output by the network.
Experiments show that BBQ explores more efficiently than state-of-the-art baselines for dialogue
domain extension.
The BBQ model is updated as follows. Given observed transitions T = {(s, a, r, s0 )}, one uses the
target network (see Sec. 2.3) to compute the target values for each (s, a) in T , resulting in the set
D = {(x, y)}, where x = (s, a) and y may be computed as in DQN. Then, parameter θ is updated to
represent the posterior distribution of weights. Since the exact posterior is not Gaussian any more,
and thus not representable by BBQ, it is approximated as follows: θ is chosen by minimizing the
variational free energy (Hinton and Van Camp, 1993), the KL-divergence between the variational
approximation q(w|θ) and the posterior p(w|D):
θ∗ = argminθ KL[q(w|θ)||p(w|D)]
n o
= argminθ KL[q(w|θ)||p(w)] − Eq(w|θ) [log p(D|w)] .
In other words, the new parameter θ is chosen so that the new Gaussian distribution is closest to the
posterior measured by KL-divergence.

4.4.3 Composite-task Dialogues

In many real-world problems, a task may consist of a set of subtasks that need to be solved collec-
tively. Similarly, dialogues can often be decomposed into a sequence of related sub-dialogues, each
of which focuses on a subtopic (Litman and Allen, 1987). Consider for example a travel planning
dialogue system, which needs to book flights, hotels and car rental in a collective way so as to satisfy
certain cross-subtask constraints known as slot constraints (Peng et al., 2017). Slot constraints are
application specific. In a travel planning problem, one natural constraint is that the outbound flight’s
arrival time should be earlier than the hotel check-in time.
Complex tasks with slot constraints are referred to as composite tasks by Peng et al. (2017). Opti-
mizing the dialogue policy for a composite task is challenging for two reasons. First, the policy has
to handle many slots, as each subtask often corresponds to a domain with its own set of slots, and
the set of slots of a composite-task consists of slots from all subtasks. Furthermore, thanks to slot
constraints, these subtasks cannot be solved independently. Therefore, the state space considered by
a composite-task is much larger. Second, a composite-task dialogue often requires many more turns
to complete. Typical reward functions give a success-or-not reward only at the end of the whole
dialogue. As a result, this reward signal is very sparse and considerably delayed, making policy
optimization much harder.
Cuayáhuitl et al. (2010) proposed to use hierarchical reinforcement learning to optimize a composite
task’s dialogue policy, with tabular versions of the MAXQ (Dietterich, 2000) and Hierarchical Ab-
stract Machine (Parr and Russell, 1998) approaches. While promising, their solutions assume finite
states, so do not apply directly to larger-scale conversational problems.
More recently, Peng et al. (2017) tackled the composite-task dialogue policy learning problem under
the more general options framework (Sutton et al., 1999b), where the task hierarchy has two levels.
As illustrated in Fig. 4.6, a top-level policy πg selects which subtask g to solve, and a low-level
policy πa,g solves the subtask specified by πg . Assuming predefined subtasks, they extend the DQN
model that results in substantially faster learning speed and superior policies. A similar approach is
taken by Budzianowski et al. (2017), who used Gaussian process RL instead of deep RL for policy
learning.
A major assumption in options/subgoal-based hierarchical reinforcement learning is the need for
reasonable options and subgoals. Tang et al. (2018) considered the problem of discovering subgoals
from dialogue demonstrations. Inspired by a sequence segmentation approach that is successfully
applied to machine translation (Wang et al., 2017a), the authors developed the Subgoal Discovery
Network (SDN), which learns to identify “bottleneck” states in successful dialogues. It is shown that
the hierarchical DQN optimized with subgoals discovered by SDN is competitive to expert-designed
subgoals.
Finally, another interesting attempt is made by Casanueva et al. (2018) based on Feudal Reinforce-
ment Learning (FRL) (Dayan and Hinton, 1993). In contrast to the above methods that decompose a

46
Figure 4.6: A two-level hierarchical dialogue policy. Figure credit: Peng et al. (2017).

Table 4.2: An example of multi-domain dialogue, adapted from Cuayáhuitl et al. (2016). The first
column specifies which domain is triggered in the system, based on user utterances received so far.
Domain Agent Utterance
meta system “Hi! How can I help you?”
user “I’m looking for a hotel in Seattle on January 2nd
for 2 nights.”
hotel system “A hotel for 2 nights in Seattle on January 2nd?”
user “Yes.”
system “I found Hilton Seattle.”
meta system “Anything else I can help with?”
user “I’m looking for cheap Japanese food in the downtown.”
restaurant system “Did you say cheap Japanese food?”
user “Yes.”
system “I found the following results.”
...

task into temporally separated subtasks, FRL decomposes a complex decision spatially. In each turn
of a dialogue, the feudal policy first decides between information-gathering actions and information-
providing actions, then chooses a primitive action that falls in the corresponding high-level category.

4.4.4 Multi-domain Dialogues

A multi-domain dialogue can converse with a user to have a conversation that may involve more than
one domain (Komatani et al., 2006; Hakkani-Tür et al., 2012; Wang et al., 2014). Table 4.2 shows
an example, where the dialogue covers both the hotel and restaurant domains, in addition to a
special meta domain for sub-dialogues that contain domain-independent system and user responses.
Different from composite tasks, sub-dialogues corresponding to different domains in a conversation
are separate tasks, without cross-task slot constraints. Similar to composite-task systems, a multi-
domain dialogue system needs to keep track of a much larger dialogue state space that has slots
from all domains, so directly applying RL can be inefficient. It thus raises the need to learn re-
usable policies whose parameters can be shared across multiple domains as long as they are related.
Gašić et al. (2015) proposed to use a Bayesian Committee Machine (BCM) for efficient multi-
domain policy learning. During training time, a number of policies are trained on different, po-
tentially small, datasets. The authors used Gaussian processes RL algorithms to optimize those
policies, although they can be replaced by deep learning alternatives. During test time, in each turn
of a dialogue, these policies recommend an action, and all recommendations are aggregated into a
final action to be taken by the BCM policy.
Cuayáhuitl et al. (2016) developed another related technique known as NDQN—Network of DQNs,
where each DQN is trained for a specialized skill to converse in a particular sub-dialogue. A meta-

47
Figure 4.7: Three strategies for optimizing dialogue policies based on reinforcement learning. Fig-
ure credit: Peng et al. (2018).

policy controls how to switch between these DQNs, and can also be optimized using (deep) rein-
forcement learning.
More recently, Papangelis et al. (2018a) studied another approach in which policies optimized for
difference domains can be shared, through a set of features that describe a domain. It is shown to
be able to handle unseen domains, and thus reduce the need for domain knowledge to design the
ontology.

4.4.5 Integration of Planning and Learning

As mentioned in Sec. 4.2, optimizing the policy of a task-oriented dialogue against humans is costly,
since it requires many interactions between the dialogue system and humans (left panel of Fig. 4.7).
Simulated users provide an inexpensive alternative to RL-based policy optimization (middle panel
of Fig. 4.7), but may not be a sufficiently truthful approximation of human users.
Here, we are concerned with the use of a user model to generate more data to improve sample
complexity in optimizing a dialogue system. Inspired by the Dyna-Q framework (Sutton, 1990),
Peng et al. (2018) proposed Deep Dyna-Q (DDQ) to handle large-scale problems with deep learning
models, as shown by the right panel of Fig. 4.7. Intuitively, DDQ allows interactions with both
human users and simulated users. Training of DDQ consists of three parts:

• direct reinforcement learning: the dialogue system interacts with a real user, collects real
dialogues and improves the policy by either imitation learning or reinforcement learning;
• world model learning: the world model (user simulator) is refined using real dialogues
collected by direct reinforcement learning;
• planning: the dialogue policy is improved against simulated users by reinforcement learn-
ing.

Human-in-the-loop experiments show that DDQ is able to efficiently improve the dialogue policy
by interacting with real users, which is important for deploying dialogue systems in practice.
One challenge with DDQ is to balance samples from real users (direct reinforcement learning) and
simulated users (planning). Peng et al. (2018) used a heuristics that reduces planning steps in later
stage of DDQ when more real user interactions are available. In contrast, Su et al. (2018b) proposed
the Discriminative Deep Dyna-Q (D3Q) that is inspired by generative adversarial networks. Specif-
ically, it incorporates a discriminator which is trained to differentiate experiences of simulated users
from those of real users. During the planning step, a simulated experience is used for policy training
only when it appears to be a real-user experience according to the discriminator.

4.4.6 Reward Function Learning

The dialogue policy is often optimized to maximize long-term reward when interacting with users.
The reward function is therefore critical to creating high-quality dialogue systems. One possibility is
to have users provide feedback during or at the end of a conversation to rate the quality, but feedback
like this is intrusive and costly. Often, easier-to-measure quantities such as time-elapsed are used to
compute a reward function. Unfortunately, in practice, designing an appropriate reward function is
not always obvious, and substantial domain knowledge is needed (Sec. 4.1). This inspires the use of

48
machine learning to find a good reward function from data (Walker et al., 2000; Rieser and Lemon,
2008; Rieser et al., 2010; El Asri et al., 2012) which can better correlate with user satisfaction (Rieser
and Lemon, 2011), or is more consistent with expert demonstrations (Li et al., 2014).
Su et al. (2015) proposed to rate dialogue success with two neural network models, a recurrent and
a convolutional network. Their approach is found to result in competitive dialogue policies, when
compared to a baseline that uses prior knowledge of user goals. However, these models assume
the availability of labeled data in the form of (dialogue, success-or-not) pairs, in which the success-
or-not feedback provided by users can be expensive to obtain. To reduce the labeling cost, Su
et al. (2016a; 2018a) investigated an active learning approach based on Gaussian processes, which
aims to learn the reward function and policy at the same time while interacting with human users.
Ultes et al. (2017a) argued that dialogue success only measures one aspect of the dialogue policy’s
quality. Focusing on information-seeking tasks, the authors proposed a new reward estimator based
on interaction quality that balances multiple aspects of the dialogue policy. Later on, Ultes et al.
(2017b) used multi-objective RL to automatically learn how to linearly combine multiple metrics of
interest in the definition of reward function.
Finally, inspired by adversarial training in deep learning, Liu and Lane (2018) proposed to view
the reward function as a discriminator that distinguishes dialogues generated by humans from those
by the dialogue policy. Therefore, there are two learning processes in their approach: the reward
function as a discriminator, and the dialogue policy optimized to maximize the reward function.
The authors showed that such an adversarially learned reward function can lead to better dialogue
policies than with hand-designed reward functions.

4.5 Natural Language Generation

Natural Language Generation (NLG) is responsible for converting a communication goal, selected
by the dialogue manager, into a natural language form. It is an important component that affects
naturalness of a dialogue system, and thus the user experience.
There exist many approaches to language generation. The most common in practice is perhaps
template- or rule-based ones, where domain experts design a set of templates or rules, and hand-craft
heuristics to select a proper candidate to generate sentences. Even though machine learning can be
used to train certain parts of these systems (Langkilde and Knight, 1998; Stent et al., 2004; Walker
et al., 2007), the cost to write and maintain templates and rules leads to challenges in adapting to
new domains or different user populations. Furthermore, the quality of these NLG systems is limited
by the quality of hand-crafted templates and rules.
These challenges motivate the study of more data-driven approaches, known as corpus-based meth-
ods, that aim to optimize a generation module from corpora (Oh and Rudnicky, 2002; Angeli et al.,
2010; Kondadadi et al., 2013; Mairesse and Young, 2014). Most such methods are based on super-
vised learning, while Rieser and Lemon (2010) takes a decision-theoretic view and uses reinforce-
ment learning to make a trade-off between sentence length and information revealed.3
In recent years, there is a growing interest in neural approaches to language generation. An elegant
model, known as Semantically Controlled LSTM (SC-LSTM) (Wen et al., 2015), is a variant of
LSTM (Hochreiter and Schmidhuber, 1997), with an extra component that gives a semantic control
on the language generation results. As shown in Fig. 4.8, a basic SC-LSTM cell has two parts: a
typical LSTM cell (upper part in the figure) and a sentence planning cell (lower part) for semantic
control.

3
Some authors (Stone et al., 2003; Koller and Stone, 2007) have taken a similar, decision-theoretic point of
view for NLG. Their formulate NLG as a planning problem, as opposed to data-driven or corpus-based methods
being discussed here.

49
Figure 4.8: A Semantic Controlled LSTM (SC-LSTM) Cell. Picture credit: Wen et al. (2015).

More precisely, the operations upon receiving the tth input token, denoted wt , are defined by the
following set of equations:
it = g(Wwi wt + Whi ht−1 )
ft = g(Wwf wt + Whf ht−1 )
ot = g(Wwo wt + Who ht−1 )
rt = g(Wwr wt + αWhr ht−1 )
dt = rt dt−1
ĉt = tanh(Wwc wt + Whc ht−1 )
ct = ft ct−1 + it ĉt + tanh(Wdc dt )
ht = ot tanh(ct ) ,
where ht−1 is the hidden layer, W? the trainable parameters, and g(·) the sigmoid function. As in
a standard LSTM cell, it , ft and ot are the input, forget, and output gates, respectively. The extra
component introduced to SC-LSTM is the reading gate rt , which is used to compute a sequence
of dialogue acts {dt } starting from the original dialogue act d0 . This sequence is to ensure that
the generated utterance represents the intended meaning, and the reading gate is to control what
information to be retained for future steps. It is in this sense that the gate rt plays the role of
sentence planning (Wen et al., 2015). Finally, given the hidden layer ht , the output distribution is
given by a softmax function:
wt+1 ∼ softmax(Who ht ) .
Wen et al. (2015) proposed several improvements to the basic SC-LSTM architecture. One was to
make the model deeper by stacking multiple LSTM cells on top of the structure in Fig. 4.8. Another
was utterance reranking: they trained another instance of SC-LSTM on the reversed input sequence,
similar to bidirectional recurrent networks, and then combined both instances to finalize reranking.
The basic approach outlined above may be extended in several ways. For example, Wen et al. (2016)
investigated the use of multi-domain learning to reduce the amount of data to train a neural language
generator, and Su et al. (2018c) proposed a hierarchical approach that leverages linguistic patterns to
further improve generation results. Language generation remains an active research area. The next
chapter will cover more recent works for chitchat conversations, in which many techniques can also
be useful in task-oriented dialogue systems.

50
4.6 End-to-end Learning

Traditionally, components in most dialogue systems are optimized separately. This modularized
approach provides the flexibility that allows each module to be created in a relatively independent
way. However, it often leads to a more complex system design, and improvements in individual
modules do not necessarily translate into improvement of the whole dialogue system. Lemon (2011)
argued for, and empirically demonstrated, the benefit of jointly optimizing dialogue management
and natural language generation, within a reinforcement-learning framework. More recently, with
the increasing popularity of neural models, there have been growing interests in jointly optimizing
multiple components, or even end-to-end learning of a dialogue system.
One benefit of neural models is that they are often differentiable and can be optimized by gradient-
based methods like back-propagation (Goodfellow et al., 2016). In addition to language under-
standing, state tracking and policy learning that have been covered in previous sections, speech
recognition & synthesis (for spoken dialogue systems) may be learned by neural models and back-
propagation to achieve state-of-the-art performance (Hinton et al., 2012; van den Oord et al., 2016;
Wen et al., 2015). In the extreme, if all components in a task-oriented dialogue system (Fig. 4.1)
are differentiable, the whole system becomes a larger differentiable system that can be optimized by
back-propagation against metrics that quantify overall quality of the whole system. This is an ad-
vantage compared to traditional approaches that optimize individual components separately. There
are two general classes of approaches to building an end-to-end dialogue system:

Supervised Learning. The first is based on supervised learning, where desired system responses
are first collected and then used to train multiple components of a dialogue system in order to maxi-
mize prediction accuracy (Bordes et al., 2017; Wen et al., 2017; Yang et al., 2017b; Eric et al., 2017;
Madotto et al., 2018; Wu et al., 2018).
Wen et al. (2017) introduced a modular neural dialogue system, where most modules are represented
by a neural network. However, their approach relies on non-differentiable knowledge-base lookup
operators, so training of the components is done separately in a supervised manner. This challenge
is addressed by Dhingra et al. (2017) who proposed “soft” knowledge-base lookups; see Sec. 3.5 for
more details.
Bordes et al. (2017) treated dialogue system learning as the problem of learning a mapping from
dialogue histories to system responses. They show memory networks and supervised embedding
models outperform standard baselines on a number of simulated dialogue tasks. A similar approach
was taken by Madotto et al. (2018) in their Mem2Seq model. This model uses mechanisms from
pointer networks (Vinyals et al., 2015a) so as to incorporate external information from knowledge
bases.
Finally, Eric et al. (2017) proposed an end-to-end trainable Key-Value Retrieval Network, which is
equipped with an attention-based key-value retrieval mechanism over entries of a KB, and can learn
to extract relevant information from the KB.

Reinforcement Learning. While supervised learning can produce promising results, they require
training data that may be expensive to obtain. Furthermore, this approach does not allow a dialogue
system to explore different policies that can potentially be better than expert policies that produce
responses for supervised training. This inspire another line of work that uses reinforcement learning
to optimize end-to-end dialogue systems (Zhao and Eskénazi, 2016; Williams and Zweig, 2016;
Dhingra et al., 2017; Li et al., 2017d; Braunschweiler and Papangelis, 2018; Strub et al., 2017; Liu
and Lane, 2017; Liu et al., 2018a).
Zhao and Eskénazi (2016) proposed a model that takes user utterance as input and outputs a semantic
system action. Their model is a recurrent variant of DQN based on LSTM, which learns to compress
a user utterance sequence to infer an internal state of the dialogue. Compared to classic approaches,
this method is able to jointly optimize the policy as well as language understanding and state tracking
beyond standard supervised learning.
Another approach, taken by Williams et al. (2017), is to use LSTM to avoid the tedious step of state
tracking engineering, and jointly optimize state tracker and the policy. Their model, called Hybrid
Code Networks (HCN), also makes it easy for engineers to incorporate business rules and other

51
prior knowledge via software and action templates. They show that HCN can be trained end-to-end,
demonstrating much faster learning than several end-to-end techniques.
Strub et al. (2017) applied policy gradient to optimize a visually grounded task-oriented dialogue
in the GuessWhat?! game in an end-to-end fashion. In the game, both the user and the dialogue
system have access to an image. The user chooses an object in the image without revealing it, and
the dialogue system is to locate this object by asking the user a sequence of yes-no questions.
Finally, it is possible to combine supervised and reinforcement learning in an end-to-end trainable
system. Liu et al. (2018a) proposed such a hybrid approach. First, they used supervised learning on
human-human dialogues to pre-train the policy. Second, they used an imitation learning algorithm,
known as DAgger (Ross et al., 2011), to fine tune the policy with human teachers who can suggest
correct dialogue actions. In the last step, reinforcement learning was used to continue policy learning
with online user feedback.

4.7 Further Remarks


In this chapter, we have surveyed recent neural approaches to task-oriented dialogue systems, focus-
ing on slot-filling problems. This is a new area with many exciting research opportunities. While
it is out of the scope of the paper to give a full coverage of more general dialogue problems and all
research directions, we briefly describe a small sample of them to conclude this chapter.

Beyond Slot-filling Dialogues. Task-oriented dialogues in practice can be much more diverse and
complex than slot-filling ones. Information-seeking or navigation dialogues are another popular
example that has been mentioned in different contexts (e.g., Dhingra et al. (2017), Papangelis et al.
(2018b), and Sec. 3.5). Another direction is to enrich the dialogue context. Rather than text-only or
speech-only ones, our daily dialogues are often multimodal, and involve both verbal and nonverbal
inputs like vision (Bohus et al., 2014; DeVault et al., 2014; de Vries et al., 2017; Zhang et al., 2018a).
Challenges such as how to combine information from multiple modalities to make decisions arise
naturally.
So far, we have looked at dialogues that involve two parties—the user and the dialogue agent, and
the latter is to assist the former. In general, the task can be more complex such as mixed-initiative
dialogues (Horvitz, 1999) and negotiations (Barlier et al., 2015; Lewis et al., 2017). More generally,
there may be multiple parties involved in a conversation, where turn taking becomes more challeng-
ing (Bohus and Horvitz, 2009, 2011). In such scenarios, it is helpful to take a game-theoretic view,
more general than the MDP view as in single-agent decision making.

Weaker Learning Signals. In the literature, a dialogue system can be optimized by supervised,
imitation, or reinforcement learning. Some require expert labels/demonstrations, while some require
a reward signal from a (simulated) user. There are other weaker form of learning signals that facil-
itate dialogue management at scale. A promising direction is to consider preferential input: instead
of having an absolute judgment (either in the form of label or reward) of the policy quality, one only
requires a preferential input that indicates which one of two dialogues is better. Such comparable
feedback is often easier and cheaper to obtain, and can be more reliable than absolute feedback.

Related Areas. Evaluation remains a major research challenge. Although user simulation can be
useful (Sec. 4.2.2), a more appealing and robust solution is to use real human-human conversation
corpora directly for evaluation. Unfortunately, this problem, known as off-policy evaluation in the
RL literature, is challenging with numerous current research efforts (Precup et al., 2000; Jiang and
Li, 2016; Thomas and Brunskill, 2016; Liu et al., 2018c). Such off-policy techniques can find
important use in evaluating and optimizing dialogue systems.
Another related line of research is deep reinforcement learning applied to text games (Narasimhan
et al., 2015; Côté et al., 2018), which are in many ways similar to a conversation, except that the
scenarios are predefined by the game designer. Recent advances for solving text games, such as
handling natural-language actions (Narasimhan et al., 2015; He et al., 2016; Côté et al., 2018) and
interpretable policies (Chen et al., 2017c) may be useful for task-oriented dialogues as well.

52
Chapter 5

Fully Data-Driven Conversation


Models and Social Bots

Researchers have recently begun to explore fully data-driven and end-to-end (E2E) approaches
to conversational response generation, e.g., within the sequence-to-sequence (seq2seq) frame-
work (Hochreiter and Schmidhuber, 1997; Sutskever et al., 2014). These models are trained entirely
from data without resorting to any expert knowledge, which means they do not rely on the four
traditional components of dialogue systems noted in Chapter 4. Such end-to-end models have been
particularly successful with social bot (chitchat) scenarios, as social bots rarely require interaction
with the user’s environment, and the lack of external dependencies such as API calls simplifies end-
to-end training. By contrast, task-completion scenarios typically require such APIs in the form of,
e.g., knowledge base access. The other reason this framework has been successful with chitchat is
that it easily scales to large free-form and open-domain datasets, which means the user can typically
chat on any topic of her liking. While social bots are of significant importance in facilitating smooth
interaction between humans and their devices, more recent work also focuses on scenarios going
beyond chitchat, e.g., recommendation.

5.1 End-to-End Conversation Models

Most of the earliest end-to-end (E2E) conversation models are inspired by statistical machine trans-
lation (SMT) (Koehn et al., 2003; Och and Ney, 2004), including neural machine translation (Kalch-
brenner and Blunsom, 2013; Cho et al., 2014a; Bahdanau et al., 2015). The casting of the conversa-
tional response generation task (i.e., predict a response Ti based on the previous dialogue turn Ti−1 )
as an SMT problem is a relatively natural one, as one can treat turn Ti−1 as the “foreign sentence”
and turn Ti as its “translation”. This means one can apply any off-the-shelf SMT algorithm to a
conversational dataset to build a response generation system. This was the idea originally proposed
in one of the first works on fully data-driven conversational AI (Ritter et al., 2011), which applied
a phrase-based translation approach (Koehn et al., 2003) to dialogue datasets extracted from Twit-
ter (Serban et al., 2015). A different E2E approach was proposed in (Jafarpour et al., 2010), but it
relied on IR-based methods rather than machine translation.
While these two papers constituted a paradigm shift, they had several limitations. The most signif-
icant one is their representation of the data as (query, response) pairs, which hinders their ability to
generate responses that are contextually appropriate. This is a serious limitation as dialogue turns in
chitchat are often short (e.g., a few word utterance such as “really?”), in which case conversational
models critically need longer contexts to produce plausible responses. This limitation motivated
the work of Sordoni et al. (2015b), which proposed an RNN-based approach to conversational re-
sponse generation (similar to Fig. 2.2) to exploit longer context. Together with the contemporaneous
works (Shang et al., 2015; Vinyals and Le, 2015), these papers presented the first neural approaches
to fully E2E conversation modeling. While these three papers have some distinct properties, they
are all based on RNN architectures, which nowadays are often modeled with a Long Short-Term
Memory (LSTM) model (Hochreiter and Schmidhuber, 1997; Sutskever et al., 2014).

53
5.1.1 The LSTM Model

We give an overview of LSTM-based response generation. LSTM is arguably the most popular
seq2seq model, although alternative models like GRU (Cho et al., 2014b) are often as effective.
LSTM is an extension of the RNN model in Fig. 2.2, and is often more effective at exploiting long-
term context.
An LSTM-based response generation system is usually modeled as follows (Vinyals and Le, 2015;
Li et al., 2016a): Given a dialogue history represented as a sequence of words S = {s1 , s2 , ..., sNs }
(S here stands for source), the LSTM associates each time step k with input, memory, and output
gates, denoted respectively as ik , fk and ok . Ns is the number of words in the source S.1 Then, the
hidden state hk of the LSTM for each time step k is computed as follows:

ik = σ(Wi [hk−1 ; ek ]) (5.1)


fk = σ(Wf [hk−1 ; ek ]) (5.2)
ok = σ(Wo [hk−1 ; ek ]) (5.3)
lk = tanh(Wl [hk−1 ; ek ]) (5.4)
ck = fk ◦ ck−1 + ik ◦ lk (5.5)
hsk = ok ◦ tanh(ck ) (5.6)

where matrices Wi , Wf , Wo , Wl belong to Rd×2d , ◦ denotes the element-wise product. As it is a


response generation task, each conversational context S is paired with a sequence of output words to
predict: T = {t1 , t2 , ..., tNt }. Here, Nt is the length of the response and t represents a word token
that is associated with a d-dimensional word embedding et (distinct from the source).
The LSTM model defines the probability of the next token to predict using the softmax function.
Specifically, let f (hk−1 , eyk ) be the softmax activation function of hk−1 and eyk , where hk−1 is
the hidden vector at time k − 1. Then, the probability of outputing token T is given by
Nt
Y
p(T |S) = p(tk |s1 , s2 , ..., st , t1 , t2 , ..., tk−1 )
k=1
Nt
Y exp(f (hk−1 , eyk ))
= P .
y 0 exp(f (hk−1 , ey ))
0
k=1

5.1.2 The HRED Model

While the LSTM model has been shown to be effective in encoding textual contexts up to 500
words (Khandelwal et al., 2018), dialogue histories can often be long and there is sometimes a need
to exploit longer-term context. Hierarchical models were designed to address this limitation by
capturing longer context (Yao et al., 2015; Serban et al., 2016, 2017; Xing et al., 2018). One pop-
ular approach is the Hierarchical Recurrent Encoder-Decoder (HRED) model, originally proposed
in (Sordoni et al., 2015a) for query suggestion and applied to response generation in (Serban et al.,
2016).
The HRED architecture is depicted in Fig. 5.1, where it is compared to the standard RNN archi-
tecture. HRED models dialogue using a two-level hierarchy that combines two RNNs: one at a
word level and one at the dialogue turn level. This architecture models the fact that dialogue history
consists of a sequence of turns, each consisting of a sequence of tokens. This model introduces a
temporal structure that makes the hidden state of the current dialogue turn directly dependent on the
hidden state of the previous dialogue turn, effectively allowing information to flow over longer time
spans, and helping reduce the vanishing gradient problem (Hochreiter, 1991), a problem that limits
RNN’s (including LSTM’s) ability to model very long word sequences. Note that, in this particular
work, RNN hidden states are implemented using GRU (Cho et al., 2014b) instead of LSTM.

1
The notation distinguishes e and h where ek is the embedding vector for an individual word at time step k,
and hk is the vector computed by the LSTM model at time k by combining ek and hk−1 . ck is the cell state
vector at time k, and σ represents the sigmoid function.

54
𝑤1,1 𝑤1,2 𝑤1,3 𝑤2,1 𝑤2,2 𝑤2,3

𝑤1 𝑤2 𝑤3 𝑤1,1 𝑤1,2 𝑤2,1 𝑤2,2

𝑤1 𝑤2

𝑤1,1 𝑤1,2
(a) (b)

Figure 5.1: (a) Recurrent architecture used by models such as RNN, GRU, LSTM, etc. (2) Two-level
hierarchy representative of HRED. Note: To simplify the notation, the figure represents utterances
of length 3.

5.1.3 Attention Models

The seq2seq framework has been tremendously successful in text generation tasks such as machine
translation, but its encoding of the entire source sequence into a fixed-size vector has certain limita-
tions, especially when dealing with long source sequences. Attention-based models (Bahdanau et al.,
2015; Vaswani et al., 2017) alleviate this limitation by allowing the model to search and condition on
parts of a source sentence that are relevant to predicting the next target word, thus moving away from
a framework that represents the entire source sequence merely as a single fixed-size vector. While
attention models and variants (Bahdanau et al., 2015; Luong et al., 2015, etc.) have contributed to
significant progress in the state-of-the-art in translation (Wu et al., 2016) and are very commonly
used in neural machine translation nowadays, attention models have been somewhat less effective in
E2E dialogue modeling. This can probably be explained by the fact that attention models effectively
attempt to “jointly translate and align” (Bahdanau et al., 2015), which is a desirable goal in machine
translation as each information piece in the source sequence (foreign sentence) typically needs to
be conveyed in the target (translation) exactly once, but this is less true in dialogue data. Indeed,
in dialogue entire spans of the source may not map to anything in the target and vice-versa.2 Some
specific attention models for dialogue have been shown to be useful (Yao et al., 2015; Mei et al.,
2017; Shao et al., 2017), e.g., to avoid word repetitions (which are discussed further in Sec. 5.2).

5.1.4 Pointer-Network Models

Multiple model extensions (Gu et al., 2016; He et al., 2017a) of the seq2seq framework improve
the model’s ability to “copy and paste” words between the conversational context and the response.
Compared to other tasks such as translation, this ability is particularly important in dialogue, as the
response often repeats spans of the input (e.g., “good morning” in response to “good morning”)
or uses rare words such as proper nouns, which the model would have difficulty generating with
a standard RNN. Originally inspired by the Pointer Network model (Vinyals et al., 2015a)—which
produces an output sequence consisting of elements from the input sequence—these models hypoth-
esize target words that are either drawn from a fixed-size vocabulary (akin to a seq2seq model) or
selected from the source sequence (akin to a pointer network) using an attention mechanism. An
instance of this model is CopyNet (Gu et al., 2016), which was shown to significantly improve over
RNNs thanks to its ability to repeat proper nouns and other words of the input.

2
Ritter et al. (2011) also found that alignment produced by an off-the-shelf word aligner (Och and Ney,
2003) produced alignments of poor quality, and an extension of their work with attention models (Ritter 2018,
pc) yield attention scores that did not correspond to meaningful alignments.

55
5.2 Challenges and Remedies
The response generation task faces challenges that are rather specific to conversation modeling.
Much of the recent research is aimed at addressing the following issues.

5.2.1 Response Blandness

Utterances generated by neural response generation systems are often bland and deflective. While
this problem has been noted in other tasks such as image captioning (Mao et al., 2015), the problem
is particularly acute in E2E response generation, as commonly used models such as seq2seq tend to
generate uninformative responses such as “I don’t know” or “I’m OK”. Li et al. (2016a) suggested
that this is due to their training objective, which optimizes the likelihood of the training data accord-
ing to p(T |S), where S is the source (dialogue history) and T is the target response. The objective
p(T |S) is asymmetrical in T and S, which causes the trained systems to prefer responses T that
unconditionally enjoy high probability, i.e., irrespectively of the context S. For example, such sys-
tems often respond “I don’t know” if S is a question, as the response “I don’t know” is plausible for
almost all questions. Li et al. (2016a) suggested replacing the conditional probability p(T |S) with
p(T,S)
mutual information p(T )p(S) as an objective, since the latter formulation is symmetrical in S and T ,
thus giving no incentive for the learner to bias responses T to be particularly bland and deflective,
unless such a bias emerges from the training data itself. While this argument may be true in general,
optimizing the mutual information objective (also known as Maximum Mutual Information or MMI
(Huang et al., 2001)) can be challenging, so Li et al. (2016a) used that objective at inference time.
More specifically, given a conversation history S, the goal at inference time is to find the best T
according to:3
 p(S, T )
T̂ = argmaxT log
p(S)p(T ) (5.7)

= argmaxT log p(T |S) − log p(T )
A hyperparameter λ was introduced to control how much to penalize generic responses, with either
formulations:4

T̂ = argmaxT log p(T |S) − λ log p(T )

= argmaxT (1 − λ) log p(T |S)
(5.8)
+ λ log p(S|T ) − λ log p(S)

= argmaxT (1 − λ) log p(T |S) + λ log p(S|T ) .
Thus, this weighted MMI objective function can be viewed as representing a tradeoff between
sources given targets (i.e., p(S|T )) and targets given sources (i.e., p(T |S)), which is also a tradeoff
between response appropriateness and lack of blandness. Note, however, that despite this trade-
off, Li et al. (2016a) have not entirely solved the blandness problem, as this objective is only used
at inference and not training time. This approach first generates N -best lists according to p(T |S)
and rescores them with MMI. Since such N -best lists tend to be overall relatively bland due to the
p(T |S) inference criterion (beam search), MMI rescoring often mitigates rather than completely
eliminates the blandness problem.
More recently, researchers (Li et al., 2017c; Xu et al., 2017; Zhang et al., 2018e) have used adver-
sarial training and Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), which often
have the effect of reducing blandness. Intuitively, the effect of GAN on blandness can be understood
as follows: adversarial training puts a Generator and Discriminator against each other (hence the
term “adversarial”) using a minimax objective, and the objective for each of them is to make their
counterpart the least effective. The Generator is the response generation system to be deployed,
while the goal of the Discriminator is to be able to identify whether a given response is generated
by a human (i.e., from the training data) or is the output of the Generator. Then, if the Generator
p(S,T ) |S)
3
Recall that log p(S)p(T )
= log p(T
p(T )
= log p(T |S) − log p(T )
4
The second formulation is derived from:
log p(T ) = log p(T |S) + log p(S) − log p(S|T ).

56
always responds “I don’t know” or with other deflective responses, the Discriminator would have
little problem distinguishing them from human responses in most of the cases, as most humans do
not respond with “I don’t know” all the time. Therefore, in order to fool the Discriminator, the Gen-
erator progressively steers away from such predictable responses. More formally, the optimality of
GAN is achieved when the hypothesis distribution matches the oracle distribution, thus encouraging
the generated responses to spread out to reflect the true diversity of real responses. To promote more
diversity, Zhang et al. (2018e) explicitly optimize a variational lower bound on pairwise mutual in-
formation between query and response to encourage generating more informative responses during
training time.
Serban et al. (2017) presented a latent Variable Hierarchical Recurrent Encoder-Decoder (VHRED)
model that also aims to generate less bland and more specific responses. It extends the HRED
model described previously in this chapter, by adding a high-dimensional stochastic latent variable
to the target. This additional latent variable is meant to address the challenge associated with the
shallow generation process. As noted in (Serban et al., 2017), this process is problematic from an
inference standpoint because the generation model is forced to produce a high-level structure—i.e.,
an entire response—on a word-by-word basis. This generation process is made easier in the VHRED
model, as the model exploits a high-dimensional latent variable that determines high-level aspects
of the response (topic, names, verb, etc.), so that the other parts of the model can focus on lower-
level aspects of generation, e.g., ensuring fluency. The VHRED model incidentally helps reducing
blandness as suggested by sample outputs of (Serban et al., 2017). Indeed, as the content of the
response is conditioned on the latent variable, the generated response is only bland and devoid of
semantic content if the latent variable determines that the response should be as such. More recently,
Zhang et al. (2018b) presented a model that also introduces an additional variable (modeled using a
Gaussian kernel layer), which is added to control the level of specificity of the response, going from
bland to very specific.
While most response generation systems surveyed earlier in this chapter are generation-based (i.e.,
generating new sentences word-by-word), a more conservative solution to mitigating blandness is
to replace generation-based models with retrieval-based models for response generation (Jafarpour
et al., 2010; Lu and Li, 2014; Inaba and Takahashi, 2016; Al-Rfou et al., 2016; Yan et al., 2016), in
which the pool of possible responses is constructed in advance (e.g., pre-existing human responses).
These approaches come at the cost of reduced flexibility: In generation, the set of possible responses
grows exponentially in the number of words, but the set of responses of a retrieval system is fixed,
and as such retrieval systems often do not have any appropriate responses for many conversational
inputs. Despite this limitation, retrieval systems have been widely used in popular commercial
systems, and we survey them in Chapter 6.

5.2.2 Speaker Consistency

It has been shown that the popular seq2seq approach often produces conversations that are inco-
herent (Li et al., 2016b), where the system may for instance contradict what it had just said in the
previous turn (or sometimes even in the same turn). While some of this effect can be attributed to
the limitation of the learning algorithms, Li et al. (2016b) suggested that the main cause of this in-
consistency is probably due to the training data itself. Indeed, conversational datasets (see Sec. 5.5)
feature multiple speakers, which often have different or conflicting personas and backgrounds. For
example, to the question “how old are you?”, a seq2seq model may give valid responses such as
“23”, “27”, or “40”, all of which are represented in the training data.
This sets apart the response generation task from more traditional NLP tasks: While models for
other tasks such as machine translation are trained on data that is mostly one-to-one semantically,
conversational data is often one-to-many or many-to-many as the above example implies.5 As one-
to-many training instances are akin to noise to any learning algorithm, one needs more expressive
models that exploits a richer input to better account for such diverse responses.
To do this, Li et al. (2016b) proposed a persona-based response generation system, which is an
extension of the LSTM model of Sec. 5.1.1 that uses speaker embeddings in addition to word em-
beddings. Intuitively, these two types of embeddings work similarly: while word embeddings form
5
Conversational data is also many-to-one, for example with multiple semantically-unrelated inputs that map
to “I don’t know.”

57
in england . EOS
Source Target

where do you live EOS Rob in Rob england Rob . Rob


D_Gomes25 Jinnmeow3

Speaker embeddings (70k)


skinnyoflynny2 u.s. london

Word embeddings (50k)


TheCharlieZ england
Rob_712 great
Dreamswalls good
Tomcoatez
Bob_Kelly2
Kush_322
kierongillen5 monday live okay
This_Is_Artful tuesday
The_Football_Bar DigitalDan285 stay

Figure 5.2: Persona-based response generation system. Figure credit: Li et al. (2016b)

a latent space in which spacial proximity (i.e., low Euclidean distance) means two words are seman-
tically or functionally close, speaker embeddings also constitute a latent space in which two nearby
speakers tend to converse in the same way, e.g., having similar speaking styles (e.g., British English)
or often talking about the same topic (e.g., sports).
Like word embeddings, speaker embedding parameters are learned jointly with all other parameters
of the model from their one-hot representations. At inference time, one just needs to specify the
one-hot encoding of the desired speaker to produce a response that reflects her speaking style. The
global architecture of the model is displayed in Fig. 5.2, which shows that each target hidden state is
conditioned not only on the previous hidden state and the current word embedding (e.g., “England”),
but also on the speaker embedding (e.g., of “Rob”). This model not only helps generate more
personalized responses, but also alleviates the one-to-many modeling problem mentioned earlier.
Other approaches also utilized personalized information. For example, Al-Rfou et al. (2016) pre-
sented a persona-based response generation model, but geared for retrieval using an extremely large
dataset consisting of 2.1 billion responses. Their retrieval model is implemented as a binary classi-
fier (i.e., good response or not) using a deep neural network. The distinctive feature of their model
is a multi-loss objective, which augments a single-loss model p(R|I, A, C) of the response R, input
I, speaker (“author”) A, and context C, by adding auxiliary losses that, e.g., model the probability
of the response given the author p(R|A). This multi-loss model was shown to be quite helpful (Al-
Rfou et al., 2016), as the multiple losses help cope with the fact that certain traits of the author are
often correlated with the context or input, which makes it difficult to learn good speaker embedding
representation. By adding a loss for p(R|A), the model is able to learn a more distinctive speaker
embedding representation for the author.
More recently, Luan et al. (2017) presented an extension of the speaker embedding model of Li et al.
(2016b), which combines a seq2seq model trained on conversational datasets with an autoencoder
trained on non-conversational data, where the seq2seq and autoencoder are combined in a multi-
task learning setup (Caruana, 1998). The tying of the decoder parameters of both seq2seq and
autoencoder enables Luan et al. (2017) to train a response generation system for a given persona
without actually requiring any conversational data available for that persona. This is an advantage
of their approach, as conversational data for a given user or persona might not always be available.
In (Bhatia et al., 2017), the idea of (Li et al., 2016b) is extended to a social-graph embedding model.
While (Serban et al., 2017) is not a persona-based response generation model per se, their work
shares some similarities with speaker embedding models such as (Li et al., 2016b). Indeed, both
Li et al. (2016b) and Serban et al. (2017) introduced a continuous high-dimensional variable in the
target side of the model in order to bias the response towards information encoded in a vector. In
the case of (Serban et al., 2017), that variable is latent, and is trained by maximizing a variational
lower-bound on the log-likelihood. In the case of (Li et al., 2016b), the variable (i.e., the speaker
embedding) is technically also latent, although it is a direct function of the one-hot representation
of speaker. (Li et al., 2016b) might be a good fit when utterance-level information (e.g., speaker ID
or topic) is available. On the other hand, the strength of (Serban et al., 2017) is that it learns a latent

58
variable that best “explains” the data, and may learn a representation that is more optimal than the
one based strictly on speaker or topic information.

5.2.3 Word Repetitions

Word or content repetition is a common problem with neural generation tasks other than machine
translation, as has been noted with tasks such as response generation, image captioning, visual story
generation, and general language modeling (Shao et al., 2017; Huang et al., 2018; Holtzman et al.,
2018). While machine translation is a relatively one-to-one task where each piece of information in
the source (e.g., a name) is usually conveyed exactly once in the target, other tasks such as dialogue
or story generation are much less constrained, and a given word or phrase in the source can map
to zero or multiple words or phrases in the target. This effectively makes the response generation
task much more challenging, as generating a given word or phrase doesn’t completely preclude the
need of generating the same word or phrase again. While the attention model (Bahdanau et al.,
2015) helps prevent repetition errors in machine translation as that task is relatively one-to-one,6 the
attention models originally designed for machine translation (Bahdanau et al., 2015; Luong et al.,
2015) often do not help reduce word repetitions in dialogue.
In light of the above limitations, Shao et al. (2017) proposed a new model that adds self-attention to
the decoder, aiming at improving the generation of longer and coherent responses while incidentally
mitigating the word repetition problem. Target-side attention helps the model more easily keep
track of what information has been generated in the output so far,7 so that the model can more easily
discriminate against unwanted word or phrase repetitions.

5.2.4 Further Challenges

The above issues are significant problems that have only been partially solved and that require fur-
ther investigation. However, a much bigger challenge faced by these E2E systems is response ap-
propriateness. As explained in Chapter 1, one of the most distinctive characteristics of earlier E2E
systems, when compared to traditional dialogue systems, is their lack of grounding. When asked
“what is the weather forecast for tomorrow?”, E2E systems are likely to produce responses such
as “sunny” and “rainy”, without a principled basis for selecting one response or the other, as the
context or input might not even specify a geographical location. Ghazvininejad et al. (2018) argued
that seq2seq and similar models are usually quite good at producing responses that have plausible
overall structure, but often struggle when it comes to generating names and facts that connect to the
real world, due to the lack of grounding. In other words, responses are often pragmatically correct
(e.g., a question would usually be followed by an answer, and an apology by a downplay), but the
semantic content of the response is often inappropriate. Hence, recent research in E2E dialogue has
increasingly focused on designing grounded neural conversation models, which we will survey next.

5.3 Grounded Conversation Models

Unlike task-oriented dialogue systems, most E2E conversation models are not grounded in the real
world, which prevents these systems from effectively conversing about anything that relates to the
user’s environment. This limitation is also inherited from machine translation, which neither models
nor needs are grounded. Recent approaches to neural response generation address this problem
by grounding systems in the persona of the speaker or addressee (Li et al., 2016b; Al-Rfou et al.,
2016), textual knowledge sources such as Foursquare (Ghazvininejad et al., 2018), the user’s or
agent’s visual environment (Das et al., 2017a; Mostafazadeh et al., 2017), and affect or emotion of
the user (Huber et al., 2018; Winata et al., 2017; Xu et al., 2018). At a high level, most of these works
have in common the idea of augmenting their context encoder to not only represent the conversation
history, but also some additional input drawn from the user’s environment, such as an image (Das
et al., 2017a; Mostafazadeh et al., 2017) or textual information (Ghazvininejad et al., 2018).

6
Ding et al. (2017) indeed found that word repetition errors, usually few in machine translation, are often
caused by incorrect attention.
7
A seq2seq model can also keep track of what information has been generated so far. However, this becomes
more difficult as contexts and responses become longer, as a seq2seq hidden state is a fixed-size vector.

59
CONVERSATION HISTORY RESPONSE
Going to DIALOG
Σ DECODER Try omakase, the
ENCODER
Kusakabe tonight best in town
FACTS
ENCODER

Consistently the best omakase


Amazing sushi tasting […]

... A
They were out of kaisui […]
...
WORLD CONTEXTUALLY
FACTS RELEVANT FACTS

Figure 5.3: A neural conversation model grounded in “facts” relevant to the current conversation.
Figure credit: Ghazvininejad et al. (2018)

As an illustrative example of such grounded models, we give a brief overview of Ghazvininejad


et al. (2018), whose underlying model is depicted in Fig. 5.3. The model mainly consists of two
encoders and one decoder. The decoder and the dialogue encoder are similar to those of standard
seq2seq models. The additional encoder is called the facts encoder, which infuses into the model
factual information or so-called facts relevant to the conversation history, e.g., restaurant reviews
(e.g., “amazing sushi tasting”) that pertain to a restaurant that happened to be mentioned in the
conversation history (e.g., “Kusakabe”). While the model in this work was trained and evaluated
with Foursquare reviews, this approach makes no specific assumption that the grounding consists
of reviews, or that trigger words are restaurants (in fact, some of the trigger words are, e.g., hotels
and museums). To find facts that are relevant to the conversation, their system uses an IR system
to retrieve text from a very large collection of facts or world facts (e.g., all Foursquare reviews of
several large cities) using search words extracted from the conversation context. While the dialogue
encoder of this model is a standard LSTM, the facts encoder is an instance of the Memory Net-
work of Chen et al. (2016b), which uses an associative memory for modeling the facts relevant to a
particular problem, which in this case is a restaurant mentioned in a conversation.
There are two main benefits to this approach and other similar work on grounded conversation
modeling. First, the approach splits the input of the E2E system into two parts: the input from
the user and the input from her environment. This separation is crucial because it addresses the
limitation of earlier E2E (e.g., seq2seq) models which always respond deterministically to the same
query (e.g.to “what’s the weather forecast for tomorrow?”). By splitting input into two sources
(user and environment), the system can effectively generate different responses to the same user
input depending on what has changed in the real world, without having to retrain the entire system.
Second, this approach is much more sample efficient compared to a standard seq2seq approach.
For an ungrounded system to produce a response like the one in Fig. 5.3, the system would require
that every entity any user might conceivably talk about (e.g., “Kusakabe” restaurant) be seen in
the conversational training data, which is an unrealistic and impractical assumption. While the
amount of language modeling data (i.e., non-conversational data) is abundant and can be used to
train grounded conversation systems (e.g., using Wikipedia, Foursquare), the amount of available
conversational data is typically much more limited. Grounded conversational models don’t have
that limitation, and, e.g., the system of Ghazvininejad et al. (2018) can converse about venues that
are not even mentioned in the conversational training data.

5.4 Beyond Supervised Learning


There is often a sharp disconnect between conversational training data (human-to-human) and envi-
sioned online scenarios (human-computer). This makes it difficult to optimize conversation models
towards specific objectives, e.g., maximizing engagement by reducing blandness. Another limita-
tion of the supervised learning setup is their tendency to optimize for an immediate reward (i.e., one

60
Figure 5.4: Deep reinforcement learning for response generation, pitching the system to optimize
against a user simulator (both systems are E2E generation systems.) Figure credit: Li et al. (2017b)

response at a time) rather than a long-term reward. This also partially explains why their responses
are often bland and thus fail to promote long-term user engagement. To address these limitations,
some researchers have explored reinforcement learning (RL) for E2E systems (Li et al., 2016c)
which could be augmented with human-in-the-loop architectures (Li et al., 2017a,b). Unlike RL
for task-oriented dialogue, a main challenge that E2E systems are facing is the lack of well-defined
metrics for success (i.e., reward functions), in part because they have to deal with informal genres
such as chitchat, where the user goal is not explicitly specified.
Li et al. (2016c) constitutes the first attempt to use RL in a fully E2E approach to conversational
response generation. Instead of training the system on human-to-human conversations as in the
supervised setup of (Sordoni et al., 2015b; Vinyals and Le, 2015), the system of Li et al. (2016c) is
trained by conversing with a user simulator which mimics human users’ behaviors.
As depicted in Fig. 5.4, human users have to be replaced with a user simulator because it is pro-
hibitively expensive to train an RL system using thousands or tens of thousands of turns of real user
dialogues. In this work, a standard seq2seq model is used as a user simulator. The system is trained
using policy gradient (Sec. 2.3). The objective is to maximize the expected total reward over the
dialogues generated by the user simulator and the agent to be learned. Formally, the objective is
J(θ) = E[R(T1 , T2 , . . . , TN )] (5.9)
where R(.) is the reward function, and Ti ’s are dialogue turns. The above objective can be optimized
using gradient descent, by factoring the log probability of the conversation and the aggregated re-
ward, which is independent of the model parameters:
∇J(θ) = ∇ log p(T1 , T2 , . . . , TN )R(T1 , T2 , ..., TN )
(5.10)
Y
' ∇ log p(Ti |Ti−1 )R(T1 , T2 , ..., TN )
i
where p(Ti |Ti−1 ) is parameterized the same way as the standard seq2seq model of Sec. 5.1.1, except
that the model here is optimized using RL. The above gradient is often approximated using sampling,
and Li et al. (2016c) used a single sampled conversation for each parameter update. While the above
policy gradient setup is relatively common in RL, the main challenge in learning dialogue models
is how to devise an effective reward function. Li et al. (2016c) used a combination of three reward
functions that are designed to mitigate the problems of the supervised seq2seq model, which was
used in their work as initialization parameters. The three reward functions are:

• −p(Dull Response|Ti ): Li et al. (2016c) created a short list of dull responses such as “I
don’t know” selected from the training data. This reward function penalizes those turns Ti
that are likely to lead to any of these dull responses. This is called the ease of answering
reward, as it promotes conversational turns that are not too difficult to respond to, so as
to keep the user engaged in the conversation. For example, the reward function gives a
very low reward to turns whose response is “I don’t know”, as this evasive response indi-
cates that the previous turn was difficult to respond to, which may ultimately terminate the
conversation.

61
• − log Sigmoid cos(Ti−1 , Ti ): This information flow reward function ensures that consecu-
tive turns Ti−1 and Ti are not very similar to each other (e.g., “how are you?” followed by
“how are you?”), as Li et al. (2016c) assumed that conversations with little new information
are often not engaging and therefore more likely to be terminated.
• log p(Ti−1 |Ti ) + log p(Ti |Ti−1 ): This meaningfulness reward function was mostly intro-
duced to counterbalance the aforementioned two rewards. For example, the two other re-
ward functions prefer the type of conversations that constantly introduce new information
and change topics so frequently that users find them hard to follow. To avoid this, the
meaningfulness reward encourages consecutive turns in a dialogue session to be related to
each other.

5.5 Data
Serban et al. (2015) presented a comprehensive survey of existing datasets that are useful beyond the
E2E and social bot research. What distinguishes E2E conversation modeling from other NLP and
dialogue tasks is that data is available in very large quantities, thanks in part to social media (e.g.,
Twitter and Reddit). On the other hand, most of this social media data is neither redistributable
nor available through language resource organizations (such as the Linguistic Data Consortium),
which means there are still no established public datasets (either with Twitter or Reddit) for training
and testing response generation systems. Although these social media companies offer API access
to enable researchers to download social media posts in relatively small quantities and then to re-
construct conversations from them, the strict legal terms of the service specified by these companies
inevitably affect the reproducibility of the research. Most notably, Twitter makes certain tweets
(e.g., retracted tweets or tweets from suspended user) unavailable through the API and requires that
any such previously downloaded tweets be deleted. This makes it difficult to establish any standard
training or test datasets, as these datasets deplete over time.8 Consequently, in most of the papers
cited in this chapter, their authors have created their own (subsets of) conversational data for training
and testing, and then evaluated their systems against baselines and competing systems on these fixed
datasets. Dodge et al. (2016) used an existing dataset to define standard training and test sets, but it
is relatively small. Some of the most notable E2E and chitchat datasets include:

• Twitter: Used since the first data-driven response generation systems (Ritter et al., 2011),
Twitter data offers a wealth of conversational data that is practically unbounded, as Twitter
produces new data each day that is more than most system developers can handle.9 While
the data itself is made accessible through the Twitter API as individual tweets, its metadata
easily enables the construction of conversation histories, e.g., between two users. This
dataset forms the basis of the DSTC Task 2 competition in 2017 (Hori and Hori, 2017).
• Reddit: Reddit is a social media source that is also practically unbounded, and represents
about 3.2 billion dialogue turns as of July 2018. It was for example used in Al-Rfou et al.
(2016) to build a large-scale response retrieval system. Reddit data is organized by topics
(i.e.“subreddits”), and its responses don’t have a character limit as opposed to Twitter.
• OpenSubtitles: This dataset consists of subtitles made available on the opensubtitles.org
website. It offers captions of many commercial movies, and contains about 8 billion words
as of 2011 in multiple languages (Tiedemann, 2012).
• Ubuntu: The Ubuntu dataset (Lowe et al., 2015) has also been used extensively for E2E
conversation modeling. It differs from other datasets such as Twitter in that it is less focused
on chitchat but more goal-oriented, as it contains many dialogues that are specific to the
Ubuntu operating system.
• Persona-Chat dataset: This crowdsourced dataset (Zhang et al., 2018c) was developed to
meet the need for conversational data where dialogues exhibit distinct user personas. In
collecting Persona-Chat, every crowdworker was asked to impersonate a given character
8
Anecdotally, the authors of Li et al. (2016a, pc) found that a Twitter dataset from 2013 had lost about 25%
of its tweets by 2015 due to retracted tweets and Twitter account suspensions.
9
For example, the latest official statistics from Twitter, dating back from 2014, states that Twitter users
post on average more than 500 million tweets per day: https://blog.twitter.com/official/en_us/a/
2014/the-2014-yearontwitter.html

62
described using five facts. Then that worker took part in dialogues while trying to stay in
character. The resulting dataset contains about 160k utterances.

5.6 Evaluation
Evaluation is a long-standing research topic for generation tasks such as machine translation and
summarization. E2E dialogue is no different. While it is common to evaluate response generation
systems using human raters (Ritter et al., 2011; Sordoni et al., 2015b; Shang et al., 2015, etc.), this
type of evaluation is often expensive and researchers often have to resort to automatic metrics for
quantifying day-to-day progress and for performing automatic system optimization. E2E dialogue
research mostly borrowed those metrics from machine translation and summarization, using string
and n-gram matching metrics like BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). Pro-
posed more recently, METEOR (Banerjee and Lavie, 2005) aims to improve BLEU by identifying
synonyms and paraphrases between the system output and the human reference, and has also been
used to evaluate dialogue. deltaBLEU (Galley et al., 2015) is an extension of BLEU that exploits
numerical ratings associated with conversational responses.
There has been significant debate as to whether such automatic metrics are actually appropriate for
evaluating conversational response generation systems. For example, Liu et al. (2016) argued that
they are not appropriate by showing that most of these machine translation metrics correlate poorly
with human judgments. However, their correlation analysis was performed at the sentence level, but
decent sentence-level correlation has long been known to be difficult to achieve even for machine
translation (Callison-Burch et al., 2009; Graham et al., 2015), the task for which the underlying
metrics (e.g., BLEU and METEOR) were specifically intended.10 In particular, BLEU (Papineni
et al., 2002) was designed from the outset to be used as a corpus-level rather than sentence-level
metric, since assessments based on n-gram matches are brittle when computed on a single sentence.
Indeed, the empirical study of Koehn (2004) suggested that BLEU is not reliable on test sets con-
sisting of fewer than 600 sentences. Koehn (2004)’s study was on translation, a task that is arguably
simpler than response generation, so the need to move beyond sentence-level correlation is proba-
bly even more critical in dialogue. When measured at a corpus- or system-level, correlations are
typically much higher than that at sentence-level (Przybocki et al., 2009), e.g., with Spearman’s ρ
above 0.95 for the best metrics on WMT translation tasks (Graham and Baldwin, 2014).11 In the
case of dialogue, Galley et al. (2015) showed that the correlation of string-based metrics (BLEU and
deltaBLEU) significantly increases with the units of measurement bigger than a sentence. Specifi-
cally, their Spearman’s ρ coefficient goes up from 0.1 (essentially no correlation) at sentence-level
to nearly 0.5 when measuring correlation on corpora of 100 responses each.
Recently, Lowe et al. (2017) proposed a machine-learned metric for E2E dialogue evaluation. They
presented a variant of the VHRED model (Serban et al., 2017) that takes context, user input, gold and
system responses as input, and produces a qualitative score between 1 and 5. As VHRED is effective
for modeling conversations, Lowe et al. (2017) was able to achieve an impressive Spearman’s ρ cor-
relation of 0.42 at the sentence level. On the other hand, the fact that this metric is trainable leads to
other potential problems such as overfitting and “gaming of the metric” (Albrecht and Hwa, 2007),12
which might explain why previously proposed machine-learned evaluation metrics (Corston-Oliver
10
For example, in the official report of the WMT shared task, Callison-Burch et al. (2009, Section 6.2) com-
puted the percentage of times popular metrics are consistent with human ranking at the sentence level, but the
results did not bode well for sentence-level studies: “Many metrics failed to reach [a random] baseline (in-
cluding most metrics in the out-of-English direction). This indicates that sentence-level evaluation of machine
translation quality is very difficult.”
11
In one of the largest scale system-level correlation studies to date, Graham and Baldwin (2014) found that
BLEU is relatively competitive against most translation metrics proposed more recently, as they show there “is
currently insufficient evidence for a high proportion of metrics to conclude that they outperform BLEU”. Such
a large scale study remains to be done for dialogue.
12
In discussing the potential pitfalls of machine-learned evaluation metrics, Albrecht and Hwa (2007) argued
for example that it would be “prudent to defend against the potential of a system gaming a subset of the
features.” In the case of deep learning, this gaming would be reminiscent of making non-random perturbations
to an input to drastically change the network’s predictions, as it was done, e.g., with images in (Szegedy et al.,
2013) to show how easily deep learning models can be fooled. However, preventing such a gaming is difficult
if the machine-learned metric is to become a standard evaluation, and this would presumably require model
parameters to be publicly available.

63
et al., 2001; Kulesza and Shieber, 2004; Lita et al., 2005; Albrecht and Hwa, 2007; Giménez and
Màrquez, 2008; Pado et al., 2009; Stanojević and Sima’an, 2014, etc.) are not commonly used in
official machine translation benchmarks. The problem of “gameable metrics” is potentially serious,
for example in the frequent cases where automatic evaluation metrics are used directly as training
objectives (Och, 2003; Ranzato et al., 2015) as unintended “gaming” may occur unbeknownst to the
system developer. If a generation system is optimized directly on a trainable metric, then the system
and the metric become akin to an adversarial pair in GANs (Goodfellow et al., 2014), where the
only goal of the generation system (Generator) is to fool the metric (Discriminator). Arguably, such
attempts become easier with trainable metrics as they typically incorporate thousands or millions
of parameters, compared to a relatively parameterless metric like BLEU that is known to be fairly
robust to such exploitation and was shown to be the best metric for direct optimization (Cer et al.,
2010) among other established string-based metrics. To prevent machine-learned metrics from being
gamed, one would need to iteratively train the Generator and Discriminator as in GANs, but most
trainable metrics in the literature do not exploit this iterative process. Adversarial setups proposed
for dialogue and related tasks (Kannan and Vinyals, 2016; Li et al., 2017c; Holtzman et al., 2018)
offer solutions to this problem, but it is also well-known that such setups suffer from instability (Sal-
imans et al., 2016) due to the nature of GANs’ minimax formulation. This fragility is potentially
troublesome as the outcome of an automatic evaluation should ideally be stable (Cer et al., 2010)
and reproducible over time, e.g., to track progress of E2E dialogue research over the years. All of
this suggests that automatic evaluation for E2E dialogue is far from a solved problem.

5.7 Open Benchmarks


Open benchmarks have been the key to achieving progress in many AI tasks such as speech recog-
nition, information retrieval, and machine translation. Although end-to-end conversational AI is a
relatively nascent research problem, some open benchmarks have already been developed:

• Dialog System Technology Challenges (DSTC): In 2017, DSTC proposed for the first
time an “End-to-End Conversation Modeling” track,13 which requires systems to be fully
data-driven using Twitter data. Two of the tasks in the subsequent challenge (DSTC7) focus
on grounded conversation scenarios. One is focused on audio-visual scene-aware dialogue
and the other on response generation grounded in external knowledge (e.g., Foursquare and
Wikipedia), with conversations extracted from Reddit.14
• ConvAI Competition: This is a NIPS competition that has been featured so far at two
conferences. It offers prizes in the form of Amazon Mechanical Turk funding. The com-
petition aims at “training and evaluating models for non-goal-oriented dialogue systems”,
and in 2018 uses the Persona-Chat dataset (Zhang et al., 2018c), among other datasets.
• NTCIR STC: This benchmark focuses on conversation “via short texts”. The first
benchmark focused on retrieval-based methods, and in 2017 was expanded to evaluate
generation-based approaches.
• Alexa Prize: In 2017, Amazon organized an open competition on building “social bots”
that can converse with humans on a range of current events and topics. The competition
enables participants to test their systems with real users (Alexa users), and offers a form of
indirect supervision as users are asked to rate each of their conversations with each of the
Alexa Prize systems. The inaugural prize featured 15 academic teams (Ram et al., 2018).15

13
http://workshop.colips.org/dstc6/call.html
14
http://workshop.colips.org/dstc7/
15
These 15 systems are described in the online proceeding: https://developer.amazon.com/
alexaprize/proceedings

64
Chapter 6

Conversational AI in Industry

This chapter pictures the landscape of conversational systems in industry, including task-oriented
systems (e.g., personal assistants), QA systems, and chatbots.

6.1 Question Answering Systems


Search engine companies, including Google, Microsoft and Baidu, have incorporated multi-turn
QA capabilities into their search engines to make user experience more conversational, which is
particularly appealing for mobile devices. Since relatively little is publicly known about the internals
of these systems (e.g., Google and Baidu), this section presents a few example commercial QA
systems whose architectures have been at least partially described in public source, including Bing
QA, Satori QA and customer support agents.

6.1.1 Bing QA

Bing QA is an example of the Web-scale text-QA agents. It is an extension of the Microsoft Bing
Web search engine. Instead of returning ten blue links, Bing QA generates a direct answer to a
user query by reading the passages retrieved by the Bing Web search engine using MRC models, as
illustrated in Fig. 6.1.
The Web QA task that Bing QA is dealing with is far more challenging than most of the academic
MRC tasks described in Chapter 3. For example, Web QA and SQuAD differs in:

• Scale and quality of the text collection. SQuAD assumes the answer is a text span in a
passage which is a clean text section from a Wikipedia page. Web QA needs to identify
an answer from billions of Web documents which consist of trillions of noisy passages that
often contain contradictory, wrong, obsolete information due to the dynamic nature of Web
content.
• Runtime latency. In an academic setting, an MRC model might take seconds to read and
re-read documents to generate an answer, while in the Web QA setting the MRC part (e.g.,
in Bing QA) is required to add no more than 10 mini seconds to the entire serving stack.
• User experience. While SQuAD MRC models provide a text span as an answer, Web QA
needs to provide different user experiences depending on different devices where the an-
swer is shown, e.g., a voice answer in a mobile device or a rich answer in a Search Engine
Result Page (SERP). Fig. 6.1 (Right) shows an example of the SERP for the question “what
year did Disney buy lucasfilms?”, where Bing QA presents not only the answer as a high-
lighted text span, but also various supporting evidence and related Web search results (i.e.,
captions of retrieved documents, passages, audios and videos) that are consistent with the
answer.

As a result, a commercial Web QA agent such as Bing QA often incorporates a MRC module as a
post-web component on top of its Web search engine stack. An overview of the Bing QA agent is

65
Figure 6.1: (Left) An overview of the Bing QA architecture. (Right) An example of a search engine
result page of the question “what year did disney buy lucasfilms?”. Example graciously provided by
Rangan Majumder.

illustrated in Fig. 6.1 (Left). Given the question “what year did Disney buy lucasfilms?”, a set of
candidate documents are retrieved from Web Index via a fast, primary ranker. Then in the Document
Ranking module, a sophisticated document ranker based on boosted trees (Wu et al., 2010) is used
to assign relevance scores for these documents. The top-ranked relevant documents are presented
in a SERP, with their captions generated from a Query-Focused Captioning module, as shown in
Fig. 6.1 (Right). The Passage Chunking module segments the top documents into a set of candidate
passages, which are further ranked by the Passage Ranking module based on another passage-level
boosted trees ranker (Wu et al., 2010). Finally, the MRC module identifies the answer span “2012”
from the top-ranked passages.
Although turning Bing QA into a conversational QA agent of Sec. 3.8 requires the integration of
additional components such as dialogue manager, which is a nontrivial ongoing engineering effort,
Bing QA can already deal with conversational queries (e.g., follow up questions) using a Conversa-
tional Query Understanding (CQU) module (Ren et al., 2018a). As the example in Fig. 6.2, CQU
reformulates a conversational query into a search engine friendly query in two steps: (1) determine
whether a query depends upon the context in the same search session (i.e., previous queries and
answers), and (2) if so, rewrite that query to include the necessary context e.g., replace “its” with
“California” in Q2 and add “Stanford” in Q5 in Fig. 6.2.

6.1.2 Satori QA

Satori QA is an example of the KB-QA agents, as described in Sec. 3.1–3.5. Satori is Microsoft’s
knowledge graph, which is seeded by Freebase, and now is several orders of magnitude larger than
Freebase. Satori QA is a hybrid system that uses both neural approaches and symbolic approaches.
It generates answers to factual questions.
Similar to Web QA, Satori QA has to deal with the issues regarding scalability, noisy content, speed,
etc. One commonly used design strategy of improving system’s robustness and runtime efficiency is
to decompose a complex question into a sequence of simpler questions, which can be answered more
easily by a Web-scale KB-QA system, and compute the final answer by recomposing the sequence
of answers, as exemplified in Fig. 6.3 (Talmor and Berant, 2018).

6.1.3 Customer Support Agents

Several IT companies, including Microsoft and Salesforce, have developed a variety of customer
support agents. These agents are multi-turn conversational KB-QA agents, as described in 3.5.

66
Figure 6.2: An example query session, where some queries are rewritten to include context infor-
mation via the CQU module as indicated by the arrows. Examples adapted from Ren et al. (2018a).

Figure 6.3: Given a complex question Q, we decompose it to a sequence of simple questions


Q1 , Q2 , ..., use a Web-scale KB-QA agent to generate for each Qi an answer Ai , from which we
compute the final answer A. Figure credit: Talmor and Berant (2018).

Given a user’s description of a problem e.g., “cannot update the personal information of my ac-
count”, the agent needs to recommend a pre-compiled solution or ask a human agent to help. The
dialogue often consists of multiple turns as the agent asks the user to clarify the problem while
navigating the knowledge base to find the solution. These agents often take both text and voice as
input.

6.2 Task-oriented Dialogue Systems (Virtual Assistants)

Commercial task-oriented dialogue systems nowadays often reside in smart phones, smart speakers
and personal computers. They can perform a range of tasks or services for a user, and are sometimes
referred to as virtual assistants or intelligent personal assistants. Some of the example services
are providing weather information, setting alarms, and calling center support. In the US, the most

67
widely used systems include Apple’s Siri, Google Assistant, Amazon Alexa, and Microsoft Cortana,
among others. Users can interact with them naturally through voice, text or images. To activate a
virtual assistant using voice, a wake word might be used, such as “OK Google.”

Figure 6.4: Architecture of Task Completion Platform. Figure credit: Crook et al. (2016).

There are also a number of fast-growing tools available to facilitate the development of virtual assis-
tants, including Amazon’s Alexa Skills Kit1 , IBM’s Watson Assistant2 , and similar offerings from
Microsoft and Google, among others. A comprehensive survey is outside of the scope of this sec-
tion, and not all information of such tools is publicly available. Here, we will give a high-level
description of a sample of them:

• The Task Completion Platform (TCP) of Microsoft (Crook et al., 2016) is a platform for
creating multi-domain dialogue systems. As shown in Fig. 6.4, TCP follows a similar
structure as in Fig. 4.1, containing language understanding, state tracking, and a policy.
A useful feature of TCP is a task configuration language, TaskForm, which allows the
definitions of individual tasks to be decoupled from the platform’s overarching dialogue
policy. TCP is used to power many of the multi-turn dialogues supported by the Cortana
personal assistant.

• Another tool from Microsoft is LUIS, a cloud-based API service for natural language un-
derstanding3 . It provides a suite of pre-built domains and intentions, as well as a convenient
interface for a non-expert to use machine learning to obtain an NLU model by providing
training examples. Once a developer creates and publishes a LUIS app, the app can be used
as a NLU blackbox module by a client dialogue system: the client sends a text utterance to
the app, which will return language understanding results in the JSON format, as illustrated
in Fig. 6.5.

• While LUIS focuses on language understanding, the Azure Bot Service4 allows developers
to build, test, deploy, and manage dialogue systems in one place. It can take advantages of a
suite of intelligent services, including LUIS, image captioning, speech-to-text capabilities,
among others.

• DialogFlow is Google’s development suite for creating dialogue systems on websites, mo-
bile and IoT devices.5 Similar to the above tools, it provides mechanisms to facilitate
development of various modules of a dialogue system, including language understanding
and carrying information over multiple turns. Furthermore, it can deploy a dialogue system
as an action that users can invoke through Google Assistant.

1
https://developer.amazon.com/alexa-skills-kit
2
https://ibm.biz/wcsdialog
3
https://www.luis.ai
4
https://docs.microsoft.com/en-us/azure/bot-service/?view=azure-bot-service-3.0
5
https://dialogflow.com

68
Figure 6.5: Use of LUIS by a client dialogue system. Figure credit: https://docs.microsoft.
com/en-us/azure/cognitive-services/LUIS .

6.3 Chatbots
There have been publicly-available conversational systems going back many decades (Weizenbaum,
1966; Colby, 1975). Those precursors of today’s chatbot systems relied heavily on hand-crafted
rules, and are very different from the data-driven conversational AI systems discussed in Chapter 5.
Nowadays publicly available and commercial chatbot systems are often a combination of statistical
methods and hand-crafted components, where statistical methods provide robustness to conversa-
tional systems (e.g., via intent classifiers) while rule-based components are often still used in prac-
tice, e.g., to handle common chitchat queries (e.g., “tell me a joke”). Examples include personal
assistants like Amazon’s Alexa, Google Assistant, Facebook M, and Microsoft’s Cortana, which
in addition to personal assistant skills are able to handle chitchat user inputs. Other commercial
systems such as XiaoIce,6 Replika, (Fedorenko et al., 2017) Zo,7 and Ruuh8 focus almost entirely
on chitchat. Since relatively little is publicly known about the internals of main commercial sys-
tems (Alexa, Google Assistant, etc.), the rest of this section focuses on commercial systems whose
architecture have been at least partially described in some public source.
One of the earliest such systems is XiaoIce, which was initially released in 2014. XiaoIce is designed
as an AI companion with an emotional connection to satisfy the human need for communication,
affection, and social belonging (Zhou et al., 2018). The overall architecture of XiaoIce is shown in
Fig. 6.6. It consists of three layers.

• User experience layer: It connects XiaoIce to popular chat platforms (e.g., WeChat, QQ),
and deals with conversations in two communication modes. The full-duplex module han-
dles voice-stream-based conversations where both a user and XiaoIce can talk to each other
simultaneously. The other module deals with message-based conversations where a user
and XiaoIce have to take turns to talk.
• Conversation engine layer: In each dialogue turn, the dialogue state is first updated using
the state tracker, and either Core Chat (and a topic) or a dialogue skill is selected by the
dialogue policy to generate a response. A unique component of XiaoIce is the empathetic
computing module, designed to understand not only the content of the user input (e.g.,
topic) but also the empathy aspects (e.g., emotion, intent, opinion on topic, and the user’s
background and general interests), to ensure the generation of an empathetic response that
fits XiaoIce’s persona. Another central module, Core Chat, combines neural generation
techniques (Sec. 5.1) and retrieval-based methods (Zhou et al., 2018). As Fig. 6.7 show,
XiaoIce is capable of generating socially attractive responses (e.g., having a sense of hu-
mor, comforting, etc.), and can determine whether to drive the conversation when, e.g.,
6
https://www.msxiaobing.com
7
https://www.zo.ai
8
https://www.facebook.com/Ruuh

69
Figure 6.6: XiaoIce system architecture. Figure credit: Zhou et al. (2018)

Figure 6.7: Conversation between a user and XiaoIce. The empathy model provides a context-aware
strategy that can drive the conversation when needed.

the conversation is somewhat stalled, or whether to perform active listening when the user
herself is engaged.9
• Data layer: It consists of a set of databases that store the collected human conversational
data (in text pairs or text-image pairs), non-conversational data and knowledge graphs used
for Core Chat and skills, and the profiles of XiaoIce and all the registered users for empa-
thetic computing.

The Replika system (Fedorenko et al., 2017) for chitchat combines neural generation and retrieval-
based methods, and is able to condition responses on images as in (Mostafazadeh et al., 2017). The
neural generation component of Replika is persona-based (Li et al., 2016b), as it is trained to mimic
specific characters. While Replika is a company, the Replika system has been open-sourced10 and
can thus be used as a benchmark for future research.
Alexa Prize systems (Ram et al., 2018) are social chatbots that are exposed to real users, and as
such anyone with an Alexa device is able to interact with these social bots and give them ratings.
This interaction is triggered with the “Alexa, let’s chat” command, which then triggers a free-form
conversation about any topic selected by either the user or the system. These systems featured
not only fully data-driven approaches, but also more engineered and modularized approaches. For
example, the winning system of the 2017 competition (Sounding Board11 ) contained a chitchat
9
https://www.leiphone.com/news/201807/rgyKfVsEUdK1BpXf.html
10
https://github.com/lukalabs/cakechat
11
https://sounding-board.github.io

70
component as well as individual “miniskills” enabling the system to handle distinct tasks (e.g., QA)
and topics (e.g., news, sports). Due to the diversity of systems in the Alexa prize, it would be
impractical to overview these systems in this survey, and instead we refer the interested reader to the
Alexa Prize online proceedings (Ram et al., 2018).

71
Chapter 7

Conclusions and Research Trends

Conversational AI is a rapidly growing field. This paper surveys neural approaches that were re-
cently developed. Some of them have already been widely used in commercial systems.

• Dialogue systems for question answering, task completion, chitchat and recommendation
etc. can be conceptualized using a unified mathematical framework of optimal decision
process. The neural approaches to AI, developed in the last few years, leverage the recent
breakthrough in RL and DL to significantly improve the performance of dialogue agents
across a wide range of tasks and domains.
• A number of commercial dialogue systems allow users to easily access various services and
information via conversation. Most of these systems use hybrid approaches that combine
the strength of symbolic methods and neural models.
• There are two types of QA agents. KB-QA agents allow users to query large-scale knowl-
edge bases via conversation without composing complicated SQL-like queries. Text-QA
agents, equipped with neural MRC models, are becoming more popular than traditional
search engines (e.g., Bing and Google) for the query types to which users expect a concise
direct answer.
• Traditional task-oriented systems use handcrafted dialogue manager modules, or shallow
machine-learning models to optimize the modules separately. Recently, researchers have
begun to explore DL and RL to optimize the system in a more holistic way with less domain
knowledge, and to automate the optimization of systems in a changing environment such
that they can efficiently adapt to different tasks, domains and user behaviors.
• Chatbots are important in facilitating smooth and natural interaction between humans and
their electronic devices. More recent work focuses on scenarios beyond chitchat, e.g.,
recommendation. Most state-of-the-art chatbots use fully data-driven and end-to-end gen-
eration of conversational responses within the framework of neural machine translation.

We have discussed some of the main challenges in conversational AI, common to Question Answer-
ing agents, task-oriented dialogue bots and chatbots.

• Towards a unified modeling framework for dialogues: Chapter 1 presents a unified view
where an open-domain dialogue is formulated as an optimal decision process. Although
the view provides a useful design principle, it remains to be proved the effectiveness of
having a unified modeling framework for system development. Microsoft XiaoIce, initially
designed as a chitchat system based on a retrieval engine, has gradually incorporated many
ML components and skills, including QA, task completion and recommendation, using a
unified modeling framework based on empathic computing and RL, aiming to maximize
user engagement in the long run, measured by expected conversation-turn per session. We
plan to present the design and development of XiaoIce in a future publication. McCann
et al. (2018) presented a platform effort of developing a unified model to handle various
tasks including QA, dialogue and chitchat.

72
• Towards fully end-to-end dialogue systems: Recent work combines the benefit of task-
oriented dialogue with more end-to-end capabilities. The grounded models discussed in
Sec. 5.3 represent a step towards more goal-oriented conversations, as the ability to interact
with the user’s environment is a key requirement for most goal-oriented dialogue systems.
Grounded conversation modeling discussed in this paper is still preliminary, and future
challenges include enabling API calls in fully data-driven pipelines.
• Dealing with heterogeneous data: Conversational data is often heterogeneous. For ex-
ample, chitchat data is plentiful but not directly relevant to goal-oriented systems, and
goal-oriented conversational datasets are typically very small. Future research will need
to address the challenge of capitalizing on both, for example in a multi-task setup similar
to Luan et al. (2017). Another research direction is the work of Zhao et al. (2017), which
brought synergies between chitchat and task-oriented data using a “data augmentation”
technique. Their resulting system is not only able to handle chitchat, but also more robust
to goal-oriented dialogues. Another challenge is to better exploit non-conversational data
(e.g., Wikipedia) as part of the training of conversational systems (Ghazvininejad et al.,
2018).
• Incorporating EQ (or empathy) into dialogue: This is useful for both chatbots and QA
bots. For example, XiaoIce incorporates an EQ module so as to deliver a more understand-
able response or recommendation (as in 3.1 of (Shum et al., 2018)). Fung et al. (2016)
embedded an empathy module into a dialogue agent to recognize users’ emotion using
multimodality, and generate emotion-aware responses.
• Scalable training for task-oriented dialogues: It is important to fast update a dialogue
agent to handle a changing environment. For example, Lipton et al. (2018) proposed an
efficient exploration method to tackle a domain extension setting, where new slots can be
gradually introduced. Chen et al. (2016b) proposed a zero-shot learning for unseen intents
so that a dialogue agent trained on one domain can detect unseen intents in a new domain
without manually labeled data and without retraining.
• Commonsense knowledge is crucial for any dialogue agents. This is challenging because
common sense knowledge is often not explicitly stored in existing knowledge base. Some
new datasets are developed to foster the research on common sense reasoning, such as
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) (Zhang et al.,
2018d), Winograd Schema Challenge (WSC) (Morgenstern and Ortiz, 2015) and Choice
Of Plausible Alternatives (COPA) (Roemmele et al., 2011).
• Model interpretability: In some cases, a dialogue agent is required not only to give a
recommendation or an answer, but also provide explanations. This is very important in
e.g., business scenarios, where a user cannot make a business decision without justifica-
tion. Shen et al. (2018); Xiong et al. (2017); Das et al. (2017b) combine the interpretability
of symbolic approaches and the robustness of neural approaches and develop an inference
algorithm on KB that not only improves the accuracy in answering questions but also pro-
vides explanations why the answer is generated, i.e., the paths in the KB that leads to the
answer node.

73
Bibliography

Agichtein, E., Carmel, D., Pelleg, D., Pinter, Y., and Harman, D. (2015). Overview of the TREC
2015 LiveQA track. In TREC.
Ai, H. and Litman, D. J. (2008). Assessing dialog system user simulation evaluation measures using
human judges. In Proceedings of the 46th Annual Meeting of the Association for Computational
Linguistics (ACL), pages 622–629.
Ai, H., Raux, A., Bohus, D., Eskenazi, M., and Litman, D. (2007). Comparing spoken dialog corpora
collected with recruited subjects versus real users. In Proceedings of the 8th SIGdial Workshop
on Discourse and Dialogue, pages 124–131.
Al-Rfou, R., Pickett, M., Snaider, J., Sung, Y., Strope, B., and Kurzweil, R. (2016). Conversa-
tional contextual cues: The case of personalization and history for response ranking. CoRR,
abs/1606.00372.
Albrecht, J. and Hwa, R. (2007). A re-examination of machine learning approaches for sentence-
level mt evaluation. In Proceedings of the 45th Annual Meeting of the Association of Computa-
tional Linguistics, pages 880–887, Prague, Czech Republic.
Allen, J. F., Byron, D. K., Dzikovska, M. O., Ferguson, G., Galescu, L., and Stent, A. (2001).
Toward conversational human-computer interaction. AI Magazine, 22(4):27–38.
Angeli, G., Liang, P., and Klein, D. (2010). A simple domain-independent probabilistic approach
to generation. Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 502–512.
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). DBpedia: A
nucleus for a web of open data. In The semantic web, pages 722–735. Springer.
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to
align and translate. In Proc. of ICLR.
Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for mt evaluation with improved
correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Mea-
sures for Machine Translation and/or Summarization, pages 65–72.
Bao, J., Duan, N., Zhou, M., and Zhao, T. (2014). Knowledge-based question answering as machine
translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), volume 1, pages 967–976.
Bapna, A., Tür, G., Hakkani-Tür, D., and Heck, L. P. (2017). Towards zero-shot frame semantic
parsing for domain scaling. In Proceedings of the 18th Annual Conference of the International
Speech Communication Association (INTERSPEECH), pages 2476–2480.
Barlier, M., Pérolat, J., Laroche, R., and Pietquin, O. (2015). Human-machine dialogue as a stochas-
tic game. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse
and Dialogue (SIGDIAL), pages 2–11.
Baxter, J. and Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial
Intelligence Research, 15:319–350.
Baxter, J., Bartlett, P., and Weaver, L. (2001). Experiments with infinite-horizon, policy-gradient
estimation. Journal of Artificial Intelligence Research, 15:351–381.
Bell, J. (1999). Pragmatic reasoning: Inferring contexts. In International and Interdisciplinary
Conference on Modeling and Using Context, pages 42–53. Springer.

74
Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016).
Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information
Processing Systems (NIPS), pages 1471–1479.
Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). Semantic parsing on Freebase from question-
answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language
Processing, pages 1533–1544.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
Bhatia, P., Gavaldà, M., and Einolghozati, A. (2017). soc2seq: Social embedding meets conversation
model. CoRR, abs/1702.05512.
Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. O’Reilly.
Black, A. W., Burger, S., Conkie, A., Hastie, H. W., Keizer, S., Lemon, O., Merigaud, N., Parent,
G., Schubiner, G., Thomson, B., Williams, J. D., Yu, K., Young, S. J., and Eskénazi, M. (2011).
Spoken dialog challenge 2010: Comparison of live and control test results. In Proceedings of the
12th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages
2–7.
Bohus, D. and Horvitz, E. (2009). Models for multiparty engagement in open-world dialog. In
Proceedings of the 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
(SIGDIAL), pages 225–234.
Bohus, D. and Horvitz, E. (2011). Multiparty turn taking in situated dialog: Study, lessons, and
directions. In Proceedings of the 12nnual Meeting of the Special Interest Group on Discourse
and Dialogue (SIGDIAL), pages 98–109.
Bohus, D. and Rudnicky, A. I. (2009). The RavenClaw dialog management framework: Architecture
and systems. Computer Speech & Language, 23(3):332–361.
Bohus, D., Saw, C. W., and Horvitz, E. (2014). Directions robot: In-the-wild experiences and lessons
learned. In Proceedings of the International Conference on Autonomous Agents and Multi-Agent
Systems (AAMAS), pages 637–644.
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. (2008). Freebase: a collabora-
tively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM
SIGMOD international conference on Management of data, pages 1247–1250. ACM.
Bordes, A., Boureau, Y.-L., and Weston, J. (2017). Learning end-to-end goal-oriented dialog. In
Proceedings of the International Conference on Learning Representations (ICLR).
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013). Translating
embeddings for modeling multi-relational data. In Advances in neural information processing
systems, pages 2787–2795.
Bos, J., Klein, E., Lemon, O., and Oka, T. (2003). DIPPER: Description and formalisation of an
information-state update dialogue system architecture. In Proceedings of the 4th Annual Meeting
of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 115–124.
Braunschweiler, N. and Papangelis, A. (2018). Comparison of an end-to-end trainable dialogue
system with a modular statistical dialogue system. In Proceedings of the 19th Annual Conference
of the International Speech Communication Association (INTERSPEECH), pages 576–580.
Budzianowski, P., Ultes, S., Su, P.-H., Mrkšić, N., Wen, T.-H., nigo Casanueva, I., Rojas-Barahona,
L. M., and Gašić, M. (2017). Sub-domain modelling for dialogue management with hierarchical
reinforcement learning. In Proceedings of the 18h Annual SIGdial Meeting on Discourse and
Dialogue (SIGDIAL), pages 86–92.
Callison-Burch, C., Koehn, P., Monz, C., and Schroeder, J. (2009). Findings of the 2009 Workshop
on Statistical Machine Translation. In Proceedings of the Fourth Workshop on Statistical Machine
Translation, pages 1–28, Athens, Greece.
Caruana, R. (1998). Multitask learning. In Learning to learn, pages 95–133. Springer.
Casanueva, I., Budzianowski, P., Su, P.-H., Ultes, S., Rojas-Barahona, L. M., Tseng, B.-H., and
Gašić, M. (2018). Feudal reinforcement learning for dialogue management in large domains.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 714–719.

75
Cer, D., Manning, C. D., and Jurafsky, D. (2010). The best lexical metric for phrase-based statistical
MT system optimization. In Human Language Technologies: The 2010 Annual Conference of
the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages
555–563, Stroudsburg, PA, USA.
Chandramohan, S., Geist, M., Lefèvre, F., and Pietquin, O. (2011). User simulation in dialogue
systems using inverse reinforcement learning. In Proceedings of the 12th Annual Conference of
the International Speech Communication Association (INTERSPEECH), pages 1025–1028.
Chapelle, O. and Li, L. (2012). An empirical evaluation of Thompson sampling. In Advances in
Neural Information Processing Systems 24 (NIPS), pages 2249–2257.
Chen, D., Bolton, J., and Manning, C. D. (2016a). A thorough examination of the CNN/Daily mail
reading comprehension task. arXiv preprint arXiv:1606.02858.
Chen, D., Fisch, A., Weston, J., and Bordes, A. (2017a). Reading Wikipedia to answer open-domain
questions. arXiv 1704.00051.
Chen, H., Liu, X., Yin, D., and Tang, J. (2017b). A survey on dialogue systems: Recent advances
and new frontiers. arXiv preprint arXiv:1711.01731.
Chen, J., Wang, C., Xiao, L., He, J., Li, L., and Deng, L. (2017c). Q-LDA: Uncovering latent pat-
terns in text-based sequential decision processes. In Advances in Neural Information Processing
Systems 30, pages 4984–4993.
Chen, L., Tan, B., Long, S., and Yu, K. (2018). Structured dialogue policy with graph neural
networks. In Proceedings of the 27th International Conference on Computational Linguistics
(COLING), pages 1257–1268.
Chen, L., Zhou, X., Chang, C., Yang, R., and Yu, K. (2017d). Agent-aware dropout DQN for
safe and efficient on-line dialogue policy learning. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing (EMNLP), pages 2454–2464.
Chen, Y.-N., Celikyilmaz, A., and Hakkani-Tür, D. (2017e). Deep learning for dialogue systems.
In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(Tutorial Abstracts), pages 8–14.
Chen, Y.-N. and Gao, J. (2017). Open-domain neural dialogue systems. In Proceedings of the Eighth
International Joint Conference on Natural Language Processing (Tutorial Abstracts), pages 6–10.
Chen, Y.-N., Hakkani-Tür, D., Tur, G., Gao, J., and Deng, L. (2016b). End-to-end memory networks
with knowledge carryover for multi-turn spoken language understanding. In Proceedings of The
17th Annual Meeting of the International Speech Communication Association, pages 3245–3249.
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014a). On the properties of neural
machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop
on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,
Y. (2014b). Learning phrase representations using rnn encoder–decoder for statistical machine
translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734, Doha, Qatar.
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Zettlemoyer, L. (2018).
QuAC: Question answering in context. arXiv preprint arXiv:1808.07036.
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018).
Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv preprint
arXiv:1803.05457.
Colby, K. M. (1975). Artificial Paranoia: A Computer Simulation of Paranoid Processes. Elsevier
Science Inc., New York, NY, USA.
Cole, R. A. (1999). Tools for research and education in speech science. In Proceedings of Interna-
tional Conference of Phonetic Sciences, pages 1277–1280.
Core, M. G. and Allen, J. F. (1997). Coding dialogs with the DAMSL annotation scheme. In
Proceedings of AAAI Fall Symposium on Communicative Action in Humans and Machines, pages
28–35.

76
Corston-Oliver, S., Gamon, M., and Brockett, C. (2001). A machine learning approach to the auto-
matic evaluation of machine translation. In Proceedings of 39th Annual Meeting of the Association
for Computational Linguistics, pages 148–155, Toulouse, France.
Côté, M.-A., Ákos Kádár, Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J., Hausknecht, M.,
El Asri, L., Adada, M., Tay, W., and Trischler, A. (2018). TextWorld: A learning environment for
text-based games. arXiv:1806.11532.
Crook, P. A., Marin, A., Agarwal, V., Aggarwal, K., Anastasakos, T., Bikkula, R., Boies, D., Ce-
likyilmaz, A., Chandramohan, S., Feizollahi, Z., Holenstein, R., Jeong, M., Khan, O. Z., Kim,
Y.-B., Krawczyk, E., Liu, X., Panic, D., Radostev, V., Ramesh, N., Robichaud, J.-P., Rochette, A.,
Stromberg, L., and Sarikaya, R. (2016). Task Completion Platform: A self-serve multi-domain
goal oriented dialogue platform. In Proceedings of the 2016 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language Technologies
(HLT-NAACL): Demonstrations Session, pages 47–51.
Cuayáhuitl, H., Renals, S., Lemon, O., and Shimodaira, H. (2010). Evaluation of a hierarchical
reinforcement learning spoken dialogue system. Computer Speech and Language, 24(2):395–
429.
Cuayáhuitl, H., Yu, S., Williamson, A., and Carse, J. (2016). Deep reinforcement learning for
multi-domain dialogue systems. arXiv preprint arXiv:1611.08675.
Dai, B., Shaw, A., He, N., Li, L., and Song, L. (2018a). Boosting the actor with dual critic. In
Proceedings of the Sixth International Conference on Learning Representations (ICLR).
Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. (2018b). SBEED:
Convergent reinforcement learning with nonlinear function approximation. In Proceedings of the
Thirty-Fifth International Conference on Machine Learning (ICML-18), pages 1133–1142.
Dang, H. T., Kelly, D., and Lin, J. J. (2007). Overview of the TREC 2007 question answering track.
In TREC, volume 7, page 63.
Dann, C., Lattimore, T., and Brunskill, E. (2017). Unifying PAC and regret: Uniform PAC bounds
for episodic reinforcement learning. In Advances in Neural Information Processing Systems 30
(NIPS), pages 5717–5727.
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., and Batra, D. (2017a).
Visual Dialog. In CVPR.
Das, R., Dhuliawala, S., Zaheer, M., Vilnis, L., Durugkar, I., Krishnamurthy, A., Smola, A., and Mc-
Callum, A. (2017b). Go for a walk and arrive at the answer: Reasoning over paths in knowledge
bases using reinforcement learning. arXiv preprint arXiv:1711.05851.
Daubigney, L., Gašić, M., Chandramohan, S., Geist, M., Pietquin, O., and Young, S. J. (2011). Un-
certainty management for on-line optimisation of a POMDP-based large-scale spoken dialogue
system. In Proceedings of the 12th Annual Conference of the International Speech Communica-
tion Association (INTERSPEECH), pages 1301–1304.
Daubigney, L., Geist, M., Chandramohan, S., and Pietquin, O. (2012). A comprehensive rein-
forcement learning framework for dialogue management optimization. IEEE Journal of Selected
Topics in Signal Processing, 6(8):891–902.
Dayan, P. and Hinton, G. E. (1993). Feudal reinforcement learning. In Advances in Neural Infor-
mation Processing Systems 5 (NIPS), pages 271–278.
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., and Courville, A. C. (2017).
GuessWhat?! visual object discovery through multi-modal dialogue. In Proceedings of the 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4466–4475.
DeVault, D., Artstein, R., Benn, G., Dey, T., Fast, E., Gainer, A., Georgila, K., Gratch, J., Hartholt,
A., Lhommet, M., Lucas, G. M., Marsella, S., Morbini, F., Nazarian, A., Scherer, S., Stratou,
G., Suri, A., Traum, D. R., Wood, R., Xu, Y., Rizzo, A. A., and Morency, L.-P. (2014). Sim-
Sensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of
the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages
1061–1068.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805.

77
Dhingra, B., Li, L., Li, X., Gao, J., Chen, Y.-N., Ahmed, F., and Deng, L. (2017). Towards end-to-
end reinforcement learning of dialogue agents for information access. In ACL (1), pages 484–495.
Dhingra, B., Liu, H., Yang, Z., Cohen, W. W., and Salakhutdinov, R. (2016). Gated-attention readers
for text comprehension. arXiv preprint arXiv:1606.01549.
Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function de-
composition. Journal of Artificial Intelligence Research, 13:227–303.
Ding, Y., Liu, Y., Luan, H., and Sun, M. (2017). Visualizing and understanding neural machine
translation. In Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1150–1159, Vancouver, Canada.
Dodge, J., Gane, A., Zhang, X., Bordes, A., Chopra, S., Miller, A., Szlam, A., and Weston, J. (2016).
Evaluating prerequisite qualities for learning end-to-end dialog systems. In ICLR.
Dunn, M., Sagun, L., Higgins, M., Guney, U., Cirik, V., and Cho, K. (2017). SearchQA: A new
Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
Eckert, W., Levin, E., and Pieraccini, R. (1997). User modeling for spoken dialogue system evalu-
ation. In Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Under-
standing (ASRU), pages 80–87.
El Asri, L., He, J., and Suleman, K. (2016). A sequence-to-sequence model for user simulation
in spoken dialogue systems. In Proceedings of the 17th Annual Conference of the International
Speech Communication Association (INTERSPEECH), pages 1151–1155.
El Asri, L., Laroche, R., and Pietquin, O. (2012). Reward function learning for dialogue manage-
ment. In Proceedings of the Sixth Starting AI Researchers’ Symposium (STAIRS), pages 95–106.
Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement learning with Gaussian processes. In
Proceedings of the 22nd International Conference on Machine Learning (ICML), pages 201–208.
Eric, M., Krishnan, L., Charette, F., and Manning, C. D. (2017). Key-value retrieval networks for
task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and
Dialogue (SIGDIAL), pages 37–49.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning.
Journal of Machine Learning Research, 6:503–556.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell,
M., Platt, J. C., Zitnick, L., and Zweig, G. (2015). From captions to visual concepts and back.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1473–
1482.
Fatemi, M., El Asri, L., Schulz, H., He, J., and Suleman, K. (2016). Policy networks with two-stage
training for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest
Group on Discourse and Dialogue (SIGDIAL), pages 101–110.
Fedorenko, D. G., Smetanin, N., and Rodichev, A. (2017). Avoiding echo-responses in a retrieval-
based conversation system. CoRR, abs/1712.05626.
Fung, P., Bertero, D., Wan, Y., Dey, A., Chan, R. H. Y., Siddique, F. B., Yang, Y., Wu, C., and Lin,
R. (2016). Towards empathetic human-robot interactions. CoRR, abs/1605.04072.
Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli, M., Quirk, C., Mitchell, M., Gao, J., and Dolan,
B. (2015). deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse
targets. In ACL-IJCNLP, page 445–450.
Gao, J. (2017). An introduction to deep learning for natural language processing. In International
Summer School on Deep Learning, Bilbao.
Gao, J., Galley, M., and Li, L. (2018a). Neural approaches to conversational AI. In The 41st In-
ternational ACM SIGIR Conference on Research & Development in Information Retrieval, pages
1371–1374. ACM.
Gao, J., Galley, M., and Li, L. (2018b). Neural approaches to conversational AI. Proc. of ACL 2018,
Tutorial Abstracts, pages 2–7.
Gao, J., He, X., Yih, W.-t., and Deng, L. (2014a). Learning continuous phrase representations
for translation modeling. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), volume 1, pages 699–709.

78
Gao, J., Pantel, P., Gamon, M., He, X., and Deng, L. (2014b). Modeling interestingness with
deep neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 2–13.
Gardner, M., Talukdar, P., Krishnamurthy, J., and Mitchell, T. (2014). Incorporating vector space
similarity in random walk inference over knowledge bases. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), pages 397–406.
Gašić, M., Breslin, C., Henderson, M., Kim, D., Szummer, M., Thomson, B., Tsiakoulis, P., and
Young, S. J. (2013). On-line policy optimisation of Bayesian spoken dialogue systems via human
interaction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 8367–8371.
Gašic, M., Kim, D., Tsiakoulis, P., Breslin, C., Henderson, M., Szummer, M., Thomson, B., and
Young, S. (2014). Incremental on-line adaptation of POMDP-based dialogue managers to ex-
tended domains. In Proceedings of the 15th Annual Conference of the International Speech Com-
munication Association (INTERSPEECH), pages 140–144.
Gašić, M., Mrkšić, N., hao Su, P., Vandyke, D., Wen, T.-H., and Young, S. J. (2015). Policy com-
mittee for adaptation in multi-domain spoken dialogue systems. In Proceedings of the 2015 IEEE
Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 806–812.
Gašić, M. and Young, S. J. (2014). Gaussian processes for POMDP-based dialogue manager opti-
mization. IEEE Trans. Audio, Speech & Language Processing, 22(1):28–40.
Georgila, K., Henderson, J., and Lemon, O. (2006). User simulation for spoken dialogue systems:
Learning and evaluation. In Proceedings of the 9th International Conference on Spoken Language
Processing (INTERSPEECH), pages 1065–1068.
Ghazvininejad, M., Brockett, C., Chang, M.-W., Dolan, B., Gao, J., Yih, W.-t., and Galley, M.
(2018). A knowledge-grounded neural conversation model. In AAAI, pages 5110–5117.
Giménez, J. and Màrquez, L. (2008). A smorgasbord of features for automatic MT evaluation. In
Proceedings of the Third Workshop on Statistical Machine Translation, pages 195–198.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and
Bengio, Y. (2014). Generative adversarial nets. In NIPS, pages 2672–2680.
Graham, Y. and Baldwin, T. (2014). Testing for significance of increased correlation with human
judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 172–176, Doha, Qatar.
Graham, Y., Baldwin, T., and Mathur, N. (2015). Accurate evaluation of segment-level machine
translation metrics. In NAACL-HLT.
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM
and other neural network architectures. Neural Networks, 18:602–610.
Gu, J., Lu, Z., Li, H., and Li, V. O. (2016). Incorporating copying mechanism in sequence-to-
sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages 1631–1640.
Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2017). Q-Prop: Sample-efficient
policy gradient with an off-policy critic. In Proceedings of the 5th International Conference on
Learning Representations (ICLR).
Guu, K., Miller, J., and Liang, P. (2015). Traversing knowledge graphs in vector space. arXiv
preprint arXiv:1506.01094.
Hakkani-Tür, D., Tür, G., Celikyilmaz, A., Chen, Y.-N., Gao, J., Deng, L., and Wang, Y.-Y. (2016).
Multi-domain joint semantic frame parsing using Bi-directional RNN-LSTM. In Proceedings
of the 17th Annual Conference of the International Speech Communication Association (INTER-
SPEECH), pages 715–719.
Hakkani-Tür, D., Tur, G., Heck, L., Fidler, A., and Celikyilmaz, A. (2012). A discriminative
classification-based approach to information state updates for a multi-domain dialog system. In
Proceedings of the 13th Annual Conference of the International Speech Communication Associa-
tion (INTERSPEECH), pages 330–333.

79
Hartikainen, M., Salonen, E.-P., and Turunen, M. (2004). Subjective evaluation of spoken dialogue
systems using SERVQUAL method. In Proceedings of the 8th International Conference on Spo-
ken Language Processing (INTERSPEECH), pages 2273–2276.
Hausknecht, M. and Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. In
Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents,
pages 29–37.
He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and Ostendorf, M. (2016). Deep reinforcement
learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (ACL), page 1621–1630.
He, S., Liu, C., Liu, K., and Zhao, J. (2017a). Generating natural answers by incorporating copying
and retrieving mechanisms in sequence-to-sequence learning. In ACL, volume 1, pages 199–208.
He, W., Liu, K., Lyu, Y., Zhao, S., Xiao, X., Liu, Y., Wang, Y., Wu, H., She, Q., Liu, X., Wu, T., and
Wang, H. (2017b). DuReader: a chinese machine reading comprehension dataset from real-world
applications. arXiv preprint arXiv:1711.05073.
Henderson, J., Lemon, O., and Georgila, K. (2008). Hybrid reinforcement/supervised learning of
dialogue policies from fixed data sets. Computational Linguistics, 34(4):487–511.
Henderson, M. (2015). Machine learning for dialog state tracking: A review. In Proceedings of The
First International Workshop on Machine Learning in Spoken Language Processing.
Henderson, M., Thomson, B., and Williams, J. D. (2014a). The 3rd dialog state tracking challenge.
In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT), pages 324–329.
Henderson, M., Thomson, B., and Williams, J. D. (2014b). The second dialog state tracking chal-
lenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and
Dialogue (SIGDIAL), pages 263–272.
Henderson, M., Thomson, B., and Young, S. J. (2013). Deep neural network approach for the dialog
state tracking challenge. In Proceedings of the 14th Annual Meeting of the Special Interest Group
on Discourse and Dialogue (SIGDIAL), pages 467–471.
Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., and Blunsom,
P. (2015). Teaching machines to read and comprehend. In Advances in Neural Information
Processing Systems, pages 1693–1701.
Hewlett, D., Lacoste, A., Jones, L., Polosukhin, I., Fandrianto, A., Han, J., Kelcey, M., and Berth-
elot, D. (2016). WikiReading: A novel large-scale language understanding task over wikipedia.
arXiv preprint arXiv:1608.03542.
Hill, F., Bordes, A., Chopra, S., and Weston, J. (2015). The Goldilocks principle: Reading children’s
books with explicit memory representations. arXiv preprint arXiv:1511.02301.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for acoustic mod-
eling in speech recognition: The shared views of four research groups. IEEE Signal Processing
Magazine, 29(6):82–97.
Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the
description length of the weights. In Proceedings of the sixth annual conference on Computational
learning theory, pages 5–13. ACM.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut
für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation,
9(8):1735–1780.
Hofmann, K., Li, L., and Radlinski, F. (2016). Online evaluation for information retrieval. Founda-
tions and Trends in Information Retrieval, 10(1):1–117.
Holtzman, A., Buys, J., Forbes, M., Bosselut, A., Golub, D., and Choi, Y. (2018). Learning to write
with cooperative discriminators. In ACL, pages 1638–1649, Melbourne, Australia.
Hori, C. and Hori, T. (2017). End-to-end conversation modeling track in DSTC6. CoRR,
abs/1706.07440.

80
Hori, C., Hori, T., Watanabe, S., and Hershey, J. R. (2015). Context sensitive spoken language
understanding using role dependent LSTM layers. Technical Report TR2015-134, Mitsubishi
Electric Research Laboratories.
Hori, C., Perez, J., Yoshino, K., and Kim, S. (2017). The sixth dialog state tracking challenge.
http://workshop.colips.org/dstc6.
Horvitz, E. (1999). Principles of mixed-initiative user interfaces. In Proceeding of the Conference
on Human Factors in Computing Systems (CHI), pages 159–166.
Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. (2016). VIME:
Variational information maximizing exploration. In Advances in Neural Information Processing
Systems 29 (NIPS), pages 1109–1117.
Hu, M., Peng, Y., and Qiu, X. (2017). Mnemonic reader for machine comprehension. arXiv preprint
arXiv:1705.02798.
Huang, H.-Y., Zhu, C., Shen, Y., and Chen, W. (2017). FusionNet: Fusing via fully-aware attention
with application to machine comprehension. arXiv preprint arXiv:1711.07341.
Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. (2013). Learning deep structured
semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM inter-
national conference on Conference on information & knowledge management, pages 2333–2338.
Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D. O., Wang, J., and He, X. (2018). Hierarchi-
cally structured reinforcement learning for topically coherent visual story generation. CoRR,
abs/1805.08191.
Huang, X., Acero, A., and Hon, H.-W. (2001). Spoken language processing: A guide to theory,
algorithm, and system development. Prentice Hall.
Huber, B., McDuff, D., Brockett, C., Galley, M., and Dolan, B. (2018). Emotional dialogue genera-
tion using image-grounded language models. In CHI, pages 277:1–277:12.
Inaba, M. and Takahashi, K. (2016). Neural utterance ranking model for conversational dialogue
systems. In SIGDIAL, pages 393–403.
Iyyer, M., Yih, W.-t., and Chang, M.-W. (2017). Search-based neural structured learning for se-
quential question answering. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1821–1831.
Jafarpour, S., Burges, C. J., and Ritter, A. (2010). Filter, rank, and transfer the knowledge: Learning
to chat. Advances in Ranking, 10:2329–9290.
Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning.
Journal of Machine Learning Research, 11:1563–1600.
Jia, R. and Liang, P. (2017). Adversarial examples for evaluating reading comprehension systems.
arXiv preprint arXiv:1707.07328.
Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. (2017). Contextual
decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th Inter-
national Conference on Machine Learning (ICML), pages 1704–1713.
Jiang, N. and Li, L. (2016). Doubly robust off-policy evaluation for reinforcement learning. In
Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 652–661.
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). TriviaQA: A large scale distantly
supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
Jung, S., Lee, C., Kim, K., Jeong, M., and Lee, G. G. (2009). Data-driven user simulation for
automated evaluation of spoken dialog systems. Computer Speech and Language, 23:479–509.
Jurafsky, D. and Martin, J. H. (2009). Speech & language processing. Prentice Hall.
Jurafsky, D. and Martin, J. H. (2018). Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall. Draft
of August 12th, 2018. Website: https://web.stanford.edu/~jurafsky/slp3.
Kaelbling, L. P., Littman, M. L., and Moore, A. P. (1996). Reinforcement learning: A survey.
Journal of Artificial Intelligence Research, 4:237–285.
Kakade, S. (2001). A natural policy gradient. In Advances in Neural Information Processing Systems
13 (NIPS), pages 1531–1538.

81
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings
of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–
1709, Seattle, Washington, USA.
Kannan, A. and Vinyals, O. (2016). Adversarial evaluation of dialogue models. In NIPS Workshop
on Adversarial Training.
Khandelwal, U., He, H., Qi, P., and Jurafsky, D. (2018). Sharp nearby, fuzzy far away: How neural
language models use context. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 284–294.
Kim, S., D’Haro, L. F., Banchs, R. E., Williams, J. D., and Henderson, M. (2016a). The fourth dialog
state tracking challenge. In Proceedings of the 7th International Workshop on Spoken Dialogue
Systems (IWSDS), pages 435–449.
Kim, S., D’Haro, L. F., Banchs, R. E., Williams, J. D., Henderson, M., and Yoshino, K. (2016b).
The fifth dialog state tracking challenge. In Proceedings of the 2016 IEEE Spoken Language
Technology Workshop (SLT-16), pages 511–517.
Kočisky, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., and Grefenstette, E.
(2017). The NarrativeQA reading comprehension challenge. arXiv preprint arXiv:1712.07040.
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of
EMNLP 2004, pages 388–395, Barcelona, Spain.
Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings
of the 2003 Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54.
Koller, A. and Stone, M. (2007). Sentence generation as a planning problem. In Proceedings of the
45th Annual Meeting of the Association for Computational Linguistics (ACL-07), pages 336–343.
Komatani, K., Kanda, N., Nakano, M., Nakadai, K., Tsujino, H., Ogata, T., and Okuno, H. G.
(2006). Multi-domain spoken dialogue system with extensibility and robustness against speech
recognition errors. In Proceedings of the SIGDIAL 2006 Workshop, pages 9–17.
Konda, V. R. and Tsitsiklis, J. N. (1999). Actor-critic algorithms. In Advances in Neural Information
Processing Systems 12 (NIPS), pages 1008–1014.
Kondadadi, R., Howald, B., and Schilder, F. (2013). A statistical NLG framework for aggregated
planning and realization. In Proceedings of the 51st Annual Meeting of the Association for Com-
putational Linguistics (ACL), pages 1406–1415.
Kotti, M., Diakoloukas, V., Papangelis, A., Lagoudakis, M., and Stylianou, Y. (2018). A case study
on the importance of belief state representation for dialogue policy management. In Proceedings
of the 19th Annual Conference of the International Speech Communication Association (INTER-
SPEECH), pages 986–990.
Kulesza, A. and Shieber, S. M. (2004). A learning approach to improving sentence-level MT eval-
uation. In Proceedings of the 10th International Conference on Theoretical and Methodological
Issues in Machine Translation, Baltimore, MD.
Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., Zhong, V., Paulus, R., and
Socher, R. (2016). Ask me anything: Dynamic memory networks for natural language processing.
In International Conference on Machine Learning, pages 1378–1387.
Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. (2017). RACE: Large-scale reading comprehension
dataset from examinations. arXiv preprint arXiv:1704.04683.
Langkilde, I. and Knight, K. (1998). Generation that exploits corpus-based statistical knowledge.
In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and
17th International Conference on Computational Linguistics (COLING-ACL), pages 704–710.
Lao, N. and Cohen, W. W. (2010). Relational retrieval using a combination of path-constrained
random walks. Machine learning, 81(1):53–67.
Lao, N., Mitchell, T., and Cohen, W. W. (2011). Random walk inference and learning in a large scale
knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, pages 529–539.
Larsson, S. and Traum, D. R. (2000). Information state and dialogue management in the TRINDI
dialogue move engine toolkit. Natural Language Engineering, 6(3–4):323–340.

82
Lee, J. Y. and Dernoncourt, F. (2016). Sequential short-text classification with recurrent and convo-
lutional neural networks. In Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT),
pages 515–520.
Lee, S. and Jha, R. (2019). Zero-shot adaptive transfer for conversational language understanding. In
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), to appear.
Lei, W., Jin, X., Ren, Z., He, X., Kan, M.-Y., and Yin, D. (2018). Sequicity: Simplifying task-
oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the
56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1437–1447.
Lemon, O. (2011). Learning what to say and how to say it: Joint optimisation of spoken dialogue
management and natural language generation. Computer Speech & Language, 25(2):210–221.
Levin, E., Pieraccini, R., and Eckert, W. (2000). A stochastic model of human-machine interaction
for learning dialog strategies. IEEE Transactions on Speech and Audio Processing, 8(1):11–23.
Lewis, M., Yarats, D., Dauphin, Y., Parikh, D., and Batra, D. (2017). Deal or no deal? End-to-end
learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods
in Natural Language Processing (EMNLP-17), pages 2443–2453.
Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. (2016a). A diversity-promoting objective
function for neural conversation models. In NAACL-HLT, pages 110–119.
Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. (2016b). A persona-based neural conversation
model. In ACL, page 994–1003.
Li, J., Miller, A. H., Chopra, S., Ranzato, M., and Weston, J. (2017a). Dialogue learning with
human-in-the-loop. In ICLR.
Li, J., Miller, A. H., Chopra, S., Ranzato, M., and Weston, J. (2017b). Learning through dialogue
interactions by asking questions. In ICLR.
Li, J., Monroe, W., Ritter, A., Jurafsky, D., Galley, M., and Gao, J. (2016c). Deep reinforcement
learning for dialogue generation. In EMNLP, pages 1192–1202.
Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. (2017c). Adversarial learning for
neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, pages 2157–2169, Copenhagen, Denmark.
Li, L., He, H., and Williams, J. D. (2014). Temporal supervised learning for inferring a dialog policy
from example conversations. In Proceedings of the 2014 IEEE Spoken Language Technology
Workshop (SLT), pages 312–317.
Li, L., Williams, J. D., and Balakrishnan, S. (2009). Reinforcement learning for spoken dialog man-
agement using least-squares policy iteration and fast feature selection. In Proceedings of the 10th
Annual Conference of the International Speech Communication Association (INTERSPEECH),
pages 2475–2478.
Li, X., Chen, Y.-N., Li, L., Gao, J., and Celikyilmaz, A. (2017d). End-to-end task-completion neural
dialogue systems. In Proceedings of the 8th International Joint Conference on Natural Language
Processing (IJCNLP), pages 733–743.
Li, X., Chen, Y.-N., Li, L., Gao, J., and Celikyilmaz, A. (2017e). Investigation of language under-
standing impact for reinforcement learning based dialogue systems. CoRR abs/1703.07055.
Li, X., Li, L., Gao, J., He, X., Chen, J., Deng, L., and He, J. (2015). Recurrent reinforcement
learning: A hybrid approach. arXiv:1509.03044.
Li, X., Lipton, Z. C., Dhingra, B., Li, L., Gao, J., and Chen, Y.-N. (2016d). A user simulator for
task-completion dialogues. CoRR abs/1612.05688.
Li, X., Panda, S., Liu, J., and Gao, J. (2018). Microsoft dialogue challenge: Building end-to-end
task-completion dialogue systems. arXiv preprint arXiv:1807.11125.
Li, Y. (2018). Deep reinforcement learning. arXiv 1810.06339.
Liang, C., Berant, J., Le, Q., Forbus, K. D., and Lao, N. (2016). Neural symbolic machines: Learning
semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020.
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In ACL workshop,
pages 74–81.

83
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and
teaching. Machine Learning, 8(3–4):293–321.
Lipton, Z. C., Gao, J., Li, L., Li, X., Ahmed, F., and Deng, L. (2018). BBQ-networks: Efficient
exploration in deep reinforcement learning for task-oriented dialogue systems. In AAAI, pages
5237–5244.
Lita, L. V., Rogati, M., and Lavie, A. (2005). BLANC: Learning evaluation metrics for MT. In Pro-
ceedings of the Conference on Human Language Technology and Empirical Methods in Natural
Language Processing, HLT ’05, pages 740–747.
Litman, D. J. and Allen, J. F. (1987). A plan recognition model for subdialogues in conversations.
Cognitive Science, 11(163–200).
Liu, B. and Lane, I. (2016). Attention-based recurrent neural network models for joint intent detec-
tion and slot filling. In Proceedings of the 17th Annual Conference of the International Speech
Communication Association (INTERSPEECH), pages 685–689.
Liu, B. and Lane, I. (2017). Iterative policy learning in end-to-end trainable task-oriented neural di-
alog models. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU), pages 482–489.
Liu, B. and Lane, I. (2018). Adversarial learning of task-oriented neural dialog models. In Proceed-
ings of the 19th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL), pages 350–359.
Liu, B., Tür, G., Hakkani-Tür, D., Shah, P., and Heck, L. P. (2018a). Dialogue learning with human
teaching and feedback in end-to-end trainable task-oriented dialogue systems. In Proceedings
of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-HLT), pages 2060–2069.
Liu, C.-W., Lowe, R., Serban, I., Noseworthy, M., Charlin, L., and Pineau, J. (2016). How not
to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for
dialogue response generation. In EMNLP, pages 2122–2132.
Liu, H., Feng, Y., Mao, Y., Zhou, D., Peng, J., and Liu, Q. (2018b). Action-depedent control variates
for policy optimization via Stein’s identity. In Proceedings of the 6th International Conference
on Learning Representations (ICLR).
Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018c). Breaking the curse of horizon: Infinite-horizon off-
policy estimation. In Advances in Neural Information Processing Systems 31 (NIPS-18), pages
5361–5371.
Liu, R., Wei, W., Mao, W., and Chikina, M. (2017). Phase conductor on multi-layered attentions for
machine comprehension. arXiv preprint arXiv:1710.10504.
Liu, X., Shen, Y., Duh, K., and Gao, J. (2018d). Stochastic answer networks for machine reading
comprehension. In ACL, pages 1694–1704.
Lowe, R., Noseworthy, M., Serban, I. V., Angelard-Gontier, N., Bengio, Y., and Pineau, J. (2017).
Towards an automatic turing test: Learning to evaluate dialogue responses. In ACL, page
1116–1126.
Lowe, R., Pow, N., Serban, I., and Pineau, J. (2015). The Ubuntu Dialogue Corpus: A large dataset
for research in unstructured multi-turn dialogue systems. In SIGDIAL, pages 285–294.
Lu, Z. and Li, H. (2014). A deep architecture for matching short texts. In Advances in Neural
Information Processing Systems 27, pages 1368–1375. Curran Associates, Inc.
Luan, Y., Brockett, C., Dolan, B., Gao, J., and Galley, M. (2017). Multi-task learning for speaker-
role adaptation in neural conversation models. In IJCNLP, pages 605–614.
Luong, T., Pham, H., and Manning, C. D. (2015). Effective approaches to attention-based neural
machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, pages 1412–1421, Lisbon, Portugal.
Madotto, A., Wu, C.-S., and Fung, P. (2018). Mem2Seq: Effectively incorporating knowledge bases
into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (ACL), pages 1468–1478.
Mairesse, F. and Young, S. (2014). Stochastic language generation in dialogue using factored lan-
guage models. Computational Linguistics, 40(4):763–799.

84
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., and McClosky, D. (2014). The
Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting
of the association for computational linguistics: system demonstrations, pages 55–60.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. (2015). Deep captioning with
multimodal recurrent neural networks (m-RNN). In ICLR.
McCann, B., Bradbury, J., Xiong, C., and Socher, R. (2017). Learned in translation: Contextualized
word vectors. In Advances in Neural Information Processing Systems, pages 6297–6308.
McCann, B., Keskar, N. S., Xiong, C., and Socher, R. (2018). The natural language decathlon:
Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
McTear, M. F. (2002). Spoken dialogue technology: Enabling the conversational user interface.
ACM Computing Surveys, 34(1):90–169.
Mei, H., Bansal, M., and Walter, M. R. (2017). Coherent dialogue with attention-based language
models. In AAAI, pages 3252–3258.
Melamud, O., Goldberger, J., and Dagan, I. (2016). context2vec: Learning generic context embed-
ding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational
Natural Language Learning, pages 51–61.
Mesnil, G., Dauphin, Y., Yao, K., Bengio, Y., Deng, L., Hakkani-Tür, D. Z., He, X., Heck, L. P.,
Tür, G., Yu, D., and Zweig, G. (2015). Using recurrent neural networks for slot filling in spoken
language understanding. IEEE/ACM Transactions on Audio, Speech & Language Processing,
23(3):530–539.
Mesnil, G., He, X., Deng, L., and Bengio, Y. (2013). Investigation of recurrent-neural-network
architectures and learning methods for spoken language understanding. In Proceedings of the 14th
Annual Conference of the International Speech Communication Association (INTERSPEECH),
pages 3771–3775.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations
of words and phrases and their compositionality. In Advances in neural information processing
systems, pages 3111–3119.
Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT
Press.
Mitchell, T. (1997). Machine Learning. McGraw-Hill, New York.
Mnih, V., Adrià, Badia, P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and
Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings
of the 33rd International Conference on Machine Learning (ICML), pages 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,
Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature, 518:529–533.
Morgenstern, L. and Ortiz, C. L. (2015). The Winograd Schema Challenge: Evaluating progress
in commonsense reasoning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial
Intelligence, AAAI’15, pages 4024–4025.
Mostafazadeh, N., Brockett, C., Dolan, B., Galley, M., Gao, J., Spithourakis, G., and Vanderwende,
L. (2017). Image-grounded conversations: Multimodal context for natural question and response
generation. In IJCNLP, pages 462–472.
Mou, L., Lu, Z., Li, H., and Jin, Z. (2016). Coupling distributed and symbolic execution for natural
language queries. arXiv preprint arXiv:1612.02741.
Mrkšić, N., Séaghdha, D. O., Thomson, B., Gašić, M., Su, P.-H., Vandyke, D., Wen, T.-H., and
Young, S. J. (2015). Multi-domain dialog state tracking using recurrent neural networks. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and
the 7th International Joint Conference on Natural Language Processing of the Asian Federation
of Natural Language Processing (ACL), pages 794–799.
Mrkšić, N., Séaghdha, D. O., Wen, T.-H., Thomson, B., and Young, S. J. (2017). Neural belief
tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (ACL), pages 1777–1788.

85
Munos, R. and Szepesvári, C. (2008). Finite-time bounds for sampling-based fitted value iteration.
Journal of Machine Learning Research, 9:815–857.
Narasimhan, K., Kulkarni, T. D., and Barzilay, R. (2015). Language understanding for text-based
games using deep reinforcement learning. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1–11.
Neelakantan, A., Roth, B., and McCallum, A. (2015). Compositional vector space models for
knowledge base completion. arXiv preprint arXiv:1504.06662.
Nguyen, D. Q. (2017). An overview of embedding models of entities and relationships for knowl-
edge base completion. arXiv preprint arXiv:1703.08098.
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., and Deng, L. (2016).
MS MARCO: A human generated machine reading comprehension dataset. arXiv preprint
arXiv:1611.09268.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings
of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167,
Sapporo, Japan.
Och, F. J. and Ney, H. (2003). A systematic comparison of various statistical alignment models.
Computational Linguistics, 29(1):19–51.
Och, F. J. and Ney, H. (2004). The alignment template approach to statistical machine translation.
Comput. Linguist., 30(4):417–449.
Oh, A. H. and Rudnicky, A. I. (2002). Stochastic natural language generation for spoken dialog
systems. Computer Speech & Language, 16(3–4):387–407.
Osband, I., Blundell, C., Pritzel, A., and Roy, B. V. (2016). Deep exploration via bootstrapped DQN.
In Advances in Neural Information Processing Systems 29 (NIPS-16), pages 4026–4034.
Osband, I. and Roy, B. V. (2017). Why is posterior sampling better than optimism for reinforcement
learning? In Proceedings of the 34th International Conference on Machine Learning (ICML),
pages 2701–2710.
Pado, S., Cer, D., Galley, M., Jurafsky, D., and Manning, C. D. (2009). Measuring machine transla-
tion quality as semantic equivalence: A metric based on entailment features. Machine Translation,
pages 181–193.
Paek, T. (2001). Empirical methods for evaluating dialog systems. In Proceedings of the 2nd Annual
Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 1–9.
Paek, T. and Pieraccini, R. (2008). Automating spoken dialogue management design using machine
learning: An industry perspective. Speech Communication, 50(8–9):716–729.
Papangelis, A., Kotti, M., and Stylianou, Y. (2018a). Towards scalable information-seeking multi-
domain dialogue. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 6064–6068.
Papangelis, A., Papadakos, P., Stylianou, Y., and Tzitzikas, Y. (2018b). Spoken dialogue for in-
formation navigation. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and
Dialogue (SIGDIAL), pages 229–234.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation
of machine translation. In ACL, pages 311–318.
Parr, R. and Russell, S. J. (1998). Reinforcement learning with hierarchies of machines. In Advances
of Neural Information Processing Systems 10 (NIPS), pages 1043–1049.
Pasupat, P. and Liang, P. (2015). Compositional semantic parsing on semi-structured tables. arXiv
preprint arXiv:1508.00305.
Peng, B., Li, X., Gao, J., Liu, J., and Wong, K.-F. (2018). Integrating planning for task-completion
dialogue policy learning. CoRR abs/1801.06176.
Peng, B., Li, X., Li, L., Gao, J., Celikyilmaz, A., Lee, S., and Wong, K.-F. (2017). Composite task-
completion dialogue policy learning via hierarchical deep reinforcement learning. In EMNLP,
pages 2231–2240.

86
Pennington, J., Socher, R., and Manning, C. (2014). GloVe: Global vectors for word representation.
In Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), pages 1532–1543.
Peters, J., Vijayakumar, S., and Schaal, S. (2005). Natural actor-critic. In Proceedings of the 16th
European Conference on Machine Learning (ECML), pages 280–291.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018).
Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
Pietquin, O. and Dutoit, T. (2006). A probabilistic framework for dialog simulation and optimal
strategy learning. IEEE Transactions on Audio, Speech & Language Processing, 14(2):589–599.
Pietquin, O., Geist, M., Chandramohan, S., and Frezza-Buet, H. (2011). Sample-efficient batch
reinforcement learning for dialogue management optimization. ACM Transactions on Speech and
Language Processing, 7(3):7:1–7:21.
Pietquin, O. and Hastie, H. (2013). A survey on metrics for the evaluation of user simulations. The
Knowledge Engineering Review, 28(1):59–73.
Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation.
In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 759–
766.
Przybocki, M., Peterson, K., Bronsart, S., and Sanders, G. (2009). The NIST 2008 metrics for ma-
chine translation challenge—overview, methodology, metrics, and results. Machine Translation,
23(2):71–103.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley-Interscience, New York.
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerable questions
for SQuAD. arXiv 1806.03822.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ questions for machine
comprehension of text. arXiv preprint arXiv:1606.05250.
Ram, A., Prasad, R., Khatri, C., Venkatesh, A., Gabriel, R., Liu, Q., Nunn, J., Hedayatnia, B.,
Cheng, M., Nagar, A., King, E., Bland, K., Wartick, A., Pan, Y., Song, H., Jayadevan, S., Hwang,
G., and Pettigrue, A. (2018). Conversational AI: the science behind the alexa prize. CoRR,
abs/1801.03604.
Ramshaw, L. A. and Marcus, M. (1995). Text chunking using transformation based learning. In
Third Workshop on Very Large Corpora (VLC at ACL), pages 82–94.
Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2015). Sequence level training with recurrent
neural networks. arXiv 1511.06732.
Ravuri, S. V. and Stolcke, A. (2015). Recurrent neural network and LSTM models for lexical
utterance classification. In Proceedings of the 16th Annual Conference of the International Speech
Communication Association (INTERSPEECH), pages 135–139.
Ravuri, S. V. and Stolcke, A. (2016). A comparative study of recurrent neural network models
for lexical domain classification. In Proceedings of the 2016 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 6075–6079.
Reddy, S., Chen, D., and Manning, C. D. (2018). CoQA: A conversational question answering
challenge. arXiv preprint arXiv:1808.07042.
Ren, G., Ni, X., Malik, M., and Ke, Q. (2018a). Conversational query understanding using sequence
to sequence modeling. In Proceedings of the 2018 World Wide Web Conference on World Wide
Web, pages 1715–1724. International World Wide Web Conferences Steering Committee.
Ren, L., Xie, K., Chen, L., and Yu, K. (2018b). Towards universal dialogue state tracking. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 2780–2786.
Rich, C., Sidner, C. L., and Lesh, N. (2001). COLLAGEN: Applying collaborative discourse theory
to human-computer interaction. AI Magazine, 22(4):15–25.
Richardson, M., Burges, C. J., and Renshaw, E. (2013). MCTest: A challenge dataset for the open-
domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing, pages 193–203.

87
Richardson, S. D., Dolan, W. B., and Vanderwende, L. (1998). MindNet: acquiring and structuring
semantic information from text. In Proceedings of the 36th Annual Meeting of the Association
for Computational Linguistics and 17th International Conference on Computational Linguistics-
Volume 2, pages 1098–1102.
Rieser, V. and Lemon, O. (2008). Learning effective multimodal dialogue strategies from wizard-
of-oz data: Bootstrapping and evaluation. In Proceedings of the 46th Annual Meeting of the
Association for Computational Linguistics (ACL), pages 638–646.
Rieser, V. and Lemon, O. (2010). Natural language generation as planning under uncertainty for
spoken dialogue systems. In Empirical Methods in Natural Language Generation: Data-oriented
Methods and Empirical Evaluation, volume 5790 of Lecture Notes in Computer Science, pages
105–120. Springer.
Rieser, V. and Lemon, O. (2011). Learning and evaluation of dialogue strategies for new appli-
cations: Empirical methods for optimization from small data sets. Computational Linguistics,
37(1):153–196.
Rieser, V., Lemon, O., and Liu, X. (2010). Optimising information presentation for spoken dialogue
systems. In Proceedings of the Forty-Eighth Annual Meeting of the Association for Computational
Linguistics (ACL-10), pages 1009–1018.
Ritter, A., Cherry, C., and Dolan, W. (2011). Data-driven response generation in social media. In
EMNLP, pages 583–593.
Roemmele, M., Bejan, C., and S. Gordon, A. (2011). Choice of plausible alternatives: An evaluation
of commonsense causal reasoning. In AAAI Spring Symposium - Technical Report.
Rosenblatt, F. (1957). The perceptron: A perceiving and recognizing automaton. Report 85-460-1,
Project PARA, Cornell Aeronautical Laboratory, Ithaca, New York.
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan Books, New York.
Ross, S., Gordon, G. J., and Bagnell, D. (2011). A reduction of imitation learning and structured
prediction to no-regret online learning. In Proceedings of the 14th International Conference on
Artificial Intelligence and Statistics (AISTATS), pages 627–635.
Roy, N., Pineau, J., and Thrun, S. (2000). Spoken dialogue management using probabilistic reason-
ing. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics
(ACL), pages 93–100.
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. (2018). A tutorial on Thompson
sampling. Foundations and Trends in Machine Learning, 11(1):1–96.
Saha, A., Pahuja, V., Khapra, M. M., Sankaranarayanan, K., and Chandar, S. (2018). Complex
sequential question answering: Towards learning to converse over linked question answer pairs
with a knowledge graph. arXiv preprint arXiv:1801.10314.
Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Im-
proved techniques for training GANs. CoRR, abs/1606.03498.
Sarikaya, R., Hinton, G. E., and Deoras, A. (2014). Application of deep belief networks for natural
language understanding. IEEE/ACM Transactions on Audio, Speech & Language Processing,
22(4):778–784.
Schapire, R. E. and Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization.
Machine Learning, 39(2/3):135–168.
Schatzmann, J., Georgila, K., and Young, S. (2005a). Quantitative evaluation of user simulation
techniques for spoken dialogue systems. In Proceedings of the 6th SIGdial Workshop on Dis-
course and Dialogue, pages 45–54.
Schatzmann, J., Stuttle, M. N., Weilhammer, K., and Young, S. (2005b). Effects of the user model
on simulation-based learning of dialogue strategies. In Proceedings of the IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU), pages 220–225.
Schatzmann, J., Weilhammer, K., Stuttle, M., and Young, S. (2006). A survey of statistical user sim-
ulation techniques for reinforcement-learning of dialogue management strategies. The Knowledge
Engineering Review, 21(2):97–126.
Schatzmann, J. and Young, S. (2009). The hidden agenda user simulation model. IEEE Transactions
on Audio, Speech, and Language Processing, 17(4):733–747.

88
Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. (2015a). Trust region policy
optimization. In Proceedings of the Thirty-Second International Conference on Machine Learning
(ICML), pages 1889–1897.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015b). High-dimensional contin-
uous control using generalized advantage estimation. arXiv:1506.02438.
See, A., Liu, P. J., and Manning, C. D. (2017). Get to the point: Summarization with pointer-
generator networks. arXiv preprint arXiv:1704.04368.
Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H. (2016). Bidirectional attention flow for
machine comprehension. arXiv preprint arXiv:1611.01603.
Serban, I. V., Lowe, R., Charlin, L., and Pineau, J. (2015). A survey of available corpora for building
data-driven dialogue systems. arXiv preprint arXiv:1512.05742.
Serban, I. V., Lowe, R., Henderson, P., Charlin, L., and Pineau, J. (2018). A survey of available
corpora for building data-driven dialogue systems: The journal version. Dialogue & Discourse,
9(1):1–49.
Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C., and Pineau, J. (2016). Building end-to-end
dialogue systems using generative hierarchical neural network models. In AAAI, pages 3776–
3783.
Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. (2017).
A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pages
3295–3301.
Shah, P., Hakkani-Tür, D. Z., Liu, B., and Tür, G. (2018). Bootstrapping a neural conversational
agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings
of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-HTL), pages 41–51.
Shang, L., Lu, Z., and Li, H. (2015). Neural responding machine for short-text conversation. In
ACL-IJCNLP, pages 1577–1586.
Shao, C. C., Liu, T., Lai, Y., Tseng, Y., and Tsai, S. (2018). DRCD: a chinese machine reading
comprehension dataset. arXiv preprint arXiv:1806.00920.
Shao, Y., Gouws, S., Britz, D., Goldie, A., Strope, B., and Kurzweil, R. (2017). Generating high-
quality and informative conversation responses with sequence-to-sequence models. In EMNLP,
pages 2210–2219.
Shen, Y., Chen, J., Huang, P., Guo, Y., and Gao, J. (2018). M-walk: Learning to walk in graph with
monte carlo tree search. CoRR, abs/1802.04394.
Shen, Y., He, X., Gao, J., Deng, L., and Mesnil, G. (2014). A latent semantic model with
convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM Inter-
national Conference on Conference on Information and Knowledge Management, pages 101–110.
ACM.
Shen, Y., Huang, P., Chang, M., and Gao, J. (2016). Implicit ReasoNet: Modeling large-scale
structured relationships with shared memory. CoRR, abs/1611.04642.
Shen, Y., Huang, P.-S., Chang, M.-W., and Gao, J. (2017a). Traversing knowledge graph in vector
space without symbolic space guidance. arXiv preprint arXiv:1611.04642.
Shen, Y., Huang, P.-S., Gao, J., and Chen, W. (2017b). ReasoNet: Learning to stop reading in
machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 1047–1055. ACM.
Shen, Y., Liu, X., Duh, K., and Gao, J. (2017c). An empirical analysis of multiple-turn reasoning
strategies in reading comprehension tasks. arXiv preprint arXiv:1711.03230.
Shum, H., He, X., and Li, D. (2018). From Eliza to XiaoIce: Challenges and opportunities with
social chatbots. CoRR, abs/1801.01957.
Singh, S. P., Litman, D., Kearns, M. J., and Walker, M. (2002). Optimizing dialogue management
with reinforcement learning: Experiments with the NJFun system. Journal of Artificial Intelli-
gence Research, 16:105–133.

89
Socher, R., Chen, D., Manning, C. D., and Ng, A. (2013). Reasoning with neural tensor networks
for knowledge base completion. In Advances in neural information processing systems, pages
926–934.
Sordoni, A., Bachman, P., Trischler, A., and Bengio, Y. (2016). Iterative alternating neural attention
for machine reading. arXiv preprint arXiv:1606.02245.
Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Simonsen, J., and Nie, J.-Y. (2015a). A hi-
erarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceed-
ings of the 24th ACM International on Conference on Information and Knowledge Management,
CIKM ’15, pages 553–562. ACM.
Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., Nie, J.-Y., Gao, J., and Dolan, B.
(2015b). A neural network approach to context-sensitive generation of conversational responses.
In NAACL-HLT, pages 196—-205.
Stanojević, M. and Sima’an, K. (2014). Fitting sentence level translation evaluation with many dense
features. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 202–206, Doha, Qatar.
Stent, A., Prasad, R., and Walker, M. A. (2004). Trainable sentence planning for complex informa-
tion presentations in spoken dialog systems. In Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics (ACL), pages 79–86.
Stone, M., Doran, C., Webber, B. L., Bleam, T., and Palmer, M. (2003). Microplanning with com-
municative intentions: The SPUD system. Computational Intelligence, 19(4):311–381.
Strehl, A. L., Li, L., and Littman, M. L. (2009). Reinforcement learning in finite MDPs: PAC
analysis. Journal of Machine Learning Research, 10:2413–2444.
Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A. C., and Pietquin, O. (2017). End-to-end
optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the 26th
International Joint Conference on Artificial Intelligence (IJCAI), pages 2765–2771.
Su, P.-H., Gasic, M., Mrksic, N., Rojas-Barahona, L. M., Ultes, S., Vandyke, D., Wen, T.-H., and
Young, S. J. (2016a). On-line active reward learning for policy optimisation in spoken dialogue
systems. In Proceedings of the 54th Annual Meeting of the Association for Computational Lin-
guistics (ACL), volume 1, pages 2431–2441.
Su, P.-H., Gašić, M., Mrkšić, N., Rojas-Barahona, L., Ultes, S., Vandyke, D., Wen, T.-H., and Young,
S. (2016b). Continuously learning neural dialogue management. arXiv preprint: 1606.02689.
Su, P.-H., Gašić, M., and Young, S. (2018a). Reward estimation for dialogue policy optimisation.
Computer Speech & Language, 51:24–43.
Su, P.-H., Vandyke, D., Gašić, M., Kim, D., Mrkšić, N., Wen, T.-H., and Young, S. (2015). Learning
from real users: Rating dialogue success with neural networks for reinforcement learning in spo-
ken dialogue systems. In Proceedings of the 16th Annual Conference of the International Speech
Communication Association (INTERSPEECH), pages 2007–2011.
Su, S.-Y., Li, X., Gao, J., Liu, J., and Chen, Y.-N. (2018b). Discriminative deep Dyna-Q: Ro-
bust planning for dialogue policy learning. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 3813–3823.
Su, S.-Y., Lo, K.-L., Yeh, Y. T., and Chen, Y.-N. (2018c). Natural language generation by hier-
archical decoding with linguistic patterns. In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technolo-
gies (NAACL-HLT), pages 61–66.
Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). YAGO: a core of semantic knowledge. In
Proceedings of the 16th international conference on World Wide Web, pages 697–706. ACM.
Suhr, A., Iyer, S., and Artzi, Y. (2018). Learning to map context-dependent sentences to executable
formal queries. arXiv preprint arXiv:1804.06868.
Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks.
In NIPS, pages 3104–3112.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning,
3(1):9–44.

90
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on ap-
proximating dynamic programming. In Proceedings of the seventh international conference on
machine learning, pages 216–224.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, 2nd
edition edition.
Sutton, R. S., McAllester, D., Singh, S. P., and Mansour, Y. (1999a). Policy gradient methods for re-
inforcement learning with function approximation. In Advances in Neural Information Processing
Systems 12 (NIPS), pages 1057–1063.
Sutton, R. S., Precup, D., and Singh, S. P. (1999b). Between MDPs and semi-MDPs: A framework
for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2):181–211. An
earlier version appeared as Technical Report 98-74, Department of Computer Science, University
of Massachusetts, Amherst, MA 01003. April, 1998.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R.
(2013). Intriguing properties of neural networks. CoRR, abs/1312.6199.
Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool.
Talmor, A. and Berant, J. (2018). The web as a knowledge-base for answering complex questions.
arXiv preprint arXiv:1803.06643.
Tang, D., Li, X., Gao, J., Wang, C., Li, L., and Jebara, T. (2018). Subgoal discovery for hierarchical
dialogue policy learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 2298––2309.
Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM,
38(3):58–68.
Thomas, P. S. and Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforcement
learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML),
pages 2139–2148.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, 25(3–4):285–294.
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the Eight
International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218,
Istanbul, Turkey. European Language Resources Association (ELRA).
Toutanova, K., Lin, V., Yih, W.-t., Poon, H., and Quirk, C. (2016). Compositional learning of
embeddings for relation paths in knowledge base and text. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1,
pages 1434–1444.
Traum, D. R. (1999). Speech acts for dialogue agents. In Foundations of Rational Agency, pages
169–201. Springer.
Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., and Suleman, K. (2016).
NewsQA: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
Tromp, J. and Farnebäck, G. (2006). Combinatorics of Go. In Proceedings of the Fifth International
Conference on Computers and Games, number 4630 in Lecture Notes in Computer Science, pages
84–99.
Tur, G. and De Mori, R. (2011). Spoken language understanding: Systems for extracting semantic
information from speech. John Wiley & Sons.
Ultes, S., Budzianowski, P., nigo Casanueva, I., Mrkšić, N., Rojas-Barahona, L. M., Su, P.-H.,
Wen, T.-H., Gašić, M., and Young, S. J. (2017a). Domain-independent user satisfaction reward
estimation for dialogue policy learning. In Proceedings of the 18th Annual Conference of the
International Speech Communication Association (INTERSPEECH), pages 1721–1725.
Ultes, S., Budzianowski, P., nigo Casanueva, I., Mrkšić, N., Rojas-Barahona, L. M., Su, P.-H., Wen,
T.-H., Gašić, M., and Young, S. J. (2017b). Reward-balancing for statistical spoken dialogue
systems using multi-objective reinforcement learning. In Proceedings of the 18th Annual SIGdial
Meeting on Discourse and Dialogue (SIGDIAL), pages 65–70.

91
Ultes, S., Rojas-Barahona, L. M., Su, P.-H., Vandyke, D., Kim, D., nigo Casanueva, I.,
Budzianowski, P., Mrkšić, N., Wen, T.-H., Gašić, M., and Young, S. J. (2017c). PyDial: A
multi-domain statistical dialogue system toolkit. In Proceedings of the Fifty-fifth Annual Meeting
of the Association for Computational Linguistics (ACL), System Demonstrations, pages 73–78.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner,
N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio.
arXiv:1609.03499.
van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-
learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16),
pages 2094–2100.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polo-
sukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing
Systems 30, pages 5998–6008. Curran Associates, Inc.
Vinyals, O., Fortunato, M., and Jaitly, N. (2015a). Pointer networks. In Advances in Neural Infor-
mation Processing Systems 28, pages 2692–2700. Curran Associates, Inc.
Vinyals, O. and Le, Q. (2015). A neural conversational model. In ICML Deep Learning Workshop.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015b). Show and tell: A neural image caption
generator. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3156–3164.
Walker, M., Kamm, C., and Litman, D. (2000). Towards developing general models of usability
with PARADISE. Natural Language Engineering, 6(3–4):363–377.
Walker, M. A. (2000). An application of reinforcement learning to dialogue strategy selection in a
spoken dialogue system for email. Journal of Artificial Intelligence Research, 12:387–416.
Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. (1997). PARADISE: A framework for
evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association
for Computational Linguistics (ACL), pages 271–280.
Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. (1998). Evaluating spoken dialogue
agents with PARADISE: Two case studies. Computer Speech & Language, 12(4):317–347.
Walker, M. A., Stent, A., Mairesse, F., and Prasad, R. (2007). Individual and domain adaptation in
sentence planning for dialogue. Journal of Artificial Intelligence Research, 30:413–456.
Wang, C., Wang, Y., Huang, P.-S., Mohamed, A., Zhou, D., and Deng, L. (2017a). Sequence
modeling via segmentations. In Proceedings of the 34th International Conference on Machine
Learning, pages 3674–3683.
Wang, W., Yang, N., Wei, F., Chang, B., and Zhou, M. (2017b). Gated self-matching networks for
reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189–198.
Wang, Y.-Y., Deng, L., and Acero, A. (2005). Spoken language understanding: An introduction to
the statistical framework. IEEE Signal Processing Magazine, 22(5):16–31.
Wang, Z., Chen, H., Wang, G., Tian, H., Wu, H., and Wang, H. (2014). Policy learning for domain
selection in an extensible multi-domain spoken dialogue system. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 57–67.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016). Dueling
network architectures for deep reinforcement learning. In Proceedings of the Third International
Conference on Machine Learning (ICML-16), pages 1995–2003.
Watkins, C. J. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, University of
Cambridge, UK.
Wei, W., Le, Q. V., Dai, A. M., and Li, L.-J. (2018). AirDialogue: An environment for goal-oriented
dialogue research. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 3844–3854.
Weissenborn, D., Wiese, G., and Seiffe, L. (2017). FastQA: A simple and efficient neural architec-
ture for question answering. arXiv preprint arXiv:1703.04816.
Weizenbaum, J. (1966). ELIZA: a computer program for the study of natural language communica-
tion between man and machine. Commun. ACM, 9(1):36–45.

92
Welbl, J., Stenetorp, P., and Riedel, S. (2017). Constructing datasets for multi-hop reading compre-
hension across documents. arXiv preprint arXiv:1710.06481.
Wen, T.-H., Gašić, M., Mrkšić, N., Rojas-Barahona, L. M., Su, P.-H., Vandyke, D., and Young,
S. J. (2016). Multi-domain neural network language generation for spoken dialogue systems.
In Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (HLT-NAACL), pages 120–129.
Wen, T.-H., Gašić, M., Mrkšić, N., Su, P.-H., Vandyke, D., and Young, S. J. (2015). Semantically
conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceed-
ings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 1711–1721.
Wen, T.-H., Vandyke, D., Mrkšić, N., Gašić, M., Rojas-Barahona, L. M., Su, P.-H., Ultes, S., and
Young, S. J. (2017). A network-based end-to-end trainable task-oriented dialogue system. In Pro-
ceedings of the 15th Conference of the European Chapter of the Association for Computational
Linguistics (EACL), pages 438–449. arXiv reprint arXiv:1604.04562.
Wiering, M. and van Otterlo, M. (2012). Reinforcement Learning: State of the Art. Springer.
Williams, J. D. (2006). Partially Observable Markov Decision Processes for Spoken Dialogue Man-
agement. PhD thesis, Cambridge University, Cambridge, UK.
Williams, J. D. (2008). Evaluating user simulations with the Cramér-von Mises divergence. Speech
Communication, 50(10):829–846.
Williams, J. D., Asadi, K., and Zweig, G. (2017). Hybrid code networks: Practical and efficient
end-to-end dialog control with supervised and reinforcement learning. In Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages
665–677.
Williams, J. D., Henderson, M., Raux, A., Thomson, B., Black, A. W., and Ramachandran, D.
(2014). The dialog state tracking challenge series. AI Magazine, 35(4):121–124.
Williams, J. D., Raux, A., Ramachandran, D., and Black, A. W. (2013). The dialog state tracking
challenge. In Proceedings of the 14th Annual Meeting of the Special Interest Group on Discourse
and Dialogue (SIGDIAL), pages 404–413.
Williams, J. D. and Young, S. J. (2007). Partially observable Markov decision processes for spoken
dialog systems. Computer Speech and Language, 21(2):393–422.
Williams, J. D. and Zweig, G. (2016). End-to-end LSTM-based dialog control optimized with
supervised and reinforcement learning.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning, 8:229–256.
Winata, G. I., Kampman, O., Yang, Y., Dey, A., and Fung, P. (2017). Nora the empathetic psychol-
ogist. In Proc. Interspeech, pages 3437–3438.
Wu, C.-S., Madotto, A., Winata, G. I., and Fung, P. (2018). End-to-end dynamic query memory
network for entity-value independent task-oriented dialog. In Proceedings of the 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6154–
6158.
Wu, J., Li, M., and Lee, C.-H. (2015). A probabilistic framework for representing dialog systems
and entropy-based dialog management through dynamic stochastic state evolution. IEEE/ACM
Transactions on Audio, Speech and Language Processing (TASLP), 23(11):2026–2035.
Wu, Q., Burges, C. J., Svore, K. M., and Gao, J. (2010). Adapting boosting for information retrieval
measures. Information Retrieval, 13(3):254–270.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao,
Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y.,
Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa,
J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. (2016). Google’s neural
machine translation system: Bridging the gap between human and machine translation. CoRR,
abs/1609.08144.
Xing, C., Wu, W., Wu, Y., Zhou, M., Huang, Y., and Ma, W. (2018). Hierarchical recurrent attention
network for response generation. In AAAI, pages 5610–5617.

93
Xiong, C., Zhong, V., and Socher, R. (2016). Dynamic coattention networks for question answering.
arXiv preprint arXiv:1611.01604.
Xiong, W., Hoang, T., and Wang, W. Y. (2017). DeepPath: A reinforcement learning method for
knowledge graph reasoning. arXiv preprint arXiv:1707.06690.
Xu, P., Madotto, A., Wu, C.-S., Park, J. H., and Fung, P. (2018). Emo2vec: Learning generalized
emotion representation by multi-task training. arXiv preprint arXiv:1809.04505.
Xu, Z., Liu, B., Wang, B., Sun, C., Wang, X., Wang, Z., and Qi, C. (2017). Neural response
generation via GAN with an approximate embedding layer. In EMNLP, pages 617–626.
Yaman, S., Deng, L., Yu, D., Wang, Y.-Y., and Acero, A. (2008). An integrative and discriminative
technique for spoken utterance classification. IEEE Transactions on Audio, Speech & Language
Processing, 16(6):1207–1214.
Yan, R., Song, Y., and Wu, H. (2016). Learning to respond with deep neural networks for retrieval-
based human-computer conversation system. In Proceedings of the 39th International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR, pages 55–64, New
York, NY, USA. ACM.
Yang, B., Yih, W.-t., He, X., Gao, J., and Deng, L. (2015). Embedding entities and relations for
learning and inference in knowledge bases. In ICLR.
Yang, F., Yang, Z., and Cohen, W. W. (2017a). Differentiable learning of logical rules for knowledge
base completion. CoRR, abs/1702.08367.
Yang, X., Chen, Y.-N., Hakkani-Tür, D. Z., Crook, P., Li, X., Gao, J., and Deng, L. (2017b). End-
to-end joint learning of natural language understanding and dialogue manager. In Proceedings of
the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 5690–5694.
Yao, K., Zweig, G., Hwang, M.-Y., Shi, Y., and Yu, D. (2013). Recurrent neural networks for lan-
guage understanding. In Proceedings of the 14th Annual Conference of the International Speech
Communication Association (INTERSPEECH), pages 2524–2528.
Yao, K., Zweig, G., and Peng, B. (2015). Attention with intention for a neural network conversa-
tion model. In NIPS workshop on Machine Learning for Spoken Language Understanding and
Interaction.
Yao, X. and Van Durme, B. (2014). Information extraction over structured data: Question answering
with Freebase. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), volume 1, pages 956–966.
Yih, W.-t., Chang, M.-W., He, X., and Gao, J. (2015a). Semantic parsing via staged query graph
generation: Question answering with knowledge base. In ACL, pages 1321–1331.
Yih, W.-t., He, X., and Gao, J. (2015b). Deep learning and continuous representations for natural
language processing. In Proceedings of the 2015 Conference of the North American Chapter of
the Association for Computational Linguistics: Tutorial.
Yih, W.-t., He, X., and Gao, J. (2016). Deep learning and continuous representations for natural
language processing. In IJCAI: Tutorial.
Young, S., Breslin, C., Gašić, M., Henderson, M., Kim, D., Szummer, M., Thomson, B., Tsiakoulis,
P., and Hancock, E. T. (2016). Evaluation of statistical POMDP-based dialogue systems in noisy
environments. In Situated Dialog in Speech-Based Human-Computer Interaction, Signals and
Communication Technology, pages 3–14. Springer.
Young, S., Gašić, M., Thomson, B., and Williams, J. D. (2013). POMDP-based statistical spoken
dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179.
Young, S. J., Gašić, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., and Yu, K. (2010).
The Hidden Information State model: A practical framework for POMDP-based spoken dialogue
management. Computer Speech & Language, 24(2):150–174.
Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., and Le, Q. V. (2018). QANet:
Combining local convolution with global self-attention for reading comprehension. arXiv preprint
arXiv:1804.09541.

94
Zhang, J., Zhao, T., and Yu, Z. (2018a). Multimodal hierarchical reinforcement learning policy for
task-oriented visual dialog. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and
Dialogue (SIGDIAL), pages 140–150.
Zhang, R., Guo, J., Fan, Y., Lan, Y., Xu, J., and Cheng, X. (2018b). Learning to control the speci-
ficity in neural response generation. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 1108–1117, Melbourne, Australia.
Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., and Weston, J. (2018c). Personalizing
dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213,
Melbourne, Australia.
Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Van Durme, B. (2018d). ReCoRD: Bridging
the gap between human and machine commonsense reading comprehension. arXiv preprint
arXiv:1810.12885.
Zhang, Y., Galley, M., Gao, J., Gan, Z., Li, X., Brockett, C., and Dolan, B. (2018e). Generating
informative and diverse conversational responses via adversarial information maximization. In
NeurIPS, pages 1815–1825.
Zhao, T. and Eskénazi, M. (2016). Towards end-to-end learning for dialog state tracking and man-
agement using deep reinforcement learning. In Proceedings of the 17th Annual Meeting of the
Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 1–10.
Zhao, T., Lu, A., Lee, K., and Eskenazi, M. (2017). Generative encoder-decoder models for task-
oriented spoken dialog systems with chatting capability. In ACL, pages 27–36.
Zhou, L., Gao, J., Li, D., and Shum, H.-Y. (2018). The design and implementation of XiaoIce, an
empathetic social chatbot. arXiv preprint arXiv:1812.08989.

95
Reversible Recurrent Neural Networks

Matthew MacKay, Paul Vicol, Jimmy Ba, Roger Grosse


University of Toronto
Vector Institute
{mmackay, pvicol, jba, rgrosse}@cs.toronto.edu
arXiv:1810.10999v1 [cs.LG] 25 Oct 2018

Abstract

Recurrent neural networks (RNNs) provide state-of-the-art performance in pro-


cessing sequential data but are memory intensive to train, limiting the flexibility
of RNN models which can be trained. Reversible RNNs—RNNs for which the
hidden-to-hidden transition can be reversed—offer a path to reduce the memory
requirements of training, as hidden states need not be stored and instead can be
recomputed during backpropagation. We first show that perfectly reversible RNNs,
which require no storage of the hidden activations, are fundamentally limited be-
cause they cannot forget information from their hidden state. We then provide a
scheme for storing a small number of bits in order to allow perfect reversal with
forgetting. Our method achieves comparable performance to traditional models
while reducing the activation memory cost by a factor of 10–15. We extend our
technique to attention-based sequence-to-sequence models, where it maintains
performance while reducing activation memory cost by a factor of 5–10 in the
encoder, and a factor of 10–15 in the decoder.

1 Introduction
Recurrent neural networks (RNNs) have attained state-of-the-art performance on a variety of tasks,
including speech recognition [1], language modeling [2, 3], and machine translation [4, 5]. However,
RNNs are memory intensive to train. The standard training algorithm is truncated backpropagation
through time (TBPTT) [6, 7]. In this algorithm, the input sequence is divided into subsequences of
smaller length, say T . Each of these subsequences is processed and the gradient is backpropagated.
If H is the size of our model’s hidden state, the memory required for TBPTT is O(T H).
Decreasing the memory requirements of the TBPTT algorithm would allow us to increase the length
T of our truncated sequences, capturing dependencies over longer time scales. Alternatively, we could
increase the size H of our hidden state or use deeper input-to-hidden, hidden-to-hidden, or hidden-to-
output transitions, granting our model greater expressivity. Increasing the depth of these transitions
has been shown to increase performance in polyphonic music prediction, language modeling, and
neural machine translation (NMT) [8, 9, 10].
Reversible recurrent network architectures present an enticing way to reduce the memory requirements
of TBPTT. Reversible architectures enable the reconstruction of the hidden state at the current timestep
given the next hidden state and the current input, which would enable us to perform TBPTT without
storing the hidden states at each timestep. In exchange, we pay an increased computational cost to
reconstruct the hidden states during backpropagation.
We first present reversible analogues of the widely used Gated Recurrent Unit (GRU) [11] and Long
Short-Term Memory (LSTM) [12] architectures. We then show that any perfectly reversible RNN
requiring no storage of hidden activations will fail on a simple one-step prediction task. This task is
trivial to solve even for vanilla RNNs, but perfectly reversible models fail since they need to memorize
the input sequence in order to solve the task. In light of this finding, we extend the memory-efficient

32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.
reversal method of Maclaurin et al. [13], storing a handful of bits per unit in order to allow perfect
reversal for architectures which forget information.
We evaluate the performance of these models on language modeling and neural machine translation
benchmarks. Depending on the task, dataset, and chosen architecture, reversible models (without
attention) achieve 10–15-fold memory savings over traditional models. Reversible models achieve
approximately equivalent performance to traditional LSTM and GRU models on word-level language
modeling on the Penn TreeBank dataset [14] and lag 2–5 perplexity points behind traditional models
on the WikiText-2 dataset [15].
Achieving comparable memory savings with attention-based recurrent sequence-to-sequence models
is difficult, since the encoder hidden states must be kept simultaneously in memory in order to
perform attention. We address this challenge by performing attention over a small subset of the
hidden state, concatenated with the word embedding. With this technique, our reversible models
succeed on neural machine translation tasks, outperforming baseline GRU and LSTM models on the
Multi30K dataset [16] and achieving competitive performance on the IWSLT 2016 [17] benchmark.
Applying our technique reduces memory cost by a factor of 10–15 in the decoder, and a factor of
5–10 in the encoder.1

2 Background
We begin by describing techniques to construct reversible neural network architectures, which we
then adapt to RNNs. Reversible networks were first motivated by the need for flexible probability
distributions with tractable likelihoods [18, 19, 20]. Each of these architectures defines a mapping
between probability distributions, one of which has a simple, known density. Because this mapping is
reversible with an easily computable Jacobian determinant, maximum likelihood training is efficient.
A recent paper, closely related to our work, showed that reversible network architectures can be
adapted to image classification tasks [21]. Their architecture, called the Reversible Residual Network
or RevNet, is composed of a series of reversible blocks. Each block takes an input x and produces
an output y of the same dimensionality. The input x is separated into two groups: x = [x1 ; x2 ], and
outputs are produced according to the following coupling rule:
y1 = x1 + F (x2 ) y2 = x2 + G(y1 ) (1)
where F and G are residual functions analogous to those in standard residual networks [22]. The
output y is formed by concatenating y1 and y2 , y = [y1 ; y2 ]. Each layer’s activations can be
reconstructed from the next layer’s activations as follows:
x2 = y2 − G(y1 ) x1 = y1 − F (x2 ) (2)
Because of this property, activations from the forward pass need not be stored for use in the back-
wards pass. Instead, starting from the last layer, activations of previous layers are reconstructed
during backpropagation2 . Because reversible backprop requires an additional computation of the
residual functions to reconstruct activations, it requires 33% more arithmetic operations than ordinary
backprop and is about 50% more expensive in practice. Full details of how to efficiently combine
reversibility with backpropagation may be found in Gomez et al. [21].

3 Reversible Recurrent Architectures


The techniques used to construct RevNets can be combined with traditional RNN models to produce
reversible RNNs. In this section, we propose reversible analogues of the GRU and the LSTM.
3.1 Reversible GRU
We start by recalling the GRU equations used to compute the next hidden state h(t+1) given the
current hidden state h(t) and the current input x(t) (omitting biases):
[z (t) ; r(t) ] = σ(W [x(t) ; h(t−1) ]) g (t) = tanh(U [x(t) ; r(t) h(t−1) ])
(3)
h(t) = z (t) h(t−1) + (1 − z (t) ) g (t)
Here, denotes elementwise multiplication. To make this update reversible, we separate the hidden
state h into two groups, h = [h1 ; h2 ]. These groups are updated using the following rules:
1
Code will be made available at https://github.com/matthewjmackay/reversible-rnn
2
The activations prior to a pooling step must still be saved, since this involves projection to a lower
dimensional space, and hence loss of information.

2
(t) (t) (t−1) (t) (t) (t)
[z1 ; r1 ] = σ(W1 [x(t) ; h2 ]) [z2 ; r2 ] = σ(W2 [x(t) ; h1 ])
(t) (t) (t−1) (t) (t) (t)
g1 = tanh(U1 [x(t) ; r1 h2 ]) (4) g2 = tanh(U2 [x(t) ; r2 h1 ]) (5)
(t) (t) (t−1) (t) (t) (t) (t) (t−1) (t) (t)
h1 = z1 h1 + (1 − z1 ) g1 h2 = z2 h2 + (1 − z2 ) g2

(t) (t−1) (t)


Note that h1 and not h1 is used to compute the update for h2 . We term this model the Reversible
(t)
Gated Recurrent Unit, or RevGRU. Note that zi 6= 0 for i = 1, 2 as it is the output of a sigmoid,
which maps to the open interval (0, 1). This means the RevGRU updates are reversible in exact
(t) (t) (t) (t) (t) (t)
arithmetic: given h(t) = [h1 ; h2 ], we can use h1 and x(t) to find z2 , r2 , and g2 by redoing
(t−1)
part of our forwards computation. Then we can find h2 using:
(t−1) (t) (t) (t) (t)
h2 = [h2 − (1 − z2 ) g2 ] 1/z2 (6)
(t−1)
h1 is reconstructed similarly. We address numerical issues which arise in practice in Section 3.3.

3.2 Reversible LSTM


We next construct a reversible LSTM. The LSTM separates the hidden state into an output state h
and a cell state c. The update equations are:
[f (t) , i(t) , o(t) ] = σ(W [x(t) , h(t−1) ]) (7) g (t) = tanh(U [x(t) , h(t−1) ]) (8)

c(t) = f (t) c(t−1) + i(t) g (t) (9) h(t) = o(t) tanh(c(t) ) (10)
(t)
We cannot straightforwardly apply our reversible techniques, as the update for h is not a non-zero
linear transformation of h(t−1) . Despite this, reversibility can be achieved using the equations:
(t) (t) (t) (t) (t−1) (t) (t−1)
[f1 , i1 , o1 , p1 ] = σ(W1 [x(t) , h2 ]) (11) g1 = tanh(U1 [x(t) , h2 ]) (12)
(t) (t) (t−1) (t) (t) (t) (t) (t−1) (t) (t)
c1 = f1 c1 + i1 g1 (13) h1 = p1 h1 + o1 tanh(c1 ) (14)
(t) (t)
We calculate the updates for c2 , h2 in an identical fashion to the above equations, using c1 and h1 .
We call this model the Reversible LSTM, or RevLSTM.

3.3 Reversibility in Finite Precision Arithmetic


We have defined RNNs which are reversible in exact arithmetic. In practice, the hidden states cannot
be perfectly reconstructed due to finite numerical precision. Consider the RevGRU equations 4 and
5. If the hidden state h is stored in fixed point, multiplication of h by z (whose entries are less than
1) destroys information, preventing perfect reconstruction. Multiplying a hidden unit by 1/2, for
example, corresponds to discarding its least-significant bit, whose value cannot be recovered in the
reverse computation. These errors from information loss accumulate exponentially over timesteps,
causing the initial hidden state obtained by reversal to be far from the true initial state. The same
issue also affects the reconstruction of the RevLSTM hidden states. Hence, we find that forgetting is
the main roadblock to constructing perfectly reversible recurrent architectures.
There are two possible avenues to address this limitation. The first is to remove the forgetting step.
(t) (t) (t) (t)
For the RevGRU, this means we compute zi , ri , and gi as before, and update hi using:
(t) (t−1) (t) (t)
hi = hi + (1 − zi ) gi (15)

We term this model the No-Forgetting RevGRU or NF-RevGRU. Because the updates of the NF-
RevGRU do not discard information, we need only store one hidden state in memory at a given time
during training. Similar steps can be taken to define a NF-RevLSTM.
The second avenue is to accept some memory usage and store the information forgotten from the
hidden state in the forward pass. We can then achieve perfect reconstruction by restoring this
information to our hidden state in the reverse computation. We discuss how to do so efficiently in
Section 5.

3
A B C C B A

h(0) h(1) h(2) h(3) h(3) h(2) h(1) h(0)

A B C C B A

Figure 1: Unrolling the reverse computation of an exactly reversible model on the repeat task yields a sequence-
to-sequence computation. Left: The repeat task itself, where the model repeats each input token. Right:
Unrolling the reversal. The model effectively uses the final hidden state to reconstruct all input tokens, implying
that the entire input sequence must be stored in the final hidden state.

4 Impossibility of No Forgetting
We have shown reversible RNNs in finite precision can be constructed by ensuring that no information
is discarded. We were unable to find such an architecture that achieved acceptable performance on
tasks such as language modeling3 . This is consistent with prior work which found forgetting to be
crucial to LSTM performance [23, 24]. In this section, we argue that this results from a fundamental
limitation of no-forgetting reversible models: if none of the hidden state can be forgotten, then the
hidden state at any given timestep must contain enough information to reconstruct all previous hidden
states. Thus, any information stored in the hidden state at one timestep must remain present at all
future timesteps to ensure exact reconstruction, overwhelming the storage capacity of the model.
We make this intuition concrete by considering an elementary sequence learning task, the repeat task.
In this task, an RNN is given a sequence of discrete tokens and must simply repeat each token at the
subsequent timestep. This task is trivially solvable by ordinary RNN models with only a handful of
hidden units, since it doesn’t require modeling long-distance dependencies. But consider how an
exactly reversible model would perform the repeat task. Unrolling the reverse computation, as shown
in Figure 1, reveals a sequence-to-sequence computation in which the encoder and decoder weights
are tied. The encoder takes in the tokens and produces a final hidden state. The decoder uses this
final hidden state to produce the input sequence in reverse sequential order.
Notice the relationship to another sequence learning task, the memorization task, used as part of
a curriculum learning strategy by Zaremba and Sutskever [25]. After an RNN observes an entire
sequence of input tokens, it is required to output the input sequence in reverse order. As shown in
Figure 1, the memorization task for an ordinary RNN reduces to the repeat task for an NF-RevRNN.
Hence, if the memorization task requires a hidden representation size that grows with the sequence
length, then so does the repeat task for NF-RevRNNs.
We confirmed experimentally that NF-RevGRU and NF-RevLSM networks with limited capacity
were unable to solve the repeat task4 . Interestingly, the NF-RevGRU was able to memorize input
sequences using considerably fewer hidden units than the ordinary GRU or LSTM, suggesting it may
be a useful architecture for tasks requiring memorization. Consistent with the results on the repeat
task, the NF-RevGRU and NF-RevLSTM were unable to match the performance of even vanilla
RNNs on word-level language modeling on the Penn TreeBank dataset [14].

5 Reversibility with Forgetting


The impossibility of zero forgetting leads us to explore the second possibility to achieve reversibility:
storing information lost from the hidden state during the forward computation, then restoring it in the
reverse computation. Initially, we investigated discrete forgetting, in which only an integral number
of bits are allowed to be forgotten. This leads to a simple implementation: if n bits are forgotten in
the forwards pass, we can store these n bits on a stack, to be popped off and restored to the hidden
state during reconstruction. However, restricting our model to forget only an integral number of
bits led to a substantial drop in performance compared to baseline models5 . For the remainder of

3
We discuss our failed attempts in Appendix A.
4
We include full results and details in Appendix B. The argument presented applies to idealized RNNs able
to implement any hidden-to-hidden transition and whose hidden units can store 32 bits each. We chose to use
the LSTM and the NF-RevGRU as approximations to these idealized models since they performed best at their
respective tasks.
5
Algorithmic details and experimental results for discrete forgetting are given in Appendix D.

4
Algorithm 1 Exactly reversible multiplication (Maclaurin et al. [13])
1: Input: Buffer integer B, hidden state h = 2−RH h∗ , forget value z = 2−RZ z ∗ with 0 < z ∗ < 2RZ
2: B ← B × 2RZ {make room for new information on buffer}
3: B ← B + (h∗ mod 2RZ ) {store lost information in buffer}
4: h∗ ← h∗ ÷ 2RZ {divide by denominator of z}
5: h∗ ← h∗ × z ∗ {multiply by numerator of z}
6: h∗ ← h∗ + (B mod z ∗ ) {add information to hidden state}
7: B ← B ÷ z ∗ {shorten information buffer}
8: return updated buffer B, updated value h = 2−RH h∗

this paper, we turn to fractional forgetting, in which a fractional number of bits are allowed to be
forgotten.
To allow forgetting of a fractional number of bits, we use a technique introduced by Maclaurin et al.
[13] to store lost information. To avoid cumbersome notation, we do away with super- and subscripts
and consider a single hidden unit h and its forget value z. We represent h and z as fixed-point
numbers (integers with an implied radix point). For clarity, we write h = 2−RH h∗ and z = 2−RZ z ∗ .
Hence, h∗ is the number stored on the computer and multiplication by 2−RH supplies the implied
radix point. In general, RH and RZ are distinct. Our goal is to multiply h by z, storing as few bits as
necessary to make this operation reversible.
The full process of reversible multiplication is shown in detail in Algorithm 1. The algorithm
maintains an integer information buffer which stores h∗ mod 2RZ at each timestep, so integer
division of h∗ by 2RZ is reversible. However, this requires enlarging the buffer by RZ bits at each
timestep. Maclaurin et al. [13] reduced this storage requirement by shifting information from the
buffer back onto the hidden state. Reversibility is preserved if the shifted information is small enough
so that it does not affect the reverse operation (integer division of h∗ by z ∗ ). We include a full review
of the algorithm of Maclaurin et al. [13] in Appendix C.1.
However, this trick introduces a new complication not discussed by Maclaurin et al. [13]: the
information shifted from the buffer could introduce significant noise into the hidden state. Shifting
information requires adding a positive value less than z ∗ to h∗ . Because z ∗ ∈ (0, 2RZ ) (z is the output
of a sigmoid function and z = 2−RZ z ∗ ), h = 2−RH h∗ may be altered by as much (2RZ − 1)/2RH .
If RZ ≥ RH , this can alter the hidden state h by 1 or more6 . This is substantial, as in practice we
observe |h| ≤ 16. Indeed, we observed severe performance drops for RH and RZ close to equal.
The solution is to limit the amount of information moved from the buffer to the hidden state by setting
RZ smaller than RH . We found RH = 23 and RZ = 10 to work well. The amount of noise added
onto the hidden state is bounded by 2RZ −RH , so with these values, the hidden state is altered by at
most 2−13 . While the precision of our forgetting value z is limited to 10 bits, previous work has
found that neural networks can be trained with precision as low as 10–15 bits and reach the same
performance as high precision networks [26, 27]. We find our situation to be similar.
Memory Savings To analyze the savings that are theoretically possible using the procedure above,
consider an idealized memory buffer which maintains dynamically resizing storage integers Bhi for
each hidden unit h in groups i = 1, 2 of the RevGRU model. Using the above procedure, at each
timestep the number of bits stored in each Bhi grows by:
∗ ∗
) = log2 2RZ /zi,h

RZ − log2 (zi,h = log2 (1/zi,h ) (16)
If the entries of zi,h are not close to zero, this compares favorably with the naïve cost of 32 bits
per timestep. The total storage cost of TBPTT for a RevGRU model with hidden state size H on a
sequence of length T will be 7 :
" T H #
(t) (t)
XX
− log2 (z1,h ) + log2 (z2,h ) (17)
t=T h=1

Thus, in the idealized case, the number of bits stored equals the number of bits forgotten.
6
We illustrate this phenomenon with a concrete example in Appendix C.2.
7 (t) (t)
For the RevLSTM, we would sum over pi and fi terms.

5
Attention
+

...

... ...

...
<SOS>
...
Encoder Decoder
Figure 2: Attention mechanism for NMT. The word embeddings, encoder hidden states, and decoder hidden
states are color-coded orange, blue, and green, respectively; the striped regions of the encoder hidden states
represent the slices that are stored in memory for attention. The final vectors used to compute the context vector
are concatenations of the word embeddings and encoder hidden state slices.

5.1 GPU Considerations


For our method to be used as part of a practical training procedure, we must run it on a parallel
architecture such as a GPU. This introduces additional considerations which require modifications to
Algorithm 1: (1) we implement it with ordinary finite-bit integers, hence dealing with overflow, and
(2) for GPU efficiency, we ensure uniform memory access patterns across all hidden units.
Overflow. Consider the storage required for a single hidden unit. Algorithm 1 assumes unboundedly
large integers, and hence would need to be implemented using dynamically resizing integer types,
as was done by Maclaurin et al. [13]. But such data structures would require non-uniform memory
access patterns, limiting their efficiency on GPU architectures. Therefore, we modify the algorithm
to use ordinary finite integers. In particular, instead of a single integer, the buffer is represented
with a sequence of 64-bit integers (B0 , . . . , BD ). Whenever the last integer in our buffer is about to
overflow upon multiplication by 2RZ , as required by step 1 of Algorithm 1, we append a new integer
BD+1 to the sequence. Overflow will occur if BD > 264−RZ .
After appending a new integer BD+1 , we apply Algorithm 1 unmodified, using BD+1 in place of B.
It is possible that up to RZ − 1 bits of BD will not be used, incurring an additional penalty on storage
cost. We experimented with several ways of alleviating this penalty but found that none improved
significantly over the storage cost of the initial method.
Vectorization. Vectorization imposes an additional penalty on storage. For efficient computation,
we cannot maintain different size lists as buffers for each hidden unit in a minibatch. Rather, we must
store the buffer as a three-dimensional tensor, with dimensions corresponding to the minibatch size,
the hidden state size, and the length of the buffer list. This means each list of integers being used as a
buffer for a given hidden unit must be the same size. Whenever a buffer being used for any hidden
unit in the minibatch overflows, an extra integer must be added to the buffer list for every hidden unit
in the minibatch. Otherwise, the steps outlined above can still be followed.
We give the complete, revised algorithm in Appendix C.3. The compromises to address overflow and
vectorization entail additional overhead. We measure the size of this overhead in Section 6.
5.2 Memory Savings with Attention
Most modern architectures for neural machine translation make use of attention mechanisms [4, 5];
in this section, we describe the modifications that must be made to obtain memory savings when
using attention. We denote the source tokens by x(1) , x(2) , . . . , x(T ) , and the corresponding word
embeddings by e(1) , e(2) , . . . , e(T ) . We also use the following notation to denote vector slices: given
a vector v ∈ RD , we let v[: k] ∈ Rk denote the vector consisting of the first k dimensions of v.
Standard attention-based models for NMT perform attention over the encoder hidden states; this is
problematic from the standpoint of memory savings, because we must retain the hidden states in
memory to use them when computing attention. To remedy this, we explore several alternatives to
storing the full hidden state in memory. In particular, we consider performing attention over: 1) the
embeddings e(t) , which capture the semantics of individual words; 2) slices of the encoder hidden

6
Table 1: Validation perplexities (memory savings) on Penn TreeBank word-level language modeling. Results
shown when forgetting is restricted to 2, 3, and 5 bits per hidden unit per timestep and when there is no restriction.

Reversible Model 2 bit 3 bits 5 bits No limit Usual Model No limit


1 layer RevGRU 82.2 (13.8) 81.1 (10.8) 81.1 (7.4) 81.5 (6.4) 1 layer GRU 82.2
2 layer RevGRU 83.8 (14.8) 83.8 (12.0) 82.2 (9.4) 82.3 (4.9) 2 layer GRU 81.5
1 layer RevLSTM 79.8 (13.8) 79.4 (10.1) 78.4 (7.4) 78.2 (4.9) 1 layer LSTM 78.0
2 layer RevLSTM 74.7 (14.0) 72.8 (10.0) 72.9 (7.3) 72.9 (4.9) 2 layer LSTM 73.0

(t)
states, henc [: k] (where we consider k = 20 or 100); and 3) the concatenation of embeddings and
(t)
hidden state slices, [e(t) ; henc [: k]]. Since the embeddings are computed directly from the input
tokens, they don’t need to be stored. When we slice the hidden state, only the slices that are attended
to must be stored. We apply our memory-saving buffer technique to the remaining D − k dimensions.
In our NMT models, we make use of the global attention mechanism introduced by Luong et
(t)
al. [28], where each decoder hidden state hdec is modified by incorporating context from the source
annotations: a context vector c(t) is computed as a weighted sum of source annotations (with weights
(t) (t) (t)
αj ); hdec and c(t) are used to produce an attentional decoder hidden state hdec . Figure 2 illustrates
g
this attention mechanism, where attention is performed over the concatenated embeddings and hidden
state slices. Additional details on attention are provided in Appendix F.
5.3 Additional Considerations
(t)
Restricting forgetting. In order to guarantee memory savings, we may restrict the entries of zi
to lie in (a, 1) rather than (0, 1), for some a > 0. Setting a = 0.5, for example, forces our model to
forget at most one bit from each hidden unit per timestep. This restriction may be accomplished by
(t)
applying the linear transformation x 7→ (1 − a)x + a to zi after its initial computation8 .
Limitations. The main flaw of our method is the increased computational cost. We must reconstruct
hidden states during the backwards pass and manipulate the buffer at each timestep. We find that
each step of reversible backprop takes about 2-3 times as much computation as regular backprop. We
believe this overhead could be reduced through careful engineering. We did not observe a slowdown
in convergence in terms of number of iterations, so we only pay an increased per-iteration cost.
6 Experiments
We evaluated the performance of reversible models on two standard RNN tasks: language modeling
and machine translation. We wished to determine how much memory we could save using the
techniques we have developed, how these savings compare with those possible using an idealized
buffer, and whether these memory savings come at a cost in performance. We also evaluated our
proposed attention mechanism on machine translation tasks.
6.1 Language Modeling Experiments
We evaluated our one- and two-layer reversible models on word-level language modeling on the Penn
Treebank [14] and WikiText-2 [15] corpora. In the interest of a fair comparison, we kept architectural
and regularization hyperparameters the same between all models and datasets. We regularized the
hidden-to-hidden, hidden-to-output, and input-to-hidden connections, as well as the embedding
matrix, using various forms of dropout9 . We used the hyperparameters from Merity et al. [3]. Details
are provided in Appendix G.1. We include training/validation curves for all models in Appendix I.
6.1.1 Penn TreeBank Experiments
We conducted experiments on Penn TreeBank to understand the performance of our reversible models,
how much restrictions on forgetting affect performance, and what memory savings are achievable.
Performance. With no restriction on the amount forgotten, one- and two-layer RevGRU and
RevLSTM models obtained roughly equivalent validation performance10 compared to their non-
8 (t) (t)
For the RevLSTM, we would apply this transformation to pi and fi .
9
We discuss why dropout does not require additional storage in Appendix E.
10
Test perplexities exhibit similar patterns but are 3–5 perplexity points lower.

7
Table 2: Validation perplexities on WikiText-2 word-level language modeling. Results shown when forgetting is
restricted to 2, 3, and 5 bits per hidden unit per timestep and when there is no restriction.

Reversible Model 2 bits 3 bits 5 bits No limit Usual model No limit


1 layer RevGRU 97.7 97.2 96.3 97.1 1 layer GRU 97.8
2 layer RevGRU 95.2 94.7 95.3 95.0 2 layer GRU 93.6
1 layer RevLSTM 94.8 94.5 94.5 94.1 1 layer LSTM 89.3
2 layer RevLSTM 90.7 87.7 87.0 86.0 2 layer LSTM 82.2

reversible counterparts, as shown in Table 1. To determine how little could be forgotten without
affecting performance, we also experimented with restricting forgetting to at most 2, 3, or 5 bits per
hidden unit per timestep using the method of Section 5.3. Restricting the amount of forgetting to 2, 3,
or 5 bits from each hidden unit did not significantly impact performance.
Performance suffered once forgetting was restricted to at most 1 bit. This caused a 4–5 increase in
perplexity for the RevGRU. It also made the RevLSTM unstable for this task since its hidden state,
unlike the RevGRU’s, can grow unboundedly if not enough is forgotten. Hence, we do not include
these results.

Memory savings. We tracked the size of the information buffer throughout training and used this
to compare the memory required when using reversibility vs. storing all activations. As shown in
Appendix H, the buffer size remains roughly constant throughout training. Therefore, we show
the average ratio of memory requirements during training in Table 1. Overall, we can achieve a
10–15-fold reduction in memory when forgetting at most 2–3 bits, while maintaining comparable
performance to standard models. Using Equation 17, we also compared the actual memory savings
to the idealized memory savings possible with a perfect buffer. In general, we use about twice the
amount of memory as theoretically possible. Plots of memory savings for all models, both idealized
and actual, are given in Appendix H.

6.1.2 WikiText-2 Experiments


We conducted experiments on the WikiText-2 dataset (WT2) to see how reversible models fare on a
larger, more challenging dataset. We investigated various restrictions, as well as no restriction, on
forgetting and contrasted with baseline models as shown in Table 2. The RevGRU model closely
matched the performance of the baseline GRU model, even with forgetting restricted to 2 bits. The
RevLSTM lagged behind the baseline LSTM by about 5 perplexity points for one- and two-layer
models.

6.2 Neural Machine Translation Experiments


We further evaluated our models on English-to-German neural machine translation (NMT). We used
a unidirectional encoder-decoder model and our novel attention mechanism described in Section
5.2. We experimented on two datasets: Multi30K [16], a dataset of ∼30,000 sentence pairs derived
from Flickr image captions, and IWSLT 2016 [17], a larger dataset of ∼180,000 pairs. Experimental
details are provided in Appendix G.2; training and validation curves are shown in Appendix I.3
(Multi30K) and I.4 (IWSLT); plots of memory savings during training are shown in Appendix H.2.
For Multi30K, we used single-layer RNNs with 300-dimensional hidden states and 300-dimensional
word embeddings for both the encoder and decoder. Our baseline GRU and LSTM models achieved
test BLEU scores of 32.60 and 37.06, respectively. The test BLEU scores and encoder memory
savings achieved by our reversible models are shown in Table 3, for several variants of attention
and restrictions on forgetting. For attention, we use Emb to denote word embeddings, xH for a
x-dimensional slice of the hidden state (300H denotes the whole hidden state), and Emb+xH to
denote the concatenation of the two. Overall, while Emb attention achieved the best memory savings,
Emb+20H achieved the best balance between performance and memory savings. The RevGRU with
Emb+20H attention and forgetting at most 2 bits achieved a test BLEU score of 34.41, outperforming
the standard GRU, while reducing activation memory requirements by 7.1× and 14.8× in the encoder
and decoder, respectively. The RevLSTM with Emb+20H attention and forgetting at most 3 bits
achieved a test BLEU score of 37.23, outperforming the standard LSTM, while reducing activation
memory requirements by 8.9× and 11.1× in the encoder and decoder respectively.

8
Table 3: Performance on the Multi30K dataset with different restrictions on forgetting. P denotes the test BLEU
scores; M denotes the average memory savings of the encoder during training.
Model Attention 1 bit 2 bit 3 bit 5 bit No Limit
P M P M P M P M P M
20H 29.18 11.8 30.63 9.5 30.47 8.5 30.02 7.3 29.13 6.1
100H 27.90 4.9 35.43 4.3 36.03 4.0 35.75 3.7 34.96 3.5
RevLSTM 300H 26.44 1.0 36.10 1.0 37.05 1.0 37.30 1.0 36.80 1.0
Emb 31.92 20.0 31.98 15.1 31.60 13.9 31.42 10.7 31.45 10.1
Emb+20H 36.80 12.1 36.78 9.9 37.23 8.9 36.45 8.1 37.30 7.4
20H 26.52 7.2 26.86 7.2 28.26 6.8 27.71 6.5 27.86 5.7
100H 33.28 2.6 32.53 2.6 31.44 2.5 31.60 2.4 31.66 2.3
RevGRU 300H 34.86 1.0 33.49 1.0 33.01 1.0 33.03 1.0 33.08 1.0
Emb 28.51 13.2 28.76 13.2 28.86 12.9 27.93 12.8 28.59 12.9
Emb+20H 34.00 7.2 34.41 7.1 34.39 6.4 34.04 5.9 34.94 5.7

For IWSLT 2016, we used 2-layer RNNs with 600-dimensional hidden states and 600-dimensional
word embeddings for the encoder and decoder. We evaluated reversible models in which the decoder
used Emb+60H attention. The baseline GRU and LSTM models achieved test BLEU scores of 16.07
and 22.35, respectively. The RevGRU achieved a test BLEU score of 20.70, outperforming the GRU,
while saving 7.15× and 12.92× in the encoder and decoder, respectively. The RevLSTM achieved a
score of 22.34, competitive with the LSTM, while saving 8.32× and 6.57× memory in the encoder
and decoder, respectively. Both reversible models were restricted to forget at most 5 bits.

7 Related Work
Several approaches have been taken to reduce the memory requirements of RNNs. Frameworks
that use static computational graphs [29, 30] aim to allocate memory efficiently in the training
algorithms themselves. Checkpointing [31, 32, 33] is a frequently used method. In this strategy,
certain activations are stored as checkpoints throughout training and the remaining activations are
recomputed as needed in the backwards pass. Checkpointing has previously been used √ to train
recurrent neural networks on sequences of length T by storing the activations every d T e layers
[31]. Gruslys et al. [33] further developed this strategy by using dynamic programming to determine
which activations to store in order to minimize computation for a given storage budget.
Decoupled neural interfaces [34, 35] use auxilliary neural networks trained to produce the gradient of
a layer’s weight matrix given the layer’s activations as input, then use these predictions to train, rather
than the true gradient. This strategy depends on the quality of the gradient approximation produced
by the auxilliary network. Hidden activations must still be stored as in the usual backpropagation
algorithm to train the auxilliary networks, unlike our method.
Unitary recurrent neural networks [36, 37, 38] refine vanilla RNNs by parametrizing their transition
matrix to be unitary. These networks are reversible in exact arithmetic [36]: the conjugate transpose
of the transition matrix is its inverse, so the hidden-to-hidden transition is reversible. In practice, this
method would run into numerical precision issues as floating point errors accumulate over timesteps.
Our method, through storage of lost information, avoids these issues.

8 Conclusion
We have introduced reversible recurrent neural networks as a method to reduce the memory require-
ments of truncated backpropagation through time. We demonstrated the flaws of exactly reversible
RNNs, and developed methods to efficiently store information lost during the hidden-to-hidden
transition, allowing us to reverse the transition during backpropagation. Reversible models can
achieve roughly equivalent performance to standard models while reducing the memory requirements
by a factor of 5–15 during training. We believe reversible models offer a compelling path towards
constructing more flexible and expressive recurrent neural networks.

Acknowledgments
We thank Kyunghyun Cho for experimental advice and discussion. We also thank Aidan Gomez,
Mengye Ren, Gennady Pekhimenko, and David Duvenaud for helpful discussion. MM is supported
by an NSERC CGS-M award, and PV is supported by an NSERC PGS-D award.

9
References
[1] Alex Graves, Abdel-Rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep
Recurrent Neural Networks. In International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 6645–6649. IEEE, 2013.
[2] Gábor Melis, Chris Dyer, and Phil Blunsom. On the State of the Art of Evaluation in Neural
Language Models. arXiv:1707.05589, 2017.
[3] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM
Language Models. arXiv:1708.02182, 2017.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by
Jointly Learning to Align and Translate. arXiv:1409.0473, 2014.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s Neural
Machine Translation System: Bridging the Gap between Human and Machine Translation.
arXiv:1609.08144, 2016.
[6] Paul J Werbos. Backpropagation through Time: What It Does and How to Do It. Proceedings
of the IEEE, 78(10):1550–1560, 1990.
[7] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning Representations by
Back-propagating Errors. Nature, 323(6088):533, 1986.
[8] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to Construct
Deep Recurrent Neural Networks. arXiv:1312.6026, 2013.
[9] Antonio Valerio Miceli Barone, Jindřich Helcl, Rico Sennrich, Barry Haddow, and Alexandra
Birch. Deep Architectures for Neural Machine Translation. arXiv:1707.07631, 2017.
[10] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent
Highway Networks. arXiv:1607.03474, 2016.
[11] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-
Decoder for Statistical Machine Translation. arXiv:1406.1078, 2014.
[12] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9
(8):1735–1780, 1997.
[13] Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Gradient-based Hyperparameter
Optimization through Reversible Learning. In Proceedings of the 32nd International Conference
on Machine Learning (ICML), July 2015.
[14] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated
corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
[15] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture
Models. arXiv:1609.07843, 2016.
[16] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. Multi30K: Multilingual
English-German Image Descriptions. arXiv:1605.00459, 2016.
[17] Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. The
IWSLT 2016 Evaluation Campaign. Proceedings of the 13th International Workshop on Spoken
Language Translation (IWSLT), 2016.
[18] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked Autoregressive Flow for
Density Estimation. In Advances in Neural Information Processing Systems (NIPS), pages
2335–2344, 2017.
[19] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation using Real NVP.
arXiv:1605.08803, 2016.

10
[20] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.
Improving Variational Inference with Inverse Autoregressive Flow. In Advances in Neural
Information Processing Systems (NIPS), pages 4743–4751, 2016.
[21] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The Reversible Residual
Network: Backpropagation Without Storing Activations. In Advances in Neural Information
Processing Systems (NIPS), pages 2211–2221, 2017.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for
Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 770–778, 2016.
[23] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to Forget: Continual Prediction
with LSTM. 1999.
[24] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber.
LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems,
28(10):2222–2232, 2017.
[25] Wojciech Zaremba and Ilya Sutskever. Learning to Execute. arXiv:1410.4615, 2014.
[26] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning
with Limited Numerical Precision. In International Conference on Machine Learning (ICML),
pages 1737–1746, 2015.
[27] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training Deep Neural Networks
with Low Precision Multiplications. arXiv:1412.7024, 2014.
[28] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective Approaches to
Attention-Based Neural Machine Translation. arXiv:1508.04025, 2015.
[29] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-Scale
Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467, 2016.
[30] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bah-
danau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky,
et al. Theano: A Python Framework for Fast Computation of Mathematical Expressions.
arXiv:1605.02688, 2016.
[31] James Martens and Ilya Sutskever. Training Deep and Recurrent Networks with Hessian-Free
Optimization. In Neural Networks: Tricks of the Trade, pages 479–535. Springer, 2012.
[32] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sublinear
Memory Cost. arXiv:1604.06174, 2016.
[33] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-
Efficient Backpropagation through Time. In Advances in Neural Information Processing Systems
(NIPS), pages 4125–4133, 2016.
[34] Max Jaderberg, Wojciech M Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David
Silver, and Koray Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. In
International Conference on Machine Learning (ICML), 2017.

[35] Wojciech Marian Czarnecki, Grzegorz Świrszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals,
and Koray Kavukcuoglu. Understanding Synthetic Gradients and Decoupled Neural Interfaces.
arXiv preprint arXiv:1703.00522, 2017.
[36] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Net-
works. In International Conference on Machine Learning (ICML), pages 1120–1128, 2016.
[37] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-Capacity
Unitary Recurrent Neural Networks. In Advances in Neural Information Processing Systems
(NIPS), pages 4880–4888, 2016.

11
[38] Li Jing, Yichen Shen, Tena Dubček, John Peurifoy, Scott Skirlo, Max Tegmark, and Marin
Soljačić. Tunable Efficient Unitary Neural Networks (EUNN) and their Application to RNN.
arXiv:1612.05231, 2016.
[39] Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
arXiv:1412.6980, 2014.
[40] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or Propagating Gradients
Through Stochastic Neurons for Conditional Computation. arXiv preprint arXiv:1308.3432,
2013.
[41] Eric Jang, Shixiang Gu, and Ben Poole. Categorical Reparameterization with Gumbel-Softmax.
arXiv:1611.01144, 2016.
[42] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous
Relaxation of Discrete Random Variables. arXiv:1611.00712, 2016.
[43] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic Differentiation in
PyTorch. 2017.
[44] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. OpenNMT:
Open-Source Toolkit for Neural Machine Translation. arXiv:1701.02810, 2017.
[45] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of Neural
Networks using DropConnect. In International Conference on Machine Learning (ICML),
pages 1058–1066, 2013.
[46] Yarin Gal and Zoubin Ghahramani. A Theoretically Grounded Application of Dropout in
Recurrent Neural Networks. In Advances in neural information processing systems, pages
1019–1027, 2016.

12
Appendix
Here, we provide additional details about our models and results. This appendix is structured as
follows:

• We discuss no-forgetting failures in Sec. A.


• We present results for our toy memorization experiment in Sec. B.
• We provide details on reversible multiplication in Sec. C.
• We discuss discrete forgetting in Sec. D.
• We discuss reversibility with dropout in Sec. E.
• We provide details about the attention mechanism we use in Sec. F.
• We provide details on our language modeling (LM) and neural machine translation (NMT)
experiments in Sec. G.
• We plot the memory savings during training for many configurations of our RevGRU and
RevLSTM models on LM and NMT in Sec. H.
• We provide training and validation curves for each model on the Penn TreeBank and
WikiText2 language modeling task, and on the Multi30K and IWSLT-2016 NMT tasks in
Sec. I.

A No-Forgetting Failures
We tried training NF-RevGRU models on the Penn TreeBank dataset. Without regularization, the
training loss (not perplexity) of NF models blows up and remains above 100. This is because the
norm of the hidden state grows very quickly. We tried many techniques to remedy this, including: 1)
penalizing the hidden state norm; 2) using different optimizers; 3) using layer normalization; and 4)
using better initialization. The best-performing model we found reached 110 train perplexity on PTB
without any regularization; in contrast, even heavily regularized baseline models can reach 50 train
perplexity.

B Toy Task Experiment


We trained an LSTM on the memorization task and an NF-RevGRU on the repeat task on sequences
of length 20, 35, and 50. To vary the complexity of the tasks, we experimented with hidden state sizes
of 8, 16 and 32. We trained on randomly generated synthetic sequences consisting of 8 possible input
tokens. To evaluate performance, we generated an evaluation batch of 10, 000 randomly generated
sequences and report the average number of tokens correctly predicted over all sequences in this
batch. To ensure exact reversibility of the NF-RevGRU, we used a fixed point representation of the
hidden state, while activations were computed in floating point.
Each token was input to the model as a one-hot vector. For the remember task, we appended another
category to these one-hot vectors indicating whether the end of the input sequence has occurred.
This category was set to 0 before the input sequence terminated and was 1 afterwards. Models were
trained by a standard cross-entropy loss objective.
We used the Adam optimizer [39] with learning rate 0.001. We found that a large batch size of
20, 000 was needed to achieve the best performance. We noticed that performance continued to
improve, albeit slowly, over long periods of time, so we trained our models for 1 million batches. We
report the maximum number of tokens predicted correctly over the course of training, as there are
slight fluctuations in evaluation performance during training.
We found a surprisingly large difference in performance between the two tasks, as shown in Table 4.
In particular, the NF-RevGRU was able to correctly predict more tokens than expected, indicating
that it was able to store a surprising amount of information in its hidden state. We suspect that
the NF-RevGRU learns how to compress information more easily than an LSTM. The function
NF-RevGRU must learn for the repeat task is inherently local, in contrast to the function the LSTM
must learn for the remember task, which has long term dependencies.

13
Algorithm 2 Exactly reversible multiplication (Maclaurin et al. [13])
1: Input: Buffer integer B, hidden state h = 2−RH h∗ , forget value z = 2−RZ z ∗ with 0 < z ∗ < 2RZ
2: B ← B × 2RZ {make room for new information on buffer}
3: B ← B + (h∗ mod 2RZ ) {store lost information in buffer}
4: h∗ ← h∗ ÷ 2RZ {divide by denominator of z}
5: h∗ ← h∗ × z ∗ {multiply by numerator of z}
6: h∗ ← h∗ + (B mod z ∗ ) {add information to hidden state}
7: B ← B ÷ z ∗ {shorten information buffer}
8: return updated buffer B, updated value h = 2−RH h∗

Table 4: Number of correct predictions made by an exactly reversible model, which cannot forget, on the repeat
task and a traditional model, which can forget, on the memorization task. We expect these models to achieve
equivalent performance given the same hidden state size and sequence length. With random guessing, both
models would be expected to correctly predict Sequence Length/8 tokens. We also include the number of bits
stored per hidden unit, after subtracting out chance accuracy.
Repeat (NF-RevGRU) Memorization (LSTM)
Hidden Units Sequence Length
Tokens predicted Bits/units Tokens predicted Bits/unit
20 7.9 2.0 7.4 1.8
8 35 13.1 3.3 9.7 2.0
50 18.6 4.6 13.0 2.5
20 19.9 3.3 13.7 2.1
16 35 25.4 3.9 14.3 1.9
50 27.3 3.9 17.2 2.1
20 20.0 2.6 20.0 2.6
32 35 35.0 5.6 20.6 2.4
50 47.9 6.2 21.5 2.3

C Reversible Multiplication
C.1 Review of Algorithm of Maclaurin et al. [13]

We restate the algorithm of Maclaurin et al. [13] above for convenience. Recall the goal is to multiply
h = 2−RH h∗ by z = 2−RZ z ∗ , storing as few bits as necessary to make this operation reversible.
This multiplication is accomplished by first dividing h∗ by 2RZ then multiplying by z ∗ .
First, observe that integer division of h∗ by 2RZ can be made reversible through knowledge of
h∗ mod 2RZ :
h∗ = (h∗ ÷ 2RZ ) × 2RZ + (h∗ mod 2RZ ) (18)
Thus, the remainders at each timestep must be stored in order to ensure reversibility. The remainders
could be stored as separate integers, but this would entail 32 bits of storage at each timestep. Instead,
the remainders are stored in a single integer information buffer B, which is assumed to dynamically
resize upon overflow. At each timestep, the buffer’s size must be enlarged by RZ bits to make room:
B ← B × 2 RZ (19)
Then a new remainder can be added to the buffer:
B ← B + (h∗ mod 2RZ ) (20)
The storage cost has been reduced from 32 bits to RZ bits per timestep, but even further savings can be
realized. Upon multiplying h∗ by z ∗ , there is an opportunity to add an integer e ∈ {0, 1, . . . , z ∗ − 1}
to h∗ without affecting the reverse process (integer division by z ∗ ):
h∗ = (h∗ × z + e) ÷ z (21)
Maclaurin et al. [13] took advantage of this and moved information from the buffer B to h∗ by adding
B mod z ∗ to h∗ . This allows division B by z ∗ since this division can be reversed by knowledge of

14
Algorithm 3 Reverse process of Maclaurin et al. [13]’s Algorithm
1: Input: Updated buffer integer B, updated hidden state h = 2−RH h∗ , forget value z = 2−RZ z ∗ with
0 < z ∗ < 2 RZ
2: B ← B × z
3: B ← B + (h∗ mod z)
4: h∗ ← h∗ ÷ z
5: h∗ ← h∗ × 2RZ
6: h∗ ← h∗ + (B mod 2RZ )
7: B ← B ÷ 2RZ
8: return Original buffer B, original hidden state h = 2−RH h∗

the modulus B mod z ∗ , which can be recovered from h∗ in the reverse process:
h∗ ← h∗ + (B mod z ∗ ) (22)
B ←B÷z (23)
We give the complete reversal algorithm as Algorithm 3.

C.2 Noise in Buffer Computations

Suppose we have RH = RZ = 4, h∗ = 16, z ∗ = 17 and B = 1. We hope to compute the new value


∗ ∗
for h of h = 2hRH × 2zRZ = 17
16 = 1.0625. Executing Algorithm 1 we have:

B ← B × 2RZ = 16
B ← B + (h∗ mod 2RZ ) = 16
h∗ ← h∗ ÷ 2RZ = 1
h∗ ← h∗ × z ∗ = 17
h∗ ← h∗ + (B mod 17) = 33
B ← B ÷ z∗ = 0

At the conclusion of the algorithm, we have that h = 2hRZ = 16 33
= 2.0625. The addition of
information from the buffer onto the hidden state has altered it from its intended value.

C.3 Vectorized reversible multiplication

We let N denote the current minibatch size. Algorithm 4 shows the vectorized reversible multiplica-
tion.

Algorithm 4 Exactly reversible multiplication with overflow


1: Input: Hidden state h = 2−RH h∗ with dimensions (N, H); forget value z = 2−RZ z ∗ with 0 < z ∗ < 2RZ
and dimensions (N, H); current buffer B, an integer tensor with dimensions (N, H); past buffers Bpast ,
an integer tensor with dimensions (N, H, D)
2: if any entry of B is ≥ 264−RZ then
3: Bpast ← [Bpast , B] {Append B to end of Bpast }
4: B ← tensor of zeroes with dimensions (N, H) {Initialize new buffer}
5: end if
6: Execute Algorithm 1 unchanged
7: return updated buffer B, updated past buffers Bpast , updated value h

D Discrete Forgetting
D.1 Description

Here, we consider forgetting a discrete number of bits at each timestep. This is much easier to
implement than fractional forgetting, and it is interesting to explore whether fractional forgetting is
necessary or if discrete forgetting will suffice.

15
One layer Two layers
Model
1 bit 2 bits 3 bits 5 bits No limit 1 bit 2 bits 3 bits 5 bits No limit
GRU - - - - 82.2 - - - - 81.5
DF-RevGRU 93.6 94.1 93.9 94.7 - 93.5 92.0 93.1 94.3 -
FF-RevGRU 86.0 82.2 81.1 81.1 81.5 87.0 83.8 83.8 82.2 82.3
LSTM - - - - 78.0 - - - - 73.0
DF-RevLSTM 85.4 85.1 86.1 86.8 - 78.1 78.3 79.1 78.6 -
FF-RevLSTM - 79.8 79.4 78.4 78.2 - 74.7 72.8 72.9 72.9
Table 5: Validation perplexities on Penn TreeBank word-level language modeling. Test perplexities exhibit a
similar pattern but are 3–5 perplexity points lower. DF denotes discrete forgetting and FF denotes fractional
forgetting. We show perplexities when forgetting is restricted to 1, 2, 3, and 5 bits per hidden unit and when
there is no limit placed on the amount forgotten.

Recall that the RevGRU updates proposed in Equations 4 and 5. If all entries of zi are non-positive
powers of 2, then multiplication by zi corresponds exactly to a right-shift of the bits of hi 11 . The
shifted off bits can be stored in a stack, to be popped off and restored in the reverse computation. We
enforce this condition by changing the equation computing zi . We first choose the largest negative
(t)
power of 2 that zi could possibly represent, say F . z1 is computed using12 :
(t) (t−1)
s1 [i, j] = ReLU(Q[x(t) , h2 ])[Hi + j] for 1 ≤ i ≤ H, 1 ≤ j ≤ F
(t) (t) (t) (t)
(24)
o1 = Softmax(SampleOneHot(s1 )) z1 = [1, 0.5, 0.25, . . . , 2−F ] · o1
(t) (t) (t)
The equations to calculate z2 are analogous. We use similar equations to compute fi , pi for
the RevLSTM. To train these models, we must use techniques to estimate gradients of functions of
discrete random variables. We used both the Straight-Through Categorical estimator [40] and the
Straight-Through Gumbel-Softmax estimator [41, 42]. In both these estimators, the forward pass is
discretized but gradients during backpropagation are computed as if a continuous sample were used.
The memory savings this represents over traditional models depends on the maximum number of
bits F allowed to be forgotten. Instead of storing 32 bits for hidden unit per timestep, we must
instead only store at most F bits. We do so by using a list of integers B = (B1 , B2 , . . . , BD ) as an
information buffer. To store n bits in B, we shift the bits of each Bi left by n, then add the n bits to
be stored onto B1 . We move the bits shifted off of Bi onto Bi+1 for i ∈ {1, . . . , D − 1}. If stored
bits are shifted off of BD , we must append another integer to B. In practice, we store F bits for each
hidden unit regardless of its corresponding forget value. This stores some extraneous bits but is much
easier to implement when vectorizing over the hidden unit dimension and the batch dimension on the
GPU, as is required for computational efficiency.

D.2 Experiments

For discrete forgetting, we found the Straight-Through Gumbel-Softmax gradient estimator to


consistently achieve results 2–3 perplexity better than the Straight-Through categorical estimator.
Hence, all discrete forgetting models whose results are reported were trained using the Straight-
Through Gumbel-Softmax estimator.

Discrete vs. Fractional Forgetting. We show complete results on Penn TreeBank validation
perplexity in Table 5. Overall, models which use discrete forgetting performed 4-10 perplexity points
worse on the validation set than their fractional forgetting counterparts. It could be the case that
the stochasticity of the samples used in discrete forgetting models already imposes a regularizing
effect, causing discrete models to be too heavily regularized. To check, we also ran experiments
using lower dropout rates and found that discrete forgetting models still lagged behind their fractional
counterparts. We conclude that information must be discarded from the hidden state in fine, not
coarse, quantities.

11
When hi is negative, we must perform an additional step of appending ones to the bit representation of hi
due to using two’s complement representation.
12
Note that the Softmax is computed over rows, so the first dimension of the matrix Q must be F H.

16
E Discussion of Dropout
First, consider dropping out elements of the input. If the same elements are dropped out at each step,
we simply store the single mask used, then apply it to the input at each step of our forwards and
reverse computation.
Applying dropout to the hidden state does not entail information loss (and hence additional storage),
since we can interpret dropout as masking out elements of the input/hidden-to-hidden matrices. If the
same dropout masks are used at each timestep, as is commonly done in RNNs, we store the single
weight mask used, then use the dropped-out matrix in the forward and reverse passes. If the same
rows of these matrices are dropped out (as in variational dropout), we need only store a mask the
same size as the hidden state.
If we wish to sample different dropout masks at each timestep, which is not commonly done in RNNs,
we would either need to store the mask used at each timestep, which is memory intensive, or devise
a way to recover the sampled mask in the reverse computation (e.g., using a reversible sampler, or
using a deterministic function to set the random seed at each step).

F Attention Details
In our NMT experiments, we use the global attention mechanism introduced by Luong et al. [28].
We consider attention performed over a set of source-side annotations {s(1) , . . . , s(T ) }, which can
(t)
be either: 1) the encoder hidden states, s(t) = henc ; 2) the source embeddings, s(t) = e(t) ; or 3) a
(t)
concatenation of the embeddings and k-dimensional slices of the hidden states, s(t) = [e(t) ; henc [: k]].
(1) (M )
When using global attention, the model first computes the decoder hidden states {hdec , . . . , hdec } as
(t)
in the standard encoder-decoder paradigm, and then it modifies each hdec by incorporating context
from the source annotations. A context vector c(t) is computed as a weighted sum of the source
annotations:
T
(t)
X
c(t) = αj s(j) (25)
j=1
(t)
where the weights αj are computed by scoring the similarity between the “current” decoder hidden
(t)
state hdec and each of the encoder annotations:
(t)
(t) exp(score(hdec , s(j) ))
αj = PT (t)
(26)
(k) ))
k=1 exp(score(hdec , s

As the score function, we use the “general” formulation proposed by Luong et al.:
(t) (t)
score(hdec , s(j) ) = (hdec )> Wa s(j) (27)

(t)
Then, the original decoder hidden state hdec is modified via the context c(t) , to produce an attentional
(t)
hidden state hdec :
g
(t) (t)
hdec = tanh(Wc [c(t) ; hdec ]) (28)
g

(t)
Finally, the attentional hidden state hdec is passed into the softmax layer to produce the output
g
distribution:  
(t) (1) (t−1) (t)
p(y | y , . . . , y , x) = softmax Ws hdec (29)
g

G Experiment Details
All experiments were implemented using PyTorch [43]. Neural machine translation experiments were
implemented using OpenNMT [44].

17
Table 6: Total number of parameters in each model used for LM.

Model Total number of parameters


1 layer GRU 9.0M
1 layer RevGRU 8.4M
1 layer LSTM 9.9M
1 layer RevLSTM 9.7M
2 layer GRU 16.2M
2 layer RevGRU 13.6M
2 layer LSTM 19.5M
2 layer RevLSTM 18.4M

G.1 Language Modeling Experiments

We largely followed Merity et al. [3] in setting hyperparameters. All one-layer models used 650
hidden units and all two-layer models used 1150 hidden units in their first layer and 650 in their
second. We kept our embedding size constant at 650 through all experiments.
Notice that with a fixed hidden state size, a reversible architecture will have fewer parameters than
a standard architecture. If the total number of hidden units is H, the number of hidden-to-hidden
parameters is 2 × (H/2)2 = H 2 /2 in a reversible model, compared to H 2 for its non-reversible
counterpart. For the RevLSTM, there are extra hidden-to-hidden parameters due to the p gate needed
for reversibility. Each model also has additional parameters associated with the input-to-hidden
connections and embedding matrix.
We show the total number of parameters in each model, including embeddings, in Table 6.
We used DropConnect [45] with probability 0.5 to regularize all hidden-to-hidden matrices. We
applied variational dropout [46] on the inputs and outputs of the RNNs. The inputs to the first layer
were dropped out with probability 0.3. The outputs of each layer were dropped out with probability
0.4. As in Gal and Ghahramani [46], we used embedding dropout with probability 0.1. We also
applied weight decay with scalar factor 1.2 × 10−6 .
We used a learning rate of 20 for all models, clipping the norm of the gradients to be smaller than 0.1.
We decayed the learning rate by a factor of 4 once the nonmonotonic criterion introduced by Merity
et al. [3] was triggered and used the same non-monotone interval of 5 epochs. For discrete forgetting
models, we found that a learning rate decay factor of 2 worked better. Training was stopped once the
learning rate is below 10−2 .
Like Merity et al. [3], we used variable length backpropagation sequences. The base sequence length
was set to 70 with probability 0.95 and set to 35 otherwise. The actual sequence length used was
then computed by adding random noise from N (0, 5) to the base sequence length. We rescaled the
learning rate linearly based on the length of the truncated sequences, so for a given minibatch of
T
length T , the learning rate used was 20 × 70 .

G.2 Neural Machine Translation Experiments


Multi30K Experiments. The Multi30K dataset [16] contains English-German sentence pairs
derived from captions of Flickr images, and consists of 29,000 training, 1,015 validation, and 1,000
test sentence pairs. The average length of the source (English) sequences is 13 tokens, and the average
length of the target (German) sequences is 12.4 tokens.
We applied variational dropout with probability 0.4 to inputs and outputs. We trained on mini-
batches of size 64 using SGD. The learning rate was initialized to 0.2 for GRU and RevGRU, 0.5 for
RevLSTM, and 1 for the standard LSTM—these values were chosen to optimize the performance
of each model. The learning rate was decayed by a factor of 2 each epoch when the validation loss
failed to improve from the previous epoch. Training halted when the learning rate dropped below
0.001. Table 7 shows the validation BLEU scores of each RevGRU and RevLSTM variant.

18
Table 7: BLEU scores on the Multi30K validation set. For the attention type, Emb denotes word embeddings, xH
denotes a x-dimensional slice of the hidden state (300H corresponds to the whole hidden state), and Emb+xH
denotes the concatenation of the two.
Model Attention 1 bit 2 bit 3 bit 5 bit No Limit
20H 28.51 29.72 30.65 29.82 29.11
100H 28.10 35.52 36.13 34.97 35.14
RevLSTM 300H 26.46 36.73 37.04 37.32 37.27
Emb 31.27 30.96 31.41 31.31 31.95
Emb+20H 36.33 36.75 37.54 36.89 36.51
20H 25.96 25.86 27.25 27.13 26.96
100H 32.52 32.86 31.08 31.16 31.87
RevGRU 300H 34.26 34.00 33.02 33.08 32.24
Emb 27.57 27.59 28.03 27.24 28.07
Emb+20H 33.67 34.94 34.36 34.87 35.12

IWSLT-2016 Experiments. For both the encoder and decoder we used unidirectional, two-layer
RNNs with 600-dimensional hidden states and 600-dimensional word embeddings. We applied
variational dropout with probability 0.4 to the inputs and the output of each layer. The learning rates
were initialized to 0.2 for the GRU, RevGRU, and RevLSTM, and 1 for the LSTM. We used the same
learning rate decay and stopping criterion as for the Multi30K experiments.
The RevGRU with attention over the concatenation of embeddings and a 60-dimensional slice of the
hidden state and 5 bit forgetting achieved a BLEU score of 23.65 on the IWSLT validation set; the
RevLSTM with the same attention and forgetting configuration achieved a validation BLEU score
of 26.17. The baseline GRU achieved a validation BLEU score of 18.92, while the baseline LSTM
achieved 26.31.

H Memory Savings

H.1 Language modeling

1 layer RevGRU on Penn TreeBank

Ratio of memory used by storing discarded information in a buffer and using reversibility vs. storing all
activations naïvely. Left: Actual savings obtained by our method. Right: Idealized savings obtained by using a
perfect buffer.

35 1 bit 5 bits 70 1 bit 5 bits


2 bits No limit 2 bits No limit
30 3 bits 60 3 bits
25 50
Memory Ratio

Memory Ratio

20 40
15 30
10 20
5 10
0 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Number of Batches (x1000) Number of Batches (x1000)

19
2 layer RevGRU on Penn TreeBank

Ratio of memory used by storing discarded information in a buffer and using reversibility vs. storing all
activations naïvely. Left: Actual savings obtained by our method. Right: Idealized savings obtained by using a
perfect buffer.

80
40 1 bit 5 bits 1 bit 5 bits
2 bits No limit 70 2 bits No limit
35 3 bits 3 bits
60
30
Memory Ratio

Memory Ratio
25 50
20 40
15 30
10 20
5 10
0 0
0 10 20 30 40 0 10 20 30 40
Number of Batches (x1000) Number of Batches (x1000)

1 layer RevLSTM on Penn TreeBank

Ratio of memory used by storing discarded information in a buffer and using reversibility vs. storing all
activations naïvely. Left: Actual savings obtained by our method. Right: Idealized savings obtained by using a
perfect buffer.

2 bits 5 bits 35 2 bits 5 bits


20 3 bits No limit 3 bits No limit
30
15 25
Memory Ratio

Memory Ratio

20
10 15
10
5
5
0 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Number of Batches (x1000) Number of Batches (x1000)

20
2 layer RevLSTM on Penn TreeBank

Ratio of memory used by storing discarded information in a buffer and using reversibility vs. storing all
activations naïvely. Left: Actual savings obtained by our method. Right: Idealized savings obtained by using a
perfect buffer.

2 bits 5 bits 35 2 bits 5 bits


20 3 bits No limit 3 bits No limit
30
25
15
Perplexity

Perplexity
20
10 15
10
5
5
0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Number of Batches (x1000) Number of Batches (x1000)

H.2 Neural Machine Translation

In this section, we show the memory savings achieved by the encoder and decoder of our reversible
NMT models. The memory savings refer to the ratio of the amount of memory needed to store
discarded information in a buffer for reversibility, compared to storing all activations. Table 8 shows
the memory savings in the decoder for various RevGRU and RevLSTM models on Multi30K.

Table 8: Average memory savings in the decoder for NMT on the Multi30K dataset, during training. For
the attention type, Emb denotes word embeddings, xH denotes a x-dimensional slice of the hidden state, and
Emb+xH denotes the concatenation of the two.
Model Attention 1 bit 2 bit 3 bit 5 bit No Limit
20H 24.0 13.6 10.7 7.9 6.6
100H 24.1 13.9 10.1 8.0 5.5
RevLSTM 300H 24.7 13.4 10.7 8.3 6.5
Emb 24.1 13.5 10.5 8.0 6.7
Emb+20H 24.4 13.7 11.1 7.8 7.8
20H 24.1 13.5 11.1 8.8 7.9
100H 26.0 14.1 12.2 9.5 8.2
RevGRU 300H 26.1 14.8 13.0 10.0 9.8
Emb 25.9 14.1 12.5 9.8 8.3
Emb+20H 25.5 14.8 12.9 11.2 8.9

In sections H.2.1, H.2.2, and H.2.3, we plot the memory savings during training for RevGRU and
RevLSTM models on Multi30K and IWSLT-2016, using various levels of forgetting. In each plot, we
show the actual memory savings achieved by our method, as well as the idealized savings obtained
by using a perfect buffer.

21
H.2.1 RevGRU on Multi30K

80
Encoder Actual 30
70 Decoder Actual 35
Encoder Optimal 30 25
60 Decoder Optimal

Memory Ratio

Memory Ratio
Memory Ratio

50 25 Encoder Actual 20 Encoder Actual


20 Decoder Actual Decoder Actual
40 Encoder Optimal 15 Encoder Optimal
30 15 Decoder Optimal Decoder Optimal
10
20 10
10 5 5
0 0 0
0 4000 8000 12000 16000 20000 24000 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000
Iteration Iteration Iteration

Figure 3: RevGRU 20H. From left to right: 1 bit, 3 bits, and no limit on forgetting.

80
35 30
70
60 30 25
Memory Ratio

Memory Ratio

Memory Ratio
50 Encoder Actual 25 Encoder Actual 20 Encoder Actual
Decoder Actual 20 Decoder Actual Decoder Actual
40 Encoder Optimal Encoder Optimal 15 Encoder Optimal
30 Decoder Optimal 15 Decoder Optimal Decoder Optimal
10
20 10
10 5 5
0 0 0
0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000
Iteration Iteration Iteration

Figure 4: RevGRU 100H. From left to right, 1 bit, 3 bits, and no limit on forgetting.

80
70 35 30
60 30 25
Memory Ratio

Memory Ratio

Memory Ratio

50 Encoder Actual 25 Encoder Actual Encoder Actual


20
Decoder Actual 20 Decoder Actual Decoder Actual
40 Encoder Optimal Encoder Optimal Encoder Optimal
15
30 Decoder Optimal 15 Decoder Optimal Decoder Optimal
10 10
20
10 5 5
0 0 0
0 4000 8000 12000 16000 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000
Iteration Iteration Iteration

Figure 5: RevGRU Emb+20H. From left to right: 1 bit, 3 bits, and no limit on forgetting.

H.2.2 RevLSTM on Multi30K

70 Encoder Actual Encoder Actual 30 Encoder Actual


Decoder Actual 30 Decoder Actual Decoder Actual
60 Encoder Optimal Encoder Optimal 25 Encoder Optimal
50 Decoder Optimal 25 Decoder Optimal Decoder Optimal
Memory Ratio

Memory Ratio

Memory Ratio

20
40 20
15
30 15
10 10
20
10 5 5

0 0 0
0 4000 8000 12000 16000 0 4000 8000 12000 16000 0 4000 8000 12000 16000
Iteration Iteration Iteration

Figure 6: RevLSTM 20H. From left to right: 1 bit, 3 bits, and no limit on forgetting.

22
35 30
Encoder Actual Encoder Actual Encoder Actual
60 Decoder Actual 30 Decoder Actual Decoder Actual
Encoder Optimal Encoder Optimal 25 Encoder Optimal
50 Decoder Optimal 25 Decoder Optimal Decoder Optimal

Memory Ratio

Memory Ratio

Memory Ratio
20
40 20
15
30 15
20 10
10
10 5 5

0 0 0
0 4000 8000 12000 16000 0 4000 8000 12000 16000 0 4000 8000 12000
Iteration Iteration Iteration

Figure 7: RevLSTM 100H. From left to right: 1 bit, 3 bits, and no limit on forgetting.

Encoder Actual 30 Encoder Actual


Encoder Actual
60 Decoder Actual 30 Decoder Actual Decoder Actual
Encoder Optimal Encoder Optimal 25 Encoder Optimal
50 Decoder Optimal 25 Decoder Optimal Decoder Optimal
Memory Ratio

Memory Ratio
Memory Ratio

20
40 20
15
30 15
20 10 10

10 5 5

0 0 0
0 4000 8000 12000 16000 0 4000 8000 12000 16000 0 4000 8000 12000
Iteration Iteration Iteration

Figure 8: RevLSTM Emb+20H. From left to right: 1 bit, 3 bits, and no limit on forgetting.

H.2.3 RevGRU and RevLSTM on IWSLT-2016

Here, we plot the memory savings achieved by our two-layer models on IWSLT-2016, as well as the
ideal memory savings, for both the encoder and decoder.

35
Encoder Actual Encoder Actual
30 Decoder Actual 25 Decoder Actual
Encoder Optimal Encoder Optimal
25 Decoder Optimal 20 Decoder Optimal
Memory Ratio

Memory Ratio

20
15
15
10
10
5 5

0 0
0 50000 100000 150000 200000 0 20000 40000 60000 80000 100000 120000
Iteration Iteration
Figure 9: Memory savings on IWSLT. Left: RevGRU. Right: RevLSTM. Both models use attention over the
concatenation of the word embeddings and a 60-dimensional slice of the hidden state.

23
I Training/Validation Curves

I.1 Penn TreeBank

1 layer RevGRU

Training/validation perplexity for a 1-layer RevGRU on Penn TreeBank with various restrictions on forgetting
and a baseline GRU model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

120 150
1 bit 5 bits 1 bit 5 bits
110 2 bits No limit 140 2 bits No limit
100 3 bits GRU 130 3 bits GRU
90 120
Perplexity

80 Perplexity 110
70 100
60 90
50 80
40 70
30 60
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Number of Batches (x1000) Number of Batches (x1000)

2 layer RevGRU

Training/validation perplexity for a 2-layer RevGRU on Penn TreeBank with various restrictions on forgetting
and a baseline GRU model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

120 150
1 bit 5 bits 1 bit 5 bits
110 2 bits No limit 140 2 bits No limit
100 3 bits GRU 130 3 bits GRU
90 120
Perplexity

Perplexity

80 110
70 100
60 90
50 80
40 70
30 60
0 10 20 30 40 0 10 20 30 40
Number of Batches (x1000) Number of Batches (x1000)

24
1 layer RevLSTM

Training/validation perplexity for a 1-layer RevLSTM on Penn TreeBank with various restrictions on forgetting
and a baseline LSTM model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

140 150
2 bits No limit 2 bits No limit
3 bits LSTM 140 3 bits LSTM
120 5 bits 5 bits
130
100 120
Perplexity

Perplexity
110
80 100
90
60
80
40 70
60
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Number of Batches (x1000) Number of Batches (x1000)

2 layer RevLSTM

Training/validation perplexity for a 1-layer RevLSTM on Penn TreeBank with various restrictions on forgetting
and a baseline LSTM model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

140 150
2 bits No limit 2 bits No limit
3 bits LSTM 140 3 bits LSTM
120 5 bits 5 bits
130
100 120
Perplexity

Perplexity

110
80 100
90
60
80
40 70
60
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Number of Batches (x1000) Number of Batches (x1000)

25
I.2 WikiText-2

1 layer RevGRU

Training/validation perplexity for a 1-layer RevGRU on WikiText-2 with various restrictions on forgetting and a
baseline GRU model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

130 170
1 bit 5 bits 1 bit 5 bits
120 2 bits No limit 160 2 bits No limit
110 3 bits GRU 150 3 bits GRU
100 140
Perplexity

Perplexity
90 130
80 120
70 110
60 100
50 90
40 80
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Number of Batches (x1000) Number of Batches (x1000)

2 layer RevGRU

Training/validation perplexity for a 2-layer RevGRU on WikiText-2 with various restrictions on forgetting and a
baseline GRU model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

130 170
1 bit 5 bits 1 bit 5 bits
120 2 bits No limit 160 2 bits No limit
110 3 bits GRU 150 3 bits GRU
100 140
Perplexity

Perplexity

90 130
80 120
70 110
60 100
50 90
40 80
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Number of Batches (x1000) Number of Batches (x1000)

26
1 layer RevLSTM

Training/validation perplexity for a 1-layer RevLSTM on WikiText-2 with various restrictions on forgetting and
a baseline LSTM model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

140 150
2 bits No limit 2 bits No limit
3 bits LSTM 140 3 bits LSTM
120 5 bits 5 bits
130
100 120
Perplexity

Perplexity
110
80 100
90
60
80
40 70
60
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Number of Batches (x1000) Number of Batches (x1000)

2 layer RevLSTM

Training/validation perplexity for a 2-layer RevLSTM on WikiText-2 with various restrictions on forgetting and
a baseline LSTM model. Left: Perplexity on the training set. Right: Perplexity on the validation set.

140 150
2 bits No limit 2 bits No limit
3 bits LSTM 140 3 bits LSTM
120 5 bits 5 bits
130
100 120
Perplexity

Perplexity

110
80 100
90
60
80
40 70
60
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Number of Batches (x1000) Number of Batches (x1000)

I.3 Multi30K NMT

In this section we show the training and validation curves for the RevLSTM and RevGRU NMT
models with various types of attention (20H, 100H, 300H, Emb, and Emb+20H) and restrictions
on forgetting (1, 2, 3, and 5 bits, and no limit on forgetting). For Multi30K, both the encoder and
decoder are single-layer, unidirectional RNNs with 300 hidden units.

27
I.3.1 RevGRU

12 12.0
11
1 bit 11.5 1 bit
2 bits 2 bits
10 3 bits 11.0
Train Perplexity

3 bits

Val Perplexity
9 5 bits 10.5 5 bits
No limit 10.0 No limit
8
7 9.5
6 9.0
8.5
5
4000 8000 12000 16000 20000 24000 4000 8000 12000 16000 20000 24000
Number of Batches Number of Batches
Figure 10: RevGRU 20H (attention over a 20-dimensional slice of the hidden state).

12 12
11 1 bit 1 bit
10
2 bits 11 2 bits
3 bits 3 bits
Train Perplexity

Val Perplexity

9 5 bits 10 5 bits
8 No limit No limit
9
7
6 8
5
7
4
2000 6000 10000 14000 18000 22000 2000 6000 10000 14000 18000 22000
Number of Batches Number of Batches
Figure 11: RevGRU 100H (attention over a 100-dimensional slice of the hidden state).

12 12
11 1 bit 1 bit
2 bits 11 2 bits
10
3 bits 3 bits
Train Perplexity

Val Perplexity

9 5 bits 10 5 bits
8 No limit No limit
9
7
6 8
5 7
4
2000 6000 10000 14000 18000 22000 2000 6000 10000 14000 18000 22000
Number of Batches Number of Batches
Figure 12: RevGRU 300H (attention over the whole hidden state).

28
12 12.0
11 1 bit 1 bit
2 bits 11.5
10
2 bits
3 bits

Train Perplexity
11.0 3 bits

Val Perplexity
9 5 bits 10.5 5 bits
8 No limit No limit
10.0
7
9.5
6
5 9.0
4 8.5
4000 8000 12000 16000 20000 24000 4000 8000 12000 16000 20000 24000
Number of Batches Number of Batches
Figure 13: RevGRU Emb (attention over the input word embeddings).

12 12
11 1 bit 1 bit
2 bits 11 2 bits
10
3 bits 3 bits
Train Perplexity

Val Perplexity

9 10
5 bits 5 bits
8 No limit No limit
9
7
6 8
5 7
4
6
2000 6000 10000 14000 18000 22000 2000 6000 10000 14000 18000 22000
Number of Batches Number of Batches
Figure 14: RevGRU Emb+20H (attention over a concatenation of the word embeddings and a 20-dimensional
slice of the hidden state).

I.3.2 RevLSTM

16 16
1 bit 15 1 bit
14 2 bits 2 bits
14
3 bits 3 bits
Train Perplexity

12
Val Perplexity

5 bits 13 5 bits
10 No limit 12 No limit
8 11
10
6 9
4 8
0 4000 8000 12000 16000 0 4000 8000 12000 16000
Number of Batches Number of Batches
Figure 15: RevLSTM 20H (attention over a 20-dimensional slice of the hidden state).

29
16 16
1 bit 1 bit
14 2 bits 14 2 bits
3 bits 3 bits

Train Perplexity
12

Val Perplexity
5 bits 12 5 bits
10 No limit No limit
8 10
6 8
4
6
0 4000 8000 12000 16000 0 4000 8000 12000 16000
Number of Batches Number of Batches
Figure 16: RevLSTM 100H (attention over a 100-dimensional slice of the hidden state).

16 16
14
1 bit 1 bit
2 bits 14 2 bits
3 bits 3 bits
Train Perplexity

12
Val Perplexity

10 5 bits 12 5 bits
No limit No limit
8 10
6 8
4
2 6
0 4000 8000 12000 16000 0 4000 8000 12000 16000
Number of Batches Number of Batches
Figure 17: RevLSTM 300H (attention over the whole hidden state).

16 16
1 bit 15 1 bit
14 2 bits 2 bits
14
3 bits 3 bits
Train Perplexity

12
Val Perplexity

5 bits 13 5 bits
10 No limit 12 No limit
8 11
10
6
9
4 8
0 4000 8000 12000 16000 0 4000 8000 12000 16000
Number of Batches Number of Batches
Figure 18: RevLSTM Emb (attention over the input word embeddings).

30
16 16
1 bit 1 bit
14 2 bits 14 2 bits
3 bits 3 bits

Train Perplexity
12

Val Perplexity
10 5 bits 12 5 bits
No limit No limit
8 10
6 8
4
6
2
0 4000 8000 12000 16000 0 4000 8000 12000 16000
Number of Batches Number of Batches
Figure 19: RevLSTM Emb+20H (attention over a concatenation of the word embeddings and a 20-dimensional
slice of the hidden state).

I.4 IWSLT 2016

30 30
Train Train
25 Val 25 Val
Perplexity

20 Perplexity 20

15 15

10 10

5 5
0 10 20 30 40 50 60 0 10 20 30 40
Epoch Epoch

Figure 20: Training/validation perplexity for a 2-layer, 600-hidden unit encoder-decoder architecture, with
attention over a 60-dimensional slice of the hidden state, and 5 bit forgetting. Left: RevGRU. Right: RevLSTM.

31

Das könnte Ihnen auch gefallen