Beruflich Dokumente
Kultur Dokumente
To reiterate, the loss function lets us quantify the quality of any particular set of weights W.
The goal of optimization is to find W that minimizes the loss function. We will now motivate
and slowly develop an approach to optimizing the loss function. My goal is to eventually
optimize Neural Networks where we can’t easily use any of the tools developed in the
Convex Optimization literature.
Approach:
Gradient descent is one of the most popular algorithms to perform optimization and by far
the most common way to optimize neural networks. At the same time, every state-of-the-
art Deep Learning library contains implementations of various algorithms to optimize
gradient descent e.g.lasagne,caffe,and keras. These algorithms, however, are often used
as black-box optimizers, as practical explanations of their strengths and weaknesses are
hard to come by. Here I will go afterthe intuitions with regard to the behaviour of different
algorithms for optimizing gradient descent that will help put them to use. Gradient descent
is a way to minimize an objective function J(θ) parameterized by a model’s parameters
by updating the parameters in the opposite direction of the gradient of the objective
function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the
steps we take to reach a (local) minimum. In other words, we follow the direction of the
slope of the surface created by the objective function downhill until we reach a valley.
It computes the gradient of the cost function w.r.t. to the parameters θ for the entire
training dataset:
θ = θ − η · ∇θJ(θ)
As we need to calculate the gradients for the whole dataset to perform just one update,
batch gradient descent can be very slow and is intractable for datasets that do not fit in
memory. Batch gradient descent also does not allow us to update our model online, i.e.
with new examples on-the-fly. In code, batch gradient descent looks something like this:
for i in range ( no_epochs ):
params_grad = evaluate_gradient ( loss_function , data , params )
params = params - learning_rate * params_grad
For a pre-defined number of epochs, we first compute the gradient vector params_grad
of the loss function for the whole dataset w.r.t. our parameter vector params. Note that
state-of-the-art deep learning libraries provide automatic differentiation that efficiently
computes the gradient w.r.t. some parameters. If you derive the gradients yourself, then
gradient checking is a good idea.6 We then update our parameters in the direction of
the gradients with the learning rate determining how big of an update we perform. Batch
gradient descent is guaranteed to converge to the global minimum for convex error
surfaces and to a local minimum for non-convex surfaces.
The size of the mini-batch is a hyperparameter but it is not very common to cross-
validate it. It is usually based on memory constraints (if any), or set to some value, e.g.
32, 64 or 128. We use powers of 2 in practice because many vectorized operation
implementations work faster when their inputs are sized in powers of 2.
The extreme case of this is a setting where the mini-batch contains only a single example.
This process is called Stochastic Gradient Descent (SGD) (or also sometimes on-
line gradient descent). This is relatively less common to see because in practice due to
vectorized code optimizations it can be computationally much more efficient to evaluate
the gradient for 100 examples, than the gradient for one example 100 times. Even though
SGD technically refers to using a single example at a time to evaluate the gradient, you
will hear people use the term SGD even when referring to mini-batch gradient descent
(i.e. mentions of MGD for “Minibatch Gradient Descent”, or BGD for “Batch gradient
descent” are rare to see), where it is usually assumed that
References:
1. Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu.
Advances in Optimizing Recurrent Networks. 2012.
2. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.
Curriculum learning. Proceedings of the 26th annual international
conference on machine learning, pages 41–48, 2009
3. Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLR
Workshop, (1):2013–2016, 2016.
4. John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods
for Online Learning and Stochastic Optimization. Journal of Machine
Learning Research, 12:2121–2159, 2011.