Beruflich Dokumente
Kultur Dokumente
MOHAMED, QADRI
Fall 2016
Introduction
Traditional machine learning algorithms struggle when it comes to extracting the most
appropriate features from the data and improving accuracy when fed with vast amounts
of data. For feature extraction and selection, the machine learning algorithms reply on
human effort, this however, changes when Deep Networks come into picture. A type of
machine learning algorithm, Neural Nets, have been known to learn complex patterns of
data but with a downside of requiring enormous amount of training time and computation.
Due to this drawback neural nets have not been used in main stream analytics scene to
a great extent. As the research in the area has picked up pace in the last decade, the
neural networks are gaining ground. This time with powerful backing of hardware and
scientists. The fundamental issue of slow training has largely been addressed by Deep
Learning. Deep learning is simply an ordinary graphical modeled neural network that has
been tweaked at various levels to perform optimally. This paper explores the
fundamental idea of neural networks and discusses its evolution to deep belief networks.
Artificial Neural Networks have been tweaked to address specific problems to optimize
results in several areas. Below are some of the networks and the problems they address:
1.
2.
3.
4.
5.
Text
Image
Object
Speech
Time series
Deep nets are made up of several building blocks. Some of them are introduced below
and explained in much detail later in the paper.
Vanishing Gradient Problem
Vanishing Gradient is a problem associated with neural networks trained with gradient
based learning methods such as backpropagation. In gradient based methods each of
neural nets weights receive an update proportional to the gradient of the error with
respect to the current weight in each iteration of training. The ReLu function trains faster
than the sigmoid due to the vanishing gradient problem in the latter.
Traditional activation functions have gradients in the range (1, 1) and backpropagation
computes gradients by the chain rule. This has the effect of multiplying n of these small
numbers to compute gradients of the "front" layers in an n-layer network, meaning that
the gradient (error signal) decreases exponentially with n and the front layers train very
slowly.
Restricted Boltzmann machines
These are two layer neural networks that required the second layer of the network to
learn the features of the first layer as much as possible, iteratively. The learning is done
using forward and backpropagation. They are feature extraction neural nets. RBMs are
used in sequence to form deep networks, in the process, easing out the vanishing
gradients and training problems.
Deep Belief Network
This is a neural net with a resolved vanishing gradient problem. The training method is
different from the regular back propagation method. DBN is a deep layer net with a series
of RBNs (or other types of units), where the layer 1 learns gradient weight form the input
and reconstructs the input from these weights until the layer 2 is a good representation
of layer 1. This is done iteratively for each pair of layers in the network. The training
concludes with a final round of overall weight-fine-tuning.
Learning Algorithms
The one neuron neural network is the simple
neural network possible, whose optimization
problem is convex. The system however
becomes complex wen multiple nodes are added
and there are multiple local minima and maxima.
Several state of the art algorithms are available
that can be used to solve such problems. The
fundamental algorithms is the gradient descent
algorithm, this algorithms has been improved
Functions in 3D visualization
upon by changing some of the parameters such
as window size, learning rate adaptability etc. to get better algorithms.
Gradient descent
Gradient descent is a first order optimization algorithm. To find a local minimum of a
function using gradient descent, one takes steps proportional to the negative of
the gradient (or of the approximate gradient) of the function at the current point. If instead
one takes steps proportional to the positive of the gradient, one approaches a local
maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest
descent, or the method of steepest descent.
Gradient descent should not be confused with
the method of steepest descent for approximating
integrals. Using the Gradient Decent (GD)
optimization algorithm, the weights are updated
incrementally after each epoch (= pass over the
training dataset).
Working of Gradient Descent
For a pre-defined number of epochs, we first compute the gradient vector, the loss
function for the whole dataset w.r.t. our parameter vector. We then update our
parameters in the direction of the gradients with the learning rate determining how big of
an update we perform. Batch gradient descent is guaranteed to converge to the global
minimum for convex error surfaces and to a local minimum for non-convex surfaces.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) in contrast performs a parameter update
for each training example x(i)x(i)and label y(i)y(i):
= . + (; ; )
Batch gradient descent performs redundant computations for large datasets, as it
recomputes gradients for similar examples before each parameter update. SGD does
away with this redundancy by performing one update at a time. It is therefore usually
much faster and can also be used to learn online. SGD performs frequent updates with
a high variance that cause the objective function to fluctuate heavily as in Image.
While batch gradient descent converges to the minimum
of the basin the parameters are placed in, SGD's
fluctuation, on the one hand, enables it to jump to new
and potentially better local minima. On the other hand,
this ultimately complicates convergence to the exact
minimum, as SGD will keep overshooting. However, it has
been shown that when we slowly decrease the learning
rate, SGD shows the same convergence behavior as
batch gradient descent, almost certainly converging to a
local or the global minimum for non-convex and convex
optimization respectively.
Others
Depending on the amount of data, we make a trade-off between the accuracy of the
parameter update and the time it takes to perform an update. To overcome this
challenge of parameter update using the learning rate we looked at various methods
that address this problem such as momentum, Nesterov Accelerated gradient,
Adagrad, Adadelta, RMSProp and Adam. RMSprop is an extension of Adagrad that
deals with its radically diminishing learning rates. It is identical to Adadelta, except that
Adadelta uses the RMS of parameter updates in the numinator update rule. Adam,
finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta,
and Adam are very similar algorithms that do well in similar circumstances. Its biascorrection helps Adam slightly outperform RMSprop towards the end of optimization as
gradients become sparser. Insofar, Adam might be the best overall choice.
Interestingly, many recent papers use vanilla SGD without momentum and a simple
learning rate annealing schedule. As has been shown, SGD usually achieves to find a
minimum, but it might take significantly longer than with some of the optimizers, is much
more reliant on a robust initialization and annealing schedule, and may get stuck in
saddle points rather than local minima. Consequently, if you care about fast
convergence and train a deep or complex neural network, you should choose one of
the adaptive learning rate methods.
Regular ANN
4 4 = + 5
The output is generated by passing this pre activation through some function, moslty
sigmoid.
=
= ( +
4 4 )
The positive weight means that it excites the next neuron, while as a negative weight
means it inhibits a neuron. The weight changes the steepness of the sigmoid. The bias
allows us to shift the activation function (sigmoid). The sigmoid is centered at 0, if we
want to center it at 2, then we add a bias of 2. Or in other words move the graph two
units ahead. This is useful when you might want to fire a value of y at a different value
of x. Without a bias this is not possible.
Each neuron acts as a function and combination an entire hidden layer creates a pattern
of ridges and mounds.
For multiclass problems we need multiple output probabilities (conditional of belonging
to a class) . In this case we use a softmax function.
Backward propagation
Backpropagation is the second phase of the neural net learning algorithm. It involves
changing the weights such that the input and the output are appropriately mapped. This
is achieved using gradient descent, although in larger data sets algorithms such as
stochastic gradient descent, RMSPROP, ADAM etc. are used. We need a loss function
that compares the predicted output with the output label. Empirical risk is the average of
loss function , plus regularizer which penalizes the value of the parameter theta.
Mathematically, the primary idea of backpropagation is to find the rate of change of error
with respect to the weights.
= - Loss Reg
The pre activation of a node is the sumproduct of all weights form l-1 layer to this node
at layer l. The activation is the sigmoid of this sumproduct.
SGD finds the gradient and goes in the opposite direction, minimizing the loss. The net
gradient is given by . The new parameter after the iteration is given by below:
= + .
This gradient is computed by taking the partial derivative of the negative likelihood loss
function.
Mathematically, we take the difference between the f(x) and y, to get the error. This error
is propagated to all layers as 1, 2 The deltas can be calculated as below:
(;) =
(A) =
5 (=)
5 (;)
. (
. (
where z is the activation and g(z) is the partial derivative w.r.t each weight.
It is seen that this partial derivative is given by the activation of that layer x deltas of layer
ahead of it.
(D)
(D)
(D) (DEF)
4C = 4C + C 4
In the below graphic the output layer is denoted by j and the hidden layer by i. We first
find the square error of the output and find its derivative w.r.t to the output to see how
the error changes w.r.t output (2). From the chain rule we see that it is equal to the
derivative of pre activation to the output x derivative of error to the pre activation (1).
We also need to find the derivative of error w.r.t to the weights. This is equal to the layer
below it x the derivative of the error to the preactivation (3).
GH
GIJ
GH
GKN
GKJ GH
LIJ GKJ
GH
GONJ
= C 1 C
GIJ GH
C LK GI
N
J
GIJ GH
LONJ GIJ
= 4
GH
GKJ
GH
C 4C GI
J
GH
GIJ
(1)
(2)
(3)
Backpropagation I
The below figure also explains the same thing. We find the E/y or l/f(x) as the
gradient of f(x) . Then also find the gradient of weights and biases as E/w or the
gradient of the loss(E) w.rt. pre activation a2 (x) , using the gradient of f(x) or z as shown
above.
E/w tells how much change in w affects the error. This error is then subtracted from
the corresponding weight. The below is for the output layer.
QRSTU WXSYU SWS[D SWS[D
.
.
=
OV QRSTU WXSYU
OV
Backpropagation II
To go into the hidden layers, we need to find the derivative of the same E total but with
respect to the corresponding weight w1. This can be split as a chain and computed.
SWS[D QRS\U WXS\U SWS[D
=
.
.
OF
OF QRS\U WXS\U
SWS[D
WF
WA
=
+
]F
WXS\U WXS\U
Backpropagation III
Training Housekeeping
For a complete NN training we need the following things:
1. Loss function (Prediction minus true value)
2. Procedure to compute gradients (Forward and Backward prop)
3. Regularizer (prevents overfitting)
4. Initialization method
Regularization is usually applied only on weights and not biases. L1(Lasso) and
L2(Ridge) will put a limit on the size of the coefficients (weights here), this way bias will
be introduced but at the same time variance will be reduced. And as we know the biasvariance trade-off we can reduce some variance in the final model at the cost of a little
bias (beautiful)
The biases must be initialized to 0 and weights
to non-zero distinct values. The hyper parameter
section or decision of the number of hidden
layers can be done using grid search.
To find the best point that is neither under fitting
nor overfitting, keep running iterations and noting
the test and training error. Then select the
optimal point, at which the test and train error are
the least.
Tuning Parameters
The difference between Neural Nets and Deep Nets is that unlike Neural Nets, Deep
nets dont simply employ forward and back propagation algorithms and fall victim to the
vanishing gradient problem. Instead, the Deep Nets are trained by training subsets of
the network and then combining it, this idea is responsible for the name Deep Nets
Different kinds of algorithms are used to train these sub networks, the ones discussed
in this paper are RBM, Auto Encoders and Sparse coding.
Restricted Boltzmann machines are feature extraction Neural Nets. They modify the
weights such that the input l1 can be reconstructed at l2 and in the process latent features
form the input are learnt at l2. In RBMs there are no connections between hidden units,
or visible units of the same layer. These systems are also used to change the dimension
of data in X-dim(input) to that in H-dim(hidden layer).
In large systems calculating the partition or the normalization function is a challenge, in
such a setting we use an energy functions. Energy functions are just probability function
without the normalizing denominator. The below shows how a probability function is
equivalent to a normalized energy function. The energy function contains extra elements
of bias b and c. V is the input vector and h is the hidden vector form the hidden layer.
Low energy means high probability.
1
, = cH(d,])
, = 5 5 5
cH(d,])
=
d
where E(v,h) is the energy function and Z is the normalizing constant partition function
RBM is trained using Gibbs sampling method, where you find the P(h|v) and the find
P(v|h) and repeat this. At every time at h it finds the energy. Here it is usually at infinity
or a lot of training that we will get desired results. But using the technique
of Contrastive Divergence it can be achieved within a few iterations
Training a RBM
The idea here is that we find a representation (at h) of the input in the hidden layer, and
then compare it with the original, if it is noise and not comparable with the input then we
assign very low probability to the representation and adjust the weights and biases
accordingly. RBM always consists of an input layer and one hidden layer. These RMBs
can connected series to form what is called a Deep Belief network explained later.
Such that the input (v) is represented by first hidden layer (h), this representation (h) is
taken as an input by another RBM, and transformed. The deep belief network was used
because training deep models using backpropagation was very hard. This was first done
in 2005.
Auto Encoders
Auto encoders are similar to RBMs in function, but unlike the 2 layered RBM they are 3
layered and look like in the figure below. The constitute of an encoder and a decoder.
The encoder transforms the data x into a low dimensional h(x) hidden layer and then the
hidden layer is reconstructed into the inputs as x hat. The x and x are then compared
and a loss function is minimized to reduce the difference. Note the weight matrix in the
encoder and the decoder are transposes.
AutoEncoders
PCA is method that assumes linear systems where as Autoencoder (AE) does not. If no
non-linear function is used in the AE and the number of neurons in the hidden layer is
fewer than those in the input layer, then PCA and AE can yield the similar results.
Otherwise the AE may find a different subspace.
An under complete autoencoder is one which has fewer number of hidden units than
those in the input layer, on the other hand an overcomplete autoencoder has greater
number of hidden units than those in the input. The overcomplete AE have a tendency
to simply copy the input unit into the hidden layer because h>x and h is a representation
of x. In this case some h units are left to 0. This is not a good idea. To overcome this,
noise can be added to the input such that x becomes x~, and now this x~ becomes the
input to the AE, as in the graphic below. But note, the x is still x not x~ hat (comparing
with RBM). This allows the AE to find features from the noise input (x~) and then
compare it with the original input (x), at the same time making sure that the hidden units
are not simply copies of the input layer. This is called a denoising AE
Denoising autoencoders take a partially corrupted input whilst training to recover the
original undistorted input. This technique has been introduced with a specific approach
to good representation. A good representation is one that
can be obtained robustly from a corrupted input and that will
be useful for recovering the corresponding clean input. This
definition contains the following implicit assumptions:
1. The higher level representations are relatively stable
and robust to the corruption of the input;
2. It is necessary to extract features that are useful for
representation of the input distribution.
Contractive AE (CAE) on the other hand, allow us to extract
those features that only reflect variations observed in the
training set. CAE uses the Frobenius norm of Jacobian to
find the solution.
Sparse Coding
Sparse AutoEncoders
The advantage of having an over-complete basis is that our basis vectors are better able
to capture structures and patterns inherent in the input data. However, with an overcomplete basis, the coefficients are no longer uniquely determined by the input vector x.
Therefore, in sparse coding, we introduce the additional criterion of sparsity to resolve
the degeneracy introduced by over-completeness.
Sparsity is another way of allowing for an overcomplete representation, that doesn't learn
a trivial mapping.
The methods described above are used to construct and train a deep belief networks.
The resultant network is then run through a round of back prop type fine tuning procedure
to optimize the weights. This is called unsupervised pre training. Unsupervised pre
training is not only as regularization but also as prevention from underfitting.
A greedy layer-wise procedure is used:
Train one layer at a time, form first to last, with unsupervised criterion
Fix the parameters of the previous hidden layers
Previous layers viewed as feature extraction
Once all the layers are trained we connect the output layer and then train the entire
system one last time using back propagation. The weights, however, do not change
much this time. This process is called fine tuning. The idea is that we could use a major
portion of the dataset in an unsupervised way to learn the weights, and only use a small
portion of labeled data in a supervised way as to complete the fine tuning phase. The
advantage of this algorithm is evident in scenarios when we want to do supervised
learning but have only small amount of labeled data at hand, a scenario quite comment
in the big data era.
The non-top two layers are also called as a sigmoid belief network(SBN) .
(C
= 1|
(C = 1|
= (
= (
A m
where
;
= (
A m
(A) ) (|
A m
; m (;)
)/
C F (A) )
(A) ) =
C
(F) ) =
4 (F) )
4
The algorithm can further be improved using some tweaks and techniques such as
dropout training and drop connect.
Dropout training
The key idea in this technique is to randomly drop units, marked in black (along with their
connections) from the neural network during training. This prevents units from coadapting too much. During training, dropout samples from an exponential number of
different thinned networks. At test time, it is easy to approximate the effect of averaging
the predictions (in an ensemble way) of all these thinned networks by simply using a
single unthinned network that has smaller weights. This significantly reduces overfitting
and gives major improvements over other regularization methods. Randomly nodes are
set to zero, and they get disconnected
The outcome is that the other nodes cannot collaborate or be sure that other black node
will always be present. This way nodes are forced to extract general features.
At testing time after training has finished, we would ideally
like to find a sample average of all possible 2n dropped-out
networks; unfortunately, this is unfeasible for large values
of n. However, we can find an approximation by using the
full network with each node's output weighted by a factor of
p (probability of dropping node) the expected value of the
output of any node is the same as in the training stages.
This is the biggest contribution of the dropout method:
although it effectively generates 2n neural nets, and as such
allows for model combination, at test time only a single
network needs to be tested.
Dropout Training
DropConnect
It is a generalization of Dropout in which each connection, rather than each output unit,
can be dropped with probability (1-p). Each unit thus receives input from a random
subset of units in the previous layer. DropConnect is similar to Dropout as it introduces
dynamic sparsity within the model, but differs in that the sparsity is on the weights, rather
than the output vectors of a layer. In other words, the fully connected layer with
DropConnect becomes a sparsely connected layer in which the connections are chosen
at random during the training stage.
We would want to geometrically average out the weights of all 2n trained models, but s
ince that is not possible when n is large, we rather multiply the weights by the expectati
on of the masks/dropped nodes (i.e. p)
Conclusion
The paper describes how Neural Networks have evolved into Deep Neural Networks or
Deep Belief Networks by overcoming problems such as training time, vanishing gradient,
computation expense etc. DBNs are gaining momentum in the technology sector, with
applications in areas such as speech recognition, text processing, information retrieval,
artificial intelligence etc. Several speculators claim that the future belongs to deep
learning.
Hardware improvements have also played a major role in the wide acceptability of deep
networks. The Graphical Processing Units(GPU) which are more robust than CPUs are
being utilized to train Deep Networks. In some cases, the training time has been known
to reduce to up to 1/250th of the training time spent on training on CPUs of comparable
configuration. In addition to this, distributed computing systems such as Hadoop/Spark
have allowed for parallel computation, improving the performance much more. Several
programming interfaces leveraging the hardware improvements such as CUDA,
Tensorflow, Theano, Deep4j etc. have been developed to optimize the algorithms on all
fronts. Bearing in mind this ongoing enthusiasm and research in the area, and the
fallouts, it wouldnt be hard to say that Deep Learning is the future of machine learning.
References
[1] http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
[2] https://www.coursera.org/learn/neural-networks
[3] http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
[4] http://deeplearning.net/tutorial/deeplearning.pdf
[5] http://cilvr.cs.nyu.edu/doku.php?id=deeplearning:slides:start
[6] https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example
[7] https://www.youtube.com/channel/UC9OeZkIwhzfv-_Cb7fCikLQ