Sie sind auf Seite 1von 16

The evolution of Deep Neural Networks

MOHAMED, QADRI
Fall 2016

Introduction

Traditional machine learning algorithms struggle when it comes to extracting the most
appropriate features from the data and improving accuracy when fed with vast amounts
of data. For feature extraction and selection, the machine learning algorithms reply on
human effort, this however, changes when Deep Networks come into picture. A type of
machine learning algorithm, Neural Nets, have been known to learn complex patterns of
data but with a downside of requiring enormous amount of training time and computation.
Due to this drawback neural nets have not been used in main stream analytics scene to
a great extent. As the research in the area has picked up pace in the last decade, the
neural networks are gaining ground. This time with powerful backing of hardware and
scientists. The fundamental issue of slow training has largely been addressed by Deep
Learning. Deep learning is simply an ordinary graphical modeled neural network that has
been tweaked at various levels to perform optimally. This paper explores the
fundamental idea of neural networks and discusses its evolution to deep belief networks.

Key Elements and types of Deep Networks

Elements of Deep Learning

Artificial Neural Networks have been tweaked to address specific problems to optimize
results in several areas. Below are some of the networks and the problems they address:

1.
2.
3.
4.
5.

Text
Image
Object
Speech
Time series

Recursive/Recurrent neural tensor network (RNTN)


Convolutional Net (CNN)
Convolutional Net, RNTN
Recursive/Recurrent neural tensor network (RNTN)
Recursive/Recurrent neural tensor network (RNTN)

Deep nets are made up of several building blocks. Some of them are introduced below
and explained in much detail later in the paper.
Vanishing Gradient Problem
Vanishing Gradient is a problem associated with neural networks trained with gradient
based learning methods such as backpropagation. In gradient based methods each of
neural nets weights receive an update proportional to the gradient of the error with
respect to the current weight in each iteration of training. The ReLu function trains faster
than the sigmoid due to the vanishing gradient problem in the latter.

Types of activation functions

Traditional activation functions have gradients in the range (1, 1) and backpropagation
computes gradients by the chain rule. This has the effect of multiplying n of these small
numbers to compute gradients of the "front" layers in an n-layer network, meaning that
the gradient (error signal) decreases exponentially with n and the front layers train very
slowly.
Restricted Boltzmann machines
These are two layer neural networks that required the second layer of the network to
learn the features of the first layer as much as possible, iteratively. The learning is done
using forward and backpropagation. They are feature extraction neural nets. RBMs are
used in sequence to form deep networks, in the process, easing out the vanishing
gradients and training problems.
Deep Belief Network
This is a neural net with a resolved vanishing gradient problem. The training method is
different from the regular back propagation method. DBN is a deep layer net with a series
of RBNs (or other types of units), where the layer 1 learns gradient weight form the input
and reconstructs the input from these weights until the layer 2 is a good representation
of layer 1. This is done iteratively for each pair of layers in the network. The training
concludes with a final round of overall weight-fine-tuning.

Convolutional Net (CNN)


CNNs are useful in image recognition. They have a torch structure that can be
understood as a small matrix traversing across the complete pixel matrix of the image
and collecting dot products (convolution layer) and converting them into activation maps.
This entire process can be replicated in series such that more torches are traversing the
activation maps.
Image Torch Activation Map Torch Activation Map
The idea behind a torch-scan the
matrix is to extract features similar to
the image bag of words concept, such
that an image feature is extracted
keeping in mind its surrounding
pixels. Also the idea behind running
torches on activation maps is to
combine extracted features from the
previous layer and project them as
combined features to the next torch
CNN explained
layer. CNN are a sequence of units convLayer, ReLu and Pool. The pool layer down samples the image or the activation
map, or in other words reduces the size of the image. More filters may be applied to the
activation maps, and much more activation maps achieved. Finally, they come out as a
probability value, using some kind of a softmax function. ReLu trains faster than the
sigmoid due to the vanishing gradient problem in the latter.
Recurrent Net (RNN)
RNNs are used whenever there is a time step involved. It is a layer at t0, which outputs
the output, and feeds it back to itself as t1. Here each time step acts as a layer, so training
a model for 100 time steps means training a model of 100 layers. Therefore, it is
important to figure out when the previous information can be forgotten, and when to
remember it for future time steps. This is achieved by gating methods such as LSTM and
GRU. RNN works well for a sequence of data, such as picture frame, time series. RNN
are used for hierarchical data and specially for text. It works in the form of a recursive
binary tree.
Auto encoders(AE) and Sparse Models
Auto encoders and Sparse models are neural nets used for feature extraction similar to
RBM and can be used as building blocks for deep networks. These methods are better
than regular PCA.

Learning Algorithms
The one neuron neural network is the simple
neural network possible, whose optimization
problem is convex. The system however
becomes complex wen multiple nodes are added
and there are multiple local minima and maxima.
Several state of the art algorithms are available
that can be used to solve such problems. The
fundamental algorithms is the gradient descent
algorithm, this algorithms has been improved
Functions in 3D visualization
upon by changing some of the parameters such
as window size, learning rate adaptability etc. to get better algorithms.
Gradient descent
Gradient descent is a first order optimization algorithm. To find a local minimum of a
function using gradient descent, one takes steps proportional to the negative of
the gradient (or of the approximate gradient) of the function at the current point. If instead
one takes steps proportional to the positive of the gradient, one approaches a local
maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest
descent, or the method of steepest descent.
Gradient descent should not be confused with
the method of steepest descent for approximating
integrals. Using the Gradient Decent (GD)
optimization algorithm, the weights are updated
incrementally after each epoch (= pass over the
training dataset).
Working of Gradient Descent

Batch Gradient Descent


Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost
function w.r.t. to the parameters for the entire training dataset:
= . ()
As we need to calculate the gradients for the entire dataset to perform just one update,
batch(regular) gradient descent can be very slow and is intractable for datasets that don't
fit in memory. Batch gradient descent also doesn't allow us to update our model online,
i.e. with new examples on-the-fly.

For a pre-defined number of epochs, we first compute the gradient vector, the loss
function for the whole dataset w.r.t. our parameter vector. We then update our
parameters in the direction of the gradients with the learning rate determining how big of
an update we perform. Batch gradient descent is guaranteed to converge to the global
minimum for convex error surfaces and to a local minimum for non-convex surfaces.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) in contrast performs a parameter update
for each training example x(i)x(i)and label y(i)y(i):
= . + (; ; )
Batch gradient descent performs redundant computations for large datasets, as it
recomputes gradients for similar examples before each parameter update. SGD does
away with this redundancy by performing one update at a time. It is therefore usually
much faster and can also be used to learn online. SGD performs frequent updates with
a high variance that cause the objective function to fluctuate heavily as in Image.
While batch gradient descent converges to the minimum
of the basin the parameters are placed in, SGD's
fluctuation, on the one hand, enables it to jump to new
and potentially better local minima. On the other hand,
this ultimately complicates convergence to the exact
minimum, as SGD will keep overshooting. However, it has
been shown that when we slowly decrease the learning
rate, SGD shows the same convergence behavior as
batch gradient descent, almost certainly converging to a
local or the global minimum for non-convex and convex
optimization respectively.

Working of Stochastic Gradient Descent

Others
Depending on the amount of data, we make a trade-off between the accuracy of the
parameter update and the time it takes to perform an update. To overcome this
challenge of parameter update using the learning rate we looked at various methods
that address this problem such as momentum, Nesterov Accelerated gradient,
Adagrad, Adadelta, RMSProp and Adam. RMSprop is an extension of Adagrad that
deals with its radically diminishing learning rates. It is identical to Adadelta, except that
Adadelta uses the RMS of parameter updates in the numinator update rule. Adam,
finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta,
and Adam are very similar algorithms that do well in similar circumstances. Its biascorrection helps Adam slightly outperform RMSprop towards the end of optimization as
gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple
learning rate annealing schedule. As has been shown, SGD usually achieves to find a
minimum, but it might take significantly longer than with some of the optimizers, is much
more reliant on a robust initialization and annealing schedule, and may get stuck in
saddle points rather than local minima. Consequently, if you care about fast
convergence and train a deep or complex neural network, you should choose one of
the adaptive learning rate methods.

Training Regular Neural Networks


Forward Propagation
The training of a Neural Net comprises of two phases :
forward propagation and backward propagation. In
reference to the figure on the right, the forward prop takes
each input and multiplies it with the weight of the
corresponding connection leading to the next hidden layer.
At the hidden layer the sumproduct of all incoming
connections is computed and an activation function
(sigmoid, Tanh or ReLu) is run on it such that the output is
in the range of 0-1. This output form the activation function
behaves as input to the next layer for forward propagation.
The sumproduct of weights and inputs at the l+1 layer is
also called as pre activation. An additional bias is added to
it and is represented as below:
= +

Regular ANN

4 4 = + 5

The output is generated by passing this pre activation through some function, moslty
sigmoid.
=

= ( +

4 4 )

The positive weight means that it excites the next neuron, while as a negative weight
means it inhibits a neuron. The weight changes the steepness of the sigmoid. The bias
allows us to shift the activation function (sigmoid). The sigmoid is centered at 0, if we
want to center it at 2, then we add a bias of 2. Or in other words move the graph two
units ahead. This is useful when you might want to fire a value of y at a different value
of x. Without a bias this is not possible.

Visually, a neural network a unit neuron creates a separation hyperplane to differentiate


the classes, but at times it is not possible to differentiate classes using a single linear
hyperplane. In this case we will need one neuron to transform the data into another
feature space, such that the data is linearly separable (like the kernel trick in SVM), and
then use another neuron to find the plane. This is the motivation behind multiple layers.
The neural network can be comprehended as a series of
functions represented by neurons, each representing a
simple curve/line. The combination of these neurons can
be therefore used to represent complex structures. A
neuron can be visualized as on the right. The rift denotes
the separating boundary. The weight vector (learnt using
back prop) is perpendicular to the rift orientation as
below. Also as the bias increases, the rift moves in the
opposite direction of w. Y is the range defined by the
activation in this case it is class 1 or -1.

A single Neuron Neural Net

Each neuron acts as a function and combination an entire hidden layer creates a pattern
of ridges and mounds.
For multiclass problems we need multiple output probabilities (conditional of belonging
to a class) . In this case we use a softmax function.
Backward propagation
Backpropagation is the second phase of the neural net learning algorithm. It involves
changing the weights such that the input and the output are appropriately mapped. This
is achieved using gradient descent, although in larger data sets algorithms such as
stochastic gradient descent, RMSPROP, ADAM etc. are used. We need a loss function
that compares the predicted output with the output label. Empirical risk is the average of
loss function , plus regularizer which penalizes the value of the parameter theta.
Mathematically, the primary idea of backpropagation is to find the rate of change of error
with respect to the weights.
= - Loss Reg
The pre activation of a node is the sumproduct of all weights form l-1 layer to this node
at layer l. The activation is the sigmoid of this sumproduct.

SGD finds the gradient and goes in the opposite direction, minimizing the loss. The net

gradient is given by . The new parameter after the iteration is given by below:
= + .
This gradient is computed by taking the partial derivative of the negative likelihood loss
function.
Mathematically, we take the difference between the f(x) and y, to get the error. This error
is propagated to all layers as 1, 2 The deltas can be calculated as below:
(;) =
(A) =

5 (=)

5 (;)

. (

. (

where z is the activation and g(z) is the partial derivative w.r.t each weight.

It is seen that this partial derivative is given by the activation of that layer x deltas of layer
ahead of it.
(D)
(D)
(D) (DEF)
4C = 4C + C 4
In the below graphic the output layer is denoted by j and the hidden layer by i. We first
find the square error of the output and find its derivative w.r.t to the output to see how
the error changes w.r.t output (2). From the chain rule we see that it is equal to the
derivative of pre activation to the output x derivative of error to the pre activation (1).
We also need to find the derivative of error w.r.t to the weights. This is equal to the layer
below it x the derivative of the error to the preactivation (3).
GH
GIJ
GH
GKN

GKJ GH
LIJ GKJ

GH
GONJ

= C 1 C

GIJ GH
C LK GI
N
J
GIJ GH
LONJ GIJ

= 4

GH
GKJ

GH
C 4C GI
J
GH
GIJ

(1)
(2)
(3)

Backpropagation I

The below figure also explains the same thing. We find the E/y or l/f(x) as the

gradient of f(x) . Then also find the gradient of weights and biases as E/w or the
gradient of the loss(E) w.rt. pre activation a2 (x) , using the gradient of f(x) or z as shown
above.
E/w tells how much change in w affects the error. This error is then subtracted from
the corresponding weight. The below is for the output layer.
QRSTU WXSYU SWS[D SWS[D
.
.
=
OV QRSTU WXSYU
OV

Backpropagation II

To go into the hidden layers, we need to find the derivative of the same E total but with
respect to the corresponding weight w1. This can be split as a chain and computed.
SWS[D QRS\U WXS\U SWS[D
=
.
.
OF
OF QRS\U WXS\U
SWS[D
WF
WA
=
+
]F
WXS\U WXS\U
Backpropagation III

Training Housekeeping
For a complete NN training we need the following things:
1. Loss function (Prediction minus true value)
2. Procedure to compute gradients (Forward and Backward prop)
3. Regularizer (prevents overfitting)
4. Initialization method

Regularization is usually applied only on weights and not biases. L1(Lasso) and
L2(Ridge) will put a limit on the size of the coefficients (weights here), this way bias will
be introduced but at the same time variance will be reduced. And as we know the biasvariance trade-off we can reduce some variance in the final model at the cost of a little
bias (beautiful)
The biases must be initialized to 0 and weights
to non-zero distinct values. The hyper parameter
section or decision of the number of hidden
layers can be done using grid search.
To find the best point that is neither under fitting
nor overfitting, keep running iterations and noting
the test and training error. Then select the
optimal point, at which the test and train error are
the least.
Tuning Parameters

In the x axis we can put anything that increases


the capacity of the NN, e.g. hidden units. Sometimes we do a look ahead where we keep
running the epochs beyond the optimal point to make sure that it doesnt go down again
at a later stage.

Training Deep Networks


The difference between Neural Nets and Deep Nets is that unlike Neural Nets, Deep
nets dont simply employ forward and back propagation algorithms and fall victim to the
vanishing gradient problem. Instead, the Deep Nets are trained by training subsets of
the network and then combining it, this idea is responsible for the name Deep Nets
Different kinds of algorithms are used to train these sub networks, the ones discussed
in this paper are RBM, Auto Encoders and Sparse coding.

Restricted Boltzmann Machines


Restricted Boltzmann machines are feature extraction Neural Nets. They modify the
weights such that the input l1 can be reconstructed at l2 and in the process latent features
form the input are learnt at l2. In RBMs there are no connections between hidden units,
or visible units of the same layer. These systems are also used to change the dimension
of data in X-dim(input) to that in H-dim(hidden layer).
In large systems calculating the partition or the normalization function is a challenge, in
such a setting we use an energy functions. Energy functions are just probability function
without the normalizing denominator. The below shows how a probability function is
equivalent to a normalized energy function. The energy function contains extra elements

of bias b and c. V is the input vector and h is the hidden vector form the hidden layer.
Low energy means high probability.
1
, = cH(d,])

, = 5 5 5
cH(d,])

=
d

where E(v,h) is the energy function and Z is the normalizing constant partition function

RBM is trained using Gibbs sampling method, where you find the P(h|v) and the find
P(v|h) and repeat this. At every time at h it finds the energy. Here it is usually at infinity
or a lot of training that we will get desired results. But using the technique
of Contrastive Divergence it can be achieved within a few iterations

Training a RBM

The idea here is that we find a representation (at h) of the input in the hidden layer, and
then compare it with the original, if it is noise and not comparable with the input then we
assign very low probability to the representation and adjust the weights and biases
accordingly. RBM always consists of an input layer and one hidden layer. These RMBs
can connected series to form what is called a Deep Belief network explained later.
Such that the input (v) is represented by first hidden layer (h), this representation (h) is
taken as an input by another RBM, and transformed. The deep belief network was used
because training deep models using backpropagation was very hard. This was first done
in 2005.

Auto Encoders

Auto encoders are similar to RBMs in function, but unlike the 2 layered RBM they are 3
layered and look like in the figure below. The constitute of an encoder and a decoder.
The encoder transforms the data x into a low dimensional h(x) hidden layer and then the
hidden layer is reconstructed into the inputs as x hat. The x and x are then compared

and a loss function is minimized to reduce the difference. Note the weight matrix in the
encoder and the decoder are transposes.

AutoEncoders

PCA is method that assumes linear systems where as Autoencoder (AE) does not. If no
non-linear function is used in the AE and the number of neurons in the hidden layer is
fewer than those in the input layer, then PCA and AE can yield the similar results.
Otherwise the AE may find a different subspace.
An under complete autoencoder is one which has fewer number of hidden units than
those in the input layer, on the other hand an overcomplete autoencoder has greater
number of hidden units than those in the input. The overcomplete AE have a tendency
to simply copy the input unit into the hidden layer because h>x and h is a representation
of x. In this case some h units are left to 0. This is not a good idea. To overcome this,
noise can be added to the input such that x becomes x~, and now this x~ becomes the
input to the AE, as in the graphic below. But note, the x is still x not x~ hat (comparing
with RBM). This allows the AE to find features from the noise input (x~) and then
compare it with the original input (x), at the same time making sure that the hidden units
are not simply copies of the input layer. This is called a denoising AE

Denoising autoencoders take a partially corrupted input whilst training to recover the
original undistorted input. This technique has been introduced with a specific approach
to good representation. A good representation is one that
can be obtained robustly from a corrupted input and that will
be useful for recovering the corresponding clean input. This
definition contains the following implicit assumptions:
1. The higher level representations are relatively stable
and robust to the corruption of the input;
2. It is necessary to extract features that are useful for
representation of the input distribution.
Contractive AE (CAE) on the other hand, allow us to extract
those features that only reflect variations observed in the
training set. CAE uses the Frobenius norm of Jacobian to
find the solution.

Sparse Coding

Sparse coding is another method of feature extraction, the


difference here is that the extracted features are sparse,
that is the representation has only few non zero elements.

Sparse AutoEncoders

The advantage of having an over-complete basis is that our basis vectors are better able
to capture structures and patterns inherent in the input data. However, with an overcomplete basis, the coefficients are no longer uniquely determined by the input vector x.
Therefore, in sparse coding, we introduce the additional criterion of sparsity to resolve
the degeneracy introduced by over-completeness.
Sparsity is another way of allowing for an overcomplete representation, that doesn't learn
a trivial mapping.

Deep Belief Network


The methods described above are used to construct and train a deep belief networks.
The resultant network is then run through a round of back prop type fine tuning procedure
to optimize the weights. This is called unsupervised pre training. Unsupervised pre
training is not only as regularization but also as prevention from underfitting.
A greedy layer-wise procedure is used:
Train one layer at a time, form first to last, with unsupervised criterion
Fix the parameters of the previous hidden layers
Previous layers viewed as feature extraction

Training a Deep Belief Network

Once all the layers are trained we connect the output layer and then train the entire
system one last time using back propagation. The weights, however, do not change
much this time. This process is called fine tuning. The idea is that we could use a major
portion of the dataset in an unsupervised way to learn the weights, and only use a small
portion of labeled data in a supervised way as to complete the fine tuning phase. The
advantage of this algorithm is evident in scenarios when we want to do supervised
learning but have only small amount of labeled data at hand, a scenario quite comment
in the big data era.
The non-top two layers are also called as a sigmoid belief network(SBN) .
(C

= 1|

(C = 1|

= (

= (

A m

DBN is not a feed forward network. The full distribution of


a DBN is as follows:
,

where
;

= (

A m

(A) ) (|

A m

; m (;)

)/

C F (A) )

(A) ) =
C

(F) ) =

4 (F) )
4

The algorithm can further be improved using some tweaks and techniques such as
dropout training and drop connect.

Dropout training
The key idea in this technique is to randomly drop units, marked in black (along with their
connections) from the neural network during training. This prevents units from coadapting too much. During training, dropout samples from an exponential number of
different thinned networks. At test time, it is easy to approximate the effect of averaging
the predictions (in an ensemble way) of all these thinned networks by simply using a
single unthinned network that has smaller weights. This significantly reduces overfitting
and gives major improvements over other regularization methods. Randomly nodes are
set to zero, and they get disconnected
The outcome is that the other nodes cannot collaborate or be sure that other black node
will always be present. This way nodes are forced to extract general features.
At testing time after training has finished, we would ideally
like to find a sample average of all possible 2n dropped-out
networks; unfortunately, this is unfeasible for large values
of n. However, we can find an approximation by using the
full network with each node's output weighted by a factor of
p (probability of dropping node) the expected value of the
output of any node is the same as in the training stages.
This is the biggest contribution of the dropout method:
although it effectively generates 2n neural nets, and as such
allows for model combination, at test time only a single
network needs to be tested.

Dropout Training

DropConnect
It is a generalization of Dropout in which each connection, rather than each output unit,
can be dropped with probability (1-p). Each unit thus receives input from a random
subset of units in the previous layer. DropConnect is similar to Dropout as it introduces
dynamic sparsity within the model, but differs in that the sparsity is on the weights, rather
than the output vectors of a layer. In other words, the fully connected layer with
DropConnect becomes a sparsely connected layer in which the connections are chosen
at random during the training stage.
We would want to geometrically average out the weights of all 2n trained models, but s
ince that is not possible when n is large, we rather multiply the weights by the expectati
on of the masks/dropped nodes (i.e. p)

Conclusion

The paper describes how Neural Networks have evolved into Deep Neural Networks or
Deep Belief Networks by overcoming problems such as training time, vanishing gradient,
computation expense etc. DBNs are gaining momentum in the technology sector, with
applications in areas such as speech recognition, text processing, information retrieval,
artificial intelligence etc. Several speculators claim that the future belongs to deep
learning.
Hardware improvements have also played a major role in the wide acceptability of deep
networks. The Graphical Processing Units(GPU) which are more robust than CPUs are
being utilized to train Deep Networks. In some cases, the training time has been known
to reduce to up to 1/250th of the training time spent on training on CPUs of comparable
configuration. In addition to this, distributed computing systems such as Hadoop/Spark
have allowed for parallel computation, improving the performance much more. Several
programming interfaces leveraging the hardware improvements such as CUDA,
Tensorflow, Theano, Deep4j etc. have been developed to optimize the algorithms on all
fronts. Bearing in mind this ongoing enthusiasm and research in the area, and the
fallouts, it wouldnt be hard to say that Deep Learning is the future of machine learning.

References

[1] http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
[2] https://www.coursera.org/learn/neural-networks
[3] http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
[4] http://deeplearning.net/tutorial/deeplearning.pdf
[5] http://cilvr.cs.nyu.edu/doku.php?id=deeplearning:slides:start
[6] https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example
[7] https://www.youtube.com/channel/UC9OeZkIwhzfv-_Cb7fCikLQ

Das könnte Ihnen auch gefallen