Sie sind auf Seite 1von 20

Introduction to Neural Network

Presented By
Sal Saad Al Deen Taher
ID: 20ME128
Neuron
• A typical brain contains something
like 100 billion minuscule cells
called neurons.
• Cell body: Central mass of the cell
• Dendrites: Numerous connections
coming off Cell Body
• Axon: Cell’s Single Output
• Can be comparable to a transistor but
transistors are built on serial chains of logic gates whereas neurons are inter-
connected with approximately 10^14 - 10^15 synapses.
Neuron: Mathematical Model
•• 3  things are happening here
• First each input is multiplied by a weight

• Next, all the weighted inputs are added together with


a bias b
)+ )+b
• Finally, the sum is passed through an activation
function:

• The activation function is used to turn an unbounded


Mathematical Model of a 3 input Neuron input into an output that has a nice, predictable form.
• This process of passing inputs forward to get an
output is known as feedforward or forward
propagating.
A Simple Example:
Assume we have a 3-input neuron that uses the sigmoid activation function and has
the following parameters:
w=[0,1,1] and b=4, where w is the weights in the vector form (i.e. w0=0, w1=1, w2=1)
and b is the bias.
Now, let’s give the neuron an input of x = [2, 3. 4]. We’ll use the dot product to write
things more concisely:
(w⋅x)+b=((w0∗x0)+(w1∗x1)+(w2∗x2 ))+b
=(0*2+1*3+1*4)+4=11
Pass through the sigmoid function,

The output, y = f((w⋅x)+b) = f (11) = 0.9999832986


This process of passing inputs forward to get an output is known as feedforward or
forward propagating.
Activation Functions: ReLU is a good default
choice for most problems
Neural Networks
• Input Layer doesn’t come into count
• Alternatively called “Artificial Neural Network
(ANN)” or “Multi-Layer Perceptrons (MLP)”
• Output Layer neurons most commonly don’t have
activation functions, represent the class scores
• The 2-Layer Neural Net has 4 + 2 = 6 neurons (not
counting the inputs)
• [3 x 4] + [4 x 2] = 20 weights and
• 4 + 2 = 6 biases, for a total of (20+6)=26 learnable
parameters.

“2-Layer Neural Net”, or


“1-hidden-layer Neural Net “Fully-connected” Layers
Neural Networks
• The second network (right) has 4 + 4 +
1 = 9 neurons
• [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 =
32 weights and 4 + 4 + 1 = 9 biases
• For a total of 41 learnable parameters
• Modern Convolutional Networks
contain on orders of 100 million
parameters and are usually made up
of approximately 10-20 layers (hence
deep learning).
• The number of effective connections
is significantly greater due to
“3-Layer Neural Net”, or parameter sharing.
“2-hidden-layer Neural Net
“Fully-connected” Layers
Example of Feedforward:
Let’s use the network pictured right and
assume all neurons have the same weights
w = [0, 1] the same bias b = 0, and the same
sigmoid activation function. Let h1, h2, o1
denote the outputs of the neurons they represent.
​What happens if we pass in the input x = [2, 3] ?
h1 =h2=f(w⋅x+b)=f((0∗2)+(1∗3)+0)
=f(3)=0.9526
​o1=f(w⋅[h1 ,h2 ]+b)
=f((0∗h1 )+(1∗h2​)+0)
=f(0.9526)= 0.7216
The output of the neural network for input x = [2, 3] is 0.72160.
Setting Number Of Layers And Their Sizes

Neural Networks with more neurons can express more complicated functions but overfitting occurs.
The model with 20 hidden neurons fits all the training data but at the cost of segmenting the space
into many disjoint red and green decision regions. The model with 3 hidden neurons only has the
representational power to classify the data in broad strokes.
Setting Number Of Layers And Their Sizes

The effects of regularization strength: Each neural network above has 20 hidden neurons, but
changing the regularization strength makes its final decision regions smoother with a higher
regularization. You can play with these examples in this link
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
Data Preprocessing

Common data preprocessing pipeline. Left: Original toy, 2-dimensional input data. Middle: The data
is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around
the origin. Right: Each dimension is additionally scaled by its standard deviation. The red lines
indicate the extent of the data - they are of unequal length in the middle, but of equal length on the
right
Weight Initialization
Zero Initialization
• If all the weights are initialized with 0, the
derivative with respect to loss function is
the same for every w in Weight matrix
• All weights have the same value in
subsequent iterations.
• Setting weights to 0 does not make it
better than a linear model.
• Let us consider a neural network with only
three hidden layers with ReLu activation Using the neural network on the dataset “make circles”
function in hidden layers and sigmoid for from sklearn.datasets, the result obtained as the
following :
the output layer. for 15000 iterations, loss = 0.6931471805599453,
accuracy = 50 %
Weight Initialization
Random Initialization
• Assigning random values to weights is
better than just 0 assignment
• If weights are initialized with very high
values the term np.dot(W,X)+b becomes
significantly higher and if an activation
function like sigmoid() is applied, the
function maps its value near to 1 where the
slope of gradient changes slowly and
learning takes a lot of time. The weights are initialized with very large values
instead of 0. Neural network is the same as earlier,
• If weights are initialized with low values it
using this initialization on the dataset “make circles”
gets mapped to 0, where the case is the from sklearn.datasets, the result obtained as the
same as above. following :
• This problem is often referred to as the for 15000 iterations, loss = 0.38278397192120406,
accuracy = 86 %
vanishing gradient.
Weight Initialization
He Initialization
• He et al. (2015) proposed activation aware
initialization of weights (for ReLu) that was
able to resolve the problem of higher
training time & vanishing/exploding
gradient.
• The random initialization is simply multiply
with
Using this initialization on the dataset “make circles”
from sklearn.datasets, the result obtained as the
following :
for 15000 iterations, loss =0.07357895962677366,
accuracy = 96 %
Batch Normalization
• Batch Normalization is a supervised learning technique that converts interlayer outputs
into of a neural network into a standard format, called normalizing.
• This effectively 'resets' the distribution of the output of the previous layer to be more
efficiently processed by the subsequent layer
• Batch normalization reduces the amount by
what the hidden unit values shift around
(covariance shift).
• For example one neural network is trained on
black cats. So for colored cats it won’t perform
well.
• If an algorithm learned some X to Y mapping,
and if the distribution of X changes, then we
might need to retrain the learning algorithm by
trying to align the distribution of X with the
distribution of Y
Batch Normalization
• To increase the stability of a neural network, batch
normalization normalizes the output of a previous
activation layer by subtracting the batch mean and
dividing by the batch standard deviation.
• Since this shifting or scaling of outputs by a randomly
initialized parameter reduces the accuracy of the
weights in the next layer, a stochastic gradient
descent is applied to remove this normalization if
the loss function is too high.
• Consequently, batch normalization adds two
trainable parameters to each layer, so the normalized output is multiplied by a “standard
deviation” parameter (gamma) and add a “mean” parameter (beta). In other words,
batch normalization lets SGD do the denormalization by changing only these two weights
for each activation, instead of losing the stability of the network by changing all the
weights.
Regularization
• Regularization is the process of adding information in order to solve an ill-posed problem or
to prevent overfitting.
• Regularization parameter (lambda) penalizes all the parameters except intercept so that
model generalizes the data and won’t overfit.
• Two Types of Regularization-
1. L2 Regularization
A regression model that uses L2 regularization technique is called Ridge Regression, which
adds “squared magnitude” of coefficient as penalty term to the loss function. Here the
highlighted part represents L2 regularization element.
Here, if lambda is zero then you can imagine we get back
OLS. However, if lambda is very large then it will add too
much weight and it will lead to under-fitting. Having
said that it’s important how lambda is chosen.
This technique works very well to avoid over-fitting issue.
Regularization
2. L1 Regularization
A regression model that uses L1 regularization technique is called Lasso (Least Absolute
Shrinkage and Selection Operator) Regression, which adds “absolute value of magnitude” of
coefficient as penalty term to the loss function. Here the highlighted part represents L1
regularization element.
Again, if lambda is zero then we will get back OLS whereas
very large value will make coefficients zero hence it will
under-fit.
The key difference between these techniques is that Lasso shrinks the less important
feature’s coefficient to zero thus, removing some feature altogether. So, this works well for
feature selection in case we have a huge number of features.
Dropout

• The key idea is to randomly drop units (along with their connections) from the neural
network during training. This prevents units from co-adapting too much.
• During training, dropout samples from an exponential number of different “thinned”
networks.
• At test time, it is easy to approximate
the effect of averaging the
predictions of all these thinned
networks by simply using a single
unthinned network that has smaller
weights.
• This significantly reduces overfitting and
gives major improvements over other
regularization methods.
References
• https://cs231n.github.io/neural-networks-1/
• https://cs231n.github.io/neural-networks-2/
• https://victorzhou.com/blog/intro-to-neural-networks/
• https://arxiv.org/abs/1502.03167
• http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

Das könnte Ihnen auch gefallen