Sie sind auf Seite 1von 17

UNIT-II

Activation functions in Neural Networks

Elements of a Neural Network:-


Input Layer: - This layer accepts input features. It provides information from the outside world
to the network, no computation is performed at this layer, nodes here just pass on the
information (features) to the hidden layer.

Hidden Layer :- Nodes of this layer are not exposed to the outer world, they are the part of the
abstraction provided by any neural network. Hidden layer performs all sort of computation on
the features entered through the input layer and transfer the result to the output layer.

Output Layer :- This layer bring up the information learned by the network to the outer world.

What is an activation function and why to use them?

Definition of activation function:- Activation function decides, whether a neuron should be


activated or not by calculating weighted sum and further adding bias with it. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.

Explanation:-
We know, neural network has neurons that work in correspondence of weight, bias and their
respective activation function. In a neural network, we would update the weights and biases of
the neurons on the basis of the error at the output. This process is known as back-propagation.
Activation functions make the back-propagation possible since the gradients are supplied along
with the error to update the weights and biases.

Why do we need Non-linear activation functions :-


A neural network without an activation function is essentially just a linear regression model.
The activation function does the non-linear transformation to the input making it capable to
learn and perform more complex tasks.

Mathematical proof :-

Suppose we have a Neural net like this :-


Elements of the diagram :-
Hidden layer i.e. layer 1 :-

z(1) = W(1)X + b(1)


a(1) = z(1)
Here,

 z(1) is the vectorized output of layer 1


 W(1) be the vectorized weights assigned to neurons
of hidden layer i.e. w1, w2, w3 and w4
 X be the vectorized input features i.e. i1 and i2
 b is the vectorized bias assigned to neurons in hidden
layer i.e. b1 and b2
 a(1) is the vectorized form of any linear function.

(Note: We are not considering activation function here)

Layer 2 i.e. output layer :-

// Note : Input for layer


// 2 is output from layer 1
z(2) = W(2)a(1) + b(2)
a(2) = z(2)

Calculation at Output layer:

// Putting value of z(1) here


z(2) = (W(2) * [W(1)X + b(1)]) + b(2)

z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]

Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b

Final output : z(2) = W*X + b


Which is again a linear function

This observation results again in a linear function even after applying a hidden layer, hence we
can conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will
behave same way because the composition of two linear function is a linear function itself.
Neuron can not learn with just a linear function attached to it. A non-linear activation function
will let it learn as per the difference w.r.t error.
Hence we need activation function.

VARIANTS OF ACTIVATION FUNCTION :-

1). Linear Function :-

 Equation : Linear function has the equation similar to as of a straight line i.e. y = ax
 No matter how many layers we have, if all are linear in nature, the final activation
function of last layer is nothing but just a linear function of the input of first layer.
 Range : -inf to +inf
 Uses : Linear activation function is used at just one place i.e. output layer.
 Issues : If we will differentiate linear function to bring non-linearity, result will no more
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.

For example : Calculation of price of a house is a regression problem. House price may have any
big/small value, so we can apply linear activation at output layer. Even in this case neural net
must have any non-linear function at hidden layers.

2). Sigmoid Function :-

 It is a function which is plotted as ‘S’ shaped graph.


 Equation :
A = 1/(1 + e-x)
 Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very steep.
This means, small changes in x would also bring about large changes in the value of Y.
 Value Range : 0 to 1
 Uses : Usually used in output layer of a binary classification, where result is either 0 or 1,
as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily
to be 1 if value is greater than 0.5 and 0 otherwise.

3). Tanh Function :- The activation that works almost always better than sigmoid function is
Tanh function also knows as Tangent Hyperbolic function. It’s actually mathematically shifted
version of the sigmoid function. Both are similar and can be derived from each other.

Equation :-
f(x) = tanh(x) = 2/(1 + e-2x) - 1
OR
tanh(x) = 2 * sigmoid(2x) - 1

 Value Range :- -1 to +1
 Nature :- non-linear
 Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in
centering the data by bringing mean close to 0. This makes learning for the next layer
much easier.

4). RELU :- Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.

 Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.


 Value Range :- [0, inf)
 Nature :- non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function.
 Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.

In simple words, RELU learns much faster than sigmoid and Tanh function.

5). Softmax Function :- The softmax function is also a type of sigmoid function but is handy
when we are trying to handle classification problems.

 Nature :- non-linear
 Uses :- Usually used when trying to handle multiple classes. The softmax function would
squeeze the outputs for each class between 0 and 1 and would also divide by the sum of
the outputs.
 Ouput:- The softmax function is ideally used in the output layer of the classifier where
we are actually trying to attain the probabilities to define the class of each input.

Weights and bias


This is an example neural work with 2 hidden layers and an input and output layer. Each
synapse has a weight associated with it.

Weights are the co-efficient of the equation which you are trying to resolve. Negative weights
reduce the value of an output.

When a neural network is trained on the training set, it is initialized with a set of weights. These
weights are then optimized training period and the optimum weights are produced.

A neuron first computes the weighted sum of the inputs.


As an instance, if the inputs are:

And the weights are:

Then a weighted sum is computed as:

Subsequently, a bias (constant) is added to the weighted sum

Finally, the computed value is fed into the activation function, which then prepares an output.

Think of the activation function as a mathematical function that can normalize the inputs.

What Is Bias?

Bias is simply a constant value (or a constant vector) that is added to the product of inputs and
weights. Bias is utilised to offset the result.

The bias is used to shift the result of activation function towards the positive or negative side.

Introduction to Loss Functions


Just like teachers guide us, whether we are performing well or not in our academics, Loss
functions do the same work. It’s a method of evaluating how well our algorithm models the
data. Loss functions are the main source of evaluation in modern machine learning. When you
change your algorithm in order to improve your model, loss function value will tell you,
whether you are making a progress or not. Our primary goal should be to reduce the loss
function by optimization.

The loss function is the bread and butter of modern machine learning; it takes your algorithm
from theoretical to practical and transforms neural networks from glorified matrix
multiplication into deep learning.

At its core, a loss function is incredibly simple: it’s a method of evaluating how well your
algorithm models your dataset. If your predictions are totally off, your loss function will output
a higher number. If they’re pretty good, it’ll output a lower number. As you change pieces of
your algorithm to try and improve your model, your loss function will tell you if you’re getting
anywhere.

In fact, we can design our own (very) basic loss function to further explain how it works. For
each prediction that we make, our loss function will simply measure the absolute difference
between our prediction and the actual value. In mathematical notation, it might look something
like abs(y_predicted – y). Here’s what some situations might look like if we were trying to
predict how expensive the rent is in some NYC apartments:

Notice how in the loss function we defined, it doesn’t matter if our predictions were too high or
too low. All that matters is how incorrect we were, directionally agnostic. This is not a feature
of all loss functions: in fact, your loss function will vary significantly based on the domain and
unique context of the problem that you’re applying machine learning to. In your project, it may
be much worse to guess too high than to guess too low, and the loss function you select must
reflect that.

Gradient Descent

Gradient descent is an optimization algorithm used to find the values of parameters


(coefficients) of a function (f) that minimizes a cost function (cost).

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using
linear algebra) and must be searched for by an optimization algorithm.

Gradient Descent is an optimization algorithm used for minimizing the cost function in various
machine learning algorithms. It is basically used for updating the parameters of the learning
model.

Types of gradient Descent:

1. Batch Gradient Descent: This is a type of gradient descent which processes all the
training examples for each iteration of gradient descent. But if the number of training
examples is large, then batch gradient descent is computationally very expensive. Hence
if the number of training examples is large, then batch gradient descent is not preferred.
Instead, we prefer to use stochastic gradient descent or mini-batch gradient descent.
2. Stochastic Gradient Descent: This is a type of gradient descent which processes 1
training example per iteration. Hence, the parameters are being updated even after one
iteration in which only a single example has been processed. Hence this is quite faster
than batch gradient descent. But again, when the number of training examples is large,
even then it processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
3. Mini Batch gradient descent: This is a type of gradient descent which works faster than
both batch gradient descent and stochastic gradient descent. Here b examples where
b<m are processed per iteration. So even if the number of training examples is large, it is
processed in batches of b training examples in one go. Thus, it works for larger training
examples and that too with lesser number of iterations.

Single-Layer Network
By connecting multiple neurons, the true computing power of the neural networks comes,
though even a single neuron can perform substantial level of computation. The most common
structure of connecting neurons into a network is by layers. The simplest form of layered
network is shown in figure. The shaded nodes on the left are in the so-called input layer. The
input layer neurons are to only pass and distribute the inputs and perform no computation.
Thus, the only true layer of neurons is the one on the right. Each of the inputs
is connected to every artificial neuron in the output layer through the connection weight. Since

every value of outputs is calculated from the same set of input values, each
output is varied based on the connection weights. Although the presented network is fully
connected, the true biological neural network may not have all possible connections - the
weight value of zero can be represented as ``no connection".

Multilayer Network
To achieve higher level of computational capabilities, a more complex structure of neural
network is required. Figure 2.8 shows the multilayer neural network which distinguishes itself
from the single-layer network by having one or more hidden layers. In this multilayer structure,
the input nodes pass the information to the units in the first hidden layer, then the outputs
from the first hidden layer are passed to the next layer, and so on .
Multilayer network can be also viewed as cascading of groups of single-layer networks. The
level of complexity in computing can be seen by the fact that many single-layer networks are
combined into this multilayer network. The designer of an artificial neural network should
consider how many hidden layers are required, depending on complexity in desired
computation.

Backpropagation

Backpropagation is a supervised learning algorithm, for training Multi-layer Perceptrons


(Artificial Neural Networks).

Why We Need Backpropagation?

While designing a Neural Network, in the beginning, we initialize weights with some random
values or any variable for that fact.

Now obviously, we are not superhuman. So, it’s not necessary that whatever weight values we
have selected will be correct, or it fits our model the best.

Okay, fine, we have selected some weight values in the beginning, but our model output is way
different than our actual output i.e. the error value is huge.

Now, how will you reduce the error?

Basically, what we need to do, we need to somehow explain the model to change the
parameters (weights), such that error becomes minimum.

Let’s put it in an another way, we need to train our model.

One way to train our model is called as Backpropagation. Consider the diagram below:
Let me summarize the steps for you:

 Calculate the error – How far is your model output from the actual output.
 Minimum Error – Check whether the error is minimized or not.
 Update the parameters – If the error is huge then, update the parameters (weights and
biases). After that again check the error. Repeat the process until the error becomes
minimum.
 Model is ready to make a prediction – Once the error becomes minimum, you can feed
some inputs to your model and it will produce the output.

I am pretty sure, now you know, why we need Backpropagation or why and what is the
meaning of training a model.

Now is the correct time to understand what is Backpropagation.

What is Backpropagation?

The Backpropagation algorithm looks for the minimum value of the error function in weight
space using a technique called the delta rule or gradient descent. The weights that minimize the
error function is then considered to be a solution to the learning problem.

Let’s understand how it works with an example:


You have a dataset, which has labels.
Consider the below table:
Input Desired Output

0 0
1 2
2 4

Now the output of your model when ‘W” value is 3:


Input Desired Output Model output (W=3)

0 0 0
1 2 3
2 4 6

Notice the difference between the actual output and the desired output:

Model output
Input Desired Output Absolute Error Square Error
(W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4

Let’s change the value of ‘W’. Notice the error when ‘W’ = ‘4’

Desired Model output Absolute Square Model output


Input Square Error
Output (W=3) Error Error (W=4)
0 0 0 0 0 0 0
1 2 3 1 1 4 4
2 4 6 2 4 8 16

Now if you notice, when we increase the value of ‘W’ the error has increased. So, obviously
there is no point in increasing the value of ‘W’ further. But, what happens if I decrease the
value of ‘W’? Consider the table below:

Desired Model output Absolute Square Model output


Input Square Error
Output (W=3) Error Error (W=2)
0 0 0 0 0 0 0
1 2 3 2 4 3 0
2 4 6 2 4 4 0

Now, what we did here:

 We first initialized some random value to ‘W’ and propagated forward.


 Then, we noticed that there is some error. To reduce that error, we propagated
backwards and increased the value of ‘W’.
 After that, also we noticed that the error has increased. We came to know that, we can’t
increase the ‘W’ value.
 So, we again propagated backwards and we decreased ‘W’ value.
 Now, we noticed that the error has reduced.

So, we are trying to get the value of weight such that the error becomes minimum. Basically, we
need to figure out whether we need to increase or decrease the weight value. Once we know
that, we keep on updating the weight value in that direction until error becomes minimum. You
might reach a point, where if you further update the weight, the error will increase. At that
time you need to stop, and that is your final weight value.

Consider the graph below:

We need to reach the ‘Global Loss Minimum’.

This is nothing but Backpropagation.

Weight initialization

Weight initialisation is basically the random values that you need to assign to the weights of the
neural networks in order to train it.
One thing that you should take care while initialisation is that you cannot assign them as zero.
This would create a symmetry in the network which means that the units in various layers
would take same values every iteration. You certainly don't want this to happen. Therefore, you
should randomly initialise them to some non-zero values.

In neural networks, there exists weights between every two layers. The liner transformation of
these weights and the values in the previous layers passes through a non linear activation
function to produce the values of the next layer. This process happens layer to layer during
forward propagation and by back propagation, the optimum values of these weights can be
found out so as to produce accurate outputs given an input.

Until now, machine learning engineers have been using randomly initialized weights as the
starting point for this process. Till now(ie:2015), it was not known that the initial values of these
weights played such an important role in finding the global minimum of a deep neural network
cost function

Lets look at three ways to initialize the weights between the layers before we start the forward,
backward propagation to find the optimum weights.

1: zero initialization

2: random initialization

3: he-et-al initialization

Zero Initialization

Zero initialization serves no purpose. The neural net does not perform symmetry-breaking. If
we set all the weights to be zero, then all the neurons of all the layers performs the same
calculation, giving the same output and thereby making the whole deep net useless. If the
weights are zero, complexity of the whole deep net would be the same as that of a single
neuron and the predictions would be nothing better than random.

w=np.zeros((layer_size[l],layer_size[l-1]))

Random Initialization

This serves the process of symmetry-breaking and gives much better accuracy. In this method,
the weights are initialized very close to zero, but randomly. This helps in breaking symmetry
and every neuron is no longer performing the same computation.

w=np.random.randn(layer_size[l],layer_size[l-1])*0.01

He-et-al Initialization
This method of initializing became famous through a paper submitted in 2015 by He et al, and is
similar to Xavier initialization, with the factor multiplied by two. In this method, the weights are
initialized keeping in mind the size of the previous layer which helps in attaining a global
minimum of the cost function faster and more efficiently. The weights are still random but
differ in range depending on the size of the previous layer of neurons. This provides a
controlled initialization hence the faster and more efficient gradient descent.

w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(2/layer_size[l-1])

Unstable gradient problem


The unstable gradient problem is a fundamental problem that occurs in a neural network, that
entails that a gradient in a deep neural network tends to either explode or vanish in early
layers.

The unstable gradient problem is not necessarily the vanishing gradient problem or the
exploding gradient problem, but is rather due to the fact that gradient in early layers is the
product of terms from all preceding layers. More layers make the network an intrinsically
unstable solution. Balancing all products of terms is the only way each layer in a neural network
can close at the same speed and avoid vanishing or exploding gradients. Balanced product of
terms occurring by chance becomes more and more unlikely with more layers. Neural networks
therefor have layers that learn at different speeds, without being given any mechanisms or
underlying reason for balancing learning speeds.

When magnitudes of gradients accumulate, unstable networks are more likely to occur, which
is a cause of poor prediction results.

Auto-Encoder:
Auto-encoders are an unsupervised learning technique.
Autoencoder is an unsupervised artificial neural network that learns how to efficiently
compress and encode data then learns how to reconstruct the data back from the reduced
encoded representation to a representation that is as close to the original input as possible.
Autoencoder, by design, reduces data dimensions by learning how to ignore the noise in the
data.
Here is an example of the input/output image from the MNIST dataset to an autoencoder.
Autoencoder for MNIST

Autoencoder Components:

Auto encoders consists of 4 main parts:

1- Encoder: In which the model learns how to reduce the input dimensions and compress the
input data into an encoded representation.

2- Bottleneck: which is the layer that contains the compressed representation of the input data.
This is the lowest possible dimensions of the input data.

3- Decoder: In which the model learns how to reconstruct the data from the encoded
representation to be as close to the original input as possible.

4- Reconstruction Loss: This is the method that measures measure how well the decoder is
performing and how close the output is to the original input.

The training then involves using back propagation in order to minimize the network’s
reconstruction loss.

Three examples of applications of AutoEncoders are given below:

 Data Storage: The encoding processes are able to compress down large quantities of
data, compressing it. This process, as you can imagine, has big benefits for data storage
at scale.
 Feature detection: The process used to encode the data identifies features of the data
that can be used to identify it. This list of features is used in multiple systems to
understand the data. (Convolutional Neural Networks also do feature detection in
images)
 Recommendation systems: One application of autoencoders is in recommendation
systems. These are the systems that identify films or TV series you are likely to enjoy on
your favorite streaming services.
Batch normalization

Batch normalization is a technique for improving the speed, performance, and stability of artificial
neural networks. Batch normalization was introduced in a 2015 paper. It is used to normalize the input
layer by adjusting and scaling the activations.

What is Batch Normalization?


Batch Normalization is a supervised learning technique that converts interlayer outputs into of
a neural network into a standard format, called normalizing. This effectively 'resets' the
distribution of the output of the previous layer to be more efficiently processed by the
subsequent layer.

What are the Advantages of Batch Normalization?


This approach leads to faster learning rates since normalization ensures there’s no activation
value that’s too high or too low, as well as allowing each layer to learn independently of the
others.

Normalizing inputs reduces the “dropout” rate, or data lost between processing layers. Which
significantly improves accuracy throughout the network?

How Does Batch Normalization Work?


To enhance the stability of a deep learning network, batch normalization affects the output of
the previous activation layer by subtracting the batch mean, and then dividing by the batch’s
standard deviation.

Since this shifting or scaling of outputs by a randomly initialized parameter reduces the
accuracy of the weights in the next layer, a stochastic gradient descent is applied to remove this
normalization if the loss function is too high.

The end result is batch normalization adds two additional trainable parameters to a layer: The
normalized output that’s multiplied by a gamma (standard deviation) parameter, and the
additional beta (mean) parameter. This is why batch normalization works together with
gradient descents so that data can be “denormalized” by simply changing just these two
weights for each output. This lead to less data loss and increased stability across the network by
changing all the other relevant weights.

Das könnte Ihnen auch gefallen