NNFL - ANN1 A

ARTIFICIAL NEURAL
NETWORKS
ARTIFICIAL NEURAL NETWORK
• A computer program
designed to recognize
patterns and learn “like”
human brain
What is a neural network?
 Biological organisms are capable of learning gradually over

time.
 The learning capability reflects the ability of biological

neurons to learn through exposure to external stimuli and to
generalize.
 This learning capability of biological organisms from

examples suggests possibilities for machine learning.
 “Neural networks, or more specifically, artificial neural

networks are mathematical models inspired from our
understanding of biological nervous systems.”
Some Definitions of ANN
a neural network is a system
composed of many simple processing
elements operating in parallel whose function
is determined by network structure,
connection strengths, and the processing
performed at computing elements or nodes.
• According to Haykin, S. (1994),Neural
Networks: A Comprehensive Foundation, NY:
Macmillan, p. 2:
– A neural network is a massively parallel

distributed processor that has a natural propensity
for storing experiential knowledge and making it
available for use. It resembles the brain processor
in two respects:
• Knowledge is acquired by the network through a
learning process.
• Interneuron connection strengths known as synaptic
weights are used to store the knowledge.
Every neuron is made up of
SOMA or the CELL BODY where
the decisions are made based
on the information carried into
the cell via DENDRITES.
The output from the soma is

carried to other neurons or any
other type of cells for
excitation or inhibition, by the
AXON.
The axon is coupled to other

dendrites by the SYNAPSES.
Synaptic gap : Gap between axon terminals and dendrite of other cell (
50-200 Angstrom)
NEURON
• Neuron = nerve
cell
• Soma = Cell Body
(decisions are
made)
• Dendrites – carry
information into
the cell
• Axon – carry
output to other
neurons
 A neural network is a collection of artificial neurons.
 An artificial neuron is a mathematical model of a biological neuron in its
simplest form.
WHY USE ACTIVATION FUNCTIONS?
Activation functions are needed to introduce

nonlinearity into the network.
Without non-linearity, hidden units would not

make nets more powerful than just plain
perceptrons (which do not have any hidden
units, just input and output units).

Tan Sigmoid activation function
2
tan sigmoid y  f (net )   net
 1
(1  e )
 net
1 e
  net
1 e
Find f ’(net) wrt net in terms of y

2
y  f (net )   net
1
(1  e )
2
1 y 
1  e  net
 net 2
1 e 
1 y
 net
1 y dy 2 e
 net
e  f '
( net )    net 2
1 y dnet (1  e )
2(1  y )
(1  y )
f ' ( net ) 
4
(1  y ) 2
f ' (net )  0.5 (1  y ) 2

Log sigmoid activation function
1
log sigmoid  y  f (net ) 
1  exp(net )
1
log sigmoid  y  f (net ) 
1  exp(net )
Find f ’(net) in terms of y

1
f (net ) 
1  exp(net )
- net 1
e  1
y
' 2 1
so f (net)  y (  1)  y(1  y)
y
(  ax 2 )
f ( x)  e
it is essentially zero everywhere except in a small region around zero.

McCulloch and Pitt’s Model
• Proposed in 1943
• consists of a summing element and a threshold
function the artificial neuron produces a binary
output
• The neuron 'fires' when the weighted sum of
inputs [wixi] reaches or exceeds the threshold T
• Could perform basic AND, OR, NOT
•
In the first computational model for artificial neurons,
proposed by McCulloch and Pitts, outputs are binary,
and the function f is the step function defined by:
Notice the quantify (-w0) is a threshold that the

weighted combination of inputs w1x1 + … + wnxn
must surpass in order for perceptron to output a
1.
• To simplify notation, we imagine an additional constant input x0 = 1, allowing us to
write the above inequality as
n
i=0 wixi >0
• Learning a perceptron involves choosing values for the weights w0, w1,…, wn.
23
Single- layer neural network or perceptron:
Simple neural networks by considering several neurons at a time.
It consists of an input layer --a layer of input nodes and --one output
layer consisting of neurons.
This is referred to as a single- layer neural network because the input
layer is not a layer of neurons, that is, no computations occur at the
input nodes.
Perceptron model
can only solve
linearly separable
problems
Single- layer neural network

Example 1:
Design a perceptron to implement the logical Boolean

function OR, that is, the function g: {0,1} × {0,1} → {0,1},
defined by
In view of the function g, we consider binary-input/binary-

output neural networks, with two nodes in the input layer,
and only one neuron in the output layer.
The problem is to choose w1, w2, and b so that
First, to see whether or not this problem is solvable, we look at the

domain of the function g,
coloring the points white where the function value is 1,
and black where the function value is 0.
Of course, there are many such lines, and any such
line gives a solution. In view of the simplicity of this
function g, no sophisticated mathematics is needed.
Just observe that:
and choose any numbers for w1 ,w2 , and b satisfying these inequalities.
One solution is w1 = w2 = b = 2.
equal to:
Example
The exclusive or(XOR) is the Boolean function with

truth table:
When we display the input-output relationship expressed by g, we see

that this problem is not linearly separable
That is, there is no line that can separate the two subsets
White={(0,0),(1,1)} and Black={(0,1),(1,0)}.
Therefore, the function g cannot be implemented by using a

perceptron, so we need to consider a hidden layer.
This is a multi-layer neural network in which
the input layer (or layer0)has two nodes x1, x2(no computation),
together with x0=1 to implement the bias for neurons A and B;
the hidden layer has two hidden neurons A and B, with x0 =1 to
implement the bias for neuron C;
and the output layer consists of only one neuron C.
For all neurons, the “activation function” f is taken to be

Consider, for example, x1 = x2 =1. We expect the network to produce
the Output y= 0. The output y is computed in several steps. The net
input to Neuron A is w1Ax1+w2Ax2−bA.
So that the output neuron A is y1=f(w1Ax1+w2Ax2−bA)
Similarly, the output of neuron B is y2=f(w1Bx1+w2Bx2−bB).

Hence, the net input to neuron C, from neurons A
and B with the condition it must satisfy, is
to result in the output y=0 when x1 = x2 = 1.

Make a table
I/P I/P HL O/P HL I/P O1 O/P o1
X1 X2 H1 H2 H1 H2
More layers seem to increase the computation power of
THE MULTILAYER PERCEPTRON
RECURRENT
NETWORKS
Supervised learning contd.
Supervised learning does

minimization of error
• Unsupervised learning
–Training samples contain only input
patterns
• No desired output is given (teacher-less)
–Learn to form classes/clusters of sample
patterns according to similarities among
them
• Patterns in a cluster would have similar
features
• No prior knowledge as what features are
important for classification, and how many
classes are there.
Unsupervised Learning
• All similar input patterns are grouped together as
clusters.
• If a matching input pattern is not found a new cluster
is formed
Unsupervised learning
Unsupervised(Self-organizing)
• In unsupervised learning there is no feedback
• Network must discover patterns, regularities,
features for the input data over the output
• While doing so the network might change in
parameters
• This process is called self-organizing
• Reinforcement : learning motor skills like
riding a bike
• You try and see what makes you fall over and
what keeps you upright
When Reinforcement learning is used?

• If less information is available about the
target output values (critic information)
• Learning based on this critic information is
called reinforcement learning and the
feedback sent is called reinforcement signal
Perceptron learning rule
• One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training example,
modifying the perceptron weights whenever it misclassifies an example.
• This process is repeated, iterating through the training examples as many as

times needed until the perceptron classifies all training examples correctly.
• Weights are modified at each step according to the perceptron training rule,
which revises the weight wi associated with input xi according to the rule.
wi  wi + wi
where wi = (t – o) xi
t is target output value for the current training example

o is perceptron output
 is small constant (e.g., 0.1) called learning rate
51
Perceptron training rule (cont.)
• The role of the learning rate is to moderate the
degree to which weights are changed at each step.
It is usually set to some small value (e.g. 0.1) and is
sometimes made to decrease as the number of
weight-tuning iterations increases.
• We can prove that the algorithm will converge

– If training data is linearly separable
– and  sufficiently small.
• If the data is not linearly separable, convergence is

not assured. 52
Perceptron Learning Algorithm
• Weights are set randomly initially
• For each training example E
– Calculate the observed output from the ANN, o(E)
– If the target output t(E) is different to o(E)
• Then tweak all the weights so that o(E) gets closer to t(E)
• Tweaking is done by perceptron training rule (next slide)
• This routine is done for every example E
• Don’t necessarily stop when all examples used
– Repeat the cycle again (an ‘epoch’)
– Until the ANN produces the correct output
• For all the examples in the training set (or good enough)
Worked Example
• Return to the “bright” and “dark” example
• Use a learning rate of η = 0.1
• Suppose we have set random weights:
• Use this training example, E, to update
weights:
• Here, x1 = -1, x2 = 1, x3 = 1, x4 = -1 as before

• Propagate this information through the
network:
S = (-0.5x 1) + (0.7x (-1)) + (-0.2 x 1) +
(0.1x1) + (0.9x(-1)) = -2.2
• Hence the network outputs o(E) = -1
• But this should have been “bright”=+1
– So t(E) = +1
Calculating the Error Values
• Δw0 = η(t(E)-o(E))x0
= 0.1 x [(1 - (-1)] x1 = 0.1x (2) = 0.2
• Δw1 = η(t(E)-o(E))x1
= 0.1x[ (1 - (-1)] x(-1) = 0.1x (-2) = -0.2
• Δw2 = η(t(E)-o(E))x2
= 0.1x[ (1 - (-1)] x (1) = 0.1 x (2) = 0.2
• Δw3 = η(t(E)-o(E))x3
= 0.1 x (1 - (-1))x (1) = 0.1 x (2) = 0.2
• Δw4 = η(t(E)-o(E))x4
= 0.1 x [(1 - (-1)]x (-1) = 0.1x (-2) = -0.2
Calculating the New Weights
• w’0 = -0.5 + Δw0 = -0.5 + 0.2 = -0.3
• w’1 = 0.7 + Δw1 = 0.7 + -0.2 = 0.5
• w’2 = -0.2 + Δw2 = -0.2 + 0.2 = 0
• w’3= 0.1 + Δw3 = 0.1 + 0.2 = 0.3
• w’4 = 0.9 + Δw4 = 0.9 - 0.2 = 0.7

• Calculate for the example, E, again:
S = (-0.3 x 1) + ((0.5x(-1)) + (0x1) + (0.3 x1) + ((0.7x(-1)) =
-1.2
• Still gets the wrong categorisation
– But the value is closer to zero (from -2.2 to -1.2)
– In a few epochs time, this example will be correctly
categorised
x1 x2 y
1 1 1
1 0 -1
0 1 -1
0 0 -1
• A perceptron for the AND function is defined

as follows :
• Binary inputs ( 0/1),
• Bipolar targets ( 1)
• Initial weights and bias equal to 0,
• threshold = 0.2=
1 net  

f (net )  0   net  
 1 net  

Perceptron learning rule

• Presenting the first input yields the following
• Presenting the second input yields the following
Response of net will now be

correct for first pattern (1,1,1), as
net will be 0x1+1x1+1x0=1 so
f(net)=1=target
Every epoch will have four inputs

• Presenting the third input yields the following
Response of net will be negative (correct), correct for 3 inputs

• Presenting the fourth input yields the following
• The response for all of the input patterns is negative for the
weights derived; but since the response for the input pattern (1,1)
is not correct, we need to perform one more EPOCH
SECOND EPOCH, FIRST TRAINING INPUT
The response of the net will now be correct for (1,1)

SECOND EPOCH, SECOND TRAINING INPUT
Response of net will now be correct (

negative) for (1,0) and (0,0) and response
for (0,1) and (1,1) will be 0, not correct
should 1
SECOND EPOCH, THIRD TRAINING INPUT
RESPONSE WILL BE ENGATIVE FOR ALL INPUTS
SECOND EPOCH, FOURTH TRAINING INPUT
RESPONSE WILL BE ENGATIVE FOR ALL INPUTS

Finally, the results for the tenth epoch are:
INPUTS NET OUTPUT TARGET WEIGHT WEIGHTS
CHANGES
(1 1 1) 1 1 1 (0 0 0) (2 3 -4)
(1 0 1) -2 -1 -1 (0 0 0) (2 3 -4)
(0 1 1) -1 -1 -1 (0 0 0) (2 3 -4)
(0 1 1) -4 -1 -1 (0 0 0) (2 3 -4)
Although the perceptron rule finds a
successful weight vector when the
training examples are linearly separable,
it can fail to converge if the examples are
not linearly separatable.
A second training rule, called the delta

rule, is designed to overcome this
difficulty.
The key idea of delta rule:
to use gradient descent to search the

space of possible weight vector to find
the weights that best fit the training
examples.
This rule is important because it

provides the basis for the
backpropagration algorithm, which can
teach networks with many
Gradient descent is a first-order optimization algorithm. To
find a local minimum of a function using gradient descent,
one takes steps proportional to the negative of
the gradient(or of the approximate gradient) of the function
at the current point.
If instead one takes steps proportional to the positive of

the gradient, one approaches a local maximum of that
function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest descent, or

the method of steepest descent.
• In order to derive a weight learning rule for linear units, let
specify a measure for the training error of a weight vector,
relative to the training examples. The Training Error can be
computed as the following squared error
(2)
where D is set of training examples, td is the target

output for the training example d and od is the output of
the linear unit for the training example d.
Here we characterize E as a function of weight

vector because the linear unit output O depends on
this weight vector.
78
Hypothesis Space
• To understand the gradient descent algorithm, it is helpful to
visualize the entire space of possible weight vectors and their
associated E values, as illustrated in Figure
– Here the axes wo,w1 represents possible values for the two weights of a simple
linear unit. The wo,w1 plane represents the entire hypothesis space.
– The vertical axis indicates the error E relative to some fixed set of training
examples. The error surface shown in the figure summarizes the desirability of
every weight vector in the hypothesis space.
• For linear units, this error surface must be parabolic with a single
global minimum. And we desire a weight vector with this minimum.
79
The error surface
How can we calculate the direction of steepest descent along

the error surface? This direction can be found by computing
the derivative of E w.r.t. each component of the vector w. 80
wi (new) = wi (old) - η E
If Error gradient is negative:

wi (new) = wi (old) + η E
wi (new) > wi(old)
If Error gradient is positive:
wi (new) = wi (old) - η E
wi (new) < wi(old)
Learning rate
Determines how far to move in the direction of the

gradient of the surface over the weight space
defined by an error function.
A small learning rate will lead to slower learning,

but a large one may cause a move through weight
space that ``overshoots'' the solution vector.
• learning rate Too small
• ■ Convergence extremely slow
• learning rate Too large
• ■ May not converge
DELTA LEARNING RULE valid for continuous
activation functions
∆w = (η (d1 – o1)(f ’ (neti )) x
Node i
BACK-PROPAGATION LEARNING ALGORITHM
The training involves three stages:
1)Feed forward of the input training pattern
2)Calculation and backpropogation of the associated error
3)Adjustment of weights
• Three Major Phases Using Backpropagation
• First phase: an input vector is presented to the network

which leads via the forward pass to the activation of the
network as a whole. This generates a difference (error)
between the output of the network and the desired
output.
• Second phase: compute the error vector for the output
unit and propagates this factor successively back through
the network (error backward pass).
• Third phase: compute the changes for the connection
weights .
INSTANTEOUS BACK PROPAGATION ALGORITHM
See Reference Book : process Control by
Surekha Bhanot
DIFFERENT MODES OF TRAINING
For a given training set, back propagation learning

may thus proceed in one of two basic ways:
1) Pattern or Sequential mode
2) Batch mode
Pattern mode/Sample-by-sample/Online :
ANN wieghts are updated every time a training
sample is presented to the network, i.e. weight
update is based on training error from that
sample.
Batch-mode (offline) : ANN weights are
updated after each Epoch, i.e. weight update
is based on training error from all samples in
training data set
An Epoch is defined as a cycle of ANN training that
involves presentation of all samples in the training
data set to neural network for purpose of learning
PATTERN MODE OF TRAINING
In the case of pattern mode, the whole sequence of

forward and backward computation is performed
resulting in weight adjustment for each pattern
X (1)/ D (1) to X (N)/D (N)
Pattern mode of training is online, requires less local

storage
The steps involved in pattern mode can be summarized
as follows:
1) Initialize the weights , p =1 and E = 0
2) Submit the pattern p
3) Compute output of all nodes in all layers(forward pass)
4) Compute cycle error

n
E→ 1  k k E
( d  o ) 2
for k = 1,2 ….n
2 k 1
5) Compute new weights (backward pass)
6) If p < N, p→ p + 1 repeat steps (2) to (5)

for all patterns
2
E
7) Compute RMS error where N is the number of patterns
N
8) If RMS error > Emax repeat steps (2) through (7)

submit patterns again after resetting E→ 0
BATCH MODE OF TRAINING
In batch mode the weight up gradation is done after

the whole N sets are presented
One completed presentation of entire training set

is called an Epoch.
Batch mode is more accurate

BATCH MODE OF TRAINING
The steps involved in batch mode can be summarized
as follows:
1) Initialize the weights
2) Submit the pattern p=1
3) Calculate and store error at all output nodes.
4) If p< N, then p → p + 1 and go to step (2)

6) Calculate average error at all output nodes(1 to k)
k N 2
(Error )average 
1
N
 E
i 1 p 1
i ( p)
7) If (Error)average is less than Emax Stop

else
Calculate new weights (backward pass) and go to step 2
Data Representation
● In general ANNs work on continuous (real valued) attributes.
Therefore symbolic attributes are encoded into continuous ones.
● Attributes of different types may have different ranges of values
which affect the training process.
● Normalization may be used, which scales each attribute to
assume values between 0 and 1.
xi  mini
xi 
max i  mini
for each value xi of ith attribute, mini and maxi are the minimum and
maximum value of that attribute over the training set.
MULTILAYER NETWORKS AND THE
BACKPROPOGATION ALGORITHM
• Single perceptrons can only express linear decision

surfaces. In contrast, the kind of multilayer networks
learned by the backpropagation algorithm are
capable of expressing a rich variety of nonlinear
decision surfaces.
A Differentiable Threshold Unit

• What type of unit as the basis for multilayer
networks ?
 Perceptron : not differentiable -> can’t use gradient
descent
 Sigmoid Unit : smoothed, differentiable threshold function 98

REMARKS ON THE BACKPROPAGATION ALGORITHM
• Convergence and Local Minima

– Gradient descent to some local minimum
• Perhaps not global minimum...
– Heuristics to alleviate the problem of local

minima
• Add momentum
• Train multiple nets with different initial weights
using the same data.
99
Adding Momentum
• Imagine rolling a ball down a hill

Gets stuck
here
Without Momentum With Momentum

Momentum in Backpropagation
• For each weight
– Remember what was added in the previous epoch
• In the current epoch

– Add on a small amount of the previous Δ
• The amount is determined by

– The momentum parameter, denoted α
– α is taken to be between 0 and 1
Adding Momentum
wi , j (n)   j xi , j  wi , j (n  1)

Here wi,j(n) is the weight update performed during the n-th
iteration through the main loop of the algorithm.
- n-th iteration update depend on (n-1)th iteration
- : constant between 0 and 1 is called the momentum (
typical value 0.6 to 0.9).
Role of momentum term:
- keep the ball rolling through small local minima in the
error surface.
- Gradually increase the step size of the search in regions
where the gradient is unchanging, thereby speeding
convergence.
102
Idntify Unde-rfit, Over-fit,
Good- fit
Underfitting
Over-fitting
Good-fitting
For banana (d6) = 1, (d7 ) = 0
For Orange (d6) = 0, (d7 ) = 1

A neural network for temperature control
SEE TEXT BOOK p205-209
• We first demonstrate a very simple neural network for control

application where the neural network is intended to function as an
ON/OFF controller
• The objective is to train the NN so that when the measured

temperature is higher than desired the neural controller shuts OFF
the furnace and when the measured temperature is below the
desired temperature then the neural controller turns ON the surface
• To train the NN to recognize the error between the desired and

measured temperature and provide ON/OFF control
Two numbers X and Y are to be compared on the cross space [-2,2] x[-2,2]. Output of
network is a 3 bit pattern corresponding to ( top node : X >Y, middle node : X<Y,
bottom node :X=Y) taking value 0 if false and value 1 for true.
• Use BPA algorithm for input pair ( X,Y)= ( -2, -1) for one epoch with neural
architecture in which there is a single neuron in single hidden layer. Learning
coefficient is 0.1, initial weights and biases at all hidden and output nodes are 0.5,
activation function for all layers is logsigmoid. (i) Draw the ANN network with
weights, biases and inputs/outputs (ii) Find Error vectors at output and hidden nodes
and (iii) Find new weights and biases.
•
Write Matlab Code for the same problem to train and test neural network
of three hidden layers with 2, 3, 4 nodes respectively in these hidden
layers. Generate patterns with X and Y changing in increments of 1 ,
divide the patterns into training data and test data picking alternate one in
each set. Biases at all hidden nodes are 0.1, learning coefficient is 0.9,
activation function at output layer is linear and at all other layers is
bipolar sigmoid, training goal is 10-8. All other parameters are default
values (do not write commands for them).
 2  2  2  2  2  1  1  1  1  1 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2

2  2  1 0 1 2  2  1 0 1 2  2  1 0 1 2  2  1 0 1 2
P
Total  2  1 0 1
 2 2 2 1 1 0 0 0 1 1 2 2 2
p
 2 0 2 1 1 2 0 2 1 1 2 0 2
 2 2 1 1 1 0 0 1 1 1 2 2 
p'   
 1 1 2 0 2 1 1 2 0 2 1 1 
0 0 0 0 0 1 0 0 1 0 1 1 0 
 
t  0 1 1 0 1 0 0 1 0 0 0 0 0 
1 0 0 1 0 0 1 0 0 1 0 0 1
Net=newff(minmax(p), [ 2 3 4 3]{‘tansig’ , ‘tansig’, ‘tansig’,
‘purelin’},trainrp)
Net.train.Param.lr=0.9
Net.train.param.goal=1e-8
Net.b{1}=[0.1;0.1];
Net.b{2}=[0.1;0.1;0.1]
Net.b{3}=[0.1;0.1;0.1;0.1];
Net=train(net,p,t)
A=sim(net,p’)
Introduction
The main property of a neural network is an ability to
learn from its environment, and to improve its
performance through learning.
So far we have considered supervised or active

learning  learning with an external “teacher” or a
supervisor who presents a training set to the network.
But another type of learning also exists:
unsupervised learning.
117
 In contrast to supervised learning, unsupervised or
self-organised learning does not require an
external teacher. During the training session, the
neural network receives a number of different
input patterns, discovers significant features in
these patterns and learns how to classify input data
into appropriate categories. Unsupervised
learning tends to follow the neuro-biological
organisation of the brain.
 Unsupervised learning algorithms aim to learn
rapidly and can be used in real-time.
118
• Pattern recognition
– Patterns: images, personal records, driving habits,
etc.
– Represented as a vector of features (encoded as
integers or real numbers in NN)
– Pattern classification:
• Classify a pattern to one of the given classes
• Form pattern classes
– Pattern associative recall
• Using a pattern to recall a related pattern
• Pattern completion: using a partial pattern to
recall the whole pattern
• Pattern recovery: deals with noise, distortion,
missing information
• Here we introduce a set of rules that allow
unsupervised learning
• These rules give the networks the ability to

learn associations between patterns that
occur frequently
• Once learnt, associations allow networks to

perform useful tasks such as pattern
recognition and recall
Associations
• An association is any link between a

system’s input and output such that
when a pattern A is presented to the
system, it will respond with pattern B
• When two patterns are linked by an
association, the input pattern is called
the stimulus and the output pattern is
called the response
Applications to Behaviorist
Psychology
• Experiment by Ivan Pavlov, in which he trained a
dog to salivate at the sound of a bell, by ringing
the bell whenever the food was presented (an
example of classical conditioning)
• B.F Skinner – Performed an experiment which
involved training a rat to press a bar in order to
obtain a food pellet (an example of instrumental
conditioning)
• Hebb, in his influential book The organization
of Behavior (1949), claimed
– Behavior changes are primarily due to the changes
of synaptic strengths between neurons I and j
– In ANN, Hebbian learning law can be stated:
synaptic strength increases only if the outputs
of both units and have the same sign (
positive or negative).
w ij  w ij (new )  w ij (old )  x i y
or, w ij  w ij (new )  w ij (old )   x i y

Simple Associative Network
An example of a simplest network implementing an
association is the single input hard limiting neuron
A neurons output ‘a’ is determined from its input ‘p’
according to
a  hard lim(wp  b)  hard lim(wp  0.5)
• We restrict the value of ‘p’ to
be either 0 or 1, indicating
whether a stimulus is absent
or present. ‘a’ is limited to the
same values by hard limit
transfer function. It indicates
presence or absence of
network’s response
• The presence of an association between stimulus p=1 and
response a=1 is dictated by the value of ‘w’. The network
will respond to the stimulus only when ‘w’ > -b
• In order to demonstrate operation of associative learning
rules without using complex networks, we will use simple
networks that have two kinds of inputs
• One set of inputs represent the unconditional stimulus.

This is analogous to the food presented in Pavlov’s
experiment,.
• Another set of inputs represent the conditional stimulus.
This is analogous to the bell in Pavlov’s experiment
• Initially the dog salivates only when the food is presented.

This is an innate characteristic that doesn’t have to be
learned. However when the bell is repeatedly paired with
food, the dog is conditioned to salivate at the sound of the
bell even when no food is present
Network for Banana Associator
The N/W has both
Unconditional stimulus (banana shape) and
Conditional stimulus (banana smell).
Let unconditional stimulus be represented as p0
let conditional stimulus be represented as p
Weight (w0 )associated with p0 are fixed
weight (w )associated with ‘p’ are adjusted according to
relevant learning rule
The definitions for unconditioned and
conditioned inputs for this network are
Associate SHAPE of banana , not smell

Assign a value greater than ‘-b’ to w0 and assign a
value less than ‘-b’ to w
• The banana associator’s input and output function
simplifies to
Network will respond only if banana is sighted p0 =1

whether or not banana is smelled or not p=1 or 0
An association will be made when w>-0.5 (=b), since
then p=1 will produce the response a=1, regardless
of value of p0
The Unsupervised Hebb rule can also be written in vector
notation as
As with all unsupervised rules, learning is
performed in response to a series of inputs
presented in time (the training sequence)
At each iteration, the output ‘a’ is calculated in

response to the input ‘p’ and then the
weights ‘W’ are then updated with the Hebb
rule.
The associator would start so that it would
initially respond to the sight and not the
smell of the banana
• The associator would be repeatedly exposed to the

banana. While the network’s smell sensor would
operate reliably, the shape sensor would operate
only intermittently (even time steps).
• The training sequence would contain repetitions of

the following two sets of inputs
Weight w representing weight for unconditional
stimulus p0 will remain constant
‘w’ associated with conditional response ‘p’ will be
updated at each iteration using unsupervised Hebb
Rule with learning rate of 1
The output for the first iteration :p0 (1) =0 ,p(1)=1
The smell alone did not generate a response.

Without a response the Hebb rule doesn’t alter ‘w’
• In the second iteration, both the banana’s
smell and shape are detected p0 (2) =0 ,p(2)=1
• Because the smell stimulus and response

have occurred simulataneously, the Hebb
rule increases the weight between them
• When the sight detector fails again, in the third
iteration, the network responds anyway. It has
made a useful association between the smell of a
banana and its response. p0 (3) =0 ,p(3)=1
• From now on the network is capable of responding

to bananas that are detected either by sight or
smell. Even if the network suffers with intermittent
faults,
Example
Classical conditioning can be modeled

with a hebbian synapse.
Consider an unconditioned stimulus (an
airpuff),
an unconditioned response (eye blink),
a conditioned stimulus (a tone) and a

conditioned response(eye blink).
Under normal conditions, an animal responds
to an air puff with an eye blink.
It does not respond to a tone with an eye
blink.
If the tone is paired with the air puff several

times, then the animal acquires an association
between the airpuff and the tone, and will
now respond to the tone alone with an
eye blink.
Consider a “black box” neural approach where one neuron
receives input from the airpuff and from the tone. The
neurons output represents the eye blink. The neuron is
wired in such a way that at first, the airpuff but not the tone
activates the neuron and produces an output.
If we apply the airpuff input and the tone input together

several times, then the neuron is active while the tone-input
is active, and a hebbian learning rule will reinforce the
strength of the connection between the tone and the
neuron.
This will lead to the fact that after a few trials, the tone
alone will be able to activate the neuron.
Associative Memory neural Network
Associative Memory neural Networks are nets in which
the weights are determined in such a way that the net can
store a set of P pattern associations.
Each association is a pair of vectors [ s(p), t(p)], with p=
1,2,…P
The net will find an appropriate output vector that
corresponds to an input vector X that may be either one
of the stored patterns s(p) or a new pattern ( corrupted by
noise)
http://www.slideshare.net/zaripices/fundamentalsofneuralne
tworkslaurenefausett
Laurene Fausett Fundamentals of neural networks p110-118

 w11 w12 
w 
w22 
 21
 w31 w32 
 
 w41 w42 
Suppose a net is to be trained to store the following mappings

from input row vectors
s =( s1 , s2, s3, s4 ) to output row vectors t = ( t1 , t2)
For s 1 ( 1, 0,0,0) t (1,0) For s 2 ( 1, 1,0,0) t (1,0)
For s 3 ( 0, 0,0,1) t (0,1) For s 4 ( 0, 0,1,1) t (0,1)
Training by Hebb Rule
wij (new)  wij (old )  sit j  w11 w12 
w 
w22 
wij  sit j
 21
 w31 w32 
 
 w41 w42 
 w11 w12 
w 
w22 
 21
 w31 w32 
 
 w41 w42 
AUTOASSOCIATIVE NET [ p121-125 of Fausett]
For an auto associative net, the training input and target output are
identical
Process of training often called STORING the vectors, may be binary

or bipolar
A stored vector can be retrieved from distorted or partial (noisy)

input if input is sufficiently similar to it.
The performance of net is judged by its ability to reproduce a stored

pattern from noisy input
Performance in general is better for bipolar vectors
In general for auto associative nets the weights on diagonal ( those
which would connect an input pattern component to corresponding
component in the output pattern) set to zero give better results
Self-Organizing Maps
Self-organizing
maps are a special
class of artificial neural
networks based on
competitive
unsupervised learning.
During the competitive

learning process the
neurons are tuned
selectively, that is the
training data select the
winning neurons.
The Kohohen neural network does not use any sort of
activation function.
Kohonen neural network does not use any sort of a bias
weight.
The most significant difference between the Kohonen

neural network and the feed forward back propagation
neural network is that the Kohonen network trained in an
unsupervised mode.
This means that the Kohonen network is presented with

data, but the correct output that corresponds to that data
is not specified. Using the Kohonen network this data can
be classified into groups.
 The competitive learning rule defines the
change wij applied to synaptic weight wij as
where xi is the input signal and  is the learning

rate parameter.
157
 The overall effect of the competitive learning rule
resides in moving the synaptic weight vector Wj of
the winning neuron j towards the input pattern X.
The matching criterion is equivalent to the
minimum Euclidean distance between vectors.
1/ 2
 n
2
d  X  W j   ( xi  wij ) 
 i 1 
158
110 0 c 2
10 0 0 c 2
0 0 01 c1
0 011 c1

NNFL - ANN1 A

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

NNFL - ANN1 A

Hochgeladen von

Copyright:

Verfügbare Formate

ARTIFICIAL NEURAL

 Biological organisms are capable of learning gradually over

 The learning capability reflects the ability of biological

 This learning capability of biological organisms from

 “Neural networks, or more specifically, artificial neural

– A neural network is a massively parallel

The output from the soma is

The axon is coupled to other

Activation functions are needed to introduce

Without non-linearity, hidden units would not

Find f ’(net) wrt net in terms of y

f ' (net )  0.5 (1  y ) 2

Find f ’(net) in terms of y

it is essentially zero everywhere except in a small region around zero.

Notice the quantify (-w0) is a threshold that the

i=0 wixi >0

Single- layer neural network

Design a perceptron to implement the logical Boolean

In view of the function g, we consider binary-input/binary-

First, to see whether or not this problem is solvable, we look at the

The exclusive or(XOR) is the Boolean function with

When we display the input-output relationship expressed by g, we see

White={(0,0),(1,1)} and Black={(0,1),(1,0)}.

Therefore, the function g cannot be implemented by using a

For all neurons, the “activation function” f is taken to be

So that the output neuron A is y1=f(w1Ax1+w2Ax2−bA)

Similarly, the output of neuron B is y2=f(w1Bx1+w2Bx2−bB).

to result in the output y=0 when x1 = x2 = 1.

Supervised learning does

When Reinforcement learning is used?

• This process is repeated, iterating through the training examples as many as

where wi = (t – o) xi

t is target output value for the current training example

• We can prove that the algorithm will converge

• If the data is not linearly separable, convergence is

• Here, x1 = -1, x2 = 1, x3 = 1, x4 = -1 as before

• w’1 = 0.7 + Δw1 = 0.7 + -0.2 = 0.5

• w’2 = -0.2 + Δw2 = -0.2 + 0.2 = 0

• w’3= 0.1 + Δw3 = 0.1 + 0.2 = 0.3

• w’4 = 0.9 + Δw4 = 0.9 - 0.2 = 0.7

• A perceptron for the AND function is defined

Perceptron learning rule

Response of net will now be

Every epoch will have four inputs

Response of net will be negative (correct), correct for 3 inputs

The response of the net will now be correct for (1,1)

Response of net will now be correct (

RESPONSE WILL BE ENGATIVE FOR ALL INPUTS

SECOND EPOCH, FOURTH TRAINING INPUT

RESPONSE WILL BE ENGATIVE FOR ALL INPUTS

A second training rule, called the delta

to use gradient descent to search the

This rule is important because it

If instead one takes steps proportional to the positive of

Gradient descent is also known as steepest descent, or

where D is set of training examples, td is the target

Here we characterize E as a function of weight

How can we calculate the direction of steepest descent along

If Error gradient is negative:

Determines how far to move in the direction of the

A small learning rate will lead to slower learning,

The training involves three stages:

1)Feed forward of the input training pattern