Sie sind auf Seite 1von 19

Foundations of Machine Learning

Module 6: Neural Network


Part C: Neural Network and
Backpropagation Algorithm

Sudeshna Sarkar
IIT Kharagpur
Single layer Perceptron
• Single layer perceptrons learn o x
linear decision boundaries
x2
0 0
0 0 o o
+ + 0 0
+ 0
+ ++
x: class I (y = 1)
o: class II (y = -1)
x1
x x
x2

+ 0
o x

0 +
x: class I (y = 1)
x1 o: class II (y = -1)
xor
x2

Boolean OR + +
OR

- + x1
input input
ouput
x1 x2
w0= -0.5
0 0 0
0 1 1 1
w1=1 w2=1
1 0 1
1 1 1 x1 x2
x2

Boolean AND - +

AND
input input x1
ouput - -
x1 x2
w0= -1.5
0 0 0
0 1 0 1
w1=1 w2=1
1 0 0
1 1 1 x1 x2
x2

Boolean XOR
+ -

XOR
input input
ouput
x1 x2 x1
- +
0 0 0
0 1 1
1 0 1
1 1 0
Boolean XOR
XOR

o -0.5

input input
ouput 1 -1
x1 x2
OR AND
0 0 0 -0.5 h1 h1 -1.5
0 1 1
1
1 0 1 1
1 1
1 1 0
x1 x1
Representation Capability of NNs
• Single layer nets have limited representation power (linear
separability problem). Multi-layer nets (or nets with non-
linear hidden units) may overcome linear inseparability
problem.
• Every Boolean function can be represented by a network with
a single hidden layer.
• Every bounded continuous function can be approximated with
arbitrarily small error, by network with one hidden layer
• Any function can be approximated to arbitrary accuracy by a
network with two hidden layers.
Multilayer Network

Outputls
Inputs

First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

n1
n n2 yn2
xn
Input Hidden Output
layer layer

Error signals

9
Derivation
• For one output neuron, the error function is
1
𝐸 = (𝑦 − 𝑜)2
2
• For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛

𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 ෍ 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
The input 𝑛𝑒𝑡𝑗 to a neuron is the weighted sum of outputs 𝑜𝑘
of previous 𝑛 neurons.
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
1
For one output neuron, the error function is 𝐸 = 2 (𝑦 − 𝑜)2
For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛

𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 ෍ 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
𝜕𝐸 𝜕𝑜𝑙
=෍ 𝑤 𝜑 𝑛𝑒𝑡𝑗 1 − 𝜑 𝑛𝑒𝑡𝑗 𝑜𝑖
𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙 𝑗𝑧𝑙
𝑙
𝜕𝐸
= 𝛿𝑗 𝑜𝑖
𝜕𝑤𝑖𝑗
with
𝑜𝑗 − 𝑦𝑗 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an output neuron
𝜕𝐸 𝜕𝑜𝑗
𝛿𝑗 = =
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 ෍ 𝛿𝑧𝑙 𝑤𝑗𝑙 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an inner neuron
𝑍
To update the weight 𝑤𝑖𝑗 using gradient descent, one must choose a learning rate 𝜂.
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜂
𝜕𝑤𝑖𝑗
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, do
– For each training example, do
• Input the training example to the network and compute the network
outputs
• For each output unit 𝑘
𝛿𝑘 ← 𝑜𝑘 (1 − 𝑜𝑘 )(𝑦𝑘 − 𝑜𝑘 )
• For each hidden unit h 𝑥𝑑 = input

𝛿ℎ ← 𝑜ℎ (1 − 𝑜ℎ ) ෍ 𝑤ℎ,𝑘 , 𝛿𝑘 , 𝑦𝑑 = target output


𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑜𝑑 = observed unit output
• Update each network weight 𝑤𝑖 , 𝑗
𝑤𝑖,𝑗 ← 𝑤𝑖,𝑗 + ∆𝑤𝑖,𝑗 𝑤𝑖𝑗 = wt from i to j
where
∆𝑤𝑖,𝑗 = 𝜂𝛿𝑗 𝑥𝑖,𝑗
Backpropagation
• Gradient descent over entire network weight vector
• Can be generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
• May include weight momentum 𝛼
∆𝑤𝑖,𝑗 𝑛 = 𝜂𝛿𝑗 𝑥𝑖,𝑗 + 𝛼∆𝑤𝑖,𝑗 𝑛 − 1
• Training may be slow.
• Using network after training is very fast
Training practices: batch vs. stochastic
vs. mini-batch gradient descent
• Batch gradient descent:
1. Calculate outputs for the entire Too slow to converge
dataset Gets stuck in local minima
2. Accumulate the errors, back-
propagate and update
• Stochastic/online gradient descent: Converges to the solution faster
1. Feed forward a training example Often helps get the system out of
local minima
2. Back-propagate the error and
update the parameters
• Mini-batch gradient descent:
Learning in epochs
Stopping
• Train the NN on the entire training set over and over
again
• Each such episode of training is called an “epoch”

Stopping
1. Fixed maximum number of epochs: most naïve
2. Keep track of the training and validation error
curves.
Overfitting in ANNs
Local Minima

• NN can get stuck in local minima for small networks.


• For most large networks (many weights) local minima rarely occurs.
• It is unlikely that you are in a minima in every dimension
simultaneously.
ANN
• Highly expressive non-linear functions
• Highly parallel network of logistic function units
• Minimizes sum of squared training errors
• Can add a regularization term (weight squared)
• Local minima
• Overfitting
Thank You

Das könnte Ihnen auch gefallen