Week05 Neural Networks

Machine Learning for Data Mining
Week 5: Neural Networks

Christof Monz
Overview
Christof Monz
1
Perceptrons
Gradient descent search
Multi-layer neural networks
The backpropagation algorithm

Neural Networks
Christof Monz
2
Analogy to biological neural systems, the most

robust learning systems we know
Attempt to understand natural biological

systems through computational modeling
Massive parallelism allows for computational

eciency
Help understand distributed nature of neural

representations
Intelligent behavior as an emergent property of

large number of simple units rather than from
explicitly encoded symbolic rules and algorithms
Neural Network Learning
Christof Monz
3
Learning approach based on modeling

adaptation in biological neural systems
Perceptron: Initial algorithm for learning

simple neural networks (single layer) developed
in the 1950s
Backpropagation: More complex algorithm for

learning multi-layer neural networks developed
in the 1980s.
Real Neurons
Christof Monz
4
Human Neural Network
Christof Monz
5
Modeling Neural Networks
Christof Monz
6
Perceptrons
Christof Monz
7
Perceptrons
Christof Monz
8
A perceptron is a single layer neural network

with one output unit
The output of a perceptron is computed as

follows
o(x
1
. . . x
n
) =
1 if w
0
+w
1
x
1
+. . . +w
n
x
n
> 0
1 otherwise
Assume a dummy input x

0
= 1 we can write:
o(x
1
. . . x
n
) =
1 if
n
i=0
w
i
x
i
> 0
1 otherwise
Perceptrons
Christof Monz
9
Learning a perceptron involves choosing the

right values for the weights w
0
. . . w
n
The set of candidate hypotheses is

H = {w | w
(n+1)
}
Representational Power of Perceptrons
Christof Monz
10
A single perceptron represent many boolean

functions, e.g. AND, OR, NAND (AND), . . . ,
but not all (e.g., XOR)
Peceptron Training Rule
Christof Monz
11
The perceptron training rule can be dened

for each weight as:
w
i
w
i
+w
i
where w
i
=(t o)x
i
where t is the target output, o is the output of
the perceptron, and is the learning rate
This scenario assume that we know what the

target outputs are supposed to be like
Peceptron Training Rule Example
Christof Monz
12
If t = o then (t o)x
i
= 0 and w
i
= 0, i.e.
the weight for w
i
remains unchanged, regardless
of the learning rate and the input values (i.e. x
i
)
Lets assume a learning rate of = 0.1 and an

input value of x
i
= 0.8
If t = +1 and o = 1, then
w
i
= 0.1(1(1)))0.8 = 0.16
If t = 1 and o = +1, then
w
i
= 0.1(11)))0.8 = 0.16
Peceptron Training Rule
Christof Monz
13
The perceptron training rule converges after a

nite number of iterations
Stopping criterion holds if the amount of

changes falls below a pre-dened threshold ,
e.g., if |w|
L1
<
But only if the training examples are linearly

separable
The Delta Rule
Christof Monz
14
The delta rule overcomes the shortcoming of

the perceptron training rule not being
guaranteed to converge if the examples are not
linearly separable
Delta rule is based on gradient descent search
Lets assume we have an unthresholded

perceptron: o(x) =wx
We can dene the training error as:

E(w) =
1
2

dD
(t
d
o
d
)
2
where D is the set of training examples
Error Surface
Christof Monz
15
Gradient Descent
Christof Monz
16
The gradient of E is the vector pointing in the

direction of the steepest increase for any point
on the error surface
E(w) =
E
w
0
,
E
w
1
, . . . ,
E
w
n
Since we are interested in minimizing the error,

we consider negative gradients: E(w)
The training rule for gradient descent is:

w w+w
where w = E(w)
Gradient Descent
Christof Monz
17
The training rule for individual weights is

dened as w
i
w
i
+w
i
where w
i
=
E
w
i
Instantiating E for the error function we use

gives:
E
w
i
=

w
i
1
2

dD
(t
d
o
d
)
2
How do we use partial derivatives to actually

compute updates to weights at each step?
Gradient Descent
Christof Monz
18
E
w
i
=

w
i
1
2

dD
(t
d
o
d
)
2
=
1
2

dD
w
i
(t
d
o
d
)
2
=
1
2

dD
2(t
d
o
d
)

w
i
(t
d
o
d
)
=

dD
(t
d
o
d
)

w
i
(t
d
o
d
)
E
w
i
=

dD
(t
d
o
d
) (x
id
)
Gradient Descent
Christof Monz
19
The delta rule for individual weights can now be

written as w
i
w
i
+w
i
where w
i
=
dD
(t
d
o
d
)x
id
The gradient descent algorithm

picks initial random weights
computes the outputs
updates each weight by adding w
i
repeats until converge
The Gradient Descent Algorithm
Christof Monz
20
Each training example is a pair <x, t >
1 Initialize each w
i
to some small random value
2 Until the termination condition is met do:
2.1 Initialize each w
i
to 0
2.2 For each <x, t > D do
2.2.1 Compute o(x)
2.2.2 For each weight w
i
do
w
i
w
i
+(t o)x
i
2.3 For each weight w
i
do
w
i
w
i
+w
i
The Gradient Descent Algorithm
Christof Monz
21
The gradient descent algorithm will nd the

global minimum, provided that the learning rate
is small enough
If the learning rate is too large, this algorithm

runs into the risk of overstepping the global
minimum
Its a common strategy to gradually the

decrease the learning rate
This algorithm works also in case the training

examples are not linearly separable
Shortcomings of Gradient Descent
Christof Monz
22
Converging to a minimum can be quite slow

(i.e. it can take thousands of steps). Increasing
the learning rate on the other hand can lead to
overstepping minima
If there are multiple local minima in the error

surface, gradient descent can get stuck in one
of them and not nd the global minimum
Stochastic gradient descent alleviates these

diculties
Stochastic Gradient Descent
Christof Monz
23
Gradient descent updates the weights after

summing over all training examples
Stochastic (or incremental) gradient descent

updates weights incrementally after calculating
the error for each individual training example
This this end step 2.3 is deleted and step 2.2.2

modied
Stochastic Gradient Descent Algorithm
Christof Monz
24
Each training example is a pair <x, t >
1 Initialize each w
i
to some small random value
2 Until the termination condition is met do:
2.1 Initialize each w
i
to 0
2.2 For each <x, t > D do
2.2.1 Compute o(x)
2.2.2 For each weight w
i
do
w
i
w
i
+(t o)x
i
Comparison
Christof Monz
25
In standard gradient descent summing over

multiple examples requires more computations
per weight update step
As a consequence standard gradient descent

often uses larger learning rates than stochastic
gradient descent
Stochastic gradient descent can avoid falling

into local minima because it uses the dierent
E
d
(w) rather than the overall E(w) to guide
its search
Multi-Layer Neural Networks
Christof Monz
26
Perceptrons only have two layers: the input

layer and the output layer
Perceptrons only have one output unit
Perceptrons are limited in their expressiveness
Multi-layer neural networks consist of an input

layer, a hidden layer, and an output layer
Multi-layer neural networks can have several

output units
Christof Monz
27
Christof Monz
28
The units of the hidden layer function as input

units to the next layer
However, multiple layers of linear units still

produce only linear functions
The step function in perceptrons is another

choice, but it is not dierentiable, and therefore
not suitable for gradient descent search
Solution: the sigmoid function, a non-linear,

dierentiable threshold function
Sigmoid Unit
Christof Monz
29
The Sigmoid Function
Christof Monz
30
The output is computed as o =(wx)

where (y) =
1
1+e
y
i.e. o =(wx) =
1
1+e
(wx)
Another nice property of the sigmoid function is

that its derivative is easily expressed:
d(y)
dy
=(y) (1(y))
Learning Weights with Multiple Layers
Christof Monz
31
The gradient descent search can be used to

train multi-layer neural networks, but the
algorithm has to be adapted
Firstly, there can be multiple output units, and

therefore the error function as to be generalized:
E(w) =
1
2

dD
koutputs
(t
kd
o
kd
)
2
Secondly, the error feedback has to be fed

through multiple layers
Backpropagation Algorithm
Christof Monz
32
For each training example <x, t > do
1. Input x to the network and compute o
u
for every unit in
the network
2. For each output unit k calculate its error
k
:
k
o
k
(1o
k
)(t
k
o
k
)
3. For each hidden unit h calculate its error
h
:
h
o
h
(1o
h
)
koutputs
w
kh
k
4. Update each network weight w
ji
:
w
ji
w
ji
+w
ji
where w
ji
=
j
x
ji
Note: x
ji
is the value from unit i to j and w
ji
is
the weight of connecting unit i to j,
Backpropagation Algorithm
Christof Monz
33
Step 1 propagates the input forward through

the network
Steps 24 propagate the errors backward

through the network
Step 2 is similar to the delta rule in gradient

descent (step 2.3)
Step 3 sums over the errors of all output units

inuence by a given hidden unit (this is because
the training data only provides direct feedback
for the output units)
Applications of Neural Networks
Christof Monz
34
Text to speech
Fraud detection
Automated vehicles
Game playing
Handwriting recognition
Summary
Christof Monz
35
Perceptrons, simple one layer neural networks
Perceptron training rule
Gradient descent search
Multi-layer neural networks
Backpropagation algorithm

Week05 Neural Networks

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Week05 Neural Networks

Hochgeladen von

Copyright:

Verfügbare Formate

Machine Learning for Data Mining

Week 5: Neural Networks

Gradient descent search

Multi-layer neural networks

The backpropagation algorithm

Analogy to biological neural systems, the most

Attempt to understand natural biological

Massive parallelism allows for computational

Help understand distributed nature of neural

Intelligent behavior as an emergent property of

Learning approach based on modeling

Perceptron: Initial algorithm for learning

Backpropagation: More complex algorithm for

A perceptron is a single layer neural network

The output of a perceptron is computed as

Assume a dummy input x

Learning a perceptron involves choosing the

The set of candidate hypotheses is

A single perceptron represent many boolean

The perceptron training rule can be dened

This scenario assume that we know what the

Lets assume a learning rate of = 0.1 and an

The perceptron training rule converges after a

Stopping criterion holds if the amount of

But only if the training examples are linearly

The delta rule overcomes the shortcoming of

Delta rule is based on gradient descent search

Lets assume we have an unthresholded

We can dene the training error as:

The gradient of E is the vector pointing in the

Since we are interested in minimizing the error,

The training rule for gradient descent is:

The training rule for individual weights is

Instantiating E for the error function we use

How do we use partial derivatives to actually

The delta rule for individual weights can now be

The gradient descent algorithm

The gradient descent algorithm will nd the

If the learning rate is too large, this algorithm

Its a common strategy to gradually the

This algorithm works also in case the training

Converging to a minimum can be quite slow

If there are multiple local minima in the error

Stochastic gradient descent alleviates these

Gradient descent updates the weights after

Stochastic (or incremental) gradient descent

This this end step 2.3 is deleted and step 2.2.2

In standard gradient descent summing over

As a consequence standard gradient descent

Stochastic gradient descent can avoid falling

Perceptrons only have two layers: the input

Perceptrons only have one output unit

Perceptrons are limited in their expressiveness

Multi-layer neural networks consist of an input

Multi-layer neural networks can have several

The units of the hidden layer function as input

However, multiple layers of linear units still

The step function in perceptrons is another

Solution: the sigmoid function, a non-linear,

The output is computed as o =(wx)

Another nice property of the sigmoid function is

The gradient descent search can be used to

Firstly, there can be multiple output units, and

Secondly, the error feedback has to be fed