Sie sind auf Seite 1von 9

Machine Learning for Data Mining

Week 5: Neural Networks


Christof Monz
Overview
Week 5: Neural Networks
Christof Monz
1

Perceptrons

Gradient descent search

Multi-layer neural networks

The backpropagation algorithm


Neural Networks
Week 5: Neural Networks
Christof Monz
2

Analogy to biological neural systems, the most


robust learning systems we know

Attempt to understand natural biological


systems through computational modeling

Massive parallelism allows for computational


eciency

Help understand distributed nature of neural


representations

Intelligent behavior as an emergent property of


large number of simple units rather than from
explicitly encoded symbolic rules and algorithms
Neural Network Learning
Week 5: Neural Networks
Christof Monz
3

Learning approach based on modeling


adaptation in biological neural systems

Perceptron: Initial algorithm for learning


simple neural networks (single layer) developed
in the 1950s

Backpropagation: More complex algorithm for


learning multi-layer neural networks developed
in the 1980s.
Real Neurons
Week 5: Neural Networks
Christof Monz
4
Human Neural Network
Week 5: Neural Networks
Christof Monz
5
Modeling Neural Networks
Week 5: Neural Networks
Christof Monz
6
Perceptrons
Week 5: Neural Networks
Christof Monz
7
Perceptrons
Week 5: Neural Networks
Christof Monz
8

A perceptron is a single layer neural network


with one output unit

The output of a perceptron is computed as


follows
o(x
1
. . . x
n
) =

1 if w
0
+w
1
x
1
+. . . +w
n
x
n
> 0
1 otherwise

Assume a dummy input x


0
= 1 we can write:
o(x
1
. . . x
n
) =

1 if
n
i=0
w
i
x
i
> 0
1 otherwise
Perceptrons
Week 5: Neural Networks
Christof Monz
9

Learning a perceptron involves choosing the


right values for the weights w
0
. . . w
n

The set of candidate hypotheses is


H = {w | w
(n+1)
}
Representational Power of Perceptrons
Week 5: Neural Networks
Christof Monz
10

A single perceptron represent many boolean


functions, e.g. AND, OR, NAND (AND), . . . ,
but not all (e.g., XOR)
Peceptron Training Rule
Week 5: Neural Networks
Christof Monz
11

The perceptron training rule can be dened


for each weight as:
w
i
w
i
+w
i
where w
i
=(t o)x
i
where t is the target output, o is the output of
the perceptron, and is the learning rate

This scenario assume that we know what the


target outputs are supposed to be like
Peceptron Training Rule Example
Week 5: Neural Networks
Christof Monz
12

If t = o then (t o)x
i
= 0 and w
i
= 0, i.e.
the weight for w
i
remains unchanged, regardless
of the learning rate and the input values (i.e. x
i
)

Lets assume a learning rate of = 0.1 and an


input value of x
i
= 0.8
If t = +1 and o = 1, then
w
i
= 0.1(1(1)))0.8 = 0.16
If t = 1 and o = +1, then
w
i
= 0.1(11)))0.8 = 0.16
Peceptron Training Rule
Week 5: Neural Networks
Christof Monz
13

The perceptron training rule converges after a


nite number of iterations

Stopping criterion holds if the amount of


changes falls below a pre-dened threshold ,
e.g., if |w|
L1
<

But only if the training examples are linearly


separable
The Delta Rule
Week 5: Neural Networks
Christof Monz
14

The delta rule overcomes the shortcoming of


the perceptron training rule not being
guaranteed to converge if the examples are not
linearly separable

Delta rule is based on gradient descent search

Lets assume we have an unthresholded


perceptron: o(x) =wx

We can dene the training error as:


E(w) =
1
2

dD
(t
d
o
d
)
2
where D is the set of training examples
Error Surface
Week 5: Neural Networks
Christof Monz
15
Gradient Descent
Week 5: Neural Networks
Christof Monz
16

The gradient of E is the vector pointing in the


direction of the steepest increase for any point
on the error surface
E(w) =

E
w
0
,
E
w
1
, . . . ,
E
w
n

Since we are interested in minimizing the error,


we consider negative gradients: E(w)

The training rule for gradient descent is:


w w+w
where w = E(w)
Gradient Descent
Week 5: Neural Networks
Christof Monz
17

The training rule for individual weights is


dened as w
i
w
i
+w
i
where w
i
=
E
w
i

Instantiating E for the error function we use


gives:
E
w
i
=

w
i
1
2

dD
(t
d
o
d
)
2

How do we use partial derivatives to actually


compute updates to weights at each step?
Gradient Descent
Week 5: Neural Networks
Christof Monz
18
E
w
i
=

w
i
1
2

dD
(t
d
o
d
)
2
=
1
2

dD

w
i
(t
d
o
d
)
2
=
1
2

dD
2(t
d
o
d
)

w
i
(t
d
o
d
)
=

dD
(t
d
o
d
)

w
i
(t
d
o
d
)
E
w
i
=

dD
(t
d
o
d
) (x
id
)
Gradient Descent
Week 5: Neural Networks
Christof Monz
19

The delta rule for individual weights can now be


written as w
i
w
i
+w
i
where w
i
=
dD
(t
d
o
d
)x
id

The gradient descent algorithm


picks initial random weights
computes the outputs
updates each weight by adding w
i
repeats until converge
The Gradient Descent Algorithm
Week 5: Neural Networks
Christof Monz
20
Each training example is a pair <x, t >
1 Initialize each w
i
to some small random value
2 Until the termination condition is met do:
2.1 Initialize each w
i
to 0
2.2 For each <x, t > D do
2.2.1 Compute o(x)
2.2.2 For each weight w
i
do
w
i
w
i
+(t o)x
i
2.3 For each weight w
i
do
w
i
w
i
+w
i
The Gradient Descent Algorithm
Week 5: Neural Networks
Christof Monz
21

The gradient descent algorithm will nd the


global minimum, provided that the learning rate
is small enough

If the learning rate is too large, this algorithm


runs into the risk of overstepping the global
minimum

Its a common strategy to gradually the


decrease the learning rate

This algorithm works also in case the training


examples are not linearly separable
Shortcomings of Gradient Descent
Week 5: Neural Networks
Christof Monz
22

Converging to a minimum can be quite slow


(i.e. it can take thousands of steps). Increasing
the learning rate on the other hand can lead to
overstepping minima

If there are multiple local minima in the error


surface, gradient descent can get stuck in one
of them and not nd the global minimum

Stochastic gradient descent alleviates these


diculties
Stochastic Gradient Descent
Week 5: Neural Networks
Christof Monz
23

Gradient descent updates the weights after


summing over all training examples

Stochastic (or incremental) gradient descent


updates weights incrementally after calculating
the error for each individual training example

This this end step 2.3 is deleted and step 2.2.2


modied
Stochastic Gradient Descent Algorithm
Week 5: Neural Networks
Christof Monz
24
Each training example is a pair <x, t >
1 Initialize each w
i
to some small random value
2 Until the termination condition is met do:
2.1 Initialize each w
i
to 0
2.2 For each <x, t > D do
2.2.1 Compute o(x)
2.2.2 For each weight w
i
do
w
i
w
i
+(t o)x
i
Comparison
Week 5: Neural Networks
Christof Monz
25

In standard gradient descent summing over


multiple examples requires more computations
per weight update step

As a consequence standard gradient descent


often uses larger learning rates than stochastic
gradient descent

Stochastic gradient descent can avoid falling


into local minima because it uses the dierent
E
d
(w) rather than the overall E(w) to guide
its search
Multi-Layer Neural Networks
Week 5: Neural Networks
Christof Monz
26

Perceptrons only have two layers: the input


layer and the output layer

Perceptrons only have one output unit

Perceptrons are limited in their expressiveness

Multi-layer neural networks consist of an input


layer, a hidden layer, and an output layer

Multi-layer neural networks can have several


output units
Multi-Layer Neural Networks
Week 5: Neural Networks
Christof Monz
27
Multi-Layer Neural Networks
Week 5: Neural Networks
Christof Monz
28

The units of the hidden layer function as input


units to the next layer

However, multiple layers of linear units still


produce only linear functions

The step function in perceptrons is another


choice, but it is not dierentiable, and therefore
not suitable for gradient descent search

Solution: the sigmoid function, a non-linear,


dierentiable threshold function
Sigmoid Unit
Week 5: Neural Networks
Christof Monz
29
The Sigmoid Function
Week 5: Neural Networks
Christof Monz
30

The output is computed as o =(wx)


where (y) =
1
1+e
y
i.e. o =(wx) =
1
1+e
(wx)

Another nice property of the sigmoid function is


that its derivative is easily expressed:
d(y)
dy
=(y) (1(y))
Learning Weights with Multiple Layers
Week 5: Neural Networks
Christof Monz
31

The gradient descent search can be used to


train multi-layer neural networks, but the
algorithm has to be adapted

Firstly, there can be multiple output units, and


therefore the error function as to be generalized:
E(w) =
1
2

dD

koutputs
(t
kd
o
kd
)
2

Secondly, the error feedback has to be fed


through multiple layers
Backpropagation Algorithm
Week 5: Neural Networks
Christof Monz
32
For each training example <x, t > do
1. Input x to the network and compute o
u
for every unit in
the network
2. For each output unit k calculate its error
k
:

k
o
k
(1o
k
)(t
k
o
k
)
3. For each hidden unit h calculate its error
h
:

h
o
h
(1o
h
)
koutputs
w
kh

k
4. Update each network weight w
ji
:
w
ji
w
ji
+w
ji
where w
ji
=
j
x
ji

Note: x
ji
is the value from unit i to j and w
ji
is
the weight of connecting unit i to j,
Backpropagation Algorithm
Week 5: Neural Networks
Christof Monz
33

Step 1 propagates the input forward through


the network

Steps 24 propagate the errors backward


through the network

Step 2 is similar to the delta rule in gradient


descent (step 2.3)

Step 3 sums over the errors of all output units


inuence by a given hidden unit (this is because
the training data only provides direct feedback
for the output units)
Applications of Neural Networks
Week 5: Neural Networks
Christof Monz
34

Text to speech

Fraud detection

Automated vehicles

Game playing

Handwriting recognition
Summary
Week 5: Neural Networks
Christof Monz
35

Perceptrons, simple one layer neural networks

Perceptron training rule

Gradient descent search

Multi-layer neural networks

Backpropagation algorithm

Das könnte Ihnen auch gefallen