Sie sind auf Seite 1von 26

J.

Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control


Artificial Neural Networks: Motivation
Machine learning paradigm motivated by biological learning systems
Human brain
interconnected network of 10
11
neurons
each connected to 10
4
others
neuron switching time 10
-3
seconds
suprisingly complex decisions, suprisingly quickly (10
-1
s)
information processing abilities follow from highly parallel process
robust & fault tolerant
flexible
deals with fuzzy, noisy, or inconsistent information
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Schematic Drawing of a Biological Neuron
dendrites - inputs
axon - output
synaptic junctions - excitatory or inhibitory
activation potential, threshold and firing of cell
transmission of signals from receptors (receive stimuli
from the environment) to effector (executive units)
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Appropriate Problems for ANN
Problems:
instances are represented by many attribute-value pairs
target function output may be discrete- or real- valued
training examples may contain errors
long training times are acceptable
fast evaluation of the learned function is required
generalisation abilities
Implementation issues:
What is the best architecture - number of neurons, layers,
connections between units, ...
How to learn ANN - how many examples, how many learning
cycles, type of training examples, ...
What can the network do - what tasks, how well, how fast, ...
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Perceptron - Formal Model of Neuron





x
1
, ..., x
n
- inputs
w
1
, ..., w
n
- synaptic weights
w
0
- threshold
- internal potential of neuron
o() - activation function: o() = sgn()
y- output: y = o()
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Geometric interpretation of the function
of Single Perceptron

J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Geometric interpretation of the function
of Single Perceptron
A single perceptron performs classification into 2 classes
with the decision hyperplane
w.x = 0
only linearly separable sets of examples
Perceptron can represent all primitive boolean functions
AND, OR, NAND, NOR (?XOR?).
every boolean function can be represented by some
network of perceptrons
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Perceptron Learning Rule
How to learn the weights for a single perceptron?
1. Begin with random weights
2. Apply the perceptron to each training example (xk, dk) and
modify the weights whenever it misclassifies an example:
w
i
w
i
+ Aw
i

where Aw
i
= q(d
k
- y
k
) x
i
and q is the learning rate
3. If at least one example was misclassified continue with step 2.
else end.
The convergence of this procedure is assured if the training
examples are linearly separable
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Gradient Descent and Delta Rule
Gradient descent used to search the space of possible weight vectors
linear unit: y(x) = w.x
training error:







Weight vector is altered in order to find the minimum error.
Converge to best approximation to the target function
E w d y
k k
d Train
k
( ) ( )
e

1
2
2
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Steepest Descent Direction
The direction is determined by the derivative of E with
respect to each component of w.
Gradient of E:

Training rule for gradient descent:
w
i
w
i
+ Aw
i
, where

Differentiating E from the equation (X)
| |
V

E w
E
w
E
w
E
w
n
( ) , ,...,
c
c
c
c
c
c
0 1
Aw
E
w
i
i
= q
c
c
c
c
c
c
E
w w
d y d y x
i i
k k
d Train
k k i
d Train
k k
= =
e e

1
2
2
( ) ( )( )
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Gradient-Descent Algorithm
Initialise each w
i
to some small random value
Until the termination condition is met, do
initialise each Aw
i
to zero
for each (x
k
,d
k
) in training examples do
compute the output y
for each weight w
i
do
Aw
i
Aw
i
+ q(d
k
- y
k
) x
i

for each w
i
do w
i
w
i
+ Aw
i

J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Remarks on Gradient-Descent Algorithm
It can be used whenever
search space of weight vectors is continuous
the error can be differentiated with respect to the weights
Difficulties in applying the algorithm are
converging to an optimum can be slow
no guarantee that the global optimum will be found
Delta Rule: w
i
w
i
+ q(d
k
- y
k
) x
i

weights are updated upon examining each training example
less computation time per weight update step
can sometimes avoid falling into local optima
Delta rule converges toward the minimum error weights
regardless of whether the training data are linearly separable
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Rosenblatt's Simple Perceptron
Designed for task of pattern recognition
Single layer net of perceptrons







Perceptron learning rule
Limitations - Objects are separable by a hyperplane.
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Neural Networks
Interconnected net of formal neurons
input(receptors), hidden, output(efectors)
State and configuration of NN
Dynamics:
organisational - topology and its change
recurrent, feed-forward, multi-layer
activation - initialisation of the state and its change
continuous, discrete, sequential, parallel, activation
function
adaptive - initial configuration and learning algorithm
supervise x unsupervised learning
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Multi-Layer Networks
Feed-forward networks with intermediate "hidden" layer(s)
n-layer network (n hidden layers)


J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Geometric Interpretation of the Function
of the Multi-Layer Networks




J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
XOR by means of 2-Layer Network




J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Multi-Layer Networks and
Backpropagation Algorithm
Requires units whose output is
a nonlinear function of its inputs
a differentiable function of its inputs

Activation function:

o

( ) =
+

1
1 e
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Adaptation of Weights in Multi-Layer
Networks
Error function:



Adaptation step:

where

and 0<c<1 is the learning rate
E w E w
k
k
p
( ) ( ) =
=

1
E w y w x d
k j k kj
j Y
( ) ( ( , ) ) =
e

1
2
2
w w w
ji
t
ji
t
ji
t ( ) ( ) ( )
= +
1
A
Aw
E
w
w
ji
t
ji
t ( ) ( )
( ) =

c
c
c
1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Visualisation of the process
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Backpropagation Algorithm
We can write:

In order to get the derivative we use the chain rule:


where we get as derivative of potential:


and the derivative of the activation function is:


c
c
c
c
E
w
E
w
ji
k
ji
k
p
=
=

1
c
c
c
c
c
c
c
c
E
w
E
y
y
w
k
ji
k
j
j
j
j
ji
=
c
c
j
ji
w
c
c
j
ji
i
w
y =
c
c

y
y y
j
j
j j j
= ( ) 1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Backpropagation Algorithm
Substituting expressions, we obtain

c
c
c
c

E
w
E
y
y y y
k
ji
k
j
j j j i
= ( ) 1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Derivation of Training Rule for Output
vs. Hidden Units
Output unit: ,
that is an error of the j-th neuron on k-th example
Hidden unit:


because y
j
is used in calculation of internal potential of all
units whose inputs includes the output of unit j
In this way the errors are propagated backwards from the
output layer to the first hidden layer


c
c
E
y
y d
k
j
j kj
=
c
c
c
c
c
c
c
c
c
c

E
y
E
y
y
y
E
y
y y w
k
j
k
r
r
r
r
j
k
r
r r r rj
r j r j
= =
e e


( ) 1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Visualisation of the Backpropagation in
a Three-Layer Network


J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Remarks on Multi-Layer Networks and
Backpropagation
Speed of Backpropagation
Convergence to local optimum
the more weights the more dimensions that might
provide "escape routes" from the local optimum
Stochastic gradient descent rather than true gradient
descent - less likely to get stuck in a local optimum
Training multiple networks using the same data but
different initial weight vectors
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Remarks on Multi-Layer Networks and
Backpropagation
Some hints to choose the topology?
complexity of the net should reflect the complexity of
the problem
overfitting
heuristics:
1. layer a few more units than is the number of inputs
2. layer (N
outputs
+ N
layer1
)/2
try and adjust
Constructive algorithms - Cascade Correlation Algorithm
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Overfitting in Multi-Layer Networks