Sie sind auf Seite 1von 65
Iris versicolor Iris setosa www.mathworks.com/help/toolbox/stats/ bqzdnrv-1.html Iris virginica
Iris versicolor Iris setosa www.mathworks.com/help/toolbox/stats/ bqzdnrv-1.html Iris virginica

Iris

versicolor

Iris setosa

setosa www.mathworks.com/help/toolbox/stats/ bqzdnrv-1.html Iris virginica

Iris virginica

Your first workshop task (using J48 in Weka) was to see if there was enough information in the four attributes of petal length and width

and sepal length and width to

distinguish and classify these similar

looking flowers. Notice that colour may not help here. Your next workshop will involve using ANNs to distinguish between these flowers.

1

Workshops and assignment

Week 3: Workshop in ANNs (attend if you

wish)

Week 3: assignment distributed

Week 4 Workshop in ANNs (attend if you wish)

Week 5-9: Workshops in WEKA/ANNs as you progress through the assignment (tailored to

your requirements)

Week 10: 1 st June 2012 hand-in. 13 pages max.

Lecture 4

Artificial Neural Networks

Images of the brain. Top left: photo; top right: what is currently known; left: close
Images of the brain. Top left: photo; top right: what is currently known; left: close
Images of the brain. Top left: photo; top right: what is currently known; left: close

Images of the brain. Top left:

photo; top right: what is

currently known; left: close up

showing the brain consisting of layers of interconnected nerve cells (neurons) and tissue.

All images taken from

www.idsia.ch/NNcourse/brain.html

4

From biology to computing

Neuron = nerve cell (in brain)

Neurons

Flow of information

Biological
Biological

http://faculty.washington.edu/chudler/color/pic1an.gif

Artificial http://research.yale.edu/ysm/images/78.2/articles-neural-neuron.jpg
Artificial
http://research.yale.edu/ysm/images/78.2/articles-neural-neuron.jpg

http://www.frontiersin.org/neuromorphic_engineering/10.3389/fnins.2011.00026/full

Biological computing through action potential spikes: A:

Biological computing through action potential spikes:

A: Abstract physiology B: Biochemistry

Physiology and

biochemistry lead to

spikes. Can spikes be used for computing?

lead to spikes. Can spikes be used for computing? Post synaptic neuron spikes more (or less)
lead to spikes. Can spikes be used for computing? Post synaptic neuron spikes more (or less)
lead to spikes. Can spikes be used for computing? Post synaptic neuron spikes more (or less)

Post synaptic neuron spikes more (or less) depending on pre synaptic behaviour

6

http://www.socialbehavior.uzh.ch/teaching/ComputationalNeuroeconomicsFS11/Chapter10.pptx Action potential spiking can take

http://www.socialbehavior.uzh.ch/teaching/ComputationalNeuroeconomicsFS11/Chapter10.pptx

Action

potential spiking can take place many times per second,

depending

on which part of the brain we look at

7

The Structure of Neurons

A neuron has a cell body, a branching input

structure (the dendrIte) and a branching output structure (the axOn)

Axons connect to dendrites via synapses.

Electro-chemical signals are propagated from the dendritic input, through the cell body, and down the axon to other neurons

Classical computing vs. Neural Net

CPU

data and

instructions

data

memory

http://ilab.usc.edu/classes/2002cs561/notes/session28.ppt

Layers of interconnected neurons (as many layers as you like), with the connections being weighted to reflect strength of incoming signal

data data Layer 1 Layer 2 Layer 3
data
data
Layer 1
Layer 2
Layer 3
the connections being weighted to reflect strength of incoming signal data data Layer 1 Layer 2

Feedforward

http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt

Introduction

What is an (artificial) neural network?

A set of nodes (units, neurons, processing elements)

Each node has input and output

Each node performs a simple computation by its node function

Weighted connections between nodes

Connectivity gives the structure/architecture of the net

What can be computed by a NN is primarily determined by the connections and their weights

A very much simplified version of networks of neurons in

animal nerve systems

Neuron is basic computational unit (primitive processor), not a program

http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt

Introduction

Von Neumann machine

Human Brain

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------

One or a few high speed (ns)

processors with considerable

computing power

One or a few shared high speed buses for

communication

Sequential memory access by address

Problem-solving knowledge is

separated from the computing

component

Hard to be adaptive

Large # (10 11 ) of low speed processors (ms) with limited

computing power

Large # (10 15 ) of low speed connections

Problem-solving knowledge

resides in the connectivity of neurons

Adaptation by changing the

connectivity

Easily adapts for learning

Fault tolerant

Example of fault tolerance

Example of fault tolerance Captchas now frequently used by websites to check that fault-tolerant human brain

Captchas now frequently

used by websites to check

that fault-tolerant human brain (rather than rigorous software) is interacting with site. If there were a reliable algorithm for recognizing

these ‘faulty’ characters, some other method for checking the ‘human-nessof the user needs to be found.

http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt

Introduction

ANN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Bio NN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Nodes

input

output

node function

Connections

Cell body

signal from other neurons

firing frequency

firing mechanism

Synapses

synaptic strength

connection strength

Highly parallel, simple local computation (at neuron level) achieves global results as emerging property of the interaction (at network level)

Pattern directed (meaning of individual nodes only in the context of a pattern)

Fault-tolerant/graceful degrading

Learning/adaptation plays important role.

http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt

History of NN

Pitts & McCulloch (1943)

First mathematical model of biological neurons

All Boolean operations can be implemented by these neuron- like nodes (with different threshold and excitatory/inhibitory connections).

Competitor to Von Neumann model for general purpose computing device

Origin of automata theory.

Hebb (1949)

Hebbian rule of learning: increase the connection strength between neurons i and j whenever both i and j are activated.

Or increase the connection strength between nodes i and j whenever both nodes are simultaneously ON or OFF.

http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt

History of NN

Early booming (50’s – early 60’s) Rosenblatt (1958)

Perceptron: network of threshold nodes for pattern classification Perceptron convergence theorem:

for pattern classification Perceptron convergence theorem: x 1 x 2 x n everything that can be

x 1

x 2

x n

everything that can be represented by a perceptron can be

learned

A neuron only fires if its input signal exceeds a certain amount (the threshold) in a short time period.

Synapses vary in strength

Good connections allowing a large signal

Slight connections allow only a weak signal.

Synapses can be either excitatory or inhibitory.

Perceptron

• Linear treshold unit (LTU) Usually, an extra input (constant) is included to ensure that,
Linear treshold unit (LTU)
Usually, an extra input
(constant) is included to
ensure that, even if all the
x
0=1
inputs are 0, there is non-zero
x
w
1
1
fed into the threshold function
w
0
w
2
x
o
2
n
.
.
w i x i
.
w
i=0
Threshold function:
n
n
x
n
1 if  w i x i >0
o(x i )=
{
i=0
-1 otherwise
if the sum of
weighed inputs is
greater than 0,
output 1 else
output -1

Perceptron Learning Rule

w i = w i + w i (a weight changes according to some difference) w i = (t - o) x i (difference is between desired class value and output value) t=c(x) is the target value (class value of sample)

o is the perceptron output (actual value representing class)

(eta) is a small constant (e.g. 0.1) called learning rate (more later)

If the output is correct (t=o) the weights w i are not changed

If the output is incorrect (to) the weights w i are changed such that the output of the perceptron for the adjusted weights is closer to t.

The algorithm converges to the correct classification after repeated presentations of samples:

if the training data is linearly separable and is sufficiently small

ANN Supervised Learning Method - Basic

2. Actual output

compared with desired output

4. Error used to adjust

weights (not input or output)

output 4. Error used to adjust weights (not input or output) Problem: How do we represent
output 4. Error used to adjust weights (not input or output) Problem: How do we represent

Problem: How do we represent the class value if a threshold function only returns 0 and 1?

Problem: How do we represent the class value if a threshold function only returns 0 and
class value if a threshold function only returns 0 and 1? 5. Many repetitions of 1-4
class value if a threshold function only returns 0 and 1? 5. Many repetitions of 1-4

5. Many repetitions of 1-4

1. Samples fed in one at a

time to the input units:

repeat many times

3. Actual output stored in a temporary file to allow

calculation of error (difference between desired output and actual output) error can be summed

18

Training ANN (perceptron) for AND function

For AND

A B

Output

0

0

0

0

1

0

1

0

0

1

1

1

Additional input constant

-1 W1 = ? t = 0.0 x W2 = ? W3 = ? 0
-1
W1 = ?
t = 0.0
x W2 = ?
W3 = ?
0 otherwise
y

A

-1 W1 = ? t = 0.0 x W2 = ? W3 = ? 0 otherwise

Output is +1 if t exceeded,

B

•Initialize with random weight values • Introduce additional input constant (called ‘bias’) to ensure that
•Initialize with random weight values
• Introduce additional input constant (called
‘bias’) to ensure that the threshold function
receives some input even if all other inputs
are zero

Training Perceptrons

-1 W1 = 0.3 t = 0.0 x W2 = 0.5 W3 = -0.4 y
-1
W1 = 0.3
t = 0.0
x W2 = 0.5
W3 = -0.4
y

For AND

A B

Output

0

0

0

0

1

0

1

0

0

1

1

1

 

I

2

I

3

Summation

Output

0

0

(-1*0.3) + (0*0.5) + (0*-0.4) = -0.3

0

0

1

(-1*0.3) + (0*0.5) + (1*-0.4) = -0.7

0

1

0

(-1*0.3) + (1*0.5) + (0*-0.4) = 0.2

1

1

1

(-1*0.3) + (1*0.5) + (1*-0.4) = -0.2

0

Given the current weights, this perceptron does not produce the correct results for two combinations of AND: ‘1 0’ and ‘1 1’.

20

Exercise: Fill the values in the summation table to determine

whether this Perceptron correctly performs the AND function.

-1 W1 = 0.4 t = 0.0 x W2 = 0.7 W3 = -0.2 y
-1
W1 = 0.4
t = 0.0
x W2 = 0.7
W3 = -0.2
y

For AND A B Output

0

0

0

0

1

0

1

0

0

1

1

1

I

1

I

2

I

3

Summation

Output

-1

0

0

   

-1

0

1

   

-1

1

0

   

-1

1

1

   
Solution to Exercise -1 W1 = 0.4 t = 0.0 x W2 = 0.7 W3
Solution to Exercise
-1
W1 = 0.4
t = 0.0
x W2 = 0.7
W3 = -0.2
y

For AND A B Output

0

0

0

0

1

0

1

0

0

1

1

1

 

I

2

I

3

Summation

Output

0

0

(-1*0.4) + (0*0.7) + (0*-0.2) = -0.4

0

0

1

(-1*0.4) + (0*0.7) + (1*-0.2) = -0.6

0

1

0

(-1*0.4) + (1*0.7) + (0*-0.2) = 0.3

1

1

1

(-1*0.4) + (1*0.7) + (1*-0.2) = 0.1

1

Weight adjustment

I

2

I

3

Summation

Output

0

 

0

(-1*0.3) + (0*0.5) + (0*-0.4) = -0.3

0

0

1

(-1*0.3) + (0*0.5) + (1*-0.4) = -0.7

0

 

1

0

(-1*0.3) + (1*0.5) + (0*-0.4) = 0.2

1

1

1

(-1*0.3) + (1*0.5) + (1*-0.4) = -0.2

0

 

I

2

I

3

Summation

Output

 

0

 

0

 

(-1*0.4) + (0*0.7) + (0*-0.2) = -0.4

0

0

1

(-1*0.4) + (0*0.7) + (1*-0.2) = -0.6

0

1

0

(-1*0.4) + (1*0.7) + (0*-0.2) = 0.3

1

1

1

(-1*0.4) + (1*0.7) + (1*-0.2) = 0.1

1

Note that the summation results are in the right direction for producing the correct output for ‘1 1’ but not for ‘1 0’ (given threshold of t=0.0)

23

Learning algorithm

Epoch : Presentation of the entire training set to the neural network. In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1])

Error: The error value is the amount by which the value output by the network differs from the target value. For example, if we required the network

to output 0 and it output a 1, then

Error = -1

Learning question 1: Is there a set of weights that will produce the

correct results for all input values of the AND function?

Learning question 2: If so, how do we find these weights automatically rather than manually adjusting the weights?

Learning algorithm (supervised) - recap

Target Value, T : When we are training a network we not only present it with the input but also with a value that we require the network to produce. For example, if we present the network with [1,1] for the AND function the training value will be 1

Output , O : The output value from the neuron

Ij : Inputs being presented to the neuron

Wj : Weight from input neuron (I j ) to the output neuron

LR : The learning rate. This dictates how quickly the network converges. It is set by a matter of experimentation. It is typically 0.1, for reasons to be described later

Feedback Learning Algorithm -

recap

Until Convergence (low error or other stopping

criterion) do

Present a training pattern

Calculate the error of the output nodes

Adjust the weights connecting the input nodes to the output node so that the next time the training pattern is presented the error of the output is reduced

Automatic Perceptron algorithm

weight change = some small constant (target

output actual output) input

if we use error instead of the “target output – actual output”, we have:

weight change = some small constant error input

Perceptron feedback rule

weight change = some small constant (learning rate) error

input (typically, learning rate is 0.1 or smaller)

The error is:

error = (target output – actual output) [will always be -1, 0 or 1 in
error = (target output – actual output) [will always be -1, 0 or
1 in our example here]
• E.g. ‘1 0’ produces the wrong output ‘1’ for ‘1 0’, as follows:
-1
1
0
(-1*0.4) + (1*0.7) + (0*-0.2) = 0.3
1
follows: -1 1 0 (-1*0.4) + (1*0.7) + (0*-0.2) = 0.3 1 Error = (0 −

Error = (0 − 1) = -1

Weight change for w1 = 0.1 x -1 x -1 = 0.1 (w1=0.4+0.1=0.5) Weight change for w2 = 0.1 x -1 x 1 = -0.1 (w2=0.7-0.1=0.6) Weight change for w3 = 0.1 x -1 x 0 = 0 (w3=0.2, no change)

Update Perceptron weights for ‘1 0’

-1 W1 = 0.5 t = 0.0 x W2 = 0.6 W3 = -0.2 y
-1
W1 = 0.5
t = 0.0
x W2 = 0.6
W3 = -0.2
y
The next time ‘1 0’ is presented:

For AND A B Output

0

0

0

0

1

0

1 0

0

1

1

1

-1 1 0 (-1*0.5) + (1*0.6) + (0*-0.2) = 0.1 1
-1
1
0
(-1*0.5) + (1*0.6) + (0*-0.2) = 0.1
1

That is, while the output is still wrong, there has been a reduction in output from 0.3 to 0.1 for input ‘1 0’. At least another presentation of ‘1 0’ is required to produce the desired output 0, given the threshold of t=0.0

29

Next presentation

Error = (0 − 1) = -1

Weight change for w1 = 0.1 x -1 x -1 = 0.1 (w1=0.5+0.1=0.6)

Weight change for w2 = 0.1 x -1 x 1 = -0.1 (w2=0.6-0.1=0.5)

Weight change for w3 = 0.1 x -1 x 0 = 0 (w3=0.2, no change)

Next presentation:

-1 1 0 (-1*0.6) + (1*0.5) + (0*-0.2) = -0.1 0
-1
1
0
(-1*0.6) + (1*0.5) + (0*-0.2) = -0.1
0

Since the output activation for ‘1 0’ is now below 0, the correct output for AND(1,0) = 0 is now produced. We can now move to the next wrongly classified sample ‘1 1’.

One can change weights after processing a single pattern or accumulate weight error values over a batch of patterns before changing the weights. This allows all patterns to be presented to a perceptron’s existing weights before the weights are changed.

http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt

History of NN

The setback (mid 60’s – late 70’s)

Serious problems with perceptron model (Minsky’s book

1969)

Single layer perceonptrons cannot represent (learn) simple functions such as XOR

Multi-layer of non-linear units may have greater power

but there is no learning rule for such nets

Scaling problem: connection weights may grow infinitely

The first two problems overcame by latter effort in 80’s, but the scaling problem persists

Death of Rosenblatt (1964)

Striving of Von Neumann machine and AI

http://www.cs.umbc.edu/~ypeng/F04NN/lecture-notes/NN-Ch1.ppt

History of NN

Renewed enthusiasm and flourish (80’s – present)

New techniques

Backpropagation learning for multi-layer feed forward nets (with non-linear, differentiable node functions)

Thermodynamic models (Hopfield net, Boltzmann machine, etc.)

Unsupervised learning

Impressive application (character recognition, speech

recognition, text-to-speech transformation, process control, associative memory, etc.)

ANNs now preferred computational method in many

applications (e.g. pattern recognition)

Excitatory and Inhibitory Synapses

- Recap

We call a synapse/weight:

excitatory if w i > 0, and

inhibitory if w i < 0.

We also associate a threshold

q with each neuron

A neuron fires (i.e., has value 1 on its output line if the weighted sum of inputs at t reaches or passes q:

output = 1

if and only if

w i x i  q

http://ilab.usc.edu/classes/2002cs561/notes/session28.ppt

33

Most common ANN architecture: Feed- forward nets

Information flow is unidirectional

Data is presented to Input layer Passed on to Hidden Layer Passed on to Output layer

Information is distributed

Information processing is parallel

on to Hidden Layer Passed on to Output layer Information is distributed Information processing is parallel

Your ANN for OCR

feedforward

network

train using Back- propagation

2-D pixel matrix

converted into linear input

A B C D E Hidden Layer Output Layer Input Layer
A
B
C
D
E
Hidden
Layer
Output
Layer
Input
Layer

35

Multi-layer Perceptron (MLP)

Topology

Note that output layer can contain

more than

one unit

i k i j k i k i
i
k
i
j
k
i
k
i

E.g. output classes 1/0 can be

represented

by one unit or

by two units (‘ 1 0’ for class 1 and ‘0

1’ for class 2)

Input Layer i

Hidden Layer(s) j Output Layer k

Fully connected, MLP is the most common (and simplest) ANN

Backpropagation Learning

Algorithm

Until Convergence (low error or other stopping criteria) do

Present a training pattern

Calculate the error of the output nodes

Calculate the error of the hidden nodes (based on the error of the output nodes which is propagated back to the hidden nodes)

Continue propagating error back until the input layer is reached

Update all weights based on the standard delta rule with the appropriate error function d

w ij =  d j Z i

where Z is the output function of ANN and is learning rate

One can change weights after processing a single pattern or accumulate weight error values over a batch of patterns before changing the weights.

Backpropagation algorithm in rules

weight change = some small constant

(learning rate) error input activation

For an output node, the error is:

error = (target activation - output activation)

output activation (1 - output activation)

For a hidden node, the error is:

error = weighted sum of to-node errors hidden

activation (1 - hidden activation)

Why back-propagation?

Each weight ‘Shares the Blame’ for prediction

error with other weights.

Back-propagation algorithm decides how to distribute the blame among all weights and

adjust the weights accordingly.

Small portion of blame leads to small adjustment.

Large portion of the blame leads to large adjustment.

The role of η in

learning q w w 1 n w 2 . . . x 1 x 2
learning
q
w
w 1
n
w
2
.
.
.
x 1
x
2
x n
1
0
1

Assume three inputs, one

output

1 0 1 is the pattern at the input nodes, with 0 the target

w

w

w

1

2

3

=0.5

=0.2

=0.8

1*0.5 + 0*0.2 + 1*0.8 = 1.3 (actual output)

Assume θ = 1.0 (threshold activation function) Then 1.3>1.0 and perceptron outputs 1.

But desired output is 0. Then:

w new = w old + η(desired − actual) * input

Assume η = 1 w 1new = 0.5+1*(0−1)*1=−0.5 W 2new = 0.2+1*(0−1)*0=0.2 W 3new = 0.8+1*(0−1)*1= −0.2

Large η can lead to ‘weight oscillation’:

Assume η = 0.2

w 1new = 0.5+0.2*(0−1)*1=0.3 W 2new = 0.2+0.2*(0−1)*0=0.2 W 3new = 0.8+0.2*(0−1)*1= 0.6

Note how weights that are more to blame get a larger amount of

change

40

Transfer Functions: Transfer function is usually the same for every unit in the same layer

There are various choices for Transfer / Activation functions that determine what is output from a neuron

1 -1
1
-1

Tanh f(x) = (e x e -x ) / (e x + e -x )

1 0.5 0
1
0.5
0

Logistic f(x) = e x / (1 + e x )

1 0

1

1 0
1 0

0

1 0
1 0
1 0
1 0

Threshold

f(x) =

0 if x< 0

1 if x >= 1

Choice of transfer function for one output unit will depend on class information:

1.

2.

3.

If 0 and 1, use Logistic or Threshold

If non-binary, use Tanh (e.g. -1, 0, 1 for tripartite classification)

, many output nodes as distinct class values)

If N classes, a class is represented as (0,

0,1, 0,

,

0) at the output layer (i.e. as

41

Why is back propagation important?

Provides a procedure that allows networks to learn weights that can solve any deterministic input-output problem.

Allows networks to learn how to represent information as well as how to use it.

Raises questions about the nature of representations and of

what must be specified in order to learn them.

But back propagation pure and simple may be prone to the local minima problem

This is because standard BP always seeks to reduce error through weight adjustment and gradient descent error is not allowed to increase

42

Local

Minima

Advantages of back propagation

Relatively simple implementation

Standard method and generally works well

Disadvantages of BP

Slow and inefficient

Can get stuck in local minima resulting in sub-optimal solutions

get stuck in local minima resulting in sub-optimal solutions Global Minimum Learning rate  specifies the
Global Minimum
Global Minimum

Learning rate

specifies the step

width of gradient descent

Local

Minimum

Weights are stuck through gradient

descent (i.e. error has reached local

minimum)

Gradient descent must be amended to allow learning to leave flat spots

43

Enhancements To Back

Propagation

Momentum

Adds a percentage of the last movement to the

current movement

Without momentum, BP will fall back to local minimum

Without momentum, BP will fall back to local minimum With momentum and elimination of flat spots,

With momentum and elimination of flat spots, BP will find the global minimum

BP will fall back to local minimum With momentum and elimination of flat spots, BP will

The role of bias in FFBP ANNs

If all inputs are 0 in a FFBP, then output will be

0 irrespective of weights

Bias unit lies in one layer and is connected to all neurons in next layer

in one layer and is connected to all neurons in next layer The role of the

The role of the bias units is to ensure that some value is input to the nodes at the next layer even if values are 0 from nodes in the previous layer. Bias units are usually set to output 1.

45

Weight change and momentum

backpropagation algorithm often takes a long time to

learn

Momentum consists of adding a fraction of the old weight change typically set at about 0.5

The learning rule then looks like:

weight change = some small constant error input

activation + momentum constant old weight change

w(t) = *d + a*w(t-1)

w is the change in weight

is the learning rate

d is the error x input activation

a is the momentum parameter

Batch Update

With default BP update you update weights after every

pattern

With Batch update you accumulate the changes for each weight, but do not update them until the end of each epoch

Batch update gives a correct direction of the gradient for the

entire data set, while default update could do some weight

updates in directions quite different from the average gradient of the entire data set

When to stop training the Network ?

Ideally when we reach the global minimum of the error surface:

Stop if the decrease in total training error (since last

cycle) is small. Usually, sum of squared error (SSE) is used for this purpose (squared so that negative error values are converted to positive for summing)

Stop if the overall changes in the weights (since last cycle) are small.

But the network thus obtained may have poor

generalizing power on unseen data i.e. the ANN has

over-fitted the data

Over-fitting means that the ANN has memorized the training data so that, when new data is presented,

predictions are poor

Example of non-training. Note how the error graph goes

into a flat line

49
49

An example of a neural network learning successfully

An example of a neural network learning successfully 50

50

Choice of Training Parameters

Learning Parameter and Momentum What should be the optimal values of these training parameters ?

- No clear consensus on any fixed strategy.

- However, effects of wrongly specifying them are well studied.

Too big Large leaps in weight space risk of missing

global minima.

Too small - Takes long time to converge to global minima

- Once stuck in local minima, difficult to get out of

Trial and error still the only method known

Is Backprop biologically plausible?

Neurons do net send error signals backward across

their weights through a chain of neurons, as far as

anyone can tell

But does this matter?

Some neurons appear to use error signals, and there

are ways to use differences between activation signals to carry error information

Neural network models (summary)

Some of the most popular NN models are:

Perceptron, ADALINE, Multilayer Perceptron (MLP), Learning

Vector Quantization (LVQ), Self-Organizing Map (SOM), Adaptive Resonance Theory (ART), Probabilistic Neural Network (PNN), General Regression Neural Network

(GNN), Bidirectional Associative Memory (BAM),

Boltzmann Machine, Elman, Hamming, Support Vector Machines (SVM), Time Delay NN (TDNN), Recurrent Backpropagation, ARTMAP, Counterpropagation,

Neocognitron (over 100 different type)

Connectionism versus strong AI

Strong AI (and its use of the PSSH) is often called

computationalism’ – the human mind works purely through

formal operations on symbols, like a computer program

Connectionism is used to describe mental phenomena as the emergent processes of interconnected networks of simple

units.

There are many forms of connectionism, but the most common forms use neural network models.

Hence, the debate is whether traditional programming or neural networks explain the mind better

That is, does your mind work like high-level computer program (or programs), or does your mind work like a neural

network?

Practical Differences

A traditional program need have no

knowledge of the hardware on which it is run

A neural network is totally dependent on the architecture and is therefore ‘hardware dependent’

You can write a traditional program without paying attention to the hardware

To solve a problem on a neural network, you have to experiment with different

architectures and parameters (hardware)

55

Philosophical differences

A computer program contains explicit rules for

manipulating symbols, e.g.

If x>y then a=1 else a=2

A neural network represents symbols (‘ x ’, ‘y ’, etc) as a distributed pattern (typically a feature vector), e.g. 001100 for a, 110011 for

b

A neural network doesn’t use explicit rules but weights on connections (and threshold functions in units) to perform tasks

56

Operation

A traditional program takes input, transforms

the input through rules and produces output

A neural network uses spreading activation that represents a probability that a neuron generates an action potential spike (fires) that spreads to other units connected to it

Explaining the mind

Computationalism explains the mind as a suite

of software that takes incoming symbols,

performs mathematical and logical operations on those symbols to produce other symbols,

and outputs a desired result that is ‘correct’

Connectionism explains the mind as an n- dimensional vector of numeric activation

values over neural units in a network no

symbols exist

Explaining learning

Computationalism explains learning as the

application of symbolic formulae to data to

extract important features or derive new conclusions

Connectionism explains learning as modifying

weights on connections

Where do you stand?

Do you think that your mind works like a computer (computationalism, strong AI) or like a neural network?

Remember, it is the mind we are dealing with,

not brain

Even connectionists accept that neural networks are, at best, an approximation to real brains

If you want to work at the brain level, you

need to know about biochemistry

Summary

You have been introduced to the two main views in AI

concerning intelligence and how to build intelligence into a

machine:

Write increasingly sophisticated software

Construct increasingly sophisticated neural networks

The debate is not just about machine intelligence but about

ourselves:

Are we von Neumann computers (like your desktop, Apple iPad)?

Are we neural networks (no real neural network computers have yet been built)?

Exercise: Update Perceptron weights

for ‘1 1’ -1 W1 = 0.5 t = 0.0 x W2 = 0.3 W3
for ‘1 1’
-1
W1 = 0.5
t = 0.0
x W2 = 0.3
W3 = -0.2
y

When ‘1 1’ is presented:

For AND A B Output

0

0

0

0

1

0

1

0

0

1

1

1

-1 1 1 (-1*0.5) + (1*0.3) + (1*-0.2) = -0.4 0
-1
1
1
(-1*0.5) + (1*0.3) + (1*-0.2) = -0.4
0

Using the perceptron learning rule and a learning rate of 0.1, show how the weights are changed for subsequent presentations of ‘1 1’. How many presentations does it take before the perceptron produces the correct

Solution to Exercise: ‘1 1’

-1 1 1 (-1*0.5) + (1*0.3) + (1*-0.2) = -0.4 0 First presentation: Error =
-1
1
1
(-1*0.5) + (1*0.3) + (1*-0.2) = -0.4
0
First presentation: Error = (target output – actual output) = 1
Weight change = learning rate x error x input
Weight change for w1 = 0.1 x 1 x -1 = -0.1 (w1=0.5-0.1=0.4)
Weight change for w2 = 0.1 x 1 x 1 = 0.1 (w2=0.3+0.1=0.4)
Weight change for w3 = 0.1 x 1 x 1 = 0.1 (w3=-0.2+0.1=-0.1)
-1
1
1
(-1*0.4) + (1*0.4) + (1*-0.1) = -0.1
0

Second presentation: Error = (target output actual output) = 1 Weight change = learning rate x error x input Weight change for w1 = 0.1 x 1 x -1 = -0.1 (w1=0.4-0.1=0.3)

Weight change for w2 = 0.1 x 1 x 1 = 0.1 (w2=0.4+0.1=0.5)

Weight change for w3 = 0.1 x 1 x 1 = 0.1 (w3=-0.1+0.1=0)

-1 1 1 (-1*0.3) + (1*0.5) + (1*0) = 0.2 1
-1
1
1
(-1*0.3) + (1*0.5) + (1*0) = 0.2
1

Test your weights on ‘0 1’, ‘0 0’ and ‘1 0’

-1

0

1

(-1*0.3) + (0*0.5) + (1*0) = -0.3

0

That is, the weights work for ‘0 1’.

For ‘0 0’:

-1 0 0 (-1*0.3) + (0*0.5) + (0*0) = -0.3 0
-1
0
0
(-1*0.3) + (0*0.5) + (0*0) = -0.3
0

For ‘1 0’:

-1

1

0

(-1*0.3) + (1*0.5) + (0*0) = 0.2

1

References and further reading

http://en.wikipedia.org/wiki/Connectionism

 

butler.redlands.edu/cs/ai/AdotSaha/NNtutorial.ppt

 

http://philosophy.uwaterloo.ca/MindDict/connectionism.html