Beruflich Dokumente
Kultur Dokumente
Chapter 20.5
Introduction & Basics Perceptrons Perceptron Learning and PLR Beyond Perceptrons Two-Layered Feed-Forward Neural Networks
5/19/2013 2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
Introduction
"'Artificial Neural Networks' are massively parallel interconnected networks of simple (usually adaptive) elements and their hierarchical organizations which are intended to interact with the objects of the real world in the same way as biological nervous systems do." Teuvo Kohonen
5/19/2013 2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
Introduction
Known as:
Neural Networks (NNs) Artificial Neural Networks (ANNs) Connectionist Models Parallel Distributed Processing (PDP) Models
5/19/2013
Introduction
NN similar to brain:
knowledge is acquire experientially (learning) knowledge stored in connections (weights) dendrites collect input from other neurons single axon sends output to other neurons connected at synapses that have varying strength this model is greatly simplified
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
5/19/2013
Introduction
number of neurons: 1011 number of connections: 104 per neuron neuron death rate: 105 per day neuron birth rate: ~0 connection birth rate: very slow performance: about 102 msec, about 100 sequential neuron firings for "many" tasks
5/19/2013
Introduction
Attractions of NN approach:
MIMD, optical computing, analog systems from a large collection of simple processing elements emerges interesting complex global behavior pattern recognition (handwriting, facial expressions, etc.) forecasting (stock prices, power grid demand) adaptive control (autonomous vehicle control, robot control) can handle noisy and incomplete data due to fine-grained distributed and continuous knowledge representation
is a robust computation
5/19/2013
Introduction
Attractions of NN approach:
fault tolerant
ok to have faulty elements and bad connections isn't dependent on a fixed set of elements and connections continues to function, at a lower level of performance, when portions of the network are faulty useful as a psychological model useful for a wide variety of high-performance applcations
degrades gracefully
5/19/2013
simple neuron-like processing elements (PEs) directed from one unit to another
positive or negative real values means of long term storage adjusted by learning
result of the unit's processing unit's output
5/19/2013
represent as a graph
Layer 3
5/19/2013
Unit composition:
from other units or sensors of the environment to other units or effectors of the environment
computes the activation level based on local info: the inputs from neighbors and the weights is a simple function of the linear combination of its inputs
w1
Inputs wn
5/19/2013
Activation
Output
10
Given n inputs, the unit's activation is defined by: a = g( (w1 * x1) + (w2 * x2) + ... + (wn * xn) )
where: wi are the weights xi are the input values g() is a simple non-linear function, commonly:
let ini be sum of wi xi for all i step: activation flips from 0 to 1 when ini >= threshold sign: activation flips from 1 to +1 when ini >= 0 sigmoid: activation transistions from 0 to 1 when ini 0 1/(1+e-x) where x is ini
1 T 1 -1
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
5/19/2013
11
mainly as single-layered nets since an effective learning algorithm was known simple 1-layer network, units act independently composed of linear threshold units (LTU)
Perceptrons:
a unit's inputs, xi, are weighted, wi, and combined step function computes activation level, a w1
x1 xi xn
5/19/2013
S
wn
12
-1 t x1 xn w1 a wn
5/19/2013
13
AND Perceptron:
-1
.75
x1 x2 .5 a .5
.5*1+.5*1+.75*-1 = .25 output = 1
4 possible .5*0+.5*0+.75*-1 data points = -.75 output = 0 threshold is like a separating line
x1
1
0 0 1
14
x2
5/19/2013
Perceptrons: OR Example
OR Perceptron:
-1
.25
x1 x2 .5 a .5
.5*1+.5*1+.25*-1 = .75 output = 1
4 possible .5*0+.5*0+.25*-1 data points = -.25 output = 0 threshold is like a separating line
x1
1
0 0 1
15
x2
5/19/2013
Perceptron Learning
Programmer specifies:
supervised learning is used the correct output is given for each training example
an example is a list of values for the input units correct output is a list desired values for the output units
5/19/2013
16
2. Repeat until all examples correctly classified or some other stopping criterion is met
for each example e in training set do
a. b. c.
Unlike other learning techniques, Perceptrons need to see all of the training examples multiple times. Each pass through all of the training examples is called an epoch.
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
5/19/2013
17
Determining how to update the weights is a case of the credit assignment problem.
Perceptron Learning Rule: wi = wi + Dwi where Dwi = a * xi * (T - O)
where xi is the value associated with ith input unit a is a constant between 0.0 and 1.0 called the learning rate
5/19/2013
18
Dwi = a * xi * (T - O)
note it doesn't depend on wi correct output, i.e. T = O gives a * xi * 0 = 0 zero input, i.e. xi = 0 gives a * 0 * (T - O) = 0
Increase it so that maybe next time the weighted sum will exceed the threshold causing output to be 1 Decrease it so that maybe next time the weighted sum wont exceed the threshold causing output to be 0
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
5/19/2013
19
5/19/2013
20
PLR is also called the Delta Rule or the Widrow-Hoff Rule PLR is a variant of rule proposed by Rosenblatt in 1960
the strength of a connection between two units should be adjusted in proportion to the product of their simultaneous activations the product is used as a means of measuring the correlation between the values that are output by the two units
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
5/19/2013
21
PLR is a "local" learning rule only local information in the network is needed to update a weight PLR performs gradient descent in "weight space" this rule iteratively adjusts all of the weights so that at for each training example the error is monotonically non-increasing, i.e. ~decreases
5/19/2013
22
Perceptron Convergence Theorem says if a set of examples are learnable, then PLR will find the necessary weights
This theorem says that if a solution exists, PLR's gradient descent is guaranteed to find an optimal solution (i.e., 100% correct classification) for any 1-layer neural network
5/19/2013
23
A single perceptron's output is determined by the separating hyperplane defined by (w1 * x1) + (w2 * x2) + ... + (wn * xn) = t So, Perceptrons can only learn functions that are linearly separable (in input space).
5/19/2013
24
XOR Perceptron:
-1
???
x1 x2 .5 a .5
2-D input space with 4 possible data points positives from negatives using a straight line?
1
0
x1
x2 0 1
25
5/19/2013
5/19/2013
26
Beyond Perceptrons
Perceptrons as a computing model are too weak because they can only learn linearly-separable functions. To enhance the computational ability, general neural networks have multiple layers of units.
The challenge is to find a learning rule that works for multi-layered networks.
5/19/2013
27
Beyond Perceptrons
A feed-forward multi-layered network computes a function of the inputs and the weights. Input units (on left or bottom):
Perceptrons have input units followed by one layer of output units, i.e. no hidden units
5/19/2013
28
Beyond Perceptrons
NN's with one hidden layer of a sufficient number of units, can compute functions associated with convex classification regions in input space. NN's with two hidden layers are universal computing devices, although the complexity of the function is limited by the number of units.
If too few, the network will be unable to represent the function. If too many, the network will memorize examples and is subject to overfitting.
5/19/2013
29
Inputs Hidden Units Output Units Weights on links from input to hidden Weights on links from hidden to output Network Activations
I3
I4 I5 I6
a1 = O1
a2 = O2
5/19/2013
30
Two Layered: count layers with units computing an activation Feed-forward: each unit in a layer connects forward to all of the units in the next layer no cycles
- links within the same layer - links to prior layers
I1 I2
I3
I4 I5 I6 Layer 1 Layer 2
a1 = O1
a2 = O2
no skipping layers
5/19/2013
31
Neural Networks
Chapter 20.5
Two-Layered Feed-Forward Neural Networks Solving XOR Learning in Multi-Layered Feed-Forward NN Back-Propagation Computing the Change for Weights Other Issues & Applications
5/19/2013 2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
32
Conquering XOR
XOR Perceptron?:
I1
.25 .5 .5 .5
OR
.5 O -.5
I2
Each unit in hidden layer acts like a perceptron learning a separating line
.5 .75
AND I1 1 0 0 1
top hidden unit acts like an OR perceptron bottom hidden unit acts like an AND perceptron
I2
5/19/2013
33
Conquering XOR
XOR Perceptron?:
I1
.25 .5 .5 .5
OR
.5 O -.5
I2
The output unit combines I1 these separating lines by intersecting the "half-planes" 1 defined by the separating lines
when OR is 1 and AND is 0 0 0 1
34
.5 .75
AND
I2
then output O, is 1
5/19/2013
PLR doesn't work in multi-layered feed-forward nets since desired values for hidden units aren't known. Must again solve the Credit Assignment Problem
determine which weights to credit/blame for the output error in the network determine which weights in the network should be updated and how to update them
5/19/2013
35
Back-Propogation:
method for learning weights in these networks generalizes PLR Rumelhart, Hinton, Williams (re)discovered in 1986
Back-Propagation approach:
gradient-descent algorithm to minimize the error on the training data errors are propagated through the network starting at the output units and working backwards towards the input units
5/19/2013
36
Back-Propagation Algorithm
1. Initialize the weights in the network (usually random values like PLA) 2. Repeat until all examples correctly classified or other stopping criterion is met for each example e in training set do
a. b. c. d.
i. ii.
forward pass: Oi = neural_net_output(network, e) Ti = desired output, i.e Target or Teacher's output calculate error (Ti - Oi) at the output units backward pass: update_weights(network, DWj,i, DWk,j )
compute Dwj,i for all weights from hidden to output layer compute Dwk,j for all weights from inputs to hidden layer
e.
5/19/2013
37
Back-propagation performs a gradient descent search in weight space to learn network weights. Given a network with n weights:
each configuration of weights is a vector, W, of length n that defines an instance of the network W can be considered a point in an n-dimensional weight space, where each dimension is associated with one of the connections in the network
5/19/2013
38
each network defined by the vector W has an associated total error, E, on all of training data E the sum of the squared error (SSE) is defined as: E = E1 + E2 + ... + Em where each Ei is the squared error of the network on the ith training example
Given n output units in the network: Ei = ((T1 - O1)2 + (T2 - O2) 2 + ... + (Tn - On) 2) / 2
Ti is the target value for the ith output unit Oi is the network output value for the ith output unit
5/19/2013
39
2D surface represents errors for all weight configurations Goal is to find a lower point on the error surface (local minima) Gradient descent follows the direction of the steepest descent i.e. where E decreases the most
w2
.3
.8
w1
5/19/2013
40
The gradient is defined as: Gradient_E = [E/w1, E/w2, ..., E/wn] Then change the ith weight by: Dwi = - a * E/wi To compute the derivatives for calculating the gradient direction requires an activation function that is continuous, differentiable, non-decreasing and easily computed.
can't use the step function as in LTU's instead use the sigmoid function 1/(1+e-x) where x is ini the weighted sum of inputs
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
5/19/2013
41
wj,i weight on link from hidden unit j to output unit i a learning rate parameter aj activation (i.e. output) of hidden unit j Ti teacher output for output unit i Oi actual output of output unit i g' derivative of sigmoid activation function, which is g' = g(1-g)
5/19/2013
42
Dw1,2 = a
product of
I1 I2
learning rate
I3
I4 I5 I6
O1
O2
5/19/2013
43
don't have teacher-supplied correct output values infer the error at these units by "back-propagating" error at an output units is "distributed" back to each of the hidden units in proportion to the weight of the connection between them total error is distributed to all of the hidden units that contributed to that error
Each hidden unit accumulates some error from each of the output units to which it is connected
5/19/2013
44
For weights between inputs and hidden units: Dwk,j = -a * E/wk,j = -a * -Ik * g'(inj) * S( wj,i * (Ti - Oi) * g'(ini) ) = a * Ik * aj * (1 - aj) * S( wj,i*(Ti-Oi)*Oi*(1-Oi) )
wk,j weight on link from input k to hidden unit j wj,i weight on link from hidden unit j to output unit i a learning rate parameter aj activation (i.e. output) of hidden unit j Ti teacher output for output unit i Oi actual output of output unit i Ik input value k g' derivative of sigmoid activation function, which is g' = g(1-g)
5/19/2013
45
Dw1,2 =
a*
I1 *
a2
W2,i
I1 I2
a2 * (1 a2)
I3
I4 I5 I6
O1
g(in2)
O2
5/19/2013
46
5/19/2013
47
Other Issues
5/19/2013
48
Other Issues
Use a tuning set or cross-validation to determine experimentally the number of units that minimizes error.
5/19/2013
49
Other Issues
n is the number of weights in the network e is test set error fraction between 0 and 1
train to classify 1 - e/2 of the training set correctly e.g. if n=80 and e=0.1 (i.e. 10% error on test set)
training set of size is 800 train until 95% correct classification should produce 90% correct classification on test set
5/19/2013
50
Other Issues
Train the network until the error rate on a tuning set begins increasing rather than training until the error (i.e. SSE) is minimized.
5/19/2013
51
Applications
5/19/2013
52
Applications: ALVINN
ALVINN (Pomerleau, 1988): learns to control vehicle steering to stay in the middle of its lane topology: two-layered feed-forward network using back-propagation learning topology: input
color image is preprocessed to obtain a 30*32 pixel image each pixel is one byte, an integer from 0 to 255 corresponding to the brightness of the image networks has 960 input units (30*32)
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
5/19/2013
53
Applications: ALVINN
topology: output
output unit 1 means sharp left output unit 30 means sharp right
Gaussian distribution with a variance of 10 centered 2 on the desired steering directions: Oi = e[-(i-d) /10] computing a least-squares best fit of output units' values to a Gaussian distribution with a variance of 10 peak of this distribution is taken as the steering direction
error for learning is: target output - actual output only 4 hidden units with complete connectivity 960 input units to 4 hidden units to 30 output units
2001-2004 James D. Skrentny from notes by C. Dyer, et. al.
topology: hidden
5/19/2013
54
Applications: ALVINN
Learning:
human driver (takes ~5 minutes from random initial weights) itself (do an epoch of training every 2 seconds there after) there aren't negative examples network may overfit data in recent images (e.g. straight road) at the expense of past images (e.g. road with curves) generate negative examples by synthesizing views of the road that are incorrect for current steering maintain a buffer of 200 real and synthesized images that keeps some images in many different steering directions
solutions
5/19/2013
55
Applications: ALVINN
Results:
has driven at speeds up to 70 mph has driven continuously for distances up to 90 miles has driven across the continent during different times of the day and with different traffic conditions can drive on:
single lane roads and highways multi-lane highways paved bike paths dirt roads
5/19/2013
56
Summary
Advantages
parallel processing architecture robust with respect to node failure fine-grained, distributed representation of knowledge robust with respect to noisy data incremental algorithm (i.e. learn as you go) simple computations empirically shown to work well for many problem domains
5/19/2013
57
Summary
Disadvantages
slow training (i.e. takes many epochs) poor interpretability (i.e. difficult to get rules) ad hoc network topologies (i.e. layouts) hard to debug because distributed representations preclude content checking may converge to local, not global, minimum of error may be hard to describe a problem in terms of features with numerical values not known how to model higher-level cognitive mechanisms with NN model
5/19/2013
58