Beruflich Dokumente
Kultur Dokumente
Spring 2005
Prof. Anthony Kuh POST 205E Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427 Email: kuh@spectra.eng.hawaii.edu
EE645
Preliminaries
Class Meeting Time: MWF 8:30-9:20 Office Hours: MWF 10-11 (or by appointment) Prerequisites: Probability: EE342 or equivalent Linear Algebra: Programming: Matlab or C experience
A. Motivation
Why study neural networks and machine learning? Biological inspiration (natural computation) Nonparametric models: adaptive learning systems, learning from examples, analysis of learning models Implementation Applications Cognitive (Human vs. Computer Intelligence): Humans superior to computers in pattern recognition, associative recall, learning complex tasks. Computers superior to humans in arithmetic computations, simple repeatable tasks. Biological: (study human brain) 10^10 to 10^11 neurons in cerebral cortex with on average of 10^3 interconnections / neuron.
A neuron
Neural Network
Connection of many neurons together forms a neural network.
Neural network properties: Highly parallel (distributed computing) Robust and fault tolerant Flexible (short and long term learning) Handles variety of information (often random, fuzzy, and inconsistent) Small, compact, dissipates very little power
B. Single Neuron
(Computational node)
w
0
g( )
s=w T x + w0 ; synaptic strength (linearly weighted sum of inputs). y=g(s); activation or squashing function
Activation functions
Linear units: g(s) = s. Linear threshold units: g(s) = sgn (s). Sigmoidal units: g(s) = tanh (Bs), B >0.
Neural networks generally have nonlinear activation functions.
Most popular models: linear threshold units and sigmoidal units. Other types of computational units : receptive units (radial basis functions).
inputs
output
Neural network represented by directed graph: edges represent weights and nodes represent computational units.
Definitions
Feedforward neural network has no loops in directed graph. Neural networks are often arranged in layers. Single layer feedforward neural network has one layer of computational nodes. Multilayer feedforward neural network has two or more layers of computational nodes. Computational nodes that are not output nodes are called hidden units.
2.
+ + +
Regression problem
w
0
sgn( )
Linearly separable
Consider a set of points with two labels: + and o. Set of points is linearly separable if a linear threshold function can partition the + points from the o points. o + + + o o Set of linearly separable points
w(0) arbitrary Pick point (x(k),d(k)). If w(k) T x(k)d(k) > 0 go to 5) w(k+1) = w(k ) + x(k)d(k) k=k+1, check if cycled through data, if not go to 2 Otherwise stop.
PLA comments
Perceptron convergence theorem (requires margins) Sketch of proof Updating threshold weights Algorithm is based on cost function J(w) = - (sum of synaptic strengths of misclassified points) w(k+1) = w(k) - (k)J(w(k)) (gradient descent)
s=y
Inputs: x(k) Outputs: y(k) Desired outputs: d(k) Weights: w(k) Error: e(k)= d(k)-y(k)
Wiener solution
Define P= E(x(k)d(k)) and R=E(x(k)x(k)T).
J(w) =.5 E[(d(k)-y(k))2] = .5E(d(k)2)- E(x(k)d(k)) Tw +wT E(x(k)x(k) T)w = .5E[d(k) 2] PTw +.5wTRw
Note J(w) is a quadratic function of w. To minimize J(w) find gradient, J(w) and set to 0.
J(w) = -P + Rw = 0 Rw=P (Wiener solution) If R is nonsingular, then w= R-1 P. Resulting MSE = .5E[d(k)2]-PTR-1P
Iterative algorithms
Steepest descent algorithm (move in direction of negative gradient)
w(k+1) = w(k) - J(w(k)) = w(k) + (P-Rw(k))
Blind algorithms
Second order statistics
Minimum Output Energy Methods Reduced order approximations: PCA, multistage Wiener Filter
Inputs
Signal Processing
Feature Extraction
Neural Network
Outputs
Depend directly on total number of weights and threshold values. A one hidden layer network with sufficient number of hidden units can arbitrarily approximate any boolean function, pattern recognition problems, and well behaved function approximation problems. Sigmoidal units more powerful than linear threshold units.
B. Error backpropagation
Error backpropagation algorithm: methodical way of implementing LMS algorithm for multilayer neural networks.
Two passes: forward pass (computational pass), backward pass (weight correction pass). Analog computations based on MSE criterion. Hidden units usually sigmoidal units. Initialization: weights take on small random values. Algorithm may not converge to global minimum. Algorithm converges slower than for linear networks. Representation is distributed.
BP Algorithm Comments
s are error terms computed from output layer back to first layer in dual network. Training is usually done online. Examples presented in random or sequential order. Update rule is local as weight changes only involve connections to weight. Computational complexity depends on number of computational units. Initial weights randomized to avoid converging to local minima.
BP Architecture
Forward network
Sensitivity network
Modifications to BP Algorithm
Batch procedure Variable step size Better approximation of gradient method (momentum term, conjugate gradient) Newton methods (Hessian) Alternate cost functions Regularization Network construction algorithms Incorporating time
Linear unit
O X
X O
Input space
Feature space
: X
First layer
Fix widths, centers determined from lattice structure Fix widths, clustering algorithm for centers Resource allocation network
X O
Draw convex hull around each set of points. Find shortest line segment connecting two convex hulls. Find midpoint of line segment. Optimal hyperplane intersects line segment at midpoing perpendicular to line segment.
w
O O
Optimal hyperplane
Maximizing margins equivalent to minimizing magnitude of weight vector. W (u-v) = 2 T W (u-v)/ W = 2/ W =2m
W u+ b = 1
T T
w
O O
u v
W v+ b = -1
Kernel Methods
In many classification and detection problems a linear classifier is not sufficient. However, working in higher dimensions can lead to curse of dimensionality. Solution: Use kernel methods where computations done in dual observation space.
O X X O
X O
Input space
Feature space
: X
Solving QP problem
SVM require solving large QP problems. However, many s are zero (not support vectors). Breakup QP into subproblem.
Chunking : (Vapnik 1979) numerical solution. Ossuna algorithm: (1997) numerical solution. Platt algorithm: (1998) Sequential Minimization Optimization (SMO) analytical solution.
SMO Algorithm
Sequential Minimization Optimization breaks up QP program into small subproblems that are solved analytically. SMO solves dual QP SVM problem by examining points that violate KKT conditions. Algorithm converges and consists of: Search for 2 points that violate KKT conditions. Solve QP program for 2 points. Calculate threshold value b. Continue until all points satisfy KKT conditions. On numerous benchmarks time to convergence of SMO varied from O (l) to O (l 2.2 ) . Convergence time depends on difficulty of classification problem and kernel functions used.
SVM Summary
SVM are based on optimum margin classifiers and are solved using quadratic programming methods. SVM are easily extended to problems that are not linearly separable. SVM can create nonlinear separating surfaces via kernel functions. SVM can be efficiently programmed via the SMO algorithm. SVM can be extended to solve regression problems.
VI.Unsupervised Learning
A. Motivation Given a set of training examples with no teacher or critic, why do we learn? Feature extraction Data compression Signal detection and recovery Self organization Information can be found about data from inputs.
y(1)
Diagram of PCA
x
Second PC
x x x x x x x x x x
x x x x
x x
First PC
Ojas rule
Use normalize Hebbian rule applied to linear neuron.
s=y
Need normalized Hebbian rule otherwise weight vector will grow unbounded.
Generalized Hebbian Algorithm APEX Approximate correlation matrix R with time averages.
Applications of PCA
Matched Filter problem: x(k) = s(k) + v(k). Multiuser communications: CDMA Image coding (data compression)
GHA
quantizer
PCA
Kernel Methods
In many classification and detection problems a linear classifier is not sufficient. However, working in higher dimensions can lead to curse of dimensionality. Solution: Use kernel methods where computations done in dual observation space.
O X X O
X O
Input space
Feature space
: X
Inputs U assumed independent and user sees X. A Goal is to find W so that Y is independent.
ICA Solution
Y = DPU where D is a diagonal matrix and P is a permutation matrix. Algorithm is unsupervised. What are assumptions where learning is possible? All components of U except possibly one are nongaussian. Establish criterion to learn from (use higher order statistics): information based criteria, kurtosis function. Kullback Leibler Divergence: D(f,g) = f(x) log (f(x)/g(x)) dx
Learning Algorithms
Can learn weights by approximating divergence cost function using contrast functions. Iterative gradient estimate algorithms can be used. Faster convergence can be achieved with fixed point algorithms that approximate Newtons methods. Algorithms have been shown to converge.
Applications of ICA
Array antenna processing Blind source separation: speech separation, biomedical signals, financial data
D. Competitive Learning
Motivation: Neurons compete with one another with only one winner emerging.
Brain is a topologically ordered computational map. Array of neurons self organize.
Initialize weights Randomly choose inputs Pick winner. Update weights associated with winner. Go to 2).
459-24-24-1 network to rate moves Hand crafted examples, noise helped in training 59% winning percentage against SUN gammontools Later versions used reinforcement learning 16-768-192-30-10 network to distinguish numbers Preprocessed data, 2 hidden layers act as feature detectors 7291 training examples, 2000 test examples Training data .14%, test data 5%, test/reject data 1%,12%
KSOFM map followed by feedforward neural network 40 120 frames mapped onto 12 by 12 Kohonen map Each frame composed of 600 to 1800 analog vector Output of Kohonen map fed to feedforward network Reduced search using KSOFM map TI 20 word data base 98-99% correct on speaker dependent classsification
Other topics
Reinforcement learning Associative networks Neural dynamics and control Computational learning theory Bayesian learning Neuroscience Cognitive science Hardware implementation