Beruflich Dokumente
Kultur Dokumente
Background Artificial neurons, what they can and cannot do The multilayer perceptron (MLP) Three forms of learning The back propagation algorithm Radial basis function networks Competitive learning (and relatives)
1
An artificial neuron
x0 = +1 x1 x2 xn w2 wn w1 w0 =
y = f (S )
S=
wi xi =
wi xi
i =0
i =1
x2 1
w1 x1 + w2 w2
Only linearly separable classification problems can be solved. x1
3
0 0 1 NOR
x1 AND
Two sigmoids implement fuzzy AND and NOR 4
Inputs
Outputs
Can implement any function, given a sufficiently rich internal structure (number of nodes and layers)
Application areas
Finance Forecasting Fraud detection Medicine Image analysis Consumer market Household equipment Character recognition Speech recognition Industry Adaptive control Signal analysis Data mining
Neural networks are statistical methods Model independence Adaptivity/Flexibility Concurrency Economical reasons (rapid prototyping)
Back propagation
Input Output (y) Desired output (d) Error function Action
Unsupervised
Reinforcement
Environment Reward Agent
E w ji
The weight should be moved in proportion to that contribution, but in the other direction:
w ji =
9
E w ji
10
wji
Assumptions
Error is squared error:
w ji =
E=
1 2
n j =1
j k E = j xi w ji derivative of error
(d j y j )2
j = y (1 y ) j j
y j = f (S j ) =
1 1+ e
S j
derivative of sigmoid
sum over all nodes in the next layer (closer to the outputs)
11
12
Overtraining
E Typical error curves Test or validation set error
Network size
Overtraining is more likely to occur
if we train on too little data if the network has too many hidden nodes if we train for too long
Training set error Time (epochs) Overtraining Cross validation: Use a third set, a validation set, to decide when to stop (find the minimum for this set, and retrain for that number of epochs)
13
The network should be slightly larger than the size necessary to represent the target function Unfortunately, the target function is unknown ... Need much more training data than the number of weights!
14
Practical considerations
What happens if the mapping represented by the data is not a function? For example, what if the same input does not always lead to the same output? In what order should data be presented? Sequentially? At random? How should data be represented? Compact? Distributed? What can be done about missing data? Trick of the trade: Monotonic functions are easier to learn than non-monotonic functions! (at least for the MLP)
16
Geometric interpretation
The input space is covered with overlapping Gaussians.
measures the distance between its weight vector and the input vector (instead of a weighted sum)
Inputs
Outputs
17 18
RBF training
Could use backprop (transfer function still differentiable) Better: Train layers separately
Hidden layer: Find position and size of Gaussians by unsupervised learning (e.g. competitive learning, K-means) Output layer: Supervised, e.g. Delta-rule, LMS, backprop
19
Unsupervised learning
Classifying unlabeled data Nearest neighbour classifiers Classify the unknown sample (vector) x to the class of its closest previously classified neighbour
The new pattern, x, will be classified as a . x
K-means
K-means, for K=2 Make a codebook of two vectors, c1 and c2 Sample (at random) two vectors from the data as initial values of c1 and c2 Split the data in two subsets, D1 and D2 where D1 is the set of all points with c1 as their closest codebook vector, and vice versa Move c1 towards the mean in D1 and c2 towards the mean in D2 Repeat from 3 until convergence (until the codebook vectors stop moving)
Problem 1: The closest neighbour may be an outlier from the wrong class Problem 2: Must store lots of samples and compute distance to each one, for every new sample
21
Voroni regions
K-means form so called Voroni regions in the input space The Voroni region around a codebook vector ci is the region in which ci is the closest codebook vector
1. 2. 3.
Competitive learning
M linear, threshold less, nodes (only weighted sums) N inputs
Present a pattern (sample), x The node with the largest output (node k) is declared winner The weights of the winner is updated so that it will become even stronger the next time the same pattern is presented. All other weights are left unchanged With normalised weights, this is equivalent to finding the node with the minimum distance between its weight vector and the input vector Network node = Codebook vector
wki = ( xi wki)
Voroni regions around 10 codebook vectors
23
1 i N
24
W
Poor initialisation: The weight vectors have been initialised to small random numbers (in W), but these are far from the data (A and B) The first node to win will move from W towards A or B and will always win, henceforth Solution: Use the data to initialise the weights (as in K-means), or include the winning-frequency in the distance measure, or move more nodes than only 25 the winner.
Dimensional reduction
SOM
Competitive learning, extended in two ways: 1. The nodes are organised in a two-dimensional grid
(in competitive learning, there is no defined order between nodes) A 3x3 grid, making a twodimensional map of the fourdimensional input space
1 i N
f(j, k) is a neighbourhood function in the range [0,1], with a maximum for the winner ( j=k) and decreasing with distance from the winner, e.g. a Gaussian Gradually decrease neighbourhood radius (width of the Gaussians) and learning rate () over time. Result: Vectors that are close in the high-dimensional input space will activate areas that are close on the grid.
28
27
Note that the network is not told that the difference between the wines is the soil type, nor how many such types (how many classes) there are.
29
Node positions
Start with two nodes Each node has a set of neighbours, indicated by edges The edges are created and destroyed dynamically during training For each sample, the closest node, k, and all its current neighbours are moved towards the input
32
Node creation
A new node (blue) is created every th time step, unless the maximum number of nodes has been reached The new node is placed halfway between the node with the greatest error and the node among its current neighbours with the greatest error The node with the greatest error is the most unstable one
33
After a while
Neighbourhood
Neighbourhood edges are created and destroyed as follows: For each sample, let k denote the winner (the node closest to the sample) and r the runner-up (the second closest) If an edge exists between k and r, reset its age to 0
Otherwise, create such an edge and set its age to 0
7 nodes
Increment the age of all other edges emanating from node k Edges older than amax are removed, as are any nodes that in this way loses its last remaining edge
36
Delaunay triangulation
Connect the codebook vectors in all adjacent Voroni regions
Dead units
There is only one way for an edge to get younger when the two nodes it interconnects are the two closest to the input If one of the two nodes wins, but the other one is not the runner-up, then, and only then, the edge ages If neither of the two nodes win, the edge does not age!
The input distribution has jumped from the lower left to the upper right corner
38
The lab
(in room 1515!) Classification of bitmaps, by supervised learning (back propagation), using the SNNS simulator An illustration of some unsupervised learning algorithms, using the GNG demo applet
LBG/LBG-U ( K-means) HCL (Hard competitive learning) Neural gas CHL (Competitive Hebbian learning) Neural gas with CHL GNG/GNG-U (Growing neural gas) SOM (Self organising map) Growing grid
39