CSSS2010 20100803 Kanevski Lecture2

Machine Learning Algorithms: Theory, Applications and Software Tools
Lecture 2 Basics of ANN: MLP

Prof. Mikhail Kanevski Institute of Geomatics and Analysis of Risk, University of Lausanne Mikhail.Kanevski@unil.ch
Prof. M. Kanevski 1
Contents
Introduction to artificial neural networks Multilayer perceptron Case studies
Prof. M. Kanevski
Basics of ANN
Artificial neural networks are analytical systems that address problems whose solutions have not been explicitly formulated. In this way they contrast to classical computers and computer programs, which are designed to solve problems whose solutions - although they may be extremely complex - have been made explicit.
Prof. M. Kanevski
Basics of ANN
We can program or train neural networks to store, recognise, and associatively retrieve patterns; to filter noise from measurement data; to control ill-defined problems;
in summary:
to estimate sampled functions when we do not know the form of the functions.
Prof. M. Kanevski
Basics of ANN
Unlike statistical estimators, they estimate a function without a mathematical model of how outputs depend on inputs. Neural networks are model-semifree estimators (semiparametric models). They "learn from experience" with numerical and, sometimes, linguistic sample data.
Prof. M. Kanevski
Basics of ANN
The major applications of ANN:
Feature recognition (pattern classification). Speech recognition Signal processing Time-series prediction Function approximation and regression, classification Data Mining Intelligent control Associative memories Optimisation And many others
Prof. M. Kanevski
Basics of ANN. Simple biological neuron
Prof. M. Kanevski
Basics of ANN Simple model of the neuron
Prof. M. Kanevski
Examples of transfer functions.
1 f (x) = [1 + exp( x )]
[exp( x ) exp( x )] tanh( x ) = [exp( x ) + exp( x )]
Prof. M. Kanevski 9
Basics of ANN
The main parts of ANN: Neurones (nodes, cells, units, processing elements) Network topology (connections between neurones)
Prof. M. Kanevski
10
Basics of ANN
In general, Artificial Neural Networks are a collection of simple computational units (cells) interlinked by a system of connections (synaptic connections). The number of units and connections form a network topology.
Prof. M. Kanevski
11
Multilayer perceptron
Prof. M. Kanevski
12
Basics of ANN. ANN learning/training

Supervised learning is the most common training. Many samples
Input(i), Output(i) are prepared as a training set. Then a subset from the training data set is selected. Samples from this subset are presented to the network one by one. For each sample results obtained by the network O[(input(i)] are compared with the desired O[utput(i)]. After presenting the entire training subset the weights are updated. This updating is done in such a way that a measure of the error between the network's and desired outputs is reduced. One pass through the subset of training samples, along with an updating of the weights is called an epoch. The number of samples in the subset is called epoch size. Sometimes an epoch size of one is used
Prof. M. Kanevski
13
Basics of ANN. ANN supervised learning.

Examples
Neural network
Teacher
Response
Evaluation Modifications to Network Learning Algorithm Of Response
Prof. M. Kanevski
14
Basics of ANN Feedforward ANN.

If there are no feedback and lateral connections we have feedforward ANN. The most frequently used model is so called - multi-layer perceptron. The term feedforward means that information flows only in one direction from the input to the output.
Prof. M. Kanevski
15
ANN Multi-layer Perceptron (MLP)

Depends only on the data and its inner structure Is able to learn from data and generalise Good at modelling nonlinearities Robust to noise and outliers
[ANN = artificial neurons + connection weights]
Prof. M. Kanevski 16
Basics of ANN
All knowledge of ANN is based on
synaptic weights between units.
Prof. M. Kanevski
17
The Universality Property

A two layer feed-forward neural network with step activation functions can implement any Boolean function, provided that the number of hidden neurons H is sufficiently large.
Prof. M. Kanevski
18
MLP modelling
F1 (t , w ) = w1out f ( w1t + b1 ) + bout ,

out F2 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + bout , out out F3 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + w3 f ( w3t + b3 ) + bout .
19
Prof. M. Kanevski
Backpropagation training
Prof. M. Kanevski
20
Error function depends on networks weights (W)
1 out El (W ) = Tlj Z lj (W ) n j =0
n 1
Prof. M. Kanevski
21
MLP training algorithms

Optimisation algorithms used for MLP training:
Stochastic Annealing Genetic algorithm Gradient Conjugate gradients (slow 1st order gradient algorithm) Levenberg-Marquardt (fast 2nd order gradient algorithm) BFGS formula quasi Newton Steepest Descent RProp resilient propagation BackProp back propagation
Feedforward ANN: Multilayer perceptron. Backprop algorithm

The possibilities and capabilities of multi-layer perceptrons stem from the non-linearities used within nodes. MLP can learn with supervised learning rule - backpropagation algorithm. The Backword Error Propagation algorithm for the ANN learning/training caused a breakthrough in the application of multilayer perceptrons. The backpropagation algorithm is a supervised learning algorithm. The backpropagation algorithm is an iterative gradient algorithm designed to minimise the error measure between the actual output of the neural network and the desired output. We have to optimise a very non-linear system consisting of a large number of highly correlated variables.
Prof. M. Kanevski
23
Basics of ANN Backpropagation Algorithm

The backpropagation algorithm follows the next algorithmic steps: 1. Initialize weights. Usually it is recommended to set all weights and node offsets to small random variables. In our study we shall use simulated annealing and/or genetic algorithm to select starting values more intelligently as it is recommended in [Masters]. 2. Present inputs and desired outputs. The vectors (Inputl, Outputl=tl) are presented to the network. 3. Calculate the actual output of the ANN.
Prof. M. Kanevski
24

4. Calculate error measure and update the weights. Use a recursive algorithm starting at the output neurons (nodes) and working back to the first hidden layer - it is this backward propagation of output errors that inspired the name for this training algorithm. Update the weights W by
Prof. M. Kanevski
25
We want to know how to modify weights in order to decrease the error function
E(t) wij (t +1) wij (t) wij (t)
Prof. M. Kanevski
26
w (n +1) = w (n) + Z
w here n - iteration step, - rate of learning 0<1), Zj
m
(m 1)
m ij
m ij
m (m1) i j
- output of the j-th neurone in the layer
(m -1), error i for the output layer is defined byequation
Prof. M. Kanevski
27
out i
= Z (1 Z )(Ti Z )
= Z (1 Z ) w
h i h i h ij j h j
out i
out i
out i
( h 1) i
Prof. M. Kanevski
28

Other error measures (such as maximum absolute error and median squared error) have even greater advantages in many situations. For example, median squared error is useful because unlike the mean the median is a robust statistic - its value is insensitive to occasional large errors in the training data. Unfortunately, practical techniques for implementing these more desirable error measures do not yet exist. Thus, most neural networks today are tied to mean squared error measurements.
Prof. M. Kanevski
29

More general error functions can be written taking into account (weighting, declustering, economic criteria, etc.) importance of the samples presented to the network :
E l (W ) =
T {
j=0
n 1
lj
out lj
(W )
lj
Prof. M. Kanevski
30
Gradient descent
J(w)
Direction of the gradient J(W)
Minimum
Prof. M. Kanevski
w
31
Gradient descent
J(w)
Minimum
Prof. M. Kanevski
w
32
In reality the situation with error function and corresponding optimization problem is much more complicated: the presence of multiple local minima!
Prof. M. Kanevski
33
Gradient descent
Local minima
Prof. M. Kanevski
34
SA: Illustration
Prof. M. Kanevski
35
How important are local minima?

(Duda et al. 2001)
In computational practice, we do not want our network to be caught in a local minimum having high training error because this usually indicates that key features of the problem have not been learned by the network. In such cases it is traditional to reinitialize the weights and train again, possibly also altering other parameters in the net
Prof. M. Kanevski
36
How important are local minima?

(Duda et al. 2001)
In many problems, convergence to a nonglobal minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached, and thus it is not essential that the network be converging toward the global minimum or acceptable performance.
In short
The presence of multiple minima does not necessarily present difficulties in training nets, and a few simple heuristics can often overcome such problems (see next slide)
Prof. M. Kanevski
38
Practical techniques for improving backpropagation

Activation function (sigmoid, hyperbolic tangent,..) Scaling inputs Training with noise (noise injection) Initializing weights (simulated annealing) Regularization (weight decay) Number of hidden layers Learning parameters (rates, momentum,..) Cost function .
Prof. M. Kanevski
39
Interpretation of networks outputs

Consider the limit in which the size N of the training data set goes to infinity [Bishop 1995]. In this limit we can replace the finite sum over patterns in the sum-of-squares error with an integral of the form
1 E = lim 2N

n =1 k
{ y k ( x n ; w ) t kn } 2
1 = 2
{ y
k
( x ; w ) t k } p ( t k , x ) dt k dx
40
Prof. M. Kanevski
Interpretation of networks outputs

the network mapping is given by the conditional average of the target data, the regression of tk conditioned on x.
y k ( x ; w *) = t k | x
Prof. M. Kanevski
41
DEMO
Prof. M. Kanevski
42
MLP and number of layers

The problem with MLP using single hidden layer is that the neurons tend to interact with each other globally. In complex situations , this interaction makes it difficult to improve the approximation at one point without worsening it at some other point. On the other hand, with two hidden layers, the approximation process becomes more manageable.
Two hidden layers! (Haykin)

1. Local features are extracted in the first hidden layer. Specifically, some neurons in the first hidden layer are used to partition the input space into regions, and other neurons in that layer learn the local features characterizing those regions. 2. Global features are extracted in the second layer. Specifically, a neuron in the second hidden layer combines the outputs of neurons in the first hidden layer operating on a particular region of the input space and thereby learns the global features for that region and outputs zero elsewhere.
Data Preprocessing
Machine learning algorithms are datadriven methods. The quality and quantity of data is essential for training and generalization
Input data
Pre-processing MLA Post-processing

Results
Types of pre-processing:
1. Linear and nonlinear transformations e.g input scaling/normalisation, Z-score transform, square root transform, N-score transform, etc. 2. Dimensionality reduction 3. Incorporate prior knowledge Invariants, hints, 4. Feature extraction linear/nonlinear combination of input variables 5. Feature selection decide which features to use
Dimensionality reduction
Two approaches are available to perform dimensionality reduction: Feature extraction: creating a subset of new features by combinations of the existing features Feature selection: choosing a subset of all the features (the ones more informative)
Prof. M. Kanevski
47
Feature selection/extraction
Prof. M. Kanevski
48
Feature selection
Reducing the feature space by throwing out some of the features (covariates)
Also called variable selection
Motivating idea: try to find a simple, parsimonious model (Occams razor!)
Prof. M. Kanevski
49
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004; Springer 2006

Dimensionality Reduction
Clearly losing some information but this can be helpful due to curse of dimensionality Need some way of deciding what dimensions to keep 1. 2. 3. 4. Random choice Principal components analysis (PCA) Independent components analysis (ICA) Self-organised maps (SOM)
Data transform
Y = aZ+b Y = Log(Z) Y = Ind(Z, Zs) Normalisation: Zscore
Y = (Z-Zm)/
Box-Cox nonlinear transform :
Y ( ) =
Z 1
Prof. M. Kanevski
si > 0
Y ( = 0) = Ln( Z )
52
Model Selection & Model Evaluation
Prof. M. Kanevski
53
Guillaume d'Occam (1285 - 1349)

Pluralitas non est ponenda sine necessitate
Occams razor:
The more simple explanation of the phenomena is more likely to be correct

Model Assessment and Model Selection: Two separate goals
Prof. M. Kanevski
55
Model Selection:
Estimating the performance of different models in order to choose the (approximate) best one
Model Assessment:
Having chosen a final model, estimating its prediction error (generalization error) on new data
If we are in a data-rich situation, the best solution is to split randomly (?) data
Raw Data
Train: 50% (Train) Validation:25% (test) Test:25% (validation)
Prof. M. Kanevski
57
Interpretation
The training set is used to fit the models The validation set is used to estimate prediction error for model selection (tuning hyperparameters) The test set is used for assessment of the generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001
Bias and Variance. Models complexity

c. Underfitting 3
2.5
b. Overfitting 3
1.5
2.5
1
2
0.5
1.5
10
0.5
10
Prof. M. Kanevski
59
One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples. This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task. Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error
Prof. M. Kanevski
60
Bias-Variance Dilemma
Assume that
Y = f (X) + where E( ) = 0, Var( ) =

Prof. M. Kanevski
61
We can derive an expression for the expected prediction error of a regression at an input point X=x0 using squared-error loss:
Prof. M. Kanevski
62
Err ( x0 ) = E[(Y f ( x0 )) X = x0 ] =
+ [ E f ( x0 ) f ( x0 )] + E[ f ( x0 ) E f ( x0 )] = + Bias ( f ( x0 )) + Var ( f ( x0 )) =
IrreducibleError + Bias + Variance
2 2 2
Prof. M. Kanevski
63
The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless 2=0. The second term is the squared bias, the amount by which the average of our estimate differs from the true mean The last term is the variance, the expected squared deviation of f (x )around its mean.
0
Prof. M. Kanevski
64
Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001
Prof. M. Kanevski
65
Prof. M. Kanevski
66
A neural network is only as good as the training data! Poor training data inevitably leads to an unreliable and unpredictable network. Exploratory Data Analysis and data preprocessing are extremely important!!!
Prof. M. Kanevski
67
MLP modelling. Case Studies.

Original (10 000 points) Training (900 points)
Prof. M. Kanevski
68
MLP modeling
Original MLP prediction
Which result do you prefer?
Train RMSE Ro
1.97 0.69
Prof. M. Kanevski
69
MLP modeling
Train RMSE Ro
1.61 0.80
Prof. M. Kanevski
70
MLP modeling
Train RMSE Ro
1.67 0.79
Prof. M. Kanevski
71
MLP modeling
Train RMSE Ro
1.10 0.92
Prof. M. Kanevski
72
MLP modeling
Train RMSE Ro
0.83 0.95
Prof. M. Kanevski
73
MLP modeling
Train RMSE Ro
0.55 0.98
Prof. M. Kanevski
74
MLP modeling
1.00
Trainig statistics
5 1.90 1.70
Ro
15-15 0.95 10-10
20-20
0.90
0.85
5-5 10
10 0.80
5-5
1.50
0.75
RMSE
1.30 10-10 1.10 15-15
0.70
0.65 5 10 5-5 MLP 10-10 15-15 20-20
0.90
0.70 20-20 0.50 5 10 5-5 M LP 10-10 15-15 20-20
Model 20-20 is the best ?

Trainig statistics
MLP 5 10 5-5 10-10 15-15 20-20
MLP modeling
RMSE
1.97 1.61 1.67 1.10 0.83
Ro
0.69 0.80 0.79 0.92 0.95
0.55
0.98
Prof. M. Kanevski
76
MLP modeling Training &Validation statistics

Validationg 2.10 5 Training
1.00 0.95 0.90
10-10 15-15 20-20
1.90 10 5-5
1.70
0.85
1.50 RMSE 10-10 15-15
Ro
20-20
10 0.80 0.75
5-5
1.30
1.10
5 0.70
0.90
0.65
0.70
0.50 5 10 5-5 10-10 MLP 15-15 20-20
0.60 5 10 5-5 10-10 MLP 15-15 20-20
Prof. M. Kanevski
77
MLP modeling Training &Validation statistics

Validationg 2.10 5 Training
1.00 0.95 0.90
10-10 15-15 20-20
1.90 10 5-5
1.70
0.85
1.50 RMSE 10-10 15-15
Ro
20-20
10 0.80 0.75
5-5
1.30
1.10
5 0.70
0.90
0.65
0.70
0.50 5 10 5-5 10-10 MLP 15-15 20-20
0.60 5 10 5-5 10-10 MLP 15-15 20-20
Prof. M. Kanevski
78
MLP modeling Validation statistics

MLP 5 10 5-5 10-10 15-15 20-20 RMSE
2.01 1.66 1.70 1.25
Ro
0.68 0.80 0.79
0.89 0.89
0.88
1.24
1.39
Prof. M. Kanevski
79
ANNEX model: Artificial Neural Networks with External drift environmental data mapping
Prof. M. Kanevski
80
Traditional application of ANN to spatial predictions

Data are available at measurement points: F(xi,yi), for i= 1,N Problem: Predict F(x,y) at the points without measurements. Usually regular grid ANN solution: x,y - 2 inputs, F - output - select ANN architecture - train with available data - after training use to predict
ANNEX is similar to Kriging with External Drift Model: If there is an additional information
(available at training and prediction points)
related to the primary one, we can use it as an additional inputs to the ANN.
Inputs: x,y,+fext(x,y)
Examples of external information

Cheap information on secondary variable Physical model of the phenomena Remotely sensed images GIS data DEM data
83
Prof. M. Kanevski
Kriging with external drift

Kriging with external drift is the model when trends are limited to
E{F(x,y)}=m(x,y) = 0 +1 fext(x,y)
(1)
where the smooth variability of the secondary variable is considered to be related (e.g., linearly correlated) to that of primary variable F(x,y) being estimated. In general, kriging with an external drift is a simple and efficient algorithms to incorporate a secondary variable in the estimation of the primary variable.
ANNEX model
What relationship between primary and external information should be in case of ANNEX?
Prof. M. Kanevski
85
ANNEX model
What does external related
(how to measure: correlation between variables?)
information bring? Improved accuracy of prediction? Reduce uncertainty of prediction?

An important problem is related to the question of the quality of additional data: there is a dilemma between introducing new information and/or new noise.
Case study: Kazakh Priaralie, monitoring network
1 400 000 km2 - 400 monitoring stations

Prof. M. Kanevski
87
Datasets
GIS DEM model
Average long-term temperatures of air in June (C)

Correlation
Air temperature vs. Altitude
Prof. M. Kanevski
89
Train and Test datasets

Train Test
Prof. M. Kanevski
90
ANN and ANNEX models

Model 2-7-5-1 3-3-1 3-5-1 3-7-1 3-8-1 3-9-1 3-10-1 Kriging with external drift Correlation 0.917 0.989 0.99 0.991 0.991 0.991 0.99 0.984 RMSE 2.57 0.96 0.9 0.85 0.84 0.88 0.92 1.19 MAE 1.96 0.73 0.7 0.66 0.68 0.69 0.74 0.91 MRE -0.02 -0.01 -0.007 -0.004 -0.001 -0.01 -0.01 -0.03
Prof. M. Kanevski
91
Scatter plots
Kriging
Cokriging
Prof. M. Kanevski
Drift Kriging
ANNEX
92
Mapping results
Kriging Cokriging
Drift Kriging
ANNEX
Prof. M. Kanevski
93
Modelling noisy altitude effect (100 %)
Before
Prof. M. Kanevski
After
94
Scatter plots between variables (noisy 100 % altitude)
Train
Prof. M. Kanevski
Test
95
Mapping noise results ANNEX
Air temperature (C)

Noise results
Model Kriging Kriging external drift 3-7-1 3-8-1 3-8-1 (100% noise) 3-7-1 (10% noise) Test 1 Kriging external drift (10% noise) Test 1 3-7-1 (10% noise) Test 2 Kriging external drift (10% noise) Test 2 Correlation 0.874 0.984 0.991 0.991 0.839 0.939 0.941 0.899 0.903
Prof. M. Kanevski
RMSE 3.13 1.19 0.85 0.84 3.54 2.32 2.23 2.81 2.81
MAE 2.04 0.91 0.66 0.68 2.37 -1.49 1.54 1.52 1.59
MRE -0.06 -0.03 -0.004 -0.001 -0.13 -0.003 -0.06 -0.08 -0.103
97
MLP: real case study

Wind fields in Switzerland
Prof. M. Kanevski
98
Modeling of wind fields with MLP using regularization technique

(pp 168-172 of the book)
Monitoring network: 111 stations in Switzerland (80 training + 31 for validation) Mapping of daily: Mean speed Maximum gust Average direction
Prof. M. Kanevski
99
Modeling of wind fields with MLP and regularization technique

Monitoring network: 111 stations in Switzerland (80 training + 31 for validation) Mapping of daily: Mean speed Maximum gust Average direction Input information: X,Y geographical coordinates DEM (resolution 500 m) 23 DEM-based geo-features Total 26 features
Model: MLP 26-20-20-3

Training of the MLP Model: MLP 26-20-20-3 Training: Random initialization 500 iterations of the RPROP algorithm
Prof. M. Kanevski
101
Results: nave approach
Prof. M. Kanevski
102
Results: Noisy ejection regularization
Prof. M. Kanevski
103
Results: summary
Noisy ejection regularization
Without regularization (overfitting)
Prof. M. Kanevski
104
Conclusion
MLP is a nonlinear universal tool for the learning from and modeling of data. Excellent exploratory tool. Application demands deep expert knowledge and experience
Prof. M. Kanevski
105

CSSS2010 20100803 Kanevski Lecture2

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CSSS2010 20100803 Kanevski Lecture2

Hochgeladen von

Copyright:

Verfügbare Formate

Machine Learning Algorithms: Theory, Applications and Software Tools

Lecture 2 Basics of ANN: MLP

Basics of ANN. Simple biological neuron

Basics of ANN Simple model of the neuron

Examples of transfer functions.

Basics of ANN. ANN learning/training

Basics of ANN. ANN supervised learning.

Evaluation Modifications to Network Learning Algorithm Of Response

Basics of ANN Feedforward ANN.

ANN Multi-layer Perceptron (MLP)

synaptic weights between units.

The Universality Property

F1 (t , w ) = w1out f ( w1t + b1 ) + bout ,

Error function depends on networks weights (W)

MLP training algorithms

Feedforward ANN: Multilayer perceptron. Backprop algorithm

Basics of ANN Backpropagation Algorithm

Basics of ANN Backpropagation Algorithm

E(t) wij (t +1) wij (t) wij (t)

Basics of ANN Backpropagation Algorithm

- output of the j-th neurone in the layer

(m -1), error i for the output layer is defined byequation

Basics of ANN Backpropagation Algorithm

Basics of ANN Backpropagation Algorithm

Basics of ANN Backpropagation Algorithm

Direction of the gradient J(W)

How important are local minima?

How important are local minima?

Practical techniques for improving backpropagation

Interpretation of networks outputs

Interpretation of networks outputs

MLP and number of layers

Two hidden layers! (Haykin)

Pre-processing MLA Post-processing

Motivating idea: try to find a simple, parsimonious model (Occams razor!)

Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006

Box-Cox nonlinear transform :

Model Selection & Model Evaluation

Guillaume d'Occam (1285 - 1349)

The more simple explanation of the phenomena is more likely to be correct

Model Assessment and Model Selection: Two separate goals

Bias and Variance. Models complexity

Y = f (X) + where E( ) = 0, Var( ) =

Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001

MLP modelling. Case Studies.

Which result do you prefer?

Which result do you prefer?

Which result do you prefer?

Which result do you prefer?

Which result do you prefer?

Which result do you prefer?

15-15 0.95 10-10

1.30 10-10 1.10 15-15

0.65 5 10 5-5 MLP 10-10 15-15 20-20

0.70 20-20 0.50 5 10 5-5 M LP 10-10 15-15 20-20

Model 20-20 is the best ?

MLP modeling Training &Validation statistics

1.00 0.95 0.90

10-10 15-15 20-20

0.50 5 10 5-5 10-10 MLP 15-15 20-20

0.60 5 10 5-5 10-10 MLP 15-15 20-20

MLP modeling Training &Validation statistics