Sie sind auf Seite 1von 105

Machine Learning Algorithms: Theory, Applications and Software Tools

Lecture 2 Basics of ANN: MLP


Prof. Mikhail Kanevski Institute of Geomatics and Analysis of Risk, University of Lausanne Mikhail.Kanevski@unil.ch
Prof. M. Kanevski 1

Contents
Introduction to artificial neural networks Multilayer perceptron Case studies

Prof. M. Kanevski

Basics of ANN
Artificial neural networks are analytical systems that address problems whose solutions have not been explicitly formulated. In this way they contrast to classical computers and computer programs, which are designed to solve problems whose solutions - although they may be extremely complex - have been made explicit.

Prof. M. Kanevski

Basics of ANN
We can program or train neural networks to store, recognise, and associatively retrieve patterns; to filter noise from measurement data; to control ill-defined problems;

in summary:
to estimate sampled functions when we do not know the form of the functions.

Prof. M. Kanevski

Basics of ANN
Unlike statistical estimators, they estimate a function without a mathematical model of how outputs depend on inputs. Neural networks are model-semifree estimators (semiparametric models). They "learn from experience" with numerical and, sometimes, linguistic sample data.

Prof. M. Kanevski

Basics of ANN
The major applications of ANN:
Feature recognition (pattern classification). Speech recognition Signal processing Time-series prediction Function approximation and regression, classification Data Mining Intelligent control Associative memories Optimisation And many others

Prof. M. Kanevski

Basics of ANN. Simple biological neuron

Prof. M. Kanevski

Basics of ANN Simple model of the neuron

Prof. M. Kanevski

Examples of transfer functions.

1 f (x) = [1 + exp( x )]
[exp( x ) exp( x )] tanh( x ) = [exp( x ) + exp( x )]
Prof. M. Kanevski 9

Basics of ANN
The main parts of ANN: Neurones (nodes, cells, units, processing elements) Network topology (connections between neurones)

Prof. M. Kanevski

10

Basics of ANN
In general, Artificial Neural Networks are a collection of simple computational units (cells) interlinked by a system of connections (synaptic connections). The number of units and connections form a network topology.

Prof. M. Kanevski

11

Multilayer perceptron

Prof. M. Kanevski

12

Basics of ANN. ANN learning/training


Supervised learning is the most common training. Many samples
Input(i), Output(i) are prepared as a training set. Then a subset from the training data set is selected. Samples from this subset are presented to the network one by one. For each sample results obtained by the network O[(input(i)] are compared with the desired O[utput(i)]. After presenting the entire training subset the weights are updated. This updating is done in such a way that a measure of the error between the network's and desired outputs is reduced. One pass through the subset of training samples, along with an updating of the weights is called an epoch. The number of samples in the subset is called epoch size. Sometimes an epoch size of one is used

Prof. M. Kanevski

13

Basics of ANN. ANN supervised learning.


Examples
Neural network

Teacher
Response

Evaluation Modifications to Network Learning Algorithm Of Response

Prof. M. Kanevski

14

Basics of ANN Feedforward ANN.


If there are no feedback and lateral connections we have feedforward ANN. The most frequently used model is so called - multi-layer perceptron. The term feedforward means that information flows only in one direction from the input to the output.

Prof. M. Kanevski

15

ANN Multi-layer Perceptron (MLP)


Depends only on the data and its inner structure Is able to learn from data and generalise Good at modelling nonlinearities Robust to noise and outliers
[ANN = artificial neurons + connection weights]
Prof. M. Kanevski 16

Basics of ANN
All knowledge of ANN is based on

synaptic weights between units.

Prof. M. Kanevski

17

The Universality Property


A two layer feed-forward neural network with step activation functions can implement any Boolean function, provided that the number of hidden neurons H is sufficiently large.

Prof. M. Kanevski

18

MLP modelling

F1 (t , w ) = w1out f ( w1t + b1 ) + bout ,


out F2 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + bout , out out F3 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + w3 f ( w3t + b3 ) + bout .
19

Prof. M. Kanevski

Backpropagation training

Prof. M. Kanevski

20

Error function depends on networks weights (W)

1 out El (W ) = Tlj Z lj (W ) n j =0

n 1

Prof. M. Kanevski

21

MLP training algorithms


Optimisation algorithms used for MLP training:
Stochastic Annealing Genetic algorithm Gradient Conjugate gradients (slow 1st order gradient algorithm) Levenberg-Marquardt (fast 2nd order gradient algorithm) BFGS formula quasi Newton Steepest Descent RProp resilient propagation BackProp back propagation
Prof. M. Kanevski 22

Feedforward ANN: Multilayer perceptron. Backprop algorithm


The possibilities and capabilities of multi-layer perceptrons stem from the non-linearities used within nodes. MLP can learn with supervised learning rule - backpropagation algorithm. The Backword Error Propagation algorithm for the ANN learning/training caused a breakthrough in the application of multilayer perceptrons. The backpropagation algorithm is a supervised learning algorithm. The backpropagation algorithm is an iterative gradient algorithm designed to minimise the error measure between the actual output of the neural network and the desired output. We have to optimise a very non-linear system consisting of a large number of highly correlated variables.

Prof. M. Kanevski

23

Basics of ANN Backpropagation Algorithm


The backpropagation algorithm follows the next algorithmic steps: 1. Initialize weights. Usually it is recommended to set all weights and node offsets to small random variables. In our study we shall use simulated annealing and/or genetic algorithm to select starting values more intelligently as it is recommended in [Masters]. 2. Present inputs and desired outputs. The vectors (Inputl, Outputl=tl) are presented to the network. 3. Calculate the actual output of the ANN.

Prof. M. Kanevski

24

Basics of ANN Backpropagation Algorithm


4. Calculate error measure and update the weights. Use a recursive algorithm starting at the output neurons (nodes) and working back to the first hidden layer - it is this backward propagation of output errors that inspired the name for this training algorithm. Update the weights W by

Prof. M. Kanevski

25

We want to know how to modify weights in order to decrease the error function

E(t) wij (t +1) wij (t) wij (t)

Prof. M. Kanevski

26

Basics of ANN Backpropagation Algorithm

w (n +1) = w (n) + Z
w here n - iteration step, - rate of learning 0<1), Zj
m
(m 1)

m ij

m ij

m (m1) i j

- output of the j-th neurone in the layer

(m -1), error i for the output layer is defined byequation

Prof. M. Kanevski

27

Basics of ANN Backpropagation Algorithm

out i

= Z (1 Z )(Ti Z )
= Z (1 Z ) w
h i h i h ij j h j

out i

out i

out i

( h 1) i

Prof. M. Kanevski

28

Basics of ANN Backpropagation Algorithm


Other error measures (such as maximum absolute error and median squared error) have even greater advantages in many situations. For example, median squared error is useful because unlike the mean the median is a robust statistic - its value is insensitive to occasional large errors in the training data. Unfortunately, practical techniques for implementing these more desirable error measures do not yet exist. Thus, most neural networks today are tied to mean squared error measurements.

Prof. M. Kanevski

29

Basics of ANN Backpropagation Algorithm


More general error functions can be written taking into account (weighting, declustering, economic criteria, etc.) importance of the samples presented to the network :

E l (W ) =

T {
j=0

n 1

lj

out lj

(W )

lj

Prof. M. Kanevski

30

Gradient descent

J(w)

Direction of the gradient J(W)

Minimum
Prof. M. Kanevski

w
31

Gradient descent

J(w)

Minimum
Prof. M. Kanevski

w
32

In reality the situation with error function and corresponding optimization problem is much more complicated: the presence of multiple local minima!

Prof. M. Kanevski

33

Gradient descent
Local minima

Prof. M. Kanevski

34

SA: Illustration

Prof. M. Kanevski

35

How important are local minima?


(Duda et al. 2001)

In computational practice, we do not want our network to be caught in a local minimum having high training error because this usually indicates that key features of the problem have not been learned by the network. In such cases it is traditional to reinitialize the weights and train again, possibly also altering other parameters in the net

Prof. M. Kanevski

36

How important are local minima?


(Duda et al. 2001)

In many problems, convergence to a nonglobal minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached, and thus it is not essential that the network be converging toward the global minimum or acceptable performance.
Prof. M. Kanevski 37

In short
The presence of multiple minima does not necessarily present difficulties in training nets, and a few simple heuristics can often overcome such problems (see next slide)

Prof. M. Kanevski

38

Practical techniques for improving backpropagation


Activation function (sigmoid, hyperbolic tangent,..) Scaling inputs Training with noise (noise injection) Initializing weights (simulated annealing) Regularization (weight decay) Number of hidden layers Learning parameters (rates, momentum,..) Cost function .

Prof. M. Kanevski

39

Interpretation of networks outputs


Consider the limit in which the size N of the training data set goes to infinity [Bishop 1995]. In this limit we can replace the finite sum over patterns in the sum-of-squares error with an integral of the form

1 E = lim 2N


n =1 k

{ y k ( x n ; w ) t kn } 2

1 = 2

{ y
k

( x ; w ) t k } p ( t k , x ) dt k dx
40

Prof. M. Kanevski

Interpretation of networks outputs


the network mapping is given by the conditional average of the target data, the regression of tk conditioned on x.

y k ( x ; w *) = t k | x

Prof. M. Kanevski

41

DEMO

Prof. M. Kanevski

42

MLP and number of layers


The problem with MLP using single hidden layer is that the neurons tend to interact with each other globally. In complex situations , this interaction makes it difficult to improve the approximation at one point without worsening it at some other point. On the other hand, with two hidden layers, the approximation process becomes more manageable.
Prof. M. Kanevski 43

Two hidden layers! (Haykin)


1. Local features are extracted in the first hidden layer. Specifically, some neurons in the first hidden layer are used to partition the input space into regions, and other neurons in that layer learn the local features characterizing those regions. 2. Global features are extracted in the second layer. Specifically, a neuron in the second hidden layer combines the outputs of neurons in the first hidden layer operating on a particular region of the input space and thereby learns the global features for that region and outputs zero elsewhere.
Prof. M. Kanevski 44

Data Preprocessing
Machine learning algorithms are datadriven methods. The quality and quantity of data is essential for training and generalization
Input data

Pre-processing MLA Post-processing


Results
Prof. M. Kanevski 45

Types of pre-processing:
1. Linear and nonlinear transformations e.g input scaling/normalisation, Z-score transform, square root transform, N-score transform, etc. 2. Dimensionality reduction 3. Incorporate prior knowledge Invariants, hints, 4. Feature extraction linear/nonlinear combination of input variables 5. Feature selection decide which features to use
Prof. M. Kanevski 46

Dimensionality reduction
Two approaches are available to perform dimensionality reduction: Feature extraction: creating a subset of new features by combinations of the existing features Feature selection: choosing a subset of all the features (the ones more informative)

Prof. M. Kanevski

47

Feature selection/extraction

Prof. M. Kanevski

48

Feature selection
Reducing the feature space by throwing out some of the features (covariates)
Also called variable selection

Motivating idea: try to find a simple, parsimonious model (Occams razor!)

Prof. M. Kanevski

49

Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006


Prof. M. Kanevski 50

Dimensionality Reduction
Clearly losing some information but this can be helpful due to curse of dimensionality Need some way of deciding what dimensions to keep 1. 2. 3. 4. Random choice Principal components analysis (PCA) Independent components analysis (ICA) Self-organised maps (SOM)
Prof. M. Kanevski 51

Data transform
Y = aZ+b Y = Log(Z) Y = Ind(Z, Zs) Normalisation: Zscore
Y = (Z-Zm)/

Box-Cox nonlinear transform :

Y ( ) =

Z 1

Prof. M. Kanevski

si > 0

Y ( = 0) = Ln( Z )
52

Model Selection & Model Evaluation

Prof. M. Kanevski

53

Guillaume d'Occam (1285 - 1349)


Pluralitas non est ponenda sine necessitate

Occams razor:

The more simple explanation of the phenomena is more likely to be correct


Prof. M. Kanevski 54

Model Assessment and Model Selection: Two separate goals

Prof. M. Kanevski

55

Model Selection:
Estimating the performance of different models in order to choose the (approximate) best one

Model Assessment:
Having chosen a final model, estimating its prediction error (generalization error) on new data
Prof. M. Kanevski 56

If we are in a data-rich situation, the best solution is to split randomly (?) data

Raw Data
Train: 50% (Train) Validation:25% (test) Test:25% (validation)

Prof. M. Kanevski

57

Interpretation
The training set is used to fit the models The validation set is used to estimate prediction error for model selection (tuning hyperparameters) The test set is used for assessment of the generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001
Prof. M. Kanevski 58

Bias and Variance. Models complexity


c. Underfitting 3

2.5
b. Overfitting 3

1.5
2.5

1
2

0.5
1.5

10

0.5

10

Prof. M. Kanevski

59

One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples. This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task. Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error

Prof. M. Kanevski

60

Bias-Variance Dilemma
Assume that

Y = f (X) + where E( ) = 0, Var( ) =


Prof. M. Kanevski

61

We can derive an expression for the expected prediction error of a regression at an input point X=x0 using squared-error loss:

Prof. M. Kanevski

62

Err ( x0 ) = E[(Y f ( x0 )) X = x0 ] =

+ [ E f ( x0 ) f ( x0 )] + E[ f ( x0 ) E f ( x0 )] = + Bias ( f ( x0 )) + Var ( f ( x0 )) =
IrreducibleError + Bias + Variance
2 2 2

Prof. M. Kanevski

63

The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless 2=0. The second term is the squared bias, the amount by which the average of our estimate differs from the true mean The last term is the variance, the expected squared deviation of f (x )around its mean.
0

Prof. M. Kanevski

64

Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001

Prof. M. Kanevski

65

Prof. M. Kanevski

66

A neural network is only as good as the training data! Poor training data inevitably leads to an unreliable and unpredictable network. Exploratory Data Analysis and data preprocessing are extremely important!!!

Prof. M. Kanevski

67

MLP modelling. Case Studies.


Original (10 000 points) Training (900 points)

Prof. M. Kanevski

68

MLP modeling
Original MLP prediction

Which result do you prefer?

Train RMSE Ro

1.97 0.69

Prof. M. Kanevski

69

MLP modeling
Original MLP prediction

Which result do you prefer?

Train RMSE Ro

1.61 0.80

Prof. M. Kanevski

70

MLP modeling
Original MLP prediction

Which result do you prefer?

Train RMSE Ro

1.67 0.79

Prof. M. Kanevski

71

MLP modeling
Original MLP prediction

Which result do you prefer?

Train RMSE Ro

1.10 0.92

Prof. M. Kanevski

72

MLP modeling
Original MLP prediction

Which result do you prefer?

Train RMSE Ro

0.83 0.95

Prof. M. Kanevski

73

MLP modeling
Original MLP prediction

Which result do you prefer?

Train RMSE Ro

0.55 0.98

Prof. M. Kanevski

74

MLP modeling
1.00

Trainig statistics
5 1.90 1.70
Ro

15-15 0.95 10-10

20-20

0.90

0.85

5-5 10

10 0.80

5-5

1.50
0.75

RMSE

1.30 10-10 1.10 15-15

0.70

0.65 5 10 5-5 MLP 10-10 15-15 20-20

0.90

0.70 20-20 0.50 5 10 5-5 M LP 10-10 15-15 20-20

Model 20-20 is the best ?


Prof. M. Kanevski 75

Trainig statistics
MLP 5 10 5-5 10-10 15-15 20-20

MLP modeling

RMSE
1.97 1.61 1.67 1.10 0.83

Ro
0.69 0.80 0.79 0.92 0.95

0.55

0.98

Prof. M. Kanevski

76

MLP modeling Training &Validation statistics


Validationg 2.10 5 Training

1.00 0.95 0.90

10-10 15-15 20-20

1.90 10 5-5

1.70

0.85
1.50 RMSE 10-10 15-15

Ro

20-20

10 0.80 0.75

5-5

1.30

1.10

5 0.70

0.90

0.65
0.70

0.50 5 10 5-5 10-10 MLP 15-15 20-20

0.60 5 10 5-5 10-10 MLP 15-15 20-20

Prof. M. Kanevski

77

MLP modeling Training &Validation statistics


Validationg 2.10 5 Training

1.00 0.95 0.90

10-10 15-15 20-20

1.90 10 5-5

1.70

0.85
1.50 RMSE 10-10 15-15

Ro

20-20

10 0.80 0.75

5-5

1.30

1.10

5 0.70

0.90

0.65
0.70

0.50 5 10 5-5 10-10 MLP 15-15 20-20

0.60 5 10 5-5 10-10 MLP 15-15 20-20

Prof. M. Kanevski

78

MLP modeling Validation statistics


MLP 5 10 5-5 10-10 15-15 20-20 RMSE
2.01 1.66 1.70 1.25

Ro
0.68 0.80 0.79

0.89 0.89
0.88

1.24
1.39

Prof. M. Kanevski

79

ANNEX model: Artificial Neural Networks with External drift environmental data mapping

Prof. M. Kanevski

80

Traditional application of ANN to spatial predictions


Data are available at measurement points: F(xi,yi), for i= 1,N Problem: Predict F(x,y) at the points without measurements. Usually regular grid ANN solution: x,y - 2 inputs, F - output - select ANN architecture - train with available data - after training use to predict
Prof. M. Kanevski 81

ANNEX is similar to Kriging with External Drift Model: If there is an additional information
(available at training and prediction points)

related to the primary one, we can use it as an additional inputs to the ANN.

Inputs: x,y,+fext(x,y)
Prof. M. Kanevski 82

Examples of external information


Cheap information on secondary variable Physical model of the phenomena Remotely sensed images GIS data DEM data
83

Prof. M. Kanevski

Kriging with external drift


Kriging with external drift is the model when trends are limited to

E{F(x,y)}=m(x,y) = 0 +1 fext(x,y)

(1)

where the smooth variability of the secondary variable is considered to be related (e.g., linearly correlated) to that of primary variable F(x,y) being estimated. In general, kriging with an external drift is a simple and efficient algorithms to incorporate a secondary variable in the estimation of the primary variable.
Prof. M. Kanevski 84

ANNEX model
What relationship between primary and external information should be in case of ANNEX?

Prof. M. Kanevski

85

ANNEX model
What does external related
(how to measure: correlation between variables?)

information bring? Improved accuracy of prediction? Reduce uncertainty of prediction?


An important problem is related to the question of the quality of additional data: there is a dilemma between introducing new information and/or new noise.
Prof. M. Kanevski 86

Case study: Kazakh Priaralie, monitoring network

1 400 000 km2 - 400 monitoring stations


Prof. M. Kanevski

87

Datasets
GIS DEM model

Average long-term temperatures of air in June (C)


Prof. M. Kanevski 88

Correlation
Air temperature vs. Altitude

Prof. M. Kanevski

89

Train and Test datasets


Train Test

Prof. M. Kanevski

90

ANN and ANNEX models


Model 2-7-5-1 3-3-1 3-5-1 3-7-1 3-8-1 3-9-1 3-10-1 Kriging with external drift Correlation 0.917 0.989 0.99 0.991 0.991 0.991 0.99 0.984 RMSE 2.57 0.96 0.9 0.85 0.84 0.88 0.92 1.19 MAE 1.96 0.73 0.7 0.66 0.68 0.69 0.74 0.91 MRE -0.02 -0.01 -0.007 -0.004 -0.001 -0.01 -0.01 -0.03

Prof. M. Kanevski

91

Scatter plots

Kriging

Cokriging
Prof. M. Kanevski

Drift Kriging

ANNEX
92

Mapping results
Kriging Cokriging

Drift Kriging

ANNEX

Prof. M. Kanevski

93

Modelling noisy altitude effect (100 %)

Before
Prof. M. Kanevski

After
94

Scatter plots between variables (noisy 100 % altitude)

Train
Prof. M. Kanevski

Test

95

Mapping noise results ANNEX

Air temperature (C)


Prof. M. Kanevski 96

Noise results
Model Kriging Kriging external drift 3-7-1 3-8-1 3-8-1 (100% noise) 3-7-1 (10% noise) Test 1 Kriging external drift (10% noise) Test 1 3-7-1 (10% noise) Test 2 Kriging external drift (10% noise) Test 2 Correlation 0.874 0.984 0.991 0.991 0.839 0.939 0.941 0.899 0.903
Prof. M. Kanevski

RMSE 3.13 1.19 0.85 0.84 3.54 2.32 2.23 2.81 2.81

MAE 2.04 0.91 0.66 0.68 2.37 -1.49 1.54 1.52 1.59

MRE -0.06 -0.03 -0.004 -0.001 -0.13 -0.003 -0.06 -0.08 -0.103
97

MLP: real case study


Wind fields in Switzerland

Prof. M. Kanevski

98

Modeling of wind fields with MLP using regularization technique


(pp 168-172 of the book)

Monitoring network: 111 stations in Switzerland (80 training + 31 for validation) Mapping of daily: Mean speed Maximum gust Average direction

Prof. M. Kanevski

99

Modeling of wind fields with MLP and regularization technique


Monitoring network: 111 stations in Switzerland (80 training + 31 for validation) Mapping of daily: Mean speed Maximum gust Average direction Input information: X,Y geographical coordinates DEM (resolution 500 m) 23 DEM-based geo-features Total 26 features

Model: MLP 26-20-20-3


Prof. M. Kanevski 100

Training of the MLP Model: MLP 26-20-20-3 Training: Random initialization 500 iterations of the RPROP algorithm

Prof. M. Kanevski

101

Results: nave approach

Prof. M. Kanevski

102

Results: Noisy ejection regularization

Prof. M. Kanevski

103

Results: summary
Noisy ejection regularization

Without regularization (overfitting)

Prof. M. Kanevski

104

Conclusion
MLP is a nonlinear universal tool for the learning from and modeling of data. Excellent exploratory tool. Application demands deep expert knowledge and experience

Prof. M. Kanevski

105

Das könnte Ihnen auch gefallen