Beruflich Dokumente
Kultur Dokumente
Contents
Introduction to artificial neural networks Multilayer perceptron Case studies
Prof. M. Kanevski
Basics of ANN
Artificial neural networks are analytical systems that address problems whose solutions have not been explicitly formulated. In this way they contrast to classical computers and computer programs, which are designed to solve problems whose solutions - although they may be extremely complex - have been made explicit.
Prof. M. Kanevski
Basics of ANN
We can program or train neural networks to store, recognise, and associatively retrieve patterns; to filter noise from measurement data; to control ill-defined problems;
in summary:
to estimate sampled functions when we do not know the form of the functions.
Prof. M. Kanevski
Basics of ANN
Unlike statistical estimators, they estimate a function without a mathematical model of how outputs depend on inputs. Neural networks are model-semifree estimators (semiparametric models). They "learn from experience" with numerical and, sometimes, linguistic sample data.
Prof. M. Kanevski
Basics of ANN
The major applications of ANN:
Feature recognition (pattern classification). Speech recognition Signal processing Time-series prediction Function approximation and regression, classification Data Mining Intelligent control Associative memories Optimisation And many others
Prof. M. Kanevski
Prof. M. Kanevski
Prof. M. Kanevski
1 f (x) = [1 + exp( x )]
[exp( x ) exp( x )] tanh( x ) = [exp( x ) + exp( x )]
Prof. M. Kanevski 9
Basics of ANN
The main parts of ANN: Neurones (nodes, cells, units, processing elements) Network topology (connections between neurones)
Prof. M. Kanevski
10
Basics of ANN
In general, Artificial Neural Networks are a collection of simple computational units (cells) interlinked by a system of connections (synaptic connections). The number of units and connections form a network topology.
Prof. M. Kanevski
11
Multilayer perceptron
Prof. M. Kanevski
12
Prof. M. Kanevski
13
Teacher
Response
Prof. M. Kanevski
14
Prof. M. Kanevski
15
Basics of ANN
All knowledge of ANN is based on
Prof. M. Kanevski
17
Prof. M. Kanevski
18
MLP modelling
Prof. M. Kanevski
Backpropagation training
Prof. M. Kanevski
20
1 out El (W ) = Tlj Z lj (W ) n j =0
n 1
Prof. M. Kanevski
21
Prof. M. Kanevski
23
Prof. M. Kanevski
24
Prof. M. Kanevski
25
We want to know how to modify weights in order to decrease the error function
Prof. M. Kanevski
26
w (n +1) = w (n) + Z
w here n - iteration step, - rate of learning 0<1), Zj
m
(m 1)
m ij
m ij
m (m1) i j
Prof. M. Kanevski
27
out i
= Z (1 Z )(Ti Z )
= Z (1 Z ) w
h i h i h ij j h j
out i
out i
out i
( h 1) i
Prof. M. Kanevski
28
Prof. M. Kanevski
29
E l (W ) =
T {
j=0
n 1
lj
out lj
(W )
lj
Prof. M. Kanevski
30
Gradient descent
J(w)
Minimum
Prof. M. Kanevski
w
31
Gradient descent
J(w)
Minimum
Prof. M. Kanevski
w
32
In reality the situation with error function and corresponding optimization problem is much more complicated: the presence of multiple local minima!
Prof. M. Kanevski
33
Gradient descent
Local minima
Prof. M. Kanevski
34
SA: Illustration
Prof. M. Kanevski
35
In computational practice, we do not want our network to be caught in a local minimum having high training error because this usually indicates that key features of the problem have not been learned by the network. In such cases it is traditional to reinitialize the weights and train again, possibly also altering other parameters in the net
Prof. M. Kanevski
36
In many problems, convergence to a nonglobal minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached, and thus it is not essential that the network be converging toward the global minimum or acceptable performance.
Prof. M. Kanevski 37
In short
The presence of multiple minima does not necessarily present difficulties in training nets, and a few simple heuristics can often overcome such problems (see next slide)
Prof. M. Kanevski
38
Prof. M. Kanevski
39
1 E = lim 2N
n =1 k
{ y k ( x n ; w ) t kn } 2
1 = 2
{ y
k
( x ; w ) t k } p ( t k , x ) dt k dx
40
Prof. M. Kanevski
y k ( x ; w *) = t k | x
Prof. M. Kanevski
41
DEMO
Prof. M. Kanevski
42
Data Preprocessing
Machine learning algorithms are datadriven methods. The quality and quantity of data is essential for training and generalization
Input data
Types of pre-processing:
1. Linear and nonlinear transformations e.g input scaling/normalisation, Z-score transform, square root transform, N-score transform, etc. 2. Dimensionality reduction 3. Incorporate prior knowledge Invariants, hints, 4. Feature extraction linear/nonlinear combination of input variables 5. Feature selection decide which features to use
Prof. M. Kanevski 46
Dimensionality reduction
Two approaches are available to perform dimensionality reduction: Feature extraction: creating a subset of new features by combinations of the existing features Feature selection: choosing a subset of all the features (the ones more informative)
Prof. M. Kanevski
47
Feature selection/extraction
Prof. M. Kanevski
48
Feature selection
Reducing the feature space by throwing out some of the features (covariates)
Also called variable selection
Prof. M. Kanevski
49
Dimensionality Reduction
Clearly losing some information but this can be helpful due to curse of dimensionality Need some way of deciding what dimensions to keep 1. 2. 3. 4. Random choice Principal components analysis (PCA) Independent components analysis (ICA) Self-organised maps (SOM)
Prof. M. Kanevski 51
Data transform
Y = aZ+b Y = Log(Z) Y = Ind(Z, Zs) Normalisation: Zscore
Y = (Z-Zm)/
Y ( ) =
Z 1
Prof. M. Kanevski
si > 0
Y ( = 0) = Ln( Z )
52
Prof. M. Kanevski
53
Occams razor:
Prof. M. Kanevski
55
Model Selection:
Estimating the performance of different models in order to choose the (approximate) best one
Model Assessment:
Having chosen a final model, estimating its prediction error (generalization error) on new data
Prof. M. Kanevski 56
If we are in a data-rich situation, the best solution is to split randomly (?) data
Raw Data
Train: 50% (Train) Validation:25% (test) Test:25% (validation)
Prof. M. Kanevski
57
Interpretation
The training set is used to fit the models The validation set is used to estimate prediction error for model selection (tuning hyperparameters) The test set is used for assessment of the generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001
Prof. M. Kanevski 58
2.5
b. Overfitting 3
1.5
2.5
1
2
0.5
1.5
10
0.5
10
Prof. M. Kanevski
59
One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples. This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task. Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error
Prof. M. Kanevski
60
Bias-Variance Dilemma
Assume that
61
We can derive an expression for the expected prediction error of a regression at an input point X=x0 using squared-error loss:
Prof. M. Kanevski
62
Err ( x0 ) = E[(Y f ( x0 )) X = x0 ] =
+ [ E f ( x0 ) f ( x0 )] + E[ f ( x0 ) E f ( x0 )] = + Bias ( f ( x0 )) + Var ( f ( x0 )) =
IrreducibleError + Bias + Variance
2 2 2
Prof. M. Kanevski
63
The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless 2=0. The second term is the squared bias, the amount by which the average of our estimate differs from the true mean The last term is the variance, the expected squared deviation of f (x )around its mean.
0
Prof. M. Kanevski
64
Prof. M. Kanevski
65
Prof. M. Kanevski
66
A neural network is only as good as the training data! Poor training data inevitably leads to an unreliable and unpredictable network. Exploratory Data Analysis and data preprocessing are extremely important!!!
Prof. M. Kanevski
67
Prof. M. Kanevski
68
MLP modeling
Original MLP prediction
Train RMSE Ro
1.97 0.69
Prof. M. Kanevski
69
MLP modeling
Original MLP prediction
Train RMSE Ro
1.61 0.80
Prof. M. Kanevski
70
MLP modeling
Original MLP prediction
Train RMSE Ro
1.67 0.79
Prof. M. Kanevski
71
MLP modeling
Original MLP prediction
Train RMSE Ro
1.10 0.92
Prof. M. Kanevski
72
MLP modeling
Original MLP prediction
Train RMSE Ro
0.83 0.95
Prof. M. Kanevski
73
MLP modeling
Original MLP prediction
Train RMSE Ro
0.55 0.98
Prof. M. Kanevski
74
MLP modeling
1.00
Trainig statistics
5 1.90 1.70
Ro
20-20
0.90
0.85
5-5 10
10 0.80
5-5
1.50
0.75
RMSE
0.70
0.90
Trainig statistics
MLP 5 10 5-5 10-10 15-15 20-20
MLP modeling
RMSE
1.97 1.61 1.67 1.10 0.83
Ro
0.69 0.80 0.79 0.92 0.95
0.55
0.98
Prof. M. Kanevski
76
1.90 10 5-5
1.70
0.85
1.50 RMSE 10-10 15-15
Ro
20-20
10 0.80 0.75
5-5
1.30
1.10
5 0.70
0.90
0.65
0.70
Prof. M. Kanevski
77
1.90 10 5-5
1.70
0.85
1.50 RMSE 10-10 15-15
Ro
20-20
10 0.80 0.75
5-5
1.30
1.10
5 0.70
0.90
0.65
0.70
Prof. M. Kanevski
78
Ro
0.68 0.80 0.79
0.89 0.89
0.88
1.24
1.39
Prof. M. Kanevski
79
ANNEX model: Artificial Neural Networks with External drift environmental data mapping
Prof. M. Kanevski
80
ANNEX is similar to Kriging with External Drift Model: If there is an additional information
(available at training and prediction points)
related to the primary one, we can use it as an additional inputs to the ANN.
Inputs: x,y,+fext(x,y)
Prof. M. Kanevski 82
Prof. M. Kanevski
E{F(x,y)}=m(x,y) = 0 +1 fext(x,y)
(1)
where the smooth variability of the secondary variable is considered to be related (e.g., linearly correlated) to that of primary variable F(x,y) being estimated. In general, kriging with an external drift is a simple and efficient algorithms to incorporate a secondary variable in the estimation of the primary variable.
Prof. M. Kanevski 84
ANNEX model
What relationship between primary and external information should be in case of ANNEX?
Prof. M. Kanevski
85
ANNEX model
What does external related
(how to measure: correlation between variables?)
87
Datasets
GIS DEM model
Correlation
Air temperature vs. Altitude
Prof. M. Kanevski
89
Prof. M. Kanevski
90
Prof. M. Kanevski
91
Scatter plots
Kriging
Cokriging
Prof. M. Kanevski
Drift Kriging
ANNEX
92
Mapping results
Kriging Cokriging
Drift Kriging
ANNEX
Prof. M. Kanevski
93
Before
Prof. M. Kanevski
After
94
Train
Prof. M. Kanevski
Test
95
Noise results
Model Kriging Kriging external drift 3-7-1 3-8-1 3-8-1 (100% noise) 3-7-1 (10% noise) Test 1 Kriging external drift (10% noise) Test 1 3-7-1 (10% noise) Test 2 Kriging external drift (10% noise) Test 2 Correlation 0.874 0.984 0.991 0.991 0.839 0.939 0.941 0.899 0.903
Prof. M. Kanevski
RMSE 3.13 1.19 0.85 0.84 3.54 2.32 2.23 2.81 2.81
MAE 2.04 0.91 0.66 0.68 2.37 -1.49 1.54 1.52 1.59
MRE -0.06 -0.03 -0.004 -0.001 -0.13 -0.003 -0.06 -0.08 -0.103
97
Prof. M. Kanevski
98
Monitoring network: 111 stations in Switzerland (80 training + 31 for validation) Mapping of daily: Mean speed Maximum gust Average direction
Prof. M. Kanevski
99
Training of the MLP Model: MLP 26-20-20-3 Training: Random initialization 500 iterations of the RPROP algorithm
Prof. M. Kanevski
101
Prof. M. Kanevski
102
Prof. M. Kanevski
103
Results: summary
Noisy ejection regularization
Prof. M. Kanevski
104
Conclusion
MLP is a nonlinear universal tool for the learning from and modeling of data. Excellent exploratory tool. Application demands deep expert knowledge and experience
Prof. M. Kanevski
105