Sie sind auf Seite 1von 19

Multilayer Perceptron

1
Practically, single layers allow only global feature interaction,
often making better approximation at some part of it at the
expense of the other… so… 2 hidden layers?

Cross Validation

• Available samples are randomly partitioned into a


training set and a test set
• The training set is further partitioned into an estimation
subset and a validation subset

When to stop training, Hyper-parameters tuning


2
Model Selection

Core architecture similar

All inputs in
Universe Probability

Generalization
error:

Given training samples


3
Just large enough number of free parameters for proper fitting

Get with
to avoid model complexity / overfitting

estimation set
examples
validation set
training set

4
Use of cross validation results in this choice:

Classification error produced when tested on validation set.

80% (estimation) – 20% (validation) … thumb rule!


Which?
Exhaustive Cross Non-exhaustive
K-fold
Validation Cross Validation
5
A couple of points (empirical obs):

Early Stopping of Training

As training progress --
Learns fairly simple to more complex mapping functions

Training error starts high, falls rapidly and


then slowly moves toward minimum (local)

6
The training (on estimation data) can be stopped at regular
intervals (epochs) and then tested on validation data for
early stopping.

Usually not as
smooth as this,
contains multiple
minima

7
Variants of Cross-Validation

Described earlier:

Randomly partitioned
blue: validation

K=4

Validation Error
averaged over the 4

when sample scarcity


8
Complexity Regularization and Pruning
Cost

Only in terms of weights!

Weight decay!

Eliminate these, reduce complexity!


9
Hessian based network pruning
Taylor series!

Gradient and Hessian of w


Local effects of weight perturbation

Which parameters when pruned out will


cost least increase in the average error!

Error surface nearly quadratic assumption!

10
is ignored! (assuming we
are at flat gradient)
Optimal Brain Surgeon (OBS)

Set one of the weights to 0 with least


increase in the average error!

During processing we can make it zero like this:

11
Lagrangian

λ𝟏𝟏𝒊𝒊
∆𝒘𝒘 = −𝟏𝟏
𝑯𝑯
Substituting here:

λ = −𝑤𝑤𝑖𝑖 𝑯𝑯−𝟏𝟏

12
𝛼𝛼𝑖𝑖
∆𝑤𝑤𝑖𝑖 = 𝛼𝛼 = λ𝟏𝟏𝒊𝒊
[𝑯𝑯−𝟏𝟏 ]𝒊𝒊,𝒊𝒊

Change of all weights w.r.to i

Leading to the optimum value of


Lagrangian

(increase in error)

Small weights large iith inverse Hessian!

13
Some other aspects

• Connectionism approach – locality constraint


(locally connected neurons) allowing parallelization

• Replicator mapping

Encoder - Decoder

14
• Function approximation:
Universal approximator

• Computation complexity of backpropagation


algorithm is O(W)

• Neural networks are locally robust (initiated near


global minima), H∞ optimal filter.
• Presence of local minima and scaling with respect
to problem size and complexity are major issues

15
Supervised Learning is Numerical Optimization

Ensemble averaging --- through ergodic


assumption considers batch learning!

16
Steepest descent

Linear approximation of cost function at a


local area around w(n)

Momentum a crude attempt at second order!

Quadratic approximation (Newton)

Problems:
• Inverse Hessian calculation expensive
• Hessian singular (Psuedo), rank deficient (ill-conditioned)
• For non-quadratic COST, no guarantee of convergence.
17
Cost – sum of squares, then Gauss-Newton!

Otherwise Quasi-Newton methods

Conjugate Gradient Method of


solving a set of linear equations!

Min

Doing this without explicitly calculating A

Math (in numeral optimization


literature), straight to algo…

18
After proceeding a bit as gradient falls off

Suggested stopping criterion:

19

Das könnte Ihnen auch gefallen