Sie sind auf Seite 1von 17

Multilayer Perceptron

1
Learning from hints
In the form of prior information (invariance, symmetry) about
the input – output map to be learned

Learning Rate

Last layers usually have larger gradients (compared to earlier


layers) and hence rate is taken smaller
2 −1
< opt, a few iterations smoothly 𝜕𝜕 𝜀𝜀 𝑛𝑛
η𝒐𝒐𝒐𝒐𝒐𝒐 =
down the slope 𝜕𝜕𝜔𝜔 2 𝑛𝑛
> opt, a few jumps towards minimum
>> opt, then jumping out of minimum Hessian!

Sometimes,
inversely proportional to the square root of synaptic sum (+ive)
2
Manufacturing Training Data:
• By corrupting with (adding, multiplying, convolving) noise
• Vary orientation (2d)
• Vary scale (size)
• Vary average of an attribute (brightness)
• Adding high frequency (vary sharpness)
• Enveloping with low frequency variation

Number of Hidden Units and Layers


Units: In general related to the dimensionality of the input
vectors, thumb rule is n/10. Number of i/p, o/p neurons!

Layers: More can increase accuracy immensely


but More is costly, More creates vanishing gradient problem…
More is DEEP.
3
*But from Initialization:
It is desirable for the uniform distribution, from which the
synaptic weights are selected, to have a mean of zero and a
variance equal to the reciprocal of the number of synaptic
connections of a neuron.
(So number of units matter here)
Premature stopping of learning: Avoid overfitting (due to RKHS),
by Bias-Variance trade-off, validation within training set.

Weight decay: Constant decaying by −𝜖𝜖𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜


A kind of regularization*
Square of weight term in cost
Criterion function: Other than square error
• Cross entropy (input vector as probability vector)

4
Back Propagation and Differentiation
Specific technique for implementing gradient descent

No nonlinearity

An element

5
Function for one example X
involving W weights

‘N’ examples

Rank of J < min(N,W)

Rank-deficient, neural network is under-


learning, but problem of overfitting remains!

Typically training times are long!

6
Hessian and Online Learning

Pruning of insignificant synaptic weights


Second order optimization alternative to BackProp.

The Eigenvalues of Hessian Matrix (MxM)

Positive semi-
definate
Largest eigenvalue
Smallest nonzero
eigenvalue

7
Utility (example)

Means of inputs

For the hidden layers… use odd symmetric


(Hyperbolic TAN) for faster convergence!

On Convergence
We have seen asymptotic convergence of LMS to local minima

Such analysis complicated here, some comments


on the learning curve.

8
Learning curve contains:
• Minimal loss (towards global/local minima)
• Additional loss (fluctuation in weight evolution)
• A time dependent term

Stability (learning) if learning rate at

Speed (learning) if learning rate at

Optimal annealing of learning rate

Optimally annealed on-line learning operates


as fast as batch learning in an asymptotic sense

9
Instantaneous Cost function

Expected risk

In batch sense

10
Evaluating the above in infinitesimal intervals,
with expected values of gradient

This has been shown (using the result)


11
Ensemble averaging, Hessian

H Hessian (time averaged in Batch)

12
0

Solution:

If

Want this to go to zero as t → ∞

Positive

13
→ 0, as t → ∞
Implementation

is fine, even with “stability”!

14
Adaptive control of learning rate*
not suitable due to nonstationarity

Generalized from g.

With positive definiteness intact

15
Assumed:

Solved to get

But not optimally annealed


16
Generalization

Smoothness!

• Size of the training data


• Network architecture, hyperparameters
• Complexity of the problem at hand
Order of
Total # free parameters
Training size
Fraction of errors in test
17 data allowed

Das könnte Ihnen auch gefallen