Lecture 2a

Computational Intelligence
Lecture 2: Supervised Learning Algorithms

George Magoulas
gmagoulas@dcs.bbk.ac.uk
Contents
The feedforward neural network model Supervised learning Formulation as a minimisation problem Gradient-based algorithms for supervised learning Hybrid algorithms for supervised learning Summary
Feedforward Neural Network

f(x) = 1 1+exp(-ax)
Learning & Generalisation!
Supervised Learning
Supervised learning
outputs are known; the network is being forced into producing the correct outputs; the weights are updated to reduce the error.
d j w ij y
External signal (teacher)
__
Com parison
E=y-d
Supervised Learning Algorithm
Supervised learning
The response that the network is required to learn is presented to the network during training The desired response of the network acts as an explicit teacher signal. Examples: The perceptron rule The gradient descent-backpropagation rule During training the difference between the current output of the network and the desired output is used to update the weights unit the desired performance is obtained.
Methods for Supervised Training

Gradient-based techniques: use only weight-specific information (e.g. partial derivative) to adapt weight parameters. Second-order methods: use second derivative related information to accelerate the learning process. Global search methods: explore the entire search space to locate optimal solutions.
Formulation as a minimisation of the error problem

min E ( w).
w
General weight update rule:
w k +1 = w k + k d k ,
dk : search direction
k : stepsize

Gradient vector of function F(x):
F(x) x1 F(x) F ( x ) = x 2 F(x) xn
F
First derivative of F(x) with respect to xi (ith element of gradient vector): Optimality Condition
F ( x) xi
= 0
F ( x)
x = x
Example
F (x ) = x 1 + 2 x 1 x 2 + 2 x 2
2 2
x =
0.5 0
p =
1 1
F ( x)
F(x) x1 F(x) x2
2x1 + 2x2 2x1 + 4x2
1 1
1 1 1 T 0 1 p F ( x ) = ------------------------ = ------- = 0 ----------------------p 2 1 1
Hessian matrix of function F(x):
2 x1 2
F(x)
F(x ) F( x) x 1 x 2 x 1 x n
2 2 x2 2
2F ( x ) = x x F ( x ) 2 1
2
F(x)
F( x) x 2 x n
2 2 xn
F(x) F(x ) x n x 1 x n x 2
F(x)
Second derivative (curvature) of F(x) with respect to xi (i,i element of Hessian):
F (x )
2 x i
F( x ) = x 1 + 2x 1 x 2 + 2x 2 + x 1
F ( x) = 2 x1 + 2x 2 + 1 2 x1 + 4x 2 = 0
x = 1
0.5
2 F (x ) = 2 2 24
(Not a function of x in this case.)
To test the definiteness, check the eigenvalues of the Hessian. If the eigenvalues are all greater than zero, the Hessian is positive definite.
determinant
2 F ( x) I =
= 0.76 , 5.24
2 2 2 4
= 6 + 4 = ( 0.76 )( 5.24 )
Both eigenvalues are positive, therefore strong minimum.
Steepest Descent
Gradient descent Choose the next x so that:
F (x k + 1 ) < F ( xk )
Maximise the decrease by choosing: x k + 1 = xk k g k where
g k F (x )
x = xk
Steepest Descent
F ( x ) = x1 + 2 x1 x 2 + 2 x 2 + x1
2 2
x 0 = 0.5
0.5
= 0.1
F ( x ) =
F( x ) x1 F( x ) x2
2x 1 + 2x 2 + 1 2x 1 + 4x 2
g 0 = F (x )
x = x0
3 3
x 1 = x 0 g 0 = 0.5 0.1 3 = 0.2

0.5 3 0.2 0.02 0.08 0.2 0.1 1.8 1.2 0.2
x2 = x1 g1 =
= learning rate
-1
-2 -2
-1
= 0.37
2 2
= 0.39
-1
-1
-2 -2
-1
-2 -2
-1
Feedforward Neural Network
Backpropagation (BP)
w k + 1 = w k E ( w k )
is the learning rate
Chain Rule
d f ( n ) dn ( w ) d f ( n (w ) ) ---- -- -- -- -- = -- -- -- - - -- -- --- -- -- -- -- --- -- --- -- -- -- dw dn dw
Example
f ( n ) = cos ( n )
n = e
2w
f ( n ( w) ) = cos ( e
2w
d f (n ) d n (w ) d f (n ( w )) 2w 2w 2w -- -- -- -- -- -- = - -- -- -- -- -- -- -- = ( sin (n ) ) (2e ) = ( sin ( e )) ( 2e ) -- -- -- -- -- -- -- --- -- -- --dn dw dw
Application to Gradient Calculation

F F -- -- -- -- - = -- --m -- --m -ni w i, j
m n i - -- -- -- -- -m wi, j
F F n i - -- - = -- -- - - -- -- ---- -- -- --m m m b i n i b i
Gradient Calculation
ni =
ni m -- -- --- = a j -- -- m w i, j
m
m 1
j= 1
1
w i, j a j
m 1
+ bi
m
n i - -- - = 1 -- --m b i
Sensitivity
m si
F -- -- -- - m ni
G radient
F m m - -- -- - = s i a j -- --m -w i , j
1
F m - -- m = s i -- --b i
Steepest Descent
wi, j( k + 1 ) = wi, j (k ) s i a j
m m m m1
b i (k + 1 ) = b i (k ) s i
m m
W ( k + 1 ) = W ( k ) s (a
m 1 T
b (k + 1 ) = b (k ) s
F -------m n 1 F s --------- = m n
m
F -------m n 2 F ---------m n m
S
Next Step: Compute the Sensitivities (Backpropagation)
Backpropagation
n 1 n 1 n 1 ---------------- ---------------- ---------------m m m n 1 n 2 n m n ---------------- m n
m+1 m+1 n 2 ---------------m n 1 m+1 n 2 ---------------m n 2 m+1 m+1 m+1 S m+1 n 2 ---------------m
m+1 m m + 1 wi, l a l + b i m m+1 ni m + 1 a j -- -- -- --- = - --l-- -- -- -- -- -- -- -- -- -- -- -- -- = wi, j - -- -- -- -- -- --= 1 -- -- -- -- -- -- -- -- -- -- --- --- --m m m n j n j n j
n n
n m + 1 n m + 1 n m + 1 S S S ---------------- ---------------- ---------------m m m n 1 n 2 n m

S
m+1
m+1
m+1
m +1 i m j
= wi , j
f m (n m ) j m +1 n m j
& = wimj+1 f m (n m ) , j
f m (nim ) & f m (n m ) = j n m j
n m +1 & = W m +1F m (n m ) n m
& f m (n1m ) & m (n m ) = 0 F M 0
0 & m f m (n2 ) M 0
L L O L
0 0 M & m (n mm ) f S
Backpropagation
n m+1 T F F m & m (n m )(W m +1 )T F s = m = n m n m +1 = F n m +1 n
& m (n m )(W m +1 )T sm +1 s =F
m
The sensitivities are computed by starting at the last layer, and then propagating backwards through the network to the first layer.
s s
M M 1
s s
Output layer calculation

M si T ai F ( t a ) (t a) = - -- -- = - -- -- -- -- -- -- -- -- -- = - --=-- -- -- -- -- -- -- = 2 (t i a i )-- -- -- ---- -- -- -- -- -- -- -- -- --j-- 1 -- -- -- -- --- --- --M M M M n i ni n i ni
( tj a j )
ai aiM f M (niM ) & M M = M = = f (ni ) M M ni ni ni
& siM = 2(ti ai ) f M (niM )
& s M = 2F M (n M )(t a )
Gradient-based algorithms for supervised learning

BP with adaptive learning rate
E( wk ) E( wk 1) Local estimation of the Lipschitz constant: k = , wk wk 1

Update the weights:
1 wk +1 = wk E( wk ), 2 k
In steep regions of the error surface, is large, and a small value for the learning rate is used in order to guarantee convergence. On the other hand, when the error surface has flat regions, is small, and a large learning rate is used to accelerate the convergence speed.

BP with an adaptive learning rate for each weight Local estimation of the Lipschitz constant along the ith weight direction:
. 1
=
i k
iE ( wk ) iE ( wk 1)
i i wk wk 1
w k + 1 = w k k diag { , 1 , K , 1 } E ( w k ) 1
1 k 2 k n k
k is usually set to 1.

The one-step Jacobi-Newton BP method Update the weights:
i k +1
i E(w k ) = w k 2 ii E ( w k )
i k
where
is usually set to 1.

The Rprop (1993) : help to eliminate harmful influences of derivatives' magnitude on the weight updates. Basic Idea: the sign of the derivative is used to determine the direction of the weight update; the magnitude of the derivative has no effect on the weight update.
The Resilient propagation update rule:
k wk +1 = wk diag {1k , K , ik , K n } sign g ( wk )
if if if
(g (g (g
k k ( w k 1 ) g m ( w k ) < 0 ) then m = max ( m1 , min ) m k k ( w k 1 ) g m ( w k ) = 0 ) then m = m1 m
k k ( w k 1 ) g m ( w k ) > 0) then m = min ( m1 + , max ) m
for each wk do{
Rprop
if g k * g k1 > 0 then{ k = min ( k 1 + , max ) ; wk = sign ( g k ) k ; elseif g T * g k1 < 0 then{ k k = max ( k 1 , min ) ; wk = wk 1 ; g k = 0; elseif g T * g k1 = 0 then{. k k = k 1 ; wk = sign ( gk ) k ; wk +1 = wk + wk ; } } } }

Conjugate Gradient
Choose the initial search direction as the negative of the gradient.
p0 = g0
Choose subsequent search directions to be conjugate.

pk = gk + k pk 1
PR k
ET ( wk ) E( wk ) E( wk 1) E( wk 1)
2
where
T gk 1gk k = ----T---------g k 1pk 1 Hestenes-Stiefel
or
k =
FletcherReeves
T gkgk ---- --- --- --- -- --- --- -T g k 1gk 1
or
g k 1gk k = -- --- ---- --- -- --- - -- --
T g k 1gk 1 PolakRibire
The first search direction is the negative of the gradient.

p 0 = g0
Select the learning rate to minimize along the line.

k
T F ( x) pk g k pk x = xk = -- -- -- -- -- -- -- -- -- -- -- --- = -- -- -- -- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -T 2 T pk F( x ) pk p k A kp k T
(For quadratic functions.)
x = xk
Select the next search direction using

p k = gk + k p k 1
If the algorithm has not converged, return to second step. A quadratic function will be minimized in n steps.
Conjugate Gradient
2
Steepest Descent
Contour Plot 2
x2
-1 0 1 2
-1
-1
-2 -2
-2 -2
-1
0 x1

Newton method:
dk = B E( wk ),
1 k
Bk : symmetric nonsingular matrix( has an inverse); e.g. Broyden-Fletcher-Goldfarb-Shanno method:

B( wk )s k sT B( wk ) y k yT k B( wk +1) = B( wk ) + T k, sT B( wk )s k y k sk k
E( w k )
2
where
y k = E( wk +1) E( wk ),
s k = wk +1 wk .
Monotone convergence
Convergence condition: convergence requires that the search direction dk is a descent direction: dT E( wk )< 0, k
Monotone condition
f ( wk + ak d k ) f ( wk ) + ak g d k
T k
It can be shown that if the learning rate satisfies the monotone condition then any algorithm of the form wk +1 = wk + k dk , converges to a local minimiser.
Hybrid algorithms for supervised learning

Noise diffusion
wk +1 = wk E( wk ) +c( k ) N ( k ) .
stepsize
c( k ) noise magnitude; reduce with k . E.g. c( k ) = e k , 0 controls the magnitude ;
>0
damping factor
N = N 1 N 2L N q
Independent noise sources

Langevin noise
wk +1 = wk E( wk ) + c ( k ) N ( k ),
T
c( k )
is a vector with components that define a different noise magnitude for each parameter

The Langevin noise technique (power low): It uses the derivative, i.e.
x k +1 = x k f ( x k ) + c 2
Tk
constant that controls the noise intensity, in the interval 0.5, +0.5 noise reduction rate.
c
T

SARPROP: Combine gradient descent with the Simulated Annealing in a form of noise when there is a change in sign (second Rprop condition). Simulated annealing (Kirkpartick et al. 1983; Corana et al. 1987)
x k +1 = x k + x
is random noise from a uniform distribution The effectiveness depends on the parameter T that is called temperature; it controls the noise reduction rate:
T0 T (k )= 1 + ln k
k = 1, 2, K

The Metropolis move (Metropolis et al. 1953):
1 f P(xk xk +1 ) = e T
if f = f (xk +1 ) f (xk ) otherwise
The new x k +1 is either acceptable, or not, depending on the sign of
f = f (x k +1 ) f (x k )
If the sign is neg. then the new point is acceptable with probability 1. Otherwise: It depends on the probability value and the threshold value

P(x k x k +1 ) = exp( f T ) > , (0,1)
Summary
Formulated supervised learning as minimisation Presented first-order and second-order algorithms for supervised learning Presented hybrid approaches that equip gradient-based algorithms with noise injection schemes
Next lecture Supervised learning part 2: nonmonotone learning

Lecture 2a

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 2a

Hochgeladen von

Copyright:

Verfügbare Formate

Computational Intelligence

Lecture 2: Supervised Learning Algorithms

Feedforward Neural Network

Learning & Generalisation!

Supervised Learning Algorithm

Methods for Supervised Training

Formulation as a minimisation of the error problem

General weight update rule:

Formulation as a minimisation of the error problem

2x1 + 2x2 2x1 + 4x2

1 1 1 T 0 1 p F ( x ) = ------------------------ = ------- = 0 ----------------------p 2 1 1

Formulation as a minimisation of the error problem

Hessian matrix of function F(x):

Second derivative (curvature) of F(x) with respect to xi (i,i element of Hessian):

(Not a function of x in this case.)

Both eigenvalues are positive, therefore strong minimum.

Maximise the decrease by choosing: x k + 1 = xk k g k where

x 1 = x 0 g 0 = 0.5 0.1 3 = 0.2

Feedforward Neural Network

is the learning rate

d f (n ) d n (w ) d f (n ( w )) 2w 2w 2w -- -- -- -- -- -- = - -- -- -- -- -- -- -- = ( sin (n ) ) (2e ) = ( sin ( e )) ( 2e ) -- -- -- -- -- -- -- --- -- -- --dn dw dw

Application to Gradient Calculation

Next Step: Compute the Sensitivities (Backpropagation)

n m + 1 n m + 1 n m + 1 S S S ---------------- ---------------- ---------------m m m n 1 n 2 n m

& f m (n1m ) & m (n m ) = 0 F M 0

Output layer calculation

ai aiM f M (niM ) & M M = M = = f (ni ) M M ni ni ni

& siM = 2(ti ai ) f M (niM )

Gradient-based algorithms for supervised learning

E( wk ) E( wk 1) Local estimation of the Lipschitz constant: k = , wk wk 1

Gradient-based algorithms for supervised learning

Gradient-based algorithms for supervised learning

Gradient-based algorithms for supervised learning

k k ( w k 1 ) g m ( w k ) < 0 ) then m = max ( m1 , min ) m k k ( w k 1 ) g m ( w k ) = 0 ) then m = m1 m

k k ( w k 1 ) g m ( w k ) > 0) then m = min ( m1 + , max ) m

for each wk do{

Gradient-based algorithms for supervised learning

Choose subsequent search directions to be conjugate.

T gkgk ---- --- --- --- -- --- --- -T g k 1gk 1

g k 1gk k = -- --- ---- --- -- --- - -- --

The first search direction is the negative of the gradient.

Select the learning rate to minimize along the line.

(For quadratic functions.)

Select the next search direction using

Gradient-based algorithms for supervised learning

Bk : symmetric nonsingular matrix( has an inverse); e.g. Broyden-Fletcher-Goldfarb-Shanno method:

Hybrid algorithms for supervised learning

c( k ) noise magnitude; reduce with k . E.g. c( k ) = e k , 0 controls the magnitude ;

Independent noise sources

Hybrid algorithms for supervised learning

Hybrid algorithms for supervised learning

Hybrid algorithms for supervised learning

Hybrid algorithms for supervised learning

if f = f (xk +1 ) f (xk ) otherwise

The new x k +1 is either acceptable, or not, depending on the sign of

P(x k x k +1 ) = exp( f T ) > , (0,1)

Next lecture Supervised learning part 2: nonmonotone learning

Das könnte Ihnen auch gefallen