Sie sind auf Seite 1von 38

Computational Intelligence

Lecture 2: Supervised Learning Algorithms


George Magoulas
gmagoulas@dcs.bbk.ac.uk

Contents
The feedforward neural network model Supervised learning Formulation as a minimisation problem Gradient-based algorithms for supervised learning Hybrid algorithms for supervised learning Summary

Feedforward Neural Network


f(x) = 1 1+exp(-ax)

Learning & Generalisation!

Supervised Learning
Supervised learning
outputs are known; the network is being forced into producing the correct outputs; the weights are updated to reduce the error.
d j w ij y
External signal (teacher)

__

Com parison

E=y-d

Supervised Learning Algorithm

Supervised learning
The response that the network is required to learn is presented to the network during training The desired response of the network acts as an explicit teacher signal. Examples: The perceptron rule The gradient descent-backpropagation rule During training the difference between the current output of the network and the desired output is used to update the weights unit the desired performance is obtained.

Methods for Supervised Training


Gradient-based techniques: use only weight-specific information (e.g. partial derivative) to adapt weight parameters. Second-order methods: use second derivative related information to accelerate the learning process. Global search methods: explore the entire search space to locate optimal solutions.

Formulation as a minimisation of the error problem


min E ( w).
w

General weight update rule:

w k +1 = w k + k d k ,

dk : search direction

k : stepsize

Formulation as a minimisation of the error problem


Gradient vector of function F(x):
F(x) x1 F(x) F ( x ) = x 2 F(x) xn
F

First derivative of F(x) with respect to xi (ith element of gradient vector): Optimality Condition

F ( x) xi
= 0

F ( x)

x = x

Example
F (x ) = x 1 + 2 x 1 x 2 + 2 x 2
2 2

x =

0.5 0

p =

1 1

F ( x)

F(x) x1 F(x) x2

2x1 + 2x2 2x1 + 4x2

1 1

1 1 1 T 0 1 p F ( x ) = ------------------------ = ------- = 0 ----------------------p 2 1 1

Formulation as a minimisation of the error problem

Hessian matrix of function F(x):

2 x1 2

F(x)

F(x ) F( x) x 1 x 2 x 1 x n
2 2 x2 2

2F ( x ) = x x F ( x ) 2 1
2

F(x)

F( x) x 2 x n
2 2 xn

F(x) F(x ) x n x 1 x n x 2

F(x)

Second derivative (curvature) of F(x) with respect to xi (i,i element of Hessian):

F (x )

2 x i

F( x ) = x 1 + 2x 1 x 2 + 2x 2 + x 1
F ( x) = 2 x1 + 2x 2 + 1 2 x1 + 4x 2 = 0

x = 1

0.5

2 F (x ) = 2 2 24

(Not a function of x in this case.)

To test the definiteness, check the eigenvalues of the Hessian. If the eigenvalues are all greater than zero, the Hessian is positive definite.
determinant

2 F ( x) I =
= 0.76 , 5.24

2 2 2 4

= 6 + 4 = ( 0.76 )( 5.24 )

Both eigenvalues are positive, therefore strong minimum.

Steepest Descent
Gradient descent Choose the next x so that:

F (x k + 1 ) < F ( xk )

Maximise the decrease by choosing: x k + 1 = xk k g k where

g k F (x )

x = xk

Steepest Descent
F ( x ) = x1 + 2 x1 x 2 + 2 x 2 + x1
2 2

x 0 = 0.5
0.5

= 0.1

F ( x ) =

F( x ) x1 F( x ) x2

2x 1 + 2x 2 + 1 2x 1 + 4x 2

g 0 = F (x )

x = x0

3 3

x 1 = x 0 g 0 = 0.5 0.1 3 = 0.2


0.5 3 0.2 0.02 0.08 0.2 0.1 1.8 1.2 0.2

x2 = x1 g1 =

= learning rate

-1

-2 -2

-1

= 0.37
2 2

= 0.39

-1

-1

-2 -2

-1

-2 -2

-1

Feedforward Neural Network

Backpropagation (BP)

w k + 1 = w k E ( w k )

is the learning rate

Chain Rule
d f ( n ) dn ( w ) d f ( n (w ) ) ---- -- -- -- -- = -- -- -- - - -- -- --- -- -- -- -- --- -- --- -- -- -- dw dn dw

Example
f ( n ) = cos ( n )

n = e

2w

f ( n ( w) ) = cos ( e

2w

d f (n ) d n (w ) d f (n ( w )) 2w 2w 2w -- -- -- -- -- -- = - -- -- -- -- -- -- -- = ( sin (n ) ) (2e ) = ( sin ( e )) ( 2e ) -- -- -- -- -- -- -- --- -- -- --dn dw dw

Application to Gradient Calculation


F F -- -- -- -- - = -- --m -- --m -ni w i, j
m n i - -- -- -- -- -m wi, j

F F n i - -- - = -- -- - - -- -- ---- -- -- --m m m b i n i b i

Gradient Calculation
ni =
ni m -- -- --- = a j -- -- m w i, j
m

m 1

j= 1
1

w i, j a j

m 1

+ bi
m

n i - -- - = 1 -- --m b i

Sensitivity
m si

F -- -- -- - m ni

G radient
F m m - -- -- - = s i a j -- --m -w i , j
1

F m - -- m = s i -- --b i

Steepest Descent
wi, j( k + 1 ) = wi, j (k ) s i a j
m m m m1

b i (k + 1 ) = b i (k ) s i
m m

W ( k + 1 ) = W ( k ) s (a

m 1 T

b (k + 1 ) = b (k ) s

F -------m n 1 F s --------- = m n
m

F -------m n 2 F ---------m n m
S

Next Step: Compute the Sensitivities (Backpropagation)

Backpropagation
n 1 n 1 n 1 ---------------- ---------------- ---------------m m m n 1 n 2 n m n ---------------- m n
m+1 m+1 n 2 ---------------m n 1 m+1 n 2 ---------------m n 2 m+1 m+1 m+1 S m+1 n 2 ---------------m

m+1 m m + 1 wi, l a l + b i m m+1 ni m + 1 a j -- -- -- --- = - --l-- -- -- -- -- -- -- -- -- -- -- -- -- = wi, j - -- -- -- -- -- --= 1 -- -- -- -- -- -- -- -- -- -- --- --- --m m m n j n j n j

n n

n m + 1 n m + 1 n m + 1 S S S ---------------- ---------------- ---------------m m m n 1 n 2 n m


S

m+1

m+1

m+1

m +1 i m j

= wi , j

f m (n m ) j m +1 n m j

& = wimj+1 f m (n m ) , j

f m (nim ) & f m (n m ) = j n m j

n m +1 & = W m +1F m (n m ) n m

& f m (n1m ) & m (n m ) = 0 F M 0

0 & m f m (n2 ) M 0

L L O L

0 0 M & m (n mm ) f S

Backpropagation
n m+1 T F F m & m (n m )(W m +1 )T F s = m = n m n m +1 = F n m +1 n

& m (n m )(W m +1 )T sm +1 s =F
m

The sensitivities are computed by starting at the last layer, and then propagating backwards through the network to the first layer.
s s
M M 1

s s

Output layer calculation


M si T ai F ( t a ) (t a) = - -- -- = - -- -- -- -- -- -- -- -- -- = - --=-- -- -- -- -- -- -- = 2 (t i a i )-- -- -- ---- -- -- -- -- -- -- -- -- --j-- 1 -- -- -- -- --- --- --M M M M n i ni n i ni

( tj a j )

ai aiM f M (niM ) & M M = M = = f (ni ) M M ni ni ni

& siM = 2(ti ai ) f M (niM )

& s M = 2F M (n M )(t a )

Gradient-based algorithms for supervised learning


BP with adaptive learning rate

E( wk ) E( wk 1) Local estimation of the Lipschitz constant: k = , wk wk 1


Update the weights:

1 wk +1 = wk E( wk ), 2 k

In steep regions of the error surface, is large, and a small value for the learning rate is used in order to guarantee convergence. On the other hand, when the error surface has flat regions, is small, and a large learning rate is used to accelerate the convergence speed.

Gradient-based algorithms for supervised learning


BP with an adaptive learning rate for each weight Local estimation of the Lipschitz constant along the ith weight direction:
. 1

=
i k

iE ( wk ) iE ( wk 1)
i i wk wk 1

w k + 1 = w k k diag { , 1 , K , 1 } E ( w k ) 1
1 k 2 k n k
k is usually set to 1.

Gradient-based algorithms for supervised learning


The one-step Jacobi-Newton BP method Update the weights:

i k +1

i E(w k ) = w k 2 ii E ( w k )
i k

where

is usually set to 1.

Gradient-based algorithms for supervised learning


The Rprop (1993) : help to eliminate harmful influences of derivatives' magnitude on the weight updates. Basic Idea: the sign of the derivative is used to determine the direction of the weight update; the magnitude of the derivative has no effect on the weight update.
The Resilient propagation update rule:
k wk +1 = wk diag {1k , K , ik , K n } sign g ( wk )

if if if

(g (g (g

k k ( w k 1 ) g m ( w k ) < 0 ) then m = max ( m1 , min ) m k k ( w k 1 ) g m ( w k ) = 0 ) then m = m1 m

k k ( w k 1 ) g m ( w k ) > 0) then m = min ( m1 + , max ) m

for each wk do{

Rprop

if g k * g k1 > 0 then{ k = min ( k 1 + , max ) ; wk = sign ( g k ) k ; elseif g T * g k1 < 0 then{ k k = max ( k 1 , min ) ; wk = wk 1 ; g k = 0; elseif g T * g k1 = 0 then{. k k = k 1 ; wk = sign ( gk ) k ; wk +1 = wk + wk ; } } } }

Gradient-based algorithms for supervised learning


Conjugate Gradient
Choose the initial search direction as the negative of the gradient.
p0 = g0

Choose subsequent search directions to be conjugate.


pk = gk + k pk 1

PR k

ET ( wk ) E( wk ) E( wk 1) E( wk 1)
2

where
T gk 1gk k = ----T---------g k 1pk 1 Hestenes-Stiefel

or

k =

FletcherReeves

T gkgk ---- --- --- --- -- --- --- -T g k 1gk 1

or

g k 1gk k = -- --- ---- --- -- --- - -- --

T g k 1gk 1 PolakRibire

The first search direction is the negative of the gradient.


p 0 = g0

Select the learning rate to minimize along the line.


k
T F ( x) pk g k pk x = xk = -- -- -- -- -- -- -- -- -- -- -- --- = -- -- -- -- --- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -T 2 T pk F( x ) pk p k A kp k T

(For quadratic functions.)

x = xk

Select the next search direction using


p k = gk + k p k 1

If the algorithm has not converged, return to second step. A quadratic function will be minimized in n steps.

Conjugate Gradient
2

Steepest Descent
Contour Plot 2

x2
-1 0 1 2

-1

-1

-2 -2

-2 -2

-1

0 x1

Gradient-based algorithms for supervised learning


Newton method:

dk = B E( wk ),

1 k

Bk : symmetric nonsingular matrix( has an inverse); e.g. Broyden-Fletcher-Goldfarb-Shanno method:


B( wk )s k sT B( wk ) y k yT k B( wk +1) = B( wk ) + T k, sT B( wk )s k y k sk k

E( w k )
2

where

y k = E( wk +1) E( wk ),

s k = wk +1 wk .

Monotone convergence
Convergence condition: convergence requires that the search direction dk is a descent direction: dT E( wk )< 0, k

Monotone condition

f ( wk + ak d k ) f ( wk ) + ak g d k
T k

It can be shown that if the learning rate satisfies the monotone condition then any algorithm of the form wk +1 = wk + k dk , converges to a local minimiser.

Hybrid algorithms for supervised learning


Noise diffusion
wk +1 = wk E( wk ) +c( k ) N ( k ) .

stepsize

c( k ) noise magnitude; reduce with k . E.g. c( k ) = e k , 0 controls the magnitude ;

>0

damping factor

N = N 1 N 2L N q

Independent noise sources

Hybrid algorithms for supervised learning


Langevin noise

wk +1 = wk E( wk ) + c ( k ) N ( k ),
T

c( k )

is a vector with components that define a different noise magnitude for each parameter

Hybrid algorithms for supervised learning


The Langevin noise technique (power low): It uses the derivative, i.e.

x k +1 = x k f ( x k ) + c 2

Tk

constant that controls the noise intensity, in the interval 0.5, +0.5 noise reduction rate.

c
T

Hybrid algorithms for supervised learning


SARPROP: Combine gradient descent with the Simulated Annealing in a form of noise when there is a change in sign (second Rprop condition). Simulated annealing (Kirkpartick et al. 1983; Corana et al. 1987)

x k +1 = x k + x

is random noise from a uniform distribution The effectiveness depends on the parameter T that is called temperature; it controls the noise reduction rate:

T0 T (k )= 1 + ln k

k = 1, 2, K

Hybrid algorithms for supervised learning


The Metropolis move (Metropolis et al. 1953):

1 f P(xk xk +1 ) = e T

if f = f (xk +1 ) f (xk ) otherwise

The new x k +1 is either acceptable, or not, depending on the sign of

f = f (x k +1 ) f (x k )
If the sign is neg. then the new point is acceptable with probability 1. Otherwise: It depends on the probability value and the threshold value

P(x k x k +1 ) = exp( f T ) > , (0,1)

Summary
Formulated supervised learning as minimisation Presented first-order and second-order algorithms for supervised learning Presented hybrid approaches that equip gradient-based algorithms with noise injection schemes

Next lecture Supervised learning part 2: nonmonotone learning

Das könnte Ihnen auch gefallen