Beruflich Dokumente
Kultur Dokumente
Contents
The feedforward neural network model Supervised learning Formulation as a minimisation problem Gradient-based algorithms for supervised learning Hybrid algorithms for supervised learning Summary
Supervised Learning
Supervised learning
outputs are known; the network is being forced into producing the correct outputs; the weights are updated to reduce the error.
d j w ij y
External signal (teacher)
__
Com parison
E=y-d
Supervised learning
The response that the network is required to learn is presented to the network during training The desired response of the network acts as an explicit teacher signal. Examples: The perceptron rule The gradient descent-backpropagation rule During training the difference between the current output of the network and the desired output is used to update the weights unit the desired performance is obtained.
w k +1 = w k + k d k ,
dk : search direction
k : stepsize
First derivative of F(x) with respect to xi (ith element of gradient vector): Optimality Condition
F ( x) xi
= 0
F ( x)
x = x
Example
F (x ) = x 1 + 2 x 1 x 2 + 2 x 2
2 2
x =
0.5 0
p =
1 1
F ( x)
F(x) x1 F(x) x2
1 1
2 x1 2
F(x)
F(x ) F( x) x 1 x 2 x 1 x n
2 2 x2 2
2F ( x ) = x x F ( x ) 2 1
2
F(x)
F( x) x 2 x n
2 2 xn
F(x) F(x ) x n x 1 x n x 2
F(x)
F (x )
2 x i
F( x ) = x 1 + 2x 1 x 2 + 2x 2 + x 1
F ( x) = 2 x1 + 2x 2 + 1 2 x1 + 4x 2 = 0
x = 1
0.5
2 F (x ) = 2 2 24
To test the definiteness, check the eigenvalues of the Hessian. If the eigenvalues are all greater than zero, the Hessian is positive definite.
determinant
2 F ( x) I =
= 0.76 , 5.24
2 2 2 4
= 6 + 4 = ( 0.76 )( 5.24 )
Steepest Descent
Gradient descent Choose the next x so that:
F (x k + 1 ) < F ( xk )
g k F (x )
x = xk
Steepest Descent
F ( x ) = x1 + 2 x1 x 2 + 2 x 2 + x1
2 2
x 0 = 0.5
0.5
= 0.1
F ( x ) =
F( x ) x1 F( x ) x2
2x 1 + 2x 2 + 1 2x 1 + 4x 2
g 0 = F (x )
x = x0
3 3
x2 = x1 g1 =
= learning rate
-1
-2 -2
-1
= 0.37
2 2
= 0.39
-1
-1
-2 -2
-1
-2 -2
-1
Backpropagation (BP)
w k + 1 = w k E ( w k )
Chain Rule
d f ( n ) dn ( w ) d f ( n (w ) ) ---- -- -- -- -- = -- -- -- - - -- -- --- -- -- -- -- --- -- --- -- -- -- dw dn dw
Example
f ( n ) = cos ( n )
n = e
2w
f ( n ( w) ) = cos ( e
2w
F F n i - -- - = -- -- - - -- -- ---- -- -- --m m m b i n i b i
Gradient Calculation
ni =
ni m -- -- --- = a j -- -- m w i, j
m
m 1
j= 1
1
w i, j a j
m 1
+ bi
m
n i - -- - = 1 -- --m b i
Sensitivity
m si
F -- -- -- - m ni
G radient
F m m - -- -- - = s i a j -- --m -w i , j
1
F m - -- m = s i -- --b i
Steepest Descent
wi, j( k + 1 ) = wi, j (k ) s i a j
m m m m1
b i (k + 1 ) = b i (k ) s i
m m
W ( k + 1 ) = W ( k ) s (a
m 1 T
b (k + 1 ) = b (k ) s
F -------m n 1 F s --------- = m n
m
F -------m n 2 F ---------m n m
S
Backpropagation
n 1 n 1 n 1 ---------------- ---------------- ---------------m m m n 1 n 2 n m n ---------------- m n
m+1 m+1 n 2 ---------------m n 1 m+1 n 2 ---------------m n 2 m+1 m+1 m+1 S m+1 n 2 ---------------m
m+1 m m + 1 wi, l a l + b i m m+1 ni m + 1 a j -- -- -- --- = - --l-- -- -- -- -- -- -- -- -- -- -- -- -- = wi, j - -- -- -- -- -- --= 1 -- -- -- -- -- -- -- -- -- -- --- --- --m m m n j n j n j
n n
m+1
m+1
m+1
m +1 i m j
= wi , j
f m (n m ) j m +1 n m j
& = wimj+1 f m (n m ) , j
f m (nim ) & f m (n m ) = j n m j
n m +1 & = W m +1F m (n m ) n m
0 & m f m (n2 ) M 0
L L O L
0 0 M & m (n mm ) f S
Backpropagation
n m+1 T F F m & m (n m )(W m +1 )T F s = m = n m n m +1 = F n m +1 n
& m (n m )(W m +1 )T sm +1 s =F
m
The sensitivities are computed by starting at the last layer, and then propagating backwards through the network to the first layer.
s s
M M 1
s s
( tj a j )
& s M = 2F M (n M )(t a )
1 wk +1 = wk E( wk ), 2 k
In steep regions of the error surface, is large, and a small value for the learning rate is used in order to guarantee convergence. On the other hand, when the error surface has flat regions, is small, and a large learning rate is used to accelerate the convergence speed.
=
i k
iE ( wk ) iE ( wk 1)
i i wk wk 1
w k + 1 = w k k diag { , 1 , K , 1 } E ( w k ) 1
1 k 2 k n k
k is usually set to 1.
i k +1
i E(w k ) = w k 2 ii E ( w k )
i k
where
is usually set to 1.
if if if
(g (g (g
Rprop
if g k * g k1 > 0 then{ k = min ( k 1 + , max ) ; wk = sign ( g k ) k ; elseif g T * g k1 < 0 then{ k k = max ( k 1 , min ) ; wk = wk 1 ; g k = 0; elseif g T * g k1 = 0 then{. k k = k 1 ; wk = sign ( gk ) k ; wk +1 = wk + wk ; } } } }
PR k
ET ( wk ) E( wk ) E( wk 1) E( wk 1)
2
where
T gk 1gk k = ----T---------g k 1pk 1 Hestenes-Stiefel
or
k =
FletcherReeves
or
T g k 1gk 1 PolakRibire
x = xk
If the algorithm has not converged, return to second step. A quadratic function will be minimized in n steps.
Conjugate Gradient
2
Steepest Descent
Contour Plot 2
x2
-1 0 1 2
-1
-1
-2 -2
-2 -2
-1
0 x1
dk = B E( wk ),
1 k
E( w k )
2
where
y k = E( wk +1) E( wk ),
s k = wk +1 wk .
Monotone convergence
Convergence condition: convergence requires that the search direction dk is a descent direction: dT E( wk )< 0, k
Monotone condition
f ( wk + ak d k ) f ( wk ) + ak g d k
T k
It can be shown that if the learning rate satisfies the monotone condition then any algorithm of the form wk +1 = wk + k dk , converges to a local minimiser.
stepsize
>0
damping factor
N = N 1 N 2L N q
wk +1 = wk E( wk ) + c ( k ) N ( k ),
T
c( k )
is a vector with components that define a different noise magnitude for each parameter
x k +1 = x k f ( x k ) + c 2
Tk
constant that controls the noise intensity, in the interval 0.5, +0.5 noise reduction rate.
c
T
x k +1 = x k + x
is random noise from a uniform distribution The effectiveness depends on the parameter T that is called temperature; it controls the noise reduction rate:
T0 T (k )= 1 + ln k
k = 1, 2, K
1 f P(xk xk +1 ) = e T
f = f (x k +1 ) f (x k )
If the sign is neg. then the new point is acceptable with probability 1. Otherwise: It depends on the probability value and the threshold value
Summary
Formulated supervised learning as minimisation Presented first-order and second-order algorithms for supervised learning Presented hybrid approaches that equip gradient-based algorithms with noise injection schemes