THIRD EDITION
Neural
Networks
and
Learning Machines
Simon Haykin
and
Yanbo Xue
McMaster University
Canada
CHAPTER 1
Rosenblatts Perceptron
Problem 1.1
(1) If w
T
(n)x(n) >0, theny(n) =+1.
If alsox(n) belongs toC
1
, thend(n) =+1.
Under these conditions, the error signal is
e(n) =d(n)  y(n) =0
and from Eq. (1.22) of the text:
w(n +1) =w(n) +e(n)x(n) =w(n)
This result is the same as line 1 of Eq. (1.5) of the text.
(2) If w
T
(n)x(n) <0, theny(n) =1.
If alsox(n) belongs toC
2
, thend(n) =1.
Under these conditions, the error signal e(n) remains zero, and so from Eq. (1.22)
we have
w(n +1) =w(n)
This result is the same as line 2 of Eq. (1.5).
(3) If w
T
(n)x(n) >0 andx(n) belongs toC
2
we have
y(n) =+1
d(n) =1
The error signal e(n) is 2, and so Eq. (1.22) yields
w(n +1) =w(n) 2x(n)
which has the same form as the rst line of Eq. (1.6), except for the scaling factor 2.
(4) Finally if w
T
(n)x(n) <0 andx(n) belongs toC
1
, then
y(n) =1
d(n) =+1
In this case, the use of Eq. (1.22) yields
w(n +1) =w(n) +2x(n)
which has the same mathematical form as line 2 of Eq. (1.6), except for the scaling
factor 2.
Problem 1.2
The output signal is dened by
y
v
2

tanh =
b
2

1
2
 w
i
x
i
i
+
tanh =
Equivalently, we may write
(1)
where
Equation (1) is the equation of a hyperplane.
Problem 1.3
(a) AND operation: Truth Table 1
This operation may be realized using the perceptron of Fig. 1
The hard limiter input is
If x
1
=x
2
=1, thenv =0.5, andy =1
If x
1
=0, andx
2
=1, thenv =0.5, andy =0
If x
1
=1, andx
2
=0, thenv =0.5, andy =0
If x
1
=x
2
=0, thenv =1.5, andy =0
Inputs Output
x
1
x
2
y
1
0
1
0
1
1
0
0
1
0
0
0
b w
i
x
i
i
+ y
=
y
2 y ( )
1
tanh =
o o
o
o
o
o
x
1
x
2
w
1
=1
w
2
=1
+1
v
y
Hard
limiter
Figure 1: Problem 1.3
b =1.5
v w
1
x
1
w
2
x
2
b + + =
x
1
x
2
1.5 + =
These conditions agree with truth table 1.
OR operation: Truth Table 2
The OR operation may be realized using the perceptron of Fig. 2:
In this case, the hard limiter input is
If x
1
=x
2
=1, thenv =1.5, andy =1
If x
1
=0, andx
2
=1, thenv =0.5, andy =1
If x
1
=1, andx
2
=0, thenv =0.5, andy =1
If x
1
=x
2
=0, thenv =0.5, andy =1
These conditions agree with truth table 2.
Inputs Output
x
1
x
2
y
1
0
1
0
1
1
0
0
1
1
1
0
o o
o
o
o
o
x
1
x
2
w
1
=1
w
2
=1
+1
v
y
Hard
limiter
Figure 2: Problem 1.3
b =0.5
v x
1
x
2
0.5 + =
COMPLEMENT operation: Truth Table 3
The COMPLEMENT operation may be realized as in Figure 3::
The hard limiter input is
If x =1, thenv =0.5, andy =0
If x =0, thenv =0.5, andy =1
These conditions agree with truth table 3.
(b) EXCLUSIVE OR operation: Truth table 4
This operation is nonlinearly separable, which cannot be solved by the perceptron.
Problem 1.4
TheGaussianclassier consistsof asingleunit withasingleweight andzerobias, determinedin
accordance with Eqs. (1.37) and (1.38) of the textbook, respectively, as follows:
Input x, Output, y
1
0
0
1
Inputs Output
x
1
x
2
y
1
0
1
0
1
1
0
0
0
1
1
0
o o o
v
y
Hard
limiter
o
w
1
=1
b =0.5
Figure 3: Problem 1.3
v wx b + x 0.5 + = =
w
1
2

1
2
( ) =
20 =
Problem 1.5
Using the condition
inEqs. (1.37) and(1.38) of thetextbook, weget thefollowingformulasfor theweight vector and
bias of the Bayes classier:
b
1
2
2

2
2
1
2
( ) =
0 =
C
2
I =
w
1
2

1
2
( ) =
b
1
2
2

1
2
2
2
( ) =
1
CHAPTER 4
Multilayer Perceptrons
Problem 4.1
Assume that each neuron is represented by a McCullochPitts model. Also assume that
The induced local eld of neuron 1 is
We may thus construct the following table:
The induced local eld of neuron is
Accordingly, we may construct the following table:
x
1
0 0 1 1
x
2
0 1 0 1
v
1
1.5 0.5 0.5 0.5
y
2
0 0 0 1
x
1
0 0 1 1
x
2
0 1 0 1
y
1
0 0 0 1
v
2
0.5 0.5 0.5 0.5
y
2
0 1 1 1
+1
2
1 +1
+1
x
1
x
2
0.5
2
y
2
1.5
Figure 4: Problem 4.1
x
i
1 if the input bit is 1
0 if the input bit is 0
'
=
v
1
x
1
x
2
1.5 + =
v
2
x
1
x
2
2y
1
0.5 + =
2
2
Fromthistableweobservethat theoverall output y
2
is0if x
1
andx
2
areboth0or both1, andit is
1 if x
1
is 0 andx
2
is 1 or vice versa. In other words, the network of Fig. P4.1 operates as an
EXCLUSIVE OR gate.
Problem 4.2
Figure 1 shows the evolutions of the free parameters (synaptic weights and biases) of the neural
networkasthebackpropagationlearningprocessprogresses. Eachepochcorrespondsto100iter
ations. Fromthegure, weseethatthenetworkreachesasteadystateafter about25epochs. Each
neuronusesalogisticfunctionfor itssigmoidnonlinearity. Alsothedesiredresponseisdenedas
Figure 2 shows the nal form of the neural network. Note that we have used biases (the negative
of thresholds) for the individual neurons.
d
0.9 for symbol bit ( ) 1
0.1 for symbol bit ( ) 0
'
=
Figure 1: Problem 4.2, where one epoch =100 iterations
3
Problem 4.3
If the momentum constant is negative, Equation (4.43) of the text becomes
Nowwendthat if thederivative hasthesamealgebraicsignonconsecutiveiterations
of the algorithm, the magnitude of the exponentially weighted sum is reduced. The opposite is
true when alternates its algebraic sign on consecutive iterations. Thus, the effect of the
momentumconstant isthesameasbefore, except that theeffectsarereversed, comparedtothe
case when is positive.
Problem 4.4
From Eq. (4.43) of the text we have
(1)
For the case of a single weight, the cost function is dened by
x
1
x
2
b
1
=1.6
w
11
=4.72
w
22
=3.52
w
31
=6.80
w
32
=6.44
b
3
=2.85
+1
Output
b
2
=5.0
+1
1
2
3
Figure 2: Problem 4.2
w
21
=3.51
w
12
=4.24
w
ji
n ( )
nt E t ( )
w
ji
t ( )

t=0
n
=
1 ( )
nt
nt E t ( )
w
ji
t ( )

t=0
n
=
E w
ji
E w
ji
w
ji
n ( )
nt E t ( )
w
ji
t ( )

t=1
n
=
E k
1
w w
0
( )
2
k
2
+ =
4
Hence, the application of (1) to this case yields
In this case, the partial derivative has the same algebraic sign on consecutive itera
tions. Hence, with 0< <1 the exponentially weighted adjustment to the weight w at
timen grows in magnitude. That is, the weight w is adjusted by a large amount. The inclusion of
the momentum constant in the algorithm for computing the optimum weight w* =w
0
tends to
accelerate the downhill descent toward this optimum point.
Problem 4.5
Consider Fig. 4.14 of the text, which has an input layer, two hidden layers, and a single output
neuron. We note the following:
Hence, thederivativeof withrespect tothesynapticweight connectingneuronk in
the second hidden layer to the single output neuron is
(1)
where is the activation potential of the output neuron. Next, we note that
(2)
where is the output of neuronk in layer 2. We may thus proceed further and write
(3)
w n ( ) 2k
1
nt
w t ( ) w
0
( )
t=1
n
=
E t ( ) w t ( )
w n ( )
y
1
3 ( )
F A
1
3 ( )
( ) F w x , ( ) = =
F A
1
3 ( )
( ) w
1k
3 ( )
F A
1
3 ( )
( )
w
1k
3 ( )

F A
1
3 ( )
( )
y
1
3 ( )

y
1
3 ( )
v
1
3 ( )

v
1
3 ( )
w
1k
3 ( )
 =
v
1
3 ( )
F A
1
3 ( )
( )
y
1
3 ( )
 1 =
y
1
3 ( )
v
1
3 ( )
( ) =
v
1
3 ( )
w
1k
3 ( )
y
k
2 ( )
k
=
y
k
2 ( )
y
1
3 ( )
v
1
3 ( )
 v
1
3 ( )
( ) A
1
3 ( )
= =
5
(4)
Thus, combining (1) to (4):
Consider next the derivative of F(w,x) with respect to , the synaptic weight connecting
neuronj in layer 1 (i.e., rst hidden layer) to neuron k in layer 2 (i.e., second hidden layer):
(5)
where is theoutput of neuroninlayer 2, and is theactivationpotential of that neuron.
Next we note that
(6)
(7)
(8)
(9)
v
1
3 ( )
w
1k
3 ( )
 y
k
2 ( )
=
A
k
2 ( )
( ) =
F w x , ( )
w
1k
3 ( )

F A
1
3 ( )
( )
w
1k
3 ( )
 =
A
1
3 ( )
( ) A
k
3 ( )
( ) =
w
kj
2 ( )
F w x , ( )
w
kj
2 ( )

F w x , ( )
y
1
3 ( )

y
1
3 ( )
v
1
3 ( )

v
1
3 ( )
y
k
2 ( )

y
k
2 ( )
v
k
2 ( )

v
k
2 ( )
w
kj
2 ( )
 =
y
k
2 ( )
v
k
1 ( )
F w x , ( )
y
1
3 ( )
 1 =
y
1
3 ( )
v
1
3 ( )
 A
1
3 ( )
( ) =
v
1
3 ( )
w
1k
3 ( )
y
k
2 ( )
k
=
v
1
3 ( )
y
k
2 ( )
 w
1k
3 ( )
=
y
k
2 ( )
v
k
2 ( )
( ) =
y
k
2 ( )
v
k
2 ( )
 v
k
2 ( )
( ) A
k
2 ( )
( ) = =
6
(10)
Substituting (6) and (10) into (5), we get
Finally, we consider the derivative of F(w,x) with respect to , the synaptic weight
connecting source nodei in the input layer to neuronj in layer 1. We may thus write
(11)
where is the output of neuronj in layer 1, and is the activation potential of that neuron.
Next we note that
(12)
(13)
(14)
v
k
2 ( )
w
kj
1 ( )
y
j
1 ( )
j
=
v
k
2 ( )
w
kj
1 ( )
 y
j
1 ( )
v
j
1 ( )
( ) A
j
1 ( )
( ) = = =
F w x , ( )
w
kj
2 ( )
 A
1
3 ( )
( )w
1k
3 ( )
A
k
2 ( )
( ) A
j
1 ( )
( ) =
w
ji
1 ( )
F w x , ( )
w
ji
1 ( )

F w x , ( )
y
1
3 ( )

y
1
3 ( )
v
1
3 ( )

v
1
3 ( )
y
j
1 ( )

y
j
1 ( )
v
j
1 ( )

v
j
1 ( )
w
ji
1 ( )
 =
y
j
1 ( )
v
i
1 ( )
F w x , ( )
y
1
3 ( )
 1 =
y
1
3 ( )
v
1
3 ( )
 A
3 ( )
( ) =
v
1
3 ( )
w
1k
3 ( )
y
k
2 ( )
k
=
v
1
3 ( )
y
j
1 ( )
 w
1k
3 ( )
y
k
2 ( )
y
j
1 ( )

k
=
w
1k
3 ( )
y
k
2 ( )
v
k
2 ( )

v
k
2 ( )
y
j
1 ( )

k
=
w
1k
3 ( )
A
k
2 ( )
( )
v
k
2 ( )
y
j
1 ( )

k
=
7
(15)
(16)
(17)
Substituting (12) to (17) into (11) yields
Problem 4.12
According to the conjugategradient method, we have
(1)
where, in the second term of the last line in (1), we have used(n  1) in place of (). Dene
We may then rewrite (1) as
(2)
On the other hand, according to the generalized delta rule, we have for neuron j:
(3)
Comparing (2) and (3), we observe that they have a similar mathematical form:
v
k
2 ( )
y
j
1 ( )
 w
kj
2 ( )
=
y
j
1 ( )
v
j
1 ( )
( ) =
y
j
1 ( )
v
j
1 ( )
 v
j
1 ( )
( ) A
j
1 ( )
( ) = =
v
j
1 ( )
w
ji
1 ( )
x
i
i
=
v
j
1 ( )
w
ji
1 ( )
 x
i
=
F w x , ( )
w
ji
1 ( )
 A
1
3 ( )
( ) w
1k
3 ( )
A
k
2 ( )
( )w
kj
2 ( )
k
,
_
A
j
1 ( )
( )x
i
=
w n ( ) n ( )p n ( ) =
n ( ) g n ( ) n 1 ( )p n 1 ( ) + [ ] =
n ( )g n ( ) n 1 ( ) n 1 ( )p n 1 ( ) +
w n 1 ( ) n 1 ( )p n 1 ( ) =
w n ( ) n ( )g n ( ) n 1 ( )w n 1 ( ) +
w
j
n ( ) w
j
n 1 ( )
j
n ( )y n ( ) + =
8
Thevector g(n) intheconjugategradientmethodplaystheroleof
j
(n)y(n), where
j
(n) isthe
local gradient of neuronj andy(n) is the vector of inputs for neuronj.
The timevarying parameter (n  1) in the conjugategradient method plays the role of
momentum in the generalized delta rule.
Problem 4.13
We start with (4.127) in the text:
(1)
The residual r(n) is governed by the recursion:
Equivalently, we may write
(2)
Hence multiplying both sides of (2) bys
T
(n  1), we obtain
(3)
where it is noted that (by denition)
Moreover, multiplying both sides of (2) byr
T
(n), we obtain
(4)
where it is noted that A
T
=A. Dividing (4) by (3) and invoking the use of (1):
(5)
which is the HestenessStiefel formula.
n ( )
s
T
n 1 ( )Ar n ( )
s
T
n 1 ( )As n 1 ( )
 =
r n ( ) r n 1 ( ) n 1 ( )As n 1 ( ) =
n 1 ( )As n 1 ( ) r n ( ) r n 1 ( ) =
n 1 ( )s
T
n 1 ( )As n 1 ( ) s
T
n 1 ( ) r n ( ) r n 1 ( ) ( ) =
s
T
n 1 ( )r n 1 ( ) =
s
T
n 1 ( )r n ( ) 0 =
n 1 ( )r
T
n ( )As n 1 ( ) n 1 ( )s
T
n 1 ( )Ar n 1 ( ) =
r
T
n ( ) r n ( ) r n 1 ( ) ( ) =
n ( )
r
T
n ( ) r n ( ) r n 1 ( ) ( )
s
T
n 1 ( )r n 1 ( )
 =
9
In the linear form of conjugate gradient method, we have
in which case (5) is modied to
(6)
which is the PolakRibire formula. Moreover, in the linear case we have
in which case (6) reduces to the FletcherReeves formula:
Problem 4.15
In this problem, we explore the operation of a fully connected multilayer perceptron trained with
the backpropagation algorithm. The network has a single hidden layer. It is trained to realize the
following onetoone mappings:
(a) Inversion:
, 1<x <100
(b) Logarithmic computation
, 1<x <10
(c) Exponentiation
, 1<x <10
(d) Sinusoidal computation
,
(a) f(x) =1/x for 1<x <100
The network is trained with:
s
T
n 1 ( )r n 1 ( ) r
T
n 1 ( )r n 1 ( ) =
n ( )
r
T
n ( ) r n ( ) r n 1 ( ) ( )
r
T
n 1 ( )r n 1 ( )
 =
r
T
n ( )r n 1 ( ) 0 =
n ( )
r
T
n ( )r n ( )
r
T
n 1 ( )r n 1 ( )
 =
f x ( )
1
x
 =
f x ( ) x
10
log =
f x ( ) e
x
=
f x ( ) x sin = 0 x
2

10
learningrate parameter =0.3, and
momentum constant =0.7.
Ten different network congurations were trained to learn this mapping. Each network was
trained identically, that is, with the same and, with bias terms, and with 10,000 passes of the
trainingvectors(withoneexceptionnotedbelow). Onceeachnetworkwastrained, thetestdataset
wasappliedtocomparetheperformanceandaccuracyof eachconguration. Table1summarizes
the results obtained:
The results of Table 1 indicate that even with a small number of hidden neurons, and with a rela
tively small number of training passes, the network is able to learn the mapping described in (a)
quite well.
(b) f(x) =log
10
x for 1<x <10
The results of this second experiment are presented in Table 2:
Here again, we see that the network performs well even with a small number of hidden neurons.
Interestingly, in this second experiment the network peaked in accuracy with 10 hidden neurons,
after which the accuracy of the network to produce the correct output started to decrease.
(c) f(x) =e
 x
for 1<x <10
The results of this third experiment (using the logistic function as with experiments (a)
Table 1
Number of hidden neurons
Average percentage error
at the network output
3
4
5
7
10
15
20
30
100
30 (trained with 100,000 passes)
4.73%
4.43
3.59
1.49
1.12
0.93
0.85
0.94
0.9
0.19
Table 2
Number of hidden neurons
Average percentage error
at the network output
2
3
4
5
7
10
15
20
30
100
30 (trained with 100,000 passes)
2.55%
2.09
0.46
0.48
0.85
0.42
0.85
0.96
1.26
1.18
0.41
11
and (b)), are summarized in Table 3:
These results are unacceptable since the network is unable to generalize when each neuron is
driven to its limits.
Theexperimentwith30hiddenneuronsand100,000trainingpasseswasrepeated, butthis
time the hyperbolic tangent function was used as the nonlinearity. The result obtained this time
was an average percentage error of 3.87% at the network output. This last result shows that the
hyperbolictangentfunctionisabetter choicethanthelogisticfunctionasthesigmoidfunctionfor
realizing the mappingf(x) =e
 x
.
(d) f(x) =sinx for 0<x </2
Finally, the following results were obtained using the logistic function with 10,000
training passes, except for the last conguration:
Theresultsof Table4showthattheaccuracyof thenetworkpeaksaround20neurons, whereafter
the accuracy decreases.
Table 3
Number of hidden neurons
Average percentage error
at the network output
2
3
4
5
7
10
15
20
30
100
30 (trained with 100,000 passes)
244.0%
185.17
134.85
133.67
141.65
158.77
151.91
144.79
137.35
98.09
103.99
Table 4
Number of hidden neurons
Average percentage error
at the network output
2
3
4
5
7
10
15
20
30
100
30 (trained with 100,000 passes)
1.63%
1.25
1.18
1.11
1.07
1.01
1.01
0.72
1.21
3.19
0.4
1
CHAPTER 5
Kernel Methods and RadialBasis Function Networks
Problem 5.9
The expected square error is given by
where istheprobabilitydensityfunctionof anoisedistributionintheinputspace . Itis
reasonabletoassumethat thenoisevector isadditivetotheinput datavector x. Hence, wemay
dene the cost functionJ(F) as
(1)
where (for convenience of presentation) we have interchanged the order of summation and
integration, which is permissible because both operations are linear. Let
or
Hence, we may rewrite (1) in the equivalent form:
(2)
Note that the subscript in merely refers to the name of the noise distribution and is
therefore untouched by the change of variables. Differentiating (2) with respect toF, setting the
result equal to zero, and nally solving for F(z), we get the optimal estimator
This result bears a close resemblance to the WatsonNadaraya estimator.
J F ( )
1
2
 f x
i
( ) F x
i
, ( ) ( )
2
f
( ) d
R
m
0
i=1
N
=
f
( ) R
m
0
J F ( )
1
2
 f x
i
( ) F x
i
+ ( ) ( )
2
f
( ) d
R
m
0
i=1
N
=
z x
i
+ = z x
i
=
J F ( )
1
2

R
m
0
f x
i
( ) F z ( ) ( )
2
f
z x
i
( ) z d
i=1
N
=
f
.
( )
F
z ( )
f x
i
( ) f
z x
i
( )
i=1
N
f
z x
i
( )
i=1
N
 =
1
CHAPTER 6
Support Vector Machines
Problem 6.1
FromEqs. (6.2) in the text we recall that the optimumweight vector w
o
and optimumbias b
o
satisfy the following pair of conditions:
for d
i
=+1
for d
i
=1
wherei =1, 2, ...,N. Equivalently, we may write
as the dening condition for the pair (w
o
, b
o
).
Problem 6.2
In the context of a support vector machine, we note the following:
1. Misclassication of patterns can only arise if the patterns are nonseparable.
2. If the patterns are nonseparable, it is possible for a pattern to lie inside the margin of
separation and yet be on the correct side of the decision boundary. Hence, nonseparability
does not necessarily mean misclassication.
Problem 6.3
We start with the primel problem formulated as follows (see Eq. (6.15)) of the text
(1)
Recall from (6.12) in the text that
Premultiplyingw byw
T
:
w
o
T
x
i
b
o
+1 +
w
o
T
x
i
b
o
1 < +
min
i 1 2 N , , , =
w
T
x
i
b + 1 =
J w b , , ( )
1
2
w
T
w
i
d
i
w
T
x
i
b
i
d
i
i
i=1
N
+
i=1
N
i=1
N
=
w
i
d
i
x
i=1
N
=
2
(2)
We may also write
Accordingly, we may redene the inner product w
T
w as the double summation:
(3)
Thus substituting (2) and (3) into (1) yields
(4)
subject to the constraint
Recognizing that
i
>0 for all i, we see that (4) is the formulation of the dual problem.
Problem 6.4
Consider a support vector machine designed for nonseparable patterns. Assuming the use of the
leaveoneoutmethod for training the machine, the following situations may arise when the
example left out is used as a test example:
1. The example is a support vector.
Result: Correct classication.
2. The example lies inside the margin of separation but on the correct side of the decision
boundary.
Result: Correct classication.
3. The example lies inside the margin of separation but on the wrong side of the decision
boundary.
Result: Incorrect classication.
w
T
w
i
d
i
w
T
x
i
i=1
N
=
w
T
i
d
i
x
i
T
i=1
N
=
w
T
w
i
d
i
j
d
j
x
j
T
x
i
j=1
N
i=1
N
=
Q ( )
1
2

i
d
i
j
d
j
x
j
T
x
i
i
i=1
N
+
j=1
N
i=1
N
i
d
i
i=1
N
0 =
3
Problem 6.5
Bydenition, asupportvector machineisdesignedtomaximizethemarginof separationbetween
theexamples drawnfromdifferent classes. This denitionapplies to all sources of data, bethey
noisyor otherwise. It followsthereforethat bytheverynatureof it, thesupport vector machineis
robust tothepresenceof additivenoiseinthedatausedfor trainingandtesting, providedthat all
the data are drawn from the same population.
Problem 6.6
Since theGramK ={K(x
i
, x
j
)} is a square matrix, it can be diagonalized using the similarity
transformation:
where is adiagonal matrix consistingof theeigenvalues of K andQ is anorthogonal matrix
whose columns are the associated eigenvectors. With K being a positive matrix, has
nonnegativeentries. Theinnerproduct (i.e., Mercer) kernel k(x
i
, x
j
) istheijthelement of matrix
K. Hence,
(1)
Let u
i
denotetheithrowof matrixQ. (Notethat u
i
isnot aneigenvector.) Wemaythenrewrite(1)
as the inner product
(2)
where is the square root of .
By denition, we have
(3)
K QQ
T
=
k x
i
x
j
, ( ) QQ
T
( )
ij
=
Q ( )
il
( )
ll
Q
T
( )
lj
l=1
m
1
=
Q ( )
il
( )
ll
Q ( )
lj
l=1
m
1
=
k x
i
x
j
, ( ) u
i
T
u
j
=
1 2
u
i
( )
T
1 2
u
j
( ) =
1 2
k x
i
x
j
, ( )
T
x
i
( ) x
j
( ) =
4
Comparing(2) and(3), wededucethat themappingfromtheinput spacetothehidden(feature)
space of a support vector machine is described by
Problem 6.7
(a) From the solution to Problem 6.6, we have
Supposetheinput vector x
i
is multiplied by theorthogonal (unitary) matrix Q. Wethen havea
new mapping described by
Correspondingly, we may write
(1)
whereu
i
is theith row of Q. From the denition of an orthogonal (unitary) matrix:
or equivalently
whereI is the identity matrix. Hence, (1) reduces to
In words, the Mercer kernel exhibits theunitary invariance property.
: x
i
1 2
u
i
: x
i
1 2
u
i
: Qx
i
Q
1 2
u
i
k Qx
i
Qx
j
, ( ) Q
1 2
u
i
( )
T
Q
1 2
u
j
( ) =
1 2
u
i
( )
T
Q
T
Q
1 2
u
j
( ) =
Q
1
Q
T
=
Q
T
Q I =
k Qx
i
Qx
j
, ( )
1 2
u
i
( )
T
1 2
u
j
( ) =
k x
i
x
j
, ( ) =
5
(b) Consider rst the polynomial machine described by
Consider next the RBF network described by the Mercer kernel:
,
Finally, consider the multilayer perceptron described by
Thus all three types of the support vector machine, namely, the polynomial machine, RBF
network, and MLP, satisfy the unitary invariance property in their own individual ways.
k Qx
i
Qx
j
, ( ) Qx
i
( )
T
Qx
j
( ) 1 + ( )
p
=
x
i
T
Q
T
Qx
j
1 + ( )
p
=
x
i
T
x
j
1 + ( )
p
=
k x
i
x
J
, ( ) =
k Qx
i
Qx
j
, ( )
1
2
2
 Qx
i
Qx
j
exp =
1
2
2
 Qx
i
Qx
j
( )
T
Qx
i
Qx
j
( )
exp =
1
2
2
 x
i
x
j
( )
T
Q
T
Q x
i
x
j
( )
exp =
1
2
2
 x
i
x
j
( )
T
x
i
x
j
( )
exp = Q
T
Q I =
k x
i
x
J
, ( ) =
k Qx
i
Qx
j
, ( )
0
Qx
i
( )
T
Qx
j
( )
1
+ ( ) tanh =
0
x
i
T
Q
T
Qx
j
1
+ ( ) tanh =
0
x
i
T
x
j
1
+ ( ) tanh =
k x
i
x
J
, ( ) =
6
Problem 6.17
The truth table for the XOR function, operating on a threedimensional pattern x, is as follows:
Toproceedwiththesupport vector machinefor solvingthismultidimensional XOR problem, let
the Mercer kernel
Theminimumvalueof power p (denotingapositiveinteger) neededfor thisproblemisp =3. For
p =2, we end up with a zero weight vector, which is clearly unacceptable.
Settingp =3, we thus have
where
andlikewisefor x
i
. Then, proceedinginamanner similar but muchmorecumbersomethanthat
described for thetwodimensional XOR problemin Section 6.6, weend up with a polynomial
machine dened by
This machine satises the entries of Table 1.
Table 1
Inputs
Desired response
x
1
x
2
x
3
y
+1
+1
1
+1
+1
1
1
1
+1
1
+1
+1
1
+1
1
1
+1
+1
+1
1
1
1
1
+1
+1
1
1
1
+1
+1
1
+1
k x x
j
, ( ) 1 x
T
x
i
+ ( )
p
=
k x x
i
, ( ) 1 x
T
x
i
+ ( )
3
=
1 3x
T
x
i
3 x
T
x
i
( )
2
x
T
x
i
( )
3
+ + + =
x x
1
x
2
x
3
, , [ ]
T
=
y x
1
x
2
x
3
, , =
1
CHAPTER 8
PrincipalComponents Analysis
Problem 8.5
From Example 8.2 in the text:
(1)
(2)
The correlation matrix of the input is
(3)
where s is the signal vector and
2
is the variance of an element of the additive noise vector.
Hence, using (2) and (3):
(4)
The vector s is a signal vector of unit length:
Hence, (4) simplies to
which is the desired result given in (1).
0
1
2
+ =
q
0
s =
R ss
T
2
I + =
0
q
0
T
Rq
0
q
0
T
q
0
 =
s
T
ss
T
2
I + ( )s
s
T
s
 =
s
T
s ( ) s
T
s ( )
2
s
T
s ( ) +
s
T
s
 =
s
T
s
2
+ =
s
2
2
+ =
s 1 =
0
1
2
+ =
2
Problem 8.6
From (8.46) in the text we have
(1)
As , and so we deduce from (1) that
for (2)
whereq
1
is the eigenvector associated with the largest eigenvalue
1
of the correlation matrix
R =E[x(n)x
T
(n)], whereE is theexpectationoperator. Multiplying(2) by its owntransposeand
then taking expectations, we get
Equivalently, we may write
(3)
where is the variance of the output y(n). Postmultiplying (3) byq
1
:
(4)
whereitisnotedthat bydenition. From(4) wereadilyseethat , whichisthe
desired result.
Problem 8.7
Writing the learning algorithm for minor components analysis in matrix form:
Proceeding in a manner similar to that described in Section (8.5) of thetextbook, wehavethe
nonlinear differential equation:
Dene
w n 1 + ( ) w n ( ) y n ( ) x n ( ) y n ( )w n ( ) [ ] + =
n w n ( ) q
1
,
x n ( ) y n ( )q
1
= n
E x n ( )x
T
n ( ) [ ] E y
2
n ( ) [ ]q
1
q
1
T
=
R
Y
2
q
1
q
1
T
=
Y
2
Rq
1
Y
2
q
1
q
1
T
q
1
Y
2
q
1
= =
q
1
1 =
Y
2
1
=
w n 1 + ( ) w n ( ) y n ( ) x n ( ) y n ( )w n ( ) [ ] =
d
dt
w t ( ) w
T
t ( )Rw t ( ) [ ]w t ( ) Rw t ( ) =
3
(1)
whereq
k
isthektheigenvector of correlationmatrixR =E[x(n)x
T
(n)] andthecoefcient is
the projection of w(t) ontoq
k
. We may then identify two cases as summarized here:
Case I: 1< k < m
For this rst case, we dene
for some xedm (2)
Accordingly, we nd that
(3)
With the eigenvalues of R arranged in decreasing order:
it follows that as .
Case II: k = m
For this second case, we nd that
for (4)
Hence, as .
Thus, in light of the results derived for cases I and II, we deduce from (1) that:
= eigenvector associated with the smallest eigenvalue
m
as , and
.
w t ( )
k
t ( )q
k
k=1
M
k
t ( )
k
t ( )
k
t ( )
m
t ( )
 =
d
k
t ( )
dt

m
k
( )
k
t ( ) =
1
2
k
m
0 > > > > > >
k
t ( ) 0 t
d
m
t ( )
dt

m
m
t ( )
m
2
t ( ) 1 ( ) = t
m
t ( ) 1 = t
w t ( ) q
m
t
Y
2
E y
2
n ( ) [ ]
m
=
4
Problem 8.8
From (8.87) and (8.88) of the text:
(1)
(2)
where, for convenienceof presentation, wehaveomittedthedependenceontimen. Equations(1)
and (2) may be represented by the following vectorvalued signal ow graph:
Note: The dashed lines indicate inner (dot) products formed by the input vector x and the
pertinent synaptic weight vectorsw
0
, w
1
, ..., w
j
to producey
0
, y
1
, ..., y
j
, respectively.
Problem 8.9
Consider a network consisting of a single layer of neurons with feedforward connections. The
algorithmfor adjusting thematrix of synaptic weights W(n) of thenetwork is described by the
recursive equation (see Eq. (8.91) of the text):
w
j
y
j
x y
j
2
w
j
=
x x w
k
y
k
k=0
j1
=
o o o
o
.
.
.
o
o
o
o
o
o
o
w
j
x
y
0
y
1
y
j1
y
j
y
j
w
0
w
1
w
j1
w
0
w
1
w
j1
w
j
5
(1)
wherex(n) istheinput vector, y(n) istheoutput vector; andLT[.] isamatrixoperator that setsall
the elements above the diagonal of the matrix argument to zero, thereby making it lower
triangular.
First, wenotethat theasymptotic stability theoremdiscussed in thetext does not apply
directlytotheconvergenceanalysisof stochasticapproximationalgorithmsinvolvingmatrices; it
isformulatedtoapplytovectors. However, wemaywritetheelementsof theparameter (synaptic
weight) matrixW(n) in(1) asavector, thatis, onecolumnvector stackedupontopof another. We
maytheninterprettheresultingnonlinear updateequation inacorrespondingwayandsoproceed
to apply the asymptotic stability theorem directly.
To prove the convergence of the learning algorithmdescribed in (1), we may use the
method of induction to show that if the rst j columns of matrix W(n) converge to the rst j
eigenvectorsof thecorrelationmatrixR =E[x(n)x
T
(n)], thenthe(j +1)thcolumnwill convergeto
the (j +1)th eigenvector of R. Here we use the fact that in light of the convergence of the
maximumeigenlter involving a single neuron, the rst column of the matrix W(n) converges
with probability 1 to the rst eigenvector of R, and so on.
Problem 8.10
Theresultsof acomputer experiment onthetrainingof asinglelayer feedforwardnetwork using
thegeneralized Hebbian algorithmaredescribed by Sanger (1990). Thenetwork has 16 output
neurons, and4096inputsarrangedasa64x 64gridof pixels. Thetraininginvolvedpresentation
of 2000samples, whichareproducedbylowpasslteringawhiteGaussiannoiseimageandthen
multiplyingwi6thaGaussianwindowfunction. Thelowpasslter wasaGaussianfunctionwith
standard deviation of 2 pixels, and the window had a standard deviation of 8 pixels.
Figure1, presentedonthenext page, shows therst 16receptiveeldmasks learnedby
thenetwork (Sanger, 1990). Inthisgure, positiveweightsareindicatedbywhite andnegative
weights are indicated by black; the ordering is lefttoright and toptobottom.
The results displayed in Fig. 1 are rationalized as follows (Sanger, 1990):
The rst mask is a lowpass lter since the input has most of its energy near dc (zero
frequency).
Thesecondmask cannot bealowpasslter, soit must beabandpasslter withamidband
frequency as small as possible since the input power decreases with increasing frequency.
Continuingtheanalysisinthemanner describedabove, thefrequency responseof successive
masks approaches dc as closely as possible, subject (of course) to being orthogonal to
previous masks.
The end result is a sequence of orthogonal masks that respond to progressively higher
frequencies.
W n ( ) W n ( ) n ( ) y n ( )x
T
n ( ) LT y n ( )y
T
n ( ) [ ]W n ( ) { } + =
6
Figure 1: Problem 8.10 (Reproduced with permission of Biological Cybernetics)
1
CHAPTER 9
SelfOrganizing Maps
Problem 9.1
Expanding the functiong(y
j
) in a Taylor series aroundy
j
=0, we get
(1)
where
for k =1, 2, ....
Let
Then, we may rewrite (1) as
Correspondingly, we may write
Consequently, a nonzero g(0) has the effect of making dw
j
/dt assume a nonzero value when
neuronj is off, which is undesirable. To alleviate this problem, we makeg(0) =0.
g y
j
( ) g 0 ( ) g
1 ( )
0 ( ) y
j
1
2!
g
2 ( )
0 ( ) y
j
2
+ + + =
g
k ( )
0 ( )
k
g y
j
( )
y
j
k

y
j
0 =
=
y
j
1, neuron j is on
0, neuron j is off
'
=
g y
j
( )
g 0 ( ) g
i ( )
0 ( )
1
2!
g
2 ( )
0 ( ) , + + + neuron j is on
g 0 ( ) neuron j is off
'
=
dw
j
dt
 y
j
x g y
j
( )w
j
=
x w
j
g 0 ( ) g
1 ( )
0 ( )
1
2!
g
2 ( )
0 ( ) + + + neuron j is on
g 0 ( )w
j
neuron j is off
'
=
2
Problem 9.2
Assumethaty(c) isaminimumL
2
(leastsquares) distortionvector quantizer for thecodevector c.
We may then form the distortion function
Thisdistortionfunctionissimilar tothat of Eq. (10.20) inthetext, except for theuseof c and
in place of x and , respectively. We wish to minimizeD
2
with respect toy(c) and .
Assuming that is a smooth function of the noise vector , we may expand the
decoder output in usingtheTaylor series. Inparticular, usingasecondorder approximation,
we get (Luttrell, 1989b)
(1)
where
where
ij
is a Kronecker delta function. We now make the following observations:
The rst term on the righthand side of (1) is the conventional distortion term.
The second term (i.e., curvature term) arises due to the output noise model .
Problem 9.3
Consider thePeanocurveshowninpart (d) of Fig. 9.9of thetext. Thisparticular selforganizing
featuremappertains to aonedimensional latticefedwithatwodimensional input. Weseethat
(countingfromleft toright) neuron14, say, isquiteclosetoneuron97. It isthereforepossiblefor
alargeenoughinput perturbationtomakeneuron14jumpintotheneighborhoodof neuron97, or
vice versa. If this change were to happen, the topological preserving property of the SOM
algorithm would no longer hold
For a more convincing demonstration, consider a higherdimensional, namely, three
dimensional input structure mapped onto a twodimensional lattice of 10by10 neurons. The
D
2
1
2
 f c ( ) c y c ( ) ( ) c
2
c d
=
c
x c y ( )
( )
x
( ) x c x ( ) + ( ) x
2
d
1
D
2
2

k
2
+
,
_
x c ( ) x
2
( )d 1 =
n
i
( ) d ( )
0 =
n
i
n
j
( ) d ( )
D
2
ij
=
( )
3
network is trainedwithaninput consistingof 8Gaussianclouds withunit variancebut different
centers. The centers are located at the points (0,0,0,...,0), (4,0,0,...,0), (4,4,0,...,0), (0,4,0,...,0),
(0,0,4,...,0), (4,0,4, ...,0), (4,4,4, ..., 0), and(0,4,4, ...,0). Thecloudsoccupythe8cornersof acube
asshowninFig. 1a. TheresultinglabeledfeaturemapcomputedbytheSOM algorithmisshown
in Fig. 1b. Although each of theclasses is grouped together in themap, theplanar featuremap
failstocapturethecompletetopology of theinput space. Inparticular, weobservethat class6is
adjacent to class 2 in the input space, but isnot adjacent to it in the feature map.
The conclusion to be drawn here is that although the SOM algorithmdoes perform
clustering on theinput space, it may not always completely preservethetopology of theinput
space.
Figure 1: Problem 9.3
Problem 9.4
Consider for example a twodimensional lattice using the SOM algorithm to learn a two
dimensional input distributionasillustratedinFig. 9.8inthetextbook. Supposethat theneuronat
thecenter of thelatticebreaks down; this failuremay haveadramatic effect ontheevolutionof
thefeaturemap. Ontheother hand, asmall perturbationappliedtotheinput spaceleavesthemap
learned by the lattice essentially unchanged.
Problem 9.5
The batch version of the SOM algorithm is dened by
for some prescribed neuronj (1)
where
j,i
isthediscretizedversionof thepdf of noisevector . FromTable9.1of thetext
werecall that
j,i
plays aroleanalogous to that of theneighborhood function. Indeed, wecan
w
j
j i ,
x
i
i
j i ,
i
 =
( )
4
substituteh
j,i(x)
for
j,i
in(1). Weareinterestedinrewriting(1) inaformthathighlightstheroleof
Voronoi cells. To this end wenotethat thedependenceof theneighborhood function h
j,i(x)
and
therefore
j,i
ontheinput patternx isindirect, withthedependencebeingthroughtheVoronoi cell
in which x lies. Hence, for all input patterns that lie in a particular Voronoi cell the same
neighborhood function applies. Let each Voronoi cell beidentied by an indicator function I
i,k
interpreted as follows:
I
i,k
=1if theinput patternx
i
liesintheVoronoi cell correspondingtowinningneuronk. Thenin
light of these considerations we may rewrite (1) in the new form
(2)
Nowlet m
k
denotethecentroidof theVoronoi cell of neuronk andN
k
denotethenumber of input
patterns that lie in that cell. We may then simplify (2) as
(3)
whereW
j,k
is a weighting function dened by
(4)
with
for all j
Equation (3) bears a close resemblance to the WatsonNadaraya regression estimator
denedinEq. (5.61) of thetextbook. Indeed, inlight of thisanalogy, wemay offer thefollowing
observations:
The SOM algorithm is similar to nonparametric regression in a statistical sense.
Except for thenormalizingfactor N
k
, thediscretizedpdf
j,i
andthereforetheneighborhood
functionh
j,i
plays the role of a kernel in the WatsonNadaraya estimator.
w
j
j k ,
I
i k ,
x
i
i
j k ,
I
i k ,
i
 =
w
j
j k ,
N
k
m
k
k
j k ,
N
k
k
 =
W
j k ,
m
k
k
=
W
j k ,
j k ,
N
k
j k ,
N
k
k
 =
W
j k ,
k
1 =
5
The width of the neighborhood function plays the role of the span of the kernel.
Problem 9.6
In its basic form, Hebbs postulate of learning states that the adjustment w
kj
applied to the
synaptic weight w
kj
is dened by
wherey
k
is the output signal produced in response to the input signal x
j
.
Theweight updatefor themaximumeigenlter includestheterm and, additionally,
a stabilizing term dened by . The term provides for synaptic amplication.
In contrast, in the SOM algorithmtwo modications are made to Hebbs postulate of
learning:
1. The stabilizing term is set equal to .
2. The output y
k
of neuronk is set equal to a neighborhood function.
Thenet result of thesetwo modications is to maketheweight updatefor theSOM algorithm
assume a form similar to that in competitive learning rather than Hebbian learning.
Problem 9.7
In Fig. 1 (shown on the next page), we summarize the density matching results of computer
simulationonaonedimensional latticeconsistingof 20neurons. Thenetwork is trainedwitha
triangular input density. Two sets of results are displayed in this gure:
1. The standard SOM (Kohonen) algorithm, shown as the solid line.
2. The conscience algorithm, shown as the dashed line; the line labeled predict is its
straightline approximation.
InFig. 1, wehavealsoincludedtheexact result. Althoughit appears that bothalgorithms fail to
matchtheinput density exactly, weseethat theconsciencealgorithmcomes closer to theexact
result than the standard SOM algorithm.
w
kj
y
k
x
j
=
y
k
x
j
y
k
2
w
kj
y
k
x
j
y
k
w
kj
6
Figure 1: Problem 9.7
Problem 9.11
The results of computer simulation for a onedimensional lattice with a twodimensional
(triangular) input areshowninFig. 1onthenext pagefor anincreasingnumber of iterations. The
experiment begins with random weights at zero time, and then the neurons start spreading out.
Two distinct phases in the learning process can be recognized from this gure:
The neurons become ordered (i.e., the onedimensional lattice becomes untangled), which
happens at about 20 iterations.
The neurons spread out to match the density of the input distribution, culminating in the
steadystate condition attained after 25,000 iterations.
7
Figure 1: Problem 9.11
1
CHAPTER 10
InformationTheoretic Learning Models
Problem 10.1
Themaximumentropy distribution of therandomvariableX is a uniformdistribution over the
range, [a, b], as shown by
Hence,
Problem 10.3
Let
wherethevectorsX
1
andX
2
havemultivariateGaussiandistributions. Thecorrelationcoefcient
betweenY
i
andZ
i
is dened by
(1)
f
X
x ( )
1
a b
, a x b
0, otherwise
'
=
h X ( ) f
X
x ( ) f
X
x ( ) log x d
=
1
a b
 a b ( ) x d log
b
a
=
a b ( ) log =
Y
i
a
i
T
X
1
=
Z
i
b
i
T
X
2
=
i
E Y
i
Z
i
[ ]
E Y
i
2
[ ]E Z
i
2
[ ]
 =
a
i
T
E X
1
X
2
T
[ ]b
i
a
i
T
E X
1
X
1
T
[ ]a
i
( ) b
i
T
E X
1
X
2
T
[ ]b
i
( ) { }
1 2
 =
a
i
T
12
b
i
a
i
T
11
a
i
( ) b
i
T
22
b
i
( ) { }
1 2
 =
2
where
The mutual information betweenY
i
andZ
i
is dened by
Let r denotetherankof thecrosscovariancematrix . GiventhevectorsX
1
andX
2
, we
may invoke the idea of canonical correlations as summarized here:
Find the pair of random variables and that are most highly
correlated.
Extract thepair of randomvariables and insuchaway that Y
1
and
Y
2
are uncorrelated and so areZ
1
andZ
2
.
Continuethesetwostepsuntil at most r pairsof variables{(Y
1
, Z
i
), (Y
2
, Z
i
), ..., (Z
r
, Z
r
)}have
been extracted.
The essence of the canonical correlation described above is to encapsulate the dependence
between random vectors X
1
and X
2
in the sequence {(Y
1
, Z
i
), (Y
2
, Z
i
), ..., (Z
r
, Z
r
)}. The
uncorrelatedness of the pairs in this serquence, that is,
for all
means that the mutual information between the vectors X
1
and X
2
is the sumof the mutual
informationmeasuresbetweentheindividual elementsof thepairs . Thatis, wemay
write
where
i
is dened by (1).
11
E X
1
X
1
T
[ ] =
12
E X
1
X
2
T
[ ]
21
= =
22
E X
2
X
2
T
[ ] =
I Y
i
Z
i
; ( ) 1
i
2
( ) log =
12
Y
1
a
1
T
X
1
= Z
1
b
1
T
X
2
=
Y
2
a
2
T
X
1
= Z
2
b
2
T
X
2
=
E Y
i
Y
j
[ ] E Z
i
Z
j
[ ] 0 = = j i
Y
i
Z
i
, ( ) { }
i=1
r
I X
1
X
2
, ( ) I Y
ij
,Z
i
( ) constant +
i=1
r
=
1
i
2
( ) log constant +
i=1
r
=
3
Problem 10.4
Consider amultilayer perceptronwithasinglehiddenlayer. Let w
ji
denotethesynapticweight of
hiddenneuronj connectedtosourcenodei intheinput layer. Let x
i
denotetheithcomponent of
the input vector x, given example. Then the induced local eld of neuronj is
(1)
Correspondingly, the output of hidden neuronj for example is given by
(2)
where is the logistic function
Consider nexttheoutputlayer of thenetwork. Letw
kj
denotethesynapticweightof outputneuron
k connected to hidden neuronj. The induced local eld of output neuronk is
(3)
Thekth output of the network is therefore
(4)
The output y
k
is assigned a probabilistic interpretation by writing
(5)
Accordingly, wemayviewy
k
asanestimateof theconditional probabilitythat thepropositionk
is true, given the example at the input. On this basis, we may interpret
as the estimate of the conditional probability that the proposition k is false, given the input
example. Correspondingly, let q
k
denotetheactual (true) valueof theconditional probability
that theproposition k is true, given theinput example. This means that 1  q
k
is theactual
v
j
w
ji
x
i
i
=
y
j
v
j
( ) =
.
( )
v ( )
1
1 e
v
+
 =
v
k
w
kj
y
j
i
=
y
k
v
k
( ) =
p
k
y
k
=
1 y
k
1 p
k
=
4
value of the conditional probability that the proposition k is false, given the input example .
Thus, we may dene the KullbackLeibler divergence for the multilayer perceptron as
wherep
q
k
q
k
p
k

,
_
1 q
k
( )
1 q
k
1 p
k

,
_
log + log
k
=
D
p q
D
p q
D
p q
w
kj

D
p q
p
k

p
k
y
k

y
k
v
k
 =
v
k
w
kj

p
q
k
p
k
( ) y
j
=
D
p q
D
p q
w
ji
 p
q
k
p
k

1 q
k
1 p
k

,
_
p
k
w
ji

=
p
k
w
ji

p
k
y
k

y
k
v
k

v
k
y
j

y
j
v
j
 =
v
j
w
ji

v
k
( )w
kj
v
j
( )x
i
=
v
k
( ) y
k
1 y
k
( ) =
p
k
1 p
k
( ) =
D
p q
w
ji
 p
x
i
w
ji
x
i
i
,
_
p
k
q
k
( )w
kj
k
=
5
where is the derivative of the logistic function with respect to its argument.
Assuming theuseof thelearningrateparameter for all weight changes applied to the
network, wemayusethemethodof steepest descent towritethefollowingtwostepprobabilistic
algorithm:
1. For output neuronk, compute
2. For hidden neuronj, compute
Problem 10.9
We rst note that the mutual information between the random variablesX andY is dened by
To maximize the mutual information I(X;Y) we need to maximize the sumof the differential
entropy h(X) and thedifferential entropy and also minimizethejoint differential entropy
h(X,Y). Fromthedenitionof differential entropy, bothh(X) andh(Y) attaintheir maximumvalue
of 0.5 when X and Y occur with probability 1/2. Moreover h(X,Y) is minimized when thejoint
probability of X andY occupies the smallest possible region in the probability space.
Problem 10.10
The outputsY
1
andY
2
of the two neurons in Fig. P10.6 in the text are respectively dened by
.
( )
.
( )
w
kj
D
p q
w
kj
 =
p
q
k
P
k
( ) y
j
=
w
ji
D
p q
w
ji
 =
p
x
i
w
ji
x
i
i
,
_
p
k
q
k
( )w
kj
k
=
I X Y ; ( ) h X ( ) h Y ( ) h X Y , ( ) + =
h Y
( )
Y
1
w
1i
x
i
i=1
m
,
_
N
1
+ =
Y
2
w
2i
x
i
i=1
L
,
_
N
2
+ =
6
wherearew
1i
thesynapticweightsof output neuron1, andthew
2i
aresynapticweightsof output
neuron2. Themutual informationbetweentheoutput vector Y =[Y
1
, Y
2
]
T
andtheinput vector X
=[X
1
, X
2
, .., X
m
]
T
is
(1)
whereh(Y) isthedifferential entropyof theoutput vector Y andh(N) isthedifferential entropyof
the noise vector N =[N
1
, N
2
]
T
.
Sincethenoiseterms N
1
and N
2
areGaussian and uncorrelated, it follows that they are
statistically independent. Hence,
(2)
The differential entropy of the output vector Y is
where isthejoint pdf of Y
1
andY
2
. BothY
1
andY
2
aredependent onthesameset
of input signals, and so they are correlated with each other. Let
where
, i, j =1, 2
The individual element of the correlation matrixR are given by
I X Y ; ( ) h Y ( ) h Y X ( ) =
h Y ( ) h N ( ) =
h N ( ) h N
1
N
2
, ( ) =
h N
1
( ) h N
2
( ) + =
1 2
N
2
( ) log + =
h Y ( ) h Y
1
,Y
2
( ) =
f
Y
1
,Y
2
y
1
y
2
, ( ) f
Y
1
,Y
2
y
1
y
2
, ( ) log
y
1
d y
2
d
=
f
Y
1
,Y
2
y
1
y
2
, ( )
R E YY
T
[ ] =
r
11
r
12
r
21
r
22
=
r
ij
E Y
i
Y
j
[ ] =
r
11
1
2
N
2
+ =
r
12
r
21
1
12
= =
7
where and aretherespectivevariancesof Y
1
andY
2
intheabsenceof noise, and
12
istheir
correlation coefcient also in the absence of noise. For the general case of an Ndimensional
Gaussian distribution, we have
Correspondingly, the differential entropy of theNdimensional vector Y is described as
wheree is the base of the natural logarithm. For the problem at hand, we haveN =2 and so
Hence, the use of (2) and (3) in (1) yields
(4)
For axed noisevariance , themutual information I(X;Y) is maximized by maximizing the
determinant det(R). By denition,
That is,
(5)
Depending on the value of noise variance , we may identify two distinct situations:
1. Large noise variance. When is large, the third term in (5) may be neglected, obtaining
r
22
2
2
N
2
+ =
1
2
2
2
f
Y
y ( )
1
2 ( )
N 2
detR ( )
1 2

1
2
y
T
R
1
y
,
_
exp =
h Y ( ) 2e ( )
N 2
det R ( ) ( ) log =
h Y ( ) 2edet R ( ) ( ) log =
1 2det R ( ) ( ) log + =
I X Y ; ( )
det R ( )
N
2

,
_
log =
N
2
det R ( ) r
11
r
22
r
12
r
21
=
det R ( )
N
4
N
2
1
2
2
2
+ ( )
1
2
2
2
1
12
2
( ) + + =
N
2
N
2
det R ( )
N
4
N
2
1
2
2
2
+ ( ) +
8
Inthiscase, maximizingdet(R) requiresthat wemaximize . Thisrequirement may
besatisedsimply by maximizingthevariance of output Y
1
or thevariance of output
Y
2
, separately. Sincethevarianceof output Y
i
: i =1, 2, isequal to intheabsenceof noise
and inthepresenceof noise, it followsfromtheInfomaxprinciplethat theoptimum
solution for a xed noise variance is to maximize the variance of either output, Y
1
or Y
2
.
2. Low noise variance. Whenthenoisevariance is small, thethirdterm in
(5) becomesimportant relativetotheother twoterms. Themutual informationI(X;Y) isthen
maximizedbymakinganoptimal tradeoff betweentwooptions: keepingtheoutput variances
and large, and making the outputsY
1
andY
2
of the two neurons uncorrelated.
Based on these observations, we may now make the following two statements:
A highnoise level favors redundancy of response, in which case the two output neurons
computethesamelinear combinationof inputs. Onlyonesuchcombinationyieldsaresponse
with maximum variance.
A lownoiselevel favorsdiversityof response, inwhichcasethetwooutput neuronscompute
different linear combinations of inputs even though such a choice may result in a reduced
output variance.
Problem 10.11
(a) We are given
Hence,
The mutual information between and the signal component S is
(1)
The differential entropy of is
1
2
2
2
+ ( )
1
2
2
2
i
2
1
2
N
2
+
N
2
1
2
2
2
1
12
2
( )
1
2
2
2
Y
a
S N
a
+ =
Y
b
S N
b
+ =
Y
a
Y
b
+
2
 S
1
2
 N
a
N
b
+ ( ) + =
1
2
 Y
a
Y
b
+ ( )
I
Y
a
Y
b
+
2
 S ;
,
_
h
Y
a
Y
b
+
2

,
_
h
Y
a
Y
b
+
2

S
,
_
=
Y
a
Y
b
+
2

9
(2)
The conditional differential entropy of givenS is
(3)
Hence, the use of (2) and (3) in (1) yields (after the simplication of terms)
(b) Thesignal component S isordinarily independent of thenoisecomponentsN
a
andN
b
. Hence
with
it follows that
The ratio in the expression for the mutual information
may therefore be interpreted as a signalplusnoise to noise ratio.
Problem 10.12
Principal components analysis (PCA) and independentcomponents analysis (ICA) share a
common feature: They both linearly transform an input signal into a xed set of components.
However, they differ from each other in two important respects:
1. PCA performsdecorrelationbyminimizingsecondorder moments; higherorder momentsare
not involvedinthiscomputation. Ontheother hand, ICA performsstatistical independenceby
using higherorder moments.
h
Y
a
Y
b
+
2

,
_
1
2
 1
2
var Y
a
Y
b
+ [ ]
,
_
log + =
Y
a
Y
b
+
2

h
Y
a
Y
b
+
2

S
,
_
h
N
a
N
b
+
2

,
_
=
1
2

2
var N
a
N
b
+ [ ]
,
_
log =
I
Y
a
Y
b
+
2
 S ;
,
_
var Y
a
Y
b
+ [ ]
var N
a
N
b
+ [ ]

,
_
log =
Y
a
Y
b
+ 2S N
a
N
b
+ + =
var Y
a
Y
b
+ [ ] 4var S [ ] var N
a
N
b
+ [ ] + =
var Y
a
Y
b
+ [ ] ( ) var N
a
N
b
+ [ ] ( )
I
Y
a
Y
b
+
2
 S ;
,
_
10
2. The output signal vector resulting fromPCA has a diagonal covariance matrix. The rst
principal component denes a direction in the original signal space that captures the
maximumpossiblevariance; thesecondprincipal component denesanother directioninthe
remainingorthogonal subspacethat capturesthenext maximumpossiblevariance, andsoon.
On the other hand, ICA does not nd the directions of maximum variances but rather
interesting directions where the term interesting refers to deviation from Gaussianity.
Problem 10.13
Independent componentsanalysismaybeusedasapreprocessingtool beforesignal detectionand
patternclassication. Inparticular, throughachangeof coordinatesresultingfromtheuseof ICA,
theprobability density functionof multichannel datamay beexpressedas aproduct of marginal
densities. This change, in turn, permits density estimation with shorter observations.
Problem 10.14
Consider m random variablesX
1
, X
2
, ..., X
m
that are dened by
, i =1, 2, ..., N
wheretheU
j
areindependent randomvariables. TheDarmois theoremstates that if theX
i
are
independent, then the variablesU
j
for which are all Gaussian.
For independentcomponents analysis towork, at most asingleX
i
canbeGaussian. If all
the X
i
are independent to begin with, there is no need for the application of independent
components analysis. This, in turn, means that all theX
i
must beGaussian. For aniteN, this
condition can only be satised if all theU
j
are not only independent but also Gaussian.
Problem 10.15
Theuseof independentcomponentsanalysisresultsinaset of componentsthat areasstatistically
independent of eachother aspossible. Incontrast, theuseof decorrelationonlyaddressessecond
order statistics and there is therefore no guarantee of statistical independence.
Problem 10.16
TheKullbackLeibler divergencebetweenthejoint pdf f
Y
(y, w) andthefactorial pdf is
the multifold integral
(1)
X
i
a
ij
U
j
j=1
N
=
a
ij
0
f
Y
y w , ( )
D
f
Y
f
Y
f
Y
y w , ( )
f
Y
y w , ( )
f
Y
y w , ( )
 log
,
_
y d
=
11
Let
where excludesy
i
andy
j
. We may then rewrite (1) as
That is, the KullbackLeibler divergence between the joint pdf f
Y
(y, w) and the factorial pdf
distribution isequal tothemutual informationbetweenthecomponentsY
i
andY
j
of the
output vector Y for any pair (i, j).
Problem 10.18
Dene the output matrix
(1)
where m is the dimension of the output vector y(n) and N is the number of samples used in
computing the matrixY. Correspondingly, dene thembyN matrix of activation functions
y d d y
i
d y
j
y d =
y
D
f
Y
f
Y
d y
i
d y
j
f
Y
y w , ( ) f
Y
y w , ( ) log y d
=
d y
i
d y
j
f
Y
y w , ( ) f
Y
y w , ( ) log y d
f
Y
i
,Y
j
y
i
y
j
w , , ( ) f
Y
i
,Y
j
y
i
y
j
w , , ( )d y
i
d y
j
log
=
f
Y
i
,Y
j
y
i
y
j
w , , ( ) f
Y
i
y
i
w , ( ) f
Y
j
y
j
w , ( ) ( )d y
i
d y
j
log
f
Y
i
,Y
j
y
i
y
j
, ( )
f
Y
i
,Y
j
y
i
y
j
w , , ( )
f
Y
i
y
i
w , ( ) f
Y
j
y
j
w , ( )

,
_
d y
i
d y
j
log
=
I Y
i
Y
j
; ( ) =
f
Y
y w , ( )
Y
y
1
0 ( ) y
1
1 ( ) y
1
N 1 ( )
y
2
0 ( ) y
2
1 ( ) y
2
N 1 ( )
y
m
0 ( ) y
m
1 ( ) y
m
N 1 ( )
=
.
.
.
.
.
.
.
.
.
12
In the batch mode, we dene the average weight adjustment (see Eq. (10.100) of the text)
Equivalently, using the matrix denitions introduced in (2), we may write
which is the desired formula.
Problem 10.19
(a) Let q(y) denoteapdf equal tothedeterminant det(J) withtheelementsof theJ acobianJ being
asdenedinEq. (10.115). ThenusingEq. (10.116) wemayexpresstheentropyof therandom
vector Z at the output of the nonlinearity in Fig. 10.16 of the text as
Invoking the pythogorean decomposition of the KullbackLeibler divergence, we write
Hence, the differential entropy
(1)
Y ( )
y
1
0 ( ) ( ) y
1
1 ( ) ( ) y
1
N 1 ( ) ( )
y
2
0 ( ) ( ) y
2
1 ( ) ( ) y
2
N 1 ( ) ( )
y
m
0 ( ) ( ) y
m
1 ( ) ( ) y
m
N 1 ( ) ( )
=
.
.
.
.
.
.
.
.
.
W
1
N
 W n ( )
n=0
N1
=
I
1
N
 y n ( ) ( )y
T
n ( )
n=0
N1
,
_
,
_
W =
W I
1
N
 Y ( )Y
T
,
_
W =
h Z ( ) D
f q
=
D
f q
D
f f
D
f
q
+ =
h Z ( ) D
f f
D
f
q
=
13
(b) If q(y
i
) happenstoequal thesourcepdf f
U
(y
i
) for all i, wethenndthat . Insucha
case, (1) reduces to
That is, theentropy h(Z) isequal tothenegativeof theKullbackLeibler divergencebetween
the pdf f
Y
(y) and the corresponding factorial distribution .
Problem 10.20
(a) From Eq. (10.124) in the text,
The matrixA of the linear mixer is xed. Hence differentiating with respect toW:
(1)
(b) From Eq. (10.126) pf the text,
Differentiatingz
i
with respect toy
i
:
(2)
Hence, differentiating with respect to the demixing matrixW, we get
D
f
q
0 =
h Z ( ) D
f f
=
f
Y
y ( )
det A ( ) det W ( )
z
i
y
i

,
_
log
i
+ log + log =
W
 W
T
W

z
i
y
i

,
_
log
i
+ =
z
i
1
1 e
y
i
+
 =
z
i
y
i

e
y
i
1 e
y
i
+ ( )
2
 =
z
i
z
i
2
=
z
i
y
i

,
_
log
W

z
i
y
i

,
_
log
W
 z
i
z
i
2
( ) log =
14
(3)
But from (2) we have
Hence, we may simplify (3) to
We may thus rewrite (1) as
Puttingthisrelationinmatrix formandrecognizingthat thedemixer output y isequal toWx
wherex is the observation vector, we nd that the adjustment applied toW is dened by
where is the learningrate parameter and1 is a vector of ones.
z
i
W

z
i
 z
i
z
i
2
( ) log =
z
i
W

1
z
i
z
i
2
( )
 1 2z
i
( ) =
z
i
y
i

y
i
W

1
z
i
z
i
2
( )
 1 2z
i
( ) =
z
i
y
i

1
z
i
z
i
2

,
_
1 =
W

z
i
y
i

,
_
log
y
i
W
 1 2z
i
( ) =
W
 W
T
y
i
W
 1 2z
i
( )
i
+ =
W
W
 =
W
T
1 2z ( )x
T
+ ( ) =
1
CHAPTER 11
Stochastic Methodfs Rooted in Statistical Mechanics
Problem 11.1
By denition, we have
wheret denotestimeandn denotesthenumber of discretesteps. For n =1, wehavetheonestep
transition probability
For n =2 we have the twostep transition probability
wherethesumis taken over all intermediatesteps k taken by thesystem. By induction, it thus
follows that
Problem 11.2
For p >0, thestatetransitiondiagramfor therandomwalkprocessshowninFig., P11.2of thetest
is irreducible. The reason for saying so is that the system has only one class, namely,
{0, +1, +2, ...}.
Problem 11.3
Thestatetransitiondiagramof Fig. P11.3inthetext pertainstoaMarkovchainwithtwoclasses:
{x
1
} and {x
1
, x
2
}.
p
ij
n ( )
P X
t
j X
t n
i = = ( ) =
p
ij
1 ( )
p
ij
P X
t
j X
t 1
i = = ( ) = =
p
ij
2 ( )
p
ik
p
kj
k
=
p
ij
n 1 ( )
p
ik
p
kj
n ( )
k
=
2
Problem 11.4
The stochastic matrix of the Markov chain in Fig. P11.4 of the text is given by
Let
1
,
2
, and
3
denotethesteadystateprobabilitiesof thischain. Wemay thenwrite(seeEq.
(11.27) of the text)
That is,
We also have, by denition,
Hence,
or equivalently
and so
P
3
4

1
4
 0
0
2
3

1
3

1
4

3
4
 0
=
1
1
3
4

,
_
2
0 ( )
3
1
4

,
_
+ + =
2
1
1
4

,
_
2
2
3

,
_
3
3
4

,
_
+ + =
3
1
0 ( )
2
1
3

,
_
3
0 ( ) + + =
1
3
=
2
3
3
=
1
2
3
+ + 1 =
3
3
3
3
+ + 1 =
3
1
5
 =
3
Problem 11.6
TheMetropolisalgorithmandtheGibbssampler aresimilar inthat they bothgenerateaMarkov
chain with the Gibbs distribution as the equilibrium distribution.
They differ fromeach other in the following respect: In the Metropolis algorithm, the
transitionprobabilitiesof theMarkov chainarestationary. Incontrast, intheGibbssampler, they
are nonstationary.
Problem 11.7
Simulated annealing algorithm for solving the travelling salesman problem:
1. Set up an annealing schedule for the algorithm.
2. Initialize the algorithm by picking a tour at random.
3. Choose a pair of cities in the tour and then reverse the order that the cities inbetween the
selectedpairsarevisited. Thisprocedure, illustratedinFigure1below, generatesnewtoursin
a local manner:
4. Calculate the energy difference due to the reversal of paths applied in step 3.
5. If theenergy differencesocalculatedisnegativeor zero, accept thenewtour. If, ontheother
hand, it ispositive, accept thechangeinthetour withprobability denedinaccordancewith
the Metropolis algorithm.
6. Select another pair of cities, andrepeat steps 3to 5until therequirednumber of iterations is
accepted.
7. Lower the temperature in the annealing schedule, and repeat steps 3 to 6.
Problem 11.8
(a) We start with the notion that a neuron j ips fromstate x
j
to x
j
at temperature T with
probability
(1)
1
1
5
 =
2
3
5
 =
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure 1: Problem 11.7
P x
j
x
j
( )
1
1 E
j
T ( ) exp +
 =
4
whereE
j
istheenergydifferenceresultingfromsuchaip. Theenergyfunctionof theBoltzman
machine is dened by
Hence, the energy change produced by neuron j ipping from statex
j
to x
j
is
(2)
wherev
i
is the induced local eld of neuronj.
(b) In light of the result in (2), we may rewrite (1) as
This means that for an initial statex
j
=1, the probability that neuronj is ipped into state +1 is
(3)
(c) For an initial state of x
j
=+1, the probability that neuronj is ipped into state 1 is
(4)
The ipping probability in (4) and the one in (3) are in perfect agreement with the following
probabilistic rule
E
1
2
 w
ji
x
i
x
j
j
i j
=
E
j
energy with neuron j
in state x
j
,
_ energy with neuron j
in state x
j
,
_
=
x
j
( ) w
ji
x
i
x
j
( ) w
ji
x
i
j
+
j
=
2x
j
w
ji
x
i
j
=
2x
j
v
j
=
P x
j
x
j
( )
1
1 2x
j
v
j
T ( ) exp +
 =
1
1 2 v
j
T ( ) exp +

1
1 +2v
j
T ( ) exp +
 1
1
1 2 v
j
T ( ) exp +
 =
x
j
+1 with probability P v
j
( )
1 with probability 1P v
j
( )
'
=
5
whereP(v
j
) is itself dened by
Problem 11.9
The loglikelihood functionL(w) is (see (11.48) of the text)
DifferentiatingL(w) with respect to weight w
ji
:
The energy functionE(x) is dened by (see (11.39) of the text)
Hence,
, (1)
We also note the following:
(2)
(3)
P v
j
( )
1
1 2 v
j
T ( ) exp +
 =
L w ( )
E x ( )
T

,
_
E x ( )
T

,
_
exp
x
log exp
x
log
x
=
L w ( )
w
ji

1
T

E x ( )
w
ji

1
E x ( )
T

,
_
exp
x

1
E x ( )
T

,
_
exp
x
 +
,
_
x
=
E x ( )
1
2
 w
ji
x
i
x
j
j
i j
=
E x ( )
w
ji
 x
i
x
j
= i j
P X
= = ( )
1
E x ( )
T

,
_
exp
x
 =
P X x = ( )
1
E x ( )
T

,
_
exp
x
 =
6
Accordingly, using the formulas of (1) to (3), we may redene the derivative as
follows:
which is the desired result.
Problem 11.10
(a) Factoringthetransitionprocessfromstatei tostatej intoatwostepprocess, wemayexpress
the transition probabilityp
ji
as
for (1)
where
ji
is the probability that a transition fromstate j to state i is attempted, and q
ji
is the
conditional probability that theattempt is successful giventhat it was attempted. Whenj =i, the
property that each row of the stochastic matrix must add to unity implies that
(b) We require that the attemptrate matrix be symmetric:
for all (2)
and that it satises the normalization condition
for all
We also require the property of complementary conditional transition probability
(3)
For a stationary distribution, we have
for all i (4)
L w ( ) w
ji
L w ( )
w
ji

1
T
 P X
= = ( )x
j
x
i
P X x = ( )x
j
x
i
x
,
_
x
=
p
ji
ji
q
ji
= j i
p
ii
1 p
ij
j i
=
1
ij
q
ij
j i
ji
ij
= i j
ji
j
1 = i j
q
ji
1 q
ij
=
i
j
p
ji
j
=
7
Hence, using (1) to (3) in (4):
(5)
Next, recognizing that
for all i
we may go on to write
(6)
Hence, combining (5) and (6), using the symmetry property of (2), and then rearranging terms:
(7)
(c) For the condition of (7) can only be satised if
which, in turn, means that q
ij
is dened by
(8)
(d) Make a change of variables:
whereT andT
*
are arbitrary constants. We may then express
i
in terms of E
i
as
i
j
ji
p
ji
j
ji
1 q
ij
( )
j
=
p
ij
j
1 =
i
i
p
ij
j
i
p
ij
j
ij
q
ij
j
ji
i
q
ij
j
q
ij
j
+ ( )
j
0 =
ji
0
i
q
ij
j
q
ij
j
+ 0 =
q
ij
1
1
i
j
( ) +
 =
E
i
T
i
T* + log =
8
where
Accordingly, we may reformulate (8) in the new form
(9)
whereE =E
i
 E
j
. To evaluate the constant Z, we note that
and therefore
(e) The formula of (9) is the only possible distribution for state transitions in the Boltzmann
machine; it is recognized as the Gibbs distribution.
Problem 11.11
We start with the KullbackLeibler divergence
(1)
Theprobabilitydistribution intheclampedconditionisnaturallyindependent of thesynaptic
weightsw
ji
intheBoltzmanmachine, whereastheprobabilitydistribution isdependent onw
ji
.
Hence differentiating (1) with respect tow
ji
:
i
1
Z

E
i
T

,
_
exp =
Z
T*
T

,
_
exp =
q
ij
1
1
1
T
 E
i
E
j
( )
,
_
exp +
 =
1
1 E T ( ) exp +
 =
i
i
1 =
Z E
i
T ( ) exp
i
=
D
p
+
p
 p
+
p
+
p


,
_
log
=
p
+
p

9
(2)
To minimize , we use the method of gradient descent:
(3)
where is a positive constant.
Let denotethejoint probabilitythat thevisibleneuronsareinstate andthehidden
neurons are in state, given that the network is in its clamped condition. We may then write
Assuming that the network is in thermal equilibrium, we may use the Gibbs distribution
to write
(4)
whereE
+
p


p

w
ji

=
D
p
+
p

w
ji
D
p
+
p

w
ji
 =
+
p


p

w
ji

=
p

p

p
=
p
 1
Z

E
T

,
_
exp =
p
 1
Z

E
T

,
_
exp
=
Z
E
T

,
_
exp
=
10
(5)
where isthestateof neuroni whenthevisibleneuronsareinstate andthehiddenneurons
are in state. Therefore, using (4):
(6)
From (5) we have (remembering that in a Boltzmann machinew
ji
=w
ij
)
(7)
The rst term on the righthand side of (6) is therefore
where we have made use of the Gibbs distribution
astheprobabilitythatthevisibleneuronsareinstate andthehiddenneuronsareinstate inthe
freerunningcondition. Consider nextthesecondtermontherighthandsideof (6). Exceptfor the
minus sign, we may express this term as the product of two factors:
(8)
The rst factor in (8) is recognized as the Gibbs distribution dened by
(9)
E
1
2
 w
ji
x
j
x
i
j
i j
=
x
i
p

w
ji

1
ZT

E
T

,
_
E
w
ji
 exp
=
1
Z
2

Z
w
ji

E
T

,
_
exp
w
ji
 x
j
x
i
=
1
ZT

E
T

,
_
E
w
ji
 exp
+
1
ZT

E
T

,
_
x
j
x
i
exp
=
1
T
 p

x
j
x
i
=
p
 1
Z

E
T

,
_
exp =
1
Z
2

Z
w
ji

E
T

,
_
exp
1
Z

E
T

,
_
exp
1
Z

Z
w
ji
 =
p

p
 1
Z

E
T

,
_
exp
=
11
To evaluate the second factor in (8), we write
(10)
Using (9) and (10) in (8):
(11)
We are now ready to revisit (6) and thus write
We now make the following observations:
1. The sum of probability over the states is unity, that is,
(12)
2. The joint probability
(13)
Similarly
(14)
3. Theprobability of ahiddenstate, givensomevisiblestate, is naturally thesamewhether the
visibleneuronsof thenetworkinthermal equilibriumareclampedinthat statebytheexternal
environment or arrive at that state by free running of the network, as shown by
(15)
In light of this relation we may rewrite Eq. (13) as
1
Z

Z
w
ji

1
Z

w
ji

E
T

,
_
exp
=
1
TZ

E
T

,
_
E
w
ji
 exp
=
1
TZ

E
T

,
_
x
j
x
i
exp
=
1
T
 p

x
j
x
i
=
1
Z
2

Z
w
ji

E
T

,
_
exp

T
 p

x
j
x
i
=
p

w
ji

1
T
 p

x
j
x
i
p

T
 p

x
j
x
i
=
p
+
p
1 =
p

p

p

=
p
+
p
+
p
+
=
p

p
+
=
12
(16)
Moreover, we may write
(17)
Accordingly, we may rewrite (3) as follows:
Dene the following terms:
= learning rate parameter
=
=
=
=
=
We may then nally formulate the Boltzmann learning rule as
Problem 11.12
(a) We start with the relative entropy:
p

p
+
p

=
p
+
p


,
_
p

p
+
p
+
=
p
+
=
w
ji
T

p
+
p

 p

x
j
x
i
p

x
j
x
i
,
_
=
T
 p
+
x
j
x
i
p

x
j
x
i
,
_
=
T

ji
+
x
j
x
i
+
> <
p
+
x
j
x
i
ji

x
j
x
i

> <
p

x
j
x
i
w
ji
ji
+
ji

( ) =
13
(1)
From probability theory, we have
(2)
(3)
where, inthelast line, wehavemadeuseof thefact that theinput neuronsarealwaysclampedto
the environment, which means that
Substituting (2) and (3) into (1):
(4)
where the state refers to the input neurons and refers to the output neurons.
(b) With denotingtheconditional probability of ndingtheoutput neuronsinstate, given
that theinput neurons arein state, wemay express theprobability distribution of theoutput
states as
Theconditional isdeterminedbythesynapticweightsof thenetworkinaccordancewiththe
formula
(5)
where
(6)
D
p
+
p

p
+
p
+
p


,
_
log
=
p
+
p
+
p
+
=
p

p

p

p

p
+
= =
p

p
+
=
D
p
+
p

p
+
p
+
p
+
p


,
_
log
=
p
p

p

p
=
p

p
 1
Z
1

E
T

,
_
exp
=
E
1
2
 w
ji
s
j
s
i
[ ]
=
14
The parameter Z
1
is the partition function:
(7)
Thefunctionof theBoltzmannmachineistondthesynaptic weightsfor whichtheconditional
probability approaches the desired value .
Applying the gradient method to the relative entropy of (1):
(8)
Using (4) in (8) and recognizing that is determined by the environment (i.e., it is
independent of the network), we get
(9)
To evaluate the partial derivative we use (5) to (7):
(10)
Next, we recognize the following pair of relations:
(11)
1
Z
1

E
T

,
_
exp
=
p

p
+
w
ji

D
p
+
p

w
ji
 =
p
+
w
ji
p
+
p
+
p


p

w
ji

=
p

w
ji
p

w
ji

1
Z
1

1
T

,
_
E
T

,
_
E
w
ji
 exp
=
1
Z
1
2

Z
1
w
ji

E
T

,
_
exp
1
Z
1

1
T
 s
j
s
i
[ ]
T

,
_
exp
=
+
1
Z
1
2
 s
j
s
i
[ ]
T

,
_
exp
1
Z
1

E
T

,
_
exp p
1

=
15
(12)
wheretheterm is theaveragedcorrelationof thestates s
j
ands
i
withtheinput neurons
clampedtostate andthenetworkinafreerunningcondition. Substituting(11) and(12) in(10):
(13)
Next, substituting (13) into (9):
(14)
We now recognize that
for all (15)
(16)
Accordingly, substituting (15) and (16) into (14):
1
Z
1
 s
j
s
i
[ ]
T

,
_
exp
<s
j
s
i
>

p

=
<s
j
s
i
>

p

w
ji

1
T
 s
j
s
i
[ ]

<s
j
s
i
>

p

,
_
=
w
ji
T
 p
+
s
j
s
i
[ ]
p
1

p
+
p


,
_
'
=
p
+
<s
j
s
i
>

p
+
1 =
s
j
s
i
[ ]
p


p
+
p


,
_
p
+
s
j
s
i
[ ]
p
+
p


,
_
=
p
+
<s
j
s
i
>
=
<s
j
s
i
>
+
=
w
ji
T
 p
+
<s
j
s
i
>
+
<s
j
s
i
>

( )
=
p
ji
+
ji

( )
=
16
where ; and and are the averaged correlations in the clamped and free
running conditions, given that the input neurons are in state.
Problem 11.15
Consider the expected distortion (energy)
(1)
where d(x, y
j
) is the distortion measure for representing the data point x by the vector y
j
, and
is the probability that x belongs to the cluster of points represented by y
j
. To
determinetheassociation probabilities at agiven expected distortion, wemaximizetheentropy
subject totheconstraint of (1). For axedY ={y
j
}, weassumethat theassociationprobabilities
of different data points are independent. We may thus express the entropy as
(2)
The probability distribution that maximizes the entropy under the expectation constraint is the
Gibbs distribution:
(3)
where
isthepartitionfunction. TheinversetemperatureB =1/T istheLagrangemultiplier denedbythe
value of E in (1).
Problem 11.6
(a) The free energy is
(1)
whereD is theexpected distortion, T is thetemperature, and H is theconditional entropy. The
expected distortion is dened by
T =
ji
+
ji

E P x C
j
( )d x y
j
, ( )
j
=
P x C
j
( )
H P x C
j
( ) P x C
j
( ) log
j
=
P x C
j
( )
1
Z
x

1
T
d x y
j
, ( )
,
_
exp =
Z
x
1
T
d x y
j
, ( )
,
_
exp
j
=
F D TH =
17
(2)
The conditional entropy if dened by
(3)
The minimizingP(Y =yX =x) is itself dened by the Gibbs distribution:
(4)
where
(5)
is the partition function. Substituting (2) to (5) into (1), we get
This result simplies as follows by virtue of the denition given in (5) for the partition function:
(6)
(b) Differentiating the minimum free energyF* of (6) with respect toy:
(7)
Using the denition of Z
x
given in (5), we write:
(8)
D P(X
y
x) = = P(Y y X x)d x y , ( ) = =
H Y X ( ) P(X
y
=
F
*
P(X
y
x)
1
Z
x

d x y , ( )
T

,
_
d x y , ( ) exp = =
T P(X
y
x)
1
Z
x

d x y , ( )
T

,
_
Z
x
1
T
 log d x y , ( )
,
_
exp = +
T P(X
y
x)
1
Z
x

d x y , ( )
T

,
_
Z
x
log ( ) exp = =
F
*
T P(X
x
x) Z
x
log = =
F
*
y
 T P(X
x
x)
1
Z
x

Z
x
y
 = =
Z
x
y

1
T

d x y , ( )
T

,
_
d x y , ( )
y
 exp
y
=
18
Hence, we may rewrite (7) as
(9)
where use has been made of (4). Noting that
we may then state that the condition for minimizing the Lagrangian with respect toy is
for all y (10)
Normalizing this result with respect toP(X =x) we get the minimizing condition:
for all y (11)
(c) Consider the squared Euclidean distortion
for which we have
(12)
For this particular measure we nd it more convenient to normalize (10) with respect to the
probability
We may then write the minimizing condition with respect toy as
(13)
F
*
y
 P(X
y
x)
1
Z
x

d x y , ( )
T

,
_
d x y , ( )
y
 exp = =
P(X
x
x)P(Y y X x)
d x y , ( )
y
 = = = =
P(X x Y , y) = = P(Y y X x) = P(X x) = = =
P(X x Y , y)
d x y , ( )
y
 = =
x
0 =
P(Y y X x)
d x y , ( )
y
 = =
x
0 =
d x y , ( ) x y
2
x y ( )
T
x y ( ) = =
d x y , ( )
y

y
 x y ( )
T
x y ( ) =
2 x y ( ) ( ) =
P(Y y) P(X x,Y y) = =
x
= =
P(X xY y)
d x y , ( )
y
 = =
x
0 =
19
Using (12) in (13) and solving for y, we get the desired minimizing solution
(14)
which is recognized as the formula for a centroid.
Problem 11.17
Theadvantageof deterministic annealingover maximumlikelihoodis that it does not makeany
assumption on the underlying probability distribution of the data.
Problem 11.18
(a) Let
, k =1, 2, ..., K
wheret
k
is thecenter or prototypevector of thekthradial basis functionandK is thenumber of
such functions (i.e., hidden units). Dene the normalized radial basis function
The average squared cost over the training set is
(1)
where is the output vector of the RBF network in response to the input x
i
. The Gibbs
distribution for is
(2)
whered is dened in (1) and
y
P(X xY y)x = =
x
P(X xY y) = =
x
 =
k
x ( )
1
2
2
 x = t
k
2
,
_
exp =
P
k
x ( )
k
x ( )
k
x ( )
k
 =
d
1
N
 y
i
F x
i
( )
2
i=1
N
=
F x
i
( )
P x R ( )
P x R ( )
1
Z
x

d
T

,
_
exp =
20
(3)
(b) The Lagrangian for minimizing the average misclassication cost is
F = d  TH
where the average squared cost d is dened in (1), and the entropyH is dened by
where is theprobability of associatingclass j at theoutput of theRBF network withthe
input x.
Z
x
d
T

,
_
exp
y
i
=
H p j x ( ) j x ( ) log
j
=
p j x ( )
1
CHAPTER 12
Dynamic Programming
Problem 12.1
Asthediscount factor approaches1, thecomputationof thecosttogofunctionJ
(i) becomes
longer because of the corresponding increase in the number of time steps involved in the
computation.
Problem 12.2
(a) Let beanarbitrarypolicy, andsupposethat thispolicychoosesaction at timestep0.
We may then write
where p
a
is the probability of choosing action a, c(i, a) is the expected cost, p
ij
(a) is the
probability of transition fromstate i to state j under action a, W
i ( ) p
a
c i a , ( ) p
ij
a ( )W
j ( )
j=1
N
+
a A
i
=
W
j ( ) J j ( )
J
i ( ) p
a
c i a , ( ) p
ij
a ( )J j ( )
j
p
a
min
a A
i
c i a , ( ) p
ij
a ( )J j ( )
j
min
a
c i a , ( ) p
ij
a ( )J j ( )
j
+
=
J
i ( )
min
a A
i
c i a , ( ) p
ij
a ( )J j ( )
j
2
(b) Suppose we next go the other way by choosinga
0
with
(2)
Let bethepolicythat choosesa
0
at timestep0and, if thenext stateisj, theprocessisviewedas
originating in statej following a policy
j
such that
Where is a small positive number. Hence
(3)
SinceJ(i) <J
+ min
a
c i a , ( ) p
ij
a ( )J j ( )
j
+
=
J
j
J j ( ) +
J
j
c i a
0
, ( ) p
ij
a
0
( )J
j
j ( )
j=1
N
+ =
c i a
0
, ( ) p
ij
a
0
( )J j ( ) +
j=1
N
+
J i ( ) c i a
0
, ( ) p
ij
a
0
( )J j ( ) +
j=1
N
+
J i ( )
min
a A
i
c i a , ( ) p
ij
a ( )J j ( )
j
+
+
J
*
i ( ) min
a
c i a , ( ) p
ij
a ( )J
*
j ( )
j
+
=
3
J
=c() +P()J
(1)
where
Rearranging terms in 1), we may write
whereI istheNbyN identitymatrix. For thesolutionJ
tobeuniquewerequirethat theNbyN
matrix (I  P()) have an inverse matrix for all possible values of the discount factor .
Problem 12.4
Consider anadmissiblepolicy{
0
,
1
, ...}, apositiveinteger K, andcosttogofunctionJ. Let the
costs of therst K stages beaccumulated, andaddtheterminal cost
K
J(X
K
), thereby obtaining
the total expected cost
whereE istheexpectational operator, Tominimizethetotal expectedcost, westart with
K
J(X
K
)
and performK iterations of the dynamic programming algorithm, as shown by
(1)
with the initial condition
J
K
(X) =
K
J(X)
J
( )
J
( )
1 ( ), J
( )
2 ( ), J
( )
N ( )
T
=
c ( )
C 1 , ( ), C 2 , ( ), C N , ( )
T
=
P ( )
p
11
( ) p
12
( ) p
1N
( )
p
21
( ) p
22
( ) p
2N
( )
p
N1
( ) p
N2
( ) p
NN
( )
=
.
.
.
.
.
.
.
.
.
I P ( )J
( )
c ( ) =
E
K
J X
K
( )
n
g X
n
n
X
n
( ) X
n1
, , ( )
n=0
K1
+
J
n
X
n
( ) min
n
E g
n
X
n
n
X
n
( ) X
n1
, , ( ) J
n+1
X
n+1
( ) + [ ] =
4
Now consider the functionV
n
dened by
for all n andX (2)
ThefunctionV
n
(X) istheoptimal Kstagecost J
0
(X). Hence, thedynamicprogrammingalgorithm
of (1) can be rewritten in terms of the functionV
n
(X) as follows:
with the initial condition
V
0
(X) =J(X)
which has the same mathematical form as that specied in the problem.
Problem 12.5
An important property of dynamic programming is themonotonicity property described by
This property follows fromthe fact that if the terminal cost g
K
for K stages is changed to a
uniformly larger cost , that is,
for all X
K
,
then the last stage costtogo function J
K1
(X
K1
) will be uniformly increased. In more general
terms, we may state the following.
Given two costtogo functionsJ
K+1
and with
for all X
K+1
,
we nd that for all X
K
and
K
the following relation holds
This relation merely restates the monotonicity property of the dynamic programming algorithm.
V
n
X ( )
J
Kn
X ( )
Kn
 =
V
n+1
X
0
( ) min
E g(X
0
X
0
( ) X
1
) , V
n
X
1
( ) + , [ ] =
J
n+1
J
g
K
g
K
X
K
( ) g
K
X
K
( )
J
K+1
J
K+1
X
K+1
( ) J
K+1
X
K+1
( )
E g
K
X
K
K
X
K+1
, , ( ) J
K+1
X
K+1
( ) + [ ] E g
K
X
K
K
X
K+1
, , ( ) J
K+1
X
K+1
( ) + [ ]
5
Problem 12.6
According to (12.24) of thetext theQfactor for stateaction pair (i, a) and stationary policy
satises the condition
for all i
This equation emphasizes the fact that the policy is greedy with respect to the costtogo
functionJ
(i).
Problem 12.7
Figure1, shownbelow, presents aninterestinginterpretationof thepolicy iterationalgorithm. In
this interpretation, thepolicy evaluationstepis viewedas thework of acritic that evaluates the
performanceof thecurrent policy; that is, it calculatesanestimateof thecosttogofunction .
Thepolicy improvement stepisviewedasthework of acontroller or actor that accountsfor the
latestevaluationmadebyher criticandactsouttheimprovedpolicy
n+1
. Inshort, thecriticlooks
after policy evaluationandthecontroller (actor) looksafter policy improvement andtheiteration
between them goes on.
Problem 12.8
From(12.29) inthetext, wendthat for eachpossiblestate, thevalueiterationalgorithmrequires
NMiterations, whereN isthenumber of statesandMisthenumber of admissibleactions. Hence,
the total number of iterations for all N states inN
2
M.
Q
i i ( ) , ( ) min
a
Q
i a , ( ) =
J
n
Environment
Controller
(Actor)
Costtogo
J
n
Critic
State
i
n+1
(i)
Figure 1: Problem 12.7
6
Problem 12.9
To reformulatethevalueiteration algorithmin terms of Qfactors, theonly changeweneed to
make is in step 2 of Table 12.2 in the text. Specically, we rewrite this step as follows:
For n =0, 1, 2, ..., compute
Problem 12.10
The policyiteration algorithm alternates between two steps: policy evaluation, and policy
improvement. In other words, an optimal policy is computed directly in the policy iteration
algorithm. In contrast, no such thing happens in the value iteration algorithm.
Another point of differenceisthat inpolicyiterationthecosttogofunctionisrecomputed
on each iteration of the algorithm. This burdensome computational difculty is avoided in the
valueiteration algorithm.
Problem 12.14
Fromthe denition of Qfactor given in (12.24) in the text and Bellmans optimality equation
(12.11), we immediately see that
where the minimization is performed over all possible actionsa.
Problem 12.15
Thevalueiterationalgorithmrequiresknowledgeof thestatetransitionprobabilities. Incontrast,
Qlearningoperateswithout thisknowledge. But throughaninteractiveprocess, Qlearninglearns
estimates of the transition probabilities in an implicit manner. Recognizing the intimate
relationship between value iteration and Qlearning, we may therefore view Qlearning as an
adaptive version of the valueiteration algorithm.
Q i a , ( ) c i a , ( ) p
ij
a ( )J
n
j ( )
j=1
N
+ =
J
n+1
i ( ) min
a
Q i a , ( ) =
J
*
i ( ) min
a
Q i a , ( ) =
7
Problem 12.16
Using Table 12.4 in the text, we may construct the signalow graph in Figure 1 for the Q
learning algorithm:
Problem 12.17
Thewholepoint of theQlearningalgorithmis that it eliminates theneedfor knowingthestate
transitionprobabilities. If knowledgeof thestatetransitionprobabilities is available, thentheQ
learning algorithm assumes the same form as the valueiteration algorithm.
Compute
optimum
action
Compute
target
Qfactor
Update
Qfactor
Unit
delay
Q
n+1
(i
n+1
, a, w)
a
n Q
target
Q
n
(i
n
, a, w)
Figure 1: Problem 12.16
1
CHAPTER 13
Neurodynamics
Problem 13.1
Theequilibriumstatex(0) is(asymptotically) stableif inasmall neighborhoodaroundx(0), there
exists a positive denite function V(x) such that its derivative with respect to time is negative
denite in that region.
Problem 13.3
Consider the symem of coupled nonlinear differential equations:
, j =1, 2, ..., N
whereW is theweight matrix, i is thebias vector, and x is thestatevector with its jth element
denoted byx
j
.
(a) Withthebiasvector i treatedasinput andwithxedinitial conditionx(0), let denotethe
nal state vector of the system. Then,
, j =1, 2, ..., N
For agivenmatrixWandinput vector i, theset of initial pointsx(0) evolvestoaxedpoint. The
xed points arefunctions of W and i. Thus, thesystemacts as amapper with i as input and
as output, as shown in Fig. 1(a):
(b) Withtheinitial statevector x(0) treatedas input, andthebias vector i beingxed, let
denote the nal state vector of the system. We may then write
, j =1, 2, .., N
dx
j
dt

j
W i x , , ( ) =
x ( )
0
j
W i x ( ) , , ( ) =
x ( )
W;
x(0) : xed
W;
i : xed
(a) (b)
x ( )
x ( )
x(0)
i
Figure 1: Problem 13.3
x ( )
0
j
W i:fixed x , ( ) , ( ) =
2
Thus with x(0) acting as input and acting as output, the dynamic systembehaves like a
pattern associator, as shown in Fig. 1b.
Problem 13.4
(a) We are given the fundamental memories:
The weight matrix of the Hopeld network (withN =25 andp =3) is therefore
(b) According to the alignment condition, we write
, i =1, 2, 3
Consider rst , for which we have
x ( )
i
T P
N
I
i=1
p
=
1
5

0 1 +1 +1 1
1 0 +1 +1 +3
+1 +1 0 1 +1
+1 +1 1 0 +1
1 +3 +1 +1 0
=
i
W
i
( ) sgn =
1
W
1
( ) sgn
1
5

0 1 +1 +1 1
1 0 +1 +1 +3
+1 +1 0 1 +1
+1 +1 1 0 +1
1 +3 +1 +1 0
+1
+1
+1
+1
+1 ,
_
sgn =
3
Thus all three fundamental memories satisfy the alignment condition.
Note: Wherever aparticular element of theproduct iszero, theneuroninquestionisleft in
its previous state.
(c) Consider the noisy probe:
1
5

0
+4
+2
+2
+4 ,
_
sgn
+1
+1
+1
+1
+1
1
= = =
W
2
( ) sgn
1
5

0 1 +1 +1 1
1 0 +1 +1 +3
+1 +1 0 1 +1
+1 +1 1 0 +1
1 +3 +1 +1 0
+1
1
1
+1
1 ,
_
sgn =
1
5

+2
4
2
0
4 ,
_
sgn
+1
1
1
+1
1
2
= = =
W
3
( ) sgn
1
5

0 1 +1 +1 1
1 0 +1 +1 +3
+1 +1 0 1 +1
+1 +1 1 0 +1
1 +3 +1 +1 0
1
+1
1
+1
+1 ,
_
sgn =
1
5

2
+4
0
+2
+4 ,
_
sgn
1
+1
1
+1
+1
2
= = =
W
i
4
which is the fundamental memory with its second element reversed in polarity. We write
(1)
Therefore,
Thus, neurons 2 and 5 want to change their states. We therefore have 2 options:
Neuron 5 is chosen for a state change, which yields the result
This vector is recognized as the fundamental memory , and the computation is thereby
terminated.
Neuron 2 is chosen to change its state, yielding the vector
Next, we go on to compute
x
+1, 1, +1, +1, +1
T
=
Wx
1
5

0 1 +1 +1 1
1 0 +1 +1 +3
+1 +1 0 1 +1
+1 +1 1 0 +1
1 +3 +1 +1 0
+1
1
+1
+1
+1
=
1
5

+2
+4
0
0
2
=
Wx ( ) sgn
+1
+1
+1
+1
1
=
x
+1, +1, +1, +1, +1
T
=
1
x
+1, 1, +1, +1, 1
T
=
5
Hence, neurons 3 and 4 want to change their states:
If we permit neuron 3 to change its state from +1 to 1, we get
which is recognized as the fundamental memory .
If we permit neuron 4 to change its state from +1 to 1, we get
which is recognized as the negative of the third fundamental memory .
In both cases, thenew statewould satisfy thealignment condition and thecomputation is then
terminated.
Thus, when the noisy version of is applied to the network, with its second element
changed in polarity, one of 2 things can happen with equal likelihood:
1. The original is recovered after 1 iteration.
2. Thesecondfundamental memory or thenegativeof thethirdfundamental memory is
recovered after 2 iterations, which, of course, is in error.
Wx
1
5

0 1 +1 +1 1
1 0 +1 +1 +3
+1 +1 0 1 +1
+1 +1 1 0 +1
1 +3 +1 +1 0
+1
1
+1
+1
1
=
1
5

+4
2
2
2
2
=
Wx ( ) sgn
+1
1
1
1
1
=
x
+1, 1, 1, +1, 1
T
=
2
x
+1, 1, +1, 1, 1
=
2
3
6
Problem 13.5
Given the probe vector
and the weight matrix of (1) Problem 13.4, we nd that
and
Accordingtothisresult, neurons2and5havechangedtheir states. Insynchronousupdating, this
is permitted. Thus, with the new state vector
on the next iteration,we compute
x
+1, 1, +1, +1, +1
T
=
Wx
1
5

2
4
0
0
2
=
Wx ( ) sgn
+1
+1
+1
+1
1
=
x
+1
+1
+1
+1
1
=
Wx
1
5

0 1 +1 +1 1
1 0 +1 +1 +3
+1 +1 0 1 +1
+1 +1 1 0 +1
1 +3 +1 +1 0
+1
+1
+1
+1
1
=
7
Hence,
The new state vector is therefore
which is recognized as the original probe. In this problem, we thus nd that the network
experiences a limit cycle of duration 2.
Problem 13.6
(a) The vectors
are simply the negatives of the three fundamental memories considered in Problem 13.4,
respectively. These 3 vectors are therefore also fundamental memories of the Hopeld network.
1
5

+2
2
0
0
+4
=
Wx ( ) sgn
+1
1
+1
+1
+1
=
x
+1
1
+1
+1
+1
=
1
x
1, +1, +1, +1, +1
T
=
W
0 1
1 0
=
Ws
2
0 1
1 0
1
+1
=
1
+1
=
Ws
2
( ) sgn
1
+1
s
2
= =
9
which yields
Thus, both statess
2
ands
4
satisfy the alignment condition and are therefore stable.
Consider next the states
1
, for which we write
which yields
Thus, both neurons want to change; supposewepick neuron 1 to changeits state, yielding the
new state vector [1, +1]
T
. This is a stable vector as it satises the alignment condition. If,
however, wepermit neuron2to changeits state, weget astatevector equal to s
4
. Similarly, we
mayshowthatthestatevector s
3
=[1, 1]
T
isalsounstable. Theresultingstatetransitiondiagram
of the network is thus as depicted in Fig. 1.
Ws
4
0 1
1 0
+1
1
=
+1
1
=
Ws
4
( ) sgn
+1
1
s
4
= =
Ws
1
0 1
1 0
+1
+1
=
1
1
=
Ws
1
( ) sgn
1
1
s
1
= =
10
Theresults depictedinFig. 1assumetheuseof asynchronous updating. If, however, we
use synchronous updating, we nd that in the case of s
1
:
Permittingbothneuronstochangestate, weget thenewstatevector [1, 1]
T
. Thisisrecognized
to be stable states
3
. Now, we nd that
which takes back to states
1
.
Thus, in thesynchronous updating case, thestates s
1
and s
3
represent alimit cyclewith
length 2.
Returning to the normal operation of the Hopeld network, we note that the energy
function of the network is
since
(1)
. .
. .
(1, 1)
(1, 1)
(1, 1) (1, 1)
x
2
Figure 1: Problem 13.7
Ws
1
( ) sgn
1
1
=
0 1
1 0
1
1
+1
+1
=
E
1
2
 w
ji
s
i
s
j
j
i
i j
=
1
2
w
12
s
1
s
2
1
2
w
21
s
2
s
1
=
w
12
s
1
s
2
= w
12
w
21
=
s
1
s
2
=
11
Evaluating (1) for all possible states of the network, we get the following table:
State Energy
[+1, +1] +1
[1, +1] 1
[1. 1] +1
[+1. 1] 1
Thus, statess
1
ands
3
represent global minima and are therefore stable.
Problem 13.8
The energy function of the Hopeld network is
(1)
The overlapm
v
is dened by
(2)
and the weight w
ji
is itself dened by
(3)
Substituting (3) into (1) yields
where, in the third line, we made use of (2).
E
1
2
 w
ji
s
j
s
i
j
=
m
v
1
N
 s
j
v j ,
j
=
w
ji
1
N

v j ,
v i ,
v
=
E
1
2N

v j ,
v i ,
s
j
s
i
v
=
1
2N
 s
i
v i ,
i
,
_
s
j
v j ,
j
,
_
v
=
1
2N
 m
v
N ( ) m
v
N ( )
v
=
N
2
 m
v
2
v
=
12
Problem 13.11
We start with the function (see (13.48) of the text)
(1)
where is the derivative of the function with respect to its argument. We now
differentiate the functionE with respect to timet and note the following relations:
1. C
ji
=C
ij
2.
3.
Accordingly, we may use (1) to express the derivative as follows:
(2)
From Eq. (13.47) in the text, we have
, j =1, 2, ..., N (3)
Hence using (3) in (2) and collecting terms, we get the nal result
(4)
Provided that the coefcient a
j
(u
j
) satises the nonnegativity condition
E
1
2
 c
ji
i
u
i
( )
j
u
j
( ) b
j
( )
j
( ) d
0
u
j
j=1
N
j=1
N
i=1
N
j
.
( )
j
.
( )
t

j
u
j
( )
u
j
t

u
j

j
u
j
( ) =
u
j
t

j
u
j
( ) =
t
 b
j
( )
j
( ) d
0
u
j
u
j
t

u
j
 b
j
( )
j
( ) d
0
u
j
=
u
j
t
b
j
u
j
( )
j
u
j
( ) =
E t
E
t

u
j
t
 c
ji
j
u
j
( ) b
j
u
j
( )
j
u
j
( )
j=1
N
j=1
N
i=1
N
,
_
=
u
j
t
 a
j
u
j
( ) b
j
u
j
( ) c
ji
i
u
i
( )
j=1
N
,
_
=
E
t
 a
j
u
j
( )
j
u
j
( ) b
j
u
j
( ) c
ji
i
u
i
( )
i=1
N
,
_
2
j=1
N
=
13
a
j
(u
j
) >0 for all u
j
and the function satises the monotonicity condition
for all u
j
,
we then immediately see from (4) that
for all t
In words, the function E dened in(1) is the Lyapunov function for the coupled systemof
nonlinear differential equations (3).
Problem 13.12
From (13.61) of the text:
, j =1, 2, ..., N (1)
where
where
ji
isaKronecker delta. AccordingtotheCohenGrossbergtheoremof (13.47) inthetext,
we have
(2)
Comparison of (1) and (2) yields thefollowing correspondences between theCohenGrossberg
theorem and the braininstatebox (BSB) model:
Therefore, using these correspondences in (13.48) of thetext:
CohenGrossberg Theorem BSB Model
u
j
v
j
a
j
(u
j
) 1
b
j
(u
j
) v
j
c
ji
c
ji
j
u
j
( )
j
u
j
( ) 0
E
t
 0
d
dt
v
j
t ( ) v
j
t ( ) c
ji
i
v
i
( )
i=1
N
+ =
c
ji
ji
w
ji
+ =
d
dt
u
j
t ( ) a
j
u
j
( ) b
j
u
j
( ) c
ji
i
u
i
( )
i=1
N
i
u
i
( ) v
i
( )
14
,
we get the following Liapunov function for the BSB model:
(3)
From (13.55) in the text, we note that
We therefore have
Hence, the second term of (3) is given by
inside the linear region (4)
The rst term of (3) is given by
(5)
Finally, substituting (4) and (5) into (3), we obtain
E
1
2
 c
ji
i
u
i
( )
j
u
j
( ) b
j
( )
j
( ) d
u
j
=
E 
1
2
 c
ji
v
i
( ) v
j
( ) ( ) d
0
v
j
+
j
=
y
j
( )
+1 if y
j
1 >
y
j
if 1 y
j
1
1 if y
j
1
'
=
y
j
( )
0, y
j
1 >
1, y
j
1
'
=
( ) d
0
v
j
d
0
v
j
1
2
 v
j
2
j
= =
1
2
 x
j
2
j
=
1
2
 c
ji
v
i
( ) v
j
( )
i
1
2

ji
w
ji
+ ( ) v
i
( ) v
j
( )
i
2
 w
ji
x
j
x
i
1
2

2
v
j
( )
j
2
 w
ji
x
j
x
i
1
2
 x
j
2
j
=
15
which is the desired result
Problem 13.13
Theactivationfunction of Fig. P13.13isanonmonotonicfunctionof theargumentv; thatis,
assumes both positive and negative values. It therefore violates the monotonicity
condition requiredby theCohenGrossbergtheorem; seeEq. (4) of Problem13.11. This means
that the cohenGrossberg theoremis not applicable to an associative memory like a Hopeld
network that uses the activation function of Fig. P14.15.
E
2
 w
ji
x
j
x
i
2
x
T
Wx
i
=
v ( )
v
1
CHAPTER 15
Dynamically Driven Recurrent Networks
Problem 15.1
Referringtothesimplerecurrent neural network of Fig. 15.3, let thevector u(n) denotetheinput
signal, thevector x(n) denotesthesignal producedat theoutput of thehiddenlayer, andthevector
y(n) denotes the output signal of the whole network. Then, treating x(n) as the state of the
network, we may describe the statespace model of the network as follows:
where and are vectorvalued functions of their respective arguments.
Problem 15.2
Referring to the recurrent MLP of Fig. 15.4, we note the following:
(1)
(2)
(3)
where , , and are vectorvalued functions of their respective arguments.
Substituting (1) into (2), we write
(4)
Dene the state of the system at timen as
(5)
Then, from (4) and (5) we immediately see that
(6)
wheref is a new vectorvalued function. Dene the output of the system as
x n 1 + ( ) f x n ( ) u n ( ) , ( ) =
y n ( ) g x n ( ) ( ) =
f
.
( ) g
.
( )
x
I
n 1 + ( ) f
1
x
I
n ( ) u n ( ) , ( ) =
x
I I
n 1 + ( ) f
2
x
I I
n ( ) x
I
n 1 + ( ) , ( ) =
x
0
n 1 + ( ) f
3
x
0
n ( )x
I I
n 1 + ( ) ( ) =
f
1
.
( ) f
2
.
( ) f
3
.
( )
x
I I
f
2
x
I I
f
1
x
I
u n ( ) , ( ) , ( ) =
x
I I
n 1 + ( )
x
I I
n ( )
x
0
n 1 ( )
=
x n 1 + ( ) f x n ( ) u n ( ) , ( ) =
2
(7)
Withx
0
(n) includedinthedenitionof thestatex(n +1) andwithx(n) dependent ontheinput
u(n), we thus have
(8)
where isanother vector valuedfunction. Equations(6) and(8) denethestatespacemodel
of the recurrent MLP.
Problem 15.3
It is indeed possiblefor adynamic systemto becontrollablebut unobservable, and viceversa.
This statement is justied by virtue of the fact that the conditions for controllability and
observabilityareentirelydifferent, whichmeansthat therearesituationswheretheconditionsare
satised for one and not for the other.
Problem 15.4
(a) We are given the process equation
Hence, iterating forward in time, we write
andsoon. Byinduction, wemaystatethat thestatex(n +q) isanestednonlinear functionof x(n)
andu
q
(n), where
(b) The J acobian of x(n +q) with respect tou
q
(n) at the origin, is
y n ( ) x
0
n ( ) =
y n ( ) g x n ( ) u n ( ) , ( ) =
g
.,.
( )
x n 1 + ( ) W
a
x n ( ) w
b
u n ( ) + ( ) =
x n 2 + ( ) W
a
x n 1 + ( ) w
b
u n 1 + ( ) + ( ) =
W
a
W
a
x n ( ) w
b
u n ( ) + ( ) w
b
u n 1 + ( ) + ( ) =
x n 3 + ( ) W
a
x n 2 + ( ) w
b
u n 2 + ( ) + ( ) =
W
a
W
a
(W
a
x n ( ) w
b
u n ( )) + w
b
u n 1 + ( ) + ( ) w
b
u n 2 + ( )) + =
u
q
n ( ) u n ( ) u n 1 + ( ) u n q 1 + ( ) , , , [ ]
T
=
J
q
n ( )
x n q + ( )
u
q
n ( )

x n ( ) 0 =
u n ( ) 0 =
=
3
As an illustrativeexample, consider thecast of q =3. TheJ acobian of x(n +3) with respect to
u
3
(n) is
From the dening equation of x(n +3), we nd that
All thesepartial derivatives havebeenevaluatedat x(n) =0andu(n) =0. TheJ acobianJ
3
(n) is
therefore
We may generalize this result by writing
Problem 15.5
We start with the statespace model
(1)
wherec is a column vector. We thus write
J
3
n ( )
x n 3 + ( )
u n ( )

x n 2 + ( )
u n 1 + ( )

x n 3 + ( )
u n 2 + ( )
 , ,
x n ( ) 0 =
u n ( ) 0 =
=
x n 3 + ( )
u n ( )
 0 ( )W
a
0 ( )W
a
0 ( )w
b
=
AAb =
A
2
b =
x n 3 + ( )
u n 1 + ( )
 0 ( )W
a
0 ( )w
b
=
Ab =
x n 3 + ( )
u n 2 + ( )
 0 ( )w
b
=
b =
J
3
n ( ) A
2
b,Ab,b [ ] =
J
q
n ( ) A
q 1
b,A
q 2
b , , Ab,b [ ] =
x n 1 + ( ) W
a
x n ( ) w
b
u n ( ) + ( ) =
y n ( ) c
T
x n ( ) =
y n 1 + ( ) c
T
x n 1 + ( ) =
4
(2)
(3)
andsoon. Byinduction, wemaythereforestatethat y(n +q) isanestednonlinear functionof x(n)
andu
q
(n), where
Dene theqby1 vector
The J acobian of y
q
(n) with respect tox(n), evaluated at the origin, is dened by
As an illustrative example, consider the case of q =3, for which we have
From (1), we readily nd that
From (2), we nd that
From (3), we nally nd that
c
T
W
a
x n ( ) w
b
u n ( ) + ( ) =
y n 2 + ( ) c
T
x n 2 + ( ) =
c
T
W
a
W
a
x n ( ) w
b
u n ( ) + ( ) w
b
u n 1 + ( ) + ( ) =
u
q
n ( ) u n ( ) u n 1 + ( ) , u n q 1 + ( ) , , [ ]
T
=
y
q
n ( ) y n ( ) y n 1 + ( ) y n q 1 + ( ) , , , [ ]
T
=
J
q
n ( )
y
q
T
n ( )
x n ( )

x n ( ) 0 =
u n ( ) 0 =
=
J
3
n ( )
y n ( )
x n ( )

y n 1 + ( )
x n ( )

y n 2 + ( )
x n ( )
 , ,
x n ( ) 0 =
u n ( ) 0 =
=
y n ( )
x n ( )
 c =
y n 1 + ( )
x n ( )
 c 0 ( )W
a
( )
T
=
cA
T
=
5
All these partial derivatives have been evaluated at the origin. We thus write
By induction, we may now state that the J acobianJ
q
(n) for observability is, in general,
wherec is a column vector and .
Problem 15.6
We are given a nonlinear dynamic system described by
(1)
Suppose x(n) is Ndimensional and u(n) is mdimensional. Dene a new nonlinear dynamic
system in which the input is of additive form, as shown by
(2)
where
(3)
(4)
and
(5)
y n 2 + ( )
x n ( )
 c 0 ( )W
a
( )
T
0 ( )W
a
( ) =
cA
T
A
T
=
c A
T
( )
2
=
J
3
n ( ) c c A
T
cA
T
, ( )
2
, [ ] =
J
q
n ( ) c cA
T
c A
T
( )
2
c A
T
( )
q 1
, , , , [ ] =
A 0 ( )W
a
=
x n 1 + ( ) f x n ( ) u n ( ) , ( ) =
x n 1 + ( ) f x n ( ) ( ) u n ( ) + =
x n ( )
x n ( )
u n 1 ( )
=
u n ( )
0
u n ( )
=
f x n ( ) ( )
f x n ( ) u n ( ) , ( )
0
=
6
Both and are(N +m)dimensional, andtherst N elementsof arezero. From
these denitions, we readily see that
which is in perfect agreement with the description of the original nonlinear dynamic system
dened in (1).
Problem 15.7
(a) Thestatespacemodel of thelocal activationfeedback systemof Fig. P15.7adependsonhow
the linear dynamic component is described. For example, we may dene the input as
(1)
whereB is a (p1)by(p1) matrix and
Letw denotethesynapticweightvector of thesingleneuroninFig. P15.7a, withw
1
beingtherst
element andw
0
denoting the rest. we may then write
(2)
where
and
x n ( ) u n ( ) u n ( )
x n 1 + ( )
x n 1 + ( )
u n ( )
=
f x n ( ) u n ( ) , ( )
0
0
u n ( )
= =
z n ( )
x n 1 ( )
Bu n ( )
=
u n ( ) u n ( ) u n 1 ( ) u n p 2 + ( ) , , , [ ]
T
=
x n ( ) w
T
z n ( ) b + =
w
1
w
0
T
, [ ]
x n 1 ( )
Bu n ( )
b + =
w
1
x n 1 ( ) Bu n ( ) + =
u n ( )
u n ( )
1
=
7
The output y(n) is dened by
(3)
Equations (2) and (3) dene the statespace model of Fig. P15.7a, assuming that its linear
dynamic component is described by (1).
(b) Consider next the local output feedback system of Fig. 15.7b. Let the linear dynamic
component of this systembedescribed by (1). Theoutput of thewholesystemin Fig. 15.7b is
then dened by
(4)
wherew
1
, w
0
, , and are all as dened previously. The output y(n) of Fig. P15.7b is
(5)
Equations (4) and (5) denethestatespacemodel of thelocal output feedback systemof Fig.
P15.7b, assuming that its linear dynamic component is described by (1).
Theprocess (state) equation of thelocal feedback systemof Fig. P15.7ais linear but its
measurement equationisnonlinear, andconversely for thelocal feedback systemof Fig. P15.7b.
Thesetwo local feedback systems arecontrollableandobservable, becausethey bothsatisfy the
conditions for controllability and observability.
Problem 15.8
We start with the state equation
Hence, we write
B w
0
T
B b , [ ] =
y n ( ) x n ( ) ( ) =
x n ( ) w
T
z n ( ) b + ( ) =
w
1
w
0
T
, [ ]
x n 1 ( )
Bu n ( )
b +
,
_
=
w
1
x n 1 ( ) Bu n ( ) + ( ) =
B u n ( )
y n ( ) x n ( ) =
x n 1 + ( ) W
a
x n ( ) w
b
u n ( ) + ( ) =
x n 2 + ( ) W
a
x n 1 + ( ) w
b
u n 1 + ( ) + ( ) =
W
a
W
a
x n ( ) w
b
u n ( ) + ( ) w
b
u n 1 + ( ) + ( ) =
8
and so on.
Byinduction, wemaynowstatethat x(n + q) isanestednonlinear functionof x(n) andu
q
(n), and
thus write
whereg is a vectorvalued function, and
By denition, the output is correspondingly given by
where is a new scalarvalued nonlinear function.
x n 3 + ( ) W
a
x n 2 + ( ) w
b
u n 2 + ( ) + ( ) =
W
a
W
a
W
a
x n ( ) w
b
u n ( ) + ( ) w
b
u n 1 + ( ) + ( ) w
b
u n 2 + ( ) + ( ) =
x n q + ( ) g x n ( )u
q
n ( ) ( ) =
u
q
n ( ) u n ( ) u n 1 + ( ) u n q 1 + ( ) , , , [ ]
T
=
y n q + ( ) c
T
x n q + ( ) =
c
T
g x n ( )u
q
n ( ) ( ) =
x n ( ) u
q
n ( ) , ( ) =
9
Problem 15.11
Consider a statespace model described by
(1)
(2)
Using (1), we may readily write
and so on. Accordingly, thesimplerecurrent network of Fig. 15.3 may be unfolded in timeas
follows:
Problem 15.12
The local gradient for the hybrid form of the BPTT algorithm is given by
where isthenumber of additional stepstakenbeforeperformingthenext BPTT computation,
with .
x n 1 + ( ) f x n ( ) u n ( ) , ( ) =
y n ( ) g x n ( ) ( ) =
x n ( ) f x n 1 ( ) u n 1 ( ) , ( ) =
x n 1 ( ) f x n 2 ( ) u n 2 ( ) , ( ) =
x n 2 ( ) f x n 3 ( ) u n 3 ( ) , ( ) =
x(n3)
u(n3)
x(n2)
u(n2)
x(n1)
u(n1)
x(n)
y(n)
f( )
.
f( )
.
f( )
.
g( )
.
Figure : Problem 15.11
j
l ( )
v
j
l ( ) ( )e
j
l ( ) for l n =
v
j
l ( ) ( ) e
j
l ( ) w
kj
l ( )
l
l 1 + ( )
k
+
,
_
for n h l n < <
v
j
l ( ) ( ) w
kj
l ( )
l
l 1 + ( )
k
'
=
h
h h <
10
Problem 15.13
(a) The nonlinear state dynamics of the realtime recurrent learning algorithmof described in
(15.48) and (15.52) olf the text may be reformulated in the equivalent form:
(1)
where is the Kronecker delta and y
j
(n +1) is the output of neuron j at time n +1. For a
teacherforced recurrent network, we have
(2)
Hence, substituting (2) into (1), we get
(3)
(b) Let
Provided that the learningrate parameter is small enough, we may put
Under this condition, we may rewrite (3) as follows:
(4)
This nonlinear state equation is the centerpiece of the RTRL algorithm using teacher forcing.
y
j
n 1 + ( )
w
kl
n ( )
 v
j
n ( ) ( ) w
ji
n ( )
i
n ( )
w
kl
n ( )

kj
l
n ( ) +
i A B
kj
i
n ( )
u
i
n ( ) if i A
d
i
n ( ) if i C
y
i
n ( ) if i BC
'
=
y
j
n 1 + ( )
w
kl
n ( )
 v
j
n ( ) ( ) w
ji
n ( )
y
i
n ( )
w
kl
n ( )

kj
l
n ( ) +
i BC
kl
j
n ( )
y
i
n ( )
w
kl
n ( )
 =
kl
j
n 1 + ( )
y
i
n 1 + ( )
w
kl
n 1 + ( )

y
i
n 1 + ( )
w
kl
n ( )
 =
kl
j
n 1 + ( ) v
j
n ( ) ( ) w
ji
n ( )
kl
j
n ( )
kj
l
n ( ) +
i BC
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.