to accompany
Pattern
Classification
(2nd ed.)
David G. Stork
Preface
In writing this Solution Manual I have learned a very important lesson. As a
student, I thought that the best way to master a subject was to go to a superb
university and study with an established expert. Later, I realized instead that the
best way was to teach a course on the subject. Yet later, I was convinced that the
best way was to write a detailed and extensive textbook. Now I know that all these
years I have been wrong: in fact the best way to master a subject is to write the
Solution Manual.
In solving the problems for this Manual I have been forced to confront myriad
technical details that might have tripped up the unsuspecting student. Students and
teachers can thank me for simplifying or screening out problems that required pages of
unenlightening calculations. Occasionally I had to go back to the text and delete the
word easily from problem references that read it can easily be shown (Problem ...).
Throughout, I have tried to choose data or problem conditions that are particularly
instructive. In solving these problems, I have found errors in early drafts of this text
(as well as errors in books by other authors and even in classic refereed papers), and
thus the accompanying text has been improved for the writing of this Manual.
I have tried to make the problem solutions selfcontained and selfexplanatory. I
have gone to great lengths to ensure that the solutions are correct and clearly presented
many have been reviewed by students in several classes. Surely there are errors and
typos in this manuscript, but rather than editing and rechecking these solutions over
months or even years, I thought it best to distribute the Manual, however awed,
as early as possible. I accept responsibility for these inevitable errors, and humbly
ask anyone nding them to contact me directly. (Please, however, do not ask me to
explain a solution or help you solve a problem!) It should be a small matter to change
the Manual for future printings, and you should contact the publisher to check that
you have the most recent version. Notice, too, that this Manual contains a list of
known typos and errata in the text which you might wish to photocopy and distribute
to students.
I have tried to be thorough in order to help students, even to the occassional fault of
verbosity. You will notice that several problems have the simple explain your answer
in words and graph your results. These were added for students to gain intuition
and a deeper understanding. Graphing per se is hardly an intellectual challenge, but
if the student graphs functions, he or she will develop intuition and remember the
problem and its results better. Furthermore, when the student later sees graphs of
data from dissertation or research work, the link to the homework problem and the
material in the text will be more readily apparent. Note that due to the vagaries
of automatic typesetting, gures may appear on pages after their reference in this
Manual; be sure to consult the full solution to any problem.
I have also included worked examples and so sample nal exams with solutions to
1
2
cover material in text. I distribute a list of important equations (without descriptions)
with the exam so students can focus understanding and using equations, rather than
memorizing them. I also include on every nal exam one problem verbatim from a
homework, taken from the book. I nd this motivates students to review carefully
their homework assignments, and allows somewhat more dicult problems to be included. These will be updated and expanded; thus if you have exam questions you
nd particularly appropriate, and would like to share them, please send a copy (with
solutions) to me.
It should be noted, too, that a set of overhead transparency masters of the gures
from the text are available to faculty adopters. I have found these to be invaluable
for lecturing, and I put a set on reserve in the library for students. The les can be
accessed through a standard web browser or an ftp client program at the Wiley STM
ftp area at:
ftp://ftp.wiley.com/public/sci tech med/pattern/
or from a link on the Wiley Electrical Engineering software supplements page at:
http://www.wiley.com/products/subject/engineering/electrical/
software supplem elec eng.html
I have taught from the text (in various stages of completion) at the University
of California at Berkeley (Extension Division) and in three Departments at Stanford University: Electrical Engineering, Statistics and Computer Science. Numerous
students and colleagues have made suggestions. Especially noteworthy in this regard are Sudeshna Adak, Jian An, SungHyuk Cha, Koichi Ejiri, Rick Guadette,
John Heumann, Travis Kopp, Yaxin Liu, Yunqian Ma, Sayan Mukherjee, Hirobumi
Nishida, Erhan Oztop, Steven Rogers, Charles Roosen, Sergio Bermejo Sanchez, Godfried Toussaint, Namrata Vaswani, Mohammed Yousuf and Yu Zhong. Thanks too
go to Dick Duda who gave several excellent suggestions.
I would greatly appreciate notices of any errors in this Manual or the text itself. I
would be especially grateful for solutions to problems not yet solved. Please send any
such information to me at the below address. I will incorporate them into subsequent
releases of this Manual.
This Manual is for the use of educators and must not be distributed in bulk to
students in any form. Short excerpts may be photocopied and distributed, but only
in conjunction with the use of Pattern Classication (2nd ed.).
I wish you all the best of luck in teaching and research.
David G. Stork
Contents
Preface
1 Introduction
7
7
74
357
3
4
Worked examples
CONTENTS
415
Chapter 1
Introduction
Problem Solutions
There are neither problems nor computer exercises in Chapter 1.
CHAPTER 1. INTRODUCTION
Chapter 2
Problem Solutions
Section 2.1
1. Equation 7 in the text states
P (errorx) = min[P (1 x), P (2 x)].
(a) We assume, without loss of generality, that for a given particular x we have
P (2 x) P (1 x), and thus P (errorx) = P (1 x). We have, moreover, the
normalization condition P (1 x) = 1P (2 x). Together these imply P (2 x) >
1/2 or 2P (2 x) > 1 and
2P (2 x)P (1 x) > P (1 x) = P (errorx).
This is true at every x, and hence the integrals obey
2P (2 x)P (1 x)dx P (errorx)dx.
In short, 2P (2 x)P (1 x) provides an upper bound for P (errorx).
(b) From part (a), we have that P (2 x) > 1/2, but in the current conditions not
greater than 1/ for < 2. Take as an example, = 4/3 and P (1 x) = 0.4
and hence P (2 x) = 0.6. In this case, P (errorx) = 0.4. Moreover, we have
P (1 x)P (2 x) = 4/3 0.6 0.4 < P (errorx).
This does not provide an upper bound for all values of P (1 x).
(c) Let P (errorx) = P (1 x). In that case, for all x we have
P (1 x)P (errorx)
<
P (1 x)P (errorx)dx,
i
k
exp[(x ai )/bi ]dx + exp[(x ai )/bi ]dx = 1,
ai
1 xai /bi
e
.
2bi
2e(x+1)/2
p(x2 )
=
2e(13x)/2
p(x1 ) (x1)/2
2e
x0
0<x1
x > 1,
1
Section 2.3
3. We are are to use the standard zeroone classication cost, that is 11 = 22 = 0
and 12 = 21 = 1.
PROBLEM SOLUTIONS
R1
To obtain the prior with the minimum risk, we take the derivative with respect
to P (1 ) and set it to 0, that is
d
p(x1 )dx p(x2 )dx = 0,
R(P (1 )) =
dP (1 )
R2
R1
R1
(b) This solution is not always unique, as shown in this simple counterexample. Let
P (1 ) = P (2 ) = 0.5 and
1 0.5 x 0.5
p(x1 ) =
0 otherwise
1 0x1
p(x2 ) =
0 otherwise.
It is easy to verify that the decision regions R1 = [0.5, 0.25] and R1 = [0, 0.5]
satisfy the equations in part (a); thus the solution is not unique.
4. Consider the minimax criterion for a twocategory classication problem.
(a) The total risk is the integral over the two regions Ri of the posteriors times
their costs:
R =
[11 P (1 )p(x1 ) + 12 P (2 )p(x2 )] dx
R1
+
We use
nd:
R
R2
R2
p(x2 ) dx = 1
R1
p(x2 ) dx 22
= 22 + 12
R1
p(x2 ) dx
R1
+ P (1 ) (11 22 ) + 11 p(x1 ) dx 12 p(x2 ) dx
R2
+ 21 p(x1 ) dx + 22 p(x2 ) dx
R2
R1
R1
10
p(x2 ) dx
R1
+P (1 ) (11 22 ) + (11 + 21 )
R2
+ (22 12 )
p(x1 ) dx
p(x2 ) dx .
R1
(b) Consider an arbitrary prior 0 < P (1 ) < 1, and assume the decision boundary
has been set so as to achieve the minimal (Bayes) error for that prior. If one holds
the same decision boundary, but changes the prior probabilities (i.e., P (1 ) in
the gure), then the error changes linearly, as given by the formula in part (a).
The true Bayes error, however, must be less than or equal to that (linearly
bounded) value, since one has the freedom to change the decision boundary at
each value of P (1 ). Moreover, we note that the Bayes error is 0 at P (1 ) = 0
and at P (1 ) = 1, since the Bayes decision rule under those conditions is to
always decide 2 or 1 , respectively, and this gives zero error. Thus the curve
of Bayes error rate is concave down for all prior probabilities.
E(P(1))
0.2
0.1
P*(1)
0.5
P(1)
(c) According to the general minimax equation in part (a), for our case (i.e., 11 =
22 = 0 and 12 = 21 = 1) the decision boundary is chosen to satisfy
p(x1 ) dx =
R2
p(x2 ) dx.
R1
We assume that a single decision point suces, and thus we seek to nd x such
that
x
N (1 , 12 ) dx =
N (2 , 22 ) dx,
x
PROBLEM SOLUTIONS
11
x
et dt.
2
2 1 + 1 2
1/2 + 0
=
= 1/3.
1 + 2
1 + 1/2
1/3 0
E = 1/2 erf
= 1/2 erf[1/3] = 0.1374.
1+0
(f) Note that the distributions have the same form (in particular, the same variance). Thus, by symmetry the Bayes error for P (1 ) = P (for some value
P ) must be the same as for P (2 ) = P . Because P (2 ) = 1 P (1 ), we
know that the curve, analogous to the one in part (b), is symmetric around the
point P (1 ) = 0.5. Because the curve is concave down, therefore it must peak
at P (1 ) = 0.5, that is, equal priors. The tangent to the graph of the error
versus P (1 ) is thus horizontal at P (1 ) = 0.5. For this case of equal priors,
the Bayes decision point for this problem can be stated simply: it is the point
midway between the means of the two distributions, that is, x = 5.5.
5. We seek to generalize the notion of minimax criteria to the case where two independent prior probabilities are set.
12
p(xi)P(i)
p(x2)P(2)
p(x3)P(3)
p(x1)P(1)
1 x1*
1
x2* 3
x
3
(a) We use the triangle distributions and conventions in the gure. We solve for the
decision points as follows (being sure to keep the signs correct, and assuming
that the decision boundary consists of just two points):
1 (x1 1 )
2 (2 x1 )
P (1 )
=
P
(
,
)
2
12
22
which has solution
x1 =
3
i=1
1
2
[1 + 1 x1 ]
212
1
2
+P (2 ) 2 [2 2 + x1 ]
22
1
2
+P (2 ) 2 [2 + 2 x2 ]
22
1
2
+ 1 P (1 ) P (2 )
[3 3 + x2 ] .
232
= P (1 )
P (3 )
PROBLEM SOLUTIONS
13
To obtain the minimax solution, we take the two partial and set them to zero.
The rst of the derivative equations,
E
=
P (1 )
yields the equation
2
1 + 1 x1
1
1 + 1 x1
1
0,
=
=
3 3 + x2
3
3 3 + x2
.
3
2
or
bi + ci
xi =
i = 1, 2.
ai
After a straightforward, but very tedious calculation, we nd that:
a1
= 12 22 + 32 ,
b1
= 12 2 1 22 1 2 3 22 1 + 32 1 + 12 2 1 3 2 + 1 3 3 ,
= 12 21 23 + 224 + 21 22 3 + 223 3 + 1 22 1 + 223 1
c1
+2 1 2 3 1 2 2 3 2 1 + 2 2 1 2 3 2 1 2 2 1 2 2 2 2 1 2 2 2
+2 2 2 3 2 + 2 2 3 2 2 2 2 2 1 2 + 2 1 3 1 2 + 2 3 2 1 2
1 2 2 2 + 2 2 2 2 2 2 1 3 2 2 3 2 2 2 + 2 1 2 2 3 2 2 3 3
2 1 2 3 3 2 2 2 3 3 2 1 3 1 3 + 2 1 2 2 3 2 2 2 2 3
+2 1 3 2 3 1 2 3 2 + 2 2 3 2 .
An analogous calculation gives:
a2
= 12 22 + 32 ,
b2
= 1 2 3 + 22 3 + 2 2 32 + 1 3 1 1 3 2 + 32 2 + 12 3 22 3 ,
= 12 22 + 32
2 2
2 3 + 2 2 32 1 + 32 21 2 32 1 2 + 2 32 22 + 2 1 2 3 3
+ 2 22 3 3 + 2 1 3 1 3 2 1 3 2 3 + 12 23 22 23 .
c2
(c) For {i , i } = {0, 1}, {.5, .5}, {1, 1}, for i = 1, 2, 3, respectively, we substitute
into the above equations to nd x1 = 0.2612 and x2 = 0.7388. It is a simple
matter to conrm that indeed these two decision points suce for the classication problem, that is, that no more than two points are needed.
14
p(xi)P(i)
p(x1)P(1)
p(x2)P(2)
E1
2
1
x*
6. We let x denote our decision boundary and 2 > 1 , as shown in the gure.
(a) The error for classifying a pattern that is actually in 1 as if it were in 2 is:
R2
1
p(x1 )P (1 ) dx =
2
N (1 , 12 ) dx E1 .
x
Our problem demands that this error be less than or equal to E1 . Thus the
bound on x is a function of E1 , and could be obtained by tables of cumulative
normal distributions, or simple numerical integration.
(b) Likewise, the error for categorizing a pattern that is in 2 as if it were in 1 is:
E2 =
R1
1
p(x2 )P (2 ) dx =
2
x
N (2 , 22 ) dx.
(c) The total error is simply the sum of these two contributions:
E
= E1 + E 2
x
1
1
=
N (1 , 12 ) dx +
N (2 , 22 ) dx.
2
2
x
(d) For p(x1 ) N (1/2, 1) and p(x2 ) N (1/2, 1) and E1 = 0.05, we have (by
simple numerical integration) x = 0.2815, and thus
E
1
0.05 +
2
1
0.05 +
2
0.168.
0.2815
N (2 , 22 ) dx
0.2815
1
(x 0.5)2
dx
exp
2(0.5)2
20.05
(e) The decision boundary for the (minimum error) Bayes case is clearly at x = 0.
The Bayes error for this problem is:
EB
1
2
N (1 , 12 ) dx
2
0
PROBLEM SOLUTIONS
15
=
which of course is lower than the error for the NeymanPearson criterion case.
Note that if the Bayes error were lower than 2 0.05 = 0.1 in this problem, we
would use the Bayes decision point for the NeymanPearson case, since it too
would ensure that the NeymanPearson criteria were obeyed and would give the
lowest total error.
7. We proceed as in Problem 6, with the gure below.
p(xi)P(i)
0.3
0.25
p(x1)P(1)
p(x2)P(2)
0.2
0.15
0.1
0.05
E1
x
4
2
x*
p(xi ) =
p(x1 )P (1 ) dx
x
1
2
x
1
1
xa 2 dx.
1
b 1 +
b
E1
=
where = sin1
1
2
1 + y 2 to get:
=0
d
=
1
b
,
sin1
2
b2 + (x a1 )2
. Solving for the decision point gives
2
b
b2 +(x a1 )
x = a1 + b
1
1 = a1 + b/tan[2E1 ].
sin [2E1 ]
2
16
E2
1
b
1
2
1
2
x
1+
1
xa 2 P (2 ) dx
b
=
d
=
sin1
b
b2 + (x a2 )2
+
1
b
1 1
,
+ sin
2
b2 + (x a2 )2
1+
1
xa 2 P (2 ) dx = 0.2489.
b
This is indeed lower than for the NeymanPearson case, as it must be. Note
that if the Bayes error were lower than 2 0.1 = 0.2 in this problem, we would
use the Bayes decision point for the NeymanPearson case, since it too would
ensure that the NeymanPearson criteria were obeyed and would give the lowest
total error.
8. Consider the Cauchy distribution.
(a) We let k denote the integral of p(xi ), and check the normalization condition,
that is, whether k = 1:
k=
1
p(xi ) dx =
b
1+
1
xa 2 dx.
i
b
PROBLEM SOLUTIONS
17
1
dy,
1 + y2
=0
sin2
d = 1.
sin2
x
4
2
10
lim
1
2
1
2
1
1
b 1+( xai )2
b
1
1
b 1+( xa1 )2
b
1
2
1
1
b 1+( xa2 )2
b
b2 + (x ai )2
1
= ,
x b2 + (x a1 )2 + b2 + (x a2 )2
2
lim
18
(a) Without loss of generality, we assume that a2 > a1 , note that the decision
boundary is at (a1 + a2 )/2. The probability of error is given by
(a1
+a2 )/2
p(2 x)dx +
P (error) =
p(1 x)dx
(a1
+a2 )/2
1
=
b
(a1
a2 )/2
1
b
1/2
1
xa2 2 dx +
b
1+
(a1
a2 )/2
1
1
xa 2 dx =
2
1+
b
1+
1/2
xa1 2 dx
1
dy,
1 + y2
where for the last step we have used the trigonometric substitution y = (xa2 )/b
as in Problem 8. The integral is a standard form for tan1 y and thus our solution
is:
a a
1
1
2
1
P (error) =
tan1
tan
[]
2b
1
1
a2 a1
=
tan1
.
2
2b
(b) See figure.
P(error)
0.5
0.4
0.3
0.2
0.1
a2  a1/(2b)
1
a1
(c) The maximum value of the probability of error is Pmax ( a22b
) = 1/2, which
a2 a1
occurs for  2b  = 0. This occurs when either the two distributions are the
same, which can happen because a1 = a2 , or even if a1 = a2 because b = and
both distributions are at.
=
P (errorx)p(x)dx.
PROBLEM SOLUTIONS
19
!
a
R(i (x)x)P (i x) p(x) dx.
i=1
(b) Consider a xed point x and note that the (deterministic) Bayes minimum risk
decision at that point obeys
R(i (x)x) R(max (x)x).
20
R(max x)
a
!
P (i x) p(x)dx
i=1
R(max x)p(x)dx
= RB ,
the Bayes risk. Equality holds if and only if P (max (x)x) = 1.
12. We rst note the normalization condition
c
!
P (i x) = 1
i=1
for all x.
(a) If P (i x) = P (j x) for all i and j, then P (i x) = 1/c and hence P (max x) =
1/c. If one of the P (i x) < 1/c, then by our normalization condition we must
have that P (max x) > 1/c.
(b) The probability of error is simply 1.0 minus the probability of being correct,
that is,
P (error) = 1 P (max x)p(x) dx.
(c) We simply substitute the limit from part (a) to get
P (error) = 1 P (max x) p(x) dx
= 1g
=g1/c
p(x) dx = 1 g.
PROBLEM SOLUTIONS
21
This last inequality shows that we should never decide on a category other than the
one that has the maximum posterior probability, as we know from our Bayes analysis.
Consequently, we should either choose max or we should reject, depending upon
which is smaller: s [1 P (max x)] or r . We reject if r s [1 P (max x)], that
is, if P (max x) 1 r /s .
14. Consider the classication problem with rejection option.
(a) The minimumrisk decision rule is given by:
Choose i if P (i x) P (j x), for all j
r
and if P (i x) 1 .
s
This rule is equivalent to
Choose i if p(xi )P (i ) p(xj )P (j ) for all j
r
and if p(xi )P (i )
p(x),
1
s
where by Bayes formula
p(i x) =
p(xi )P (i )
.
p(x)
i = 1, . . . , c
p(xi )P (i ),
c
=
s r
p(xj )P (j ), i = c + 1.
s
j=1
(b) Consider the case p(x1 ) N (1, 1), p(x2 ) N (1, 1), P (1 ) = P (2 ) = 1/2
and r /s = 1/4. In this case the discriminant functions in part (a) give
g1 (x)
1 e(x1)
= p(x1 )P (1 ) =
2
2
/2
1 e(x+1) /2
= p(x2 )P (2 ) =
2
2
r
g3 (x) =
[p(x1 )P (1 ) + p(x2 )P (2 )]
1
s
(x1)2 /2
2
1
1e
1 e(x+1) /2
=
1
+
4
2
2
2
2
%
$
2
2
3
3
=
e(x1) /2 + e(x+1) /2 = [g1 (x) + g2 (x)].
4
8 2
2
g2 (x)
22
gi(x)
0.2
g2(x)
g1(x)
g3(x)
0.15
0.1
0.05
x
3
2
1
g1 (x) = p(x1 )P (1 ) =
3
2
/2
1 2e2x
g2 (x) = p(x2 )P (2 ) =
3
2
r
g3 (x) =
[p(x1 )P (1 ) + p(x2 )P (2 )]
1
s
2
2
1 2 e(x1) /2
e2x
+
2 3
2
2
% 1
$
2
2
1
=
e(x1) /2 + e2x = [g1 (x) + g2 (x)].
2
3 2
2
Note from the gure that for this problem we should never reject.
gi(x)
0.25
g2(x)
g1(x)
0.2
g3(x)
0.15
0.1
0.05
x
3
2
1
PROBLEM SOLUTIONS
23
Section 2.5
15. We consider the volume of a ddimensional hypersphere of radius 1.0, and more
generally radius x, as shown in the gure.
z
Vd xd
dz
z
1
1
x
(a) We use Eq. 47 in the text for d = odd, that is, Vd = 2d (d1)/2 ( d1
2 )!/d!. When
applied to d = 1 (a line) we have V1 = 21 0 1 = 2. Indeed, a line segment
1 x +1 has generalized volume (length) of 2. More generally, a line of
radius x has volume of 2x.
(b) We use Eq. 47 in the text for d = even, that is, Vd = (d/2) /(d/2)!. When
applied to d = 2 (a disk), we have V2 = 1 /1! = . Indeed, a disk of radius
1 has generalized volume (area) of . More generally, a disk of radius x has
volume of x2 .
(c) Given the volume of a line in d = 1, we can derive the volume of a disk by
straightforward integration. As shown in the gure, we have
1
V2 = 2
1 z 2 dz = ,
0
2 d/2
Vd+1 = 2
0
Vd (d/2 + 1)
,
dz =
(d/2 + 3/2)
Vd = Vd+1 =
(d/2)! 2 (d/2 + 3/2)
24
d/2 k! 22k+1
(2k + 1)!
(d
1)/2 d 1
( 2 )!
(d )!
2d
( d2 + 1)
Vd+1 = Vd
2 ( d2 + 32 )
2d (d1)/2 ( d1
((k + 1) + 12 )
2 )!
=
d!
2
(k + 1)!
d /2
,
(d /2)!
where we have used that for odd dimension d = 2k + 1 for some integer k, and
d = d + 1 is the (even) dimension of the higher space. This conrms Eq. 47 for
even dimension given in the text.
16. We approach the problem analogously to problem 15, and use the same gure.
z
Vd xd
dz
z
1
1
x
Vd+1 = 2
0
Vd (d/2 + 1)
,
(d/2 + 3/2)
PROBLEM SOLUTIONS
25
(d/2 + 1)
((d + 1)/2 + 3/2)
(d/2 + 1)
= Vd
= Vd
.
(d/2 + 1)(d/2 + 1)
d/2 + 1
= Vd
Vd
2
2
2
= Vd4
=
= Vd2
(d 2) + 2
(d 4) + 2 (d 2) + 2
2
2
2
2
= V1
(d (d 1)) + 2 (d (d 3)) + 2
(d 4) + 2 (d 2) + 2
(d1)/2 terms
1
1
1
(d1)/2 (d+1)/2 1
=
2
3 5
d2 d
(d1)/2 d
1
2
= (d1)/2 2(d+1)/2
=
.
d!!
d!
We have used the fact that V1 = 2, from part (a), and the notation d!! =
d (d 2) (d 4) , read d double factorial.
(d) Analogously to part (c), for d even we have
Vd
= Vd2
= Vd4
=
(d 2)/2 + 1
(d 4)/2 + 1 (d 2)/2 + 1
V2 =
=
(d d)/2 + 1 (d (d 2))/2 + 1
(d 4)/2 + 1 (d 2)/2 + 1
d/2 terms
1 1
1
1
= d/2
1 2
d/2 1 d/2
=
d/2
.
(d/2)!
(e) The central mathematical reason that we express the formulas separately for
even and for odd dimensions comes from the gamma function, which has dierent
expressions as factorials depending upon whether the argument is integer valued,
or halfinteger valued. One could express the volume of hyperspheres in all
dimensions by using gamma functions, but that would be computationally a bit
less elegant.
17. Consider the minimal rectangular bounding box in d dimensions that contains
the hyperellipsoid, as shown in the gure. We can work in a coordinate system
in which the principal axes of the ellipsoid are parallel to the box axes, that is,
= diag[1/ 2 , 1/ 2 , . . . , 1/ 2 ]). The squared
the covariance matrix is diagonal (
1
2
d
Mahalanobis distance from the origin to the surface of the ellipsoid is given by r2 ,
26
2
1
= xt x
= xt
1/12
0
..
.
0
1/22
..
.
...
...
..
.
0
0
..
.
x.
. . . 1/d2
Thus, along each of the principal axes, the distance obeys x2i = i2 r2 . Because the
distance across the rectangular volume is twice that amount, the volume of the rectangular bounding box is
Vrect = (2x1 )(2x2 ) (2xd ) = 2 r
d d
d
,
1/2 .
i = 2d rd 
i=1
We let V be the (unknown) volume of the hyperellipsoid, Vd the volume of the unit
hypersphere in d dimension, and Vcube be the volume of the ddimensional cube having
length 2 on each side. Then we have the following relation:
V
Vrect
Vd
.
Vcube
We note that the volume of the hypercube is Vcube = 2d , and substitute the above to
nd that
V =
Vrect Vd
1/2 Vd ,
= rd 
Vcube
where Vd is given by Eq. 47 in the text. Recall that the determinant of a matrix is
1/2 = 1/2 ), and thus the value can be written
unchanged by rotation of axes (
as
V = rd 1/2 Vd .
18. Let X1 , . . . , Xn be a random sample of size n from N (1 , 12 ) and let Y1 , . . . , Ym
be a random sample of size m from N (2 , 22 ).
(a) Let Z = (X1 + + Xn ) + (Y1 + + Ym ). Our goal is to show that Z is also
normally distributed. From the discussion in the text, if Xd1 N (d1 , dd )
PROBLEM SOLUTIONS
27
X(n+m)1
X1
X2
..
.
Xn
=
Y1
Y2
.
..
Ym
= E(Z)
= E[(X1 + + Xn ) + (Y1 + + Ym )]
= E(X1 ) + + E(Xn ) + E(Y1 ) + + E(Ym )
(since X1 , . . . , Xn , Y1 , . . . , Ym are independent)
= n1 + m2 .
= Var(Z)
= Var(X1 ) + + Var(Xn ) + Var(Y1 ) + + Var(Ym )
(since X1 , . . . , Xn , Y1 , . . . , Ym are independent)
= n12 + m22
X1
..
.
Xn
X=
Y1 .
.
..
Ym
Then, clearly X is [(nd + md) 1]dimensional random variable that is normally
distributed. Consider the linear projection operator A dened by
At = (Idd Idd Idd ).
(n+m) times
28
Then we have
Z = A t X = X1 + + Xn + Y1 + + Ym ,
which must therefore be normally distributed. Furthermore, the mean and
variance of the distribution are
3 = E(Z)
3 = Var(Z)
bk (x)p(x)dx = ak
for k = 1, . . . , q.
p(x)lnp(x)dx +
q
!
k=1
q
!
p(x) lnp(x)
bk (x)p(x)dx ak
k bk (x)
k=0
q
!
ak k .
k=0
q
!
k bk (x) 1.
k=0
PROBLEM SOLUTIONS
29
20. We make use of the result of Problem 19, that is, the maximumentropy distribution p(x) having constraints of the form
bk (x)p(x)dx = ak
for k = 1, . . . , q is
p(x) = exp
q
!
k bk (x) 1 .
k=0
and thus
p(x) = exp
q
!
k bk (x) 1 = exp(0 1).
k=0
xp(x)dx = .
0
e1 x
1 0
1
0 1
= 1.
= e
1
= e0 1
30
1 0 1
e e
0 1
xdx = e
1
21
= .
(1/)ex/
0
p(x) =
x0
otherwise.
(c) Here the density has three free parameters, and is of the general form
p(x) = exp[0 1 + 1 x + 2 x2 ],
and the constraint equations are
p(x)dx = 1
()
xp(x)dx =
()
x2 p(x)dx = 2 .
( )
1
0 121 /(42 )
1
erf
2 x
= 1.
e
2 2
2 2
Since 2 < 0, erf() = 1 and erf() = 1, we have
exp[0 1 21 /(42 )]
= 1.
2
$
%
1 exp[0 1 21 /(42 )]
erf
2 x 2 /(2 2 )
= ,
42 2
exp[0 1 21 /(42 )] = .
22 2
Finally, we substitute the general form of p(x) into ( ) and nd
exp[0 1 21 /(42 )] = 2 .
22 2
PROBLEM SOLUTIONS
31
2
+ ln[1/( 2)]
22
= / 2
= 1/(2 2 ).
We substitute these values back into the general form of the density and nd
1
(x )2
p(x) =
,
exp
2 2
2
that is, a Gaussian.
21. A Gaussian centered at x = 0 is of the form
p(x) =
1
exp[x2 /(2 2 )].
2
p(x)lnp(x)dx
1
1
2
2
2
2
=
exp[x /(2 )]ln
exp[x /(2 )] dx
2
2
1
1
1
ln
dx = ln
= ln[xu xl ].
xu xl 
xu xl 
xu xl 
Since we are given that the mean of the distribution is 0, we know that xu = xl .
Further, we are told that the variance is 2 , that is
xh
x2 p(x)dx = 2
xl
We put these results together and nd for the uniform distribution H(p(x)) = ln[2 3].
We are told that the variance of the triangle distribution centered on 0 having
halfwidth w is 2 , and this implies
w
2w
x
w
x
dx =
w2
w
x
0
x
dx +
w2
2w
0
x2
w+x
dx = w2 /6 = 2 .
w2
32
=
w
w x
w x
dx
ln
w2
w2
0
wx
w+x
wx
w+x
=
dx
dx
ln
ln
w2
w2
w2
w2
w
0
p(x)
0.4/
0.3/
0.2/
0.1/
3
2
)
exp
(x
)
.
2
(2)d/2 1/2
According to Eq. 37 in the text, the entropy is
H(p(x)) = p(x)lnp(x)dx
$
%
1
t 1
d/2
1/2
= p(x)
2 (x ) (x ) ln (2) 
dx
indep. of x
d
d
1 ! !
1
(xi i )[1 ]ij (xj j ) dx + ln[(2)d ]
2
2
i=1 j=1
1 !!
2 i=1 j=1
d
1
(xj j )(xi i ) [1 ]ij dx + ln[(2)d ]
2
indep. of x
1 !!
1
[]ji [1 ]ij + ln[(2)d ]
2 i=1 j=1
2
1!
1
[1 ]jj + ln[(2)d ]
2 j=1
2
PROBLEM SOLUTIONS
33
1!
1
[I]jj + ln[(2)d ]
2 j=1
2
d
=
=
d 1
+ ln[(2)d ]
2 2
1
ln[(2e)d ],
2
1
1
= 2 and = 0
2
0
0
2 .
5
1
1
t 1
p(xo ) =
exp (xo ) (xo ) .
2
(2)3/2 1/2
1
1 0 0
1 = 0 5 2 =
0 2 5
= 21,
1
0
0
0
5
2
0
1
1
0
=
2
0
5
0
5/21
2/21
0
2/21 ,
5/21
and the squared Mahalanobis distance from the mean to xo = (.5, 0, 1)t is
(xo )t 1 (xo )
t
1
.5
1
1
0
0
.5
1
= 0 2 0 5/21 2/21 0 2
1
2
0 2/21 5/21
1
2
0.5
0.5
16
1
= 8/21 2 = 0.25 +
+
= 1.06.
21 21
1/21
1
We substitute these values to nd that the density at xo is:
1
1
p(xo ) =
exp
(1.06)
= 8.16 103 .
2
(2)3/2 (21)1/2
(b) Recall from Eq. 44 in the text that Aw = 1/2 , where contains the
normalized eigenvectors of and is the diagonal matrix of eigenvalues. The
characteristic equation,  I = 0, in this case is
1
0
0
0
5
2 = (1 ) (5 )2 4
0
2
5
= (1 )(3 )(7 ) = 0.
34
The three eigenvalues are then = 1, 3, 7 can be read immediately from the
factors. The (diagonal) matrix of eigenvalues is thus
1 0 0
= 0 3 0 .
0 0 7
To nd the eigenvectors, we solve x = i x for (i = 1, 2, 3):
x1
1 0 0
x1
x = 0 5 2 x = 5x2 + 2x3 = i x2
0 2 5
2x2 + 5x3
x3
The three eigenvectors are given by:
x1
1 = 1 : 5x2 + 2x3 =
2x2 + 5x3
x1
2 = 3 : 5x2 + 2x3 =
2x2 + 5x3
x1
3 = 7 : 5x2 + 2x3 =
2x2 + 5x3
Thus our nal and Aw matrices are:
1
0
0 1/ 2
=
0 1/ 2
and
1/2
Aw =
0
=
0
1
= 0
0
i = 1, 2, 3.
x1
1
x2 1 = 0 ,
0
x3
0
3x1
3x2 2 = 1/ 2 ,
3x3
1/ 2
0
7x1
7x2 3 = 1/2 .
7x3
1/ 2
0
1/2
1/ 2
0
0
1
1/ 2 1/2 0
0
1/ 2 1/ 2
0
0
1/ 6 1/14 .
1/ 6 1/ 14
0
3
0
0
0
7
= Atw (xo )
1
0
= 0 1/ 6
0 1/ 6
0
0.5
0.5
1/14 2 = 1/ 6 .
1
3/ 14
1/ 14
(d) From part (a), we have that the squared Mahalanobis distance from xo to in
the original coordinates is r2 = (xo )t 1 (xo ) = 1.06. The Mahalanobis
distance from xw to 0 in the transformed coordinates is xtw xw = (0.5)2 + 1/6 +
3/14 = 1.06. The two distances are the same, as they must be under any linear
transformation.
PROBLEM SOLUTIONS
35
n
!
xk =
k=1
n
!
Tt xk = Tt
k=1
n
!
xk = Tt .
k=1
n
!
(xk )(xk )t
k=1
= Tt
n
!
(xk )(xk ) T
k=1
= Tt T.
We note that   = Tt T =  for transformations such as translation and
rotation, and thus
p(xo N (, )) = p(Tt xo N (Tt , Tt T)).
The volume element is proportial to T and for transformations such as scaling,
the transformed covariance is proportional to T2 , so the transformed normalization contant contains 1/T, which exactly compensates for the change in
volume.
(f) Recall the denition of a whitening transformation given by Eq. 44 in the text:
Aw = 1/2 . In this case we have
y = Atw x N (Atw , Atw Aw ),
and this implies that
Var(y)
= Atw (x )(x )t Aw
= Atw A
=
(1/2 )t t (1/2 )
= 1/2 t t 1/2
= 1/2 1/2
= I,
the identity matrix.
24. Recall that the general multivariate normal density in ddimensions is:
1
1
t 1
p(x) =
exp
(x
)
.
(x
)
2
(2)d/2 1/2
36
1 0
= ... . . . ... .
d2
d
,
i2 ,
i=1
1
= diag(1/12 , . . . , 1/d2 ).
(b) The contours of constant density are concentric ellipses in d dimensions whose
centers areat (1 , . . . , d )t = , and whose axes in the ith direction are of
length 2i c for the density p(x) held constant at
ec/2
.
2i
d
/
i=1
The axes of the ellipses are parallel to the coordinate axes. The plot in 2
dimensions (d = 2) is shown:
(1 , 2)
22 c1/2
x2
21 c1/2
x1
0
1/12
.. (x )
..
(x )t 1 (x ) = (x )t ...
.
.
0
1/d2
2
d
!
xi i
=
.
i
i=1
PROBLEM SOLUTIONS
37
Section 2.6
25. A useful discriminant function for Gaussians is given by Eq. 52 in the text,
1
gi (x) = (x i )t 1 (x i ) + ln P (i ).
2
We expand to get
gi (x)
1 t 1
x x ti 1 x xt 1 i + ti 1 i + ln P (i )
2
%
1 $ t 1
t 1
t 1
=
+ ln P (i ).
x
2
x
+
x
i
i
i
2
indep. of i
We drop the term that is independent of i, and this yields the equivalent discriminant
function:
gi (x)
1
= ti 1 x ti 1 i + ln P (i )
2
= wit x + wio ,
where
wi
wio
= 1 i
1
= ti 1 i + ln P (i ).
2
)
ln
[P
(
1
i
j
i
j
1
x (i j ) +
(i j )t
2
(i j )t 1 i j )
1
1
tj 1 i + ti 1 j = 0.
2
2
=0
ln [P (i )/P (j )](i j )
1
( + j )
,
2 i
(i j )t 1 (i j )
38
respectively.
26. The densities and Mahalanobis distances for our two distributions with the same
covarance obey
p(xi ) N (i , ),
ri2 = (x i )t 1 (x i )
for i = 1, 2.
(a) Our goal is to show that ri2 = 21 (x i ). Here ri2 is the gradient of ri2 ,
that is, the (column) vector in ddimensions given by:
r2
x1
..
.
ri2
xd
=
=
(x i )t 1 (x i )
xj
%
$ 1
x x 2ti 1 x + ti 1 i
xj
xj
indep. of x
d !
d
!
xk xl kl 2
k=1 l=1
2xj jj +
k=1
d
!
xk kj +
d
!
d
!
xk kj 2
d
!
l=1
i,k kj
k=1
d
!
i,k kj
k=1
k=1
th
2j
xl jl 2
l=1
l=j
k=1
k=j
k=1
k=l
d
!
k=1
k=j
2 xj jj +
i,k kl xl
d
d !
d
d
d
!
!
!
!
x2k kk +
xk xl kl 2
i,k
xl kl
xj
l=1
k=1 l=1
k=1
d !
d
!
component of 1 (x i ).
PROBLEM SOLUTIONS
a tx
x2
=b
39
x1
ba
.
a2
ba
.
a2
We are interested in the direction of this derivative vector, not its magnitude.
Thus we now normalize the vector to nd a unit vector:
ba
2A a
2
[2Ax]t [2Ax]
ba
A a
2
#t
ba
ba
At A a
2
a2
0"
Aa
.
at At Aa
Note especially that this normalized vector is independent of b, and thus independent of the point x on the line. Note too that in the very special case
where the Gaussian in hyperspherical (i.e., = qI, the identity matrix and q
a scalar), then Aa = 1/qIa = 1/qa, and At A = 1/q 2 It I = 1/q 2 I. In this case,
the derivative depends only upon a and the scalar q.
(c) We seek now to show that r12 and r22 point in opposite direction along the
line from 1 to 2 . As we saw above, ri2 points in the same direction along
any line. Hence it is enough to show that r12 and r22 point in the opposite
directions at the point 1 or 2 . From above we know that 1 and 2 are
parallel to each other along the line from 1 to 2 . Also, ri2 points in the
same direction along any line through 1 and 2 . Thus, r12 points in the same
direction along a line through 1 as r22 along a line through 2 . Thus r12
points in the same direction along the line from 1 to 2 as r22 along the line
from 2 to 1 . Therefore r12 and r22 point in opposite directions along the
line from 1 to 2 .
(d) From Eqs. 60 and 62 the text, we know that the separating hyperplane has
normal vector w = 1 (1 2 ). We shall show that the vector perpendicular
to surfaces of constant probabilities is in the same direction, w, at the point of
intersecton of the line connecting 1 and 2 and the discriminanting hyperplane.
(This point is denoted xo .)
40
As we saw in part (a), the gradient obeys r12 = 21 (xo 1 ), Thus, because
of the relationship between the Mahalanobis distance and probability density
in a Gaussian, this vector is also in the direction normal to surfaces of constant
density. Equation 65 in the text states
xo =
1
ln [P (1 )/P (2 )]
( + 2 )
(1 2 ).
2 1
(1 2 )t 1 (1 2 )
2
= w.
(2 1 )
where
w = 1 (1 2 )
and
=0
xo
=
=
ln [P (1 )/P (2 )]
1
( + 2 )
(1 2 )
2 1
(1 2 )t 1 (1 2 )
1
( + 2 ).
2 1
Therefore, we have
(1 2 )t 1 x
=
=
1
( 2 )t 1 (1 2 )
21
1 t 1
1 1 t2 1 2 + t1 1 2 t2 1 1 .
2
=0
Consequently the Bayes decision boundary is the set of points x that satisfy the
following equation:
(1 2 )t 1 x =
1 t 1
1 t2 1 2 .
2 1
PROBLEM SOLUTIONS
41
r12
From these results it follows that the Bayes decision boundary consists of points
of equal Mahalanobis distance.
27. Our Gaussian distributions are p(xi ) N (i , i ) and prior probabilities are
P (i ). The Bayes decision boundary for this problem is linear, given by Eqs. 6365
in the text wt (x xo ) = 0, where
w
xo
= 1 (1 2 )
1
ln [P (1 )/P (2 )]
=
(1 2 )
(1 2 ).
2
(1 2 )t 1 (1 2 )
The Bayes decision boundary does not pass between the two means, 1 and 2 if and
only if wt (1 xo ) and wt (2 xo ) have the same sign, that is, either
wt (1 xo ) > 0
and wt (2 xo ) > 0
wt (1 xo ) < 0
and wt (2 xo ) < 0.
or
1
P (1 )
(1 2 )t 1 1 (1 + 2 ) ln
2
P (2 )
P
(
1
)
1
=
( 2 )t 1 (1 2 ) ln
2 1
P (2 )
1
P (1 )
wt (2 xo ) = (1 2 )t 1 1 (1 + 2 ) ln
2
P (2 )
1
)
P
(
1
= (1 2 )t 1 (1 2 ) ln
,
2
P (2 )
wt (1 xo )
(1 2 )
t
P (1 )
P (2 )
(1 2 ) < 2 ln
P (1 )
.
P (2 )
42
P (1 )
and
P (2 )
P (1 )
t 1
(1 2 ) (1 2 ) > 2 ln
.
P (2 )
(1 2 )t 1 (1 2 ) < 2 ln
In sum, the condition that the Bayes decision boundary does not pass between the
two means can be stated as follows:
%
$
(1 )
Case 1 : P (1 ) P (2 ). Condition: (1 2 )t 1 (1 2 ) < 2 ln P
P (2 ) and
this ensures wt (1 xo ) > 0 and wt (2 xo ) > 0.
P (1 )
P (2 )
%
and
28. We use Eqs. 42 and 43 in the text for the mean and covariance.
(a) The covariance obeys:
2
ij
= E [(xi i )(xj j )]
p(xi , xj ) (xi i )(xj j )dxi dxj
=
=p(xi )p(xj )
by indep.
(xi i )p(xi )dxi (xj j )p(xj )dxj
0,
and
p(xi )dxi = 1.
PROBLEM SOLUTIONS
In this case, we can write
p(x1 , x2 )
=
=
43
2
2
x1 1
x2 2
+
1
2
2
2
1
1
1 x1 1
1 x2 2
exp
exp
2
1
2
2
21
22
1
exp
21 2
= p(x1 )p(x2 ).
Although we have derived this for the special case of two dimensions and 12 = 0,
the same method applies to the fully general case in d dimensions and two
arbitrary coordinates i and j.
(c) Consider the following discrete distribution:
+1 with probability 1/2
x1 =
1 with probability 1/2,
and a random variable x2 conditioned on x1 by
+1/2 with probability 1/2
If x1 = +1,
x2 =
1/2 with probability 1/2.
If x1 = 1,
x2 = 0 with probability 1.
It is simple to verify that 1 = E(x1 ) = 0; we use that fact in the following
calculation:
Cov(x1 , x2 )
= E [(x1 1 )(x2 2 )]
= E[x1 x2 ] 2 E[x1 ] 1 E[x2 ] E[1 2 ]
= E[x1 x2 ] 1 2
1
1
=
P (x1 = +1, x2 = +1/2) +
P (x1 = +1, x2 = 1/2)
2
2
+0 P (x1 = 1)
= 0.
Now we consider
Pr(x1 = +1, x2 = +1/2)
Pr(x1 = 1, x2 = 1/2)
44
29. Figure 2.15 in the text shows a decision region that is a mere line in three dimensions. We can understand that unusual by analogy to a onedimensional problem
that has a decision region consisting of a single point, as follows.
(a) Suppose we have two onedimensional Gaussians, of possibly dierent means
and variances. The posteriors are also Gaussian, and the priors P (i ) can be
adjusted so that the posteriors are equal at a single point, as shown in the gure.
That is to say, the Bayes decision rule is to choose 1 except at a single point,
where we should choose 2 , as shown in the gure.
P(ix)
1
1
0.8
0.6
0.4
0.2
3
2
1
R1
R2
R1
(b) Likewise, in the threedimensional case of Fig. 2.15, the priors can be adjusted
so that the decision region is a single line segment.
30. Recall that the decision boundary is given by Eq. 66 in the text, i.e., g1 (x) =
g2 (x).
(a) The most general case of a hyperquadric can be written as
xt W1 x + w1t x + w10 = xt W2 x + w2t x + w20
where Wi are positive denite dbyd matrices, wi are arbitrary vectors in d
dimensions, and wi0 scalar osets. The above equation can be written in the
form
1 t 1
1
1
x (1 1
2 )x + (1 1 2 2 )x + (w10 w20 ) = 0.
2
If we have p(xi ) N (i , i ) and P (i ), then we can always choose i for
the distribution to be i for the classier, and choose a mean i such that
i i = wi .
(b) We can still assign these values if P (1 ) = P (2 ).
Section 2.7
31. Recall that from the Bayes rule, to minimize the probability of error, the decision
boundary is given according to:
Choose 1 if P (1 x) > P (2 x); otherwise choose 2 .
PROBLEM SOLUTIONS
45
(a) In this case, since P (1 ) = P (2 ) = 1/2, we have P (1 x) p(x1 ); and since
the prior distributions p(xi ) have the same , therefore the decision boundary
is:
Choose 1 if x 1  < x 2 , otherwise choose 2 .
Without loss of generality, we assume 1 < 2 , and thus:
P (error)
where a = 2 1 /(2).
(b) We note that a Gaussian decay overcomes an inverse function for suciently
large argument, that is,
lim
2
1
ea /2 = 0
2a
and
Pe
eu
/2
du
2
1
ea /2 .
2a
u u2 /2
e
du
a
< xt x 2t2 x + t2 2
1
(2 1 )t x <
(t1 1 + t2 2 ),
2
46
i = E(Yi )
2 = Var(Yi )
(2 1 )t i
= (2 1 )t 2 I(2 1 )t
= 2 2 1 2 .
eu
/2
du,
where
a =
=
=

2
1 
(2 1 )t 2 (2 1 )t 1 
=
2
22 1
t
(2 1 ) (2 1 )
2 1 2
=
22 1
22 1
2 1
.
2
2
1
ea /2 0 if a ,
2a
d
i=1
1 d
!
21/2
2i
i=1
2i as d .
d
i=1
i=1
2i <
PROBLEM SOLUTIONS
47
33. Consider a line in d dimensions that is projected down to a line in the lower
dimension space.
(a) The conditional error in d dimensions is
Ed =
min[P (1 )p(x1 ), P (2 )p(x2 )] dx
line
P (1 )p(x1 ) dx +
1
P (2 )p(x2 ) dx,
2
where 1 and 2 are the disjoint segments where the posterior probabilities of
each of the distributions are minimum.
In the lower dimension space we have
$
%
Ed1 =
min P (1 )p(x1 ) dx, P (2 )p(x2 ) dx.
line
line
and thus
Ed1 =
P (1 )p(x1 ) dx.
line
P (1 )p(x1 ) dx +
1
P (1 )p(x1 ) dx
2
[P (2 )p(x2 ) + f (x)] dx
P (1 )p(x1 ) dx +
1
f (x) dx
= Ed +
2
Ed ,
where we have used the fact that P (1 )p(x1 ) = P (2 )p(x2 ) + f (x) in 2
for some unknown function f .
(b) In actual pattern recognition problems we rarely have the true distributions,
but have instead just estimates. The above derivation does not hold if the
distributions used are merely estimates.
48
Section 2.8
(1 )
1 2
2[2 + (1 )2 ]1 2 + ln 2
2
2
2(1 ).
to nd the value of that leads to the extreme value of this error. This optimizing value is = 0.5, which gives k( ) = 0.5. The Cherno bound is
thus
PCher (error) P (1 )P (2 )e1/2 .
In this case, the Bhattacharyya bound is the same as the Cherno bound:
PBhat (error) P (1 )P (2 )e1/2 .
(b) Recall the Bayes error given by Eqs. 5 and 7 in the text:
P (error)
= P (x R2 1 )P (1 ) + P (x R1 2 )P (2 ))
=
p(x2 )P (1 )dx + p(x1 )P (2 )dx.
R2
R1
1
1
exp[(x )2 /(2 2 )]P (2 ) =
exp[(x + )2 /(2 2 )]P (1 ),
2
2
and thus
lnP (2 )
(x )2
(x + )2
=
lnP
(
)
.
1
2 2
2 2
2 P (1 )
ln
.
2 P (2 )
p(x2 )P (2 )dx +
xB
p(x1 )P (1 )dx
PROBLEM SOLUTIONS
xB
=
49
1
P (2 )
exp[(x )2 /(2 2 )]dx +
2
P (1 )
xB
1
exp[(x + )2 /(2 2 )]dx
2
.
2 P (2 )
=
1
1 +2
2
2
1 +
1
1
1 )t
2
1 ) + ln 3
2
k(1/2)
= (
(
.
8
2
2
1 
2

We make the following denitions:
A
1 + 2
2
1 +
2
=
2 1
2
Add
ut1d
d1
11
ud1
c11
,
where = 2 1 .
A
t
t 1 + 2
(2 1 )
(2 1 ) = ( )
t
u
2
u
c
1
.
= (
1 +
2 )/2 is the average
Note that A = (1 + 2 )/2 is invertible and A
of the appropriate covariance matrices,
1
xA1 + A1 uut A1 A1 u
1 = At u
A
,
=
ut A1
u c
where = c ut A1 u. Thus, we have the following:
1
21
2
1 +
2
1)
2
1)
(
(
2
xA1 + A1 uut A1 A1 u
t
= ( )
ut A1
50
= t A1 + t A1 uut A1 ut A1 t A1 u + 2
= t A1 + ( t A1 u)2
t A1
A
and A are positive denite)
> 0 (and A
A
1
1 + 2
t
(2 1 )
(2 1 ).
2
as = c ut A1 u =
=
Consequently, we have
21
1
1
2
1 +
1 + 2
t
2
1)
2
1 ) (2 1 )
(
(
(2 1 ),
2
2
where equality holds if an only if = t AA1 u. Now we let
i ui
i =
,
i = 1, 2.
uti ci
i  = i i , where i = ci ut 1 ui . We have
Then 
i i
2 1 + 2
1 +
= A
A =
=
2
2
A u
2
2
= 1 + 2 = 1 + 2 =
A
=
.
ut1 +ut2
c1 +c2
ut c
2
2
2
Thus, u = (u1 + u2 )/2 and c = (c1 + c2 )/2. We substitute these values into the
above equations and nd
1
2
2 
1 +2 
 1 +

2
= ln
ln 3 2

1
1 2 2 
1 2 
1
2
2 
 1 +
2
= ln
+ ln
1 2
1 2 
i = (Xi , Xi,d+1 ) N (
i ) for i = 1, 2, then the
i,
If Xi N (i , i ), and X
conditional variance of Xi,d+1 given Xi is:
i
Y
= Var(Xi,d+1 Xi ), i = 1, 2.
X1 + X2
=
2
1 +X
X
2,
= (Y, Yd+1 ) =
2
and thus = Var(Yd+1 Y). Assume now that X1 and X2 are independent. In
that case we have
PROBLEM SOLUTIONS
51
=
=
1
[Var(X1,d+1 X1 ) + Var(X2,d+1 X2 )]
2
1
(1 + 2 ).
2
2
ln 3
= ln 3 + ln 2
1 2
2
2
1
1
d+1
d
0 since
1 +2
1 2
2
and thus
1 +2
2
ln 3
2
1
d+1
1 +2
2
ln 3 ,
2
1
= 2 ,
where
= 2 1
1 + 2
=
2
1
2
1 +
2
A
=
=
t
u
2
u
c
= 2,d+1 1,d+1
i
= ci uti 1
i ui .
1 =
2.
1 =
2 and
For exmple, the condition of equality is satised if
(d) No, the true error can never increase as we go to a higher dimension, as illustrated in Fig. 3.3 in the text.
52
P (1 )P (2 )ek(1/2)
P (1 )P (2 )ek(1/2) .
But, there is no clear relation between Pd+1 (error) and Pd = P (1 )(P (2 )ek(1/2) .
36. First note the denition of k() given by Eq. 75 in the text:
k()
(1 )
(2 1 )t (2 + (1 )1 )1 (2 1 )
2
2 + (1 )1 
1
+ ln
.
2
2  1 1
(2)d/2
(b) Again from Eq. 74 we have
k()
where
t 1
exp /2 t1 1
1 1 (1 )/2 2 2 2
=
1 /2 2 (1)/2
1 t 1
exp 2 {x A x 2xA1 }
dx
(2)d/2
1 1
A = 1
1 + (1 )2
and
1
A1 = 1
1 1 + (1 )2 2 .
1
exp (x )t A1 (x ) dx
2
t 1
(2)d/2 e1/2 A A1/2
= e2
1
A1
PROBLEM SOLUTIONS
53
since
t
g(x)
e 2 (x ) A (x )
,
(2)d/2 A1/2
where g(x) has the form of a ddimensional Gaussian density. So it follows that
1
k()
1
t 1
t 1
e
= exp {A + 1 1 + (1 )2 2 2 }
2
A1/2
[]
1 /2 2 (1)/2
Problem not yet solved
37. We are given that P (1 ) = P (2 ) = 0.5 and
p(x1 ) N (0, I)
p(x2 ) N (1, I)
where 1 is a twocomponent vector of 1s.
(a) The inverse matrices are simple in this case:
1 0
1
1
=
=
.
1
2
0 1
We substitute these into Eqs. 5355 in the text and nd
g1 (x)
= w1t x + w10
t x1
+ 0 + ln(1/2)
= 0
x2
= ln(1/2)
and
g2 (x)
= w2t x + w20
1
1
x1
(1, 1)
+ ln(1/2)
= (1, 1)
x2
2
1
= x1 + x2 1 + ln(1/2).
1
8
t 1
1
0
0
1
0
0
1
+
2
t
1 1
1 0 1
1 1
=
+ ln
8 1
2 1
0 1 1
= 1/4.
1
0
0 1
1 0 1 0
(0 1)+(0 1)
1
2
1
0
1
+ ln 3
2
1
0
1 0 1 0
0 1
0 1
54
8/5
2/15
2/15
8/15
5/9 4/9
=
.
4/9
5/9
=
ln
+ ln
2/15
8/15 0
2 0
2 0.5 2
2
=
and
g2 (x)
4 2
2
4
1
x + x1 x2 x22 0.66 + ln ,
15 1 15
15
2
t
t
1 x1
5/9 4/9
x1
5/9 4/9
1
x1
+
2 x2
x2
4/9
5/9 x2
4/9
5/9 1
t
1 1
5/9 4/9
1
1 5 4
1
ln
+ ln
4/9
5/9 1
2 1
2 4 5
2
5 2
1
8
5
1
1
1
x + x1 x2 x22 + x1 + x2 1.1 + ln .
18 1 18
18
9
9
9
2
t 2 0.5 5
1
1
0
0.5 2 + 4
8
1
0
2
4
5
2 0.5 5 4
(0.5 2)+(4 5)
1
2
1
0
+ ln 3
1
0
2 0.5 5 4
t
1
1 1
3.5 2.25
1
1 7.1875
=
+ ln
8 1
2 5.8095
2.25 3.5
1
= 0.1499.
0.5 2
4 5
PROBLEM SOLUTIONS
55
x2
R2
R1
4
2
x1
2
R2
4
loss of generality
that a b, or equivalently b a + for > 0. Thus ab = a(a + ) a2 =
a = min[a, b].
(b) Using the above result and the formula for the probability of error given by
Eq. 7 in the text, we have:
P (error) =
min [P (1 )p(x1 ), P (2 )p(x2 )] dx
P (1 )P (2 )
p(x1 )p(x2 ) dx
1/2
=
/2,
where for the last step we have used the fact that min[P (1 ), P (2 )] 1/2,
which follows from the normalization condition P (1 ) + P (2 ) = 1.
39. We assume the underlying distributions are Gaussian.
(a) Based on the Gaussian assumption, we can calculate (x 2 )/2 from the hit
rate Phit = P (x > x x 2 ). We can also calculate (x 1 )/1 from the false
alarm rate Pf alse = P (x > x x 1 ). Since 1 = 2 = , the discriminability
is simply
x
x 2
2
1
1
d =
.
=
1
2
(b) Recall the error function from Eq. 96 in the Appendix of the text:
2
erf[u] =
u
0
ex dx.
2
56
and thus
x
= erf 1 [1/2 P (x > x )] .
x 1
x 2
2 1
=
d2
1
[0.3 + (1 0.8)] = 0.25
2
1
P (error) = [0.4 + (1 0.7)] = 0.35.
2
P (error) =
(d) Because of the symmetry property of the ROC curve, the point (Phit , Pf alse ) and
the point (1 Phit , 1 Pf alse ) will go through the same curve corresponding to
some xed d . For case B, (0.1, 0.3) is also a point on ROC curve that (0.9, 0.7)
lies. We can compare this point with case A, going through (0.8, 0.3) and the
help of Fig. 2.20 in the text, we can see that case A has a higher discriminability
d .
40. We are to assume that the two Gaussians underlying the ROC curve have dierent
variances.
(a) From the hit rate Phit = P (x > x x 2 ) we can calculate (x 2 )/2 . From
the false alarm rate Pf alse = P (x > x x 1 ) we can calculate (x 1 )/1 .
Let us denote the ratio of the standard deviations as 1 /2 = K. Then we can
write the discriminability in this case as
x
x 1 x 2
x 1
2
2
1
da =
=
=
.
1 2
1 2
1 2
2 /K
K1
Because we cannot determine K from (2 x )/2 and (x 1 )/1 , we cannot
determine d uniquely with only Phit = P (x > x x 2 ) and Pf alse = P (x >
x x 1 ).
(b) Suppose we are given the following four experimental rates:
Phit1 = P (x > x1 2 )
Phit2 = P (x > x2 2 )
Pf alse1 = P (x > x1 1 )
Pf alse2 = P (x > x2 1 ).
PROBLEM SOLUTIONS
57
x1 1
= erf 1 [1/2 Pf alse1 ] for x1
1
x 1
b2 = 2
= erf 1 [1/2 Pf alse2 ] for x2 .
1
a1 =
b1 =
b1 b2
x1 x2
2
x1 x2
1
a1 a2
1
=
.
b 1 b2
2
.
b 1 b2
a1 a2
da
P(x<x*x 1)
p(xi)
0.4
0.3
0.2
0.6
0.4
0.2
0.1
2
0.8
x*
0.2
0.4
0.6
0.8
P(x<x*x 2)
58
p(xi)
x*
x
2
x 
otherwise.
x
H
= 1
T (2 , )dx = 1
2
1 +
T (1 , )dx =
x
dT
= 1
(2 dT
= 1
2
2F
(x 2 + )2
2 2
(1 + x )2
2 2
2(1 H)
2F )2
(x 1 + )2
2 2
(x 1 + )2
= 1
2 2
=
2(1 F ) 2(1 H)
( 2(1 F ) dT )2
= 1
2
=
PROBLEM SOLUTIONS
59
(1 + x )2
2 2
(2 + x )2
=
2
2
=
2H 2F ,
=
P(x<x*x 1)
0.8
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
P(x<x*x 2)
(c) For H = 0.4 and F = 0.2, we can rule out case I because H < 0.5. We can
rule out CaseIIi because
F < 0.5. Can rule out Case IIii because H < 0.5.
e
+ e(1p)
with > 0. First, note that bL (p) is symmetry with respect to the interchange
p (1 p), and thus it is symmetric with respect to the value p = 0.5. We need
consider only the range 0 p 1/2 as the limit properties we prove there will
also hold for 1/2 p 1. In the range 0 p 1/2, we have min[p, 1 p] = p.
60
The simplest way to prove that bL (p) is a lower bound for p in the range 0
p 1/2 is to note that bL (0) = 0, and the derivative is always less than the
derivative of min[p, 1 p]. Indeed, at p = 0 we have
1
1 + e
1
bL (0) = ln
= ln[1] = 0.
1 + e
bL (p)
p
e e2p
e + e2p
< 1=
min[p, 1 p]
p
=
min[p, 1 p]
p
in this range. Using the results from part (a) we nd
bL (p)
p
lim
e e2p
=1
e + e2p
lim
We show that this is an upper bound by calculating the dierence between this
bound and the Bayes limit (which is min[p, 1 p] = p in the range 0 p 1/2).
Thus we have
bU (p) p
(d) We seek to conrm that bG (p) = 1/2sin[p] has the following four properties:
bG (p) min[p, 1p]: Indeed, 1/2 sin[p] p for 0 p 1/2, with equality
holding at the extremes of the interval (that is, at p = 0 and p = 1/2). By
symmetry (see below), the relation holds for the interval 1/2 p 1.
PROBLEM SOLUTIONS
61
bG (p) = bG (1 p): Indeed, the sine function is symmetric about the point
/2, that is, 1/2 sin[/2 + ] = 1/2 sin[/2 ]. Hence by a simple
substitution we see that 1/2 sin[p] = 1/2 sin[(1 p)].
bG (0) = bG (1) = 0: Indeed, 1/2 sin[ 0] = 1/2 sin[ 1] = 0 a special
case of the fact bG (p) = bG (1 p), as shown immediately above.
bG (0.5) = 0.5: Indeed, 1/2 sin[0.5] = 1/2 1 = 0.5.
(e) See figure.
min[p,1p]
0.5
0.5
bL(p;=50)
0.4
bU(p;=1)
bU(p;=10)
0.4
min[p,1p]
bL(p;=10)
0.3
0.3
0.2
0.2
bU(p;=50)
bL(p;=1)
0.1
0.1
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Section 2.9
43. Here the components of the vector x = (x1 , . . . , xd )t are binaryvalued (0 or 1),
and
pij = Pr[xi = 1j ]
i = 1, . . . , d
j = 1, . . . , c.
(a) Thus pij is simply the probability we get a 1 in feature xi given that the category
is j . This is the kind of probability structure we nd when each category has
a set of independent binary features (or even realvalued features, thresholded
in the form yi > yi0 ?).
(b) The discriminant functions are then
gj (x) = ln p(xj ) + ln P (j ).
The components of x are statistically independent for all x in j , then we can
write the density as a product:
p(xj )
= p((x1 , . . . , xd )t j )
=
d
,
p(xi j ) =
i=1
d
,
i=1
d
!
i=1
d
!
!
pij
xi ln
+
ln (1 pij ) + ln P (j ).
1 pij
i=1
i=1
d
62
44. The minimum probability of error is achieved by the following decision rule:
Choose k if gk (x) gj (x) for all j = k,
where here we will use the discriminant function
gj (x) = ln p(xj ) + ln P (j ).
The components of x are statistically independent for all x in j , and therefore,
p(xj ) = p((x1 , . . . , xd )t j ) =
d
,
p(xi j ),
i=1
where
pij
Pr[xi = 1 j ],
qij
Pr[xi = 0 j ],
rij
Pr[xi = 1 j ].
As in Sect. 2.9.1 in the text, we use exponents to select the proper probability,
that is, exponents that have value 1.0 when xi has the value corresponding to the
particular probability and value 0.0 for the other values of xi . For instance, for the
pij term, we seek an exponent that has value 1.0 when xi = +1 but is 0.0 when xi = 0
and when xi = 1. The simplest such exponent is 12 xi + 12 x2i . For the qij term, the
simplest exponent is 1x2i , and so on. Thus we write the classconditional probability
for a single component xi as:
p(xi j )
= pij2
i = 1, . . . , d
j = 1, . . . , c
d
,
pij2
i=1
ln p(xj ) + ln P (j )
1 2
1
1 2
2
=
+ ln P (j )
xi + xi ln pij + (1 xi )ln qij + xi + xi ln rij
2
2
2
2
i=1
d
d
d
!
pij rij
1!
pij !
2
=
xi ln
+
xi ln
+
ln qij + ln P (j ),
qij
2 i=1
rij
i=1
i+1
d
!
1
= p > 1/2
pi2
1p
i = 1, . . . , d,
PROBLEM SOLUTIONS
63
Choose 1 if
i=1
>
xi ln
d
i=1
p
1p
xi ln
d
ln (1 p) + ln
i=1
1p
p
d
i=1
1
2
ln p + ln 12 .
xi > d/2.
i=1
(b) We denote the minimum probability of error as Pe (d, p). Then we have:
Pe (d, p)
= P (error1 )P (1 ) + P (error2 )P (2 )
1 d
2
1 d
2
!
!
= P
xi d/21 1/2 + P
xi > d/21 1/2.
i=1
As d is odd, and
d
i=1
is an integer, we have
i=1
Pe (d, p)
= P
2
2
1 d
!
d 1
d + 1
xi
xi
1 1/2 + P
2 1/2
2
2
i=1
i=1
1 d
!
(d1)/2
= 1/2
k=0
d k
p (1 p)dk + 1/2
k
d
!
d
(1 p)k pk .
k
k=(d+1)/2
Pe (d, p)
1/2
k=0
(d1)/2
k=0
k =0
d k
p (1 p)dk .
k
(d1)/2
lim Pe (d, p)
k
(d1)/2
! d
d k
dk
p (1 p)
pk (1 p)dk
+ 1/2
k
k
(c) We seek the limit for p 1/2 of Pe (d, p). Formally, we write
p1/2
d
k=0
d
k
lim pk (1 p)dk
p1/2
64
d d (d1)/2
! d
1
d
1
=
=
2
2
k
k
k=0
k=0
d !
d
d
1
d
1
1
1
2d
=
=
= .
k
2
2
2
2
2
(d1)/2
k=0
lim Pe (d, p)
lim
k=0
d k
p (1 p)dk
k
= lim P
Z
where Z N (0, 1)
d
dp q
dp q
z(1/2 p) d/2
= lim P dp q Z
.
d
dp q
46. The general minimumrisk discriminant rule is given by Eq. 16 in the text:
Choose 1 if (11 21 )p(x1 )P (1 ) < (22 12 )p(x2 )P (2 ); otherwise choose 2 .
Under the assumption 21 > 11 and 12 > 22 , we thus have
Choose 1 if
(21 11 )p(x1 )P (1 )
> 1,
(12 22 )p(x2 )P (2 )
or
Choose 1 if ln
p(x1 )
P (1 )
21 11
> 0.
+ ln
+ ln
p(x2 )
P (2 )
12 22
Thus the discriminant function for minimum risk, derived by taking logarithms in
Eq. 17 in the text, is:
g(x) = ln
p(x1 )
P (1 )
21 11
.
+ ln
+ ln
p(x2 )
P (2 )
12 22
d
d
!
p(x1 ) !
1 pi
=
wi xi +
ln
,
p(x2 )
1 qi
i=1
i=1
where
wi = ln
pi (1 qi )
,
qi (1 pi )
i = 1, . . . , d.
PROBLEM SOLUTIONS
65
d
!
wi xi +
i=1
d
!
ln
i=1
21 11
1 pi
P (1 )
+ ln
+ ln
1 qi
P (2 )
12 22
= wt x + w0 .
47. Recall the Poisson distribution for a discrete variable x = 0, 1, 2, . . ., is given by
P (x) = e
x
.
x!
xP (x) =
x=0
xe
x=0
!
x
x1
= e
= e e = .
x!
(x
1)!
x=1
!
x
=
+
(x(x 1)e
x!
x=0
= 2 e
= 2 e
!
x2
+
(x 2)!
x=2
!
x
+
x !
x =0
= 2 e e +
= 2 + ,
where we made the substitution x (x 2) in the fourth step. We put the
above results together and nd
Var[x] = E[(x E[x])2 ] = E[x2 ] (E[x])2 = 2 + 2 = .
(c) First recall the notation , read oor of , indicates the greatest integer less
than . For all x < , then, we have
x
x1
x1
=
.
>
x!
(x 1)!
x
(x 1)!
>1
>
=
.
x!
(x 1)!
x
(x 1)!
<1
66
That is, for x > , the probability decreases as x increases. Thus, the
probability is maximized for integers x [, ]. In short, if is not an integer,
then x = is the mode; if is an integer, then both and are modes.
(d) We denote the two distributions
P (xi ) = ei
xi
x!
if
or equivalently
if
x
1
e
> 1,
2
(2 1 )
x<
;
ln[1 ] ln[2 ]
2 1
otherwise choose 1 ,
as illustrated in the gure (where the actual x values are discrete).
P(xi)
0.2
0.15
0.1
0.05
x*
10
12
14
x
x
1 1
2 2
P (errorx) = min e
,e
.
x!
x!
PB (error) =
x
!
x=0
e2
!
x2
x
+
e1 1 ,
x! x=x
x!
(x
)
.
(x
i
i
i
2
2i 1/2
PROBLEM SOLUTIONS
67
(a) By direct
calculation using the densities stated in the problem, we nd that for
x = .3
.3 that p(x1 )P (1 ) = 0.04849, p(x2 )P (2 ) = 0.03250 and p(x3 )P (3 ) =
0.04437, and thus the pattern should be classied as category 1 .
(b) To classify .3
, that is, a vector whose rst component is missing and its second
component is 0.3, we need to marginalize over the unknown feature. Thus we
compute numerically
x
P (i )p
p
i = P (i )
i dx
.3
.3
We are given that the true features, xt , are corrupted by Gaussian noise to give us
the measured bad data, that is,
1
1
t 1
p(xb xt ) =
exp (xt i ) (xt i ) .
2
(2)d/2 
68
We drop the needless subscript i indicating category membership and substitute this
Gaussian form into the above. The constant terms, (2)d/2 , cancel, and we then
nd that the probability of the category given the good and the measured bad data
is
g(x)p(x)exp[ 12 (xt )t 1 (xt )]dxt
P (xg , xb ) =
.
p(x)exp[ 12 (xt )t 1 (xt )]dxt ]dxt
After integration, this gives us the nal answer,
P (xg , xb ) =
P ()p(xg , xb )
,
p(xg , xb )
= P (a4 ) = 0.5
P (a2 )
= P (a3 ) = 0
P (b1 )
P (b2 )
P (d1 )
P (d2 )
1.
0 0.4 + 1 0.6
0.6.
PROBLEM SOLUTIONS
69
P (x2 e)
0.913 0.6
= 0.992
0.913 0.6 + 0.087 0.05
0.087 0.05
= 0.008.
0.913 0.6 + 0.087 0.05
Thus, given all the evidence e throughout the belief net, the most probable
outcome is x1 , that is, salmon. The expected error is the probability of nding
a sea bass, that is, 0.008.
(b) Here we are given that the sh is thin and medium lightness, which implies
P (eD d1 ) = 0,
P (eD d2 ) = 1
P (eC c1 ) = 0,
0.198.
Likewise we have
PC (x2 ) P (eC x2 )P (eD x2 )
= [P (eC c1 )P (c2 x2 ) + P (eC c2 )P (c2 x2 ) + P (eC c3 )P (c3 x2 )]
[P (eD d1 )P (d1 x2 ) + P (eD d2 )P (d2 x2 )]
= [0 + 1 0.1 + 0][0 + 1 0.05]
=
0.005.
We normalize and nd
P (x1 eC,D )
P (x2 eC,D )
0.198
= 0.975
0.198 + 0.005
0.005
= 0.025.
0.198 + 0.005
Thus, from the evidence of the children nodes, we classify the sh as x1 , that
is, salmon.
Now we infer the probability P (ai x1 ). We have P (a1 ) = P (a2 ) = P (a3 ) =
P (a4 ) = 1/4. Then we normalize and nd
P (a1 x1 )
70
We also have
P (a2 x1 )
0.125
P (a3 x1 )
=
=
0.167
0.333.
P (a4 x1 )
Thus the most probable season is a1 , winter. The probability of being correct
is
P (x1 eC,D )P (a1 x1 ) = 0.975 0.375 = 0.367.
(c) Fish is thin and medium lightness, so from part (b) we have
P (x1 eC,D ) = 0.975, P (x2 eC,D ) = 0.025,
and we classify the sh as salmon.
For the sh caught in the north Atlantic, we have P (b1 eB ) = 1 and P (b2 eB ) =
0.
P (a1 x1 , b1 ) P (x1 a1 )P (x1 b1 )P (a1 )P (b1 )
= 0.9 0.65 0.25 = 0.146
P (a2 x1 , b1 ) P (x1 a2 )P (x1 b1 )P (a2 )P (b1 )
= 0.3 0.65 0.25 = 0.049
P (a3 x1 , b1 ) P (x1 a3 )P (x1 b1 )P (a3 )P (b1 )
= 0.4 0.65 0.25 = 0.065
P (a4 x1 , b1 ) P (x1 a4 )P (x1 b1 )P (a4 )P (b1 )
= 0.8 0.65 0.25 = 0.13.
So the most probable season is a1 , winter. The probability of being correct is
P (x1 eC,D )P (a1 x1 ) = 0.975 0.375 = 0.367.
51. Problem not yet solved
Section 2.12
52. We have the priors P (1 ) = 1/2 and P (2 ) = P (3 ) = 1/4, and the Gaussian
densities p(x1 ) N (0, 1), p(x2 ) N (0.5, 1), and p(x3 ) N (1, 1). We use the
general form of a Gaussian, and by straightforward calculation nd:
x
0.6
0.1
0.9
1.1
p(x1 )
0.333225
0.396953
0.266085
0.217852
p(x2 )
0.396953
0.368270
0.368270
0.333225
p(x3 )
0.368270
0.266085
0.396953
0.396953
We denote X = (x1 , x2 , x3 , x4 ) and = ((1), (2), (3), (4)). Using the notation
in Section 2.12, we have n = 4 and c = 3, and thus there are cn = 34 = 81 possible
PROBLEM SOLUTIONS
71
values of , such as
(1 , 1 , 1 , 1 ), (1 , 1 , 1 , 2 ), (1 , 1 , 1 , 3 ),
(1 , 1 , 2 , 1 ), (1 , 1 , 2 , 2 ), (1 , 1 , 2 , 3 ),
(1 , 1 , 3 , 1 ), (1 , 1 , 3 , 2 ), (1 , 1 , 3 , 3 ),
..
..
..
.
.
.
(3 , 3 , 3 , 1 ), (3 , 3 , 3 , 2 ), (3 , 3 , 3 , 3 )
For each possible value of , we calculate P () and p(X) using the following, which
assume the independences of xi and (i):
p(X)
4
,
p(xi (i))
i=1
P ()
4
,
P ((i)).
i=1
0.01173
and
P ()
= P (1 )P (3 )P (3 )P (2 )
1 1 1 1
=
2 4 4 4
1
=
= 0.0078125.
128
(a) Here we have X = (0.6, 0.1, 0.9, 1.1) and = (1 , 3 , 3 , 2 ). Thus, we have
p(X)
0.012083,
72
where the sum of the 81 terms can be computed by a simple program, or somewhat tediously by hand. We also have
p(X)
Also we have
P () = P (1 )P (3 )P (3 )P (2 )
1 1 1 1
=
2 4 4 4
1
=
= 0.0078125.
128
According to Eq. 103 in the text, we have
P (X)
(b) We follow the procedure in part (a), with the values X = (0.6, 0.1, 0.9, 1.1) and
= (1 , 2 , 2 , 3 ). We have
p(X)
Likewise, we have
P () = P (1 )P (2 )P (2 )P (3 )
1 1 1 1
=
2 4 4 4
1
=
= 0.0078125.
128
Thus
P (X)
(c) Here we have X = (0.6, 0.1, 0.9, 1.1) and = ((1), (2), (3), (4)). According
to Eq. 103 in the text, the sequence that maximizes p(X)P () has the
maximum probability, since p(X) is xed for a given observed X. With the
simplications above, we have
p(X)P () = p(x1 (1))p(x2 (2))p(x3 (3))p(x4 (4))P ((1))P ((2))P ((3))P ((4)).
PROBLEM SOLUTIONS
73
0.1666125.
= 0.266085 0.5
= 0.133042.
For the nal step we have
max [p(1.11 )P (1 ), p(1.12 )P (2 ), p(1.13 )P (3 )]
=
= 0.217852 0.5
= 0.108926.
Thus the sequence = (1 , 1 , 1 , 1 ) has the maximum probability to observe
X = (0.6, 0.1, 0.9, 1.1). This maximum probability is
0.166625 0.1984765 0.133042 0.108926
1
= 0.03966.
0.012083
74
Computer Exercises
Section 2.5
1. Computer exercise not yet solved
2.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
MATLAB program
load samples.mat;
the data
[n,m] = size(samples);
for i=1:3
mu{i} = mean(samples(:, (i1)*3+1;i*3));
sigma{i} = zeros(3);
for j=1:n
sigma{i} = sigma{i} + ...
The ... continues the line
(samples(j,(i1)*3+1:i*3)  mu{i}) ...
* (samples(j,(i1)*3+1;i*3)  mu{i});
end
sigma{i} = sigma{i}./n;
end
s = [1 2 1; 5 3 2; 0 0 0; 1 0 0]
for j=1:size(s,2)
for i=1;3
d = sqrt((s:,j)mu{i})*inv(sigma{i})*(s(:,j)mu{i}));
fprintf(Mahal. dist. for class %d and point %d: %: %f\n, i, j, d);
end
end
pw(1:) =[1/3 0.8];
pw(2:) =[1/3 0.1];
pw(3:) =[1/3 0.1];
for p=1:2
fprintf(\n\n\n\n);
for j=1:size(s,2)
class = 0; max gi = 1000000;
for i=1:3
d i = (s(:,j)mu{i})*inv(sigma{i})*(s(:,j)mu{i});
g i = 0.5*d i  1.5*log(2*pi)  0.5*log(det(sigma{i})) + log(pw(i,p));
if gi > max gi,
max gi = gi;
class = i;
end
end
fprintf(Point %d classified in category %d\n, j, class);
end
end
Output
COMPUTER EXERCISES
Mahal.
Mahal.
Mahal.
Mahal.
Mahal.
Mahal.
Mahal.
Mahal.
Mahal.
dist.
dist.
dist.
dist.
dist.
dist.
dist.
dist.
dist.
for
for
for
for
for
for
for
for
for
class
class
class
class
class
class
class
class
class
1
2
3
1
2
3
1
2
3
and
and
and
and
and
and
and
and
and
75
point
point
point
point
point
point
point
point
point
Point
Point
Point
Point
1
2
3
4
classified
classified
classified
classified
in
in
in
in
category
category
category
category
2
3
1
1
Point
Point
Point
Point
1
2
3
4
classified
classified
classified
classified
in
in
in
in
category
category
category
category
1
1
1
1
2:
2:
2:
3:
3:
3:
4:
4:
4:
1.641368
1.850650
0.682007
0.516465
0.282953
2.362750
0.513593
0.476275
1.541438
76
Chapter 3
Problem Solutions
Section 3.2
1. Our exponential function is:
p(x) =
ex
0
x0
otherwise.
(a) See Figure. Note that p(x = 2) is not maximized when = 2 but instead
for a value less than 1.0.
p(x =1)
p(x=2 )
0.2
0.1
0.5
x
1
=1 (part c)
n
!
k=1
ln p(xk ) =
n
!
[ln xk ] = nln
k=1
n
!
k=1
77
xk .
78CHAPTER 3.
We solve l() = 0 to nd as
l()
n
!
n ln
xk
k=1
n !
xk = 0.
k=1
1
n
1
n
k=1
.
xk
k=1
by the integral
xp(x)dx,
0
xex dx = 1,
we put these results together and see that = 1, as shown on the gure in
part (a).
2. Our (normalized) distribution function is
1/ 0 x
p(x) =
0 otherwise.
(a) We will use the notation of an indicator function I(), whose value is equal to
1.0 if the logical value of its argument is true, and 0.0 otherwise. We can write
the likelihood function using I() as
p(D)
n
,
p(xk )
k=1
n
,
1
I (0 xk )
k=1
1
=
I max xk I min xk 0 .
k
k
n
PROBLEM SOLUTIONS
79
p(D )
12
10
8
6
4
2
0.2
0.4
0.6
0.8
[P (i )]zik [1 P (i )]1zik .
=
=
n
,
k=1
n
,
P (zik P (i ))
[P (i )]zik [1 P (i )]1zik .
k=1
ln P (zi1 , , zin P (i ))
n
,
zik
1zik
ln
[P (i )] [1 P (i )]
n
!
zik ln P (i ) + (1 zik ) ln (1 P (i )) .
k=1
k=1
k=1
k=1
80CHAPTER 3.
n
!
= P (i )
zik
k=1
n
!
(1 zik ),
k=1
zik
= P (i )
k=1
n
!
zik + nP (i ) P (i )
k=1
n
!
zik .
k=1
P (i )
k=1
That is, the estimate of the probability of category i is merely the probability
of obtaining its indicatory value in the training data, just as we would expect.
4. We have n samples {x1 , . . . , xn } from the discrete distribution
P (x) =
d
,
ixi (1 i )1xi .
i=1
n ,
d
,
ixki (1 i )1xki ,
k=1 i=1
n !
d
!
xki ln i + (1 xki ) ln (1 i ).
k=1 i=1
To nd the maximum of l(), we set l() = 0 and evaluate component by component (i = 1, . . . , d) and get
[ l()]i
= i l()
=
n
n
1 !
1 !
(1 xki )
i
1 i
0.
k=1
k=1
n
!
k=1
1
xki
= i
n
!
k=1
2
xki
PROBLEM SOLUTIONS
81
k=1
Since this result is valid for all i = 1, . . . , d, we can write this last equation in vector
form as
!
= 1
xk .
n
n
k=1
Thus the maximumlikelihood value of is merely the sample mean, just as we would
expect.
5. The probability of nding feature xi to be 1.0 in category 1 is denoted p:
p(xi = 11 ) = 1 p(xi = 01 ) = pi1 = p >
1
,
2
d
,
p(xi 1 ) =
i=1
d
,
pxi (1 p)1xi ,
i=1
d
!
[xi ln p + (1 xi ) ln (1 p)].
i=1
p l(p)
d
!
xi
= p d
i=1
d
!
2
xi
i=1
p =
82CHAPTER 3.
d
i=1
T =
Likewise, the variance of T , given that were considering just one class, 1 , is
=
1!
Var(xi 1 )
d i=1
d
1 ! 2
[1 p + 02 (1 p) p p]
d2 i=1
p(1 p)
,
d
Var(T 1 )
d = 111
d = 11
0.2
0.4
T* 0.6
T
0.8
(x
)
.
(x
)
2
(2)d/2 1/2
We choose x1 , x2 , . . . , xn independent observations from p(x, ). The joint density
is
n
1
1!
p(x1 , x2 , . . . , xn , ) =
exp
(xk )t 1 (xk ) .
2
(2)n/2 n/2
k=1
l(, )
k=1
PROBLEM SOLUTIONS
83
2
k=1
and nd that
1
n
!
1 .
xk = n
k=1
k=1
we temporarily substitute
as expected. In order to simplify our calculation of ,
A = 1 , and thus have
n
n
1!
l(, ) = ln (2) + ln A
(xk )t A(xk ).
2
2
2
n
k=1
We now replace A by
and nd that
n
1!
k )
t,
=
(xk )(x
2
2
n
k=1
(xk )(x
n
n
k=1
84CHAPTER 3.
p(x1)
0.3
0.2
true p(x2) model
(shown 30000 x)
0.1
6
R2
4
2
3.717
0 x*
R1
4
+3.717
R2
(d) For the poor model, which has equal variances for the two Gaussians, the only
decision boundary possible is a single point, as was illustrated in part (a). The
best we can do with the incorrect model is to adjust the mean so as to match
the rightmost of the decision boundaries given in part (c), that is, x = 3.717.
This decision point is midway between the means of the two Gaussians that
is, at 0 and at . As such, we need to choose such that (0 + )/2 = 3.717.
This leads directly to the solution, = 7.43.
(e) The maximumlikelihood solution in the incorrect model here p(x2 )
N (, 1) does not yield the minimum classication error. A dierent value
of the parameter (here, = 7.43) approximates the Bayes decision boundary
better and gives a lower error than the maximumlikelihood solution in the
incorrect model. As such, the faulty prior information about the model, which
did not permit the true model to be expressed, could not be overcome with even
an innite amount of training data.
8. Consider a case in which the maximumlikelihood solution gives the worst possible
classier.
(a) In this case, the symmetry operation x x takes p(x1 ) p(x2 ), and
thus we are assured that the estimated distributions have this same symmetry
property. For that reason, we are guaranteed that these distribution have the
same value at x = 0, and thus x = 0 is a decision boundary. Since the Gaussian
estimates must have the same variance, we are assured that there will be only
a single intersection, and hence a single decision point, at x = 0.
(b) See Figure.
(c) We have the estimate of the mean as
PROBLEM SOLUTIONS
85
p(xi)
1k=.8
1k=.8
k=.2
k=.2
x
6
4
2
x*
(e) Thus for the error to approach 1, we must have k 0, and this, in turn, requires
X to become arbitrarily large, that is, X (1 k)/k . In this way, a tiny
spike in density suciently far away from the origin is sucient to switch
the estimated mean and give error equal 100%.
(f) There is no substantive dierence in the above discussion if we constrain the
variances to have a xed value, since they will always be equal to each other, that
is,
12 =
22 . The equality of variances preserves the property that the decision
point remains at x = 0; it consequently also ensures that for suciently small
k and large X the error will be 100%.
(g) All the maximumlikelihood methods demand that the optimal solution exists
in the model space. If the optimal solution does not lie in the solution space,
then the proofs do not hold. This is actually a very strong restriction. Note
that obtaining the error equal 100% solution is not dependent upon limited
data, or getting caught in a local minimum it arises because of the error in
assuming that the optimal solution lies in the solution set. This is model error,
as described on page 101 of the text.
9. The standard maximumlikelihood solution is
= arg max p(x).
p(x)
.
d /dx
=
=
p(x)
d /dx
arg max p(x)
arg max
= ,
86CHAPTER 3.
k=1
where M(k) is the rst point in data set k drawn from the given distribution.
(b) While the unusual method for estimating the mean may indeed be unbiased, it
will generally have large variance, and this is an undesirable property. Note that
E[(xi )2 ] = 2 , and the RMS error, , is independent of n. This undesirable
behavior is quite dierent from that of the measurement of
1!
xi ,
n i=1
n
x
=
where we see
1
22
n
!
1
E[(
x )2 ] = E
xi
n i=1
=
n
1 !
E[(xi )2 ]
2
n i=1
2
.
n
Thus the RMS error, / n, approches 0 as 1/ n. Note that there are many
superior methods for estimating the mean, for instance the sample mean. (In
Chapter 9 we shall see other techniques ones based on resampling such as
the socalled bootstrap and jackknife methods.)
11. We assume p2 (x) p(x2 ) N (, ) but that p1 (x) p(x1 ) is arbitrary.
The KullbackLeibler divergence from p1 (x) to p2 (x) is
1
DKL (p1 , p2 ) = p1 (x)lnp1 (x)dx +
p1 (x) dln(2) + ln + (x )t 1 (x ) dx,
2
where we used the fact that p2 is a Gaussian, that is,
1
(x )t 1 (x )
p2 (x) =
exp
.
2
(2)d/2 1/2
We now seek and to minimize this distance. We set the derivative to zero and
nd
PROBLEM SOLUTIONS
and this implies
1
87
p1 (x)(x )dx = 0.
88CHAPTER 3.
a1
A11
a2
..
a = . and A = .
..
An1
an
a and matrix A:
...
..
.
...
A1n
.. .
.
Ann
n !
n
!
aj Aij aj .
i=1 j=1
n
j=1
n !
n
!
Aij aj ai = at Aa.
i=1 j=1
PROBLEM SOLUTIONS
89
We note that p(x) N (, ) where is known and x1 , . . . , xn are independent observations from p(x). Therefore the likelihood is
p(x1 , . . . , xn )
1
1
t 1
exp
(x
)
(x
k
k
2
(2)d/2 1 1/2
k=1
n
n/2
1!
t 1
=
exp
(xk ) (xk ) .
2
(2)nd/2
=
n
,
k=1
where we used
n
k=1
tr (Ak ) = tr
n
k=1
Ak
and 1  = 1 .
(xk )(xk )t .
n
n
k=1
,
= A
1  = A1 
1  = A
1 = A
= A

=
1 2 d
,

$ n
%
(1 . . . n )n/2
exp (1 + + d ) .
n/2
2
(2)nd/2 
90CHAPTER 3.
(d) The expression for p(x1 , , xn ) in part (c) depends on only through
We can write our likelihood, then, as
1 , , d , the eigenvalues of A = 1 .
p(x1 , . . . , xn ) =
1
n/2
(2)nd/2 
d
,
n/2
i
i e
i=1
k=1
n
,
1
t 1
(x
exp
(x
)
.
k
k
lk
lk
2
(2)d/2 1/2
k=1
1
The li are independent, and thus the probability density of the ls is a product,
p(l1 , . . . , ln )
n
,
p(lk ) =
k=1
n
,
p(lk ).
k=1
(xk lk )t 1 (xk lk ) =
l !
!
i=1 lk =1
(xk i )t 1 (xk i ).
PROBLEM SOLUTIONS
91
l(i )
c
!
!
1
k=1
=
exp
(xk i )t 1 (xk i )
2 j=1
(2)nd/2 n/2
k:lk =j
1!
t 1
exp
(xk i ) (xk i ) .
2
P (i )
lk =i
1 !
xk ,
ni
lk =i
where ni = {xk : lk = i} =
lk =i
lk =i
xk
i =
lk =i
(xk )(x
n
n
k=1
(xk
n
n
k=1
15. Consider the problem of learning the mean of a univariate normal distribution.
(a) From Eqs. 34 and 35 in the text, we have
n =
no2
2
mn +
o ,
2
2
2
no +
no + 2
and
n2
o2 2
,
no2 + 2
mn =
k=1
92CHAPTER 3.
o =
k=no +1
and
n
n
xk
2 /o2 1
+
n + 2 /o2
2 /o2 + n no
k=1
k=1
xk
n + no
no 1
n + no no
0
!
0
!
xk
k=no +1
xk .
k=1no
1
n + no
n
!
xk .
k=no +1
Likewise, we have
n2
=
=
2 o2
no2 + 2
2
2
=
.
n + 2 /o2
n + no
(b) The result of part (a) can be interpreted as follows: For a suitable choice of
the prior density p() N (o , o2 ), maximumlikelihood inference on the full
sample on n + no observations coincides with Bayesian inference on the second
sample of n observations. Thus, by suitable choice of prior, Bayesian learning
can be interpreted as maximumlikelihood learning and here the suitable choice
of prior in Bayesian learning is
o
0
1 !
xk ,
no
2
.
no
k=no +1
o2
Here o is the sample mean of the rst no observations and o2 is the variance
based on those no observations.
16. We assume that A and B are nonsingular matrices of the same order.
(a) Consider Eq. 44 in the text. We write
A(A + B)1 B
PROBLEM SOLUTIONS
93
We interchange the roles of A and B in this equation to get our desired answer:
B(A + B)1 A = (A1 + B1 )1 .
(b) Recall Eqs. 41 and 42 in the text:
1
n
= n1 + 1
o
= n1 n + 1
o o .
1
n n
We have solutions
1
1
1
1
n = o o + n + o +
o ,
n
n
n
and
n = o
1
1
1
.
o +
n
n
1
n
and B = o to get
1
1
1
+ o
=
n
n
1
1
= o o +
,
n
which proves Eqs. 41 and 42 in the text. We also compute the mean as
n
= n (n1 mn + 1
o o )
1
= n n mn + n 1
o o
1
1
1
1
1
1
1
= o o +
o 1
n mn + o +
o o
n
n
n
n
1
1
1
1
1
= o o +
mn + o +
o .
n
n
n
Section 3.5
17. The Bernoulli distribution is written
p(x) =
d
,
ixi (1 i )1xi .
i=1
94CHAPTER 3.
is
= P (x1 , . . . , xn ) =
P (D)
n
,
k=1
P (xk )
xk are indep.
d
n ,
,
ixki (1 i )1xki
k=1 i=1
n
d
n
,
xki
i k=1 (1 i ) k=1 (1xki )
i=1
d
,
isi (1 i )nsi .
i=1
(b) We assume an (unnormalized) uniform prior for , that is, p() = 1 for 0
i 1 for i = 1, , d, and have by Bayes Theorem
p(D) =
p(D)p()
.
p(D)
d
/
i=1
p(D)p()d =
,
d
isi (1 i )nsi d
i=1
1
isi (1 i )nsi d1 d2 dd
0 i=1
1 ,
d
d 1
,
isi (1 i )nsi di .
i=1 0
Now si =
n
k=1
m!n!
,
(m + n + 1)!
p(D) =
i=1 0
isi (1 i )nsi di =
d
,
si !(n si )!
i=1
(n + 1)!
PROBLEM SOLUTIONS
95
p(D)p()
p(D)
d
/
isi (1 i )nsi
i=1
d
/
i=1
d
,
(n + 1)! si
=
i (1 i )nsi .
s
!(n
s
)!
i
i=1 i
2!
2
1s1 (1 1 )ns1 =
s1 (1 1 )1s1 .
s1 !(n s1 )!
s1 !(1 s1 )! 1
Note that s1 takes the discrete values 0 and 1. Thus the densities are of the
form
: p(1 D) = 2(1 1 )
s1 = 0
s1 = 1
: p(1 D) = 21 ,
1
s 1=
1.5
s1 =
0.5
0
1
0.2
0.4
0.6
0.8
18. Consider how knowledge of an invariance can guide our choice of priors.
(a) We are given that s is actually the number of times that x = 1 in the
rst n tests.
Consider the (n + 1)st test. If again x = 1, then there are n+1
s+1 permutations
of 0s and 1s in the (n + 1) tests, in which the number of 1s is (s + 1). Given the
assumption of invariance of exchangeablility (that is, all permutations have the
same chance to appear), the probability of each permutation is
1
Pinstance = n+1 .
s+1
Therefore, the probability of x = 1 after n tests is the product of two probabilities: one is the probability of having (s + 1) number of 1s, and the other is the
probability for a particular instance with (s + 1) number of 1s, that is,
p(s + 1)
Pr[xn+1 = 1Dn ] = Pr[x1 + + xn = s + 1] Pinstance = n+1 .
s+1
96CHAPTER 3.
(s + 1)!(n + 1 s 1)!
s!(s + 1)(n + 1 s 1)!
=
s!(n + 1 s)!
s!(n + 1 s 1)!(n + 1 s)
s+1
.
n+1s
We can see that for some n, the ratio will be small if s is small, and that the ratio
will be large if s is large. This implies that with n increasing, if x is unlikely
to be 1 in the rst n tests (i.e., small s), it would be unlikely to be 1 in the
(n + 1)st test, and thus s remains small. On the contrary, if there are more 1s
than 0s in the rst n tests (i.e., large s), it is very likely that in the (n + 1)st
test the x remains 1, and thus retains s large.
(c) If p() U (0, 1) for some n, then p() is a constant, which we call c. Thus
1
p(s)
s (1 )ns p()d
=
0
1
n
= c
s (1 )ns d
s
0
1
n
s+1 (1 )ns+1
= c
s
s+1 ns+1
0
0,
1
ln
n
,
k=1
2
p(xk )
PROBLEM SOLUTIONS
97
n
!
ln[p(xk )]
k=1
n
!
n
1
d
= ln (2) 
(xk )t 1 (xk )
2
2
k=1
and
p() =
1
1
t 1
exp
m
)
.
(
m
0
0
o
2
(2)d/2 0 1/2
ln (2) 
(xk ) (xk )
2
2
k=1
1
1
t 1
exp ( m0 ) o ( m0 )
.
2
(2)d/2 0 1/2
(b) After the linear transform governed by the matrix A, we have
= E[x ] = E[Ax] = AE[x] = A,
and
= E[(x )(x )t ]
= E[(Ax A )(Ax A )t ]
= E[A(x )(x )t At ]
= AE[(x )(x )t ]At
= AAt .
1
=
=
ln
n
!
2
p(Axk A)
k=1
ln[p(Axk A)]
k=1
n
!
n
1
= ln (2)d AAt 
(Axk A)t (AAt )1 (Axk A)
2
2
k=1
n
!
n
1
= ln (2)d AAt 
((x )t At )((A1 )t 1 A1 )(A(xk ))
2
2
k=1
n
!
n
1
= ln (2)d AAt 
(xk )t (At (A1 )t )1 (A1 A)(xk )
2
2
n
= ln (2)d AAt 
2
k=1
n
!
k=1
1
(xk )t 1 (xk ).
2
98CHAPTER 3.
2
n
!1
1
1
t 1
t 1
exp ( m0 ) 0 ( m0 )
.
(xk ) (xk )
2
2
(2)d/2 0 1/2
k=1
and see that the two equations are the same, up to a constant.
We compare
Therefore the estimator gives the appropriate estimate for the transformed mean
.
df 1 ()
d
dln()
= p(
)
d
c
=
.
= p(f 1 ())
p( )
1
1
=
2 0
2
1/ .
p(xk )p(Dn1 )
= lim p(Dn1 ).
p(xk )p(Dn1 )d n
Note that lim p(Dn ) p(). We assume lim p(D) p() = 0 and lim xn
n
n
n
x . Then the above equation implies
p(x )
lim p(xn )
PROBLEM SOLUTIONS
99
lim
p(xn )p()d
=
lim p(xn )p()d
n
=
p(x )p()d
=
= p(x ).
In summary, we have the conditions:
lim p(Dn ) p() = 0
n
lim xn x
n
2
p(x2 , )p()d =
x1
x1
11
3x
d =
.
22
4
p(x2 ) =
x+1
p(x2 , )p()d =
0
(x + 1)/4
p(x2 ) =
(3 x)/4
11
x+1
d =
.
22
4
x < 1
1 x 1
1x3
x > 3.
x
P (error) =
1
p(x2 )P (2 )dx +
p(x1 )P (1 )dx =
x
24P 3 + 8P 2
.
(3P + 1)2
100CHAPTER 3.
1
0.3
2
0.2
0.1
1
error
(d) Again, we are considering p(x1 )P (1 ) and p(x2 )P (2 ), but this time, we
rst use a single value of as the true one, that is, p() = 1. There are two
cases for dierent relationships between P and .
p(x1 )P (1 ) p(x2 , )P (2 ) which implies P 1/2 (1 P ) or 0
P 1/3. This case can be further divided into two subcases:
subcase 1: 0 1
The error is
1
1
2
P (error) =
(1 (1)) P
d = P P/6 = 5P/6.
2
2
0
subcase 2: 1 < 2
The error is
2
P2 (error) =
1
(2 )2
P d = P/6
2
P
+1
d
2
2
2P
PGibbs (error) = P1 (error) + P2 (error) =
(1 P )2 (31P 7)
.
48P 2
PROBLEM SOLUTIONS
101
p(xi)P(i)
0.5
0.4
2
0.3
0.2
0.1
error
1
P2 (error)
1P
4
9
1
2
2
2P
d
(1P )/(2P )
P
(2 )2 d
2
=
(5P 1)/(2P )
P
6
1P
2P
3
.
(e) It can be easily conrmed that PGibbs (error) < 2PBayes (error).
Section 3.6
23. Let s be a sucient statistic for which p(s, D) = p(s); we assume p(s) = 0.
In that case, we can write Bayes law
p(Ds, ) =
p(s, D)p(Ds)
p(s)
as
p(Ds, )
=
=
p(D, s, )
p(s, )
p(s, D)P (D, s)
p(s)p(s)
102CHAPTER 3.
2
0.3
0.2
1
0.1
error
1
p(s, D)p(Ds)p(s)
p(s)p(s)
p(s, D)p(Ds)
.
p(s)
=
=
Note that the probability density of the parameter is fully specied by the sucient
statistic; the data gives no further information, and this implies
p(s, D) = p(s).
Since p(s) = 0, we can write
p(s, D)p(Ds)
p(s)
p(s)p(Ds)
=
p(s)
= p(Ds),
p(Ds, ) =
s=
k=1
PROBLEM SOLUTIONS
103
p(xi)P(i)
2
0.5
0.4
0.3
1
0.2
0.1
error
1
es
s ,
= se
We next evaluate the second derivative at this value of to see if the solution represents
a maximum, a minimum, or possibly an inection point:
2 [g(s, )]1/n
= ses ses + s2 es
= e
(s 2s) = se
< 0.
1
n
n
k=1
= 3/2 es
3 1/2 s
s3/2 es .
e
2
=
xk
.
=
s
2 n
k=1
We next evaluate the second derivative at this value of to see if the solution represents
a maximum, a minimum, or possibly an inection point:
3 1 1/2 s 3 1/2 s 3 1/2 s
2 [g(s, )]1/n
=
se
se
+ s2 3/2 es
e
22
2
2
=
=
104CHAPTER 3.
0.3
2
0.2
0.1
1
error
0
3 1/2 s
e
3s1/2 es + s2 3/2 es
4
3 9 1/2
3
3
3 +
= e3/2
= 1/2 e3/2 < 0.
4
2 4
2
=
d
,
isi
i=1
and
1!
xk .
n
n
s = (s1 , . . . , sd )t =
k=1
Our goal is to maximize [g(s, )]1/n with respect to over the set 0 < i < 1, i =
d
1, . . . , d subject to the constraint
i = 1. We set up the objective function
i=1
l(, s)
=
=
1/n
ln [g(s, )]
d
!
si ln i +
i=1
1 d
!
1 d
!
i=1
l
si
=
+
i
i
i 1
2
i 1 .
i=1
PROBLEM SOLUTIONS
105
p(xi)P(i)
0.5
0.4
0.3
2
0.2
0.1
1
error
0
d
i = 1, to nd , which leads to our nal
i=1
solution
i =
sj
d
i=1
si
for j = 1, . . . , d.
27. Consider the notion that suciency is an integral concept.
(a) Using Eq. 51 in the text, we have
p(D)
=
=
=
=
=
n
,
p(x)
k=1
n
,
1
1
2
exp 2 (xk )
2
2
k=1
n
n
1
1 ! 2
2 2
exp
xk 2xk +
2 2
2
k=1
1 n
2
n
n
n
!
!
!
1
1
2
2
exp 2
xk 2
xk +
2
2
k=1
k=1
k=1
n
1
1
2
106CHAPTER 3.
0.3
2
0.2
0.1
error
1
and
1
h(D, ) = exp
ns2 .
2 2
In the general case, the g(s1 , , 2 ) and h(D, 2 ) are dependent on 2 , that is,
there is o way to have p(D, 2 ) = g(s1 , )h(D). Thus according to Theorem 3.1
in the text, s1 is not sucient for . However, if 2 is known, then g(s1 , , 2 ) =
g(s1 , ) and h(D, 2 ) = h(D). Then we indeed have p(D) = g(s1 , )h(D), and
thus s1 is sucient for .
(c) As in part (b), we let
2
g(s2 , , ) =
n
1
2
exp
(ns2 2ns1 + n )
2 2
and h(D) = 1. In the general case, the g(s2 , , 2 ) is dependent upon , that
is, there is no way to have p(D, 2 ) = g(s2 , 2 )h(D). Thus according to
Theorem 3.1 in the text, s2 is not sucient for 2 . However, if is known, then
g(s2 , , 2 ) = g(s2 , 2 ). Then we have p(D 2 ) = g(s2 , 2 )h(D), and thus s2 is
sucient for 2 .
28. We are to suppose that s is a statistic for which p(x, D) = p(s), that is, that
s does not depend upon the data set D.
(a) Let s be a sucient statistic for which p(s, D) = p(s); we assume p(s) = 0.
In that case, we can write Bayes law
p(Ds, ) =
p(s, D)p(Ds)
p(s)
as
p(Ds, )
=
=
=
=
p(D, s, )
p(s, )
p(s, D)P (D, s)
p(s)p(s)
p(s, D)p(Ds)p(s)
p(s)p(s)
p(s, D)p(Ds)
.
p(s)
PROBLEM SOLUTIONS
107
Note that the probability density of the parameter is fully specied by the
sucient statistic; the data gives no further information, and this implies
p(s, D) = p(s).
Since p(s) = 0, we can write
p(Ds, )
p(s, D)p(Ds)
p(s)
p(s)p(Ds)
=
p(s)
= p(Ds),
=
1
1
xa 2 .
b 1 +
b
1
p(x)dx =
b
1+
1
xa 2 dx.
b
1
1
xa 2 dx =
1+
b
1
1 " " ##
1
dy = tan1 [y] =
= 1.
2
1+y
2
2
xp(x)dx =
1
x
dx.
b 1 + xa 2
b
108CHAPTER 3.
y
1
dy
+a
b
dy
b
.
1 + y2
1 + y2
0
Note that f (y) = y/(1 + y 2 ) is an odd function, and thus the rst integral
vanishes. We substitute our result from part (a) and nd
a
=
1
b dy = a.
1 + y2
This conclusion is satisfying, since when x = a, the distribution has its maximum; moreover, the distribution dies symmetrically around this peak.
According to the denition of the standard deviation,
x2 p(x)dx 2
1
b
x2
1+
1
2
xa 2 dx .
b
2 2
b y + 2aby + a2
1
b dy 2
b
1 + y2
y
1
1
b2 dy + 2ab
dy + (a2 b2 )
dy 2
1 + y2
1 + y2
1
=
+ 2ab 0 + (a2 b2 ) 2
= .
This result is again as we would expect, since according to the above calculation,
E[(x )2 ] = .
(c) Since we know = ( )t = (a )t , for any dened s = (s1 , . . . , sn ), si < ,
it is impossible to have p(Ds, (a )t ) = p(Ds). Thus the Cauchy distribution
has no sucient statistics for the mean and standard deviation.
30. We consider the univariate case. Let and 2 denote the
mean and variance of
n
the Gaussian, Cauchy and binomial distributions and
= 1/n i=1 xi be the sample
mean of the data points xi , for i = 1, . . . , n.
PROBLEM SOLUTIONS
109
n2
n
1 !
)] + E[(
)2 ] .
E[(xi )2 ] 2E[(xi )(
n 1 i=1
Note that
n
!
1
E[(xi )(
)] = E (xi )
xj
n j=1
n
!
x
1
i
= E (xi )
xk
+
n
n
k=1; k=i
n
!
1
1
= E
(xi )2 + E (xi )
xk (n 1)
n
n
k=1; k=i
1 2
+ 0 = 2 /n.
n
=
=
n
1 !
2
1
2 2 + 2
n 1 i=1
n
n
n1 2
= 2 .
n1
n1 2
= 2 ,
n
110CHAPTER 3.
Section 3.7
31. We assume that a and b are positive constants and n a variable. Recall the
denition on page 633 in the text that
O(g(x))
(a) Is an+1 = O(an )? Yes. If let c = a and x0 = 1, then c an = an+1 an+1 for
x 1.
(b) Is abn = O(an )? No. No matter what constant c we choose, we can always nd
a value n for which abn > c an .
(c) Is an+b = O(an )? Yes. If choose c = ab and x0 = 1, then c an = an+b an+b
for x > 1.
(d) Clearly f (n) = O(f (n)). To prove this, we let c = 1 and x0 = 0. Then of course
f (x) f (x) for x 0. (In fact, we have f (n) = f (n).)
n1
32. We seek to evaluate f (x) = i=0 ai xi at a point x where the n coecients ai
are given.
(a) A straightforward (n2 ) algorithm is:
Algorithm 0 (Basic polynomial evaluation)
1
2
3
4
5
6
7
8
9
10
11
12
PROBLEM SOLUTIONS
111
(a) O(N )
(b) O(N )
(c) O(J(I + K))
34. We assume that the uniprocessor can perform one operation per nanosecond
(109 second). There are 3600 seconds in an hour, 24 3600 = 86, 400 seconds in
a day, and 31, 557, 600 seconds in a year. Given one operation per nanosecond, the
total number of basic operations in the periods are: 109 in one second; 3.6 1012 in
one hour; 8.64 1013 in one day; 3.156 1016 in one year. Thus, for the table below,
we nd n such that f (n) is equal to these total number of operations. For instance,
for the 1 day entry in the nlog2 n row, we solve for n such that nlog2 n = 8.64 1013 .
f (n)
operations:
1 sec
109
1 hour
3.6 1012
12
log2 n
n
n
nlog2 n *
n2
23.610
101012
(3.61012 )2
1.31025
12
210
4.610301029995
9 2
18
n3
(10 ) = 10
109
3.9 107
9
10 3.16 104
3
109 = 103
2n
en
n! **
3.6 10
9.86
1010
3.61012
1.9106
3
3.61012
1.53104
log2 (3.61012 )
41.71
ln(3.61012 )
28.91
16.1
1 day
8.64 1013
13
28.6410
101013
(8.641013 )2
7.461027
13
8.64 10
2.11
1012
8.641013
9.3106
3
8.641013
4.42104
log2 (8.641013 )
46.30
ln(8.641013 )
32.09
17.8
1 year
3.156 1016
16
23.15610
101016
(3.1561016 )2
9.961032
16
3.156 10
6.41
1014
3.1561016
1.78108
3
3.1561016
3.16105
log2 (3.1561016 )
54.81
ln(3.1561016 )
38.0
19.3
* For entries in this row, we solve numerically for n such that, for instance, nlog2 n =
3.9 107 , and so on.
** For entries in this row, we use Stirlings approximation n! (n/e)n for large n
and solve numerically.
35. Recall rst the sample mean and sample covariance based on n samples:
1!
xk
n
n
mn
k=1
1 !
(xk mn )(xk mn )t .
n1
n
Cn
k=1
k=1
1
=
[nmn + xn+1 ]
n+1
1
1
= mn
mn +
xn+1
n+1
n+1
1
= mn +
(xn+1 mn ).
n+1
112CHAPTER 3.
=
=
n+1
1!
(xk mn+1 )(xk mn+1 )t
n
k=1
n
1 !
t
t
(xk mn+1 )(xk mn+1 ) + (xn+1 mn+1 )(xn+1 mn+1 )
n
k=1
n
1$!
k=1
!
1
(xk mn )t
(xn+1 mn )
(n + 1)
n
(xk mn )(xk mn )t
1
k=1
%
1
1
(xk mn ) (xn+1 mn )t +
(xn+1 mn )(xn+1 mn )t
2
n+1
(n + 1)
k=1
k=1
t
1
1
1
+
(xn+1 mn )
(xn+1 mn )
(xn+1 mn )
(xn+1 mn )
n
n+1
n+1
1
n
=
(xn+1 mn )(xn+1 mn )t
(n 1)Cn +
n
(n + 1)2
2
1
2
1
n
t
+
(xn+1 mn )(xn+1 mn )
n
n+1
n1
1
n
=
(xn+1 mn )(xn+1 bf mn )t
Cn +
+
n
(n + 1)2
(n + 1)2
n1
1
=
Cn +
(xn+1 mn )(xn+1 mn )t .
n
n+1
n
!
n
!
1
(xn+1 mn )
n+1
n1
1
Cn +
(xn+1 mn )(xn+1 mn )t
n
(n + 1)
A1 xxt A1
.
1 + xt A1 x
PROBLEM SOLUTIONS
113
= AA1
+
xx
1 + xt A1 x
1 + xt A1 x
t 1
xx A
xxt A1
t 1
=I
xt A1 x
t 1 + xx A
t 1
1+x
1+x
A
A
x
x
a scalar
t
= I + x xA
a scalar
1
xt A1 x
1
t
1
1 + x A x 1 + xt A1 x
= I + xt xA1 0
= I.
Note the lefthandside and the righthandside of this equation and see
(A + xt x)1
= A1
A1 xxt A1
.
1 + xt A1 x
(b) We apply the result in part (a) to the result discussed in Problem 35 and see
that
n1
1
Cn +
(xn+1 mn )(xn+1 mn )t
n
n+1
n1
n
=
Cn + 2
(xn+1 mn )(xn+1 mn )t
n
n 1
n1
=
(A + xxt ).
n
3
We set A = Cn and x = n2n1 (xn+1 mn ) and substitute to get
Cn+1
C1
n+1
=
=
=
1
n1
n
t
=
(A + xx )
(A + xxt )1
n
n1
n
A1 xxt A1
1
A
n+1
1 + xt A1 x
3
3
n
n
1
t 1
(x
m
)
C
2
n+1
n
n
n 1
n2 1 (xn+1 mn ) Cn
n 1
3
3
Cn
n
n+1
1 + n2n1 (xn+1 mn )t C1
(x
m
)
n
2
n+1
n
n 1
t 1
n
C1
n (xn+1 mn )(xn+1 mn ) Cn
1
,
Cn n2 1
t 1
n1
n + (xn+1 mn ) Cn (xn+1 mn )
114CHAPTER 3.
2
1
where u = C1
n (xn+1 mn ) is of O(d ) complexity, given that Cn , xn+1 and
1
1
mn are known. Hence, clearly Cn can be computed from Cn1 in O(d2 ) operations, as uut , ut (xn+1 mn ) is computed in O(d2 ) operations. The complexity
2
associated with determining C1
n is O(nd ).
11 12 1n
21 22 2n
= .
..
.. .
..
..
.
.
.
n1
nn
n2
1
=1
1
for all 0 < < 1. Therefore, we must rst normalize the data to have unit variance.
Section 3.8
38. Note that in this problem our densities need not be normal.
(a) Here we have the criterion function
J1 (w) =
(1 2 )2
.
12 + 22
= wt x
1 !
1 ! t
=
y=
w x = wt i
ni
ni
yYi
xDi
!
!
2
t
t
=
(y i ) = w
(x i )(x i ) w
xDi
yYi
(x i )(x i )t .
xDi
= 1 + 2
SB
(1 2 )(1 2 )t .
= wt SW w
(1 2 )2
= wt SB w.
PROBLEM SOLUTIONS
115
wt SB w
.
wt SW w
For the same reason Eq. 103 in the text is maximized, we have that J1 (w)
is maximized at wS1
W (1 2 ). In sum, that J1 (w) is maximized at w =
(1 + 2 )1 (1 2 ).
(b) Consider the criterion function
J2 (w) =
(1 2 )2
.
P (1 )12 + P (2 )22
wt SB w
.
wt SW w
For the same reason Eq. 103 is maximized, we have that J2 (w) is maximized at
1
w = (P (1 )1 + P (2 )2 )
(1 2 ).
(c) Equation 96 of the text is more closely related to the criterion function in part (a)
above. In Eq. 96 in the text, we let m
i = i , and s2i = i2 and the statistical
meanings are unchanged. Then we see the exact correspondence between J(w)
and J1 (w).
39. The expression for the criterion function
J1 =
1 ! !
(yi yj )2
n 1 n2
yi Y1 yj Y2
1 ! !
[(yi m1 ) (yj m2 ) + (m1 m2 )]2
n1 n2
1 ! !
(yi m1 )2 + (yj m2 )2 + (m1 m2 )2
n1 n2
yi Y1 yj Y2
yi Y1 yj Y2
+2(yi m1 )(yj m2 ) + 2(yi m1 )(m1 m2 ) + 2(yj m2 )(m1 m2 )
1 ! !
1 ! !
(yi m1 )2 +
(yj m2 )2 + (m1 m2 )2
n1 n2
n1 n 2
yi Y1 yj Y2
yi Y1 yj Y2
1 ! !
1 ! !
+
2(yi m1 )(yj m2 ) +
2(yi m1 )(m1 m2 )
n1 n2
n1 n2
yi Y1 yj Y2
1 ! !
+
2(yj m2 )(m1 m2 )
n 1 n2
yi Y1 yj Y2
1 2
1 2
s +
s + (m1 m2 )2 ,
n1 1 n2 2
yi Y1 yj Y2
116CHAPTER 3.
where
m1
yi
yi Y1
m2
yj
yj Y2
s21
(yi m1 )2
yi Y1
s22
(yj m2 )2 .
yj Y2
(b) The prior probability of class one is P (1 ) = 1/n1 , and likewise the prior probability of class two is P (2 ) = 1/n2 . Thus the total withinclass scatter is
J2 = P (1 )s21 + P (2 )s22 =
1 2
1 2
s +
s .
n1 1 n2 2
(m1 m2 )2 +
1 2
n1 s1
1 2
1 2
n1 s1 + n2 s2
1 2
n2 s2
(m1 m2 )2
1 2
1 2 + 1,
n1 s1 + n2 s2
= wt (m1 m2 )(m1 m2 )t w
1
1 2
1 2
1
t
=
s +
s =w
S1 +
S2 w
n 1 1 n2 2
n1
n2
wt (m1 m2 )(m1 m2 )t w
$
%
+ 1.
wt n11 S1 + n12 S2 w
For the same reason Eq. 103 in the text is optimized, we have that J(w) is
maximized at
1
1
1
w=
S1 +
S2
(m1 m2 ).
n1
n2
We need to guarantee J2 = 1, and this requires
1
1
t
w
S1 +
S2 w = 1,
n1
n2
PROBLEM SOLUTIONS
117
1
1
S1 +
S2
n1
n2
1
(m1 m2 ) = 1.
= i ij
= i ij ,
where ij is the Kronecker symbol. We denote the matrix consisting of eigenvectors as W = [e1 e2 . . . en ]. Then we can write the within scatter matrix
in the new representation as
t
e1
et2
W = Wt SW W =
S
.. SW (e1 e2 en )
.
etn
t
e1 SW e1 et1 SW en
..
..
..
=
= I.
.
.
.
etn SW e1
etn SW en
Likewise we can write the between scatter matrix in the new representation as
t
e1
et2
B = Wt SB W =
S
.. SB (e1 e2 en )
.
etn
et1 SB e1
..
.
etn SB e1
..
..
=
.
.
etn SB en
et1 SB en
1
0
..
.
0
2
..
.
..
.
0
0
..
.
118CHAPTER 3.
= QDWt
t SW W
= QDWt SW WDQt .
= W
5
W = D2 and
Then we have S
5
B = W
= QDWt SB WDQt = QDS
t SB W
B DQt ,
S
5
B = D2 1 2 n . This implies that the criterion function obeys
then S
5
B
S
J=
,
5
W
S
and thus J is invariant to this transformation.
41. Our two Gaussian distributions are p(xi ) N (i , ) for i = 1, 2. We denote
i and the distributions
the samples after projection as D
i) =
p(y
i =
and
i
p(y

)
ln
k 1
1)
lnp(D
= k=1
n
2)
/
lnp(D
2)
ln
p(yk 
n
k=1
n
$
ln
$
ln
k=1
c1 +
=
1
s2 )],
exp[(y
)2 /(2
2
s
c1 +
k=1
1 exp
2
s
1 exp
2
s
yk D1
yk D1
c1 +
1
2
c1 +
1
2
c1 +
1
2
c1 +
1
2
+
+
(yk
1 )2
2
s2
(yk
2 )2
2
s2
(yk
1 )2
2
s2
(yk
2 )2
2
s2
yk D2
yk D2
1
2
s2
1
2
s2
c1 + 1 +
c1 + 1 +
%%
k=1
%% =
n
yk D2
yk D2
(yk
1 )2
2
s2
2
yk D
1
yk D
1
2
2
s2 n2 (
1
1
2
s2 n1 (
$
ln
$
ln
k=1
(yk
2 )2
2
s2
n
1
2
s
1
2
s
%
+
%
+
n
k=1
n
k=1
(yk
1 )2
2
s2
(yk
2 )2
2
s2
(yk
1 )2
2
s2
(yk
2 )2
2
s2
c1 +
1
2
c1 +
1
2
yk D2
yk D1
(yk
2 )+(
2
1 ))2
2
s2
(yk
2 )+(
2
1 ))2
2
s2
(yk
2 ) + (
2
1 )2 + 2(yk
2 )(
2
1 )
2
((yk
1 )2 + (
1
2 )2 + 2(yk
1 )(
1
2 ))
1 )2
c + n2 J(w)
=
.
2
c + n1 J(w)
2 )
PROBLEM SOLUTIONS
119
This implies that the Fisher linear discriminant can be derived from the negative of
the loglikelihood ratio.
42. Consider the criterion function J(w) required for the Fisher linear discriminant.
(a) We are given Eqs. 96, 97, and 98 in the text:
J1 (w)
Si
2 2
m
1 m
(96)
2
s1 + s22
!
(x mi )(x mi )t
(97)
xD
SW
= S1 + S2 (98)
where y = wt x, m
i = 1/ni
y = wt mi . From these we can write Eq. 99 in
yYi
(y m
i )2
yYi
(wt x wt mi )2
xD
wt (x mi )(x mi )t w
xD
t
= w Si w.
Therefore, the sum of the scatter matrixes can be written as
s21 + s22 = wt SW w
(100)
(m
1 m
2 ) = (w m1 w m2 )
= wt (m1 m2 )(m1 m2 )t w
= wt SB w,
2
(101)
wt SB w
.
wt SW w
(103)
(b) Part (a) gave us Eq. 103. It is easy to see that the w that optimizes Eq. 103
is not unique. Here we optimize J1 (w) = wt SB w subject to the constraint
that J2 (w) = wt SW w = 1. We use the method of Lagrange undetermined
multipliers and form the functional
g(w, ) = J1 (w) (J2 (w) 1).
We set its derivative to zero, that is,
g(w, )
wi
=
=
uti SB w + wt SB ui uti SW w + wt Sw ui
2uti (SB w SW w) = 0,
120CHAPTER 3.
(c) We assume that J(w) reaches the extremum at w and that w is a small
deviation. Then we can expand the criterion function around w as
J(w + w) =
(w + w)t SB (w + w)
.
(w + w)t SW (w + w)
wt SB w
wt SB w + 2wt SB w
=
,
wt SW w
wt SW w + 2wt SW w
that is,
=
wt SB w
.
wt SW w
c
!
i=1
1 !
x
ni
xDi
!
xD
1!
ni mi ,
n i=1
c
x=
c !
!
(x mi )(x mi )t =
i=1 xD
c
!
= c.
i=1
In order to maximize
J(W) =
Wt SB W
,
Wt SW W
PROBLEM SOLUTIONS
121
1 1 = 0.
122CHAPTER 3.
" #
4 0
2
p
x42
=
lnp
dx42
" #
4 0
dx
p
42
x42
#
"
1 2 p 4 0
x42
4
=
ln
dx42
10
1
x42
1010 dx42
1
1
=
10
4
x42
lnp
2 1 2
4 0
4
p
dx42 .
x42
x42
PROBLEM SOLUTIONS
123
xu2
1
4
10
lnp
dx42
x42
10 10
xl2
1
10
1
.
xu1 xl1  xu2 xl2 
xu2
1
4
10
lnp
dx42
x42
10 10
1
1
xu2 ln
.
10
xu1 xl1  xu2 xl2 
10
K
1
4
dx42
10 10
x42
10
1
1
(10 xu1 )ln
.
10
xu1 xl1  xu2 xl2 
lnp
xu1
4. Otherwise, K = 0.
Similarly, there are four cases we must consider when calculating L:
1. xl1 < xu1 0 < 10 or 0 < 10 xl1 < xu1 or 6 xl2 or xu2 6: then L = 0.
2. 0 xl1 < xu1 10 and xl2 6 xu2 , then
L =
xu1
x41
1
10
lnp
dx41
6
10 10
xl1
1
1
(xu1 xl1 )ln
.
10
xu1 xl1  xu2 xl2 
10
xu1
x41
1
lnp
dx41
6
10 10
0
1
1
xu1 ln
.
10
xu1 xl1  xu2 xl2 
10
L =
10
xl1
lnp
1
x41
dx41
6
10 10
1
1
(10 xl1 )ln
.
10
xu1 xl1  xu2 xl2 
124CHAPTER 3.
Therefore Q(; 0 ) =
3
k=1
x1 = 5
x1 = 2
10
x2 = 6
x2 = 1
x1
0
0
10
lnp(x1 ) + lnp(x2 ) +
" #
2 0
p
x32
2
lnp(x1 ) + lnp(x2 ) +
p
dx32
x32
2
0
p
d32
x32
1/(2e)
=
lnp(x1 ) + lnp(x2 ) + 2e
lnp(x1 ) + lnp(x2 ) + K.
lnp
2 0
2
p
dx32
x32
x32
1
4
2
ln
0
1 21 1
e
1
2
dx32
PROBLEM SOLUTIONS
125
=
1
2 ln
4
1 21 1
e
1
2
.
2. 2 4:
1
4
4
1 21 1
=
ln
e
1
2
0
1
1 21 1
=
e
4ln
4
1
2
1 21 1
= ln
.
e
1
2
dx32
3. Otherwise K = 0.
Thus we have
Q(; 0 )
lnp(x1 ) + lnp(x2 ) + K
1 1 1
1 31 1
= ln
+ ln
+K
e
e
1
2
1
2
= 1 ln(1 2 ) 31 ln(1 2 ) + K
= 41 2ln(1 2 ) + K,
=
p(x1 )dx1 =
1 1 x1
e
dx1 = 1.
1
1
Q(; ) = 4 2ln2 + 2 (2 + ln2 ) .
4
0
Note that this is a monotonic function and that arg max2 Q(; 0 ) = 3,
which leads to max Q(; 0 ) 8.5212.
2. 2 4: In this case Q(; 0 ) = 6 3ln2 . Note that this is a monotomic
function and that arg max2 Q(; 0 ) = 4, which leads to max Q(; 0 )
10.1589.
In short, then, = (1 3)t .
(c) See figure.
48. Our data set is D =
126CHAPTER 3.
t
= (1 3)
x2
x2
p(x1 , x2)
0.3
0.2
0.1
0
0
p(x1 , x2)
0.3
0.2
0.1
0
0
5
4
3
2
5
4
3
2
1
4
x1
1
4
x1
= Ex31 lnp(xg , xb ; ) 0 , Dg
=
(lnp(x1 ) + lnp(x2 ) + lnp(x3 )) p(x31  0 , x32 = 2)dx31
=
lnp(x1 ) + lnp(x2 ) +
lnp
" #
x31 0
p
2
x31
dx31
2
x31 0
p
dx31
2
1/16
=
lnp(x1 ) + lnp(x2 ) + 16
lnp(x1 ) + lnp(x2 ) + K.
lnp
x31
x31 0
p
dx31
2
2
2x31
2e
ln
=
1 1 x31 1
e
1
2
dx31
= ln(1 2 )
2e2x31 dx31 21
1
= ln(1 2 ) 21
4
1
=
1 ln(1 2 ).
2
2. Otherwise K = 0.
Therefore we have
Q(; 0 )
lnp(x1 ) + lnp(x2 ) + K
PROBLEM SOLUTIONS
127
1 1 1
1 31 1
= ln
+ ln
+K
e
e
1
2
1
2
= 1 ln(1 2 ) 31 ln(1 2 ) + K
= 41 2ln(1 2 ) + K,
according to the cases for K, directly above. Note that
thus
1 1 x
e
dx1 = 1,
1
= 4 2ln2 +
7
= 3ln2 .
2
1
ln2
2
Note that this is a monotonic function, and thus arg max2 Q(; 0 ) = 2. Thus
the parameter vector is = (1 2)t .
(c) See figure.
t
0 = (2 4)
x2
p(x1 , x2)
0.4
0.3
0.2
0.1
0
0
t
= (1 2)
3
1
2
3
0.4
0.3
0.2
0.1
0
0
1
4
x2
p(x1 , x2)
x1
3
1
2
3
1
4
x1
Section 3.10
49. A single revision of a
ij s and bij s involves computing (via Eqs. ?? ??)
T
a
ij
ij (t)
t=1
T
t=1 k
??
bij
??
T
t=1
ij (t)
ik (t)
ij
ij (t)
128CHAPTER 3.
i (t)s and P (V T M ) are computed by the Forward Algorithm, which requires O(c2 T )
operations. The i (t)s can be computed recursively as follows:
For t=T to 1 (by 1)
For i=1to c
i (t) = j aij bjk v(t + 1)j (t + 1)
End
This requires O(c2 T ) operations.
Similarly, ijs can be computed by O(c2 T ) operations given i (t)s, aij s, bij s,
i (t)s and P (V T M ). So, ij (t)s are computed by
O(c2 T ) + O(c2 T ) + O(c2 T ) = O(c2 T )operations.
i (t)s
i (t)s
ij (t)s
Then, given ij (t)s, the a
ij s can be computed by O(c2 T ) operations and bij s by
O(c2 T ) operations. Therefore, a single revision requires O(c2 T ) operations.
50. The standard method for calculating the probability of a sequence in a given
HMM is to use the forward probabilities i (t).
(a) In the forward algorithm, for t = 0, 1, . . . , T , we have
0
t = 0 and j = initial status
1
t = 0 and j = initial status
j (t) =
c
i=1
0
t = T and j = nal status
1
t = T and j = nal status
j (t) =
c
i=1
c
!
i=1
i=1
c
!
i=1
i (T )i (T ).
PROBLEM SOLUTIONS
129
c
i=1
the known initial state. This is the same as line 5 in Algorithm 3. Likewise, at
c
T = T , the above reduces to P (VT ) =
i (T )i (T ) = j (T ), where j is the
i=1
a
ij
For a new HMM with ai j = 0, from () we have i j = 0 for all t. Substituting
i j (t) into (), we have a
i j = 0. Therefore, keeping this substitution throughout
the iterations in the learning algorithm, we see that a
i j = 0 remains unchanged.
52. Consider the decoding algorithm (Algorithm 4).
(a) the algorithm is:
Algorithm 0 (Modied decoding)
1
2
3
4
5
until j = c
j arg min[j (t)]
6
7
Append j to Path
until t = T
return Path
8
9
10
11
end
(b) Taking the logarithm is an O(c2 ) computation since we only need to calculate
lnaij for all i, j = 1, 2, . . . , c, and ln[bjk v(t)] for j = 1, 2, . . . , c. Then, the whole
complexity of this algorithm is O(c2 T ).
130CHAPTER 3.
Computer Exercises
Section 3.2
1. Computer exercise not yet solved
Section 3.3
2. Computer exercise not yet solved
Section 3.4
3. Computer exercise not yet solved
Section 3.5
4. Computer exercise not yet solved
Section 3.6
5. Computer exercise not yet solved
Section 3.7
6. Computer exercise not yet solved
7. Computer exercise not yet solved
8. Computer exercise not yet solved
Section 3.8
9. Computer exercise not yet solved
10. Computer exercise not yet solved
Section 3.9
11. Computer exercise not yet solved
12. Computer exercise not yet solved
Section 3.10
13. Computer exercise not yet solved
Chapter 4
Nonparametric techniques
Problem Solutions
Section 4.3
1. Our goal is to show that Eqs. 1922 are sucient to assure convergence of Eqs. 17
and 18. We rst need to show that lim pn (x) = p(x), and this requires us to show
n
that
1
xV
lim
= n (x V) (x V).
n Vn
hn
Without loss of generality, we set V = 0 and write
1
Vn
x
hn
=
1
d
/
i=1
xi
d
,
xi
x
= n (x),
h
hn
i=1 n
where we used the fact that Vn = hdn . We also see that n (x) 0 if x = 0, with
normalization
n (x)dx = 1.
Thus we have in the limit n , n (x) (x) as required.
We substitute our denition of pn (x) and get
lim pn (x) = E[
p(x)] = (x V)p(V)dV = p(x).
n
1
nVn
1 2
Vn
xV
hn
131
p(V)dv
1
[
pn (x)]2
n
132
1 2 xV
1
p(V)dV
nVn
Vn
hn
Sup()
xV
1
p(V)dV
nVn
Vn
hn
Sup()
pn (x)
.
nVn
We note that Sup() < and that in the limit n we have pn (x) p(x) and
nVn . We put these results together to conclude that
lim n2 (x) = 0,
for all x.
2. Our normal distribution is p(x) N (, 2 ) and our Parzen window is (x)
N (0, 1), or more explicitly,
(x)
2
1
ex /2 ,
2
n
1 !
x xi
.
nhn i=1
hn
(a) The expected value of the probability at x, based on a window width parameter
hn , is
n
1 !
x xi
pn (x) = E[pn (x)] =
E
nhn i=1
hn
=
1
hn
1
hn
xv
hn
p(v)dv
2
2
1
1
1 xv
1 v
exp
dv
exp
2
hn
2
2
2
1
x
2
1 2 1
1
1 x2
dv
+ 2
exp v
+ 2 2v
+ 2
exp
2hn
2 h2n
2
h2n
h2n
2
1
1 x2
1 2
2
1 v
dv,
exp
+ 2 +
exp
2hn
2 h2n
2 2
2
1
h2n 2
=
1/h2n + 1/ 2
h2n + 2
and
=
+ 2
h2n
.
PROBLEM SOLUTIONS
133
2
1 x2
1 2
2
pn (x) =
exp
+ 2 +
2hn
2 h2n
2 2
2
1
1 x
2
2
hn
=
.
exp
+ 2 2
2 h2n
2hn h2n + 2
The argument of the exponentiation can be expressed as follows
x2
2
2
+
h2n
2
2
2
x2
2
4 x
+ 2 2
+ 2
h2n
h2n
x2 h2n
2 2
2x
+ 2
2
2
2
2
2
2
(hn + )hn
(hn + )
hn + 2
2
(x )
.
h2n + 2
=
=
=
1
1 (x )2
,
exp
2 h2n + 2
2 h2n + 2
nhn i=1
hn
n
1 !
x xi
=
Var
n2 h2n i=1
hn
1
xv
=
Var
2
nhn
hn
2
1
xv
xv
2
=
E
,
E
nh2n
hn
hn
where in the rst step we used the fact that x1 , . . . , xn are independent samples
drawn according to p(x). We thus now need to calculate the expected value of
the square of the kernel function
xv
xv
2
2
E
=
p(v)dv
hn
hn
2
2
1 v
1
1
xv
=
exp
exp
dv
2
hn
2
2
2
1
1 v
1
1
1 xv
exp
=
exp
dv.
2
2
2
2
hn / 2
2
134
exp
dv
2 hn / 2
2
hn / 2
2
2
1
1 (x )2
=
.
exp 2
2 hn /2 + 2
2 h2n /2 + 2
We make the substitution and nd
xv
h / 2
1 (x )2
n
E 2
=
,
exp 2
hn
2 hn /2 + 2
2 h2n /2 + 2
and thus conclude
1
xv
1
1
1 (x )2
1
2
=
.
E
exp
nh2n
hn
nhn 2 2 h2n /2 + 2
2 h2n /2 + 2
For small hn , h2n /2 + 2 , and thus the above equation can be approximated as
2
1
xv
1
1
1 x
2
E
exp
nh2n
hn
2
nhn 2 2
1
p(x).
=
()
2nhn
Similarly, we have
2
1
xv
E
nh2n
hn
=
=
1
1 2
1 (x )2
h
exp
nh2n n 2 h2n + 2
2 h2n + 2
hn
1
1 (x )2
exp
nhn 2 h2n + 2
2 h2n + 2
hn
1
1 (x )2
0, ()
exp
nhn 2
2
2
=
=
=
=
p(x)
.
2nhn
1
1
1 (x )2
1 (x )2
exp
exp
2
2
2 h2n + 2
2
2 h2n + 2
1
1 (x )2
1 (x )2
1 (x )2
exp
+
1
exp
2 h2n + 2
2 h2n + 2
2
2
2
h2n + 2
1
(x )2 8
19
1
p(x) 1
exp
2
h2n + 2
2
1 + (hn /)2
2
1
1 hn (x )2
p(x) 1
.
exp
2 h2n + 2
1 + (hn /)2
PROBLEM SOLUTIONS
135
2
h2n (x )
h2n (x )2
1
+
.
2 2 h2n + 2
2 2 h2n + 2
2 2
h2 (x )2
1 + n2 2
p(x) 1
2 hn + 2
1 h2n
h2n (x )2
p(x) 1 1 +
2 2
2 2
2 hn + 2
2
1 hn
(x )2
1 2
p(x)
2
hn + 2
2
2
1 hn
x
1
p(x).
2
p(x) pn (x)
1
1
2
hn
1/a 0 x a
0
otherwise,
p(x) =
and our Parzen window is
(x) =
ex
0
x0
x < 0.
nhn i=1
hn
hn
hn
(xv)
1
=
e hn p(v)dv
hn
xv
exp[x/hn ]
hn
1
exp[v/hn ]dv
a
xv,
0<v<a
x
ex/hn
ev/hn dv
if
if x a
ahn
if x < 0
0xa
ex/hn
ahn
a
ev/hn dv
0
136
0
=
1
x/hn
)
a (1 e
1 a/hn
x/hn
(e
1)e
a
if x < 0
if 0 x a
if x a,
where in the rst step we used the fact that x1 , . . . , xn are independent samples
drawn according to p(v).
(b) For the case a = 1, we have
0
1 ex/hn
pn (x) =
1/hn
(e
1)ex/hn
x<0
0x1
x > 1,
hn = 1/16
0.8
0.6
0.4
hn = 1
0.2
hn = 1/4
x
0.5
1.5
2.5
a1 (1 ex/hn )
0 a1 (ea/hn 1)ex/hn
if
0
1 x/hn
e
if
=
a 1 a/hn
a (e
1)ex/hn if
=
1
a
if x < 0
if 0 x a
if x a
x<0
0xa
x a.
Formally, a bias lower than 1% over 99% of the range 0 < x < a, means that
p(x) p(x)
0.01
p(x)
over 99% of the range 0 < x < a. This, in turn, implies
1/aex/hn
0.01 over 99% of 0 < x < a or
1/a
0.01a
hn
.
ln (100)
For the case a = 1, we have that hn 0.01/(ln100) = 0.0022, as shown in
the gure. Notice that the estimate is within 1% of p(x) = 1/a = 1.0 above
x 0.01, fullling the conditions of the problem.
PROBLEM SOLUTIONS
137
p(x)
1
hn = 0.0022
0.8
0.6
0.4
0.2
x
0
0.01
0.02
0.03
0.04
0.05
4. We have from Algorithm 2 in the text that the discriminant functions of the PNN
classier are given by
t
ni
!
wk x 1
gi (x) =
aki
exp
i = 1, . . . , c
2
k=1
c
!
ij P (j x) =
j=1
c
!
P (j )p(xj
j=1
p(x)
c
!
ij P (j )p(xj ).
j=1
Consequently, the PNN classier must estimate gi (x). From part (a) we have
that gi = nPn (j )p(xj ). Thus the new discriminant functions are simply
1!
ij gj (x),
n j=1
c
gi (x) =
138
where gi (x) are the discriminant functions of the PNN classier previously dened. An equivalent discriminant function is
gi (x) =
c
!
ij gj (x).
j=1
(c) The training algorithm as dened is unaected since it only performs a normalization of the training patterns. However the PNN classication algorithm must
be rewritten as:
Algorithm 0 (PNN with costs)
11
12
end
1
2
3
4
5
6
7
8
9
10
Section 4.4
5. Our goal is to show that pn (x) converges in probability to p(x) given the below
conditions:
lim kn
(condition 1)
lim kn /n 0
(condition 2)
n
n
As we know from probability theory, when pn (x) converges in the rth mean (r 1)
to p(x) that is, E[pn (x) p(x)r ] 0 as n 0 then p(x) also converges in
probability to p(x).
It is simple to show for r = 2 that
E[(pn (x) p(x))2 ] = Var[pn (x)] + bias2 [pn (x)],
where
Var[pn (x)]
bias2 [pn (x)]
= E pn (x) E[pn (x)]
= E[pn (x) p(x)].
as n
(condition 3)
as n
(condition 4).
PROBLEM SOLUTIONS
139
Note that the knearestneighbor estimate pn (x) = knVn/n can be rewritten as a kernel
estimate according to Eq. 11 in the text:
n
1! 1
x xi
pn (x) =
,
n i=1 Vn
hn
where hn denotes here the radius of a ball which includes the kn prototypes nearest
to x, Vn is the volume of the ball, and
x xi
1 if x xi hn
=
0 otherwise.
hn
Hence we can apply the results of page 167 in the text. According to Eq. 23 in the
text,
1
x
E[pn (x)] = p(x) as n if and only if
= (x) as n ,
Vn
hn
where (x) is the delta function.
Given a test pattern x (see gure), we will have a very small ball (or more precisely,
have Vn 0) as n , since we can get k prototypes arbitrarily close to x due to
condition 2. If Vn 0, hn 0 and therefore
x
1 if x = 0
=
0 otherwise.
hn
1
Vn (x/hn )
Consequently,
hn
x1
xkn
x3
x
x2
x4
x5
1
1 2 xu
p(u)du
n nVn
Vn
hn
$
"
#%
lim V1n 2 xu
p(u)du
hn
lim
lim nVn
(x u)p(u)du
lim nVn
n
p(x)
.
lim nVn
n
kn
n nVn
Since lim
Thus Var[p(x)] 0 as n 0.
140
and that
if x 1
otherwise
if x a 1
otherwise,
=
=
1
2
(k1)/2
(k1)/2
!
j=0
j=0
n 1
1
j 2j 2(nj)
(k1)/2
1 ! n
.
2n j=0 j
1
1
< Pn (e; k) = n
2n
2
(k1)/2
n
.
j
j=0
!
>0 f or k>1
=
=
1
2n
(k1)/2
!
j=0
n
j
= Pr [B(n, 1/2) (k 1)/2]
k1
Pr Y1 + + Yn
,
2
a/ n 1
Pn (e) Pr Y1 + + Yn
2
= P r(Y1 + + Yn 0) for n suciently large
= 0,
and this guarantees Pn (e) 0 as n .
PROBLEM SOLUTIONS
141
Section 4.5
7. Our goal is to show that Voronoi cells induced by the nearestneighbor algorithm
are convex, that is, given any two points in the cell, the line connecting them also
lies in the cell. We let x be the labeled sample point in the Voronoi cell, and y be
x(0)
y
x*
x()
x(1)
any other labeled sample point. A unique hyperplane separates the space into those
that are closer to x than to y, as shown in the gure. Consider any two points
x(0) and x(1) inside the Voronoi cell of x ; these points are surely on the side of the
hyperplane nearest x . Now consider the line connecting those points, parameterized
as x() = (1 )x(0) + x(1) for 0 1. Because the halfspace dened by
the hyperplane is convex, all the points x() lie nearer to x than to y. This is true
for every pair of points x(0) and x(1) in the Voronoi cell. Furthermore, the result
holds for every other sample point yi . Thus x() remains closer to x than any other
labeled point. By our denition of convexity, we have, then, that the Voronoi cell is
convex.
8. It is indeed possible to have the nearestneighbor error rate P equal to the Bayes
error rate P for nontrivial distributions.
(a) Consider uniform priors over c categories, that is, P (i ) = 1/c, and onedimensional distributions
cr
0 x c1
1
cr
1
i x i + 1 c1
p(xi ) =
0
elsewhere.
The evidence is
1
1/c
p(x) =
p(xi )P (i ) =
i=1
0
c
!
cr
0 x c1
i x (i + 1)
elsewhere.
cr
1.
c1
cr
c1
142
P (i x) =
P (i )
p(x)
1/c
p(x)
0x
cr
c1
if i = j
if i = j
0
1
j xj+1
cr
c1
otherwise,
1 , 2 , 3
0.3
0.25
0.2
0.15
0.1
0.05
x = r = 0.1
x
1
1 P (max x)
1/c
cr
1 p(x)
if 0 x c1
=
1 1 = 0 if i x i + 1
=
cr
c1 ,
P (ex)p(x)dx
cr/(c1)
1/c
=
1
p(x)dx
p(x)
0
1
cr
=
1
= r.
c c1
(b) The nearestneighbor error rate is
P
c
!
P 2 (i x) p(x)dx
i=1
cr/(c1)
=
0
cr
j+1 c1
c
!
c( 1c )2
1 2
[1 1] p(x)dx
p(x)dx +
p (x)
j=1 j
0
PROBLEM SOLUTIONS
cr/(c1)
=
0
143
1/c
1 2
p (x)
cr/(c1)
=
0
1
c
p(x)dx
dx =
1
c
cr
= r.
c1
2
(5,10)
(0,5)
(5,5)
3
(2,8)
(5,2)
(10,4)
Throughout the gures below, we let dark gray represent category 1 , white represent
category 2 and light gray represent category 3 . The data points are labeled by their
category numbers, the means are shown as small black dots, and the dashed straight
lines separate the means.
a)
b)
15
15
10
10
3
2
3
1
1
3
5
5
10
10
15
15
10
5
10
10
15
5
c)
10
15
d)
15
15
10
10
3
5
3
5
1
1
5
5
10
10
15
15
10
5
10
15
10
5
10
15
10. The Voronoi diagram of n points in ddimensional space is the same as the
convex hull of those same points projected orthogonally to a hyperboloid in (d + 1)dimensional space. So the editing algorithm can be solved either with a Voronoi
diagram algorithm in dspace or a convex hull algorithm in (d + 1)dimensional space.
Now there are scores of algorithms available for both problems all with dierent
complexities.
A theorem in the book by Preparata and Shamos refers to the complexity of
the Voronoi diagram itself, which is of course a lower bound on the complexity of
computing it. This complexity was solved by Victor Klee, On the complexity of
ddimensional Voronoi diagrams, Archiv. de Mathematik., vol. 34, 1980, pp. 7580. The complexity formula given in this problem is the complexity of the convex
hull algorithm of Raimund Seidel, Constructing higher dimensional convex hulls at
144
logarithmic cost per face, Proc. 18th ACM Conf. on the Theory of Computing, 1986,
pp. 404413.
So here d is one bigger than for Voronoi diagrams. If we substitute d in the
formula in this problem with (d 1) we get the complexity of Seidels algorithm
for Voronoi diagrams, as discussed in A. Okabe, B. Boots and K. Sugihara, Spatial
Tessellations: Concepts and Applications of Voronoi Diagrams, John Wiley,
1992.
11. Consider the curse of dimensionality and its relation to the separation of points
randomly selected in a highdimensional space.
(a) The sampling density goes as n1/d , and thus if we need n1 samples in d = 1
dimensions, an equivalent number samples in d dimensions is nd1 . Thus if we
needed 100 points in a line (i.e., n1 = 100), then for d = 20, we would need
n20 = (100)20 = 1040 points roughly the number of atoms in the universe.
(b) We assume roughly uniform distribution, and thus the typical interpoint Euclidean (i.e., L2 ) distance goes as d volume, or (volume)1/d .
(c) Consider points uniformly distributed in the unit interval 0 x 1. The length
containing fraction p of all the points is of course p. In d dimensions, the width
of a hypercube containing fraction p of points is ld (p) = p1/d . Thus we have
l5 (0.01) = (0.01)1/5
l5 (0.1) = (0.1)1/5
l20 (0.01) = (0.01)1/20
l20 (0.1) = (0.1)1/20
= 0.3910
= 0.7248
= 0.8609
= 0.8609.
(d) The L distance between two points in ddimensional space is given by Eq. 57
in the text, with k :
;
< d
<!
1/k
L (x, y) = lim =
xi yi k
k
i=1
PROBLEM SOLUTIONS
145
range 0 l < 0.5. The gure at the left shows the xi and yi coordinates
plotted in a unit square. There are d points, one corresponding to each axis in
the hypercube. The probability that a particular single coordinate i will have
xi yi  l is equal to 1.0 minus the area in the small triangles, that is,
Pr[xi yi  l for a particular i] = (1 l )2 .
The probability that all of the d points will have xi yi  l is then
Pr[xi yi  l for all i] = (1 (1 l )2 )d .
The probability that at least one coordinate has separation less than l is just
1.0 minus the above, that is,
Pr[at least one coordinate is closer than l ] = 1 (1 (1 l )2 )d .
Now consider the distance of a point x to the faces of the hypercube. The gure
Linfinity distance between x
and any hypercube face
0.
4
y
i
yi
d points
x
i
x
i
y
i
0.
4
0.5
l*
l*
0
0
l*
0.5
1l*
xi
xi
1
0.5
at the right shows d points corresponding to the coordinates of x. The probability that a single particular coordinate value xi is closer to a corresponding
face (0 or 1) than a distance l is clearly 2l since the position of xi is uniformly
distributed in the unit interval. The probability that any of the coordinates is
closer to a face than l is
Pr[at least one coordinate is closer to a face than l ] = 1 (1 2l )d .
We put these two results together to nd that for a given l and d, the probability
that x is closer to a wall than l and that L (x, y) l is the product of the
two independent probabilities, that is,
[1 (1 2l )d ][1 (1 l )2d ],
which approaches 1.0 quickly as a function of d, as shown in the graph (for the
case l = 0.4). As shown in Fig. 4.19 in the text, the unit hypersphere in
the L distance always encloses the unit hypersphere in the L2 or Euclidean
metric. Therefore, our discussion above is pessimistic, that is, if x is closer to
a face than to point y according to the L metric, than it is even closer to
the wall than to a face in the L2 metric. Thus our conclusion above holds for
the Euclidean metric too. Consequently in high dimensions, we generally must
extrapolate from sample points, rather than interpolate between them.
146
l* = 0.4
0.4
0.2
2
12. The curse of dimensionality can be overcome if we know the form of the
target function.
(a) Our target function is linear,
f (x) =
d
!
aj xj = at x,
j=1
and y = f (x) + N (0, 2 ), that is, we have Gaussian noise. The approximation
d
is f(x) =
a
j xj , where
j=1
a
j = arg min
aj
n
!
yi
i=1
d
!
2
aj xij ,
j=1
2 2
d
d
n
d
!
!
!
!
yi
= E
aj xj
aj xij xj .
arg min
j=1
aj
j=1
i=1
j=1
For some set of sample points xi , we let yi be the corresponding sample values
yi = f (xi ) + N (0, 2 ). Consider the nbyd matrix X = [xij ] where xij is the
jth component of xi , that is
X=
x1
x2
..
.
xn
where xi is a ddimensional vector. As stated above, we have f (x) =
d
j=1
aj xj .
PROBLEM SOLUTIONS
147
a1
a2
..
.
a=
an
then we can write
Xa =
f (x1 )
f (x2 )
..
.
f (xn )
and hence y = Xa + N (0, 2 )n where
y=
y1
y2
..
.
yn
d
We approximate f (x) with f(x) =
a
j xj where the a
j are chosen to minimize
n
!
yi
i=1
j=1
d
!
a
j xij = y X
a 2
j=1
n
(f(xi ) f (xi ))2 . Since the terms (f(xi ) f (xi ))2 are
i=1
$
% d 2
E (f (x) f(x))2 =
.
n
In short, the squared error increases linearly with the dimension d, not exponentially as we might expect from the general curse of dimension.
148
(b) We follow the logic of part (a). Now our target function is f (x) =
d
i=1
ai Bi (x)
where each member in the basis set of M basis functions Bi (x) is a function of
the dcomponent vector x. The approximation function is
M
!
f(x) =
a
m Bm (x),
m=1
n
!
q
!
yi
2
am Bm (x)
m=1
i=1
Bi (x1 )
Bi (x2 )
..
.
Bi (xn )
As in part (a), we have
M 2
E[(f (x) f(x))2 ] =
,
n
which is independent of d, the dimensionality of the original space.
13. We assume P (1 ) = P (2 ) = 0.5 and the distributions are as given in the gure.
p(xi)
2
2
x2
0
0
x1
x*
0.5
x
1
(a) Clearly, by the symmetry of the problem, the Bayes decision boundary is x =
0.5. The error is then the area of the dark shading in the gure, divided by the
total possible area, that is
P
1
=
PROBLEM SOLUTIONS
149
0.5
1
= P (1 ) 2x dx + P (2 ) (2 2x) dx
0
0.5
1
1
0.5 + 0.5 = 0.25.
4
4
(b) Suppose a point is randomly selected from 1 according to p(x1 ) and another
point from 2 according to p(x2 ). We have that the error is
1
1
dx1 p(x1 1 )
(x1
x2 )/2
1
dx2 p(x2 2 )
dx p(x2 )
(x1 x2 )/2
dx p(x1 ).
0
(c) From part (d), below, we have for the special case n = 2,
1
1
1
51
+
+
=
= 0.425.
3 (2 + 1)(2 + 3) 2(2 + 2)(2 + 3)
120
P2 (e) =
(d) By symmetry, we may assume that the test point belongs to category 2 . Then
the chance of error is the chance that the nearest sample point ot the test point
is in 1 . Thus the probability of error is
1
Pn (e)
1
=
P (x2 )
n
!
i=1
By symmetry the summands are the same for all i, and thus we can write
1
Pn (e)
=
0
1
=
1
P (1 y1 )Pr[yi x > y1 x, i > 1]dy1 dx
P (x2 )n
0
1
1
P (x2 )n
0
where the last step again relies on the symmetry of the problem.
To evalute Pr[y2 x > y1 x], we divide the integral into six cases, as shown
in the gure.
We substitute these values into the above integral, and break the integral into
the six cases as
Pn (e)
x
1/2
P (x2 )n
P (1 y1 )(1 + 2y1 2x)n1 dy1
0
150
Case 1.1:
1/2
Case 2.1:
Case 2.2:
x [0,1/2]
x
1/2
x [1/2,1]
x 2xy1
1/2
Pr[y2x>y1x] = 1 + 2y1  2x
2x < y1 < 1
Case 2.3:
y1
y1
Pr[y2x>y1x] = 1 + 2x  2y1
Case 1.3:
1/2
Pr[y2x>y1x] = y1
Case 1.2:
Pr[y2x>y1x] = 1 + 2y1  2x
0 < y1 < 2x  1
x [1/2,1]
y1
Pr[y2x>y1x] = 1  y1
1/2
Pr[y2x>y1x] = 1 + 2x  2y1
2x
P (1 y1 )(1 + 2x 2y1 )n1 dy1
+
x
1
P (1 y1 )(1 y1 )
n1
dy1 dx
2x
1
+
2x1
P (x2 )n
P (1 y1 )y1n1 dy1
0
1/2
x
P (1 y1 )(1 + 2y1 2x)n1 dy1
+
2x1
1
P (1 y1 )(1 + 2x 2y1 )
n1
dy1 dx.
Our density and posterior are given as p(x2 ) = 2(1 x) and P (1 y) = y for
x [0, 1] and y [0, 1]. We substitute these forms into the above large integral
and nd
x
1/2
Pn (e)
2n(1 x)
=
0
2x
y1 (1 + 2x 2y1 )n1 dy1
x
1
y1 (1 y1 )
n1
2x
dy1 dx
PROBLEM SOLUTIONS
151
1
2x1
2n(1 x)
y1n dy1
0
1/2
x
y1 (1 + 2y1 2x)n1 dy1
2x1
1
y1 (1 + 2x 2y1 )
n1
dy1 dx.
There are two integrals we must do twice with dierent bounds. The rst is:
b
y1 (1 + 2y1 2x)n1 dy1 .
a
We dene the function u(y1 ) = 1 + 2y1 2x, and thus y1 = (u + 2x 1)/2 and
dy1 = du/2. Then the integral is
b
y(1 + 2y1 2x)
n1
dy1
1
4
u(b)
(u + 2x 1)un1 du
u(a)
u(b)
1 2x 1 n
1
n+1
u +
u
.
4
n
n+1
u(a)
n1
dy1
1
=
4
u(b)
(1 + 2x + u)un1 du
u(a)
u(b)
1 2x + 1 n
1
u
un+1
.
4
n
n+1
u(a)
We use these general forms to evaluate three of the six components of our full
integral for Pn (e):
x
y1 (1 + 2y1 2x)n1 dy1
12x=u(0)
1 2x 1 n
1
u
un+1
4
n
n+1
1=u(x)
1
1 2x + 1
1
1
1
n+1
+
+ (1 2x)
4
n
n+1
4
n n+1
152
12x=u(2x)
1 2x + 1 n
1
u
un+1
4
n
n+1
1=u(x)
1
1 2x + 1
1
1
1
1
n
n+1
(1 2x) + (1 2x)
+
4
n
n+1
2n
4
n n+1
12x
12x
1 n
1
n1
n+1
u
u
(1 u)u
du =
n
n+1
0
2x
y1 (1 + 2x 2y1 )n1 dy1
=
1
y1 (1 y1 )n1 dy1
2x
1
1
(1 2x)n
(1 2x)n+1 .
n
n+1
We add these three integrals together and nd, after a straightforward calculation that the sum is
x
1
1
1
n
+
(1 2x) +
(1 2x)n+1 .
()
n 2n
2n n + 1
We next turn to the three remaining three of the six components of our full
integral for Pn (e):
2x1
y1n dy1
x
y1 (1 + 2y1 2x)
n1
dy1
2x1
=
1
y1 (1 + 2x 2y1 )n1 dy1
1
(2x 1)n+1
n+1
1=u(x)
1 2x 1 n
1
n+1
u +
u
4
n
n+1
2x1=u(2x1)
1
1 2x 1
1
1
1
n+1
+
(2x 1)
+
4
n
n+1
4
n n+1
2x1=u(1)
1 2x + 1 n
1
u
un+1
4
n
n+1
1=u(x)
1
1 2x + 1
1
1
1
1
n
n+1
(2x 1) (2x 1)
4
n
n+1
2n
4
n n+1
(2x 1)n
(2x 1)n+1 .
n 2n
2n n + 1
Thus the sum of the three Case 1 terms is
x
1/2
2n(1 x)
y1 (1 + 2y1 2x)n1 dy1
0
2x
y1 (1 + 2x 2y1 )n1 dy1
+
x
1
y1 (1 y1 )n1 dy1 dx
+
2x
()
PROBLEM SOLUTIONS
=
153
1/2
1
1
1
x
n
n+1
+
(1 2x) + (1 2x)
dx
2n(1 x)
n 2n
2n n + 1
0
1
=
n
0
1+u
2
1u
1 n
+
u +
2n
2n
1
1
2n n + 1
u
n+1
du,
2x1
2n(1 x)
y1n dy1
0
1/2
x
y1 (1 + 2y1 2x)n1 dy1
+
2x1
1
y1 (1 + 2x 2y1 )
n1
+
1
2n(1 x)
dy1 dx
x
1
1
1
2n n + 1
dx
1/2
1
=
n
0
1u
2
1+u
1 n
u
2n
2n
1
1
2n n + 1
u
n+1
du,
where we used the substitution u = 2x1, and thus x = (1+u)/2 and dx = du/2
and 1 x = (1 u)/2.
Now we are ready to put all these results together:
1
Pn (e)
n
0
1
+
n
0
1
= n
0
1+u
2
1u
2
1u
1 n
+
u +
2n
2n
1+u
1 n
u
2n
2n
1 n+1
1
+
(1 u2 ) +
u
2n
2n
1
1
2n n + 1
1
1
2n n + 1
1
1
2n n + 1
un+1 du
un+1 du
u
n+2
du
1
1
1
u3
1
1
1
n+2
n+3
= n
u
u
u
+
2n
3 2n(n + 2
n + 3 2n n + 1
0
1
1
1n
1 2
+
+
= n
2n 3 2n(n + 2) n + 3 2n(n + 1)
1 (n + 1)(n + 3) (n 1)(n + 2)
=
+
3
2(n + 1)(n + 2)(n + 3)
154
=
=
1
3n + 5
+
3 2(n + 1)(n + 2)(n + 3)
1
1
1
+
+
,
3 (n + 1)(n + 3) 2(n + 2)(n + 3)
10
We may check this result for the case n = 1 where there is only one sample. Of
course, the error in that case is P1 (e) = 1/2, since the true label on the test
point may either match or mismatch that of the single sample point with equal
probability. The above formula above conrms this
P1 (e)
=
=
1
1
1
+
+
3 (1 + 1)(1 + 3) 2(1 + 2)(1 + 3)
1 1
1
1
+ +
= .
3 8 24
2
1
,
3
which is larger than the Bayes error, as indeed it must be. In fact, this solution
also illustrates the bounds of Eq. 52 in the text:
P
1
4
P P (2 2P )
1
3
.
3
8
14. We assume P (1 ) = P (2 ) = 0.5 and the distributions are as given in the gure.
p(xi)
1
0
0
x1
x*
0.5
x2
x
1
(a) This is a somewhat unusual problem in that the Bayes decision can be any point
1/3 x 2/3. For simplicity, we can take x = 1/3. Then the Bayes error is
PROBLEM SOLUTIONS
155
then simply
P
1
2/3
P (1 )p(x1 )dx
1/3
= 0.5(1/3)(3/2) = 0.25.
(b) The shaded area in the gure shows the possible (and equally likely) values of
a point x1 chosen from p(x1 ) and a point x2 chosen from p(x2 ).
x2
2/3
2
1/3
0
x1
1/3
2/3
7
1
5
0.25 + 0.75 =
= 0.3125,
8
8
16
156
Note that this is the same as the Bayes rate. This problem is closely related to
the zero information case, where the posterior probabilities of the two categories are equal over a range of x. If the problem specied that the distributions
were equal throughout the full range of x, then the Bayes error and the P errors
would equal 0.5.
15. An faster version of Algorithm 3 in the text deletes prototypes as follows:
Algorithm 0 (Faster nearestneighbor)
1
2
3
4
5
6
7
8
9
If we have k Voronoi neighbors on average of any point xj , then the probability that
i out of these k neighbors are not from the same class as xj is given by the binomial
law:
k
P (i) =
(1 1/c)k (1/c)k1 ,
i
where we have assumed that all the classes have the same prior probability. Then the
expected number of neighbors of any point xj belonging to dierent class is
E(i) = k(1 1/c).
Since each time we nd a prototype to delete we will remove k(11/c) more prototypes
on average, we will be able to speed up the search by a factor k(1 1/c).
16. Consider Algorithm 3 in the text.
(a) In the gure, the training points (black for 1 , white for 2 ) are constrained
to the intersections of a twodimensional grid. Note that prototype f does not
contribute to the class boundary due to the existence of points e and d. Hence
f should be removed from the set of prototypes by the editing algorithm (Algorithm 3 in the text). However, this algorithm detects that f has a prototype
PROBLEM SOLUTIONS
157
R1
R2
f
l
(b) A sequential editing algorithm in which each point is considered in turn and
retained or rejected before the next point is considered is as follows:
Algorithm 0 (Sequential editing)
1
2
3
4
5
6
7
presented to the editing algorithm, then it will be removed, since its nearestneighbor is f so d can be correctly classied. Then, the other points are kept,
except e, which can be removed since f is also its nearest neighbor. Suppose
now that the rst point to be considered is f . Then, f will be removed since
d is its nearest neighbor. However, point e will not be deleted once f has been
158
removed, due to c, which will be its nearest neighbor. According to the above
considerations, the algorithm will return points a, b, c, f in the rst ordering,
and will return points a, b, c, d, e for the second ordering.
17. We have that P (i ) = 1/c and p(xi ) = p(x) for i = 1, . . . , c. Thus the
probability density of nding x is
p(x) =
c
!
p(xi )P (i ) =
i=1
c
!
p(x)
i=1
1
= p(x),
c
and accordingly
p(x)1/c
1
p(xi )P (i )
=
= .
p(x)
p(x)
c
P (i x) =
()
lim Pn (e)
c
!
2
=
1
P (i x) p(x)dx
n
i=1
c
!
1
=
1
p(x)dx
c2
i=1
1
1
=
1
p(x)dx = 1 .
c
c
On the other hand, the Bayes error is
P =
P (errorx)p(x)dx
=
c
!
i=1R
[1 P (i x)]p(x)dx,
()
where Ri denotes the Bayes decision regions for class i . We now substitute () into
() and nd
c
!
1
1
1
P =
1
p(x)dx = 1
p(x)dx = 1 .
c
c
c
i=1
Ri
c
Thus we have P = P , which agrees with the upper bound P (2 c1
P ) = 1 1/c
in this no information case.
18. The probability of error of a classier, and the knearestneighbor classier in
particular, is
P (e) = p(ex)dx.
PROBLEM SOLUTIONS
159
The probability of error given a pattern x which belongs to class 1 can be computed
as the probability that the number of nearest neighbors which belong to 1 (which
we denote by a) is less than k/2 for k odd. Thus we have
P (ex, 1 ) = P r[a (k 1)/2].
The above probability is the sum of the probabilities of nding a = 0 to a = (k 1)/2,
that is,
!
(k1)/2
Pr[a (k 1)/2] =
Pr[a = i].
i=0
P (ex, 1 ) =
!
i=0
k
P (1 x)i P (2 x)ki .
i
By a simple interchange 1 2 , we nd
(k1)/2
P (ex, 2 ) =
!
i=0
k
P (2 x)i P (1 x)ki .
i
P (ex) =
!
i=0
k
P (1 x)i+1 [1 P (1 x)]ki + P (1 x)ki [1 P (1 x)]i+1 .
i
P (ex)
!
i=0
k
P (ex)i+1 [1 P (ex))]ki + P (ex)ki [1 P (ex)]i+1
i
= fk [P (ex)],
where we have dened fk to show the dependency upon k. This is Eq. 54 in the text.
Now we denote Ck [P (ex)] as the smallest concave function of P (ex) greater
than fk [P (ex)], that is,
fk [P (ex)] Ck [P (ex)].
Since this is true at any x, we can integrate and nd
160
Jensens inequality here states that under very broad conditions, Ex [u(x)] u[E[x]].
We apply Jensens inequality to the expression for P (e) and nd
P (e) Ck [P (ex)]p(x)dx Ck
P (ex)p(x)dx = Ck [P ],
where again P is the Bayes error. In short, we see P (e) Ck [P ], and from above
fk [P ] Ck [P ].
Section 4.6
19. We must show that the use of the distance measure
;
< d
<!
D(a, b) = =
k (ak bk )2
k=1
for k > 0 generates a metric space. This happens if and only if D(a, b) is a metric.
Thus we rst check whether D obeys the properties of a metric. First, D must be
nonnegative, that is, D(a, b) 0, which indeed holds because the sum of squares
is nonnegative. Next we consider reexivity, that is, D(a, b) = 0 if and only if
a = b. Clearly, D = 0 if and only if each term in the sum vanishes, and thus
indeed a = b. Conversely, if a = b, then D = 0. Next we consider symmetry, that
is, D(a, b) = D(b, a) for all a and b. We can alter the order of the elements in
the squared sum without aecting its value, that is, (ak bk )2 = (bk ak )2 . Thus
symmetry holds. The last property is the triangle inequality, that is,
D(a, b) + D(b, c) D(a, c)
for all a, b, and c. Note that D can be written
D(a, b) = A(a b)2
where A = diag[1 , . . . , d ]. We let x = a c and y = c b, and then the triangle
inequality test can be written
Ax2 + Ay2 A(x + y)2 .
We denote x = Ax and y = Ay, as the feature vectors computed using the matrix
A from the input space. Then, the above inequality gives
x 2 + y 2 x + y 2 .
We can square both sides of this equation and see
(x 2 + y 2 ) = x 22 + y 22 + 2x 2 y 2 x 22 + y 22 + 2
2
d
!
i=1
d
!
i=1
xi yi .
xi yi = x + y 2 ,
PROBLEM SOLUTIONS
161
Observe that the above inequality if fullled, since the CauchySchwarz inequality
states that
d
!
x 2 y 2
xi yi .
i=1
If we work with metric spaces in the nearestneighbor method, we can ensure the
existence of best approximations in this space. In other words, there will always be a
stored prototype p of minimum distance from an input pattern x . Let
= min D(p, x),
x
where x belongs to a set S of the metric space which is compact. (A subset S is said
to be compact if every sequence of points in S has a sequence that converges to a
point in S.) Suppse we dene a sequence of points {x1 , x2 , . . . xn } with the property
D(p, xn ) as n ,
where xn x using the compactness of S. By the triangle inequality, we have that
D(p, xn ) + D(xn , x ) D(p, x ).
The righthand side of the inequality does not depend on n, and the left side approches
as n . Thus we have D(p, x ). Nevertheless, we also have D(p, x )
because x belongs to S. Hence we have that D(p, x ) = .
The import of this property for nearestneighbor classiers is that given enough
prototypes, we can nd a prototype p very close to an input pattern x . Then, a good
approximation of the posterior class probability of the input pattern can be achieved
with that of the stored prototype, that is, p x implies
P (i p) P (i x ).
20. We must prove for
Lk (a, b) =
1 d
!
21/k
ai bi 
= a bk
i=1
162
()
x + yk .
()
(xk + yk )
(x + yk )
( )
(xk + yk ) =
d
!
xi  +
k
d
!
i=1
yi  +
k
i=1
k1
!
j=1
k
xkj
yjk ,
k
j
d
!
xi k +
i=1
d
!
bi k +
k1
!
i=1
j=1
d
k !
xi kj bi j .
j i=1
d
!
j = 1, . . . , k 1.
i=1
Note that
xkj
d
!
xi kj
and yjk
i=1
because
d
!
yi j
i=1
"! # q !
a
aq
for a > 0.
Since
1
!
i
i=1
21
2
!
!
ai
bi
ai bi
i
for ai , bi > 0,
PROBLEM SOLUTIONS
we have
xkj
yjk
k
163
1 d
!
xi 
kj
21 d
!
i=1
2
yi 
i=1
d
!
i=1
d
!
(xi xi )2 ,
i=1
109 seconds
= 0.315k 2 seconds.
operation
Suppose that k = 64. Then a test sample will be classied in 21 minutes and
30 seconds.
164
26. Explore the eect of r on the accuracy of nearestneighbor search based on partial
distance.
(a) We make some simplications, specically that points are uniformly distributed
in a ddimensional unit hypercube. The closest point to a given test point in, on
average, enclosed in a small hypercube whose volume is 1/n times the volume of
the full space. The length of a side of such a hypercube is thus 1/n1/d . Consider
a single dimension. The probability that any point falls within the projection
of the hypercube is thus 1/n1/d . Now in r d dimensions, the probability of
falling inside the projection of the small hypercube onto rdimensional subspace
is 1/nr/d , i.e., the product of r indpendendent events, each having probability
1/n1/d . So, for each point the probability it will not fall inside the volume in
the rdimensional space is 1 1/nr/d . Since we have n indpendent such points,
the probability that the partial distance in r dimensions will give us the true
nearest neighbor is
P r[nearest neighbor in ddim. is found in r < d dim.] =
1
nr/d
n
,
100
n=5
0.8
0.6
0.4
n=10
0.2
2
10
(b) The Bayes decision boundary in one dimenion is clearly x1 = 0.5. In two
dimensions it is the line x2 = 1 x1 . In three dimension the boundardy occurs
when
x1 x2 x3 = (1 x1 )(1 x2 )(1 x3 )
which implies
x3 =
(x1 1) (x2 1)
,
1 x1 x2 + 2x1 x2
PROBLEM SOLUTIONS
d=1
165
d=2
d=3
x3
x2
0.5
0.8
R2
0.8
R2
R2
0.2
0.6
0
1
R1
0.4
x2
0.6
x1
R1
R1
0.4
x1
0
0
0.2
0
0.2
0.4
0.6
0.8
1
x1
x=0.5
2 0.5
0.1875.
(1 x) dx
0
1
EB
2 0.5
0.104167.
dx2 (1 x1 )(1 x2 )
dx1
x1 =0
1 x3 (x
1 ,x2 )
dx1 dx2
dx3 (1 x1 )(1 x2 )(1 x3 )
= 0.0551969,
where x3 (x1 , x2 ) is the position of the decision boundary, found by solving
x1 x2 x3 = (1 x1 )(1 x2 )(1 x3 )
for x3 , that is,
x3 (x1 , x2 ) =
(x1 1) (x2 1)
.
1 x1 x2 + 2x1 x2
xi =
d
,
(1 xi )
i=1
or
xd
d1
,
i=1
xi = (1 xd )
d1
,
i=1
(1 xi )
166
xd (x1 , x2 , . . . , xd1 ) =
d1
/
j=1
i=1
(1 xi )
(1 xj ) +
d1
/
k=1
.
xk
1
dx2
dx1
0
,...,xd1 )
x
d (x1
1
dxd1
dxd
d
,
(1 xi ).
j=1
d integrals
We rst compute the integral(x[1]  1)) over xd , since it has the other variables
implicit. This integral I is
x
,...,xd1 )
d (x1
(1 xd ) dxd
I=
= xd 0 d
1 2 xd
x
2 d 0
1
= xd (xd )2 .
2
Substituting xd from above we have
d1
d1
d1
d1
/
/
/
/
(1 xi )
(1 xj ) +
xk 1/2
(1 xi )
I
i=1
j=1
k=1
d1
/
j=1
1/2
=
d1
/
i=1
(1 xi )2 +
d1
/
j=1
(1 xj ) +
d1
/
i=1
(1 xj ) +
d1
/
k=1
2
i=1
xk
xi (1 xi )
d1
/
k=1
2
xk
na + nb 2nab
na + nb nab
where na and nb are the sizes of the two sets, and nab is the number of elements
common to both sets.
PROBLEM SOLUTIONS
167
We must check whether the four properties of a metric are always fullled. The
rst property is nonnegativity, that is,
DT animoto (A, B) =
na + nb 2nab
0.
na + nb nab
Since the sum of the elements in A and B is always greater than their common
elements, we can write
na + nb nab > 0.
Furthermore, the term na + nb 2nab gives account of the number of dierent
elements in sets A and B, and thus
na + nb 2nab 0.
Consequently, DT animoto (A, B) 0.
The second property, reexivity, states that
DT animoto (A, B) = 0
if and only if A = B.
From the denition of the Tanimoto measure above, we see that DT animoto (A, B)
will be 0 if and only if na + nb 2nab = 0. This numerator is the number of
dierent elements in A and B, so it will yield 0 only when A and B have no
dierent elements, that is, when A = B. Conversely, if A = B, then na + nb
2nab = 0, and hence DT animoto (A, B) = 0.
The third property is symmetry, or
DT animoto (A, B) = DT animoto (B, A)
for all A and B. Since the terms na and nb appear in the numerator and
denominator in the same way, only the term nab can aect the fullment of this
property. However, nab and nba give the same measure: the number of elements
common to both A and B. Thus the Tanimoto metric is indeed symmetric.
The nal property is the triangle inequality:
DT animoto (A, B) + DT animoto (B, C) DT animoto (A, C).
We substitute () into the denition of the Tanimoto metric and nd
na + nb 2nab
nb + nc 2nbc
na + nc 2nac
+
.
na + nb nab
nb + nc nbc
na + nc nac
After some simple manipulation, the inequality can be expressed as
3BDE 2ADE 2BCE + ACE 2BDF + ADF + BCF 0,
where
A = na + n b
B
= nab
= nb + nc
D
E
= nbc
= na + nc
= nac .
()
168
We factor to nd
()
( )
D = nbc = 0
C(A 2B) + D(3B 2A) 0.
This last condition can be rewritten after some arrangment as
DT
nbc
.
nb + nc nbc
On the other hand, if we take other common factors in the above equation, we
have
B[E(3D 2C) + F (C 2D)] + A[E(C 2D) + DF ] 0.
The term C 2D is the number of dierent elements in C and D, so it is greater
than zero. Accordingly, the triangle inequality can be also fullled if some of
the below additional conditions are met:
3D 2C = 3nbc 2(nb + nc ) 0
E = nb + n a = 0
E(3D 2C) + F (C 2D) 0.
After some rearrangements, this last inequality can be written as
DT animoto (B, C)
nb + nc
.
nbc (nb + nc )
Since the denominator of the righthand side is always negative, the inequality
is met. Thus we can conclude that the Tanimoto metric also obeys the triangle
inequality.
(b) Since the Tanimoto metric is symmetric, we consider here only 15 out of the 30
possible pairings.
A
pattern
pattern
pattern
pattern
pattern
pat
pat
pat
pat
pots
pots
pots
stop
stop
taxonomy
B
pat
pots
stop
taxonomy
elementary
pots
stop
taxonomy
elementary
stop
taxonomy
elementary
taxonomy
elementary
elementary
na
7
7
7
7
7
3
3
3
3
4
4
4
4
4
8
nb
3
4
4
8
10
4
4
8
10
4
8
10
8
10
10
nab
3
2
2
3
5
2
2
2
2
4
2
1
2
1
5
DT animoto (a, b)
0.57
0.77
0.77
0.75
0.58
0.6
0.6
0.77
0.81
0
0.8
0.92
0.8
0.92
0.61
rank
2
7
7
6
3
4
4
7
9
1
8
9
8
9
5
PROBLEM SOLUTIONS
169
(c) Yes, the Tanimoto metric fullls the triangle inequality for all possible sets given
above.
Section 4.7
28. As shown in Fig. 23 in the text, there is a big dierence between categories
of a feature and the (true) categories or classes for the patterns. For this problem
there is a fuzzy feature that we can call temperature, and a class named hot.
Given a pattern, e.g., a cup, we can observe a degree of membership of this object
onto the fuzzy feature temperature, for instance, temperature = warm. However
temperature = warm does not necessarily imply that the membership to the hot
class is less than 1.0 due to we could argue that the fuzzy feature temperature is
not very hot but warm. We could reasonably suppose that if the hot class has
something to do with temperature then it would have into account the fuzzy feature
temperature for its membership function. Accordingly, values of temperature
such as warm or hot might be active in some degree for the class hot though
the exact dependence between them is unknown for us without more information
aboute the problem.
29. Consider fuzzy classication.
(a) We rst t the designers subjective feelings into fuzzy features, as suggested by
the below table.
Feature
lightness
designers feelings
mediumlight
mediumdark
dark
short
long
length
fuzzy feature
light
dark
dark
short
large
If we use Min[a, b] as the conjunction rule between fuzzy sets a and b, then the
discriminant functions can be written as:
d1
d2
d1
= Min[
c(length, 13, 5), c(lightness, 70, 30)]
= Min[
c (length, 5, 5), c (lightness, 30, 30)]
= Min[
c(length, 13, 5), c (lightness, 30, 30)]
where
c(x, , )
c (x, , )
1 x>
1 + (x )/ x
0 otherwise
0 x>+
1 + ( x)/ x +
=
1 otherwise.
=
Min[
c(length, 13, 5),
c(lightness, 70, 30)]
170
d2
d3
=
=
Min[
c(length, 5, 5),
c (lightness, 30, 30)]
Min[
c(length, 13, 5),
c (lightness, 30, 30)]
= Min[
c(length, 13, 5), c(lightness, 70, 30)]
d2
= Min[
c (length, 13, 5), c (lightness, 70, 30)]
= Min[
c(length, 13, 5), c (lightness, 70, 30)].
d3
Hence, the classication borders would be unaected, since all the discriminant
functions would be rescaled by the same constant , and we would have
arg max(d1 , d2 , d3 ) = arg max(d1 , d2 , d3 ) = arg max(d1 , d2 , d3 ).
di
di
di
(c) Given the pattern x = (7.5, 60)t and the above equations, the value of the
discriminant functions can be computed as follows:
d1
Min[
c(7.5, 13, 5), c(60, 70, 30)] = Min[0, 0.66] = 0
d2
=
=
Min[
c (7.5, 5, 5), c (60, 30, 30)] = Min[0.5, 0] = 0
Min[
c(7.5, 13, 5), c (60, 30, 30)] = Min[0, 0] = 0.
d3
Since all the discriminant functions are equal to zero, the classier cannot assign
the input pattern to any of the existing classes.
(d) In this problem we deal with a handcrafted classier. The designer has selected
lightness and length as features that characterize the problem and has devined
several membership functions for them without any theoretical or empirical
justication. Moreover, the conjunction rule that fuses membership functions
of the features to dene the true discriminant functions is also imposed without
any principle. Consequently, we cannot know where the sources of error come
from.
30. Problem not yet solved
Section 4.8
31. If all the radii have been reduced to a value less than m , Algorithm 4 in the text
can be rewritten as:
Algorithm 0 (Modied RCE)
1
2
3
4
5
6
7
8
9
j D(
x, xj )
j D(
x, xj )
if xj k , then ajk 1 for k = 1, . . . c
until j = n
end
PROBLEM SOLUTIONS
171
According to Algorithm 5 in the text, the decision boundaries of the RCE classier
depend on j ; an input pattern x is assigned to a class if and only if all its nearest prototypes belong to the same class, where these prototypes fulll the condition
is the nearest training sample to wj beD(x, wj ), j . Since j = D(
x, wj ), where x
longing to another class, we will have dierent class boundaries when we use dierent
training points.
Section 4.9
32. We seek a Taylor series expansion for
n
1 !
x xi
pn (x) =
.
nhn i=1
hn
(a) If the Parzen window is of the form (x) N (0, 1), or
2
1
p(x) = ex /2 ,
2
nhn i=1
hn
2
n
1 ! 1
1 x x1
exp
=
nhn i=1 2
2
hn
n
1
1!
1 x2
x2i
2xxi
=
exp
+
2 h2n
h2n
h2n
2hn n i=1
$
%
2
n
exp 12 hx2 1 !
1 x2i
xxi
n
=
exp 2 .
exp
h2n
2 hn
2hn n i=1
We express this result in terms of the normalized variables u = x/hn and ui =
xi /hn as
eu /2 1 ! u2i /2 uui
pn (x) =
e
e .
2hn n i=1
n
!
uj uj
i
j!
j=0
eu /2 1 ! ! uj uji u2i /2
e
2hn n i=1 j=0 j!
1 n
2
2
2
eu /2 ! 1 ! uji eui /2
uj
j!
2hn j=0 n i=1
2
pn (x)
172
=
=
eu /2 !
bj u j
2hn j=0
bj =
(b) If the n samples are extremely tightly clustered about u = uo (i.e., ui uo ) for
i = 1, . . . , n, then the twoterm approximation is
eu /2
pn2 (x) =
(bo + b1 u),
2hn
2
where
2
1 ! u2i /2
e
euo /2
n i=1
bo =
and
2
2
1!
ui eui /2 uo euo /2 .
n i=1
b1 =
+
= 0.
2hn
2hn
2
u pn2 (x)
= u(bo + b1 u) + b1
= u(euo /2 + uo euo /2 u) uo euo /2
2
= u2 uo + u uo = 0
u
1.
= u2 +
uo
Thus the peaks of pn2 (x) are the solutions of u2 + u/uo 1 = 0.
(c) We have from part (b) that u is the solution to u2 + u/uo 1 = 0, and thus the
solution is
1/uo 1/u2o + 4
u=
.
2
PROBLEM SOLUTIONS
173
1/u20 + 4 1/u0 ,
1 + (1 + 2u2o )
= uo .
2uo
1/uo
1/u2o + 4
2
1/uo 2 1/(4u2o ) + 1
2
1/uo 2(1 + 12 1/(4u2o ))
.
2
=
where
bo euo /2 ,
and b1 uo euo /2 ,
as is shown in the gure. (The curves have been rescaled so that their structure
is apparent.) Indeed, as our calculation above showed, for small uo , pn2 (u)
peaks near u = 0, and for large uo , it peaks near u = 1.
~pn2(u)
uo
0.4
=1
0.3
uo
=1
uo
=
0.2
.01
0.1
u
1
174
Computer Exercises
Section 4.2
1. Computer exercise not yet solved
Section 4.3
2. Computer exercise not yet solved
Section 4.4
3. Computer exercise not yet solved
Section 4.5
4. xxx
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
MATLAB program
x = [0.58 0.27 0.0055 0.53 0.47 0.69 0.55 0.61 0.49 0.054];
mu = mean(x);
delta = std(x);
mu arr = linspace(min(x), max(x), 100);
delta arr = linspace(2*(min(x)mu), 8*(max(x)mu), 100);
[M, D] = meshgrid(mu arr, delta arr);
p D theta = zeros(size(M));
for i=1:size(M,2)
for j=1:size(M,2)
x minus mu = abs(x  M(i,j));
if max(x minus mu) > D(i,j)
p D theta(i,j) = 0;
else;
a = (D(i,j)  x minus mu)/D(i,j)^2;
p D theta(i,j) = prod(a);
end
end
end
p D theta = p D theta./(sum(p D theta(:))/...
((mu arr(2)  mu arr(1))*(delta arr(2)delta arr(1))));
p theta D = p D theta;
ind = find(p D theta>0);
x vect = linspace(min(x)10*delta, max(x)+10*delta, 100);
post = zeros(size(x vect));
for i=1:length(x vect)
for k=1:length(ind)
if abs(x vect(i)  M(ind(k))) < D(ind(k)),
p x theta = (D(ind(k))  abs(x vect(i)  M(ind(k))))./D(ind(k)).^2;
else
p x theta = 0;
end
post(i) = post(i) + p x theta*p theta D(ind(k));
COMPUTER EXERCISES
44
45
46
47
48
49
50
51
52
53
end
end
post = post./sum(post);
figure;
plot(x vect, post, b);
hold on;
plot([mu mu], [min(post) max(post)], r);
plot(x, zeros(size(x)), *m);
grid;
hold off;
[[FIGURE TO BE INSERTED]]
175
176
Chapter 5
Problem Solutions
Section 5.2
1. Consider the problem of linear discriminants with unimodal and multimodal distributions.
(a) The gure shows two bimodal distributions that obey reective symmetry about
a vertical axis. Thus their densities have the same value along this vertical
line, and as a result this is the Bayesian decision boundary, giving minimum
classication error.
R1
R2
(b) Of course, if two unimodal distributions overlap signicantly, the Bayes error will
be high. As shown in Figs 2.14 and 2.5 in Chapter 2, the (unimodal) Gaussian
case generally has quadratic (rather than linear) Bayes decision boundaries.
Moreover, the two unimodal distributions shown in the gure below would be
better classied using a curved boundary than the straight one shown.
(c) Suppose we have Gaussians of dierent variances, 12 = 22 and for deniteness
(and without loss of generality) 2 > 1 . Then at large distances from the
means, the density P (2 x) > P (1 x). Moreover, the position of the mean of
1 , that is, 1 , will always be categorized as 1 . No (single) straight line can
separate the position of 1 from all distant points; thus, the optimal decision
177
178
R2
1
2
boundary cannot be a straight line. In fact, for the case shown, the optimal
boundary is a circle.
R2
R1
1
2
R2
and
max gj (x(0)) = gi (x(0)) = wit x(0) + wi0 .
j
We next determine the category of a point between x(0) and x(1) by nding the
maximum discriminant function:
max wjt [x(1) + (1 )x(0)] + wj0 = max[wj t x(1) + wj0 ] + (1 ) max[wj t x(1) + wj0 ]
j
=
=
[wit x(1)
wit [x(1)
+ wi0 ] + (1
+ (1 )x(0)] + wi0 .
)[wit x(1)
+ wi0 ]
This shows that the point x(1) + (1 )x(0) is in i . Thus any point between x(0)
and x(1) is also in category i ; and since this is true for any two points in i , the
region is convex. Furthermore, it is true for any category j all that is required in
the proof is that x(1) and x(0) be in the category region of interest, j . Thus every
decision region is convex.
3. A counterexample will serve as proof that the decision
regions need not be convex.
In the example below, there are c = 5 classes and 2c = 10 hyperplanes separating
PROBLEM SOLUTIONS
179
the pairs of classes; here Hij means the hyperplane separating class i from class j .
To show the nonconvexity of the solution retion for class 1 , we take points A and
B, as shown in the gure.
H14 H24
H15
H13
H12
H35
C
2
5
A
H25
H
H23 34
H45
The table below summarizes the voting results. Each row lists a hyperplane Hij ,
used to disginguish category i from j . Each entry under a point states the category
decision according to that hyperplane. For instance, hyperplane H12 would place A in
2 , point B in 1 , and point C in 2 . The underlining indicates the winning votes in
the full classier system. Thus point A gets three 1 votes, only one 2 vote, two 3
votes, and so forth. The bottom row shows the nal voting result of the full classier
system. Thus points A and B are classied as category 1 , while point C (on the
line between A and B) is category 2 . Thus the solution region for 1 is not convex.
hyperplane
H12
H13
H14
H15
H23
H24
H25
H34
H35
H45
Vote result:
A
2
1
1
1
3
2
5
3
5
4
1
B
1
1
1
1
2
2
2
4
3
4
1
C
2
3
1
1
2
2
2
3
5
4
2
= x xa 2 + 2[wt x + w0 ]
= (x xa )t (x xa ) + 2(wt x + w0 )
= xt x 2xt xa + xta xa + 2(xt w + w0 ).
180
= x xa + w = 0
= wt x + w0 = 0,
= wt (xa w) + w0
= wt xa + w0 wt w
= 0.
wt xa + w0
.
wt w
if w = 0
if w = 0.
g(xa )w
g(xa )
=
.
w2
w
(b) From part (a) we have that the minimum distance from the hyperplane to the
point x is attained uniquely at the point x = xa w on the hyperplane. Thus
the projection of xa onto the hyperplane is
xo
= xa w
g(xa )
= xa
w,
w2
where
=
wt xa + w0
g(xa )
=
.
wt w
w2
PROBLEM SOLUTIONS
181
5. Consider the three category linear machine with discriminant functions gi (x) =
wit x wio , for i = 1, 2, 3.
(a) The decision boundaries are obtained when wit x = wjt x or equivalently (wi
wj )t x = 0. Thus, the relevant decision boundaries are the lines through the
origin that are orthogonal to the vectors w1 w2 , w1 w3 and w2 w3 , and
illustrated by the lines marked i j in the gure.
x2
w1
w2
w3
4
2
x1
2
4
2
1
x2
(b) If a constant vector c is added to the weight vectors, w1 , w2 and w3 , then the
decision boundaries are given by [(wi + c) (wj + c)]t x = (wi wj )t x = 0,
just as in part (a); thus the decision boundaries do not change. Of course, the
triangle formed by joining the heads of w1 , w2 , and w3 changes, as shown in
the gure.
w1
+
1
w2
w3
w2 + c
x1
4
2
w3 + c
2
c
4
6. We rst show that totally linearly separable samples must be linearly separable.
For samples that are totally linearly separable, for each i there exists a hyperplane
gi (x) = 0 that separates i samples from the rest, that is,
gi (x) 0
for x i
gi (x) < 0
for x i .
The set of discriminant functions gi (x) form a linear machine that classies correctly;
the samples are linearly separable and each category can be separated from each of
the others.
We shall show that by an example that, conversely, linearly separable samples
need not be totally linearly separable. Consider three categories in two dimensions
182
specied by
1
= {x : x1 < 1}
= {x : 1 x1 < 1}
= {x : x1 > 1}.
Clearly, these categories are linearly separable by vertical lines, but not totally linearly
separable since there can be no linear function g2 (x) that separates the category 2
samples from the rest.
7. Consider the four categories in two dimensions corresponding to the quadrants of
the Euclidean plane:
1
2
3
4
= {x : x1 > 0, x2 > 0}
= {x : x1 > 0, x2 < 0}
= {x : x1 < 0, x2 < 0}
= {x : x1 < 0, x2 > 0}.
Clearly, these are pairwise linearly separable but not totally linearly separable.
8. We employ proof by contradiction. Suppose there were two distinct minimum
length solution vectors a1 and a2 with at1 y > b and at2 y > b. Then necessarily we
would have a1 = a2 (otherwise the longer of the two vectors would not be a
minimum length solution). Next consider the average vector ao = 12 (a1 + a2 ). We
note that
ato yi =
1
1
1
(a1 + a2 )t yi = at1 yi + at2 yi b,
2
2
2
= a1 = a2 .
= a1 2 + a2 2 2at1 a2
= a1 a2 2 ,
PROBLEM SOLUTIONS
183
and
n
!
i xi ,
i=1
where the i s are nonnegative and sum to 1. The discriminant function evaluated
at x is
g(x) = wt x + w0
1 n
2
!
t
= w
i xi + w0
=
n
!
i=1
i (wt xi + w0 )
i=1
> 0,
where we used the fact that wt xi + w0 > 0 for 1 i n and
n
i=1
i = 1.
Now let us assume that our point x is also in the convex hull of S2 , or
x=
m
!
j yj ,
j=1
where the j s are nonnegative and sum to 1. We follow the approach immediately
above and nd
g(x)
= wt x + w0
m
!
= wt
j yj + w0
j=1
m
!
j (wt yj + w0 )
j=1
g(yj )<0
< 0,
where the last step comes from the realization that g(yj ) = wt yj + w0 < 0 for each
yi , since they are each in S2 . Thus, we have a contradiction: g(x) > 0 and g(x) < 0,
and hence clearly the intersection is empty. In short, either two sets of vectors are
either linearly separable or their convex hulls intersects.
10. Consider a piecewise linear machine.
(a) The discriminiant functions have the form
gi (x) =
j=1,...,ni
184
j=1,...,nk
t
where gkj (x) = wkj
x+wkjo , are linear functions. Thus our decision rule implies
j=1,...,nk
1jni
Therefore, the piecewise linear machine can be viewed as a linear machine for
classifying subclasses of patterns, as follows: Classify x in category i if
max max gkj (x) = gij (x).
j=1,...,nk
= {x : x < 1}
which clearly are not linearly separable. However, a classier that is a piecewise
linear machine with the following discriminant functions
g11 (x)
g12 (x)
g1 (x)
g2 (x)
1x
1+x
=
=
=
j=1,2
g11(x)
g12(x)
2
g2(x)
x
2
1
1
= {x : b is odd}
= {x : b is even}.
PROBLEM SOLUTIONS
185
(a) Our goal is to show that 1 and 2 are not linearly separable if d > 1. We prove
this by contradiction. Suppose that there exists a linear discriminant function
g(x) = wt x + w0
such that
g(x) 0 for x 1 and
g(x) < 0 for x 2 .
Consider the unique point having b = 0, i.e., x = 0 = (0, 0, . . . , 0)t . This pattern
is in 2 and hence clearly g(0) = wt 0 + w0 = w0 < 0. Next consider patterns
in 1 having b = 1. Such patterns are of the general form
x = (0, . . . , 0, 1, 0, . . . , 0),
and thus
g(x) = wi + w0 > 0
for any i = 1, . . . , d. Next consider patterns with b = 2, for instance
x = (0, . . . , 0, 1, 0, . . . , 0, 1, 0, . . . , 0).
Because such patterns are in 2 we have
g(x) = wi + wj + w0 < 0
for i = j.
We summarize our results up to here as three equations:
w0
< 0
wi + w0
> 0
wi + wj + w0
< 0,
>0
>0
or
wi + wj + w0
> 0,
which contradicts the third equation. Thus our premise is false, and we have
proven that 1 , 2 are not linearly separable if d > 1.
(b) Consider the discriminant functions
g1 (x)
g2 (x)
j
j
where
gij (x)
186
= j + 1/2
2j
= j
wijo
= j 2 j 1/4
w2jo
= j 2 ,
= (j + 1/2)b j 2 j 1/4
= jb j 2 .
We now verify that these discriminant functions indeed solve the problem. Suppose x is in 1 ; then the number of nonzero components of x is b = 2m + 1, 0
m (d 1)/2 for m an integer. The discriminant function is
g1j (x)
= j(2m + 1) j 2
= j(2m + 1 j).
j
j
max j(2m + 1 j)
j
j
= m(m + 1).
Thus if x 1 and the number of nonzero components of x is b = 2m + 1 (i.e.,
odd), we have
g1 (x) = m(m + 1) + 1/4 > m(m + 1) = g2 (x),
that is, g1 (x) > g2 (x), as desired. Conversely, if x is in 2 , then the number
of nonzero components of x is b = 2m, 0 m d/2 where m is an integer.
PROBLEM SOLUTIONS
187
= (j + 1/2)(2m) j 2 j 1/4
= j(2m 1) j 2 + m 1/4
= j(2m j 1) + m 1/4
= j(2m) j 2
= j(2m j)
j
j
or g1 (x) < g2 (x), and hence our piecewise linear machine solves the problem.
Section 5.3
12. Consider the quadratic discriminant function given by Eq. 4 in the text:
g1 (x) g2 (x) = g(x) = w0 +
d
!
wi xi +
i=1
d !
d
!
wij xi xj ,
i=1 i=1
= w0 + wt (x m) + (x m)t W(x m)
()
where we used the fact that W is symmetric (i.e., Wij = Wji ) and that taking the
transpose of a scalar does not change its value. We can solve for m by setting the
coecients of x to zero, that is,
0 = wt mt W mW = wt 2mt W.
188
()
W
x = xt Wx = 1/4,
wt W1 w 4w0
W
wt W1 w 4w0
1 2
W= 2 5
0 1
Thus we have
W1
and
16
= 6
2
6
3
1
case are
0
1 ;
3
5
w = 2 .
3
2
1
1 ;
4
1
0.0121
W
W=
= 0.0242
82.75
0
0.0242
0.0604
0.0121
wt W1 w = 82.75
0
0.0121
0.0363
PROBLEM SOLUTIONS
(e) Our matrix and vector in this
1 2
W= 2 0
3 4
189
case are
3
4 ;
5
2
w = 1 .
3
Thus we have
0.3077 0.4231
0.1538
W1 = 0.4231 0.2692 0.0385 ;
0.1538
0.0385 0.0769
and
0.4407
W
W=
= 0.8814
2.2692
1.3220
0.8814
0
1.7627
wt W1 w = 2.2692
1.3220
1.7627
2.2034
The eigenvalues are {1 , 2 , 3 } = {0.6091, 2.1869, 3.3405}. Because of a negtative eigenvalue, the decision boundary is a hyperhyperboloid.
Section 5.4
13. We use a secondorder Taylor series expansion of the criterion function at the
point a(k):
1
J(a) J(a(k)) + J t (a(k))(a a(k)) + (a a(k))t H(a(k))(a a(k)),
2
where J(a(k)) is the derivative and H(a(k)) the Hessian matrix evaluated at the
current point.
The update rule is
a(k + 1) = a(k) (k)J(a(k)).
We use the expansion to write the criterion function as
1
J(a(k + 1) J(a(k)) + J t (a(k)[(k)J(a(k)] + [(k)J t (a(k))]H(a(k))[(k)J(a(k))].
2
We minimize this with respect to (k) and nd
0 = J t (a(k))J(a(k)) + (k)J t (a(k))H(a(k))J(a(k))
which has solution
(k) =
J(a(k))2
.
J t (a(k))H(a(k))J(a(k))
We must, however, verify that this is indeed a minimum, and that (k) > 0. We
calculate the second derivative
2J
= (J)t H(J).
2
190
2
,
3
1
,
4
0
.
2
Augmenting the samples with an extra dimension, and inverting the sign of the samples from 2 dene
1
1
5
b
1
b
2
9
a1
1 5 3
a = a2 b = b .
Y=
1 2
b
3
a3
1
b
1
4
1
0 2
b
Then the sumofsquares error criterion is
2
1 ! t
(Ya b) (Ya b)
a yi b =
.
2 i=1
2
n
Js (a) =
6a1
a2 +
Js (a) = Y t (Ya b) = a1 + 35a2 +
6a1 + 36a2 +
and
6
H = Yt Y = 1
6
6a3
36a3
144a3
+
0
+ 3b
16b
1
6
35 36 .
36 144
Notice that for this criterion, the Hessian is not a function of a since we assume
a quadratic form throughout.
(b) The optimal step size varies with a, that is,
=
The range of optimal step sizes, however, will vary from the inverse of the largest
eigenvalue of H to the inverse of the smallest eigenvalue of H. By solving the
characteristic equation or using singular value decomposition, the eigenvalues of
H are found to be 5.417, 24.57, and 155.0, so 0.006452 0.1846.
PROBLEM SOLUTIONS
191
Section 5.5
15. We consider the Perceptron procedure (Algorithm 5.3 and Theorem 5.1 in the
text).
(a) Equation 20 in the text gives the update rule
a(1)
a(k + 1)
= arbitrary
= a(k) + yk ,
k 1.
max yi 2
t yi > 0.
min a
Thus we have
a(k + 1)
a2 a(k)
a2 (2 2 ).
Because > 2 /, we have the following inequality:
2 2 2 2 2 = 2 > 0.
Thus a(k)
a2 decreases (strictly) by an amount (2 2 ) at each iteration,
and this implies
a(k + 1)
a2 a(1)
a2 k(2 )2 .
We denote the maximum number of correction convergence by ko , we have
0 a(k)
a2 a(1)
a2 k(2 2 ),
and thus
k
a(1)
a2
,
2 2
a(1)
a2
.
2 2
(b) If we start out with a zero weight vector, that is, a(1) = 0, we have
ko
=
=
=
=
a(1)
a 2
2 2
a2
2 2
a 2
2
2 2
2
2 2
192
ko ()
ko ()
to 0,
a2 2
a2 2
(2 2 )2
(2 2 )2
= 2 (4
a2 2
a2 ) 2 2
a 2
=
and this implies
0,
2
a2 2 2
a2 = 0.
and > 0. We next use the fact that 0 < a k b < , and k 1 to nd
a(k + 1)
a2 a(k)
a2 + b2 2 + 2b b 2a .
We choose the scale factor to be
=
b2 2 + 2b b
a
and nd
a(k + 1)
a2 a(k)
a2 (b2 2 + 2b b),
where b2 2 + 2b b > 0. This means the dierence between the weight vector at
iteration k + 1 and
a obeys:
a(k + 1)
a2 a(1)
a2 k(b2 2 + 2b b).
Since a squared distance cannot become negative, it follows that the procedure
must converge after almost ko iterations where
ko =
a(1)
a2
.
b2 2 + 2b b
In the case b < 0 there is no guarantee that > 0 and thus there is no guarantee
of convergence of the algorithm.
16. Our algorithm is given by Eq. 27 in the text:
a(1)
a(k + 1)
arbitrary
= a(k) + k yk
where in this case at (k)yk b, for all k. The k s are bounded by 0 < a k
b < , for k 1. We shall modify the proof in the text to prove convergence given
the above conditions.
be a solution vector. Then a
t yi > b, for all i. For a scale factor we have
Let a
a(k + 1)
a
= a(k) + k yk
a
= (a(k)
a) + k yk ,
PROBLEM SOLUTIONS
193
and thus
a(k + 1)
a2
= (a(k)
a) + k yk 2
= a(k)
a2 + k yk 2 + 2k (a(k)
a)t yk
= a(k)
a2 + k2 yk 2 + 2k atk yk 2k
at y k
a(k)
a2 + k2 yk 2 + 2k b 2k
at y k .
max yi 2 ,
min[
at yi ] > b,
i = 1, . . . , n
> 0
if yi 1
< 0
if yi 2 .
i > 0
such that w
ty
i by 1 if yi 2 , and thus there exists a vector w
We multiply y
for all i.
which requires
We use the logic of the Perceptron convergence to nd such an w
at most ko operations for convergence, where
ko =
2
a(1) w
,
2
= max yi 2
=
t y
i > 0.
min a
i
194
The remainder follows from the Perceptron convergence Theorem (Theorem 5.1) in
the text.
18. Our criterion function is
!
Jq (a) =
(at y b)2 ,
yY(a)
d
d
!
!
=
akj y1j + b2 2b
akj y1j .
j=1
j=1
d
!
Jq (a(k))
= 2
akj y1j y1i 2by1i
aki
j=1
= 2(at y1 b)y1i ,
for i = 1, . . . , d. Thus we have
Jq (a(k)) =
Dii
=
=
=
Jq (a(k))
= 2(at y1 b)y1
a(k)
2 Jq (a(k))
aki aki
d
!
2
akj y1j y1i 2by1i
aki
j=1
2y1i y1i
= arbitrary
a(k + 1)
= a(k) + k
! b at (k)y
y1 ,
y2
yY(a(k))
and thus
Y(a(k))
= {y1 },
which implies
a(k + 1)
= a(k) + k
b at (k)y
y1 .
y2
PROBLEM SOLUTIONS
195
We also have
at (k + 1)y1 b = (1 k )(at (k)y1 b).
If we choose k = 1, then a(k + 1) is moved exactly to the hyperplane at y1 = b. Thus,
if we optimally choose k = 1, we have Eq. 35 in the text:
a(k + 1) = a(k) +
b at (k)y
y1 .
y2
19. We begin by following the central equation and proof of Perceptron Convergence
given in the text:
a(k + 1)
a2 = a(k)
a2 + 2(a(k)
a)t yk + 2 yk 2 ,
where yk is the kth training presentation. (We suppress the k dependence of for
clarity.) An update occurs when yk is misclassied, that is, when at (k)yk b. We
dene 2 = max yi 2 and = min[
at yi ] > b > 0. Then we can write
i
a(k + 1)
a2 a(k)
a2 + 2(b ) + 2 2 ,
where is a positive free parameter. We can dene
2
1m1
1 !
(k) =
(k) + b
k=1
m1
!
(l) + 2 2 .
l=1
m
!
(k)
k=1
k
!
(l) +
l=1
m
!
2 (k) 2 ,
k=1
or equivalently
a(k + 1) (k)
a2 a(0) (0)
a2 2
m
!
(k)(l) + 2
k,l=k
lim
k=1
m
2 = 0
(k)
k=1
and we note
m
!
k=1
22
(k)
m
!
k=1
2 (k) + 2
m
!
(kl)
(k)(l).
m
!
k=1
2 (k).
()
196
We also have
lim
m
m
!
m
!
m
!
2 (k) + 2
(k)(l) 2
(k)(l) = 0
2
k=1
(kl)
(kl)
(k)
k=1
which implies
(k)(l)
(kl)
lim 1
=0
2
m
m
(k)
2
m
k=1
or
2
m
(k)(l)
(kl)
lim
m
2
(k)
= 1. ()
k=1
m
n
2
m
2 2 (k)(l)
(k)
(kl)
2 k=1
2
a(0) (0)
a 2
(k)
2
m
n
.
k=1
(k)
k=1
(k)
k=1
But we know from () that the rst term in the brackets goes to 1 and the coecient
of 2 goes to 0, and all the (k) terms (and their sum) are positive. Moreover, the
M
corrections will never cease, since
(k) , as long as there are incorrectly
k=1
classied samples. But the distance term on the left cannot be negative, and thus we
conclude that the corrections must cease. This implies that all the samples will be
classied correctly.
Section 5.6
20. As shown in the gure, in this case the initial weight vector is marked 0, and
after the successive updates 1, 2, . . . 12, where the updates cease. The sequence of
presentations is y1 , y2 , y3 , y1 , y2 , y3 , y1 , y3 , y1 , y3 , y2 , and y1 .
Section 5.7
21. From Eq. 54 in the text we have the weight vector
w = nS1
w (m1 m2 ),
where w satises
1
n 1 n2
t
Sw + 2 (m1 m2 )(m1 m2 ) w = m1 m2 .
n
n
PROBLEM SOLUTIONS
197
2
y 
a2
so
r e l u ti
g i on
on
b/
y1

y 1
b/
y3
12
9
910
11
a1
8
7
4
y2
5
1
3
2
b/

y 3
n1 n2
(m1 m2 )t S1
w (m1 m2 ).
n2
This equation is valid for all vectors m1 and m2 , and therefore we have = 1, or
$
%1
n1 n2
= 1 + 2 (m1 m2 )t S1
.
w (m1 m2 )
n
22. We dene the discriminant function
g0 (x) = (21 11 )P (1 x) (12 22 )P (2 x).
Our goal is to show that the vector a that minimizes
!
!
Js (a) =
(at y (21 11 ))2 +
(at y + (12 22 ))2
yY1
yY2
198
yY2
!
!
n1 1
n2 1
2
2
= n
at y (21 11 ) +
at y + (12 22 ) .
n n
n n
yY1
yY2
1
J (a)
n s
= J (a)
= P (1 )E1 (at y (21 11 ))2 + P (2 )E2 (at y + (12 22 ))2 .
PROBLEM SOLUTIONS
199
Since the second term in the above expression is independent of a, minimizing J (a)
with respect to a is equivalent to minimizing
2
= [at y g0 (x)]2 p(x)dx.
In summary, the vector a that minimizes Js (a) provides asymptotically a minimum
meansquare error approximation to the Bayes discriminant function g0 (x).
23. Given Eqs. 66 and 67 in the text,
a(k + 1) = a(k) + (k)[(k) at (k)yk ]yk
= E[yyt ]1 E[y]
a
we need to show
2 ] = 0
lim E[a(k) a
(k)
k=1
2 (k) < .
k=1
using Eq. 67 as
We write a a
2
2
a(k + 1) a
2
3
Note that 1 0 and 2 0. If we can show that 3 < 0 for all k > M for some nite
M , then we will have proved our needed result. This is because that the update rule
2 ] for k > R for some nite R since
must reduce E[a(k + 1) a
2 (k) is nite
whereas
k=1
] 0, and thus
(k) is innite. We know that E[a(k + 1) a
2
k=1
2 ]
lim E[a(k + 1) a
200
must vanish.
Now we seek to show that indeed 3 < 0. We nd
3
)t ((k) at (k)y)yk ]
= E[(a(k) a
t
= E[a (k)(k)yk ] E[(at (k)yk )2 ] E[(k)
at yk ] + E[
at yk at yk ]
= E[(k)at (k)yk ] + E[
at yk at yk ] E[(at (k)yk )2 ] E[(at yk )2 ]
t yk ]
at yk )2 2at (k)yk a
= E[(at yk )2 + (
t yk )2 ]
= E[(at (k)yk a
t
t yk )2 ] 0.
= E[(a (k)yk a
In the above, we used the fact that y1 , y2 , . . . , yk is determined, so we can consider at (k) as a nonrandom variable. This enables us to write E[(k)at (k)yk ] as
].
E[at (k)yk ykt a
Thus we have proved our claim that 3 < 0, and hence the full proof is complete.
24. Our criterion function here is
Jm (a) = E[(at y z)2 ].
(a) We expand this criterion function as
Jm (a)
= E[(at y z)2 ]
= E[{(at y go (x)) (z go (x))}2 ]
= E[(at y go (x))2 2(at y go (x))(z go (x)) + (z go (x))2 ]
= E[(at y go (x))2 ] 2E[(at y go (x))(z go (x))] + E[(z go (x))2 ].
where the second term in the expression does not depend on a. Thus the vector
a that minimizes Jm also minimizes E[(at y go (x))2 ].
25. We are given that
1
k+1
= k1 + yk2 .
PROBLEM SOLUTIONS
201
1
2
= k1
+ yk1
1
2
2
= k2
+ yk2
+ yk1
2
= 11 + y12 + y22 + . . . + yk1
1 + 1
=
k1
i=1
yi2
,
and thus
k
1
k1
1 + 1
i=1
yi2
k1
!
i=1
which implies
k
1
k1
1 + 1
i=1
yi2
1
1 + a(k 1)
1
.
1 + b(k 1)
1 + b(k 1)
k
as
!
1
= and b > 0.
k
k=1
Moreover, we have
!
k2
12
2
[1 + a(k 1)]
L1 <
as
!
1
< and a > 0.
k2
k=1
202
y1
b1
y2t
b2
Y= .
and
b = . .
..
..
ynt
bn
arbitrary
n
!
= a(k) + (k)
yi (bi atk yi ).
i=1
now reads
Note that the condition we are looking for the limiting vector a
n
!
bi ) = 0,
yi (yit a
i=1
or equivalently
n
!
yi (
at yi ) =
i=1
n
!
yi bi .
i=1
> n
>2
>!
>
2
(1)
>
>
t
2 +
= a(k) a
y
(b
a
(k)y
)
>
>
i
i
i
>
k 2 > i=1
Ck
n
!
2(1)
)t
yi (bi at (k)yi )
(a(k) a
k
i=1
Dk
2 +
= a(k) a
(1)
2(1)
Dk ,
Ck +
k2
k
n
!
t
t yi at (k)yi a
t yi b i
a (k)yi at (k)yi + at (k)yi bi + a
i=1
We can substitute
i=1
written
Dk
n
n
!
i=1
yi bi with
(at (k)yi )2
n
i=1
n
!
i=1
(
at (k)yi )2 + at (k)
n
!
i=1
t
yi b i + a
n
!
i=1
n
!
t yi at (k)yi
t yi + a
=
at yi )2 + at (k)yi a
(at (k)yi )2 (
i=1
yi at (k)yi
PROBLEM SOLUTIONS
203
and thus
n
!
Dk =
t yi )2 0.
(at (k)yi a
i=1
m
!
Ck
k=1
k2
+ 2(1)
m
!
Dk
k=1
2 + 2 (1)
= a(1) a
!
Ck
k=1
k2
+ 2(1)
2 + 2 (1)CL + 2(1)
= a(1) a
!
Dk
k=1
!
Dk
k=1
!
Ck
k=1
k2
< .
!
Dk
k=1
for all k. But Dk < 0 cannot be true for all k (except for nite occassions), otherwise
the righthand side of the equation will be negative while the lefthand side will be
nonnegative. Thus Dk must be arbitrarily close to zero for some choice N , for all
k > N . This tells us that
t yik
lim at (k)yik = a
for i = 1, 2, . . . , N .
To see this, we start with arbitrarily smally positive . Then, we know Dk  <
for all k > N for some integer N . This implies
Dk  =
n
!
t yi )2 ,
(at (k)yi a
i=1
t yi 
in particular at (k)yi a
v(k)
i=1
n
!
yi (bi [
at yi
i ])
i=1
n
!
t yi ) yi
yi (bi a
i=1
n
!
i=1
yi
i
204
t yi .
where i = at (k)yi a
Since was arbitrarily close to zero and i  , we have
lim v(k) = 0,
2
4
6
4
2
(a) By inspecting the plot of the points, we see that 1 and 2 are not linearly
separable, that is, there is no line that can separate all the 1 points from all
the 2 points.
(b) Equation 85 in the text shows how much the e2 value is reduced from step k
to step k + 1. For faster convergence we want the maximum possible reduction.
We can nd an expression for the learning rate by solving the maxima problem
for the righthand side of Eq. 85,
1
e(k)2 e(k + 1)2 = (1 )e+ (k)2 + 2 e+t (k)YY e+ (k).
4
()
We take the derivative with respect to of the righthand side, set it to zero
and solve:
e+ (k)2 2e+ (k)2 + 2e+ (k)YYt e+ (k) = 0,
which gives the optimal learning rate
opt (k) =
e2
.
2 [e+ (k)2 e+ (k)YYt e+ (k)]
The maximum eigenvalue of Yt Y is max = 66.4787. Since all the values between 0 and 2/max ensure convergence to take the largest step in each iteration
we should choose as large as possible. That is, since 0 < < 2/66.4787, we
can choose opt = 0.0301 , where is an arbitrarily small positive constant.
However, () is not suitable for xed learning rate HoKashyap algorithms. To
get a xed optimal , be proceed as discussed in pages 254255 in the text: If
PROBLEM SOLUTIONS
205
1
1
2
1
2 4
6
6 4
1 3 1
t
6 44 10 .
Y=
1 2 4 and Y Y =
4 10 62
1
1
5
1 5
0
The eigenvalues of Yt Y are {4.5331, 40.9882, 66.4787}.
Section 5.10
28. The linear programming problem on pages 256257 in the text involved nding
min {t : t 0, at yi + t > bi for all i}.
Our goal here is to show that the resulting weight vector a minimizes the criterion
function
Jt (a) =
max (bi at yi ).
i:at yi bi
There are two cases to consider: linearly separable and nonlinearly separable.
Case 1: Suppose the samples are linearly separable. Then there exists a vector, say
ao , such that
ato yi = bi .
Then clearly ato yi + t > bi , for any t > 0 and for all i. Thus we have for all i
0
min{t : t 0, at yi + t > bi }
min{t : t 0, at yi + t > bi } = 0.
Therefore, we have
min{t : t 0, at yi + t > bi } = 0,
and the resulting weight vector is ao . The fact that Jt (a) 0 for all a and Jt (ao ) = 0
implies
arg min Jt (a) = ao .
a
206
This proves that the a minimizes Jt (a) is the same as the one for solving the modied
variable problem.
Case 2: If the samples are not linearly separable, then there is no vector ao such that
ato yi = bi .
This means that for all i
min{t : t 0, at yi + t > bi }
t,a
min{t : t 0, t > bi at yi }
min{t : t >
min{ max
(bi at yi )}
t
min Jt (a).
t,a
t,a
t,a
a
max (bi at yi )}
i:at yi bi
i:a yi bi
Section 5.11
29. Given n patterns xk , in ddimensional space, we associate zk where
If xk 1 , then zk = 1
If xk 2 , then zk = 1.
We also dene a mapping : Rd Rd+1 as
(x) = (xk , zk )
where
xk = arg min(xk x).
xk
1 :
(1, 2, 2, 2, 1, 1)t , (1, 2, 2, 2, 1, 1)t
2 :
(1, 2, 2, 2, 1, 1)t , (1, 2, 2, 2, 1, 1)t
PROBLEM SOLUTIONS
207
z2 = x12
z2 = 2 x2
g=1
g=1
g=1
2
2
z1 = 2 x1
z1 = 2 x1
3
2  2 1
1 2
1
 2
3
2  21
1 2
g=1
2
3
g=1
g=1
z2 = x22
3
z2 = x12
3
g=1
g=1
g=1
g=1
3
2  2 1
1
1 2
z1 = 2 x1
2
 2
3
2
2
1
z1 = 2 x1x2
3
1
2
3
Note that the algorithm picks the worst classied patterns from each class and adjusts
a such that the hyperplane moves toward the center of the worst patterns and rotates
so that the angle between the hyperplane and the vector connecting the worst points
increases. Once the hyperplane separates the classes, all the updates will involve
support vectors, since the if statements can only pick the vectors with the smallest
at yi .
32. Consider Support Vector Machines for classication.
(a) We are given the following six points in two categories:
1
2
2
1 : x1 =
, x2 =
, x3 =
1
2
0
208
2
3/
y 2=
y 1+
y2
2
x2
x1
1
x6
x3
x4
x51
y1
2 /4
n
!
1!
k j zk zj yjt yk
2
n
k=1
k,j
zk k = 0
k=1
1 2 2 0
1
1
2
2 5 2 1 1 2 2
2 2 5 1
3 = 2 .
0 1 1 1 1 4 0
1
1
3 1 2
0
5
Unfortunately, this is an inconsistent set of equations. Therefore, the maxima
must be achieved on the boundary (where some i vanish). We try each i = 0
and solve L/i = 0:
L(0, 2 , 3 , 4 , 5 )
=0
i
PROBLEM SOLUTIONS
209
L(a, ) =
k=1
k=1
0
= 2 .
2
0
2 (1 1 1) = a0 + 4 = 1.
2
This, then, means a0 = 3, and the full weight vector is a = (3 2 2)t .
33. Consider the KuhnTucker theorem and the conversion of a constrained optimization problem for support vector machines to an unconstrained one.
210
(a) Here the relevant functional is given by Eq. 108 in the text, that is,
!
1
k [zk at yk 1].
a2
2
n
L(a, ) =
k=1
We seek to maximize L with respect to to guarantee that all the patterns are
correctly classied, that is zk at yk 1, and we want to minimize L with respect
to the (unaugmented) a. This will give us the optimal hyperplane. This solution
corresponds to a saddle point in a space, as shown in the gure.
saddle
point
L
(b) We write the augmented vector a as (a0 ar )t , where a0 is the augmented bias.
Then we have
!
1
k [zk atr yk + zk a0 1].
ar 2
2
n
L(ar , , a0 ) =
k=1
At the saddle point, L/a0 = 0 and L/ar = 0. The rst of these derivative
vanishing implies
n
!
k zk = 0.
k=1
k=1
and thus
ar =
n
!
k zk yk .
k=1
Since
n
k=1
a=
n
!
k=1
k zk yk .
PROBLEM SOLUTIONS
211
1 n
2
n
n
>2 !
!
1>
>
>!
k zk yk >
k zk
l zl yl yk 1
>
2
k=1
k=1
l=1
2t 1 n
2
1 n
n
n
!
!
!
1 !
k zk yk
k zk yk
k l zk zl ykt yl +
k .
2
k=1
k=1
kl
k=1
Thus we have
=
L
n
!
k=1
1!
k l zk zl ykt yl .
2
kl
y1 = (1 2 5 2 5 2 1 25)t , y2 = (1 2 2 4 2 8 2 4 16)t , z1 = z2 = 1
y3 = (1 2 3 2 6 2 4 9)t , y4 = (1 2 2 5 2 5 2 1 25)t , z3 = z4 = +1
We seek the optimal hyperplane, and thus want to maximize the functional given by
Eq. 109 in the text:
1!
l k zk zl ykt yl ,
2
4
L() = 1 + 2 + 3 + 4
kl
k=1
212
Note that this process cannot determine the bias term, a0 directly; we can use a
support vector for this in the following way: We note that at yk zk = 1 must hold for
each support vector. We pick y1 and then
(a0 0.0194 0.0496 0.145 0.0177 0.1413) y1 = 1,
which gives a0 = 3.1614, and thus the full weight vector is a = (3.1614 0.0194 0.0496
0.145 0.0177 0.1413)t .
Now we plot the discriminant function in x1 x2 space:
g(x1 , x2 ) = at (1 2x1
2x2
2x1 x2 x21 x22 )
= 0.0272x1 + 0.0699x2 0.2054x1 x2 + 0.1776x21 0.1415x22 + 3.17.
The gure shows the hyperbola corresponding to g(x1 , x2 ) = 0 as well as the margins,
g(x1 , x2 ) = 1, along with the three support vectors y2 , y3 and y4 .
x2
10
g(x1,x2)=1
7.5
y4
y1
5
g(x1,x2)=1
y3
2.5
g(x1,x2)=0
10
7.5
5
2.5
2.5
7.5
10
x1
2.5
y2
5
g(x1,x2)=1
g(x1,x2)=1
7.5
10
Section 5.12
35. Consider the LMS algorithm and Eq. 61 in the text.
(a) The LMS algorithm minimizes Js (a) = Ya b2 and we are given the problem
>
>
>2
>
a0
b1 >
> 1n Y 1
Js (a) = >
>
ar
b2 >
> 1n Y 1
We assume we have 2n data points. Moreover, we let bi = (b1i b2i . . . bni )t for
i = 1, 2 be arbitrary margin vectors of size n each, and Y1 is an nby2 matrix
containing the data points as the rows, and a = (a0 , a1 , a2 )t = (a0 , ar )t . Then
we have
>
>
> a 1 + Y1 ar b1 >2
Js (a) = > 0 n
>
a0 1n + Y1 ar b1
PROBLEM SOLUTIONS
=
n
!
213
i=1
=
=
n
!
i=1
n
!
(a0
bi1 )2
(a0 bi1 )2 +
i=1
n
!
i=1
n
!
i=1
n
!
i=1
(a0
bi2 )2
+2
(a0 + bi2 )2 2
n
!
i=1
n
!
i=1
H2
214
x2
x2 =
2x
1
y1
x1
2x 1
x 2=
y2
Some of the advantages are that the algorithm is simple and admits a simple
parallel implementation. Some of the disadvantages are that there may be numerical
instability and that the search is not particularly ecient.
37. Equation 122 in the text gives the Bayes discriminant function as
g0i =
c
!
ij P (i x).
j=1
The denition of ij from Eq. 121 in the text ensures that for a given i only one ij
will be nonzero. Thus we have g0i = P (i x). We apply Bayes rule to nd
p(x)g0i (x) = p(xi )P (i ).
()
On the other hand, analogous to Eq. 58 in the text, the criterion function Js1 i can be
written as
!
!
Js1 i (a) =
(ati y 1)2 +
(ati y)2
yY
/ i
yYi
!
!
1
1
n
n
1
2
= n
(ati y 1)2 +
(ati y)2 .
n n1
n n2
yY
/ i
yYi
By the law of large numbers, as n , we have that 1/nJs1 i (ai ) approaches Ji (ai )
with probability 1, where Ji (ai ) is given by
Ji (ai ) = P (i )Ei [(ati y 1)2 ] + P (N OT i )E2 [(ati y)2 ],
where
E1 [(ati y 1)2 ] =
E2 [(ati y)2 ] =
Thus we have
Ji (ai ) =
(ati y
1) p(x, i )dx +
2
PROBLEM SOLUTIONS
215
J(ai ) =
[ati y g0i (x)]2 p(xdx)
2
[g0i
(x)p(x) g0i (x)p(x)]dx .
independent of ai
The second term in the sum is independent of weight vector ai . Hence the ai that
minimizes Js1 i also minimizes 2i . Thus the MSE solution A = Yt B (where A =
ai a2 ac and Y = [Y1 Y2 Yc ]t , and B is dened by Eq. 119 in the text) for
the multiclass case yields discriminant functions ati y that provide a minimummeansquare error approximation to the Bayes discriminant functions g0i .
38. We are given the multicategory classication problem with sets Y1 , Y2 , . . . , Yc
to be classied as 1 , 2 , . . . , c , respectively. The aim is to nd a1 , a2 , . . . , ac such
that if y Yi , then ati y atj y for all j = i. We transform this problem into a binary
classication problem using Keslers construction. We dene G = G1 G2 Gc
where
Gi = {ij y Yi , j = i}
and ij is as given in Eq. 115 in the text. Moreover we dene
= (at1 a2 atc )t .
Perceptron case We wish to show what the Fixedincrement single sample Perceptron algorithm given in Eq. 20 in the text does to our transformed problem. We
rewrite Eq. 20 as
(1)
(k + 1)
= arbitrary
= (k) + gk
216
Twocategory
ai (k + 1) = ai (k) + yik
aj (k + 1) = aj (k) yjk ,
arbitrary
= (k) +
b t gk k
g ,
gk 2
where b is the margin. An update takes place when the sample gk is incorrectly
classied, that is, when t (k)gk < b. We use the same denition of g as in the
Perceptron case above, and can then write
t (k) ij < b (ati (k)y atj (k)y) < b.
In the update rule, can be decomposed into its subcomponent as, yielding
the following equivalence:
Multicategory
(k + 1) = (k) +
t (k)gk gk
bg
k 2
Twocategory
ai (k + 1) = ai (k) +
aj (k + 1) = aj (k)
217
Computer Exercises
Section 5.4
1. Computer exercise not yet solved
Section 5.5
2.
3.
4.
5.
6.
Computer
Computer
Computer
Computer
Computer
exercise
exercise
exercise
exercise
exercise
not
not
not
not
not
yet
yet
yet
yet
yet
solved
solved
solved
solved
solved
Section 5.6
7. Computer exercise not yet solved
Section 5.8
8. Computer exercise not yet solved
Section 5.9
9. Computer exercise not yet solved
Section 5.10
10. Computer exercise not yet solved
Section 5.11
11. Computer exercise not yet solved
Section 5.12
12. Computer exercise not yet solved
218
Chapter 6
Problem Solutions
Section 6.2
1. Consider a threelayer network with linear units throughout, having input vector
x, vector at the hidden units y, and output vector z. For such a linear system we
have y = W1 x and z = W2 y for two matrices W1 and W2 . Thus we can write the
output as
z
= W2 y = W2 W1 x
= W3 x
for some matrix W3 = W2 W1 . But this equation is the same as that of a twolayer
network having connection matrix W3 . Thus a threelayer network with linear units
throughout can be implemented by a twolayer network with appropriately chosen
connections.
Clearly, a nonlinearly separable problem cannot be solved by a threelayer neural network with linear hidden units. To see this, suppose a nonlinearly separable
problem can be solved by a threelayer neural network with hidden units. Then, equivalently, it can be solved by a twolayer neural network. Then clearly the problem is
linearly separable. But, by assumption the problem is only nonlinearly separable.
Hence there is a contradiction and the above conclusion holds true.
2. Fouriers theorem shows that a threelayer neural network with sigmoidal hidden
units can act as a universal approximator. Consider a twodimensional input and a
single output z(x1 , x2 ) = z(x). Fouriers theorem states
!!
z(x)
Af1 f2 cos(f1 x1 )cos(f2 x2 ).
f1
f2
(a) Fouriers theorem, as stated above, can be rewritten with the trigonometric
identity:
1
1
cos()cos() = cos( + ) + cos( ),
2
2
219
220
to give
z(x1 , x2 )
! ! Af
f1
f2
1 f2
z2
[cos(f1 x1 + f2 x2 ) + cos(f1 x1 f2 x2 )] .
(b) We want to show that cos(x), or indeed any continuous function, can be approximated by the following linear combination
n
!
Sgn[x xi ]
f (x) f (x0 ) +
[f (xi+1 f (xi )]
.
2
i=0
The Fourier series of a function f (x) at a point x0 converges to f (x0 ) if f (x)
is of bounded variation in some interval (x0 h, x0 + h) centered on x0 . A
function of bounded variation is a follows, given a partition on the interval
a = x0 < x1 < x2 < . . . < xn1 < xn = b form the sum
n
!
k=1
The least upper bound of these sums is called the total variation. For a point
f (x) in the neighborhood of f (x0 ), we can rewrite the variation as
n
!
Sgn[x xi ]
[f (xi+1 f (xi )]
,
2
i=1
which sets the interval to be (x, x + 2h). Note the function has to be either
continuous at f (x0 ) or have a discontinuity of the rst kind.
f(x)
approximation
function
(c) As the eective width of the sigmoid vanishes, i.e., as 0, the sigmoids
become step functions. Then the functions cos(f1 x1 +f2 x2 ) and cos(f1 x1 f2 x2 )
can be approximated as
cos(f1 x1 + f2 x2 ) cos(f1 x10 + f2 x20 )
1 n
!
+
[cos(x1i+1 f1 + x2i+1 f2 ) cos(x1i f1 + x2i f2 )]
i=0
2
and similarly for cos[f1 x1 f2 x2 ].
2
,
PROBLEM SOLUTIONS
221
(d) The construction does not necessarily guarantee that the derivative is approximated, since there might be discontinuities of the rst order, that is, noncontinuous rst derivative. Nevertheless, the onesided limits of f (x0 + 0) and
f (x0 0) will exist.
Section 6.3
3. Consider a d nH c network trained with n patterns for me epochs.
(a) Consider the space complexity of this problem. The total number of adjustable
weights is dnH + nH c. The amount of storage for the n patterns is nd.
(b) In stochastic mode we choose pattern randomly and compute
w(t + 1) = w(t) + w(t).
until stopping criterion is met. Each iteration involves computing w(t) and
then adding it to w(t). From Eq. 17 in the text, for hiddentooutput unit
weights we have
wjk = (tk zk )f (netk )yj ,
where netk =
n
H
j=1
c
!
wkj k
k=1
k
wjk k is computed in
c
!
wkj k .
k=1
222
! E netk
.
netk oj
k
j
where
netk
= wjk .
oj
Thus we have
j = f (netj )
c
$
!
E %
wjk
= f (netj )
wjk k .
netk
k=1
k=1
c
!
5. From Eq. 21 in the text, the backpropagation rule for training inputtohidden
weights is given by
!
wji = xi f (netj )
wkj k = xi j .
k=1
For a xed hidden unit j, wji s are determined by xi . The larger the magnitude of
xi , the larger the weight change. If xi = 0, then there is no input from the ith unit and
hence no change in wji . On the other hand, for a xed input unit i, wji j , the
sensitivity of unit j. Change in weight wji determined how the overall error changes
with the activation of unit j. For a large magnitude of j (i.e., large sensitivity),
wji is large. Also note that the sensitivities of the subsequent layers propagate to
the layer j through weighted sum of sensitivities.
6. There is no reason to expect that the backpropagation rules should be inversely
related to f (net). Note, the backpropagation rules are dened to provide an algorithm
that minimizes
1!
(tk zk )2 ,
2
c
J=
k=1
where zk = f (netk ). Now, the weights are chosen to minimize J. Clearly, a large
change in weight should occur only to signicantly decrease J, to ensure convergence
in the algorithm, small changes in weight should correspond to small changes in J.
But J can be signicantly decreased only by a large change in the output ok . So, large
changes in weight should occur where the output varies the most and least where the
output varies the least. Thus a change in weight should not be inversely related to
f (net).
7. In a threelayer neural network with bias, the output can be written
1
2
!
!
zk = f
wjk f
wji xi
j
PROBLEM SOLUTIONS
223
zk = f
wkj f
1
!
2
wji xi
by increasing the number of input units by 1 and the number of hidden inputs by 1,
xo = 1, oj = j , j = o, io = 0,
and fh is adjusted such that obias = fh (o) = 1 and wok = k .
8. We denote the number of input units as d, the number of hidden units as nH and
the number of category (output) units as c. There is a single bias unit.
(a) The number of weights is dnH + (nH + 1)c, where the rst term is the number
of inputtohidden weights and the second term the number of hiddentooutput
weights, including the bias unit.
(b) The output at output node k is given by Eq. 7 in the text:
2
2
1
1 d
nH
!
!
zk =
f wkj f
wji xi + wj0 + wk0 .
j=1
i=1
d
d
!
!
f
wji xi f
wji xi ,
j=1
j=1
d
d
!
!
(wkj ) f
wji xi = (wkj ) f
wji xi ,
j=1
j=1
begin initialize nH , w,
do
x next input pattern
wji wji + j xi ; wkj wkj + k yi
until no more patterns available
return w
end
224
10. We express the derivative of a sigmoid in terms of the sigmoid itself for positive
constants a and b for the following cases.
(a) For f (net) = 1/(1 + ea
net
), the derivative is
2
df (net)
1
ea net
= a
d net
1 + ea net
= af (net)(1 f (net)).
11. We use the following notation: The activations at the rst (input), second, third,
and fourth (output) layers are xi , yj , vl , and zk , respectively, and the indexes are
clear from usage. The number of units in the rst hidden layer is nH1 and the number
in the second hidden layer is nH2 .
output
v1
x1
z1
z2
v2
vl
y1
y2
x2
xi
zq
zc
vlmax
yj
yjmax
xd
input
Algorithm 0 (Fourlayer backpropagation)
1
2
3
4
5
Section 6.4
12. Suppose the input to hidden weights are set equal to the same value, say wo ,
then wij = wo . Then we have
netj = f (netj ) =
d
!
i=1
wji xi = wo
!
i
xi = wo x.
PROBLEM SOLUTIONS
225
This means that oj = f (netj ) is constant, say yo . Clearly, whatever the topology of
the original network, setting the wji to be a constant is equivalent to changing the
topology so that there is only a single input unit, whose input to the next layer is xo .
As, a result of this loss of onelayer and number of input units in the next layer, the
network will not train well.
13. If the labels on the two hidden units are exchanged, the shape of the error surface
is unaected. Consider a d nH c threelayer network. From Problem 8 part (c),
we know that there are nH !2nH equivalent relabelings of the hidden units that do not
aect the network output. One can also ip weights for each of these congurations,
Thus there should be nH !2nH +1 possible relabelings and weight changes that leave
the error surface unaected.
14. Consider a simple 2 1 network with bias. Suppose the training data come from
two Gaussians, p(x1 ) N (0.5, 1) and p(x2 ) N (0.5, 1). The teaching values
are 1.
(a) The error is a sum over the n patterns as a function of the transfer function f ()
as
J(w) =
n
!
k=1
2 J(w)
.
wi wj
n
!
k=1
226
ln
.
ln(1 )
16. Assume the criterion function J(w) is well described to second order by a Hessian
matrix H.
(a) Stepping along the gradient gives at step T
iT = (1 i )T i0 .
To get a local minimum, we need 1 i  < 1, and this implies < 2/max .
(b) Consider (1 (2/max )i ). More steps will be required along the direction corresponding to the smallest eigenvalue i , and thus the learning rate is governed
by
1
2min
.
max
Standardization helps reduce the learning time by making the ratio max /min =
1.
(c) Standardization is, in essence, a whitening transform.
Section 6.6
17. From Eq. 25 in the text, we have
!
!
n
1
1
n
n
k
k
J(w) = n
[gk (x, w) 1]2 +
gk (x, w)2 .
n nk x
n n nk
xk
approaches
E([gk (x, w) 1] x k ) =
2
PROBLEM SOLUTIONS
227
Likewise, we have
!
1
[Fk (x, )]2 ,
n nk
xk
which approaches
J(w)
w
= 0 and nd
Fk (x, )
Fk (x, )
p(x)dx =
Fk (x, )
p(x, k )dx
gk (x, w (p(x))
p(x)dx
=
p(x, k )dx.
w
w
w=w (p(x))
w=w (p(x))
Thus we have
gk (x, w (p(x)))p(x) = p(x, k )
with probability 1 on the set of x : p(x, k ) > 0. This implies
gk (x, w ) =
p(x, k )
= P (k x).
p(x)
228
This is true only when the above assumption is met. If the assumption is not met, the
gradient descent procedure yields the closest projection to the posterior probability
in the class spanned by the network.
20. Recall the equation
t
k )+B(y,)+w
y
p(yk ) = eA(w
.
p(yk )P (k )
.
p(y)
k and as follows:
(b) We interpret A(), w
k)
= eA(w
kt y] P (k )
k ) + B(y, ) + w
exp [A(w
p(k y) =
c
t y] P ( )
m ) + B(y, ) + w
m
exp [A(w
m
P (k )
m=1
netk
,
c
enetm
m=1
zh
= zh (1 zh ).
neth
and
=
=
d
!
i=1
nH
!
wji xi
wkj yj
j=1
yj
zk
= f (netj )
enetk
=
,
c
enetm
m=1
PROBLEM SOLUTIONS
229
J=
k=1
To derive the learning rule we have to compute J/wkj and J/wji . We start
with the former:
J
J netk
=
.
wkj
netk wkj
We next compute
J
netk
c
!
J zs
.
zs netk
s=1
We also have
enets (1)enetk
2 = zs zk
net
m
e
zs
m=1
=
nets
nets nets
ce
+ (1) e e 2 = zk zk2
netk
net
m
e
enetm
m=1
if s = k
if s = k
m=1
and nally
J
= (1)(ts zs ).
zs
Putting these together we get
!
J
=
(1)(ts zs )(zs zk ) + (1)(tk zk )(zk zk2 ).
netk
c
s=k
s=k
c
!
J zs
z
s yj
s=1
230
zs
y
j
s=1
c
c
!
!
zs netr
=
(ts zs )
netr yj
s=1
r=1
(ts zs )
c
z net
!
zs netr
s
s
(ts zs )
+
net
y
net
y
s
j
r
j
s=1
r=s
c
!
zs zs2
c
!
zs zr
wsj
s=1
c !
c
!
wrj
zs zr wrj (ts zs ).
s=1 r=s
= xi f (netj )
c
!
(ts zs )
s=1
xi f (netj )
c
!
c
!
wrj zs zr
r=s
s=1
wji
wkj
c
!
tk ln
k=1
tk
.
zk
The learning rule derivation is the same as in part (a) except that we need to
replace J/z with Jce /z. Note that for the crossentropy criterion we have
Jce /zk = tk /zk . Then, following the steps in part (a), we have
Jce
wkj
= yj
c
!
tk
tk
zs zk yj (zk zk2 )
zk
zk
s=k
= yj
c
!
tk zk yj tk (1 zk ).
s=k
Similarly, we have
Jce
wji
= xi f (netj )
c
c
!
ts !
wrj zs zr
z
s=1 s
r=x
c
!
ts
xi f (netj )
wsj (zs zs2 ).
z
s
s=1
PROBLEM SOLUTIONS
231
Thus we nd
Jce
wji
= xi f (netj )
c
!
ts
s=1
xi f (netj )
c
!
c
!
wrj zr
r=s
ts wsj (1 zs ).
s=1
Jce
wji
Jce
=
,
wkj
=
k2
is a minimum. this implies that every term in the above equation is minimized.
Section 6.7
23. Consider the weight update rules given by Eqs. 12 and 23 in the text.
(a) The weight updates are have the factor f (net). If we take the sigmoid fb (h) =
tanh(bh), we have
fb (h) = 2b
ebh
1
1 = 2b bh
1 = 2bD = 1.
(1 + ebh )2
e + ebh + 2
D
Clearly, 0 < D < 0.25 for all b and h. If we assume D is constant, then clearly
the product /fb
(h) will be equal to fb (h), which tells us the increment
in the weight values will be the same, preserving the convergence time. The
assumption will be approximately true as long as bh is very small or kept
constant in spite of the change in b.
(b) If the input data is scaled by 1/, then the increment of weights at each step
will be exactly the same. That is,
f (h/) = f (h).
b
Therefore the convergence time will be kept the same. (However, if the network
is a multilayer one, the input scaling should be applied to the hidden units
outputs as well.)
232
d
!
zk = f
fjk (xj ) + wok .
j=1
The functions fk act on single components, namely the xi . This actually restricts
the additive models. To have the full power of threelayer neural networks we must
assume that fi is multivariate function of the inputs. In this case the model will
become:
d
!
zk = f
fjk (x1 , x2 , . . . , xd ) + w0 .
j=1
d
!
fjk = wkj g
wji xi + wj0
j=1
and w0k = wk0 , where g, wkj , wji , wj0 , wk0 are dened as in Eq. 32 in the text. We
substitute fjk into the above equation and at once arrive at the threelayer neural
network function shown in Eq. 7 in the text.
25. Let px (x) and py (y) be probability density functions of x and y, respectively.
(a) From the denition of entropy given by Eq. 37 in Chapter 1 of the text (or
Eq. 118 in Section A.7.1 in the Appendix), and the onedimensional Gaussian,
1
x2
px (x) = exp 2 .
2
2
we have the entropy
H(px (x)) =
px (x)logpx (x)dx =
1
+ log 2 1.447782 + log bits.
2
(b) From Sect. 6.8.2 in the text, if a = 2/(3), then we have f (x) 1 for <
x < .
(c) Suppose y = f (x). If there exists a unique inverse mapping x = f 1 (y), then
py (y)dy = f (x)dx. Thus we have
H(py (y))
py (y)logpy (y)dy
px (x)logpx (x)dx +
px (x)logf (x)dx
1
x2
exp 2 log[1 + exp(bx)]dx
2
2
PROBLEM SOLUTIONS
233
y = f (x) =
a if x = 0
0 otherwise.
1
0
Pr[Y = y] =
if y = 0
if y = a
otherwise.
2a
1 + eb
net
a.
2a
beb net
(1 + eb net )2
2a
beb net
=
1 + eb net 1 + eb net
1
2a
1
= b
b
net
1+e
1 + eb
b 2
=
[a (f (net))2 ].
2a
=
b
a f (net)f (net),
f ()
f ()
f ()
net
2a
a = 2a a = a
1 + eb net
b 2
b 2
[a (f (net))2 ] =
(a a2 ) = 0
=
2a
2a
b
= f (net)f (net) = 0.
a
=
At net = 0, we have
f (0) =
f (0) =
f (0) =
2a
2a
a=
a=0
b
net
1+e
a
b 2
b 2
ab
(a (f (net))2 ) =
a =
2a
2a
2
0.
At net = , we have
f ()
f ()
f ()
2a
a = 0 a = a
1 + eb net
b 2
b 2
(a (f (net))2 ) =
(a a2 ) = 0
2a
2a
0.
234
n3 d
,
5
c
!
zk tk R =
k=1
c
!
tk zk R .
k=1
We proceed as in page 290 of the text for the usual sumsquareerror function:
J
J netk
=
wkj
netk wkj
1
where zk = f
n
H
j=1
2
wkj yj . We also have
J
= Rtk zk R1 f (netk )Sgn(tk zk ),
netk
where the signum function can be written Sgn(x) = x/x. We also have
netk
= yj .
wkj
PROBLEM SOLUTIONS
235
Thus we have
J
= Rtk zk R1 f (netk )yj Sgn(tk zk ),
wkj
and so the update rule for wkj is
wkj = tk zk R1 f (netk )yj Sgn(tk zk ).
Now we compute J/wji by the chain rule:
J
J yj netj
=
.
wji
yj netj wji
The rst term on the righthand side is
J
yj
=
=
c
!
J zk
zk yj
k=1
c
!
k=1
The second term is simply yj /netj = f (netj ). The third term is netj /wji = xi .
We put all this together and nd
c
!
J
R1
=
Rtk zk 
Sgn(tk zk )f (netk )wkj f (netj )xi .
wji
k=1
For the Euclidean distance measure, we let R = 2 and obtain Eqs. 17 and 21 in the
text, if we note the identity
Sgn(tk zk ) tk zk  = tk zk .
29. Consider a d nH c threelayer neural network whose input units are liner and
output units are sigmoidal but each hidden unit implements a particular polynomial
function, trained on a sumsquare error criterion. Here the output of hidden unit j is
given by
oj = wji xi + wjm xm + qj xi xm
for two prespecied inputs, i and m = i.
(a) We write the network function as
zk = f
nH
!
j=1
wkj oj (x)
236
c
!
(tk zk )2 .
k=1
c
! J zk oj
J oj
=
oj wji
zk oj wji
k=1
=
J/oj
c
!
k=1
xi + qj xmj
oj
xi + qj xij
=
wji
0
oj
.
wji
if i = ij
if i = mj
otherwise.
Now we can write J/wji for the weights connecting inputs to the hidden layer.
Remember that each hidden unit j has three weights or parameters for the two
specied input units ij and mj , namely wjij , wjmj , and qj . Thus we have
J
wjij
J
wjmj
J
wjr
=
=
c
!
k=1
c
!
(tk zk )f (netk )wkj (xij + qj xmj )
k=1
0 for r = ij and r = mj .
k=1
PROBLEM SOLUTIONS
Thus we have
237
c
!
J
=
(tk zk )f (netk )wkj xij xmj .
qj
k=1
=
=
=
c
!
k=1
c
!
k=1
c
!
k=1
(b) We observe that yi = oi , from part (a) we see that the hiddentooutput weight
update rule is the same as the standard backpropagation rule.
(c) The most obvious weakness is that the receptive elds of the network are
a mere two units. Another weakness is that the hidden layer outputs are not
bounded anymore and hence create problems in convergence and numerical stability. A key fundamental weakness is that the network is no more a universal
approximator/classier, because the function space F that the hidden units
span is merely a subset of all polynomials of degree two and less. Consider a
classication task. In essence the network does a linear classication followed
by a sigmoidal nonlinearity at the output. In order for the network to be able
to perform the task, the hidden unit outputs must be linearly separable in the
F space. However, this cannot be guaranteed; in general we need a much larger
function space to ensure the linear separability.
The primary advantage of this scheme is due to the task domain. If the task
domain is known to be solvable with such a network, the convergence will be
faster and the computational load will be much less than a standard backpropagation learning network, since the number of inputtohidden unit weights is
3nH compared to dnH of standard backpropagation.
30. The solution to Problem 28 gives the backpropagation learning rule for the
Minkowski error. In this problem we will derive the learning rule for the Manhattan
metric directly. The error per pattern is given by
J=
c
!
tk zk .
k=1
We proceed as in page 290 of the text for the sumsquared error criterion function:
J
wkj
J
netk
J netk
netk wkj
= f (netk )Sgn(tk zk ),
238
=
=
c
!
J zk
zk yj
k=1
c
!
k=1
yj
netj
netj
wji
= f (netj )
= xi .
The above learning rules dene the backpropagation learning rule with the Manhattan
metric. Note that the Manhattan metric uses the direction (signum) of tk zk instead
of the actual tk zk as in the sumsquarederror criterion.
Section 6.9
31. We are given the criterion function
J=
n
1 !
(tm zm )2 .
2 m=1
!
n
n
!
2J
1
2 zm
zm zm
.
=
+
(zm tm )
wji wlk
2 m=1 wji wlk m=1
wji wlk
PROBLEM SOLUTIONS
239
We drop the last term in the outerproduct approximation, as given in Eq. 45 in the
text. Let Wj be any hiddentooutput weight in a single output network. Then we
have z/Wj = f (net)yj . Now we let wji be any inputtohidden weight. Then since
!
!
z=f
Wj f
wji xi ,
j
the derivative is
z
= f (net)Wj f (netj )xi .
wji
Thus the derivatives with respect to the hiddentooutput weights are:
Xtv = (f (net)y1 , . . . , f (net)ynH ),
as in Eq. 47 in the text. Further, the derivatives with respect to the intputtohidden
weights are
Xtu
c
k=1
2
wji
zk wji
JCE
=
=
wji wlk
wlk
wlk
2 zk
tk zk zk
.
=
zk
zk2 wji wlk
wji wlk
We arrive at the outer product approximation by dropping the secondorder terms.
We compare the above formula to Eq. 44 in the text and see that the Hessian matrix,
HCE diers from the Hessian matrix in the traditional sumsquared error case in
Eq. 45 by a scale factor of tk /zk2 . Thus, for a d nH 1 network, the Hessian matrix
can be approximated by
HCE =
n
1 ! t1 [m]t [m]
X
X ,
n m=1 z12
where X[m] is dened by Eqs. 46, 47 and 48 in the text, as well as in Problem 31.
33. We assume that the current weight value is w(n 1) and we are doing a line
search.
(a) The next w according to the line search will be found via
w(n) = w(n 1) + J(w(n 1))
by minimizing J(w) using only . Since we are given that the Hessian matrix is
proportional to the identity, H I, the Taylor expansion of J(w) up to second
240
Now we see the result of the line search in J by solving J/ = 0 for , that
is,
J
= J(w(n 1))2 (1 + k) = 0,
which has solution = 1/k. Thus the next w after the line search will be
w(n) = w(n 1) 1/kJ(w(n 1)).
We can also rearrange terms and nd
1
w(n) w(n 1) = w(n 1) = J(w(n 1)).
k
Using the Taylor expansion for J(w(n) around w(n 1) we can write
J(w(n)) = J(w(n 1)) + k(w(n) w(n 1)),
then substitute w(n) w(n 1) from above into the righthandside and nd
J(w(n)) = J(w(n 1)) + k(1/kJ(w(n 1))) = 0.
Indeed, Eqs. 56 and 57 in the text give n = 0, given that J(wn ) = 0,
as shown. Equation 56 in the text is trivially satised since the numerator is
J(w(n))2 , and thus the FletcherReeves equation gives n = 0. Equation 57
also vanishes because the numerator is the inner product of two vectors, one of
which is J(w(n)) = 0, and thus the PolakRibiere equation gives n = 0.
(b) The above proves that application of a line search reult with w(n 1) takes us
to the minimum of J, and hence no further weight update is necessary.
34. Problem not yet solved
35. The Quickprop algorithm assumes the weights are independent and the error
surface is quadratic. Whith these assumptions, J/w is linear. The graph shows
J/w versus w and a possible error minimization step. For nding a minimum in
the criterion function, we search for J/w = 0. That is the value w where the line
crosses the w axis.
PROBLEM SOLUTIONS
241
err
o
rm
in im
iz a
tio n
J/w
wn wn1
w*
We can easilty compute the intercept in this onedimensional illustration. The equation of the line passing through (x0 , y0 and x1 , y1 ) is (xx1 )/(x1 x0 ) = (yy1 )/(y1
y0 ). The x axis crossing is found by setting y = 0 and solving for x, that is
x x1 = (x1 x0 )
y1
y1
= (x1 x0 )
.
y1 y0
y0 y1
Now we can plug in our point shown on the graph, with x0 = wn1 , y0 = J/wn1 ,
x1 = wn , y1 = J/wn , and by noting w(m) = (x1 x0 ) and w(n+1) = (xx1 ).
The nal result is
J/wn
w(n + 1) =
w(n).
J/wn1 J/wn
Section 6.10
36. Consider the matched lter described by Eq. 66 in the text.
(a) The trial weight function w(t) = w (t) + h(t) constrained of Eq. 63 must be
obeyed:
w2 (t)dt,
w (t)dt =
and we expand to nd
w dt +
2
h (t)dt + 2
w (t)h(t)dt =
w2 (t)dt
h2 (t)dt 0.
h(t)w (t)dt =
(b) We have
x(t)w(t)dt =
x(t)w (t)dt +
output z
x(t)h(t)dt.
242
From Eq. 66 in the text we know that w (t) = /2x(t), and thus x(t) =
2w (t)/(). We plug this form of x(t) into the righthand side of the above
equation and nd
2
x(t)[w (t) + h(t)]dt = z +
x(t)w(t)dt =
w (t)h(t)dt.
w (t)h(t)dt =
h2 (t)dt,
x(t)w(t)dt = z
h2 (t)dt.
z =
w (t)x(t)dt,
we get
x(t)x(t)dt.
2
z =
x2 (t)dt < 0.
h2 (t)dt = 0,
which occurs if and only if h(t) = 0 for all t. That is, w (t) ensures the maximum
output. Assume w1 (t) also assures maximum output z as well, that is
x(t)w (t)dt
= z
x(t)w1 (t)dt
= z.
PROBLEM SOLUTIONS
243
Now we can let h(t) = w1 (t) w (t). We have just seen that given the trial
weight function w1 (t) = w (t) + h(t), that h(t) = 0 for all t. This tells us that
w1 (t) w (t) = 0 for all t, or equivalently w1 (t) = w (t) for all t.
37. Consider OBS and OBD algorithms.
(a) We have from Eq. 49 in the text
J
=
=
J
w
t
1
2J
w + wt
w + O(w3 )
2
w2
1 t
w Hw.
2
1 t
w Hw (wt uq + wq ) .
2
0
244
wq
t
uq H1 uq
wq
.
[H1 ]qq
So, we have
w = +H1 uq =
wq
H1 uq .
[H1 ]qq
1
wq
wq
H1 uq ]t H[ 1 H1 uq ]
[
2 [H1 ]qq
[H ]qq
wq2
1
ut H1 HH1 uq
2 ([H1 ]qq )2 q
1 wq2
1
wqq
ut H1 uq =
[H1 ]qq
2 (H1 )2qq q
2 ([H1 ]qq )2
1 wq2
.
2 [H1 ]qq
1
.
Hqq
1 wq2
1
= wq2 Hqq .
2 [H1 ]qq
2
38. The threelayer radial basis function network is characterized by Eq. 61 in the
text:
nH
!
zk = f
wkj j (x) ,
j=0
J=
k=1
PROBLEM SOLUTIONS
245
where
J
J zk
=
= (tk zk )f (netk ).
netk
zk netk
We put these together and nd
J
= (tk zk )f (netk )j (x).
wkj
Now we turn to the update rule for j and bj . Here we take
j (x) = ebj xj .
2
c
!
(tk zk )
k=1
c
!
zk netk
netk yj
k=1
= ebxj (1)(1)2(xi ij )
2
2(x).
k=1
Finally, we nd J/bj such that the size of the spherical Gaussians be adaptive:
J
J yj
=
.
bj
yj bj
While the rst term is given above, the second can be calculated directly:
yj
bj
2
(x)
= ebxj x j 2 (1)
bj
= (x)x j 2 .
Thus the nal derivative is
!
J
= (x)x j 2
(tk zk )f (netk )wkj .
bj
c
k=1
246
J
wkj
J
=
ij
J
=
,
bj
=
x Kx
d !
d
!
xi Kij xj
j=1 i=1
d
!
x2i Kii +
d
!
i=1
xi xj Kij .
i=j
d
d
!
d ! 2
x Kii +
xi xj Kij
dxr i=1 i
i=j
d $! 2
xi Kii + x2r Krr
dxr
i=r
d !
d
!
i
xi xj Kij + xr
xj Krj + xr
j=r
0 + 2xr Krr + 0 +
d
!
xj Krj +
j=r
d
!
xj Krj +
j=1
d
!
i=r
d
!
xi Kir
i=r
xi Kir
i=1
This implies
d t
x Kx = Kx + Kt x = (K + Kt )x.
dx
(b) So, when K is symmetric, we have
d t
x Kx = (K + Kt )x = 2Kx.
dx
%
xi Kir
PROBLEM SOLUTIONS
247
2
wij
.
ij
E
wij
!
E
2
Epat +
=
wij
wij
wij
ij
= 2wij .
This implies
new
wij
old
old
= wij
2wij
old
= wij
(1 2).
new
old
(b) In this case we have wij
= wij
(1 ) where is a weight decay constant.
Thus
new
old
wij
= wij
(1 2)
.
2
E
wij
= Epat +
2
wij
2
1 + wij
1
= Epat + 1
2
1 + wij
1
2wij
Epat
1
=0+
=
+
2
2 )2
wij
wij
1 + wij
(1 + wij
2wij
=
.
2 )2
(1 + wij
old
2wij
E
old
= wij
2 )2
wij
(1 + wij
2
old
1
= wij
2 )2
(1 + wij
old
= wij
old
= wij
(1 ij )
248
where
ij =
2.
2)
(1 + wij
This implies
2
2
ij (1 + wij
)
.
2
(d) Now consider a network with a wide range of magnitudes for weights. Then, the
new
old
constant weight decay method wij
= wij
(1) is equivalent to regularization
2
via a penalty on
ij . Thus, large magnitude weights are penalized and will
ij
ij
p() e
wij
i,j
where >
0 is some parameter. Clearly this prior favors small weights, that is, p(w)
2
is large if
wij
is small. The joint density of x, w is
ij
2
.
wij
ij
!
p(x, w)
2
p(wx) =
.
exp Epat
wij
p(x)
ij
So, maximizing the posterior density is equivalnet to minimizing
!
2
E = Epat +
wij
,
ij
PROBLEM SOLUTIONS
249
t
w w,
2
we can compute the derivative Jef /wi , and thus determine the learning rule. This
derivative is
Jef
J
=
+ wi
wi
wi
wi
(k)
(k+1)
The above corresponds to the description of weight decay, that is, the weights are
updated by the standard learning rule and then shrunk by a factor, as according to
Eq. 38 in the text.
43. Equation 70 in the text computes the inverse of (H + I) when H is initialized
to I. When the value of is chosen small, this serves as a good approximation
[H + I]1 H1 .
We write the Taylor expansion of the error function around w up to second order.
The approximation will be exact if the error function is quadratic:
J(w) J(w ) +
t
J
(w w ) + (w w )t H(w w ).
w w
J (w) J(w ) +
t
J
(w w ) + (w w )t H(w w ) + + (w w )t I(w w ).
w w
J(w)
250
Thus we have
t
J (w) = J(w) + wt w w
w .
constant
J
We want to nd which component of w to set to zero that will lead to the smallest
increase in the training error. If we choose the qth component, then w must satisfy
utq w + wq = 0, where uq is the unit vector parallel to the qth axis. Since we want to
minimize J with the constraint above, we write the Lagrangian
1 t
w Hw + (utq w + wq ),
2
L(w, ) =
L
= 0.
w
These become
L
L
wj
L
wj
= utq w + wq = 0
=
Hkj wk
for j = i
Hij wk +
for j = i.
wq
.
[H1 ]qq
PROBLEM SOLUTIONS
251
wq
H1 uq .
[H1 ]q q
For the second part of Eq. 69, we just need to substitute the expression for w we
just found for the expression for J given in Eq. 68 in the text:
t
wq
1
wq
1
1
Lq =
1 H uq H 1 H uq
2
[H ]qq
[H ]qq
1 wq
wq
=
ut HH1
H1 uq
2 [H1 ]qq q [H1 ]qq
I
1 wq2
.
2 [H1 ]qq
45. Consider a simple 21 network with bias. First let w = (w1 , w2 )t be the weights
and x = (x1 , x2 )t be the inputs, and w0 the bias.
(a) Here the error is
E=
n
!
p=1
Ep =
n
!
p=1
where p indexes the patterns, and tp is the teaching or desired output and zp
the actual output. We can rewrite this error as
!
!
E=
[f (w1 x1 + w2 x2 + w0 ) 1]2 +
[f (w1 x1 + w2 x2 + w0 ) + 1]2 .
xD1
xD2
2E
w2 ,
we proceed as follows:
! $
2 f (w1 x1 + w2 x2 + w0 ) 1]f (w1 x1 + w2 x2 + w0 )xi
x1
x2
E
wj wi
=
{f (w1 x1 + w2 x2 + w0 )}2 xj xi
x1
x1
{f (w1 x1 + w2 x2 + w0 )}2 xj xi
x2
x2
1
tx
= w t x + w0 .
net = w1 x1 + w2 x2 + w0 = (w0 w1 w2 ) x1 = w
x2
252
Then we have
2E
wj wi
x1
+2
=
x2
u(x)xi xj
bf x
where
u(x) =
So, we have
u(x)x21
H = x
u(x)x1 x2
x
u(x)x1 x2
x
.
u(x)x22
H11
H12
H12
H22
.
h12
= 0.
h22
Now we have
2
(H11 )(H22 ) H12
=0
and
2
2 (H11 + H22 ) + H11 H22 H12
= 0,
PROBLEM SOLUTIONS
253
= 0
H11
H22
max
and therefore in this case the minimum and maximum eigenvalues are
min
min(H11 , H22 )
max
max(H11 , H22 ),
max
min
max(H11 , H22 )
min(H11 , H22 )
H11
H22
or
.
H22
H11
254
Computer Exercises
Section 6.2
1. Computer exercise not yet solved
Section 6.3
2.
3.
4.
5.
6.
7.
Computer
Computer
Computer
Computer
Computer
Computer
exercise
exercise
exercise
exercise
exercise
exercise
not
not
not
not
not
not
yet
yet
yet
yet
yet
yet
solved
solved
solved
solved
solved
solved
Section 6.4
8. Computer exercise not yet solved
Section 6.5
9. Computer exercise not yet solved
Section 6.6
10. Computer exercise not yet solved
Section 6.7
11. Computer exercise not yet solved
Section 6.8
12. Computer exercise not yet solved
Chapter 7
Stochastic methods
Problem Solutions
Section 7.1
1. First, the total number of typed characters in the play is approximately
m = 50 pages 80
lines
characters
40
= 160000 characters.
page
line
We assume that each character has an equal chance of being typed. Thus, the probability of typing a specic charactor is r = 1/30.
We assume that the typing of each character is an independent event, and thus
the probability of typing any particular string of length m is rm . Therefore, a rough
estimate of the length of the total string that must be typed before one copy of
Hamlet appears is r1m = 30160000 2.52 10236339 . One year is
365.25 days 24
hours
minutes
seconds
60
60
= 31557600 seconds.
day
hour
minute
We are given that the monkey types two characters per second. Hence the expected
time needed to type Hamlet under these circumstances is
2.52 10236339 characters
1 second
1
year
4 10236331 years,
2 character 31557600 seconds
which is much larger than 1010 years, the age of the universe.
Section 7.2
2. Assume we have an optimization problem with a nonsymmetric connection matrix,
W, where wij = wji .
N
N
1
1 !
1 1 !
E =
wij si sj
wji si sj
(2E) =
2
2
2 i,j=1
2 i,j=1
255
256
=
N
1 ! 1
(wij + wji )si sj
2 i,j=1 2
wij + wji
,
2
108 second
1.27 1022 seconds.
conf iguration
1 minute
1 hour
1 day
1 year
which is ten thousand times larger than 1010 years, the age of the universe. For
N = 1000, a simliar argument gives the time as 3.4 10285 years.
5. We are to suppose that it takes 1010 seconds to perform a single multiplyaccumulate.
(a) Suppose the network is fully connected. Sincethe connection matrix is symmetric and wii = 0, we can compute E as
wij si sj and thus the number
i>j
of multiplyaccumulate operations needed for calculating E for a single conguration is N (N 1)/2. We also know that the total number of congurations
is 2N for a network of N binary units. Therefore, the total time t(N ) for an
exhaustive search on a network of size N (in seconds) can be expressed as
t(N )
second
N (N 1)
operations
2N conf igurations
operation
2
conf iguration
= 1010 N (N 1)2N 1 seconds.
1010
PROBLEM SOLUTIONS
257
log[t]
700
600
500
400
300
200
100
log[N]
10
N (N 1)2(N 1)
A day is 86400 seconds, within which an exhaustive search can solve a network
of size 20 (because 5.69 104 t(20) < 86400 < t(21) 6.22 105 ). A year is
31557600 3.16 107 seconds, within which an exhaustive search can solve a
network of size 22 (because 7.00106 t(22) < 3.16107 < t(23) 8.10107 ).
A century is 3155760000 3.16109 seconds, within which an exhaustive search
can solve a network of size 24 (because 9.65108 t(24) < 3.16109 < t(25)
1.18 1010 ).
6. Notice that for any conguration , E does not change with T . It is sucient to
show that for any pair of congurations and , the ratio of the probabilities P ()
and P ( ) goes to 1 as T goes to innity:
P ()
eE /T
= lim E /T = lim e(E E )/T
T P ( )
T e
T
lim
Since the energy of a conguration does not change with T , E and E are constants,
so we have:
(E E )
T
T
lim e(E E )/T
lim
= 0
= 1,
which means the probabilities of being in dierent congurations are the same if T is
suciently high.
7. Let kuN denote the number pointing up in the subsystem of N magnets and kdN
denote the number down.
258
(a) The number of congurations is given by the number of ways of selecting kuN
magnets out of N and letting them point up, that is,
N
K(N, EN ) =
kuN
We immediately have the following set of equations for kuN and kdN :
kuN + kdN
kuN kdN
= N
= EN
M
kuM
=
M
.
1
2 (M + EM )
eE0 /T
,
Z(T )
P (s = 1) =
eE0 /T
Z(T )
respectively. We use the partition function obtained in part (a) to calculate the
expected value of the state of the magnet, that is,
E[s] = P (s = +1) P (s = 1) =
eE0 /T eE0 /T
= tanh(E0 /T ),
eE0 /T + eE0 /T
PROBLEM SOLUTIONS
259
(c) We want to calculate the expected value of the state, E[s]. We assume the other
N 1 magnets produce an average magnetic eld. In another word, we consider
the magnet in question, i, as if it is in an external magnetic eld of the same
strength as the average eld. Therefore, we can calculate the probabilities in a
similar way to that in part (b). First, we need to calculate the energy associated
with the states of i. Consider the states of the other N 1 magnets are xed. Let
i+ and i denote the conguration in which si = +1 and si = 1, respectively,
while the other N 1 magnets take the xed state values. Therefore, the energy
for conguration i+ is
N
N
N
!
!
1 !
1!
E + =
wjk sj sk =
wkj sj sk +
wki sk +
wij sj
i
2
2
j=1
k,j=i
j,k=1
k=1
since si = +1. Recall that the weight matrix is symmetric, that is, wki = wik ,
and thus we can write
N
!
wki sk =
k=1
N
!
wik sk =
N
!
wij sj ,
j=1
k=1
N
!
wij sj
j=1
1 !
wjk sj sk = li + C,
2
j,k=i
j
wij sj , and C = 12
j,k=i
wjk sj sk is
N
!
wij sj
j=1
1 !
wjk sj sk = li + C
2
j,k=i
Next, we can apply the results from parts (a) and (b). The partition function
is thus
E + /T
Z(T ) = e
E /T
+e
Therefore, according to Eq. 2 in the text, the expected value of the state of
magnet i in the average eld is
E[si ]
E + /T
E /T
e
Z(T )
9. The two input nodes are indexed as 1 and 2, as shown in the gure. Since the input
units are supposed to have xed state values, we can ignore the connection between
the two input nodes. For this assignment, we use a version of Boltzmann networks
with biases, in another word, there is a node with constant value +1 in the network.
260
4
5
Next, we are going to determine the set of connection weights between pairs of
units. In the follows, we use a vector of state values of all units to indicate the
corresponding conguration.
(a) For the exclusiveOR problem, the inputoutput relationship is as follows:
Input 1
+1
+1
1
1
Input 2
+1
1
+1
1
Output (3)
1
+1
+1
1
Since the Boltzmann network always converges to a minimum energy conguration, it solves the exclusiveOR problem if and only if the following inequalities
hold:
E{+1,+1,1}
<
E{+1,+1,+1}
E{+1,1,+1}
E{1,+1,+1}
<
<
E{+1,1,1}
E{1,+1,1}
E{1,1,1}
<
E{1,1,+1} .
PROBLEM SOLUTIONS
261
or equivalently,
w13 + w23 + w03
w23
w13
w03
< 0
< w13 + w03
< w23 + w03
< w13 + w23
(1)
(2)
(3)
(4)
From (1) + (4), we have w03 < 0, and from (2) + (3), we have 0 < w03 , which
is a contradiction. Therefore, a network of the specied form cannot solve the
exclusiveOR problem.
(b) For the twooutput network, the inputoutput relationship is as follows:
Input 1
+1
+1
1
1
Input 2
+1
1
+1
1
Output 3
1
+1
+1
1
Output 4
+1
1
1
+1
<
E{+1,+1,1,1}
E{+1,+1,1,+1}
<
E{+1,+1,+1,1}
E{+1,+1,1,+1}
<
E{+1,+1,+1,+1}
E{+1,1,+1,1}
E{+1,1,+1,1}
<
<
E{+1,1,1,1}
E{+1,1,1,+1}
E{+1,1,+1,1}
<
E{+1,1,+1,+1}
E{1,+1,+1,1}
<
E{1,+1,1,1}
E{1,+1,+1,1}
E{1,+1,+1,1}
<
<
E{1,+1,1,+1}
E{1,+1,+1,+1}
E{1,1,1,+1}
<
E{1,1,1,1}
E{1,1,1,+1}
<
E{1,1,+1,1}
E{1,1,1,+1}
<
E{1,1,+1,+1} .
262
or equivalently
w34
w13 + w23 + w03
w13 + w23 + w34 + w03
w23 + w34
w14 + w23 + w04
w14 + w34 + w04
w13 + w34
w13 + w24 + w04
w24 + w34 + w04
w14 + w24 + w34
w14 + w24 + w03
w34 + w03
<
<
<
<
<
<
<
<
<
<
<
<
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
From (2) + (11), we have the constraint w03 < w04 , and from (5) + (8), we
have w04 < w03 . These two inequalities form a contradiction and therefore the
network cannot solve the exclusiveOR problem.
(c) We consider the assignments as follows:
Input 1
+1
+1
1
1
Input 2
+1
1
+1
1
Output 3
1
+1
+1
1
Output 4
+1
1
1
+1
Hidden 5
+1
1
1
1
For the network to solve the exclusiveOR problem, it must be the case that
E{+1,+1,1,+1,+1}
E{+1,+1,1,+1,1}
< E{+1,+1,1,1,1}
E{+1,+1,1,+1,+1}
E{+1,+1,1,+1,+1}
<
<
E{+1,1,+1,1,1}
E{+1,1,+1,1,+1}
E{+1,1,+1,1,1}
<
E{+1,1,1,1,1}
E{+1,1,+1,1,1}
E{+1,1,+1,1,1}
<
<
E{+1,1,1,+1,1}
E{+1,1,+1,+1,1}
E{1,+1,+1,1,1}
E{1,+1,+1,1,1}
E{1,+1,+1,1,+1}
< E{1,+1,1,1,1}
E{1,+1,+1,1,1}
<
E{1,+1,+1,1,1}
E{1,1,1,+1,1}
< E{1,+1,+1,+1,1}
E{1,1,1,+1,+1}
E{1,1,1,+1,1}
<
E{+1,+1,1,+1,+1}
E{+1,+1,+1,1,1}
E{+1,+1,+1,+1,1}
E{1,+1,1,+1,1}
E{1,1,1,1,1}
PROBLEM SOLUTIONS
263
E{1,1,1,+1,1}
E{1,1,1,+1,1}
<
<
E{1,1,+1,1,1}
E{1,1,+1,+1,1}
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
Consider the set of weights w05 = w35 = 1, w15 = w25 = 1, w03 = 1/2, w13 =
w23 = w45 = 1/2, w14 = w24 = 1/4, w04 = 1/4, w34 = 0. The above set of
inequalities reduce to
1
1/2
1/2
1/2
1
1/2
1/2
1/2
2
1/2
<
<
<
<
<
<
<
<
3/2
1/4
3/2
0
3/4
3/2
1/4
1
13/4
0
(1)
(4)
(7)
(10)
(13)
(16)
(19)
(22)
(25)
(28)
0
1/2
1
1/2
0
1/2
2
3/2
1/2
<
<
<
<
<
<
<
<
1/4
3/4
3/2
3/4
1/4
0
3/4
5/4
1/4
(2)
(5)
(8)
(11)
(14)
(17)
(20)
(23)
(26)
1
1/2
1/2
1/2
1
1/2
0
0
1
<
<
<
<
<
<
<
<
3/4
0
3/2
1/4
3/2
3/4
1/4
1/4
3
(3)
(6)
(9)
(12)
(15)
(18)
(21)
(24)
(27)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
264
Since all these inequalities hold, the network can indeed implement the exclusiveOR problem.
10. As with the solution to Problem 9, we consider networks with biases (that is, a
constant unit). The two input nodes are indexed as 1 and 2, the output is 4, and the
hidden unit is 3, as shown in the gure.
4
3
1
Input 2
+1
1
+1
1
Hidden (3)
+1
1
1
1
Output (4)
1
+1
+1
1
For the network to solve the exclusiveOR problem, it must be the case that
E{+1,+1,+1,1}
E{+1,+1,1,1}
E{+1,+1,+1,1}
<
E{+1,+1,+1,+1}
E{+1,+1,+1,1}
<
E{+1,+1,1,+1}
E{+1,1,1,+1}
E{+1,1,+1,+1}
E{+1,1,1,+1}
<
E{+1,1,1,+1}
E{1,+1,1,+1}
< E{+1,1,1,1}
E{1,+1,+1,+1}
E{1,+1,1,+1}
E{1,+1,1,+1}
<
<
E{1,1,1,1}
E{1,1,1,1}
E{1,1,+1,1}
< E{1,1,+1,+1}
E{1,1,1,1}
<
E{+1,1,+1,1}
E{1,+1,+1,1}
E{1,+1,1,1}
E{1,1,1,+1}
Consider the set of weights w03 = w34 = 1, w13 = w23 = 1, w04 = 1/2, and
w14 = w24 = 1/2. The energies involved in the above set of inequalities are
3
2
5
+
2
1
2
1
2
3
E{+1,+1,+1,1}
E{+1,+1,1,1}
E{+1,+1,+1,+1}
E{+1,+1,1,+1}
E{+1,1,1,+1}
PROBLEM SOLUTIONS
265
5
2
1
2
1
2
3
2
5
+
2
1
2
1
2
7
2
1
+
2
11
+
2
5
E{+1,1,+1,+1}
= +
E{+1,1,+1,1}
E{+1,1,1,1}
E{1,+1,1,+1}
E{1,+1,+1,+1}
E{1,+1,+1,1}
E{1,+1,1,1}
E{1,1,1,1}
E{1,1,+1,1}
E{1,1,+1,+1}
E{1,1,1,+1}
So, the set of inequalities hold, and the network can indeed implement the solution
to the exclusiveOR problem.
Section 7.3
11. We use the denition of KullbackLeibler divergence from Eq. 12 in the text as
follows:
DKL (Q(o i ), P (o i )) =
Q(i )
Q(o i ) log
Q(o i )
.
P (o i )
Q(o , i )
Q(i )
and
P (o i ) =
P (o , i )
.
P (i )
Q(i )
Q( )
!!
i
Q(o i ) log
!
o
Q(o , i )P (i )
P (o , i )Q(i )
Q(o , i )
Q(i )
Q(  ) log
log
P (o , i )
P (i )
o
Q(o , i ) !
Q(i ) !
i
Q(
)
log
Q(o i )
P (o , i )
P (i ) o
i
266
=
Q(o , i ) log
i ,o
Q(o , i ) !
Q(i )
i
.
Q(
)
log
P (o , i )
P (i )
i
DKL (Q(o i ), P (o i ))
wij
DKL (Q(o , i ), P (o , i )) DKL (Q(i ), P (i ))
=
wij
wij
=
Then we can use Eqs. 8, 9 and 10 in the text to get the learning rule as follows:
wij
=
=
(EQ [si sj ]i o clamped E[si sj ]f ree ) (EQ [si sj ]i
EQ [si sj ]i o clamped EQ [si sj ]i clamped
clamped
E[si sj ]f ree )
K
1 !
si (xk )sj (xk ).
K
k=1
0 1
3 1
1
0
1
1
1
3
1
0
1
W=
1
1
1
0
3
1 1 1 1
3
1 3
1
calculated as
1 3
1
1
1 3
.
1
1
0
1
1
0
(a) First, we determine the energy change due to the perturbation. Let xkl denote
the pattern obtained by perturbing unit l in the pattern xk . According to Eq. 1,
the energy change is
E(xk , l) = Exk Exk
l
1!
1!
=
wij si (xkl )sj (xkl ) +
wij si (xk )sj (xk )
2 ij
2 ij
=
1!
wij (si (xkl )sj (xkl ) si (xk )sj (xk ))
2 ij
Since each unit only takes 2 possible values 1, we have sl (xkl ) = sl (xk ). We
also have si (xkl ) = si (xk ) if i = l since we only perturb one unit at a time. So,
we can cancel most of the terms in the above summation.
PROBLEM SOLUTIONS
267
i=l
i=l
2sl (xk )
wil si (xk )
i=l
14
3
x2
14
3
x3
14
3
2
3
14
3
14
3
14
3
2
3
14
3
2
3
14
3
14
3
1!
1!
1!
wij si (
x)sj (
x) =
wij (si (x))(sj (x)) =
wij si (x)sj (x) = Ex
2 ij
2 ij
2 ij
0 1
3
1
0
1
3 1
0
1
1
3
1
W=
3
1 1 1
3
1 3
1 3
1
1
3 1
connection matrix is
1
3
1
0
1
1
3
3
1
1
1
1
0
1
1
1
3
1
1 3
3
1
1
3
1
1
0 1
1
0
1 3
1
3
1
3
1
1
3
0
We thus need merely to check whether the patterns give local minima in energy. We
have
E(xk , l) l = 1 l = 2 l = 3 l = 4 l = 5 l = 6 l = 7 l = 8
x1
14
3
14
3
14
3
14
3
14
3
x2
26
3
26
3
2
3
26
3
26
3
x3
26
3
22
3
26
3
22
3
26
3
22
3
22
3
2
3
268
=
=
.
P (sk = sk d )
P (sk = sk , d )
P (d , sk = sk , d )
Since the pattern {d , sk = sk , } gives the minimal energy due to annealing, supposing there is only one global minimal corresponding to clamped d , we have for any
, that
E{d ,sk =sk , } < E{d ,sk =sk ,= }
eE /T
E{d ,s
k ,}
/T
lim P (d , sk = sk , = d )
lim P (d , sk = sk , d )
0.
T 0
T 0
T 0
So, for any 0 < < 1/2, there exists a temperature T0 , such that if T < T0 ,
P (d , sk = sk , d ) > 1 .
Therefore, under the same conditions we have the two inequalities:
!
P (d , sk = sk , d ) <
=
P (d , sk = sk , d ) < .
Therefore the ratio of the probability of sk = sk to that for sk = sk , all given the
decient information d , is
P (d , sk = sk , d )
d
P (sk = sk  )
=
P (sk = sk d )
P (d , sk = sk , d )
P (d , sk = sk , d ) +
P (d , sk = sk , d )
=
P (d , sk = sk , d )
>
1
> 1.
PROBLEM SOLUTIONS
269
N
N
!
!
Var[li ] = Var
wij sj =
Var[wij sj ] = N Var[wij sj ].
j=1
j=1
Since P (sj = +1) = P (sj = 1) = 0.5, the expected value of sj is E[sj ] = 0, and
s2j = 1. Therefore, we have
2
Var[wij sj ] = E[wij
],
and thus
2
Var[li ] = N E[wij
].
We seek to have the variance of the response to be 1, that is Var[li ] = 1. which implies
2
E[wij
] = 1/N . If we randomly initialize wij according to a uniform distribution from
[h, h], we have
h
h
1 2
1 t3
h2
2
E[wij ] =
=
t dt =
.
2h 3 h
3
h 2h
Under these conditions, then,
1
h2
=
,
N
3
h=
3/N
So, the weights are initialized uniformly in the range 3/N < wij < + 3/N .
18. We consider the problem of setting the learning rate for a Boltzmann network
having N units.
(a) In the solution to Problem 11, we showed that the KullbackLeibler divergence
obeyed
DKL (Q(o i ), P (o i )) = DKL (Q(i , o ), P (i , o )) DKL (Q(i ), P (i )).
270
As such, we need merely derive the result for DKL . From Eq. 7 in the text, we
obtain
DKL (Q(), P ())
Q()log
Q()
P ()
Q()logQ()
Q()logP ().
Therefore, we have
! Q() P ()
DKL (Q(), P ())
=
,
wuv
P () wuv
! Q() 2 P ()
.
P () wij wuv
If the network has sucient parameters, as the number of training examples goes
to innity, the weights converge to the optimal one, w , where P () = Q().
At this limit, the second term in the above expression can be omitted, because
! Q() 2 P ()
P () wij wuv
! 2 P ()
wij wuv
1
2
!
2
=
P ()
wij wuv
21
= 0.
wij wuv
! Q() P () P ()
P ()2 wij wuv
1 P ()
1 !
=
si ()sj ()P () E[si sj ] ,
P () wij
T
!
!
1 !
=
Q()
si ()sj ()P () E[si sj ]
su ( )sv ( )P ( ) E[su sv ]
T2
H
PROBLEM SOLUTIONS
=
271
!
!
1 !
Q()
si ()sj ()P ()
su ( )sv ( )P ( )
2
T
EQ [si sj ] E[su sv ] E[si sj ]EQ [su sv ] + E[si sj ]E[su sv ]
!
1 !
Q()
si ()sj ()su ( )sv ( )P (, )
2
T
,
EQ [si sj ] E[su sv ] E[si sj ]EQ [su sv ] + E[si sj ]E[su sv ]
1
EQ [si sj su sv ] EQ [si sj ] E[su sv ] E[si sj ]EQ [su sv ] + E[si sj ]E[su sv ] ,
2
T
1
EQ [si sj su sv ] EQ [si sj ] E[su sv ] E[si sj ]EQ [su sv ] + E[si sj ]E[su sv ]
T2
1
E[si sj su sv ] E[si sj ]E[su sv ] E[si sj ]E[su sv ] + E[si sj ]E[su sv ]
T2
1
E[si sj su sv ] E[si sj ]E[su sv ] .
T2
1
,
T2
=
=
=
=
=
=
=
1 t
1
w w = 2 wt E[(s E[s ])(s E[s ])t ]w
T2
T
1
E[wt (s E[s ])(s E[s ])t w]
T2
1
1
E[(w (s E[s ]))2 ] = 2 E[(w s w E[s ])2 ]
T2
T
1
E[(w s )2 2(w s )(w E[s ]) + (w E[s ])2 ]
T2
1
E[(w s )2 ] 2E[(w s )(w E[s ])] + E[(w E[s ])2 ]
T2
1
E[(w s )2 ] 2(w E[s ])2 + (w E[s ])2
T2
1
E[(w s )2 ] (w E[s ])2
T2
272
2
1
1 !
E[(w s )2 ] = 2 E
wij si sj
2
T
T
ij
2
2
!
!
1
1
E
wij  = 2
wij 
T2
T
ij
ij
we can write
a1 + a2 + + an
n
ij
wij 
2
2
N (N 1)
2
!
wij 
ij
wij 2
N (N 1)
1
N (N 1)
N (N 1).
ij
N (N 1)
.
T2
(d) The optimal learning rate, , is inverse to the curvature, and hence
=
1
T2
.
wt Hw
N (N 1)
Section 7.4
19. It is sucient to show that Aij and Bij do not have a property which can be
translated into the constraint that the probabilities sum to 1. For the transition
probalities aij and bjk in HMM, we have
!
!
aij = 1,
bjk = 1
j
bjk = eBjk /T
Therefore, if there exists a HMM equivalent for any Boltzmann chain, Aij and Bjk
must satisfy the constraints
!
!
eAij /T = 1,
eBjk /T = 1,
j
PROBLEM SOLUTIONS
273
20. Since at each time step, there are exactly one hidden unit and one visible unit
which are on, the number of congurations for each time step is ck, and total number
of legal paths is (ck)Tf .
21. With a known initial (t = 1) hidden unit, the Boltzmann chain is annealed with
all visible units and the known initial hidden units clamped to +1, and other initial
hidden units clamped to 0. If the initial hidden unit is unknown, we can hypothetically
add an alwayson hidden unit at t = 0, and connect it to all hidden units at t = 1,
and the connection weights are Ci . Since in a Boltzmann chain exactly one hidden
unit and one visible unit can be on, it follows that exactly one initial (t = 1) hidden
unit will be on, which is consistent to the semantics of Hidden Markov Models. The
generalized energy then is
Tf 1
E V = E[
Tf
, V ] = Ci
Tf
!
t=1
Aij
Tf
!
Bjk ,
t=1
where Ci indicates the connection weight between the constant unit and the initial
hidden unit which is on.
22. In a Boltzmann zipper, the hidden units are clustered into cells. For example,
in the one shown in Fig. 7.12 in the text, the fast hidden units at the rst two
time steps for the fast chain and the slow hidden units at the rst time step for
the slow chain are interconnected through E, forming a cell consisting of three
groups of hidden units. Since we require that exactly one hidden unit is on in each
chain. In a cell, three units are on after annealing, one in each group, and they are
all interconnected. Therefore, the connection matrix E cannot be simply related to
transition probabilities for two reasons. First, since each chain represents behavior
at dierent time scales, there is no time ordering between the onunit in the slow
chain and the two onunits in the fast chain in the same cell (otherwise the slow
chain can be squeezed into the fast chain and forming a single chain). However,
it is necessary to have explicit time ordering for a state transition interpretation to
make sense. Second, even though we can exert a time ordering for the onunits, the
rst onunit in the ordering is connected to two other onunits, which cannot have a
state transition interpretation where only a single successor state is allowed.
As an extreme case, an allzero E means the two chains are not correlated. This
is completely meaningful, and E is not normalized. The weights in E can also be
negative to indicate a negative or suppressive correlation.
Section 7.5
23. We address the problem of a population of size L of N bit chromosomes.
(a) The total number of dierent N bit chromosomes is 2N . So the number of
dierent populations of size L is the number of dierent ways of selecting L
chromosomes out of 2N possible choices, allowing repetition. Since the number
of
combinations
of selection n items out of m types (n < m) with repetition is
m+n1
, we have the number of dierent populations of size L is
n
2N + L 1
L
=
L + 2N 1
.
2N 1
274
=
=
B(n, k)
1
1
nk+1
!
B(n i, k 1).
i=1
The rst two equations are obvious. The third one is due to the fact that the
last element can take values ranging from 1 to n (k 1) and thus
B(n, k)
nk+1
!
B(n i, k 1)
i=1
= B(n 1, k 1) +
nk+1
!
B(n i, k 1)
i=2
= B(n 1, k 1) +
nk
!
B(n 1 j, k 1)
j=1
B(n, k)
=
=
n
!
m n1
k=1
n
!
k=1
k
m
k
k1
n1
nk
=
m+n1
,
n
PROBLEM SOLUTIONS
275
(b) Assume each chromosome has a distinctive tness score. Since chromosomes
with the highest Ls tness scores are selected as parents, the number of all
possible sets of parents is equivalent to the number of ways of selection Ls
chromosomes from all possible chromosomes excluding those with the lowest
L Ls tness scores. Therefore, the number is
N
Ls + (2N (L Ls )) 1
2 L + 2Ls 1
=
.
(2N (L Ls )) 1
2N L + Ls 1
(c) If Ls = L, the above expression is simply
N
2 +L1
,
2N 1
the same as the result in part (a).
(d) If Ls = 1, the expression is
N
N
2 L+1
2 L+1
= 2N L + 1.
=
2N L
1
Section 7.6
24. See figure.
a)
b)
X4
SQRT
SQRT
X8
SIN
X4
d)
X0
c)
X8
X8
X0
X0 TAN
3.4
SQRT
X4
X5
X8
SQRT
X4
276
Computer Exercises
Section 7.2
1. Computer exercise not yet solved
2. Computer exercise not yet solved
3. Computer exercise not yet solved
Section 7.3
4. Computer exercise not yet solved
5. Computer exercise not yet solved
6. Computer exercise not yet solved
Section 7.4
7. Computer exercise not yet solved
Section 7.5
8. Computer exercise not yet solved
Section 7.6
9. Computer exercise not yet solved
Chapter 8
Nonmetric methods
Problem Solutions
Section 8.2
1. Assume a particular query occurs more than once along some path through a given
tree. The branch taken through the uppermost (closest to the root) occurence of this
query must also be taken in all subsequent occurences along this path. As a result,
branch(es) not taken in the uppermost instance of the query may be pruned from
lower occurences of that query along the path without altering the classication of
any pattern. After such pruning, the lower occurences of the query are left with only
a single child node. At this point, these queries serve no purpose and may be deleted
by directly connecting their parent node to their child.
By repeatedly applying the above algorithm to every possible path through a tree
and to every query which is duplicated along a path, we can convert any given tree
to an equivalent tree in which each path consists of distinct queries.
Section 8.3
2. Consider a nonbinary tree, where the number of branches at nodes can vary
throughout the tree.
(a) Suppose our tree has root node R, with (local) branching ratio B. Each of the
B children is itself a root of a subtree having arbitrary depth. If we can prove
that we can replace the root node with binary nodes, then by induction we can
do it for its children the roots of the subtrees. In this way, an arbitrary tree
can be expressed as a binary tree.
We represent a tree with a root node R and B children nodes as {R, (a1 , a2 , . . . aB )}.
B = 2 This is a binary tree.
B = 3 The tree {R, (a1 , a2 , a3 )} can be represented as a binary tree {R, (a1 , R2 )}
where the subtree is {R2 , (a2 , a3 )}.
277
278
B = k 3 We can keep replacing each root of the subtree as follows: the rst
{R, (a1 , a2 , . . . , ak )} becomes {R, (a1 , R2 )}, with subtree {R2 , (a2 , R3 )},
. . ., {Rk1 , (ak1 , ak )}. By induction, then, any tree can be replaced by
an equivalent binary tree.
While the above shows that any node with B 2 can be replaced by a node
with binary decisions, we can apply this to all nodes in the expanded tree, and
thereby make the entire tree binary.
(b) The number of levels depends on the number of classes. If the number of classes
is 2, then the functionally equivalent tree is of course 2 levels in depth. If the
number of categories is c, we can create a tree with c levels though this is not
needed. Instead we can split the root node sending c/2 categories to the left,
and c/2 categories to the right. Likewise, each of these can send c/4 to the left
and c/4 to the right. Thus a tight upper bound is logc.
(c) The lower bound is 3 and the upper bound is 2B 1.
3. Problem not yet solved
4. Problem not yet solved
5. We use the entropy impurity given in Eq. 1,
!
i(N ) =
P (i ) log2 (P (i )) = H(i ).
i
(a) After splitting on a binary feature F {R, L}, the weighted impurity at the
two child nodes is
!
P (L)i(L) + P (R)i(R) = P (L)
P (i L) log2 (P (i L))
i
P (R)
P (i , L)
=
P (i , L) log2
P (L)
i
!
P (i , R)
P (i , R) log2
P (R)
i
!
=
P (i , F ) log2 (P (i , F ))
!
i,F
PROBLEM SOLUTIONS
279
than their parent. For example at the (x2 < 0.61) node in the upper tree of
example 1, the impurity is 0.65, while that at the left child is 0.0, and that at
the right is 1.0. The left branch is taken 23 of the time, however, so the weighted
impurity at the children is 23 0 + 13 1 = 0.33. Similarly, at the (x1 < 0.6)
node of the lower tree, the right child has higher impurity (0.92) than the parent
(0.76), but the weighted average at the children is 23 0 + 13 0.92 = 0.304. In
each case, the reduction in impurity is between 0 and 1 bit, as required.
(c) For Bway branches, we have 0 i(N ) log2 (B) bits.
6. Problem not yet solved
7. Problem not yet solved
8. There are four attributes, {a1 , a2 , a3 , a4 } to be used in our decision tree.
(a) To select the query at the root node, we investigate queries on each of the four
attributes. The following shows the number of patterns sent to the left and
right for each value, and the entropy at the resulting children nodes:
query
a1
a2
a3
a4
sent left
21 , 22
21 , 22
01 , 22
21 , 32
left entropy
1
1
0
0.9710
sent right
21 , 22
21 , 22
41 , 22
21 , 12
right entropy
1
1
0.9183
0.9183
a1
2
0
a2
1
0
a4
a4
0
2 2
(a3 AN D N OT a1 ) OR (a3 AN D a1 AN D N OT a2 AN D N OT a4 )
OR (a3 AN D a1 AN D a2 AN D a4 )
= a3 AN D (N OT a1 OR a1 AN D (N OT a2 AN D N OT a4 ) OR (a2 AN D a4 )).
= N OT a3 OR (a3 AN D a1 AN D N OT a2 AN D a4 )
OR (a3 AN D a1 AN D a2 AN D N OT a4 )
= N OT a3 OR(a3 AN D a1 ) AN D ((N OT a2 AN D a4 ) OR (a2 AN D N OT a4 )).
280
12.
13.
14.
15.
16.
(a) To select the query at the root node, we investigate queries on each of the four
attributes. The following shows the number of patterns sent to the left and
right for each value, and the entropy at the resulting children nodes:
query
a1
a2
a3
a4
sent left
21 , 12
31 , 02
21 , 12
31 , 22
left entropy
0.9183
0
0.9183
0.9710
sent right
21 , 32
11 , 42
21 , 32
11 , 22
right entropy
0.9710
0.7219
0.9710
0.9183
a1
1
0
a4
2
0
a3
(b) To select the query at the root node, we investigate queries on each of the four
attributes. The following shows the number of patterns sent to the left and
right for each value, and the entropy at the resulting children nodes:
query
a1
a2
a3
a4
sent left
41 , 12
61 , 02
41 , 12
61 , 22
left entropy
0.7219
0
0.7219
0.8113
sent right
41 , 32
21 , 42
41 , 32
21 , 22
right entropy
0.9852
0.9183
0.9852
1
PROBLEM SOLUTIONS
281
a2
0
a3
1
0
c
2
a
3
c
4
c
5
a
6
c
7
b
8
a c
9 10
The lastoccurence function gives F(a) = 9, F(b) = 8, F(c) = 10, and 0 otherwise. Likewise, the goodsux function gives G(c) = 7, G(ac) = 6, G(bac) = 0,
and 0 otherwise.
(b) Consider the following string (and shift positions)
a
1
b
2
a
3
b
4
a
5
b
6
c
7
b
8
c
9
b
10
a
11
a
12
a
13
b
14
c
15
b
16
a a
17 18
The lastoccurence function gives F(a) = 18, F(b) = 16, F(c) = 15 and 0
otherwise. Likewise, the goodsux function gives G(a) = 17, G(aa) = 12,
G(baa) = 10, G(cbaa) = 9, G(bcbaa) = 8, G(abcbaa) = 0, and 0 otherwise.
(c) Consider the following string (and shift positions)
c
1
c
2
c
3
a
4
a
5
a
6
b
7
a
8
b a
9 10
c
11
c
12
c
13
The lastoccurence function gives F(a) = 10, F(b) = 9, F(c) = 13, and 0
otherwise. Likewise, G(c) = 12, G(cc) = 11, G(ccc) = 1, G(accc) = 0, and 0
otherwise.
(d) Consider the following string (and shift positions)
a
1
b
2
b
3
a
4
b
5
b
6
a
7
b
8
b c
9 10
b
11
b
12
a
13
b
14
b
15
c
16
b
17
b a
18 19
The lastoccurence function gives F(a) = 19, F(b) = 18, F(c) = 16, and 0
otherwise. Likewise, G(a) = 13, G(ba) = 12, G(bba) = 11, G(cbba) = 10,
G(bcbba) = 9, G(bbcbba) = 8, G(abbcbba) = 7, G(babbcbba) = 6, G(bbabbcbba) =
5, and 0 otherwise.
20. We use the information from Fig. 8.8 in the text.
(a) The string and the number of comparisons at each shift are:
p r
1 1
o
1
b
1
a
1
b
1
i
1
l
1
i
1
t
1
i
1
e
3
s
1
f
1
o
1
r
1
e
1
s
1
t
1
i
1
m
1
a
1
t
1
e
1
s
9
282
s
2
t
3
i
4
m
5
a
6
t
7
e
8
s
9
i m a t e s
e s t i m a t
The rst shift is 9 F(i) = 5. The second shift is 9 F( ) = 9, and the last
shift is 9F(m) = 4. The total number of character comparisons is 12, as shown
at the lower right.
21. Here the test string is abcca.
(a) Naive stringmatching progresses lefttoright. Each letter comparison is indicated by an underline, and the total number of such comparisons at each shift
is shown at the right. The naive stringmatching algorithm requires 20 letter
comparisons in this case.
d
a
c a
c c
b c
a b
a
a
c
c
b
a
a
c a
c c a
b c c a
a b c c a
a b c c a
a b c c
5
1
1
1
1
1
3
1
2
1
3
20
b
b
s
e
a b c c c
a b c c a
a b c c
a b c
a b
a
c c c d a b a c a b b c a
c c a
a b c c a
a b c c a
a b c c a
1
2
3
3
9
1
1
1
9
12
PROBLEM SOLUTIONS
283
(b) By an analysis similar to that in part (a), we nd that the naive stringmatching
algorithm requires 16 letter comparisons in this case.
The BoyerMoore stringmatching algorithm requires 5 letter comparisons in
this case.
d
a
a
b
d
c
a
c
d
a
c a
a
d
1
2
2
5
c a
(c) By an analysis similar to that in part (a), we nd that the naive stringmatching
algorithm requires 19 letter comparisons in this case.
The BoyerMoore stringmatching algorithm requires 7 letter comparisons in
this case.
a
a
b c b c a b c a b c a b
b c c a
a b c c a
a b c c a
a b c c a
a b c c a
c
1
3
1
1
1
7
(d) By an analysis similar to that in part (a), we nd that the naive stringmatching
algorithm requires 18 letter comparisons in this case.
The BoyerMoore stringmatching algorithm requires 4 letter comparisons in
this case.
a
a
c
b
c
c
a b c a b a b a
c a
a b c c a
a b c c a
a
1
1
2
4
(e) By an analysis similar to that in part (a), we nd that the naive stringmatching
algorithm requires 14 letter comparisons in this case.
The BoyerMoore stringmatching algorithm requires 18 letter comparisons in
this case. Note that in this unusual case, the naive stringmatching algorithm
is more ecient than BoyerMoore algorithm.
b b c c a c b c c a b b c c a
a b c c a
a b c c a
a b c c a
a b c c a
a b c c a
a b c c a
a b c c a
5
1
1
5
1
1
5
18
284
(a) The time complexity in this serial implementation is based on the sum of the d
computations for initializing F, and the m calculations of F, that is, O(d + m).
(b) The space complexity is O(m) since we need to store the nal m values.
(c) For bonbon the number of comparisons is 26 + 6 = 32. For marmalade the
number of comparisons is 26 + 9 = 35. For abcdabdabcaabcda the number
of comparisons is 26 + 16 = 42.
23. Here the class is determined by the minimum edit distance to any pattern in a
category.
(a) By straightforward edit distance calculation, we nd that the edit distance d
from the test pattern x = abacc to each of the patterns in the categories are
as shown in the table.
1
aabbc
ababcc
babbcc
d
3
1
2
2
bccba
bbbca
cbbaaaa
d
4
3
5
3
caaaa
cbcaab
baaca
d
4
4
3
The minimum distance, 1, occurs for the second pattern in 1 , and thus we the
test pattern should be assigned to 1 .
(b) As in part (a), the table shows the edit distance d from the test x = abca to
each of the nine patterns.
1
aabbc
ababcc
babbcc
d
3
3
3
2
bccba
bbbca
cbbaaaa
d
3
2
5
3
caaaa
cbcaab
baaca
d
3
3
2
The minimum distance, 2, occurs for patterns in two categories. Hence, here
there is a tie between cateogories 2 and 3 .
(c) As in part (a), the table shows the edit distance d from the test x = ccbba
to each of the nine patterns.
1
aabbc
ababcc
babbcc
d
3
5
4
2
bccba
bbbca
cbbaaaa
d
2
3
4
3
caaaa
cbcaab
baaca
d
3
4
4
PROBLEM SOLUTIONS
285
Here the minimum distance, 2, occurs for the rst pattern in 2 , and thus the
test pattern should be assigned to cateogory 2 .
(d) As in part (a), the table shows the edit distance d from the test x = bbaaac
to each of the nine patterns.
1
aabbc
ababcc
babbcc
d
4
3
4
2
bccba
bbbca
cbbaaaa
d
4
3
2
3
caaaa
cbcaab
baaca
d
3
3
3
Here the minimum distance, 2, occurs for the third pattern in 2 , and thus the
test pattern should be assigned to category 2 .
24. Here the class is determined by the minimum edit distance to any pattern in a
category.
(a) By straightforward edit distance calculation, we nd that the edit distance d
from the test pattern x = ccab to each of the patterns in the categories are
as shown in the table.
1
aabbc
ababcc
babbcc
d
4
4
5
2
bccba
bbbca
cbbaaaa
d
3
4
5
3
caaaa
cbcaab
baaca
d
3
2
4
Here the minimum distance, 2, occurs for the second pattern in 3 , and thus
the test pattern should be assigned to 3 .
(b) As in part (a), the table shows the edit distance d from the test x = abdca
to each of the nine patterns.
1
aabbc
ababcc
babbcc
d
3
3
3
2
bccba
bbbca
cbbaaaa
d
3
2
5
3
caaaa
cbcaab
baaca
d
4
4
3
Here the minimum distance, 2, occurs for the second pattern in 2 , and thus we
assign the test pattern to 2 .
(c) As in part (a), the table shows the edit distance d from the test x = abc to
each of the nine patterns.
1
aabbc
ababcc
babbcc
d
2
3
3
2
bccba
bbbca
cbbaaaa
d
4
3
6
3
caaaa
cbcaab
baaca
d
4
4
3
Here the minimum distance, 2, occurs for the rst pattern in 1 , and thus we
assign the test pattern to 1 .
286
(d) As in part (a), the table shows the edit distance d from the test x = bacaca
to each of the nine patterns.
1
aabbc
ababcc
babbcc
d
4
4
3
2
bccba
bbbca
cbbaaaa
d
3
3
4
3
caaaa
cbcaab
baaca
d
3
4
1
Here the minimum distance, 1, occurs for the third pattern in 3 , and thus we
assign the test pattern to 3 .
25. Here the class is determined by the minimum cost edit distance dc to any pattern
in a category, where we assume interchange is twice as costly as an insertion or a
deletion.
(a) By straightforward costbased edit distance calculation, we nd that the edit
distance dc from the test pattern x = abacc to each of the patterns in the
categories are as shown in the table.
1
aabbc
ababcc
babbcc
dc
4
1
3
2
bccba
bbbca
cbbaaaa
dc
4
6
8
3
caaaa
cbcaab
baaca
dc
6
7
4
The nal categorization corresponds to the minimum of these costbased distances, i.e, the distance = 1 to the second pattern in 1 . Thus we assign the
test pattern to 1 .
(b) As in part (a), the table shows the costbased edit distance dc from the test x
= abca to each of the nine patterns.
1
aabbc
ababcc
babbcc
dc
3
4
4
2
bccba
bbbca
cbbaaaa
dc
3
3
7
3
caaaa
cbcaab
baaca
dc
5
4
3
The minimum cost distance is 3, which occurs for four patterns. There is a
threeway tie between the cateogories 1 , 2 and 3 . (In the event of a tie, in
practice we often consider the next close pattern; in this case this would lead us
to classify the test pattern as 2 .)
(c) As in part (a), the table shows the costbased edit distance dc from the test x
= ccbba to each of the nine patterns.
1
aabbc
ababcc
babbcc
dc
6
7
7
2
bccba
bbbca
cbbaaaa
dc
2
4
4
3
caaaa
cbcaab
baaca
dc
6
5
6
PROBLEM SOLUTIONS
287
(d) As in part (a), the table shows the costbased edit distance dc from the test x
= bbaaac to each of the nine patterns.
1
aabbc
ababcc
babbcc
dc
4
3
4
2
bccba
bbbca
cbbaaaa
dc
4
3
2
3
caaaa
cbcaab
baaca
dc
3
3
3
x2
2
0
2
x3
2
5 .
0
Note that dc (x2 , x3 ) > dc (x1 , x2 ) + dc (x1 , x3 ), violating the triangle inequality.
27. Problem not yet solved
28. Problem not yet solved
29. Problem not yet solved
Section 8.6
30. Recall that Lisp expressions are of the general form (operation operand1
operand2 ). (More generally, for some operations, such as multiplication, *, and addition, +, there can be an arbitrary number of operands.)
(a) Here the alphabet is A = {0, 1, 2, . . . , 9, plus, minus, quotient, (, )}, the set of
intermediate symbols is I = {A, B}, the starting symbol is S = S, and the
productions are
p1 : S (ABB)
p2 : A plusdifferencetimesquotient
P=
p
3 : B 01 . . . 9
p4 : B (ABB)
(b) An advanced Lisp parser could eliminate excess parentheses, but for the grammar above, three of the ve expressions are ungrammatical:
(times (plus (difference 5 9)(times 3 8))(quotient 2 6)) can be
expressed in this grammar (see gure).
(7 difference 2) cannot be expressed in this grammar.
288
S
A
times (
)(ABB)
plus (ABB)(ABB)quotient 2 6
difference 5 9 times 3 8
difference ( A B B )( A B B )
plus 5
9 difference 6 8
p1 S Ab
p2 A Aa
P=
p3 A a
S
S
A
A
A
A
a
PROBLEM SOLUTIONS
289
32. Here the language G has alphabet A = {a, b, c}, intermediate symbols I =
{A, B}, starting symbol S = S, and the productions are
p1 S cAb
p2 A aBa
P=
p3 B aBa
p4 B cb
(a) We consider the rules in turn:
S cAb: type 3 (of the form z)
A aBa: type 3 (of the form z)
B aBa: type 3 (of the form z)
B cb: type 2 (of the form I x) and type 3 (of the form z)
Thus G is type 3 or regular grammar.
(b) We now prove by induction that the language generated by G is L(G) =
{can cban bn 1}. Note that for n = 1, cacbab can be generated by G:
S cAb caBab cacbab.
Now suppose that can cban b is generated by G. Now can+1 cban+1 b = caan cban ab.
where A = an cban can be generated
Given the above, we know that S cAb,
n
n
by G. When we parse caa cba ab using G, we nd S cAb caBab, and
Thus can+1 cban+1 b is generated by G.
thus B = A.
(c) See gure.
S
cb
cb
p1 S A
p2 A ab . . . z
P=
p3 A aAabAb . . . zAz
See gure.
290
(b) The grammar is of type 3 because every rewrite rule is of the form z or
z.
(c) Here the grammar G has the English alphabet and the null character, A =
{a, b, . . . , z, }, the single internal symbol I = A and starting symbol S = S.
The null symbol is needed for generating palindromes having an even number
of letters. The productions are of the form:
p1 S aAbA . . . zA
p2 A ab . . . z
P=
p3 A aAabAb . . . zAz
See gure.
S
p
p1
p2
p3
P=
p4
p5
p6
Thus there are 9 possible derivations for digit1, 10 possible derivations for teens,
8 possible derivations for tys. Thus there are (10 + 8 + 8 9 + 9) = 99 possible
derivations for digits2. Thus there are (9 99 + 9 + 99) possible derivations for
digits3. So there are 999 possible derivations for digits6 under these restrictive
conditions.
PROBLEM SOLUTIONS
291
p3 digits6 digits3
We know from part (a) that digits3 = 999. Thus digits6 = 999 999 + 999 +
999 = 999, 999 indeed, one for each number.
(c) The grammar does not allow any pronunciation in more than one way because
there are only 999,999 numbers and there are only 999,999 possible pronunciations and these are in a onetoone correspondence. In English, however, there
is usually more than one way to pronounce a single number. For instance, 2000
can be pronounced two thousand or twenty hundred, however the grammar
above does not allow this second pronunciation.
35. Problem not yet solved
36. Consider grammars and the Chomsky normal form.
(a) Here, most of the rewrite rules in G are
S bA
S
aB
A bAA
P=
A aS
Aa
..
not CNF
not CNF
not CNF
not CNF
CNF
..
We need only one rule to violate the conditions of being in Chomsky normal
form for the entire grammar to be not in CNF. Thus this grammar is not in
CNF.
(b) Here the rewrite rules of G are
rewrite rule
S
Cb A
S
Ca B
A
Ca S
Cb D1
Aa
P=
B Cb S
B
Ca D2
B
b
D
AA
D
BB
a
C
Cb b
CNF?
yes A BC
yes A BC
yes A BC
yes A BC
yes A z
yes A BC
yes A BC
yes A z
yes A BC
yes A BC
yes A z
yes A z
Thus, indeed all the rewrite rules are of the form A BC or A z, and the
grammar G is in CNF.
(c) Theorem: Any contextfree language without the null symbol can be generated
by a grammar G1 in which the rewrite rules are of the form A BC or A z.
292
hundred
digits6
thousand
digits2
tys
digit1
forty
two
342,619
digits3
digits3
digit1
six
hundred
digits6
digits3
digits2
digits2
digits1
teens
teens
nine
fourteen
thirteen
13
thousand
hundred
900,000
PROBLEM SOLUTIONS
293
Section 8.7
40. Here D1 = {ab, abb, abbb} and D2 = {ba, aba, babb} are positive examples from
grammars G1 and G2 , respectively.
(a) Some candidate rewrite rules for G1 and G2 , are
p1
p1 S A
p2
p2 A ab
P1 =
P2 =
p3
p3 A Ab
p4
SA
A ba
A aA
A Ab
(b) Here we infer G1 , being sure not to include rules that would yield strings in D2 .
x+
i
i
1
ab
abb
abbb
S
S
S
A
A
S
A
A
P
A
ab
A
ab
Ab
A
ab
Ab
P produces D2 ?
No
No
No
(c) Here we infer G2 , being sure not to include rules that would yield strings in D1 .
i
x+
i
ba
aba
babb
P
SA
A ba
SA
A ba
A aA
SA
A ba
A aA
A Ab
P produces D1 ?
No
No
No
(d) Two of the strings are ambiguous, and one each is in L(G1 ) and L(G2 ):
bba is ambiguous
abab L(G2 ): S A aA aAb abab
bbb is ambiguous
abbbb L(G1 ): S A Ab Abb Abbb abbbb.
Section 8.8
41. Problem not yet solved
294
Computer Exercises
Section 8.3
1. Computer exercise not yet solved
2. Computer exercise not yet solved
Section 8.4
3. Computer exercise not yet solved
4. Computer exercise not yet solved
5. Computer exercise not yet solved
Section 8.5
6. Computer exercise not yet solved
7. Computer exercise not yet solved
8. Computer exercise not yet solved
Section 8.6
9. Computer exercise not yet solved
10. Computer exercise not yet solved
Section 8.7
11. Computer exercise not yet solved
Chapter 9
Algorithmindependent
machine learning
Problem Solutions
Section 9.2
1.
2.
3.
4.
5.
6.
7.
8.
9.
2 =
n
!
n
r=0
(1 + 1) = 2 =
n
!
n
r=0
295
296CHAPTER 9.
(b) We dene
K(n) =
n
!
n
r=0
r)!
n
r=1
!
n
n
n!(n r + 1 + r)!
n
=
+
+
0
r!(n + 1 r)!
n
r=1
!
n
n!(n r + 1)
n
n!r
n
=
+
+
+
0
r!(n
+
1
r)!
r!(n
+
1
r)!
n
r=1
!
n
n
n!
n
n!
=
+
+
+
0
r!(n
r)!
(r
1)!(n
+
1
r)!
n
r=1
!
n
n
n
n
n
=
+
+
+
0
r
r1
n
r=1
n
!
n
n
=
+
r
r
r=0
n
! n
= 2
r
r=0
=
2K(n) = 2n+1 .
= 2d ,
where d is the number of distinct possible patterns. The total number of predicates shared by two patterns remains
d
!
d2
r=2
r2
= 2d2 .
PROBLEM SOLUTIONS
297
In short, there are no changes to the derivation of the Ugly Duckling Theorem.
(b) Since there are only three dierent types of cars seen, we have three dierent
patterns, x1 , x2 and x3 , with features:
x1
x2
x3
NOT[f1 OR f2 OR f3 OR f4 OR f5 ] AND f6 ,
x2
x3
The result is that the three patterns being dierent from one another by only
one feature, and hence cars A and B are equally similar as cars B and C.
12. Problem not yet solved
13. Consider the Kolmogorov complexity of dierent sequences.
(a) 010110111011110 . . .. This sequence is made up of m sequences of 0 and l 1s,
where l = 1 . . . m. Therefore the complexity is O(log2 m).
(b) 000 . . . 100 . . . 000, a sequence made up of m 0s, a single 1, and (n m 1) 0s.
The complexity is just that needed to specify m, i.e., O(log2 m).
(c) e = 10.10110111111000010 . . .2 , The complexity of this constant is O(1).
(d) 2e = 101.0110111111000010 . . .2 , The complexity of this constant is O(1).
(e) The binary digits of , but where every 100th digit is changed to the numeral
1. The constant has a complexity of O(1); it is a simple matter to wrap
a program around this to change each 100th digit. This does not change the
complexity, and the answer is O(1).
(f) As in part (e), but now we must specify n, with has complexity O(log2 n).
14. Problem not yet solved
15. For two binary strings x1 and x2 , the Kolmogorov complexity of the pair can
be at worst K(x1 , x2 ) = K(x1 ) + K(x2 ), as we can easily write concatenate two
programs, one for computing x1 and one for x2 . If the two strings share information,
298CHAPTER 9.
the Kolmogorov complexity will be less than K(x1 ) + K(x2 ), since some information
from one of the strings can be used in the generatation of the other string.
16. Problem not yet solved
17. The denition the least number that cannot be dened in less than twenty words
is already a denition of less than twenty words. The denition of the Kolmogorov
complexity is the length of the shortest program to describe a string. From the above
paradoxical statement, we can see that it is possible that we are not clever enough
to determine the shortest program length for a string, and thus we will not be able
to determine easily the complexity.
Section 9.3
18. The meansquare error is
ED [(g(x; D) F (x)2 ] = ED [g 2 (x; D) 2g(x; D)F (x) + F 2 (x)]
= ED [g 2 (x; D] ED [2g(x; DF (x)] + ED [F 2 (x)]
= ED [g 2 (x; D)] 2F (x)ED [g(x; D)] + F 2 (x).
Note, however, that
ED [(g(x; D) ED [g(x; D)])2 ] = ED [g 2 (x; D) 2g(x; D)ED [g(x; D)] + [ED [g(x; D)]]2 ]
= ED [g 2 (x; D)] ED [2g(x; D)ED [g(x; D)]] + ED [(ED [g(x; D)])2 ]
= ED [g 2 (x; D)] 2ED [g(x; D)]ED [g(x; D)] + (ED [g(x; D)])2
= ED [g 2 (x; D)] (ED [g(x; D)])2 .
We put these two results together and nd
ED [g 2 (x; D)] = ED [(g(x; D) ED [g(x; D)])2 ] + (ED [g(x; D)])2 .
We now apply this result to the function g(x) F (x) and obtain
ED [(g(x; D) F (x))2 ]
variance
Since the estimate can be more or less than the function F (x), the bias can be negative.
The variance cannot be negative, as it is the expected value of a squared number.
19. For a given data set D, if g(x; D) agrees with the Bayes classier, the expected
error rate will be Min[F (x, 1 F (x)]; otherwise it will be Max[F (x, 1 F (x)]. Thus
we have
Pr[g(x; D) = y]
PROBLEM SOLUTIONS
299
Thus we conclude
Pr[g(x; D) = y
20. If we make the convenient assumption that p(g(x; D)) is a Gaussian, that is,
p(g(x; D)) =
1
exp[(g )2 /(2 2 )]
2
where = ED [g(x; D)] and 2 = Var[g(x; D)]. From Eq. 19 in the text, then, for
F (x) < 1/2 we have
Pr[g(x; D) = yB ]
p(g(x; D))dg
=
1/2
=
1/2
1
exp[(g )2 /(2 2 )]dg
2
exp[u2 /2]du,
(1/2)/
where u = (g )/ and du = dg/. For the other case, that is, F (x) 1/2, we have
1/2
Pr[g(x; D) = yB ]
p(g(x; D))dg
1/2
1
exp[(g )2 /(2 2 )]dg
2
exp[u2 /2]du,
(1/2)/
1
exp[u2 /2]du if F (x) 1/2
2
(1/2)/
where
t=
exp[u2 /2]du =
t
1/2
1/2
%
1$
1 erf[t/ 2] = (t),
2
300CHAPTER 9.
1/2
ED [g(x; D) 1/2]
= sgn[F (x) 1/2]
Var[g(x; D)]
= sgn[F (x) 1/2]
boundary bias
n
1 ! 1 !
xj
n i=1 n 1
j=i
n
n
!
!
1
xj xi
n(n 1) i=1 j=1
n
()
!
1
[n
xi ]
n(n 1) i=1
n
!
n
1
xi
n1
n(n 1) i=1
n
n
1
n1
n1
=
.
n i=1
n1
n1
2
n
x xi (n 1)
n 1 ! n
x
n i=1
n1
n
Var[
] =
=
=
=
PROBLEM SOLUTIONS
301
n1!
n i=1
n
x
xi
n1
2
!
1
(xi x
)2 ,
n(n 1) i=1
n
() =
<
22
1 n
n
<
!
!
n
1
1
<
2
= =
(i)
(i) ,
n
n i=1
i=1
which requires 2n summations, and thus has a complexity O(n).
(d) The bootstrap estimate of the mean is
B
1 ! (b)
()
=
,
B
b=1
= =
( )
B
B
b=1
b =1
302CHAPTER 9.
begin initialize x1 , x2 ,
while (true)
x (x1 + x2 )/2
if g1 (x) g2 (x) < then return x
if g1 (x) < g2 (x) then x2 x
end
Another version uses the discriminant functions g1 (x) and g2 (x) to heuristically guide
the search for the boundary. As the two points under consideration, x1 and x2 become
closer together, the true discriminant functions can be approximated linearly. In the
following, x3 is the point along the line from x1 to x2 where g1 and g2 cross in the
linear approaximation.
Algorithm 0 (Binary search)
1
2
3
4
5
6
7
8
begin initialize x1 , x2 ,
while (true)
g(x1 ) g2 (x1 ) g2 (x2 )
g(x2 ) g1 (x2 ) g2 (x2 )
x3 x1 + (x2 x1 )/[g(x1 ) g(x2 )]
if g(x3 ) 0 then x1 x3 else x2 x3
if g(x3 ) < then return x3
end
PROBLEM SOLUTIONS
303
(c) It is necessary to integrate over a region x about a given data point to convert
from probability density to probability. The probability of the data given a
particular model is then the product of the probability of each data point given
the model (uniform U or Gaussian G ), that is,
P (DU )
12.1427(x)7
P (DG )
1.7626(x)7 ,
Problem
Problem
Problem
Problem
Problem
Problem
Problem
Problem
Problem
not
not
not
not
not
not
not
not
not
yet
yet
yet
yet
yet
yet
yet
yet
yet
solved
solved
solved
solved
solved
solved
solved
solved
solved
Section 9.7
44. Problem not yet solved
45. Problem not yet solved
304CHAPTER 9.
Computer Exercises
Section 9.2
1. Computer exercise not yet solved
Section 9.3
2. Computer exercise not yet solved
Section 9.4
3. Computer exercise not yet solved
Section 9.5
4. Computer exercise not yet solved
5. Computer exercise not yet solved
6. Computer exercise not yet solved
Section 9.6
7. Computer exercise not yet solved
8. Computer exercise not yet solved
9. Computer exercise not yet solved
Section 9.7
10. Computer exercise not yet solved
Chapter 10
Problem Solutions
Section 10.2
1. We are given that x can assume values 0, 1, . . . , m and that the priors P (j ) are
known.
(a) The likelihood is given by
P (x) =
c
!
m
j=1
x
m
jx (1 j )mx P (j ),
x=0
306CHAPTER
c
j=1
P (j ) = 1.
2
2
2
2
1
1
ex /(21 ) + (1 P (1 ))
ex /(22 ) .
21
22
(a) When 1 = 2 , then P (1 ) can take any value in the range [0, 1], leaving the
same mixture density. Thus the density is completely unidentiable.
(b) If P (1 ) is xed (and known) but not P (1 ) = 0, 0.5, or 1.0, then the model
is identiable. For those three values of P (1 ), we cannot recover parameters
for the rst distribution. If P (1 ) = 1, we cannot recover parameters for the
second distribution. If P (1 ) = 0.5, the parameters of the two distributions are
interchangeable.
(c) If 1 = 2 , then P (1 ) cannot be identied because P (1 ) and P (2 ) are
interchangeable. If 1 = 2 , then P (1 ) can be determined uniquely.
Section 10.3
4. We are given that x is a binary vector and that P (x) is a mixture of c multivariate
Bernoulli distributions:
c
!
P (x) =
P (xi , )P (i ),
i=1
where
d
,
P (xi , i ) =
ijij (1 ij )1xij .
j=1
d
!
j=1
=
=
=
=
xij
1 xij
ij
1 ij
xij (1 ij ) ij (1 xij )
ij (1 ij )
xij xij ij ij + ij xij
ij (1 ij )
xij ij
.
ij (1 ij )
i ) xk i = 0.
P (i xk ,
i (1
i)
k=1
PROBLEM SOLUTIONS
307
i must
(b) Equation 7 in the text shows that the maximumlikelihood estimate
satisfy
n
!
k=1
i ) ln P (xk i ,
i ) = 0.
P (i xk ,
i
i
xk
,
i)
i (1
i ) xk i = 0.
P (i xk ,
i)
i (1
k=1
i (0, 1), and thus we have
We assume
n
!
i )(xk
i ) = 0,
P (i xk ,
k=1
i =
k=1
n
k=1
i)
P (i xk ,
1
exp
(2)d/2 id
1
t
2 (x i ) (x i ) .
2i
d
d
1
= ln (2) ln i2 2 (x i )t (x i ),
2
2
2i
d
1
+ 4 (x i )t (x i )
2
2i
2i
1
(di2 + x i 2 ).
2i4
=
=
308CHAPTER
i ) ln p(xk i ,
i ) = 0.
P (i xk ,
i
n
!
k=1
i ) ln p(xk i , i )
P (i xk ,
2
i
i ) 1 (d
i 2 )
i2 + xk
P (i xk ,
2
i4
k=1
0,
rearrange, and nd
d
i2
n
!
i) =
P (i xk ,
k=1
n
!
i )xk
i 2 .
P (i xk ,
k=1
The solution is
i2 =
1
d
n
i )xk
i 2
P (i xk ,
k=1
n
i)
P (i xk ,
k=1
c
!
p(xj , )P (j ),
j=1
n
!
ln p(xk ).
k=1
=
=
=
=
n
n
!
ln p(xk ) !
1
p(xk , )
=
p(xk , )
k=1
n
!
k=1
c
!
1
p(xk j , )P (j )
p(xk , )
k=1
n !
c
!
k=1 j=1
c
n !
!
k=1 j=1
l=1
p(xk j , )P (j )
ln p(xk j , )
p(xk , )
P (j xk , )
ln p(xk j , )
,
PROBLEM SOLUTIONS
309
p(xk j , )P (j )
.
p(xk )
c
!
p(xj , )P (j ),
j=1
n
!
ln p(xk ).
k=1
=
=
=
=
n
n
!
p(xk , )
ln p(xk ) !
1
=
p(xk , )
k=1
n
!
k=1
c
1
!
p(xk j , )P (j )
p(xk , )
k=1
n !
c
!
k=1 j=1
n !
c
!
l=1
p(xk j , )P (j )
ln p(xk j , )
p(xk , )
P (j xk , )
k=1 j=1
ln p(xk j , )
,
p(xk j , )P (j )
.
p(xk )
8. We are given that 1 and 2 are statistically independent, that is, p(1 , 2 ) =
p(1 )p(2 ).
(a) We use this assumption to derive
p(1 , 2 x1 )
=
=
p(1 , 2 , x1 )
p(x1 1 , 2 )p(1 , 2 )
=
p(x1 )
p(x1 )
p(x1 1 , 2 )p(1 )p(2 )
p(x1 )
p(1 )p(2 )
p(x1 )
p(1 )p(2 )
= [p(x1 1 , 1 )P (1 ) + p(x1 2 , 2 )P (2 )]
.
p(x1 )
= [p(x1 1 , 1 , 2 )P (1 ) + p(x1 2 , 1 , 2 )P (2 )]
Therefore, p(1 , 2 x1 ) can be factored as p(1 x1 )p(2 x1 ) if and only if
p(x1 1 , 1 )P (1 ) + p(x1 2 , 2 )P (2 )
310CHAPTER
can be written as a product of two terms, one involving only 1 and the other
involving 2 . Now, for any choice of P (1 ), we have P (2 ) = 1 P (1 ). This
then implies that
p(x1 1 , 1 )P (1 ) + p(x1 2 , 2 )P (2 )
can be factored in terms of 1 and 2 if and only if p(x1 i , i ) is constant in i .
This, in turn, implies that p(1 , 2 x1 ) cannot be factored if
ln p(x1 i , i )
= 0
i
for i = 1, 2.
(b) Posteriors of parameters need not be independent even if the priors are assumed
independent (unless of course the mixture component densities are parameter
free). To see this, we rst note 0 < P (j ) < 1 and nj /n P (j ) as n ,
and in that case we have
max
1 ,...,c
c
!
=
1
ln p(x1 , . . . , xn 1 , . . . , c )
n
c
1
1 !
P (j )ln P (j ) ln (2 2 ) 2
P (j ) 2
2
2
j=1
j=1
c
!
1
1
P (j )ln P (j ) ln (2 2 )
2
2
j=1
c
!
1
P (j )ln P (j ) ln (2 2 e).
2
j=1
=
=
ln p(D)
n
,
ln
p(xk )
k=1
n
!
ln p(xk ),
k=1
where
p(xk )
c
!
p(xk j , j )P (j ).
j=1
and
c
!
j=1
P (j ) = 1.
PROBLEM SOLUTIONS
311
We use the method of Lagrange undetermined multipliers, and dene the objective
function (to be maximized) as
1 c
2
!
f (, P (1 ), . . . , P (c )) = l + o
P (i ) 1 .
i=1
This form of objective function guarantees that the normalization constraint is obeyed.
n
!
ln p(xk )
i
k=1
n
!
p(xk )
p(xk ) i
k=1
n
c
!
!
1
=
p(xk j , j )P (j )
p(xk ) i j=1
k=1
=
=
=
n
!
k=1
n
!
k=1
n
!
1
p(xk i , i ) P (i )
p(xk , ) i
p(xk i , i )P (i )
1
p(xk i , i )
p(xk , )
p(xk i , i ) i
P (i xk , )
k=1
ln p(xk i , i ).
P (i xk , )
i ln p(xk i , i ) = 0,
where
=
P (i xk , )
=
i )P (i )
p(xk i ,
p(xk )
p(xk i , i )P (i )
,
c
j )P (j )
p(xk j ,
j=1
ln p(xk ) + o
P (i ) 1
=
P (i )
P (i )
i=1
k=1
312CHAPTER
n
!
k=1
n
!
k=1
!
1
p(xk j , j )P (j ) + o
p(xk ) P (i ) j=1
c
1
p(xk i , i ) + o .
p(xk )
and P (i ) = P (i ) and nd
We evaluate this at =
f
P (i )
n
!
i)
p(xk i ,
k=1
n
k=1
p(xk )
+ o
P (i xk , )
P (i )
+ o = 0.
This implies
n
1 !
P (i ) =
P (i xk , ),
o
k=1
and since
c
P (i ) = 1 we have
i=1
n
c
1 !!
= 1,
P (i xk , )
o
i=1
k=1
or o = n where
1!
P (i ) =
P (i xk , ).
n
n
k=1
Section 10.4
11. Our Gaussian densities are of the form p(xi , i ) N (i , ), and we use the
following terminology:
pq
pq
= pqth element of
= pqth element of 1
(x
)
.
(x
k
k
i
i
2
(2)d/2 1/2
The loglikelihood is
ln p(xk i , i )
d
1
1
= ln (2) ln  (xk i )t 1 (xk i ),
2
2
2
PROBLEM SOLUTIONS
313
= pq d2 ln (2) + 12 ln 1 
pq
12 (xk i )t 1 (xk i ) . ()
Now we also have that the squared Mahalanobis distance is
(xk i )t 1 (xk i )
d !
d
!
p =1 q =1
(xk i )t 1 (xk i )
pq
=
=
1 1 
1  pq
1 1
)pq  + (1 )qp  ,
1 (
 
where we have used the fact that pq = qp (i.e., both and 1 are symmetric), and the notation (1 )ij = (i, j)th minors of 1 with ith row and jth
column deleted. We can use the above results and
d
1 !
=
pq (1 )pq
q =1
to write
ln 1
pq
1 1
1 pq + 1 qp
=
1
1 pq
2pq if p = q
=
pq
if p = q.
if
p = q
if
p=q
( )
(x
(k)
(i))(x
(k)
(i))]
if p = q
pq
p
p
q
q
2 pq
()
314CHAPTER
where pq
j=1
n
k=1
and
l
pq
n !
c
!
ln p(xk j , j , )
pq
k=1 j=1
n !
c
!
pq
=
P (j xk , j ) 1
[pq (xp (k) p (j))(xq (k) q (j))] .
2
j=1
=
P (j xk , j , )
k=1
j )(xp (k)
P (j xk ,
p (j))(xq (k)
q (j))
1 pq
2
pq
k=1 j=1
1
,
pq =
c
n
2
j )
P (j xk ,
k=1 j=1
pq =
k=1 j=1
c
n
j )
P (j xk ,
k=1 j=1
In vector and matrix form, this result can be written just a bit more compactly
as
c
n
j )(xk
j )(xk
j )t
P (j xk ,
k=1 j=1
n
c
j )
P (j xk ,
k=1 j=1
j ) = 1.
P (j xk ,
PROBLEM SOLUTIONS
315
j )(xk
j )(xk
j )t
P (j xk ,
j=1
c
!
= xk xtk
j ) xkj
P (j xk ,
j=1
c
!
j )
tj
P (j xk ,
j=1
c
c
!
!
t
j ) t
j xk +
P (j xk , j )
P (j xk ,
j j
j=1
j=1
= xk xtk xk
+
c
!
c
!
c
!
j )t
j ) xt
P (j xk ,
P (j xk ,
j
j
k
j=1
j=1
j ) t
P (j xk ,
j j
j=1
n
!
xk xtk
k=1
=
=
c
!
j=1
n
!
n
c
!
!
j=1
n
!
k=1
j )xk t
P (j xk ,
j
k=1
j )xt
P (j xk ,
k
k=1
c
!
xk xtk
k=1
n
!
nP (j )j tj
j=1
c
!
xk xtk n
n
c
!
!
j=1
c
!
j )xk t
P (j xk ,
j j
k=1
nP (j )j tj +
j=1
c
!
P (j )j tj
j=1
P (j )j tj .
j=1
We rearrange this result, substitute it above and nd that the maximumlikelihood estimate of the covariance matrix is
n
c
!
!
= 1
P (j )j tj .
xk xtk
n
j=1
k=1
12. We are told that the distributions are normal with parameters given by p(x1 )
N (0, 1) and p(x2 ) N (0, 1/2). Thus the evidence is
P (1 ) x2 /2 1 P (1 ) x2
p(x) =
+
.
e
e
2
(a) If one sample x1 is observed, the likelihood function is then
l = p(x1 )
P (1 ) x21 /2 1 P (1 ) x21
+
e
e
2
2
2
2
ex1 /2
ex1
ex1
P (1 ) + .
=
2
=
316CHAPTER
> 0,
2
2
1!
ln p(xk 1 , . . . , c )
n
n
1
n
k=1
c
!
ln p(xk 1 , . . . , c )
j=1 k:xk j
c
1! !
P (j ) (xk j )2 /(22 )
.
ln
e
n j=1
2
k:x
k
c
1! !
1
1
2
2
ln P (j ) ln (2 ) 2 (xk j )
n j=1
2
2
k:xk j
c
c
!
!
!
1!
11
ln P (j )
1
1
ln (2 2 )
n j=1
n2
j=1
k:xk j
k:xk j
PROBLEM SOLUTIONS
317
c
1! ! 1
(xk j )2
n j=1
2 2
k:xk j
1
n
=
where nj =
c
!
c
1
1! !
P (j )nj ln (2 2 )
(xk j )2 ,
2
n
j=1
j=1
k:xk j
k:xk j
1
ln p(x1 , . . . , xn 1 , . . . , c )
1 ,...,c n
c
c
!
1!
1
1!
nj ln P (j ) ln (2 2 ) +
max
[(xk j )2 ].
n j=1
2
n j=1 j
max
k:xk j
[(xk j )2 ]
k:xk j
occurs at
k:xk j
xk
k:xk j
k:xk j
xk
nj
= x
j ,
for some interval, j say, and thus we have
1
p(x1 , . . . , xn 1 , . . . , c )
n
n
c
1!
1
1 1! !
nj ln P (j ) ln (2 2 ) 2
(xk x
j )2
n j=1
2
2 n j=1
max
1 ,...,c
k:xk j
c
c
1
1!
1
1 1!
2
nj ln P (j ) ln (2 ) 2
nj
n j=1
2
2 n j=1 nj
(xk x
j )2 .
k :xk j
=
x
k=1
Then we have
1!
(xk x)t 1 (xk x)
n
n
k=1
1!
+x
x)t 1 (xk x
+x
x)
(xk x
n
n
k=1
318CHAPTER
k=1
+2(
x x)
t
n
!
%
) + n(
(xk x
x x)t 1 )(
x x)
k=1
1!
)t 1 (xk x
) + (
(xk x
x x)
x x)t 1 (
n
n
1
n
k=1
n
!
)t 1 (xk x
),
(xk x
k=1
where we used
n
!
) =
(xk x
k=1
n
!
xk n
x = n
x n
x = 0.
k=1
k=1
, that is, at
is minimized at x = x
1!
xk .
n
n
=
x=x
k=1
1/2
1
1
i 
(xk i )t 1
i (xk i ).
d/2
2
(2)
PROBLEM SOLUTIONS
319
d
!
p=1
d !
d
!
p =1
q =1
p=q
The derivative of the likelihood with respect to the pth coordinate of the ith
mean is
ln p(xk i , i )
p (i)
d
$ 1 !
1
2
d
!
d
!
p =1
q =1
p=q
%
(xp (k) p (i)) p q (i)(xq q (i))
1
= 2(xp (k) p (i))(1) pp (i)
2
d
1 !
d
1 !
(xp (k) p (i)) pq (i)(1)
2 p =1
p =p
d
!
q=1
i must satisfy
From Eq. xxx in the text, we know that P (i ) and
1!
P (i ) =
P (i xk , i )
n
n
k=1
and
n
!
k=1
i) = 0
P (i xk , i ) i ln p(xk i ,
or
n
!
k=1
ln p(xk i , i )
P (i xk , i )
= 0.
p (i)
i = i
P (i xk , i )
d
!
q=1
(xq (k)
q (i))
pq (i) = 0,
320CHAPTER
which implies
n
!
i )1 (xk i ) = 0.
P (i xk ,
i
k=1
i and nd
We solve for
n
.
n
k=1 P (i xk , i )
Furthermore, we have
n
!
k=1
i ) ln p(xk i , i )
P (i xk ,
= 0,
pq
i = i
pq
P (i xk , i ) 1
[
pq (i) (xp (k)
p (i))(xq (k)
q (i))] = 0.
2
We solve for
pq (i) and nd
n
i )[(xp (k)
p (i))(xq (k)
q (i))]
P (i xk ,
pq (i) = k=1
n
k=1 P (i xk , i )
and thus
i =
n
k=1
i )(xk
i )(xk
i )t
P (i xk ,
.
n
i)
P (i xk ,
k=1
1
d
(2i2 ) 2
1
(xk i )t (xk i ),
2i2
1
(xk i ).
i2
and hence
n
!
k=1
i ) 1 (xk
i ) = 0.
P (i xk ,
i2
n
k=1 P (i xk , i )
PROBLEM SOLUTIONS
321
We have, moreover,
n
!
k=1
ln p(xk i , i )
=
P (i xk , i )
i2
i = i
and thus
n
!
1
i) d +
P (i xk ,
2
4
2i
2i xk i 2
k=1
= 0
i2
1
d
n
i )xk
i 2
P (i xk ,
.
n
k=1 P (i xk , i )
k=1
.
n
k=1 P (i xk , i )
The loglikelihood function is
l
n
!
k=1
n
!
ln p(xk )
c
!
ln
p(xk j , j )P (j )
j=1
k=1
n
c
1 1/2
!
!


1
=
ln
P (j )
exp (xk j )t 1 (xk j )
d/2
2
(2)
j=1
k=1
and hence the derivative is
l
pq
=
=
=
=
n
!
ln p(xk )
pq
k=1
n
!
k=1
n
!
k=1
n
!
k=1
p(xk )
p(xk ) pq
!
1
P (j ) pq p(xk j , j )
p(xk ) j=1
!
1
P (j )p(xk j , j ) pq ln p(xk j , j )
p(xk ) j=1
322CHAPTER
l
=0
pq =
and consequently
pq
P (j )p(xk j , ) 1
[pq (xp (k) p (j))(xq (k) q (j))] = 0.
2
p(xk )
j=1
k=1
n
!
c
!
pq
=
n
!
c
!
p(xk )
k=1
c
!
k=1
p(xk )
j=1
j )
P (j )p(xk j ,
j=1
We have, therefore,
pq =
c
!
n
!
P (j )p(xk j , )
n
k=1
1, since
= p(xk , )
j=1
c
n !
!
j )[(xp (k)
P (j xk ,
q (i))(xq (k)
q (i))]
k=1 j=1
P (j xk ,
n
j=1
n
k=1
Section 10.5
18. We shall make use of the following gure, where the largest circle represents
all possible assignments of n points to c clusters, and the smaller circle labeled C1
represents assignments of the n points that do not have points in cluster C1 , and
likewise for C2 up through Cc (the case c = 4 is shown). The gray region represents
those assigments that are valid, that is, assignments in which none of the Ci are
empty. Our goal is then to compute the number of such valid assignments, that is the
cardinality of the gray region.
(a) First we nd the cardinality of invalid assigments, that is the number of assignments in the white region:
c
!
i=1
Ci 
c
!
i=j
Ci Cj  +
c
!
i=j=k
Ci Cj Ck 
c
c
c
c
=
N1
N2 +
N3
Nc
1
2
3
c
c
!
c
(1)ci+1 Ni ,
=
i
i=1
!
i=j=k...
Ci Cj Ck Cq 
c terms
PROBLEM SOLUTIONS
323
S
assignments that
leave no points in C1
C1
C2
C3
C4
valid assignments:
ones that have points
in every Ci
1
5 1100 + 10 2100 + 10 3100 + 5 4100 + 1 5100
5!
= 65738408701461898606895733752711432902699495364788241645840659777500
6.57 1067 .
=
324CHAPTER
(c) As given in part (a), the number of distinct clusterings of 1000 points into 10
clusters is
10
1 ! 10
(1)10i i1000 .
10! i=1 i
Clearly, the term that dominates this is the case i = 10, since the 101000 is
larger
than any other in the sum. We use Stirlings aproximation, x! xx ex 2x,
which is valid for large x. Thus we approximate
1
101000
10!
1010 990
10
10!
e10
10990
210
2778 10990
=
2.778 10993 .
Incidentally, this result approximates quite well the exact number of clusterings,
275573192239858906525573192239858906525573191758192446064084102276
241562179050768703410910649475589239352323968422047450227142883275
301297357171703073190953229159974914411337998286379384317700201827
699064518600319141893664588014724019101677001433584368575734051087
113663733326712187988280471413114695165527579301182484298326671957
439022946882788124670680940407335115534788347130729348996760498628
758235529529402633330738877542418150010768061269819155260960497017
554445992771571146217248438955456445157253665923558069493011112420
464236085383593253820076193214110122361976285829108747907328896777
952841254396283166368113062488965007972641703840376938569647263827
074806164568508170940144906053094712298457248498509382914917294439
593494891086897486875200023401927187173021322921939534128573207161
833932493327072510378264000407671043730179941023044697448254863059
191921920705755612206303581672239643950713829187681748502794419033
667715959241433673878199302318614319865690114473540589262344989397
5880
2.756 10993 .
Section 10.6
19. We show that the ranking of distances between samples is invariant to any
monotonic transformation of the dissimilarity values.
(a) We dene the value vk at level k to be min (Di , Dj ), where (Di , Dj ) is the
dissimilarity between pairs of clusters Di and Dj . Recall too that by denition
in hierarchical clustering if two clusters are joined at some level (say k), they
PROBLEM SOLUTIONS
325
remain joined for all subsequent levels. (Below, we assume that only two clusters
are merged at any level.)
Suppose at level k that two clusters, D and D are merged into D at level k + 1.
Then we have
vk+1
min [(Di , Dj )]
Di ,Dj
distinct at k
min
min
Di ,Dj =D
distinct at k
min [(Di , Dj )]
Di =D
is cluster at k
[(Di , Dj )],
min
Di ,Dj =D,D
are distinct at k1
min
[(Di , D D )] .
Di =D,D
is cluster at k1
[(Di , D D )]
Di =D,D
is cluster at k1
min (x, x )
min
xDi
Di =D,D
is cluster at k1 x DD
min
min
Di =D,D
is cluster at k1
min
xDi
x D
min
xDi
x D
[(Di , Dj ] .
Di ,Dj
are distance at k1
vk+1
min
Di =D,D
is cluster at k1
min
Di ,Dj =D,D
are distinct at k1
vk
and thus vk+1 vk for all k. A similar argument goes through for = max .
(b) From part (a) we have vk+1 vk for all k, and thus
0 = v1 v2 v3 vn ,
so the similarity values retain their ordering.
Section 10.7
20. It is sucient to derive the equation for one single cluster, which allows us to
drop the cluster index. We have
!
xD
x m2
!>
1 ! >
>
>2
x>
>x
n
xD
x D
>2
!>
!
>1
>
=
x x >
>
n
=
xD
x D
326CHAPTER
=
=
=
>2
1 !>
> !
>
x x >
>
2
n
xD x D
1 ! ! !
(x x )t (x x )
n2
xD x D x D
1 1$ ! ! !
(x x )t (x x )
n2 2
xD x D x D
%
! ! !
+
(x x )t [(x x ) + (x x )]
xD x D x D
1 1$ ! ! !
(x x )t (x x )
n2 2
xD x D x D
%
! ! !
+
x x 2 + (x x )t (x x )
xD x D x D
1 1! ! !
=
x x 2
n2 2
xD x D x D
! ! !
1 1 ! ! !
t
t
+ 2
(x x ) (x x )
(x x) (x x )
n 2
xD x D x D
xD x D x D
=
1 1 ! !
x x 2
n
2 n2
1
n
s.
2
=0
xD x D
Suppose that the partition minimizing Je has an empty subset. Then n c and there
must exist at least one subset in the partition containing two or more samples; we
call that subset Dj . We write this as
D
Dj = Dj1 Dj2 ,
where Dj1 and Dj2 are disjoint and nonempty. Then we conclude that
! !
Je =
x mi 2
Di = xDi
can be written as
Je = A +
!
xDj
x mj 2 ,
PROBLEM SOLUTIONS
327
xDj1
>
xDj2
x mj1 2 +
xDj1
x mj2 2 .
xDj2
Thus, replacing Dj by Dj1 and Dj2 and removing the empty subset yields a partition
into c disjoint subsets that reduces Je . But by assumption the partition chosen minimized Je , and hence we have a contradiction. Thus there can be no empty subsets in
a partition that minimizes Je (if n c). In short, we should partition into as many
subsets are allowed, since fewer subsets means larger dispersion within subsets.
22. The gure shows our data and terminology.
D11
Partition 1
D12
k
1
2
Partition 2
1
D21
D22
2k + 0 k
= 1,
=
2k
= a,
k
!
xD12
(2 + 1)2 +
i=1
k
!
(0 + 1)2 + (a a)2
i=1
= k + k + 0 = 2k.
In Partition 2, we have
m21
m22
= 2,
a
k0+a
=
,
=
k+1
k+1
xD22
328CHAPTER
(2 + 2) +
i=1
xD21
=
=
k
!
a
0
k+1
2
a
+ a
k+1
2
a2
a2 k 2
0+
k
+
(k + 1)2
(k + 1)2
2
2
a k(k + 1)
a k
=
.
2
(k + 1)
k+1
Thus if Je2 < Je1 , that is, if a2 /(k + 1) < 2k or equivalently a2 < 2(k + 1), then
the partition that minimizes Je is Partition 2, which groups the k samples at
x = 0 with the one sample at x = a.
(b) If Je1 < Je2 , i.e., 2k < a2 /(k+1) or equivalently 2(k+1) > a2 , then the partition
that minimzes Je is Partition 1, which groups the ksamples at x = 2 with the
k samples at x = 0.
23. Our sumofsquare (scatter) criterion is tr[SW ]. We thus need to calculate SW ,
that is,
SW
=
=
c !
!
i=1 xDi
c !
!
(x mi )(x mi )t
[xxt mi xt xmti + mi mt ]
i=1 xDi
c !
!
xx
t
i=1 xDi
4
!
c
!
m1 ni mti
i=1
xk xtk
c
!
c
!
ni mi mti
i=1
c
!
ni mi mti
i=1
ni mi mti ,
i=1
k=1
1 !
x.
ni
xDi
xk xtk
4 (4 5)
1 (1 4)
0 (0 1)
5 (5
=
+
+
+
5
4
1
0
16 20
1 4
0 0
25 0
=
+
+
+
20 25
4 16
0 1
0 0
42 24
=
.
24 42
m2
4
1
5/2
1
+
=
,
2
5
4
9/2
1
0
5
5/2
+
=
,
2
1
0
1/2
0)
PROBLEM SOLUTIONS
329
45/4
25/4 5/4
17 1
2
=
,
81/4
5/4 1/4
1 1
m2
1
4
5
9/2
+
=
,
2
5
0
5/2
1
1
0
1/2
+
=
,
2
4
1
5/2
45/4
1/4 5/4
1 1
2
=
,
25/4
5/4 25/4
1 17
4
1
0
5/3
1
+
+
=
,
3
5
4
1
3
5
=
.
0
=
330CHAPTER
22/3
,
26/3
tr[SW ]
18
18
17.33
SW 
16
16
21.33
Thus for the tr[SW ] criterion Partition 3 is favored; for the SW  criterion Partitions
1 and 2 are equal, and are to be favored over Partition 3.
24. Problem not yet solved
25. Consider a nonsingular transformation of the feature space: y = Ax where A is
a dbyd nonsingular matrix.
i = {Ax : x Di } denote the data set transformed to the new space,
(a) If we let D
then the scatter matrix in the transformed domain can be written as
SyW
c !
!
i=1 yD
i
c !
!
i=1 yD
i
c !
!
= A
(x mi )(x mi )t
i=1 yD
i
c
!
ni (myi my )(mi my )t
i=1
c
!
i=1
= A
c
!
ni (mi m)(mi m)
i=1
= ASB At .
At
PROBLEM SOLUTIONS
331
= (ASW At )1 (ASB At )
1
= (At )1 S1
ASB At
W A
t
= (At )1 S1
W SB A .
= i (At )1 zi ,
or
y
Sy1
W SB ui
= i ui ,
(SB + SW )1 SW
1
[S1
W (SB + SW )]
1
[I + S1
.
W SB ]
d
d
,
,
1
SW 
i =
,
= S1
T SW  =
ST 
1
+
i
i=1
i=1
332CHAPTER
=
=
=
c !
!
(y my )(y my )t
i=1 yDi
c !
!
i=1 xDi
ASxW At .
Syt
(Syt )1 SyW
= ASxB At
= ASxt At
= (At )1 (Sxt )1 A1 ASxW At
=
(SyW )1 SyB
=
=
d
!
i=1
d
!
i=1
1
1 + i
as well as tr[B SB], because they have the same eigenvalues so long as B
is nonsingular. This is because if Sx = i x, then SBB1 x = i x, then also
B1 SBB1 x = B1 i x = i B1 x. Thus we have
B1 SB(B1 x) = i (B1 x).
We put this together and nd
tr[(Syt )1 SyW ]
d
!
i=1
1
.
1 + i
d
,
i=1
i .
PROBLEM SOLUTIONS
333
(d) The typical value of the criterion is zero or close to zero. This is because SB is
often singular, even if samples are not from a subspace. Even when SB is not
singular, some i is likely to be very small, and this makes the product small.
Hence the criterion is not always useful.
c
27. Equation 68 in the text denes the criterion Jd = SW  = Si , where
i=1
Si =
(x mi )(x mi )t
xDi
is the scatter matrix for category i , dened in Eq. 61 in the text. We let T be a
nonsingular matrix and consider the change of variables x = Tx.
(a) From the conditions stated, we have
mi =
1 !
x
ni
x Di
where ni is the number of points in category i . Thus we have the mean of the
transformed data is
1 !
mi =
Tx = Tmi .
ni
xDi
xDi
= T
(x mi )(x mi )t Tt = TSi Tt .
xDi
(b) From the conditions stated by the problem, the criterion function of the transformed data must obey
c
1 c
c
2
! !
!
Jd = SW  =
Si =
TSi Tt = T
Si Tt
i=1
i=1
i=1
c
!
= TTt 
Si
i=1
= T Jd .
2
Therefore, Jd diers from Jd only by an overall nonnegative scale factor T2 .
(c) Since Jd diers from Jd only by a scale factor of T2 (which does not depend on
the partitioning into clusters) Jd and Jd will rank partitions in the same order.
Hence the optimal clustering based on Jd is always the optimal clustering based
on Jd . Optimal clustering is invariant to nonsingular linear transformations of
the data .
334CHAPTER
c !
!
i=1 yDi
= ASxW At .
In a similar way, we have
SyB
Syt
(SyW )1 SyB
= ASxB At
= ASxt At
=
=
If is an eigenvalue of (SxW )1 SxB with corresponding eigenvector x, that is, (SxW )1 SxB x =
x, then we have
(SxW )1 SxB (At (At )1 ) x = x,
I
which is equivalent to
(SxW )1 SxB At ((At )1 x) = x.
We multiply both sides on the left by (At )1 and nd
(At )1 (SxW )1 SxB At ((At )1 x) = (At )1 x,
which yields
(SyW )1 SyB ((At )1 x) = ((At )1 x).
Thus we see that is an eigenvalue of (SyW )1 SyB with corresponding eigenvector
(At )1 x.
29. We consider the problems that might arise when using the determinant criterion
for clustering.
(a) Here the withincluster matrix for category i is
!
SiW =
(x mi )(x mi )t ,
xDi
c
!
i=1
(ni 1) = n c.
PROBLEM SOLUTIONS
335
c
!
i=1
is the sum of c matrices of rank at most 1, and thus rank[SB ] c. The matrices
are constrained by
c
!
i=1
c !
!
(x mi )t S1
T (x mi ).
i=1 xDi
c !
!
i=1 yD
i
c
! !
y
(y myi )t Sy1
T (y mi )
i=1 xDi
SyB
= ASW At and
= ASB At ,
1
= (AST At )1 = (At )1 S1
.
T A
336CHAPTER
Therefore, we have
JTy
c ! !
!
1
(A(x mi ))t (At )1 S1
(A(x mi ))
T A
i=1 xDi
c
!
(x mi )t (At )(At )1 S1
T (x mi )
i=1
c
!
(x mi )S1
T (x mi ) = JT .
i=1
c !
!
k=1
where
xDk
(x mk )t S1
T (x mk ),
if k = i, j
Dk
Di {
x} if k = i
Dk =
x} if k = j.
Dj + {
We note the following values of the means after transfer of the point:
mk
mi
= mk if k = i, j,
x
xx
xDi
xDi
=
=
1
ni 1
xDi
n i mi x
x mi )
(ni 1)mi (
=
ni 1
ni 1
mi
x
= mi
,
ni 1
x+x
xHi
=
ni + 1
n j mj + x
x mj )
(nj + 1)mj + (
=
=
nj + 1
nj + 1
mj
x
= mj +
.
nj + 1
=
mj
c
k=1,k=i,j
(x mk )t S1
T (x mk ) +
+
xDj
xDi
(x mi )t S1
T (x mi )
(x mj )t S1
T (x mj ).
()
PROBLEM SOLUTIONS
337
(x mi )t S1
(x mj )t S1
T (x mi ) +
T (x mj )
xDi
xDi
t
xt S1
T x ni m i +
!
xDj
xDj
t 1
xt S1
T x nj mj ST mj
t
mi
mi
x
x
1
1
m
x
(n
xt S1
x
x
S
1)
m
i
i
i
T
T
T
ni 1
ni 1
xDi
t
!
mj
mj
x
x
1
1
+
m
xt S1
x
+
x
S
+
1)
m
+
S
+
x
(n
j
j
j
T
T
T
nj + 1
nj + 1
xDj
!
t 1
+ mi S1
2mti S1
t S1
=
(x mi )t S1
T (x mi ) x
T x
T mi + 2mi ST x
T mi
!
xDi
1
x mi )
(
x mi )t S1
T (
ni 1
!
t 1
mj S1
+ 2mtj S1
t S1
+
(x mj )t S1
T (x mj ) + x
T x
T mj + 2mj ST x
T mj
xDj
1
(
x mj )t S1
x mj )
T (
nj + 1
!
ni
(x mi )t S1
x mi )
(
x mi )t S1
T (x mi )
T (
ni + 1
xDi
nj
x mj ).
(
x mj )t S1
T (
nj + 1
c !
!
(x mk )t S1
T (x mk )
k=1 xDk
nj
ni
t 1
(x
m
)
)
S
(
x
m
)
(x mj )t S1
(
x
m
j
i
i
T
T
nj + 1
ni 1
nj
ni
t 1
= JT +
(
x
m
)
)
S
(
x
m
)
.
(
x mj )t S1
(
x
m
j
i
i
T
T
nj + 1
ni 1
+
(c) If we let D denote the data set and n the number of points, the algorithm is:
Algorithm 0 (Minimize JT )
1
2
3
4
5
6
7
8
9
begin initialize D, c
Compute c means m1 , . . . , mc
Compute JT
338CHAPTER
10
11
end
12
n
i=1
= tr[ST ] tr[SB ]
!
= tr[ST ]
nk mk m2
()
= tr[ST ]
nk mk m2
k=i,j
= tr[ST ]
nk mk m2
k=i,j
nk mk m ni mi m2 nj mj m2 .
2
k=i,j
Therefore we have
ni mi m2 + nj mj m2
>
>2
mi
x
>
>
(ni 1)>mi
m>
ni 1
>
>2
mj
x
+(nj + 1)>mj +
m> ,
nj + 1
mi
x
,
ni 1
mj
x
= mj +
.
nj + 1
= mi
2
2
t
Je = (ni 1) mi m +
(
x mi ) (mi m)
x mi
(ni 1)2
ni 1
1
2
2
2
t
+(nj 1) mj m +
x mj
(
x mj ) (mj m)
(nj + 1)2
nj + 1
1
= ni mi m2 mi m2 2(
x mi )t (mi m) +
x m i 2
ni 1
1
mj 2
+nj mj m2 mj m2 + 2(
x mj )t (mj m) +
x
nj + 1
= ni mi m2 + nj mj m2 +
x mi 2
x mj 2
1
1

x mi 2 +

x mj 2
+
ni 1
nj + 1
ni
nj
= ni mi m2 + nj mj m2 +
x m i 2
x mj 2 .
ni 1
nj + 1
PROBLEM SOLUTIONS
339
tr[ST ] tr[SB ]
!
= tr[ST ]
nk mk m2
ni
nj
x mi 2 +
x mj 2
ni 1
nj + 1
nj
ni
x m j 2
x mi 2 .
= Je +
nj + 1
ni 1
Section 10.9
32. Our similarity measure is given by Eq. 50 in the text:
s(x, x ) =
xt x
.
xx
(a) We have that x and x are ddimensional vectors with xi = 1 if x possesses the
ith feature and xi = 1 otherwise. The Euclidean length of the vectors obeys
;
;
< d
< d
<!
<!
=
2
x = x =
xi = =
1 = d,
i=1
i=1
1!
xt x
=
xi xi
d i=1
d d
d
=
=
=
=
1
[number of common features number of features not common]
d
1
[number of common features (d number of common features) ]
d
2
(number of common features) 1.
d
= (x x )t (x x )
= xt x + xt x 2xt x
= x2 + x 2 2s(x, x )xx
= d + d 2s(x, x ) d d
d
!
i=1
(xi xi )2 .
340CHAPTER
Clearly we have
s(x, x ) 0
(nonnegativity)
s(x, x ) = 0 x = x (uniqueness)
(symmetry).
s(x, x ) = s(x , x)
Now consider the triangle inequality for the particular case d = 1, and x =
0, x = 1/2 and x = 1.
s(x, x )
(0 1)2 = 1
s(x, x )
(0 1/2)2 = 1/4
s(x , x )
Thus we have
s(x, x ) + s(x , x ) = 1/2 < s(x, x ),
and thus s(x, x ) is not less than or equal to s(x, x ) + s(x , x ). In short, the
squared Euclidean distance does not obey the triangle inequality and is not a
metric. Hence it cannot be an ultrametric either.
(b) Euclidean distance:
;
< d
<!
s(x, x ) = x x = = (xi xi )2 .
i=1
= (x + x )t (x + x )
= xt x + xt x + 2xt x
= x2 + x 2 + 2xt x
x2 + x 2 + 2xx .
and thus
x + x x + x .
We put this together to nd
s(x, x ) = x x
= (x x ) + (x x )
x x + x x
= s(x, x ) + s(x , x ),
PROBLEM SOLUTIONS
341
and thus the Euclidean metric is indeed a metric. We use the same sample
points as in the example of part (a) to test whether the Euclidean metric is an
ultrametric:
s(x, x )
s(x, x ) = s(x, x )
1/4
so
max[s(x, x ), s(x , x )] = 1/4
< 1 = s(x, x ).
Thus the Euclidean metric is not an ultrametric.
(c) Minkowski metric:
s(x, x ) =
1 d
!
21/q
xi
xi q
i=1
xi xi 
i=1
1 d
!
21/q 1 d
21/p
!
p
xi 
xi 
,
q
i=1
i=1
1/q
xj 
1/p
d
xi
i=1
1/p
xj 
=
1/q
d
xi
i=1
for all j. The limiting case of p = 1 and q = can be easily veried directly. We
thus turn to the case 1 < p, q, < . Consider two real, nonnegative numbers a
and b, and 0 1; we have
a b(1) a + (1 )b,
with equality if and only if a = b. To show this, we consider the function
f (t) = t t + 1
for t 0. Thus we have f (t) = (t(1) 1) 0. Furthermore, f (t) f (1) = 0
with equality only for t = 1. Thus we have
t t + 1 .
We let t = a/b, substitute above and our intermediate inequality is thus proven.
342CHAPTER
xj 
a =
,
1/p
d
xi p
xj 
b =
1/q
d
xi q
i=1
i=1
xj xj 
xj 
xj 
1
1
+
.
1/p
1/q
1/p
1/q
p
q
d
d
d
d
xi p
xi q
xi p
xi q
i=1
i=1
i=1
i=1
d
i=1
xj xj 
j=1
1/p
xi p
d
i=1
xi q
1 1
1/q p + q = 1,
and the H
older inequality is thereby proven.
We now use the Holder inequality to prove that the triangle inequality holds for
the Minkowski metric, a result known as Minkowskis inequality. We follow the
logic above and have
d
!
xi + xi p
i=1
d
!
d
!
i=1
i=1
We apply H
olders inequality to each summation on the right hand side and nd
21/q 1 d
21/p 1 d
21/p
1 d
d
!
!
!
!
xi + xi p
xi + xi (p1)q
xi p
+
xi p
j=1
i=1
1 d
!
i=1
i=1
i=1
We divide both side by
1 d
!
i=1
d
xi + x p
21/p
xi + xi p
1 d
!
i=1
i=1
1/q
i=1
thereby obtain
i=1
21/q 1 d
21/p 1 d
21/p
!
!
.
xi + xi p
xi p
+
xi p
xi p
1 d
!
21/p
xi p
i=1
PROBLEM SOLUTIONS
343
xt x
.
xx
We investigate
1the
condition of uniqueness in the particular case d = 2 and
x = 11 , x = 1
. For these points we have
xt x = 1 1 = 0,
but note that x = x here. Therefore s(x, x ) does not possess the uniqueness
property, and thus is not a metric and cannot be an ultrametric either.
(e) Dot product:
s(x, x ) = xt x .
We use the same counterexample as in part (c) to see that the dot product is
neither a metric nor an ultrametric.
(f) Onesided tangent distance:
s(x, x )
min x + T(x) x 2 .
There is no reason why the symmetry property will be obeyed, in general, that
is, s(x, x ) = s(x , x), or
=
dhk
= min x x
xDi
x Dj
xDj
x Dh
344CHAPTER
= max x x
xDi
x Dj
maxx x =
xDk
x Dh
max
C x x
Dj
xDi
x Dh
xDi
x Dj
xDj
x Dh
1 !
x x
ni nj xD
i
x Dj
dhk
!
1
x x
nh nk xD
k
x Dh
1
nh (ni + nj )
x x
xDi Dj
x Dh
1
nh (ni + nj )
!
!
x x +
x x
xDi
x Dh
xDj
xDh
1
[nh ni dhi + nh nj dhj ]
nh (ni + nj )
ni
nj
=
dhi +
dhj
ni + n j
n i + nj
= i dhi + j dhj + dij + dhi dhj 
=
= mi mj 2
= mh mk 2
x
x
xDi Dj
xDk
=
=
1
n i + nj
xDk
x+
xDi
xDj
ni + nj
x
=
ni m i + n j m j
,
ni + nj
PROBLEM SOLUTIONS
dhk
345
>2
ni
nj
mi
mj >
ni + nj
ni + nj
>2
> ni
ni
nj
nj
>
=
mh
mi +
mh
mj >
ni + nj
ni + n j
ni + nj
ni + nj
> ni
>2 > nj
>2
= >
(mh mi )> + >
(mh mj )>
ni + nj
ni + n j
nj
ni
t
+2
(mh mi )
(mh mj )
ni + nj
n i + nj
>
= >mh
n2j + ni nj
n2i + ni nj
2
m
m
+
mh mj 2
h
i
(ni + nj )2
(ni + nj )2
n i nj
+
[(mh mi )t (mi mj ) (mh mj )t (mi mj )]
(ni + nj )2
ni
nj
ni nj
=
mh mi 2 +
mh mj 2
mh mj 2
n i + nj
ni + nj
(ni + nj )2
= i dhi + j dhj + dij + dhi dhj ,
=
where
ni
ni + nj
nj
=
ni + nj
ni n j
=
= i j
(ni + nj )2
= 0.
c
!
!
i =1
c
!
i =1
x mi 2
xDi
xDi
xt x
c
!
ni mti mi .
i =1
Je Je =
xt x
c
!
i =1
i=k
c
!
!
xt x
ni mti mi ni mti mi nj mtj mj
x
i =1
i =i<j
= ni + n j
346CHAPTER
mk
xDk
xDk
=
nnk mtk mk
=
=
x
1
x+
xDi
xDj
n i + nj
x
=
n i m i + nj m j
ni + nj
ni
nj
mi +
mj
ni + nj
n i + nj
t
ni
ni
nj
nj
(ni + nj )
mi +
mj
mi +
mj
ni + n j
ni + nj
ni + n j
ni + nj
n2j
n2i
2ni nj
mti mi +
mtj mj +
mt mj .
ni + nj
n i + nj
ni + nj i
PROBLEM SOLUTIONS
347
Delete e from T . The result is a tree with lower sum of weights. Therefore T was
not a minimal spanning tree. Thus, by contradiction, we have proven that T was
optimal.
Section 10.10
40. Problem not yet solved
41. As given in Problem 35, the change in Je due to the transfer of one point is
Je (2) = Je (1)
n 1 n2
m1 m2 2 .
n1 + n2
d
!
i=1
d
!
2
(xi mi )
xD
(n 1) 2 ,
i=1
and
E(S 2 )
= 2 (n 1)
= (n 1)d 2
nd 2 for n large.
= {x : x1 > 1 }
D2
= {x : x1 < 1 }.
348CHAPTER
Thus we have
E
n1 n 2
m1 m2 2
n1 + n 2
d
!
n 1 n2
E
(m1i m2i )2
n 1 + n2
i=1
n1 n2
E
(m11 m21 )2 ,
n 1 + n2
m21
Without loss of generality we let 1 = 0. Then we have p(x1 ) N (0, 2 ) and thus
0
2
E(x1 x1 > 0) =
0
2
E(x1 x1 < 0) =
.
3
Therefore, we have m11 m21 2 2 for n large. We also have
(n/2)(n/2)
E(Je (2)) nd 2
(n/2) + (n/2)
1 0
2
22
n
2
= nd 2 4 2
4
2
2 .
= n d
n1 n 2
m1 m2 2 .
n1 + n 2
We write
Je (1) = Je (2) + [Je (1) Je (2)]
and take the variance of both sides and nd
Var[Je (1)]
where
=
n1 n 2
m1 m2 2 .
n1 + n2
PROBLEM SOLUTIONS
349
begin initialize i 0
T {}
do i i + 1
ci xi
until i = n
eij xi xj ; sort in non decreasing order
E ordered set of eij
do for each eij in E in non decreasing order
if xi and xj belong to disjoint clusters
then Append eij to T ; Append cj to ci
until all eij considered
return T
end
input
0
0
1/2
1/ 3
1
1/2
1/2
1/ 3
output
1
0
0
1
()
()
( )
( )
Associate the weight vector w0 with cluster 0, and w1 with cluster 1, and the
subscripts with the dierent dimension. Then we get the following conditions:
w10 + x03
w20 + w30
>
>
w11
w21
()
()
( )
( )
350CHAPTER
From ( ) we see
(w10 w11 ) + (w20 w21 ) < (w31 w30 ).
We combine this with (), () and ( ) and nd
0 < 2(w31 w30 ) < (w10 w11 ) + (w20 w21 ) < (w31 w30 ).
This is a contradition. Thus there is no set of weights that the ART network
could converge to that would implement that XOR solution.
(b) Given a weight vector w for cluster C1 , the vigilance parameter and two inputs
x1 and x2 such that wt x1 > and wt x2 < , but also (w+x)t x2 > w+x1 .
Since the topdown feedback tries to push the input vector to be closer to w,
we can use the angle between w and x1 in this case to make our point, since
we cannot forecast the value of y. If we present x1 before x2 , then x1 will
be classied as belonging to cluster C1 . The weight vector w is now slightly
adjusted and now also x2 is close enough to w to be classied as belonging to
cluster C1 , thus we will not create another cluster. In constrast, if we present x2
before x1 , the angle between w and x2 is smaller than , thus we will introduce
a new cluster C2 containing x2 . Due to the xed value of but weights that are
changing in dependence on the input, the number of clusters depends on the
order the samples are presented.
(c) In a stationary environment the ART network is able to classify inputs robustly
because the feedback connections will drive the network into a stable state even
when the input is corrupted by noise. In a nonstationary environment the
feedback mechanism will delay the adaptation to the changing input vectors
because the feedback connections will interpret the changing inputs as noisy
versions of the original input and will try to force the input to stay the same.
Without the feedback the adaptation of the weights would faster account for
changing sample congurations.
Section 10.13
44. Let e be a vector of unit length and x the input vector.
(a) We can write the variance as
2
= Ex [a2 ]
= Ex [(xt e)t (xt e)]
= Ex [et xxt e]
= et Ex [xxt ]e
= et e,
since e is xed and thus can be taken out of the expectation operator.
(b) We use the Taylor expansion:
2 (e + e)
= (e + e)t (e + e)
= et e + 2et e + O(e3 ).
PROBLEM SOLUTIONS
351
(eti x)ei .
i=1
We dene the error Ek (x) as the sumsquared error between the vector x and
its projection onto a kdimensional subspace, that is
t
k
k
!
!
t
t
Ek (x) =
x
x
(ei x)e
(ei x)e
i=1
d
!
t
(eti x)ei
i=k+1
d
!
i=1
d
!
(eti x)ei
i=k+1
(eti x)2 ,
k=k+1
d
!
Ex [(xt ei )t (xt ei )]
i=k+1
d
!
eti i ei
i=k+1
i eti ei =
d
!
i .
k=k+1
This then gives us the expected error. Thus to minimize the squarederror
criterion, the kdimensional subspace should be spanned by the k largest eigenvectors, wince the error is the sum of the remaining d k eigenvalues of the
covariance matrix.
352CHAPTER
45.
46.
47.
48.
49.
Problem
Problem
Problem
Problem
Problem
not
not
not
not
not
yet
yet
yet
yet
yet
solved
solved
solved
solved
solved
Section 10.14
50. Consider the three points x1 =
distances
1
12 = x1 x2  =
13 = x1 x3  =
23 = x2 x3  =
, x2 =
0
0
, and x3 =
0
1
(1 0)2 + (0 0)2 = 1
(1 0)2 + (0 1)2 = 2
(0 0)2 + (0 1)2 = 1.
We assume for deniteness, and without loss of generality, that the transformed points
obey 0 = y1 < y2 < y3 . We dene
y2 = 21 > 0
y3 =
21
22
> 21 = y2 ,
d13
= y2 y1 = 21 0 = 21
= y3 y1 = 21 + 22 0 = 21 + 22
d23
= y3 y2 = 21 + 22 21 = 22 .
d12
(21
2)2 .
2)21
2)22 .
PROBLEM SOLUTIONS
353
Because 1 , 2 = 0, we have
21 1 = (21 + 22
2) = 22 1,
2) = 0
1+ 2
=
=
3
1+ 2
2
y 2 = 1 =
3
2
y3 = 1 + 22 = 221 = 2y2 .
21
22
=
+
+
1
1
2
1
= (21 1)2 + (21 + 22 2)2 + (22 1)2 ,
2
and thus the derivatives
Jf f
1
Jf f
2
1
2(21 1)21 + 2(21 + 22 2)21
2
1
2(22 1)22 + 2(21 + 22 2)22 .
2
=
=
1
= (21 + 22 2)
2
= 22 1
1
21 1 + (21 + 22 2) = 0,
2
and thus
1
1
21 1 + 221
2
2
1
221 = 1 +
2
We solve for 21 and nd
21
=
=
2+1
2+ 2
.
=
2
2
2+ 2
4
354CHAPTER
2+ 2
2
y 2 = 1 =
4
2
2
2
y3 = 1 + 2 = 21 = 2y2 .
355
Computer Exercises
Section 10.4
1.
2.
3.
4.
5.
Computer
Computer
Computer
Computer
Computer
exercise
exercise
exercise
exercise
exercise
not
not
not
not
not
yet
yet
yet
yet
yet
solved
solved
solved
solved
solved
Section 10.5
6. Computer exercise not yet solved
7. Computer exercise not yet solved
Section 10.6
8. Computer exercise not yet solved
Section 10.7
9. Computer exercise not yet solved
Section 10.8
10. Computer exercise not yet solved
Section 10.9
11. Computer exercise not yet solved
12. Computer exercise not yet solved
Section 10.11
13. Computer exercise not yet solved
Section 10.12
14. Computer exercise not yet solved
Section 10.13
15. Computer exercise not yet solved
16. Computer exercise not yet solved
Section 10.14
17. Computer exercise not yet solved
356CHAPTER
" 0
# x<0
2
x
p(xi ) =
1
0 x wi
wi
wi
0
otherwise,
where the wi for i = 1, 2, are positive but unknown parameters.
(a) Conrm that the distributions are normalized.
(b) The following data were collected: D1 = {2, 5} and D2 = {3, 9} for 1 and
1 and w
2 .
2 , respectively. Find the maximumlikelihood values w
(c) Given your answer to part (b), determine the decision boundary x for
minimum classication error. Be sure to state which category is to right
(higher) values than x , and which to the left (lower) values than x .
(d) What is the expected error of your classier in part (b)?
3. (10 points total) Consider a twodimensional, threecategory pattern classication problem, with equal priors P (1 ) = P (2 ) = P (3 ) = 1/3. We dene
the disk distribution D(, r) to be uniform inside a circular disc centered on
having radius r and elsewhere 0. Suppose we model each distribution as a
disk D(i , ri ) and after training our classier nd the following parameters:
3
1 :
1 =
, r1 = 2
2
357
358
4
2 =
,
1
5
3 =
,
4
2 :
3 :
r2 = 1
r3 = 3
(a) (1 pt) Use this information to classify the point x = 62 with minimum
probability of error.
(b) (1 pt) Use this information to classify the point x = 33 .
(c) (8 pts) Use this information to classify the point x = 0.5
, where denotes
a missing feature.
4. (10 points total) It is easy to see that the nearestneighbor error rate P can
equal the Bayes rate P if P = 0 (the best possibility) or if P = (c 1)/c (the
worst possibility). One might ask whether or not there are problems for which
P = P where P is between these extremes.
(a) Show that the Bayes rate for the onedimensional case where P (i ) = 1/c
and
cr
1 0 x c1
cr
1 i x i + 1 c1
P (xi ) =
0 elsewhere
is P = r.
(b) Show that for this case the nearestneighbor rate is P = P .
5. (10 points total) Consider a dnH c threelayer backpropagation network, trained
on the traditional sumsquared error criterion, where the inputs have been standardized to have mean zero and unit variance in each feature. All nonlinear
units are sigmoidal, with transfer function
f (net) = atanh(b net) =
2a
1 + e2b
net
where a = 1.716 and b = 2/3, and net is the appropriate net activation at a
unit, as described in the text.
(a) State why when initializing weights in the network we never set them all
to have value zero.
(b) Instead, to speed learning we initialize weights in a range w
w +w.
For the inputtohidden weights, what should be this range? That is, what
is a good value of w?
(c) Explain what your answer to part (b) achieves. That is, explain as specifically as possible the motivation that leads to your answer in part (b).
6. (15 points total) Consider
training a treebased classier with the following
eight points of the form xx12 in two categories
1
EXAM 1
359
Jee =
i<j
(dij ij )2
n
i<j
.
2
ij
360
EXAM 1
361
Important formulas
p(xi )P (i )
P (i x) =
c
P (k )p(xk )
k=1
1
t 1
p(x) =
exp (x ) (x )
2
(2)d/2 1/2
1
= E[(x )(x )t ]
n
1!
=
(xi )(xi )t
n i=1
=
(x )(x )t p(x)dx
P (i xg ) =
l() = lnp(D)
p(xD) =
p(x)p(D) d
Q(; i ) = EDb lnp(Dg , Db ; )Dg ; i
lim Pn (ex) = 1
c
!
P 2 (i x)
i=1
P P P
Jp (a)
c
2
P
c1
(at y)
yY
!
yYk
362
wkj
wji
i(N ) =
2a
1 + e2b
net
P (j )log2 P (j )
ED (g(x; D) F (x))2 = (ED [g(x; D) F (x)])2 + ED g(x; D) ED [g(x; D)])2
(i)
()
1 , x2 , , xi1 , xi+1 , , xn )
= (x
n
1 !
=
(i)
n i=1
EXAM 1 SOLUTION
363
We expand and cancel the quadratic term x1 x, which is the same for both
categories (and hence can be eliminated), regroup, and nd
gi (x) = wit x + wi0
where
wi = 1 i
and
1
wi0 = ti 1 i + lnP (i ),
2
where we used the symmetry of 1 to rewrite terms such as xt 1 by
t 1 x and [1 ]t by 1 .
We now seek the decision boundary, that is, where g1 (x) g2 (x) = 0, and this
implies the xdependent term is merely the dierence between the two weights
wi , that is, our equation is wt (x x0 ) = 0 where
w = 1 (1 2 ).
Now we seek x0 . The xindependent term is
1
1
wt x0 = t1 1 1 + lnP (1 ) + t2 1 2 lnP (2 ).
2
2
Thus we have
1
P (1 )
1 (1 0 )x0 = (1 2 )t 1 (1 2 ) + ln
.
2
P (2 )
We leftmultiply both sides by (1/2)(1 2 )t , then divide both sides by the
scalar (1 2 )t 1 (1 2 ) and nd
x0 =
1
lnP (1 )/P (2 )
( 2 )
(1 2 ).
2 1
(1 2 )t 1 (1 2 )
(Note, this is eectively the derivation of Eqs. 5965 on pages 3940 in the text.)
364
2. (20 points total) Consider a onedimensional twocategory classication problem with equal priors, P (1 ) = P (2 ) = 1/2, where the densities have the
form
" 0
# x<0
2
x
p(xi ) =
1
0 x wi
wi
wi
0
otherwise,
where the wi for i = 1, 2, are positive but unknown parameters.
(a) Conrm that the distributions are normalized.
(b) The following data were collected: D1 = {2, 5} and D2 = {3, 9} for 1 and
2 , respectively. Find the maximumlikelihood values w
1 and w
2 .
(c) Given your answer to part (b), determine the decision boundary x for
minimum classication error. Be sure to state which category is to right
(higher) values than x , and which to the left (lower) values than x .
(d) What is the expected error of your classier in part (b)?
Solution The gure, below, should be consulted.
p(xi)
0.3
1
0.2
0.1
2
part (d)
0
0
1 10
w
215
w
x*
w
2 "
x#
2x
x2
1
dx =
2
= [(2 1) (0 0)] = 1.
w
w
w
w 0
2 "
x1 # 2 "
x2 #
= p(x1 w)p(x2 w) =
1
1
w
w
w w
x1 + x2
4
x1 x2
1
.
=
+
w2
w
w2
= 4 2w3 + 3(x1 + x2 )w4 4x1 x2 w5
= 4w5 2w2 3(x1 + x2 )w + 4x1 x2
= 0.
EXAM 1 SOLUTION
365
21 9 49 320
w
1 =
4
21 121
=
4
= 8 or 2.5.
Clearly, the solution 2.5 is invalid, since it is smaller than one of the points
in D1 ; thus w
1 = 8.
Likewise, for 2 , we substitute x1 = 3 and x2 = 9 and nd
36 9 122 864
w
2 =
4
36 432
=
4
= 14.2 or 3.1.
Clearly, the solution 3.1 is invalid, since it is smaller than one of the points
in D2 ; thus w
1 = 14.2.
(c) We seek the value of x such that the posteriors are equal, that is, where
x
2
x
2
1
=
1
w
1
w
1
w
2
w
2
"
#
"
2
x
2
x #
1
=
1
8
8
14.2
14.2
which has solution
6.2
= 5.1,
x = 14.2
8
8 14.2
with R1 being points less than x and R2 being points higher than x .
(d) The probability of error in this optimal case is
P (e) =
min[P (1 )p(x1 ), P (2 )p(x2 )]dx
x=5.1
w
1 =8
1 2 "
12"
x #
x#
1
dx +
1
dx
2 14.2
14.2
28
8
0
x =5.1
5.1
8
5.1
8
1
x2
1
x2
x
+ x
14.2
2(14.2)2
8
2(8)2
0
5.1
5.1
0.360.
366
3. (10 points total) Consider a twodimensional, threecategory pattern classication problem, with equal priors P (1 ) = P (2 ) = P (3 ) = 1/3. We dene the
disk distribution D(, r) to be uniform inside a circular disc centered on
having radius r and elsewhere 0. Suppose we model each distribution as such a
disk D(i , ri ) and after training our classier nd the following parameters:
3
1 :
1 =
, r1 = 2
2
4
2 :
2 =
, r2 = 1
1
4
3 :
3 =
, r3 = 3
4
(a) (1 pt) Use this information to classify the point x = 62 with minimum
probability of error.
(b) (1 pt) Use this information to classify the point x = 33 .
(c) (8 pts) Use this information to classify the point x = 0.5
, where denotes
a missing feature.
Solution
p(xg1)
p(xg2)
p(xg3)
x2
1
x2
0
0
x1
5
6
2
3
should be classied as 3 .
(b) As can be seen from the gure, 3 is in the nonzero range of 1 and
3 . Because the density p(x1 ) > p(x3 ) at that position, however, 33
should be classied as 1 .
(c) Clearly, the decient point cannot be in 3 . We must marginalize over the
bad feature, x1 , to nd the classconditional densities given the good feature, x2 . But notice that the dashed line goes through a greater percentage
of the diameter of the 2 disk than the 1 disk. In short, if we marginalize
the p(x2 ) distribution over x1 with x2 = 0.5, we get a larger value than if
we perform the same marginalization
for p(x1 ), as graphed in the gure.
as 2 .
Thus we should classify 0.5
4. (10 points total) It is easy to see that the nearestneighbor error rate P can
equal the Bayes rate P if P = 0 (the best possibility) or if P = (c 1)/c (the
worst possibility). One might ask whether or not there are problems for which
P = P where P is between these extremes.
EXAM 1 SOLUTION
367
(a) Show that the Bayes rate for the onedimensional case where P (i ) = 1/c
and
cr
1 0 x c1
cr
1 i x i + 1 c1
P (xi ) =
0 elsewhere
is P = r.
(b) Show that for this case the nearestneighbor rate is P = P .
Solution It is indeed possible to have the nearestneighbor error rate P equal
to the Bayes error rate P for nontrivial distributions.
(a) Consider uniform priors over c categories, that is, P (i ) = 1/c, and onedimensional distributions given in the problem statement. The evidence
is
cr
0 x c1
c
1
!
cr
1/c
i x (i + 1) c1
p(x) =
p(xi )P (i ) =
i=1
0
elsewhere.
Note that this automatically imposes the restriction
cr
0
1.
c1
Because the P (i )s are constant, we have P (i x) p(xi ) and thus
P (i )
1/c
cr
0 x c1
p(x) = p(x)
0 if i = j
cr
P (i x) =
j x j + 1 c1
1
if
i
=
j
0
otherwise,
as shown in the gure for c = 3 and r = 0.1. The conditional Bayesian
P(ix)
1/c = 1/3
0.25
1 , 2 , 3
0.3
0.2
0.15
0.1
x = r = 0.1
0.05
x
1
1 P (max x)
1/c
cr
1 p(x)
if 0 x c1
=
1 1 = 0 if i x i + 1
=
cr
c1 ,
368
P
=
P (ex)p(x)dx
cr/(c1)
=
=
1
1
c
1/c
p(x)dx
p(x)
cr
= r.
c1
c
!
P (i x) p(x)dx
2
i=1
cr
cr/(c1)
=
0
j+1 c1
c
!
c( 1c )2
1 2
[1 1] p(x)dx
p(x)dx +
p (x)
j=1 j
0
cr/(c1)
1/c
p2 (x)
1
c
0
cr/(c1)
=
0
p(x)dx
dx =
1
c
cr
= r.
c1
2a
1 + e2b
net
where a = 1.716 and b = 2/3, and net is the appropriate net activation at a
unit, as described in the text.
(a) State why when initializing weights in the network we never set them all
to have value zero.
(b) Instead, to speed learning we initialize weights in a range w
w +w.
For the inputtohidden weights, what should be this range? That is, what
is a good value of w?
(c) Explain what your answer to part (b) achieves. That is, explain as specifically as possible the motivation that leads to your answer in part (b).
EXAM 1 SOLUTION
369
Solution
(a) If all the weights are zero, the net activation netj in all hidden units is zero,
and thus so is their output yj . The weight update for the inputtohidden
weights would also be zero,
wkj = (tk zk )f (netk ) yj .
=0
Likewise, the weight update for the inputtohidden weights would be zero
c
!
wji =
wkj k f (netj )xi .
k=1
=0
In short, if all the weights were zero, learning could not progress. Moreover,
as described in the text, if all the weights are the same for a given pair of
layers (even if these weights are not zero), then learning cannot progress.
(b) Suppose all the weights had value 1.0. In calculating its net, each hidden
unit is summing d random variables with variance 1.0. The variance of the
value of net would be, then, d. We want, however, for the net activation
to be in roughly the linear range of f (net),
that is, 1 < net
< +1. Thus
we want our weights to be in the range 1/ d < w < +1/ d.
(c) We want to use the linear range of the transfer function, f (net), so as
to implement the simple linear model rst. If the problem is complex,
the training will change the weights and thus express the nonlinearities, if
necessary.
6. (15 points total) Consider
training a treebased classier with the following
eight points of the form xx12 in two categories
1
3 4 5 8
5 , 7 , 6 , 1
370
x2
= 1
= 2
Is x2 < 4.5?
ye
no
x1
0
0
no
Is x1 < 7.5?
yes
Solution
(a) The entropy impurity at the root node is
i(root)
2
!
P (j )log2 P (j )
j=1
1
1 1
1
= log2 log2
2
2 2
2
= 1 bit.
(b) As is clear from the gure, the query Is x2 < 4.5 the optimal split at the
root.
(c) The left node contains four 1 points and one 2 point. Its impurity is
thus
4
4 1
1
log2 log2 = 0.258 + 0.464 = 0.722 bits.
5
5 5
5
The right node is pure, containing three 2 points. Its impurity is
i(R) = 1log2 1 0log2 0 = 0 bits.
(d) The impurity at the level beneath the root is just the sum of the impurities
of the two nodes computed in part (b) weighted by the probability any
pattern at the root goes to that node. Thus the impurity at the level is
5
3
0.722 + 0 = 0.451 bits.
8
8
The split at the root thus reduced the information impurity by 1.00.451 =
0.549 bits.
(e) As can be seen in the gure, the query at the impure node should be Is
x1 < 7.5.
7. (10 points total) We dene the 20% trimmed mean of a sample to be the
mean of the data with the top 20% of the data and the bottom 20% of the data
removed. Consider the following six points, D = {0, 4, 5, 9, 14, 15}.
(a) Calculate the jackknife estimate of the 20% trimmed mean of D.
(b) State the formula for the variance of the jackknife estimate.
(c) Calculate the variance of the jackknife estimate of the 20% trimmed mean
of D.
EXAM 1 SOLUTION
371
Solution The jackknife estimate of a statistic is merely the average of the leaveoneout estimates. In this case, there are ve points, and thus
(a) We delete one point at a time, take the remaining data set, remove the top
20% and bottom 20% (that is, the highest point and the lowest point), and
calculate the mean. We then average these trimmed means. As shown in
the table, the jackknife estimate of the 20% trimmed mean is 8.33.
deleted point
0
4
5
9
14
15
trimmed set
{5, 9, 14}
{5, 9, 14}
{4, 9, 14}
{4, 5, 14}
{4, 5, 9}
{4, 5, 9}
Varjack [] =
5
(9.33 7.89)2 + (9.33 7.89)2 + (9.00 7.89)2
6
+(7.66 7.89)2 + (6.00 7.89)2 + (6.00 7.89)2
0.833 [2.07 + 2.07 + 1.23 + 0.53 + 3.57 + 3.57]
10.48.
Jee =
i<j
(dij ij )2
n
i<j
.
2
ij
372
Solution
(a) We can break Jee into terms that depend upon yk , and those that do not:
Jee
1
=
n
2
ij
i<j
n
n
!
2
2
(dij ij ) +
(dik ik ) .
i=k
i=k,j=k
does not depend on yk
depends on yk
Thus when we take the derivative, we need only consider the second term.
Thus we have
n
d !
1
yk Jee =
(dik ik )2
n
dyk
2
i=k
ij
i<j
n
1 !
d
2
(dik ik )
dik .
n
dyk
2
i=k
ij
i<j
We note that
dik =
(yk yi )t (yk yi )
and thus
d
1 2(yk yi )
yk yi
dik =
=
.
dyk
2
dik
dik
We put all these results together and nd
yk Jee
n
2 !
yk yi
=
(dki ki )
.
n
dki
2 i=k
ij
i<j
EXAM 1 SOLUTION
(b)
(c)
(d)
(e)
373
374
EXAM 2
375
Pr[xi = 1j ]
qij
Pr[xi = 0j ]
rij
Pr[xi = 1j ].
0
i ei x
x<0
x 0,
376
4. (10 points total) The task is to use Bayesian methods to estimate a onedimensional
probability density. The fundamental density function is a normalized triangle
distribution T (, 1) with center at with halfwidth equal 1, dened by
p(x) T (, 1) =
1 x  x  1
0
otherwise,
as shown on the left gure. The prior information on the parameter is that
it is equally likely to come from any of the three discrete values = 1, 0 or 1.
Stated mathematically, the prior consists of three delta functions, i.e.,
p() =
1
[(x 1) + (x) + (x + 1)],
3
as shown on the gure at the right. (Recall that the delta function has negligible
width and unit integral.)
p(x)
1
p()
x
1
+1
1
(a) (2 pt) Plot the estimated density before any data are collected (which
we denote by D0 = {}). That is, plot p(xD0 ). Here and below, be sure to
label and mark your axes and ensure normalization of your nal estimated
density.
(b) (4 pts) The single point x = 0.25 was sampled, and thus D1 = {0.25}.
Plot the estimated density p(xD1 ).
(c) (4 pts) Next the point x = 0.75 was sampled, and thus the data set is
D2 = {0.25, 0.75}. Plot the estimated density p(xD2 ).
5. (5 points total) Construct a cluster dendrogram for the onedimensional data
D = {2, 3, 5, 10, 13} using the distance measure Dmax (Di , Dj ).
6. (10 points total) Consider the use of traditional boosting for building a classier
for a twocategory problem with n training points.
(a) (8 pts) Write pseudocode for traditional boosting, leading to three component classiers.
(b) (2 pts) How are the resulting three component classiers used to classify
a text pattern?
EXAM 2
377
J=
k=1
target t
output z
...
t2
z1
z2
...
tk
zk
...
tc
...
zc
output
wkj
y1
y2
yj
...
...
yn
hidden
wji
x1
input x
x1
x2
x2
xi
...
...
xi
...
...
xd
input
xd
8. (10 points total) It is World Cup Soccer season and a researcher has developed
a Bayes belief net that expresses the dependencies among the age of a person
(A), his or her nationality (B), whether he or she likes soccer (C), and how
much he or she watches sports TV during the World Cup season (D). Use the
conditional probability tables to answer the following.
P(b)
P(a)
a1 = 030 years
b
=
American
A
P(b
) P(b2)
B
1
P(a1) P(a2) P(a3) a = 3140 years
1
0.3
0.6
P(ca,b)
0.1
a1, b 1
a1 , b 2
a2 , b 1
a2 , b 2
a3 , b 1
a3 , b 2
a3 = 41100 years
P(c1ai, bj)
0.5
0.7
0.6
0.8
0.4
0.1
age
P(c2ai, bj)
0.5
0.3
0.4
0.2
0.6
0.9
nation b2 = nonAmerican
ality
sports
0.2
0.8
c1 = likes soccer
c2 = doesn't like soccer
P(dc)
D
watch
TV
d1 = a lot
d2 = some
d3 = none
c1
c2
P(d1ck)
0.7
0.5
P(d2ck) P(d3ck)
0.2
0.1
0.3
0.2
378
9. (15 points total) This problem conserns the construction of a binary decision
tree for three categories from the following twodimensional data:
1
1 2 3
1 , 8 , 5
4 6 7
7 , 2 , 6
5 7 7
10 , 4 , 9
(a) (3 pts) What is the information impurity at the root node, i.e., before any
splitting?
(b) (4 pts) The query at the root node is: Is x1 > 3.5? What is the
information impurity at the next level?
(c) (6 pts) Continue to grow your tree fully. Show the nal tree and all queries.
(d) (2 pts) Use your tree to classify the point ( 72 ).
10. (10 points total) Short answer (1 pt each).
(a) Explain using a diagram and a few sentences the technique of learning
with hints and why it can improve a neural net pattern classier.
(b) Use the Boltzmann factor to explain why at a suciently high temperature T , all congurations in a Boltzmann network are equally probable.
(c) Explain with a simple gure the crossover operation in genetic algorithms.
(d) What is a surrogate split and when is one used?
(e) If the cost for any fundamental string operation is 1.0, state the edit distance between bookkeeper and beekeepers.
(f) Suppose the Bayes error rate for a c = 3 category classication problem is
5%. What are the bounds on the error rate of a nearestneighbor classier
trained with an innitely large training set?
(g) What do we mean by the language induced by grammar G?
(h) In hidden Markov models, what does the term aij refer to? What does bjk
refer to?
EXAM 2
379
Important formulas
p(xi )P (i )
P (i x) =
c
P (k )p(xk )
k=1
p(x) =
1
t 1
exp
(x
)
(x
)
2
(2)d/2 1/2
1
= E[(x )(x )t ]
n
1!
=
(xi )(xi )t
n i=1
=
(x )(x )t p(x)dx
P (i xg ) =
l() = lnp(D)
p(xD) =
p(x)p(D) d
Q(; i ) = EDb lnp(Dg , Db ; )Dg ; i
d =
2 1 
lim Pn (ex) = 1
P P P
c
!
P 2 (i x)
i=1
c
2
P
c1
380
Jp (a)
(at y)
yY
a(k + 1)
= a(k) + (k)
yYk
wkj
wji
k=1
f (net) = a tanh[b net] = a
P () =
Z=
e+b
e+b
eb
net + eb
net
net
net
eE /T
Z
eE /T
wij =
EQ [si sj ]i o
T
E[si sj ]i clamped
clamped
learning
i(N ) =
c
!
unlearning
P (j )log2 P (j )
j=1
ED (g(x; D) F (x))2 = (ED [g(x; D) F (x)])2 + ED g(x; D) ED [g(x; D)])2
(i)
()
1 , x2 , , xi1 , xi+1 , , xn )
= (x
n
1 !
(i)
=
n i=1
Si =
(x mi )(x mi )t
xDi
SW =
c
!
i=1
Si
EXAM 2
381
SB =
c
!
i=1
ST =
(x m)(x m)t
xD
Je = tr[SW ]
Jd = SW
min x x
dmin (Di , Dj )
dmax (Di , Dj )
max x x
davg (Di , Dj )
1 ! !
x x
ni nj
dmin (Di , Dj )
= mi mj
xD
xD
xD
xD
xD xD
(dij ij )2
i<j
2
ij
Jee
Jf f
! dij ij 2
=
ij
i<j
Jef
i<j
1 ! (dij ij )2
ij i<j
ij
i<j
382
EXAM 2 Solutions
1. (10 points total) Let the components of the vector x = (x1 , . . . , xd )t be ternary
valued (1, 0, or 1) with
pij
Pr[xi = 1j ]
qij
Pr[xi = 0j ]
rij
Pr[xi = 1j ].
p(xj ) = p((x1 , . . . , xd )t j ) =
d
,
p(xi j ),
i=1
where
pij
qij
rij
=
=
Pr[xi = 1j ]
Pr[xi = 0j ]
Pr[xi = 1j ].
As in Sect. 2.9.1 in the text, we use exponents to select the proper probability,
that is, exponents that have value 1.0 when xi has the value corresponding to
the particular probability and value 0.0 for the other values of xi . For instance,
for the pij term, we seek an exponent that has value 1.0 when xi = +1 but is
0.0 when xi = 0 and when xi = 1. The simplest such exponent is 12 xi + 12 x2i .
For the qij term, the simplest exponent is 1 x2i , and so on. Thus we write the
classconditional probability for a single component xi as:
EXAM 2 SOLUTION
p(xi j )
383
1
i = 1, . . . , d
j = 1, . . . , c
= pij2
d
,
pij2
i=1
ln p(xj ) + ln P (j )
d
!
1
1
1
1
+ ln P (j )
=
xi + x2i ln pij + (1 x2i )ln qij + xi + x2i ln rij
2
2
2
2
i=1
d
d
d
!
pij rij
1!
pij !
=
x2i ln
+
xi ln
+
ln qij + ln P (j ),
qij
2 i=1
rij
i=1
i+1
ex dx = ex 0 = 0 (1) = 1.
(b) (8 pts) The following data were collected: D1 = {1, 5} and D2 = {3, 9} for
1 and 2 , respectively. Find the maximumlikelihood values 1 and 2 .
Solution: We temporarily drop the subscript on , denote the two training
points as x1 and x2 , and write the likelihood p(D) for any category as
p(x1 )p(x2 )
= ex1 ex2
= 2 e(x1 +x2 ) .
=0
and thus = 2/(x1 +x2 ). For our data we have, then, 1 = 2/(1+5) = 1/3,
and 2 = 2/(3 + 9) = 1/6.
384
(c) (3 pts) Given your answer to part (b), determine the decision boundary
x for minimum classication error. Be sure to state which category is to
right (higher) values than x , and which to the left (lower) values than x .
Solution: The decision boundary is at the point x where the two posteriors are equal, that is, where
P (1 )1 e1 x = P (2 )2 e2 x ,
0.2
0.15
0.1
0.05
2
R1
x*
10
R2
(d) (3 pts) What is the expected error of your classier in part (c)?
Solution: The probability of error (in this case, the Bayes error rate) is
P
1
1
Min P (1 ) ex/3 , P (2 ) ex/6 dx
3
6
0
x
1
1 x/3
= 0.5
dx
ex/6 dx +
e
6
3
x
0
x
:
1
1
= 0.5 6ex/6 + 3ex/3
6
3
x
0
8
9
= 0.5 ex /6 + 1 0 + ex /3 = 0.375.
EXAM 2 SOLUTION
385
3. (10 points total) Consider the application of the kmeans clustering algorithm
to the onedimensional data set D = {0, 1, 5, 8, 14, 16} for c = 3 clusters.
(a) (3 pt) Start with the three cluster means: m1 (0) = 2, m2 (0) = 6 and
m3 (0) = 9. What are the values of the means at the next iteration?
Solution: The top of the gure shows the initial state (i.e., at iteration
t = 0). The dashed lines indicate the midpoints between adjacent means,
and thus the cluster boundaries. The two points x = 0 and x = 1 are in
the rst cluster, and thus the mean for the cluster 1 in the next iteration
is m1 (1) = (0 + 1)/2 = 0.5, as shown. Likewise, initially cluster 2 contains
the single point x = 5, and thus the mean for cluster 2 on the next iteration
is m2 (1) = 5/1 = 5. In the same manner, the mean for cluster 3 on the
next iteration is m3 (1) = (8 + 14 + 16)/3 = 12.67.
m1
m2
m3
x
t=0
0
m1
10
m2
15
20
15
20
m3
x
t=1
0
m1
10
m2
m3
x
t=2
0
x=3
(cluster 1)
10
15
20
x = 11
(cluster 3)
(b) (5 pt) What are the nal cluster means, after convergence of the algorithm?
Solution: After one more step of the algorithm, as above, we nd m1 (2) =
0.5, m2 (2) = 6.5 and m3 (2) = 15.
(c) (2 pt) For your nal clusterer, to which cluster does the point x = 3
belong? To which cluster does x = 11 belong?
Solution: As shown in the bottom gure, x = 3 is in cluster 1, and x = 11
is in cluster 3.
4. (10 points total) The task is to use Bayesian methods to estimate a onedimensional
probability density. The fundamental density function is a normalized triangle
distribution T (, 1) with center at with halfwidth equal 1, dened by
1 x  x  1
p(x) T (, 1) =
0
otherwise,
as shown on the left gure. The prior information on the parameter is that
it is equally likely to come from any of the three discrete values = 1, 0 or 1.
Stated mathematically, the prior consists of three delta functions, i.e.,
p() =
1
[(x 1) + (x) + (x + 1)],
3
as shown on the gure at the right. (Recall that the delta function has negligible
width and unit integral.)
386
p(x)
1
p()
x
1
+1
1
(a) (2 pt) Plot the estimated density before any data are collected (which
we denote by D0 = {}). That is, plot p(xD0 ). Here and below, be sure to
label and mark your axes and ensure normalization of your nal estimated
density.
Solution: In the absense of data, the estimate density is merely the sum
of three triangle densities (of amiplitude 1/3 to ensure normalization), as
shown in the top gure.
p(xD0)
1
a)
x
2
1
p(xD1)
b)
x
2
1
p(xD2)
c)
x
2
1
(b) (4 pts) The single point x = 0.25 was sampled, and thus D1 = {0.25}.
Plot the estimated density p(xD1 ).
Solution: Bayesian density estimation
p(xD) =
p(x)p(D) d
p(x)p(D)p() d
where here we do not worry about constants of proportionality as we shall
normalize densities at the end. Because the prior consists of three delta
functions, our nal estimated density consists of the sum of three triangle
distributions, centered on x = 1, 0 and 1 the only challenge is to
EXAM 2 SOLUTION
387
10
Dmax(DiDj)
10
15
388
6. (10 points total) Consider the use of traditional boosting for building a classier
for a twocategory problem with n training points.
(a) (8 pts) Write pseudocode for traditional boosting, leading to three component classiers.
Solution:
Algorithm[Boosting]
Begin INITIALIZE D = {x1 . . . xn }
Train classier C1 on D1 , i.e., n/3 patterns chosen from D
Select D2 , i.e., roughly n/3 patterns that are most informative
Train classier C2 on D2
Train classier C3 on D3 , i.e., all remaining patterns
End
(b) (2 pts) How are the resulting three component classiers used to classify
a text pattern?
Solution: Majority vote.
7. (5 points total) Consider a standard threelayer neural net as shown. Suppose
the network is to be trained using the novel criterion function
1!
(tk zk )4 .
4
c
J=
k=1
target t
output z
...
t2
z1
z2
...
tk
zk
...
tc
...
zc
output
wkj
y1
y2
yj
...
...
yn H
hidden
wji
x1
input x
x1
x2
x2
xi
...
...
xi
...
...
xd
input
xd
Solution: The derivation is nearly the same as the one in the text. The chain
rule for dierentiation gives us:
J
J netk
=
,
wkj
netk wkj
and because
1!
(tk zk )4 ,
4
c
J=
k=1
we have
J
netk
J zk
zk netk
= (tk zk )3 f (netk ).
=
EXAM 2 SOLUTION
389
0.6
P(ca,b)
0.1
a1, b 1
a1 , b 2
a2 , b 1
a2 , b 2
a3 , b 1
a3 , b 2
a3 = 41100 years
P(c1ai, bj)
0.5
0.7
0.6
0.8
0.4
0.1
age
P(c2ai, bj)
0.5
0.3
0.4
0.2
0.6
0.9
nation b2 = nonAmerican
ality
sports
0.2
0.8
c1 = likes soccer
c2 = doesn't like soccer
P(dc)
D
watch
TV
d1 = a lot
d2 = some
d3 = none
c1
c2
P(d1ck)
0.7
0.5
P(d2ck) P(d3ck)
0.2
0.1
0.3
0.2
P (a1 , b2 , c1 , d1 )
P (a2 , b1 , c1 , d2 )
= P (a2 , b1 , c1 , d2 )
P (a2 , b1 , d2 )
= P (a2 )P (b1 )P (c1 a2 , b1 )P (d2 c1 )
= 0.6 0.2 0.6 0.2 = 0.0144.
=
Likewise we have
P (c2 a2 , b1 , d2 ) = P (a2 , b1 , c2 , d2 )
= P (a2 )P (b1 )P (cc a2 , b1 )P (d2 cc )
= 0.6 0.2 0.4 0.3 = 0.0144.
390
0.0144
= 0.5.
0.0144 + 0.0144
(c) (4 pts) Suppose we nd someone over 40 years of age who never watches
sports TV. What is the probability that this person likes soccer?
Solution: Again, there are several ways to calculate this probability, but
this time we proceed as:
!
P (c1 a3 , d3 )
P (a3 , b, c1 , d3 )
b
P (c2 a3 , d3 )
P (a3 , b, c2 , d3 )
0.0016
= 0.087.
0.0016 + 0.0168
9. (15 points total) This problem conserns the construction of a binary decision
tree for three categories from the following twodimensional data:
1
1 2 3
1 , 8 , 5
4 6 7
7 , 2 , 6
5 7 7
10 , 4 , 9
(a) (3 pts) What is the information impurity at the root node, i.e., before any
splitting?
Solution: The information impurity at the root (before splitting) is
i(N )
c
!
P (j ) log2 P (j )
j=1
= 3
1
1
log2
= log2 3 = 1.585 bits.
3
3
(b) (4 pts) The query at the root node is: Is x1 > 3.5? What is the
information impurity at the next level?
EXAM 2 SOLUTION
391
Solution: The impurity at the right node is 0, as all its points are in a
single category, 1 . The impurity at the left node, NL , is:
1 1
1
1
i(NL ) =
log2 + log2
= log2 2 = 1.0 bit.
2
2 2
2
The impurity at this level is the weighted sum of the impurities at the
nodes, i.e.,
PL 1.0 + (1 PL )0.0 =
6
3
1.0 + 0.0 = 0.66 bits,
9
9
10
Y
N
1
x2 > 8?
Y
3
N
x1 > 6.5?
Y
N
2
x2 > 5?
0
0
10
x1
Y
2
N
3
part d)
392
categories
output
...
h1
hints
h2
hidden
input
eE /T
,
Z
where the numerator is the Boltzmann factor and Z is the partition function basically a normalization. If T E , then the exponent is nearly
0 regardless of , and the Boltzmann factor is nearly 1.0 for every state.
Thus the probability of each conguration is roughly the same.
(c) Explain with a simple gure the crossover operation in genetic algorithms.
Solution: In genetic algorithms, we represent classiers by a string of bits.
The crossover operation employs a randomly chosen place along the string,
then takes the left part from sequence A and splices it to the right side of
sequence B, and vice versa, much in analogy with sexual reproduction in
biology. This operation is fundamentally dierent from random variation
and occassionally leads to particularly good or t classiers.
A
B
1010100001010011101010010100010101111101011100
parents
1110001101100001010011110101000000100100101010
1010100001010001010011110101000000100100101010
offspring
1110001101100011101010010100010101111101011100
EXAM 2 SOLUTION
393
x
bookkeeper
beokkeeper
beekkeeper
beekeeper
beekeepers
source string
substitute e for o
substitute e for o
delete k
insert s
1
2
3
4
(f) Suppose the Bayes error rate for a c = 3 category classication problem is
5%. What are the bounds on the error rate of a nearestneighbor classier
trained with an innitely large training set?
Solution: In the limit of an innitely large training set, the nearestneighbor classier has a probability of error P bounded as
P P P
c
2
P
c1
where P is the Bayes error rate. For the case P = 0.05 and c = 3, these
limits are then
3
0.05 P 0.05 2 0.05 = 0.09625.
2
(h) In hidden Markov models, what does the term aij refer to? What does bjk
refer to?
Solution: The aij = P (j (t + 1)i (t)) denote the transition probability
from hidden state i to hidden state j in each time step. The bjk =
P (vk (t)j (t)) denote the transition probability that hidden state j will
emit visible symbol vk in each time step.
394
A
C
B
D
E
P (a1 )
0.5
P (a2 )
0.3
P (a3 )
0.2
a1
a2
a3
P (b1 ai )
0.4
0.3
0.5
P (b2 ai )
0.6
0.7
0.5
d1
d2
P (e1 di )
0.1
0.8
P (e2 di )
0.9
0.2
F
P (c1 )
0.2
b1 , c1
b1 , c2
b1 , c3
b2 , c1
b2 , c2
b2 , c3
d1
d2
P (c2 )
0.4
P (c3 )
0.4
P (d1 bi , cj )
0.3
0.5
0.9
1.0
0.4
0.7
P (f1 di )
0.1
0.8
P (d2 bi , cj )
0.7
0.5
0.1
0.0
0.6
0.3
P (f2 di )
0.5
0.0
P (f3 di )
0.4
0.2
EXAM 3
395
2. (15 points total) Consider a onedimensional twocategory classication problem with unequal priors, P (1 ) = 0.7 and P (2 ) = 0.3, where the densities have
the form form of a half Gaussian centered at 0, i.e.,
p(xi ) =
x2 /(2i2 )
i e
x<0
x 0,
(a) (3 pt) Start with the three cluster means: m1 (0) = 7
, m2 (0) = 74 ,
4
2
and m3 (0) = 5
. What are the values of the means at the next iteration?
(b) (5 pt) What are the nal cluster means, after convergence of the algorithm?
(c) (2 pt) For your nal clusterer, to which cluster does the point x =
belong? To which cluster does x = 3
4 belong?
3
3
396
4. (15 points total) The task is to use Bayesian methods to estimate a onedimensional
probability density. The fundamental density function is a normalized half circle distribution HC(, 1) with center at with halfwidth equal 1, dened
by
2
1 (x )2 x  1
p(x) HC(, 1) =
0
otherwise,
as shown on the left gure. The prior information on the parameter is that
it is equally likely to come from either of the two discrete values = 0.5 or
+0.5. Stated mathematically, the prior consists of two delta functions, i.e.,
p() =
1
[( 0.5) + ( + 0.5)],
2
as shown on the gure at the right. (Recall that the delta function has negligible
width and unit integral.)
p(x)
2/pi
p()
x
1
+1
0.5
+0.5
(a) (3 pt) Plot (sketch) the estimated density before any data are collected
(which we denote by D0 = {}). That is, plot p(xD0 ). Here and below, be
sure to label and mark your axes and ensure normalization of your nal
estimated density.
(b) (4 pts) The single point x = 0.25 was sampled, and thus D1 = {0.25}.
Plot the density p(xD1 ) estimated by Bayesian methods.
(c) (5 pts) Next the point x = 0.25 was sampled, and thus the data set is
D2 = {0.25, 0.25}. Plot the estimated density p(xD2 ).
(d) (3 pts) Suppose a very large number of points were selected and they were
all 0.25, i.e., D = {0.25, 0.25, . . . , 0.25}. Plot the estimated density p(xD).
(You dont need to do explicit calculations for this part.)
5. (5 points total) Construct a cluster dendrogram for the onedimensional data
D = {5, 6, 9, 11, 18, 22} using the distance measure Davg (Di , Dj ).
6. (5 points total) Consider learning a grammar from sentences.
(a) (8 pts) Write pseudocode for simple grammatical inference. Dene your
terms.
(b) (2 pts) Dene D+ and D and why your algorithm needs both.
EXAM 3
397
J=
k=1
target t
output z
...
t2
z1
z2
...
tk
zk
...
tc
...
zc
output
wkj
y1
y2
yj
...
...
yn
hidden
wji
x1
input x
x2
x1
x2
xi
...
...
xi
...
...
xd
input
xd
8. (5 points total) Prove that the single best representative pattern x0 for a data
set D = {x1 , . . . , xn } in the sumsquarederror criterion
J0 (x0 ) =
n
!
x0 xk 2
k=1
1
n
n
k=1
xk .
9. (15 points total) This problem concerns the construction of a binary decision
tree for two categories from the following twodimensional data using queries of
the form Is xi > xi ? for i = 1, 2 and the information impurity.
1
1 2 4 5 8
5 , 9 , 10 , 7 , 6
3 6 7 9
8 , 4 , 2 , 3
(a) (2 pts) What is the information impurity at the root node, i.e., before any
splitting?
(b) (3 pts) What should be the query at the root node?
(c) (3 pts) How much has the impurity been reduced by the query at the root?
(d) (3 pts) Continue constructing your tree fully. (Whenever two candidate
queries lead to the samee reduction in impurity,
the query
that uses
prefer
the x1 feature.) Use your tree to classify x = 66 and x = 34 .
(e) (2 pts) Suppose your tree is to be able to classify decient patterns. What
should be the rst (and only) surrogate split at the root node?
398
EXAM 3
399
Important formulas
p(xi )P (i )
P (i x) =
c
P (k )p(xk )
k=1
p(x) =
1
t 1
exp
(x
)
(x
)
2
(2)d/2 1/2
1
= E[(x )(x )t ]
n
1!
=
(xi )(xi )t
n i=1
=
(x )(x )t p(x)dx
P (i xg ) =
l() = lnp(D)
p(xD) =
p(x)p(D) d
Q(; i ) = EDb lnp(Dg , Db ; )Dg ; i
d =
2 1 
lim Pn (ex) = 1
P P P
c
!
P 2 (i x)
i=1
c
2
P
c1
400
Jp (a)
(at y)
yY
a(k + 1)
= a(k) + (k)
yYk
wkj
wji
k=1
f (net) = a tanh[b net] = a
P () =
Z=
e+b
e+b
eb
net + eb
net
net
net
eE /T
Z
eE /T
wij =
EQ [si sj ]i o
T
E[si sj ]i clamped
clamped
learning
i(N ) =
c
!
unlearning
P (j )log2 P (j )
j=1
ED (g(x; D) F (x))2 = (ED [g(x; D) F (x)])2 + ED g(x; D) ED [g(x; D)])2
(i)
()
1 , x2 , , xi1 , xi+1 , , xn )
= (x
n
1 !
(i)
=
n i=1
Si =
(x mi )(x mi )t
xDi
SW =
c
!
i=1
Si
EXAM 3
401
SB =
c
!
i=1
ST =
(x m)(x m)t
xD
Je = tr[SW ]
Jd = SW
min x x
dmin (Di , Dj )
dmax (Di , Dj )
max x x
davg (Di , Dj )
1 ! !
x x
ni nj
dmin (Di , Dj )
= mi mj
xD
xD
xD
xD
xD xD
(dij ij )2
i<j
2
ij
Jee
Jf f
! dij ij 2
=
ij
i<j
Jef
i<j
1 ! (dij ij )2
ij i<j
ij
i<j
402
EXAM 3 Solutions
1. (15 points total) A Bayes belief net consisting of six nodes and its associated
conditional probability tables are shown below.
A
C
B
D
E
P (a1 )
0.5
P (a2 )
0.3
P (a3 )
0.2
a1
a2
a3
P (b1 ai )
0.4
0.3
0.5
P (b2 ai )
0.6
0.7
0.5
d1
d2
P (e1 di )
0.1
0.8
P (e2 di )
0.9
0.2
F
P (c1 )
0.2
b1 , c1
b1 , c2
b1 , c3
b2 , c1
b2 , c2
b2 , c3
d1
d2
P (c2 )
0.4
P (c3 )
0.4
P (d1 bi , cj )
0.3
0.5
0.9
1.0
0.4
0.7
P (f1 di )
0.1
0.8
P (d2 bi , cj )
0.7
0.5
0.1
0.0
0.6
0.3
P (f2 di )
0.5
0.0
P (f3 di )
0.4
0.2
= P (a3 )P (b2 a3 )P (c3 )P (d1 b2 , c3 )P (e2 d1 )P (f1 d1 )
= 0.2 0.5 0.4 0.7 0.9 0.1 = 0.0025.
= P (a2 )P (b2 a2 )P (c2 )P (d2 b2 , c2 )P (e1 d2 )P (f2 d2 )
= 0.3 0.7 0.4 0.6 0.8 0.0 = 0.0.
(c) (5 pt) Suppose we know the net is in the following (partial) state of evidence
e: a3 , b1 , c2 . What is the probability P (f1 e)? What is the probability
P (c2 e)?
Solution
P (f1 e)
P (e2 e)
EXAM 3 SOLUTION
403
(d) (6 pt) Suppose we know the net is in the following (partial) state of evidence
e: f1 , e2 , a2 . What is the probability P (d1 e)? What is the probability
P (e2 e)?
Solution
b,c
P (d1 e) =
b,c,d
=
P (b, c, d1 , a2 , e2 , f1 )
b,c
P (b, c, d, a2 , e2 , f1 )
P (a2 )P (ba2 )P (c)P (d1 b, c)P (e2 d1 )P (f1 d1 )
=
=
0.9 0.1[0.3(0.2 0.3 + 0.4 0.5 + 0.4 0.9) + 0.7(0.2 1.0 + 0.4 0.4 + 0.4 0.7)]
P (e2 d1 )P (f1 d1 ) b,c P (ba2 )P (c)P (d1 b, c) + P (e2 d2 )P (f1 d2 ) b,c P (ba2 )P (c)P (d2 b, c)
0.05706
=
0.05706 + 0.2 0.8(0.3(0.2 0.7 + 0.4 0.5 + 0.4 0.1) + 0.7(0.2 0.0 + 0.4 0.6 + 0.4 0.3))
0.05706
=
0.4935.
0.05706 + 0.05856
b,d P (b, d, c2 , a2 , e2 , f1 )
P (c2 e) =
b,c,d P (b, c, d, a2 , e2 , f1 )
b,d P (a2 )P (ba2 )P (c2 )P (db, c2 )P (e2 d)P (f1 d)
=
b,c,d P (a2 )P (ba2 )P (c)P (db, c)P (e2 d)P (f1 d)
P (c2 ) b P (ba2 ) d P (db, c2 )P (e2 d)P (f1 d)
=
c P (c)
b P (ba2 )
d P (db, c)P (e2 d)P (f1 d)
0.4(0.3(0.5 0.9 0.1 + 0.5 0.2 0.8) + 0.7(0.4 0.9 0.1 + 0.6 0.2 0.8))
=
c P (c)
b P (ba2 )
d P (db, c)P (e2 d)P (f1 d)
1
!
!
= .05196/ .05196 + P (c1 )
P (ba2 )
P (db, c1 )P (e2 d)P (f1 d)
=
+ P (c3 )
!
b
P (ba2 )
= .05196/ (.05196 + 0.2(0.3(0.3 0.9 0.1 + 0.7 0.2 0.8) + 0.7(1.0 0.9 0.1 + 0.0 0.2 0.8))
2
!
!
+ P (c3 )
P (ba2 )
P (db, c3 )P (e2 d)P (f1 d)
b
= .05196/(.05196 + .02094
+ 0.4(0.3(0.9 0.9 0.1 + 0.1 0.2 0.8) + 0.7(0.7 0.9 0.1 + 0.3 0.2 0.8)))
.05196
=
0.4494.
.05196 + .02094 + .04272
404
2. (15 points total) Consider a onedimensional twocategory classication problem with unequal priors, P (1 ) = 0.7 and P (2 ) = 0.3, where the densities have
the form form of a half Gaussian centered at 0, i.e.,
0
x<0
2
2
p(xi ) =
i ex /(2i ) x 0,
where the i for i = 1, 2, are positive but unknown parameters.
(a) (1 pt) Find the normalization constant i as a function of i .
Solution
Because the normalization on a full onedimensional Gaussian (as given
1
on the equations on the exam) is 2
, the normalization on a half
Gaussian must be twice as large, i.e.,
0
2 1
i =
.
i
(b) (8 pts) The following data were collected: D1 = {1, 4} and D2 = {2, 8} for
1 and
2 .
1 and 2 , respectively. Find the maximumlikelihood values
Solution
We drop the subscripts, denote the two training points x1 and x2 , and
compute the likelihood:
p(D)
e
e
Clearly, because of the exponential we will want to work with the loglikelihood,
l ln(2/) 2 ln (x21 + x22 )/2 2 .
To nd the maximumlikelihood solution, we rst take the derivative
dl
2 (x21 + x22 ) 2
,
=
2
3
and set it to zero. Clearly we must ignore the solution
= . The solution
we seek is
0
x21 + x22
=
,
2
which for the training data given yields
0
0
12 + 42
17
1 =
=
2
2
0
22 + 82
2 =
= 34.
2
EXAM 3 SOLUTION
405
(c) (3 pts) Given your answer to part (b), determine the decision boundary
for minimum classication error. Be sure to state which category label
applies to each range in x values.
Solution
We set the posteriors equal, yielding
P (1 )p(x 1 ) = P (2 )p(x 2 )
which gives the following equalities:
0
0
2 1 x2 /(212 )
2 1 x2 /(222 )
0.7
e
= 0.3
e
2
x2
0.7
1
1
=
ln
0.3
2
2
22 +
12
or
;
%
$
<
< 2 ln 0.7 1
0.3
2
<
% 5.91,
x = = $
1
22 +
12
0.15
1
0.1
0.05
10
x*
x
=
0
P (2 )
2 1 x2 /(222
e
dx +
0
P (1 )
2 1 x2 /(212 )
e
dx
=
=
1 1 2
0.3
2
2
x/
2
z 2 /2
2 e
0
1 1 2
dz + 0.7
1
2
0.3
0.7
erf[x /
2 ] + [1 erf[x /
1 ]] 0.237.
2
2
x /
1
1 ez
/2
dz
406
0
m1
m2
1
2
2
0
1m
3
(a) (3 pt) Start with the three cluster means: m1 (0) = 7
, m2 (0) = 74 ,
4
2
. What are the values of the means at the next iteration?
and m3 (0) = 5
Solution
The points nearest each mean are as follows:
3 2
m1 : 5
3 , 1 , 6
6
m2 :
1
1 4
m3 :
7 , 3 .
We compute the mean for each of these sets to nd the new means, i.e.,
3.33
6
1.5
m1 (1) =
,
m2 (1) =
,
m3 (1) =
.
3
1
5
(b) (5 pt) What are the nal cluster means, after convergence of the algorithm?
Solution
On the next iteration, m1 does not change, but the others do:
3.33
5
1
m1 (2) =
m2 (2) =
m3 (2) =
,
3
2
7
which is the nal state.
(c) (2 pt) For your nal clusterer, to which cluster does the point x =
belong? To which cluster does x = 3
4 belong?
Solution
The point 33 is in cluster 2; the point 3
4 is in cluster 3.
3
3
EXAM 3 SOLUTION
407
4. (15 points total) The task is to use Bayesian methods to estimate a onedimensional
probability density. The fundamental density function is a normalized half circle distribution HC(, 1) with center at with halfwidth equal 1, dened
by
2
1 (x )2 x  1
p(x) HC(, 1) =
0
otherwise,
as shown on the left gure. The prior information on the parameter is that
it is equally likely to come from either of the two discrete values = 0.5 or
+0.5. Stated mathematically, the prior consists of two delta functions, i.e.,
p() =
1
[( 0.5) + ( + 0.5)],
2
as shown on the gure at the right. (Recall that the delta function has negligible
width and unit integral.)
p(x)
2/pi
p()
x
1
+1
0.5
+0.5
(a) (3 pt) Plot (sketch) the estimated density before any data are collected
(which we denote by D0 = {}). That is, plot p(xD0 ). Here and below, be
sure to label and mark your axes and ensure normalization of your nal
estimated density.
Solution In the absense of training data, our distribution is based merely
on the prior values of the parameters. Written out in full, we have
0
p(xD ) =
p(x)p()d
=
HC(, 1)0.5[( 0.5) + ( + 0.5)]d
=
which is just the sum of the two disk distributions, as shown in the gure.
p(xD0)
0.5
0.4
0.3
0.2
0.1
2
1
408
1
HC(0.5, 1) +
HC(0.5, 1).
1+
1+
p(xD1)
0.5
0.4
0.3
0.2
0.1
2
1
(c) (5 pts) Next the point x = 0.25 was sampled, and thus the data set is
D2 = {0.25, 0.25}. Plot the estimated density p(xD2 ).
Solution
We use recursive Bayesian learning, in particular p(D2 ) p(x2 )p(D1 ,
where again we defer the normalization. As before, we now have /(1 + )
for the HC(0.5, 1) component and 1/(1 + ) for the HC(0.5, 1) component, i.e.,
p(xD1 ) =
2
1
HC(0.5, 1) +
HC(0.5, 1).
1 + 2
1 + 2
p(xD2)
0.5
0.4
0.3
0.2
0.1
2
1
(d) (3 pts) Suppose a very large number of points were selected and they were all 0.25,
i.e., D = {0.25, 0.25, . . . , 0.25}. Plot the estimated density p(xD). (You dont need
to do explicit calculations for this part.)
EXAM 3 SOLUTION
409
Solution
Clearly, each subsequent piont x = 0.5 introduces another factor of into the ratio
of the two components. In the limit of large n, we have
n
n 1 + n
1
lim
n 1 + n
lim
1;
0,
and thus the density consists solely of the component HC(0.5, 1). Incidentally, this
would be the appropriate maximumlikelihood solution too.
p(xD)
0.6
0.5
0.4
0.3
0.2
0.1
2
1
410
5. (5 points total) Construct a cluster dendrogram for the onedimensional data D = {5, 6, 9, 11, 18, 22}
using the distance measure davg (Di , Dj ).
Solution
See gure. Throughout, we use davg (Di , Dj ) = ni1nj
x x . The
xD xD
10
15
20
1
2
4
4.5
12.25
J=
k=1
target t
output z
...
t2
z1
z2
...
tk
zk
...
tc
...
zc
output
wkj
y1
y2
yj
...
...
yn H
hidden
wji
x1
input x
x1
x2
x2
xi
...
...
xi
...
...
xd
input
xd
EXAM 3 SOLUTION
411
Solution
We have J =
c
1
6
k=1
J
zk
J
J zk netk
=
= (tk zk )5 f (netk )yj .
wkj
zk netk wkj
8. (5 points total) Prove that the single best representative pattern x0 for a data set D =
{x1 , . . . , xn } in the sumsquarederror criterion
J0 (x0 ) =
n
!
x0 xk 2
k=1
1
n
n
k=1
xk .
n
k=1
x0 xk 2
2
1
n
n
n
!
!
!
2
0=
x0 xk = 2
(x0 xk ) = 2 n x0
xk
x0
k=1
k=1
k=1
and thus
1!
xk .
n
n
x0 =
k=1
9. (15 points total) This problem concerns the construction of a binary decision tree for two
categories from the following twodimensional data using queries of the
form Is xi > xi ? for i = 1, 2 and the information impurity.
1
1 2 4 5 8
5 , 9 , 10 , 7 , 6
3 6 7 9
8 , 4 , 2 , 3
(a) (2 pts) What is the information impurity at the root node, i.e., before
any splitting?
Solution
i(N ) =
2
!
5
4
5
4
P (j ) log2 P (j ) = log2 + log2 0.9911
9
9
9
9
j=1
412
(c) (3 pts) How much has the impurity been reduced by the query at the
root?
Solution
Trivially, i(NR ) = 0.
1
5
1
5
i(NL ) = log2 + log2 0.65
6
6
6
6
The reduction in extropy is
6
i(N ) = i(N ) PL i(NL ) (1 PL )i(NR ) 0.99 0.65 0 0.56
9
.
(d) (3 pts) Continue constructing your tree fully. (Whenever two candidate queries lead to the same reduction in impurity, prefer the
query
that uses the x1 feature.) Use your tree to classify x = 66 and
x = 34 .
Solution
We need to decide whether the second splitting criterion is x1 = 2.5
or x1 = 3.5. The entropy reduction for x1 = 2.5 is
1
1 5
5
1 3
3
4
1
i(N ) = log2 log2
log2 log2
0.11.
6
6 6
6
6
4
4 4
4
The entropy reduction for x1 = 3.5 is
1
1 5
5
1 2
2
3
1
i(N ) = log2 log2
log2 log2
0.19.
6
6 6
6
6
3
3 3
3
So the second split should be x1 = 3.5. The last split is then x1 = 2.5.
Thus the tree is
(e) (2 pts) Suppose your tree is to be able to classify decient patterns.
What should be the rst (and only) surrogate split at the root node?
Solution
x = (6, 6)t 1 ; x = (3, 4)t 2 . The surrogate split tries to
achieve the same partitioning of the samples as the x2 > 4.5 primary
split. By inspection, it is x1 < 5.5.
R1
R1
R2
EXAM 3 SOLUTION
413
P ()
= lim e(E E )/T 1.
p( ) T
(c) Use an equation and a few sentences to explain the minimum description
length (MDL) principle.
Solution
The MDL principle states that one should seek a model h that minimizes
the sum of the models algorithmic complexity and the description of the
training data with respect to that model. That is,
h = arg min K(h, D) = arg min K(h) + K(D using h).
h
(d) Use an equation and a few sentences to explain what is the discriminability in signal detection theory.
Solution
For two normally distributed classes in one dimension (with same variance), discriminability is a measure of the ease of discriminating the two
1
classes. It is dened as d = 2
.
(e) If the cost for any fundamental string operation is 1.0, state the edit
distance between streets and scrams.
Solution
The cost is 4. streets screets scraets scramts scrams.
(f) Suppose the Bayes error rate for a c = 5 category classication problem is
1%. What are the upper and lower bounds on the error rate of a nearestneighbor classier trained with an innitely large training set?
Solution
c
Using the equation given with the exam, P P P (2 c1
P ), we
5
0.01) = 0.019875 0.02.
get 0.01 P 0.01(2 51
(g) Use a formula and a sentence to explain learning with momentum in backpropagation.
Solution
Momentum in neural net learning is a heuristic that weight changes should
tend to keep moving in the same direction. Let w(m) = w(m)w(m1)
414
Worked examples
Below are worked examples, organized by book section, that may be of use to
students.
[to be written]
415
416
WORKED EXAMPLES
Below are errata and minor alterations that improve the style or clarify the book.
To see which printing of the book you have, look on the third page of the text itself,
which faces the dedication page. At the bottom you will see:
Printed in the United States of America
10 9 8 7 6 5 4 3 2
The last number at the right gives the number of the printing; thus, as illustrated
here, 2 means that this is the second printing.
Front matter
page x line +6: Change 4.8 Reduced Coulomb to *4.8 Reduced Coulomb
page xv line 13: Change A.4.7 The Law of Total Probability and Bayes Rule to
A.4.7 The Law of Total Probability and Bayes Rule
page xviii Take the last sentence under Examples, In addition, in response to
popular demand, a Solutions Manual has been prepared to help instructors
who adopt this book for courses. and move it to be the nal sentence under
Problems, lower on the same page.
page xviii lines 10 11: Change and they are generally to and they are typically
page xix line 15: Change Ricoh Silicon Valley to Ricoh Innovations
417
418
Chapter 1
page 1 line +4: Change data and taking to data and making
page 4 the only equation on the page: Change x = xx12 to x = xx12
page 11 line +21: Change us for practical, rather than to us for practical rather
than
page 14 line 4 in the caption to Figure 1.8: Change of the data impact both to
of the data aect both
page 19 line +15: Change is achieved in humans to is performed by humans
Chapter 2
page 21 line 6 in the footnote: Change should be written as pX (x) to should
be written as px (x)
page 21 line 4 in the footnote: Change clear that pX () and pY () to clear that
px () and py ()
page 22 second line after Eq. 3: Change probability (or posterior) probability to
probability (or posterior)
page 23 second line after Eq. 7: Change By using Eq. 1, we can to By using
Eq. 1 we can
page 26 rst line after Eq. 17: Change and 2 otherwise. to and otherwise decide
2 .
page 28 Equation 23: Change (11 22 )(21 11 ) to (11 22 )+(21 11 )
page 28 second line after Eq. 24: Change decision boundary gives to decision
boundary then gives
page 32 second line after Eq. 33: Change expected values by these to expected
values by these
page 36 rst equation after Eq. 49: Change Let us examine the discriminant to
Let us examine this discriminant
page 41 Figure 2.13, caption, line +2: Change unequal variance. to unequal
variance, as shown in this case with P (1 ) = P (2 ).
page 47 Equation 73: Change for0 to for 0 (i.e., add space)
page 47 line  10: Change substituting the results in Eq. 73 to substituting this
into Eq. 73
page 47 line 2: Change This result is the socalled to This gives the socalled
page 48 Example 2, line +3: Change 4.11, to 4.06,
page 48 Example 2, line +4: Change 0.016382. to 0.0087.
Chapter 2
419
420
page 75 Problem 39, part (d): Replace the two equation lines with Case A: P (x >
x x 1 ) = 0.8, P (x > x x 2 ) = 0.3 or
Case B: P (x > x x 1 ), P (x > x x 2 ) = 0.7.
page 76 Problem 41, rst line after the equation: Change (2 1 )/i to (2
1 )/
page 76 Problem 41, part (b): Change dT = 1.0. to dT = 1.0 and 2.0.
page 76 Problem 41, part (c): Change P (x > x x 1 ) = .2. to P (x > x x
1 ) = .7.
page 76 Problem 76, part (e): Change measure P (x > x x 2 ) = .9 and (x >
x x 1 ) = .3. to measure P (x > x x 2 ) = .3 and P (x > x x 1 ) =
.9.
page 81 Computer exercise 6, part (b), line +1: Change Consider to Consider
the normal distributions
page 81 Computer exercise 6, part (b), equation: Change andp(x2 ) to and
p(x2 ) (i.e., add space)
page 81 Computer exercise 6, part (b), equation: Move with P (1 ) = P (2 ) =
1/2. out of the centered equation, and into the following line of text.
page 83 rst column, entry for [21], lines +3 4: Change Silverman edition, 1963.
to Silverman, 1963.
Chapter 3
page 88 Equation 9: Change to
page 91 Ninth line after Eq. 22: Change shall consider), the samples to shall
consider) the samples
page 99 Caption to rst gure, change starts our as a at to starts out as a at
page 100 line 5: Change are equivalent to to are more similar to
page 100 line 5 4: Change If there are much data to If there is much data
page 102 line 5: Change (Computer exercise 22) to (Problem 22)
page 103 line 2: Change choice of an prior to choice of a prior
page 104 line +6: Change if to only if
page 104 line +16: Change only if to if
page 104 Equation 62: Make the usage and style of the summation sign ( ) uniform
in this equation. Specically, intwo places put
the arguments beneath the
summation sign, that is, change DD to
DD
page 105 rst line after Eq 63: Change to this kind of scaling. to to such scaling.
Chapter 3
421
page 111 lines +9 10: Change constants c0 and x0 such that f (x) c0 h(x)
for all to constants c and x0 such that f (x) ch(x) for all
page 111 line +14: Change proper choice of c0 and x0 . to proper choice of c and
x0 .
page 116 Equation 86: Change et e to (et e 1)
page 125 Algorithm 1, line 1: Change i = 0 to i 0
page 126 rst line of the equation at the middle of the page: Change
Q(; 0 ) = Ex41 [lnp(xg , xb ;  0 ; Dg )]
to
Q(; 0 ) = Ex41 [lnp(xg , xb ; ) 0 ; Dg ]
page 128 line +6, (second line after the Example): Change the EM algorithm, and
they to the EM algorithm as they
page 129 second line aboove Section 3.10.3: Change while the i are unobservable
to while the j are unobservable
page 132 line +3: Change bkj , and thus to bjk , and thus
page 132 Equation 136: Replace by:
t = 0 and j =
initial state
0
1
t
=
0
and
j
= initial state
j (t) =
(t
1)a
v(t)
otherwise.
ij jk
i i
page 132 third line after Eq. 136: Change Consequently, i (t) represents to Consequently, j (t) represents
page 132 fourth line after Eq. 136: Change hidden state i to hidden state j
page 132 Algorithm 2, line 1: Delete (1),
page 132 Algorithm 2, line 1: Change t = 0 to t 0
page 132 Algorithm 2, line 1: Change (0) = 1 to j (0)
page 132 Algorithm 2, line 3: Replace entire line by j (t) bjk v(t)
1)aij
c
i=1
i (t
page 132 Algorithm 3: Somehow the line numbering became incorrect. Change
the line number to be sequential, 1, 2, . . . 6.
page 132 Algorithm 3, line 1: Change (t), t = T to j (T ), t T
page 132 Algorithm
3, old line 4, renumbered to be line 3: Replace entire line by
c
i (t) j=1 j (t + 1)aij bjk v(t + 1)
page 132 Algorithm 3, old line 7, renumbered line 5: Change P (V T ) to P (VT )
(i.e., make the V bold)
422
page 133 Figure 3.10, change the label on the horizontal arrow from a12 to a22 .
page 133 Figure 3.10, caption, line +5: Change was in state j (t = 2) to was in
state i (t = 2)
page 133 Figure 3.10, caption, line +6: Change is j (2) for j = 1, 2 to is i (t)
for i = 1, 2
c
page 133 Figure 3.10, caption, line 1: Change 2 (3) = b2k j=1 j (2)aj2 to
c
2 (3) = b2k i=1 i (2)ai2
page 133 line 6: Change V5 = {v3 , v1 , v3 , v2 , v0 } to V4 = {v1 , v3 , v2 , v0 }
page 133 line 44: Change is shown above, to is shown at the top of the gure
page 134 Figure in Example 3, caption, line +4: Change i (t) the probability
to j (t) the probability
page 134 Figure in Example 3, caption, line +6: Change and i (0) = 0 for i = 1.
to and j (0) = 0 for j = 1.
page 134 Figure in Example 3, caption, line +6: Change calculation of i (1). to
calculation of j (1).
page 134 Figure in Example 3, caption, line 6: Change calculation of i (1) to
calculation of j (1)
page 134 igure in Example 3, caption, line 4: Change contribution to i (1).to
contribution to j (1).
page 135 Algorithm 4: somehow the line numbering became incorrect. Change
the line numbering to be sequential, 1, 2, 3, . . ., 11. In the old line 4 (now
renumbered 3): Change k = 0, 0 = 0 to j 1
page 135 Algorithm 4 old line 5 (now renumbered 4): Change k k + 1 to
j j + 1
page 135 Algorithm 4 old line 7 (now renumbered 5): Change k (t) to j (t)
page 135 Algorithm 4, old line 8 (now renumbered 6): Change k = c to j = c
page 135 Algorithm 4, old line 11 (now renumbered 8): Change AppendT o Path j
to Append j to Path
page 135 line 5: Change The red line to The black line
page 135 line 4: Change value of i at each step to value of j at each step
page 137 Equation 138: Replace equation by
i (t) = 0 and t = T
0
1
i (t) = 0 and t = T
i (t) =
(t
+
1)a
b
v(t
+
1)
otherwise.
j
ij
jk
j
page 137 seventh
line after Eq. 138: Change i (T 1) =
i (T 1) = j aij bjk v(T )j (T ).
j
Chapter 3
423
page 137 fourth line before Eq. 139: Change probabilities aij and bij to probabilities aij and bjk
page 137 Equation 139: Change bij to bjk
page 138 line +3: Change whereas at step t it is to whereas the total expected
number of any transitions from i is
page 138 rst line after Eq. 140: Change bij to bjk
page 138 Equation 141: Replace equation by:
bjk =
T
t=1
v(t)=vk
T
t=1 l
jl (t)
jl (t)
otherwise. to
otherwise.
424
page 154 Problem 46, line +1 (after the equation): Change missing feature values.
to missing feature values and is a very small positive constant that can be
neglected when normalizing the density within the above bounds.
page 154 Problem 46, part (b): Add You may make some simplifying assumptions.
page 154 Problem 47, rst equation: Change e1 x1 to ex1 /1
page 156 Computer exercise 2, after the equation: Change calculate a density to
calculate the density
page 156 Computer exercise 2. After x2 feature of category 2 . add Assume your
priors on the parameters are uniform throughout the range of the data.
page 157 Computer exercise 4, line 3: Change apply it to the x1 x2 components
to apply it to the x1 x2 components (i.e., eliminate spaces and note that the
dash is not subtraction sign, but an ndash)
page 157 Computer exercise 6, part (a), line +2: Change in the Table above. to
in the table above.
page 159 Computer exercise 13 table, sample 4 under 1 : Change AD to ADB
Chapter 4
page 172 Figure 4.9, caption, line 1: Change where I is the d d idenity matrix.
to where I is the dbyd identity matrix.
page 173 Algorithm 1, line 1: Change j = 0 to j 0
page 173 Algorithm 2, line 1: Change test pattern, to test pattern
page 178 line +3: Change Becuase to Because
page 179 Equation 41: Change x S to
(i.e., place the limits underneath
x S
Chapter 4
425
page 188 Third margin note: Change tanimoto metric to Tanimoto metric (i.e., capitalize the T in Tanimoto)
page 189 line +5: Change the relative shift is a mere to the relative shift s is a
mere
page 192 Figure 4.23, caption, lines 4 2: Eliminate the secondtolast sentence.
page 193 Equation 61: Replace the full equation with x (x) y (y)
page 194 line +19: Change beief about memberships to belief about memberships
page 195 line +9: Change in the title 4.8 REDUCED COULOMB to *4.8
REDUCED COULOMB
page 196 Algorithm 4, line 1: Change j = 0, n = # patterns, = small parameter, m =
max radius to j 0, n # patterns, small parameter, m max radius
page 197 Algorithm 5, line 1: Change j = 0, k = 0, x = test pattern, Dt = {}
to j 0, k 0, x test pattern, Dt {}
page 199 line 7: Change prior knowledge. to prior beliefs.
page 202 Problem 6c, rst line: Change increases to increase
page 203 Problem 11, part (d), line +1: Change close to an edge to close to a
face
page 203 Problem 11, part (d) line +4: Change closer to an edge to closer to a
face
page 203 Problem 11, part (d), line +5: Change even though it is easier to calculate
here to and happens to be easier to calculate
page 204 Problem 13 line +1: Change from the distributions to with priors
P (1 ) = P (2 ) = 0.5 and the distributions
page 205 Problem 21, part (a), line +1: Change As given in the text, take to
Follow the treatment in the text and take
page 206 Problem 23 part (d) line +2: Change space, then the b to space, then
in the b
page 206 Problem 26, line +4: Change and nd its nearest to and seek its
nearest
page 207 Problem 26 part (d), rst line: change Calculate the probability to
Estimate through simulation the probability
page 207 Problem 27, part (a) equation: Replace current equation with DT animoto (S1 , S2 ) =
n1 +n2 2n12
n1 +n2 n12 ,
page 207 Problem 29, rst equation: Change i to 1 in all three places
i to C(x;
426
Chapter 5
page 218 line +6: Change the problem to c 1 to the problem to c
page 218 Equation 2: Change wt xi to wit x
page 219 second line after the second (unnumbered) equation: Change is given by
(gi gj )/wi wj to is given by (gi (x) gj (x))/wi wj
page 220 sentence before Eq. 5, Change this in turn suggest to this in turn
suggests
pages 221222 (across the page break): Change multilayer to multilayer (i.e.,
hyphenate as multilayer)
page 225 Algorithm 1, line 1: Change k = 0 to k 0
page 228 Algorithm 3, line 1: Change k = 0 to k 0
page 229 line +5: Change We shall begin our examination to We begin our
examination
page 229 line 3: Change Thus we shall denote to Thus we denote
page 230 Algorithm 4, line 1: Change k = 0 to k 0
page 230 fourth line before Theorem 5.1: Change correction is clearly moving
to correction is hence moving
page 230 line 2: Change From Eq. 20, to From Eq. 20 we have
page 232 rst line after Eq. 24: Change Because the squared distance to Because
this squared distance
page 233 Algorithm 5, line 1: Change k = 0 to k 0
page 233 Algorithm 6, line 1: Change k = 0 to k 0
page 234 line +22 (counting from the end of the Algorithm): Change that it will
have little eect at all. to that it will have little if any eect.
page 235 lines +3 4 (counting from the end of the Algorithm): Change and this
means the gap, determined by these two vectors, can never increase in size
for separable data. to and this means that for separable data the gap,
determined by these two vectors, can never increase in size.
page 235 second and third lines after the section title 5.6 RELAXATION PROCEDURES: Change in socalled relaxation procedures to include to in
socalled relaxation procedures, to include
Chapter 5
427
page 235 rst and second lines after Eq. 32: Change misclassied by a, as do Jp ,
Jq focus attention to misclassied by a. Both Jp and Jq focus attention
page 235 second line after Eq. 32: Change Its chief dierence is that its gradient
to The chief dierence is that the gradient of Jq is
page 236 Algorithm 8, line 1: Change k = 0 to k 0
page 236 Algorithm 8, line 4: Change j = 0 to j 0
page 236 Algorithm 9, line 1: Change k = 0 to k 0
page 238 line 5: Change procedures, because to procedures because
page 242 line 2: Change We begin by writing Eq. 47 to We begin by writing
Eq. 45
page 246 lines +1 2: Dont split the equation by the line break
page 246 rst (unnumbered) equation, second line: Change (Yak b). to (Ya(k)
b).
page 246 margin note: Change lms rule to LMS rule (i.e., capitalize LMS)
page 246 Equation 61, second line: Change (bk a(k)t yk ) to (b(k) at (k)yk )
page 246 Algorithm 10, line 1: Change k = 0 to k 0
page 248 second line after Eq. 66: Change obtain the MSE optimal to obtain
the MSEoptimal
page 250 Equation 79: Coalign vertically the > in the top equation with the =
in the bottom equation
page 250 Equation 79: Change a(k) to b(k)
page 251 Algorithm 11, line 5: Change a to b
page 252 top equation: Change = 0 = to = 0 = (i.e., debold the 0)
page 253 line 7: Change requires that e+ (k) = 0 for to requires that e+ (k) = 0
for
page 254 line after Eq. 90: Change constant, positivedenite to constant, symmetric, positivedenite
page 255 First (unnumbered) equation on page: Change the rst term in parentheses
on the righthand side from 2 YRYt YRY to 2 YRYt YRYt
page 255 Eq. 92: Last term on the righthandside, change 2 RYt R to 2 RYt YR
page 257 Figure 5.18, caption, line +2: Change form Au to form Au =
page 262 fourth line after Eq. 105: Change with the largest margin to with the
largest margin (i.e., italicize largest)
page 263 fourth line after Eq. 107: Change equation in Chapter 9, to topic in
Chapter 9,
428
page 266 Equation 114: Change ati (k)yk aj (k)t yk . to ati (k)yk atj (k)yk .
page 270 twelfth line under BIBLIOGRAPHICAL AND HISTORICAL REMARKS: Change errorfree case [7] and [11] and to errorfree case [7,11]
and
page 270 line 13: Change support vector machines to Support Vector Machines
page 270 line 8: Change support vector machines to Support Vector Machines
page 271 Problem 2, line +3: Change if 0 1. to for 0 1.
page 272 Problem 8, line +2: Change if at yi 0 to if at yi b
page 274 Problem 22, second term on the righthand side change (at y (12
22 ))2 to (at y + (12 22 ))2
page 275 Problem 27, last line: Change by Eq. 85. to by Eq. 95.
page 277 lines +2 3: Change satises zk at yk = 0 to satises zk at yk = 1
page 277 Problem 38, line 2: Change procedures Perceptron to procedures.
Generalize the Perceptron
page 278 Problem 1, part (a): change data in in order to data in order
page 278 Computer exercise 2, part (a): Change Starting with a = 0, to Starting
with a = 0, (i.e., make bold the 0)
page 278 First heading after the table: Change Section 5.4 to Section 5.5
page 278 Computer Exercise 1: Change (Algorithm 1) and Newtons algorithm
(Algorithm 2) applied to (Algorithm 1) and the Perceptron criterion (Eq. 16)
page 278 Second heading after the table: Delete Section 5.5
page 279 line +1: Change length is greater than the pocket to length is greater
than with the pocket
page 279 Computer exercise 4, part (a), line +2: Change and 1 = 0, to and
1 = 0, (i.e., make bold the 0)
Chapter 6
page 286 Equation 5: Change zk = f (netk ). to zk = f (netk ),
page 287 line +2: Change all identical. to all the same.
page 288 line 3: Change depend on the to depends on the
page 291 Two lines before Eq. 3: Change hiddentooutput weights, wij to hiddentooutput weights, wkj
page 292 Equation 19, rst line (inside brackets): Change 1/2 to 12 (i.e., typeset
as a full fraction)
page 292 After Eq. 19, line +3: Change activiation to activation
Chapter 6
429
J t (w(m))J(w(m))
1))J(w(m 1))
J t (w(m
(56)
(57)
page 323 Fourth (unnumbered) equation on the page: Replace the left portion by
1 =
J t (w(1))J(w(1))
=
J t (w(0))J(w(0))
430
page 323 Caption to bottom gure, line +2: Change shown in the contour plot,
to shown in the density plot,
page 325 last line of the section Special Bases: Add This is very closely related
to modeldependent maximumlikelihood techniques we saw in Chapter 3.
page 326 rst line after Eq. 63: Change of the lter in analogy to of the lter,
in analogy
page 326 line 5: Add a red margin note time delay neural network
page 330 three lines above Eq. 67: Change write the error as the sum to write
the new error as the sum
page 332 Figure 6.28, caption, line +1: Change a function of weights, J(w) to a
function of weights, J(w),
page 337 Problem 8, pagt (b) line +2: Change if the sign if ipped to if the sign
is ipped
page 337 Problem 14, part (c), line +2: Change the 2 2 identity to the 2by2
identity
page 339 Problem 22, line +4: Add Are the discriminant functions independent?
page 341 Problem 31, lines +1 2: Change for a sum squared error criterion to
for a sumsquareerror criterion
page 344 Computer exercise 2, line +2: Change backpropagation to (Algorithm
1) to backpropagation (Algorithm 1)
page 345 Computer exercise 7, line +2: Change on a random problem. to on a
twodimensional twocategory problem with 2k patterns chosen randomly from
the unit square. Estimate k such that the expected error is 25% . Discuss your
results.
page 346 Computer exercise 10, part (c), line +1: Change Use your network to
Use your trained network
page 347 Column 2, entry for [14], line +5: Change volume 3. Morgan Kaufmann
to volume 3, pages 853859. Morgan Kaufmann
page 348 column 2, entry for [43], line +4: Change 2000. to 2001.
Chapter 7
page 351 fourteenth line after Eq. 1: Change of the magnets with the most stable
conguration to of the magnets that is the most stable
page 351 footnote, line 1: Change in a range of problem domains. to in many
problem domains.
page 352 Figure 7.1, caption, line +7: Change While our convention to While
for neural nets our convention
Chapter 7
431
page 352 Figure 7.1, caption, line +8: Change Boltzmann networks is to Boltzmann networks here is
page 352 Caption to Figure 7.1, line 1: Change 0 210 to 0 < 210
page 353 Figure 7.2, caption, line +3: Change or temperature T to avoid to
or temperature T , to avoid
page 357 Figure 7.4, caption, line +3: Change values eE /T . to values eE /T .
page 360 rst line in Section 7.3: Change will use modify the to will use the
page 360 second line in Section 7.3: Change to specically identify to and specifically identify
page 360 second line in Section 7.3: Change and other units as outputs to and
others as outputs
page 361 fourth line after Eq. 6: Change possible hidden states. to possible hidden states consistent with .
page 364 at the end of the body of the text: insert One benet of such stochastic learning is that if the nal error seems to be unacceptably high, we can
merely increase the temperature and anneal we do not need to reinitialize
the weights and restart the full anneal.
page 365 seventh line after the subsection Pattern Completion: Change components of a partial pattern to components of the partial pattern
page 367 line +1: Change Recall, at the end to As mentioned, at the end
page 373 fourteenth line in Section 7.5: Change repeated for subsequent to repeated for the subsequent
page 373 fth line above the subsection Genetic Algorithms: Change In both
cases, a key to In both cases a key
page 374 line +2: Change used in the algorithm. Below to used in the algorithm;
below
page 374 line +3: Change Pco and Pmut , respectively, but rst we present the
general algorithm: to Pco and Pmut .
page 379 third line in the subsection Representation: Change Here the syntactic
to Whereas the syntactic
page 381 third line under BIBLIOGRAPHICAL AND HISTORICAL REMARKS: Change branchandbound, A to branchandbound and A
page 382 line +28: Change been fabricated as described to been fabricated, as
described
page 383 Problem 3, part (b), line +1: Change The gure shows to That gure
shows
page 384 Problem 7, part (c), line +1: Change magnets, total to magnets, the
total
432
Chapter 8
page 403 line above Section 8.3.4 Pruning: Change splitting is stopped. to
splitting should be stopped.
page 405 Table at top, x2 entry in fth row under 1 , change .48 to .44
page 405 caption to gure, line 2: Change marked were instead slightly lower
(marked ), to marked were instead slightly lower (marked ), (i.e., change
the color of the special symbols to red)
page 414 line +16: Change "
/"
/"
/ GACTG to "
/"
/"
/ GACTG (i.e., eliminate space)
page 416 Algorithm 2, line 2: Change F(x) to F
page 416 Algorithm 2, line 3: Change G(x) to G
page 416 Algorithm 2, line 11: Change G(0) to 1
page 416 line 1: Change F(x) to F
page 417 lines 9 4, Replace last full paragraph by Consider target string x.
Each location j (for j < m) denes a sux of x, that is, x[j + 1, . . . , m]. The
goodsux function G(j) returns the starting location of the rightmost instance
of another occurrence of that sux (if one exists). In the example in Fig. 8.8,
x = estimates and thus j = 8 denes the sux s. The rightmost occurrence
of another s is 2; therefore G(8) = 2. Similarly j = 7 denes the sux es. The
rightmost occurrence of another es is 2; therefore G(7) = 1. No other sux
appears repeatedly within x, and thus G is undened for j < 7.
page 418 fth line before Algorithm 3: Change consider interchanges. to consider the interchange operation.
page 418 fourth line before Algorithm 3: Change be an m n matrix to be an
mbyn matrix
Chapter 9
433
page 422 line 3: Change specify how to transform to species how to transform
page 426 line +1: Change The grammar takes digit6 and to This grammar takes
digits6 and
page 428 line 1: Change subsequent characters. to subsequent characters, or
instead starting and the last (right) character in the sentence.
page 429 caption to Figure 8.16, add after the last line: Such nitestate machines
are sometimes favored because of their clear interpretation and learning methods
based on addition of nodes and links. In Section 8.7, though, we shall see general
methods for grammatical learning that apply to a broader range of grammatical
models.
page 438 Problem 5, line +2: Change Eqs. 1 and 5. to Eqs. 1 and 5 for the case
of an arbitrary number of categories.
page 438 Problem 6 after the rst set of equations: Replace the i () equation by
i () = i(P a (1 ) + (1 )P b (1 ), ..., P a (c ) + (1 )P b (c ))
page 438 Problem 6 last line before part (a): Replace line by then we have i
ia + (1 )ib .
page 445 Problem 36, part (d), line +1: Change Give a derivation to Attempt
a derivation
page 445 Problem 40, part (d), line +2: Change either grammar as to either
grammar, or can be parsed in both grammars, as
page 446 Table, sample 12: Change D to E
page 447 Computer exercise 3, part b): Change {C, D, J, L, M} to {C, E, J, L,
M}
Chapter 9
page 455 line 9 10: Change a zeroone loss function, or, more generally, the
cost for a general loss function L(, ) to a zeroone or other loss function.
page 460 Table for rank r = 3, third row: Change x1 OR x3 OR x3 to x1 OR x3 OR x4
page 462 line +12: Change upon a specication method L, by upon a specication method,
page 462 line +13: Change transmitted as y, denoted L(y) = x. to transmitted
as y and decoded given some xed method L, denoted L(y) = x.
page 462 line +15 16: Change denoted min L(y) = x; this minimal...[[to end of
paragraph]] to denoted
min y.
y
y:L(y)=x
page 462 line +17 18: Change by analogy to entropy, where instead of a specication method L we consider to by analogy to communication, where instead
of a xed decoding method L, we consider
434
page 462 line +25: Change and so on. A universal description would to and so
on. Such a description would
page 462 line +26: Change dierent binary strings. to dierent binary strings
x1 and x2 .
page 462 third and second line above Eq. 7: Change the shortest program y (where
the length to the shortest program string y (where ys length
page 462 Equation 7: Replace the entire equation by:
K(x) =
min
y:U (y)=x
y,
page 463 line 10: Change No Free Lunch Theorems. to No Free Lunch Theorem.
1
1
page 472 Equation 23: Change (n1)
to n(n1)
j=i
to
n
j=i
summation sign)
page 472 line 4: Change jackknife estimate of the variance to variance of the
jackknife estimate
page 479 line +1: Change in line 4 to in line 5
page 485 line 10: Change good estimates, because to good estimates because
page 488 Caption to Fig. 9.13, line +6: Change approximated as a kdimensional
to approximated as a pdimensional
page 488 line +10 (i.e., second line of the second paragraph of text): Change is
kdimensional and the to is pdimensional and the
page 496 Equation 54: Change P (rx, 0 ) to P (rx, 00 )
page 497 Equation 58: Change r to r in two places
page 501 line 15: Change and learning algorithm was rst described to and
learning algorithm were rst described
page 502 Problem 6, last line: Change this way to this sense
page 504 Problem 20, line 2: Change denoted p(g(x; D)) is a to denoted
p(g(x; D)), is a
page 508 Problem 45, line +1: Change mixture of experts classier to mixtureofexperts classier
page 512 line 3: Add Now what is the training error measured using A and B ?
10.0. CHAPTER 10
435
Chapter 10
%
page 524 The second to last equation: Increase the size of the nal bracket, to
match its mate
page 525 line following Eq. 23: Change results in Eq. 12 to results into Eq. 12
page 529 line +5: Change P (wj ) to P (j )
page 529 Algorithm 2: Change (Fuzzy kMeans Clustering) to (Fuzzy kMeans Clustering)
page 534 line +2: Change overlap, thus to overlap; thus
page 535 First equation: move the second, third, and fourth lines so as to coalign
vertically the corresponding terms
page 536 line +19: Change classication analog of Chapter 3 to classication
case in Chapter 3
page 537 line 20: Change distributed, these statistics to distributed these statistics
page 571 line +1 +2: Change microphones to microphones (i.e., rehyphenate)
page 578 Figure 10.31, caption, line +2: Change space that leads maximally to
space that maximally
page 579 Figure 10.32, caption, line 1: Change to this center region to to this
central region
page 580 line +15: Change cluster centers being used to to cluster centers being
used to (i.e., italicize cluster centers)
page 580 line +16: Change with combined features being to with combined features being (i.e., italicize combined features)
page 582 four lines above BIBLIOGRAPHICAL and HISTORICAL REMARKS:
Change between points that, too, seeks to preserve neighborhoods to between
points that preserves neighborhoods
page 583 lines +9 10: Change The classicatory foundations of biology, cladistics
(from the Greek klados, branch) provide useful to Cladistics, the classifactory
foundation of biology (from the Greek klados, branch), provides useful
page 583 line +18: Change analysis, and explained the very close to analysis,
and in reference [36] explained the very close
page 583 line +19: Change information maximization in reference [36]. to information maximization.
page 584 Problem 2, equation: Change (1x1 )/(2wi ) to (wi xi )/wi2
page 584 Problem 4 a, righthand side: Change xi to xj
page 585 Problem 6, line +1: Change Consider a c component to Consider a
ccomponent
436
page 587 Problem 13, line +3: Change that for any observed x, all but one to
that for any observed x all but one
page 589 Problem 24 b, Change the trace criterion to the determinant criterion
page 591 Problem 41, line +1: Change null hypothesis in associated to null
hypothesis associated
page 592 Problem 44, part (c), line +3: Change (e)t (e)t e = 0 to (e)t e
(e)t e = 0
page 592 Problem 46, line +1: Change principal componet analysis to principal
component analysis
page 593 Problem 48, line +3: Change linearity is given by to linearity is the
one given by
page 596 Problem 11, lines +3 4: Change to the date in the table above using
the distance measure indicated to to the data in the table above using the
distance measures indicated
page 597 Problem 13, line +1: Change a basic competitive learning to a basic
Competitive Learning
page 597 Problem 13, lines +4 5: Change hypersphere to hypersphere (i.e.,
rehyphenate hypersphere)
Appendix
page 608 Section A.2.5 line +5: Change In this case the absolute value of the
determinant to In this case the determinant
page 609 rst line after Eq. 26: Change the i, j cofactor or to the i, j cofactor
or (i.e., eliminate the space in i, j)
page 609 second line after Eq. 26: Change is the (d 1) (d 1) matrix to is
the (d 1)by(d 1) matrix
page 609 fourth line after Eq. 26: Change whose i, j entry is the j, i cofactor
to whose i, j entry is the j, i cofactor (i.e., eliminate the space in i, j and
j, i)
page 609 line 2: Add The inverse of the product of two square matrices obeys
[MN]1 = N1 M1 , as can be veried by multiplying on the right or the left
by MN.
11 /n
page 615 line 12 (i.e., just before Section A.4.7): Change and (n01n+n
is ap11 )/n
proximately to and (n01 + n11 )/n is approximately
page 629 line +9: Change from a distribution to from a standarized Gaussian
distribution
Index
437
Index
page 644 column 1, line 14: Insert Gini impurity, 399, 401
page 653 column 1, line 3: Change multivariate to multivariate
438
Chapter 3
page 99 Caption to rst gure, change starts our as a at to starts out as a at
page 133 Figure 3.10, change the label on the horizontal arrow from a12 to a22
page 143 Problem 11, second and third lines after rst equation: Change p2 (x) by
a normal p1 (x) N (, ) to p1 (x) by a normal p2 (x) N (, )
page 143 Problem 11: Second equations: Change E2 to E1 in two places
page 143 Problem 11, last line: Change over the density p2 (x) to over the density
p1 (x)
page 149 Problem 31, line 1: Change suppose a and b are positive constants and
to suppose a and b are constants greater than 1 and
page 151 Problem 38 (b) bottom equation on page: Change (1 2 )2 to (1
2 )2
page 156 Computer exercise 2, after the equation: Change calculate a density to
calculate the density
page 156 Computer exercise 2. After x2 feature of category 2 . add Assume your
priors on the parameters are uniform throughout the range of the data.
page 159 Computer exercise 13 table, sample 4 under 1 : Change AD to ADB
Chapter 4
page 178 line +3: Change Becuase to Because
page 202 Problem 6 part (c), rst line: Change increases to increase
page 207 Problem 26 part (d), rst line: change Calculate the probability to
Estimate through simulation the probability
page 207 Problem 26 delete part (e)
10.0. CHAPTER 5
439
Chapter 5
page 220 sentence before Eq. 5, Change this in turn suggest to this in turn
suggests
page 250 Equation 79: Change a(k) to b(k)
page 251 Algorithm 11, line 5: Change a to b
page 254 line after Eq. 90: Change constant, positivedenite to constant, symmetric, positivedenite
page 255 First (unnumbered) equation on page: Change the rst term in parentheses
on the righthand side from 2 YRYt YRY to 2 YRYt YRYt
page 255 Eq. 92: Last term on the righthandside, change 2 RYt R to 2 RYt YR
page 274 Problem 22, second term on the righthand side change (at y (12
22 ))2 to (at y + (12 22 ))2
page 275 Problem 27, last line: Change by Eq. 85. to by Eq. 95.
page 278 Problem 1, part (a), line +1: change data in in order to data in order
page 278 Problem 1, part (a), line +2: change use (k) = 0.1. to use (k) =
0.01.
page 278 First heading after the table: Change Section 5.4 to Section 5.5
page 278 Computer Exercise 1: Change (Algorithm 1) and Newtons algorithm
(Algorithm 2) applied to (Algorithm 1) and the Perceptron criterion (Eq. 16)
applied
page 278 Second heading after the table: Delete Section 5.5
page 279 line +1: Change length is greater than the pocket to length is greater
than with the pocket
Chapter 6
page 291 Two lines before Eq. 3: Change hiddentooutput weights, wij to hiddentooutput weights, wkj
page 292 After Eq. 19, line +3: Change activiation to activation
page 293 Figure 6.5: Change wij to wji
page 294 Algorithm 2, line 3: Change wkj to wkj 0
page 295 Second paragraph in Section 6.3.3: Change indpendently selected to
independently selected
page 302 line +34: Change weights merely leads to weights merely lead
page 305 line 10: Change ratio of such priors. to ratio of such priors, though this
need not ensure minimal error.
440
page 314 line +6: Change weight changes are response to weight changes are a
response
page 330 line 2 1: Change While it is natural to It is natural
page 337 Problem 8, pagt (b) line +2: Change if the sign if ipped to if the sign
is ipped
page 346 Problem 10, part (c), line +3: Change x5 = (0, 0, 0)t to x5 = (0, 0, 1)t
Chapter 7
page 352 Caption to Figure 7.1, line 1: Change 0 210 to 0 < 210
Chapter 8
page 405 Table at top, x2 entry in fth row under 1 , change .48 to .44
page 409 line 7: Change queries involves to queries involve
page 416 Algorithm 2, line 2: Change F(x) to F
page 416 Algorithm 2, line 3: Change G(x) to G
page 416 Algorithm 2, line 11: Change G(0) to 1
page 416 line 1: Change F(x) to F
page 417 lines 9 4, Replace last full paragraph by Consider target string x.
Each location j (for j < m) denes a sux of x, that is, x[j + 1, . . . , m]. The
goodsux function G(j) returns the starting location of the rightmost instance
of another occurrence of that sux (if one exists). In the example in Fig. 8.8,
x = estimates and thus j = 8 denes the sux s. The rightmost occurrence
of another s is 2; therefore G(8) = 2. Similarly j = 7 denes the sux es. The
rightmost occurrence of another es is 2; therefore G(7) = 1. No other sux
appears repeatedly within x, and thus G is undened for j < 7.
page 438 Problem 5, line +2: Change Eqs. 1 and 5. to Eqs. 1 and 5 for the case
of an arbitrary number of categories.
page 438 Problem 6 after the rst set of equations: Replace the i () equation by
i () = i(P a (1 ) + (1 )P b (1 ), ..., P a (c ) + (1 )P b (c ))
page 438 Problem 6 last line before part (a): Replace line by then we have i
ia + (1 )ib .
page 446 Table, sample 12: Change D to E
page 447 Computer exercise 3, part b): Change {C, D, J, L, M} to {C, E, J, L,
M}
10.0. CHAPTER 9
441
Chapter 9
page 460 Table for rank r = 3, third row: Change x1 OR x3 OR x3 to x1 OR x3 OR x4
page 468 Five lines after Eq. 12: Change regardless the amount to regardless of
the amount
page 474 Caption to Figure: Change and jackknife estimate to and whose jackknife estimate
page 497 Equation 58: Change r to r in two places
page 505 Take the heading Section 9.4 and move it to the top of the page, i.e.,
between Problems 20 and 21.
page 508 Problem 45, line +2: Change N (, ) to N (r , r )
Chapter 10
page 526 line 1 Change In the absense to In the absence
page 541 line +9: Change this is a symmetric functions to this is a symmetric
function
page 549 Eq. 76: Change m to mi on the righthand side.
page 573 Eq.107: Put
the lower limit underneath the summation sign, that is,
change i<j to
i<j
page 558 Last full paragraph, line +1: Change This result agrees with out statement to This result agrees with our statement
page 584 Problem 4 part (a), righthand side: Change xi to xj
i to
in two places only on the rightpage 585 Problem 4 part (b): Change
hand side of the equation.
page 589 Problem 24 part (b), Change the trace criterion to the determinant
criterion
442
Appendix
page 608 Section A.2.5 line +5: Change In this case the absolute value of the
determinant to In this case the determinant
10.0. CHAPTER 2
443
Fifth printing
Chapter 2
page 73 Problem 34, line +67: Change assume the distributions to assume
P (1 ) = P (2 ) = 0.5 and the distributions
page 73 Problem 34, part c), line +4: Change Bayes error is 0.5. to Bayes error
is 0.25.
Chapter 3
page 143 Problem 11, second and third lines after rst equation: Change p2 (x) by
a normal p1 (x) N (, ) to p1 (x) by a normal p2 (x) N (, )
page 143 Problem 11: Second equations: Change E2 to E1 in two places
page 143 Problem 11, last line: Change over the density p2 (x) to over the density
p1 (x)
page 151 Problem 38 (b) bottom equation on page: Change (1 2 )2 to (1
2 )2
page 159 Computer exercise 13 table, sample 4 under 1 : Change AD to ADB
Chapter 5
page 250 Equation 79: Change a(k) to b(k)
page 251 Algorithm 11, line 5: Change a to b
page 275 Problem 27, last line: Change by Eq. 85. to by Eq. 95.
Chapter 6
page 294 Algorithm 2, line 3: Change wkj to wkj 0
page 305 line 10: Change ratio of such priors. to ratio of such priors, though this
need not ensure minimal error.
page 337 Problem 8, pagt (b) line +2: Change if the sign if ipped to if the sign
is ipped
Chapter 8
page 438 Problem 6 after the rst set of equations: Replace the i () equation by
i () = i(P a (1 ) + (1 )P b (1 ), ..., P a (c ) + (1 )P b (c ))
page 438 Problem 6 last line before part (a): Replace line by then we have i
ia + (1 )ib .
444
Chapter 9
page 460 Table for rank r = 3, third row: Change x1 OR x3 OR x3 to x1 OR x3 OR x4
page 508 Problem 45, line +2: Change N (, ) to N (r , r )
Chapter 10
page 553 line +1314: Change each of which is an O(d2 ) calculation to each of
which is an O(d) calculation
page 553 line +1718: Change the complexity is O(n(n 1)(d2 + 1)) = O(n2 d2 )
to the complexity is O(n(n 1)(d + 1)) = O(n2 d)
page 553 line +21: Change complexity is thus O(cn2 d2 ) to complexity is thus
O(cn2 d)
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.