HW 1 Sol

1
ENGR 691/692 Section 66 (Fall 06): Machine Learning

Homework 1: Bayesian Decision Theory (solutions)
Assigned: August 30
Due: September 13
Problem 1: (22 pts) Let the conditional densities for a two-category one-dimensional problem be given by the
following Cauchy distribution:
1
1
p(x|i ) =
xai 2 , i = 1, 2.
b 1 +
b
1. (6 pts) By explicit integration, check that the distribution are indeed normalized.
2
, that is, the minimum error
2. (9 pts) Assuming P (1 ) = P (2 ), show that P (1 |x) = P (2 |x) if x = a1 +a
2
decision boundary is a point midway between the peaks of the two distributions, regardless of b.
3. (7 pts) Show that the minimum probability of error is given by

1
1
1 a1 a2
.
P (error) = tan
2
2b
Answer:
1.
1
p(x|i )dx =
u=
b
We substitute y =
xai
b
1+
1
xai 2 dx .
b
into the above and get

k
=
=
=
=

1
1
dy
1 + y 2

1
tan1 (y)

1
+
2
2
1.
2. By setting p(x|1 )P (1 ) = p(x|2 )P (2 ), we have

1
1
1
1
1
1
xa1 2 =
xa2 2 ,
b 1 +
2
b 1 +
2
b
b
or, equivalently,
x a1 = (x a2 ) .
For a1 = a2 , this implies that x =
a1 +a2
.
2
3. Without loss of generality, we assume a2 > a1 . The probability of error is defined as

P (error, x)dx
P (error) =

P (error|x)p(x)dx .
=
Note that the decision boundary is at

P (error|x)
a1 +a2
2 ,
hence

P (2 |x) if
=
P (1 |x) if
p(x| )P ( )
2
p(x)
p(x|1 )P (1 )
p(x)
a1 +a2
2
a1 +a2
2
2
x a1 +a
2
a1 +a2
x> 2
x
x>
if
if
2
Therefore, the probability of error is
a1 +a2
2
P (error) =
1
2b
=
We substitute y =
xa2
b
and z =
P (error)

p(x|2 )P (2 )dx +
a1 +a2
2
a1 +a2
2
p(x|1 )P (1 )dx
1
xa2 2 dx +
2b
1+( b )
a1 +a2
2
1+
dx
1 2
( xa
b )
xa1
b
into the above and get

a1 a2

2b
1
1
1
dy +
dz
a2 a1 1 + z 2
2
1 + y2
2b
a a

1 2
1

1
1
2b

tan (y) + tan (z) a2 a1
2
2b

a
1
2
1 1
1 a2 a1
+ + tan
tan
2
2b
2
2
2b
1
1
1 a2 a1
tan
.
2
2b
=
=
=
=
Similarly, if a1 > a2 , we have P (error) =
1
2
tan1
a1 a2
2b .
Therefore, we have shown that

1
1
1 a1 a2
P (error) = tan
.
2
2b
Problem 2: (21 pts) Let max (x) be the state of nature for which P (max |x) P (i |x) for all i, i = 1, . . . , c.
1. (7 pts) Show that P (max |x) 1c .
2. (7 pts) Show that for the minimum-error-rate decision rule the average probability of error is given by

P (error) = 1 P (max |x)p(x)dx .
3. (7 pts) Show that P (error)
c1
c .
Answer:
1. Since P (max |x) P (i |x), we have
c

P (max |x)
i=1
c

P (i |x) = 1 .
i=1
Hence
cP (max |x) 1 ,
which implies that P (max |x) 1c .
2. By definition,

P (error) =
P (error|x)p(x)dx
[1 P (max |x)] p(x)dx

1
P (max |x)p(x)dx .
3
3. From 1 and 2, it is clear that
P (error) = 1

P (max |x)p(x)dx 1
1
1
c1
p(x)dx = 1 =
.
c
c
c
Problem 3: (22 pts) In many machine learning applications, one has the option either to assign the pattern to
one of c classes, or to reject it as being unrecognizable. If the cost for rejects is not too high, rejection may be a
desirable action. Let
i=j
i, j = 1, . . . , c
0
r i = c + 1
(i |j ) =
s otherwise,
where r is the loss incurred for choosing the (c + 1)th action, rejection, and s is the loss incurred for making
any substitution error.
1. (10 pts) Please derive the decision rule with the minimum risk.
2. (6 pts) What happens if r = 0?
3. (6 pts) What happens if r > s ?
Answer:
1. For i = 1, . . . , c,
R(i |x)
c

(i |j )P (j |x)
j=1
= s
c

P (j |x)
j=1,j=i
= s [1 P (i |x)] .
For i = c + 1,
R(c+1 |x) = r .
Therefore, the minimum risk is achieved if we decide i if R(i |x) R(c+1 |x), i.e., P (i |x) 1 rs , and
reject otherwise.
2. If r = 0, we always reject.
3. If r > s , we will never reject.
Problem 4: (12 pts + 10 extra points) Let the components of the vector x = [x1 , . . . , xd ]T be binary-valued
(0 or 1), and let P (j ) be the prior probability for the state of nature j and j = 1, . . . , c. We define
pij = P (xi = 1|j ), i = 1, . . . , d, j = 1, . . . , c,
with the components of xi being statistically independent for all x in j .
1. (12 pts) Show that the minimum probability of error is achieved by the following decision rule:
Decide k if gk (x) gj (x) for all j and k, where
gj (x) =
d

i=1

pij
+
ln(1 pij ) + lnP (j ) .
1 pij
i=1
d
xi ln
2. (10 extra pts) If the components of x are ternary valued (1, 0, or 1), show that a minimum probability
of error decision rule can be derived that involves discriminant functions gj (x) that are quadratic function
of the components xi .
Answer:
1. Consider the following discriminant function
gj (x) = ln [p(x|j )P (j )] = ln p(x|j ) + ln P (j ) .
The components of x are statistically independent for all x in j , then we can write the density as a product:
p(x|j )
d

p(xi |j )
i=1
d

pxiji (1 pij )1xi .
i=1
Thus we have the discriminant function

gj (x)
d

[xi ln pij + (1 xi ) ln(1 pij )] + ln P (j )
i=1
d

pij
xi ln
+
ln(1 pij ) + ln P (j ) .
1 pij
i=1
i=1
d
2. Consider the following discriminant function

gj (x) = ln [p(x|j )P (j )] = ln p(x|j ) + ln P (j ) .
The components of x are statistically independent for all x in j , therefore,
p(x|j ) =
d

p(xi |j ) .
i=1
Let
pij
P (xi = 1|j ) ,
qij
rij
=
=
P (xi = 0|j ) ,
P (xi = 1|j ) .
It is not hard to check that

p(xi |j ) =
d

xi + 12 x2i 1x2i 12 xi + 12 x2i

qij rij
pij2
i=1
Thus the discriminant functions can be written as

d

1
1 2
1
1 2
2
xi + xi ln pij + (1 xi ) ln qij + xi + xi ln rij + ln P (j )
gj (x) =
2
2
2
2
i=1
d
d
d

pij rij
1
pij
=
x2i ln
+
xi ln
+
ln qij + ln P (j )
qij
2 i=1
rij
i=1
i=1
which are quadratic functions of the components xi .
Question 5: (23 pts) Suppose we have three categories with prior probabilities P (1 ) = 0.5, P (2 ) = P (3 ) =
0.25 and the class conditional probability distributions
p(x|1 )
p(x|2 )
N (0, 1)
N (0.5, 1)
p(x|3 )
N (1, 1)
5
where N (, 2 ) represents the normal distribution with density function
(x)2
1
p(x) =
e 22 .
2
We sample the following sequence of four points: x = 0.6, 0.1, 0.9, 1.1.
1. (9 pts) Calculate explicitly the probability that the sequence actually came from 1 , 3 , 3 , 2 .
2. (6 pts) Repeat for the sequence 1 , 2 , 2 , 3 .
3. (8 pts) Find the sequence of states having the maximum probability.
Answer: It is straightforward to compute that
p(0.6|1 ) = 0.333225
p(0.1|1 ) = 0.396953
p(0.9|1 ) = 0.266085
p(1.1|1 ) = 0.217852
p(0.6|2 ) = 0.396953
p(0.1|2 ) = 0.368270
p(0.9|2 ) = 0.368270
p(1.1|2 ) = 0.333225
We denote X = (x1 , x2 , x3 , x4 ) and = ((1), (2), (3), (4)).

as
(1 , 1 , 1 , 1 ) (1 , 1 , 1 , 2 )
(1 , 1 , 2 , 1 ) (1 , 1 , 2 , 2 )
(1 , 3 , 1 , 1 ) (1 , 1 , 3 , 2 )
..
..
.
.
(3 , 3 , 3 , 1 )
(3 , 3 , 3 , 2 )
p(0.6|3 ) = 0.368270
p(0.1|3 ) = 0.266085
p(0.9|3 ) = 0.396953
p(1.1|3 ) = 0.396953 .
Clearly, there are 34 possible values of , such

(1 , 1 , 1 , 3 )
(1 , 1 , 2 , 3 )
(1 , 1 , 3 , 3 )
..
.
(3 , 3 , 3 , 3 )
For each possible value of , we calculate P () and P (x|) using the following, which assume the independences
of xi and (i):
p(X|)
P () =
4

i=1
4

p(xi |w(i))
P ((i)) .
i=1
For example, if = (1 , 3 , 3 , 2 ) and X = (0.6, 0.1, 0.9, 1.1), then we have

p(X|)
= p((0.6, 0.1, 0.9, 1.1)|(1, 3 , 3 , 2 ))

= p(0.6|1 )p(0.1|3 )p(0.9|3 )p(1.1|2 )
= 0.333225 0.266085 0.396953 0.333225
= 0.01173
and
P () =
=
=
P (1 )P (2 )P (3 )P (4 )
1 1 1 1

2 4 4 4
0.0078125 .
1. Given X = (0.6, 0.1, 0.9, 1.1) and = (1 , 3 , 3 , 2 ), we have

p(X)
= p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1)

=
p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|)P ()
6
= p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|1 , 1 , 1 , 1 )P (1 , 1 , 1 , 1 )
+p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|1 , 1 , 1 , 2 )P (1 , 1 , 1 , 2 )
..
.
+p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1|3 , 3 , 3 , 3 )P (3 , 3 , 3 , 3 )
= p(0.6|1 )p(0.1|1 )p(0.9|1 )p(1.1|1 )P (1 )P (1 )P (1 )P (1 )
+p(0.6|1 )p(0.1|1 )p(0.9|1 )p(1.1|2 )P (1 )P (1 )P (1 )P (2 )
..
.
+p(0.6|3 )p(0.1|3 )p(0.9|3 )p(1.1|3 )P (3 )P (3 )P (3 )P (3 )
= 0.012083 .
Therefore,
P (|X) =
=
=
=
P (1 , 3 , 3 , 2 |0.6, 0.9, 0.1, 1.1)

p(0.6, 0.9, 0.1, 1.1|1, 3 , 3 , 2 )P (1 , 3 , 3 , 2 )
p(X)
0.01173 0.0078125
0.012083
0.007584 .
2. Following the steps in part 1, we have

P (1 , 2 , 2 , 3 |0.6, 0.1, 0.9, 1.1) =
=
=
p(0.6, 0.1, 0.9, 1.1|1, 2 , 2 , 3 )P (1 , 2 , 2 , 3 )

p(X)
0.01794 0.0078125
0.012083
0.01160 .
3. The sequence = (1 , 1 , 1 , 1 ) has the maximum probability to observe X = (0.6, 0.1, 0.9, 1.1). This
maximum probability is 0.03966.

HW 1 Sol

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

HW 1 Sol

Hochgeladen von

Copyright:

Verfügbare Formate

1

ENGR 691/692 Section 66 (Fall 06): Machine Learning

3. (7 pts) Show that the minimum probability of error is given by

into the above and get

2. By setting p(x|1 )P (1 ) = p(x|2 )P (2 ), we have

3. Without loss of generality, we assume a2 > a1 . The probability of error is defined as

Note that the decision boundary is at

into the above and get

Similarly, if a1 > a2 , we have P (error) =

Therefore, we have shown that

[1 P (max |x)] p(x)dx

pxiji (1 pij )1xi .

Thus we have the discriminant function

[xi ln pij + (1 xi ) ln(1 pij )] + ln P (j )

2. Consider the following discriminant function

It is not hard to check that

xi + 12 x2i 1x2i 12 xi + 12 x2i

Thus the discriminant functions can be written as

We denote X = (x1 , x2 , x3 , x4 ) and = ((1), (2), (3), (4)).

Clearly, there are 34 possible values of , such

For example, if = (1 , 3 , 3 , 2 ) and X = (0.6, 0.1, 0.9, 1.1), then we have

= p((0.6, 0.1, 0.9, 1.1)|(1, 3 , 3 , 2 ))

1. Given X = (0.6, 0.1, 0.9, 1.1) and = (1 , 3 , 3 , 2 ), we have

= p(x1 = 0.6, x2 = 0.1, x3 = 0.9, x4 = 1.1)

P (1 , 3 , 3 , 2 |0.6, 0.9, 0.1, 1.1)

2. Following the steps in part 1, we have

p(0.6, 0.1, 0.9, 1.1|1, 2 , 2 , 3 )P (1 , 2 , 2 , 3 )

Das könnte Ihnen auch gefallen