Sie sind auf Seite 1von 10

6.867 Machine learning

Your name and M IT ID:

Final exam

Decemb er 3, 2004

J. D . 00000000

(Optional) The gr ade you would give to your se lf + a brief justification.

A why not?

1

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 1

5

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0

 

(1)

(2)

+

+

  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +
  (1) (2) − + − +

0

1

2

3

4

5

6

Figure 1: Lab ele d training p oints for proble m 1.

Consider the lab eled training p oints in Figure 1, whe re ’x’ and ’o’ de note p os itive and negative lab els , resp ectively. We wish to apply AdaBo ost with decision stumps to s olve the classific ation problem. I n each b o os ting iteration, we select the stump that minimiz es the we ighted training e rror, breaking ties arbitrarily.

1. (3 p oints) In figure 1, draw the decision b oundary c orresp onding to the first decis ion stump that the b o osting algorithm would cho ose . Lab el this b oundary (1), and als o indicate +/- side of the de cision b oundary.

2. (2 p oints) In the same figure 1 also circle the p oint(s) that have the highe st weight after the first b o osting iteration.

3. (2 p oints) What is the weighted error of the first decis ion stump after the first b o osting ite ration, i.e., after the p oints have b een reweighte d?

0.5
0.5

4. (3 p oints) Draw the de cision b oundary corresp onding to the s econd decision stump, again in Figure 1, and lab el it with (2), also indicating the +/- s ide of the b oundary.

5. (3 p oints) Would s ome of the p oints b e misc lass ified by the c ombined clas sifier after the two b o osting iterations? Provide a brie f justification. (the p oints will b e awarded for the j us tific ation, not whether your y/n answer is corre ct)

Yes. For example, the circled point in the figure is misclassified by the first decision stump and could be classified correctly in the combination only if the weight/votes of the second stump is higher than the first. If it were higher, however, then the points misclassified by the second stump would be miscla ssified in the combination.

2

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 2

1. (2 p oints) Consider a linear SVM trained w ith n lab eled p oints in R 2 without slack p enalties and resulting in k = 2 supp ort ve ctors ( k < n ). B y adding one additional lab eled training p oint and retraining the SVM classifie r, w hat is the maximum numb er of supp ort vectors in the re sulting s olution?

(

) k

(

) k + 1

(

) k + 2

( X ) n + 1

2. We train two SVM classifiers to separate p oints in R 2 . The classifie rs differ only in terms of the kernel function. Classifie r 1 uses the linear kernel K 1 ( x , x ) = x T x , and classifier 2 use s K 2 ( x , x ) = p ( x )p ( x ), where p (x ) is a 3- comp onent Gaus sian mixture density, estimated on the basis of re lated other proble ms.

(a)

(3 p oints) What is the VC -dimension of the second SVM classifier that uses kernel K 2 ( x , x )?

2
2

The feature space is 1-dim ensional; each point x ∈ R 2 is mapped to a non-negative number p (x ) .

(b)

(T/F – 2 p oints) The second SVM clas sifier can only separate p oints that are likely according to p ( x ) f rom those that have low probability under p( x).

T
T

(c)

(4 p oints) If b oth SVM clas sifiers achieve zero training error on n lab eled p oints , which classifier would have a b ette r generaliz ation guarantee ? P rovide a brief justification.

The first classifier has VC-dimension 3 while the second one has VC-dimension 2. The complexity penalty for the first one is therefore higher. When the number of trai ning e rrors is the same for the two classifiers, the bound on the expected error is smal ler for the second classifier.

3

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 3

x2

x2

−2

−4

8

6

4

2

0

0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0
8 6 4 2 0

0

2

4

6

8

x1

(a)

10

12

14

16

18

6 4 2 0
6 4 2 0
6 4 2 0
6 4 2 0

6

4

2

0

6 4 2 0
6 4 2 0
6 4 2 0
6 4 2 0

−2

−4

−6

−2

0

2

x1 4

(b)

6

8

10

Figure 2: Data s ets for c lus te ring. Points are lo c ated at integer c o ordinate s.

1. (4 p oints) First consider the data plotte d in Figure 2a, w hich consist of two rows of equally spac ed p oints . If k -me ans clustering (k = 2) is initialised w ith the two p oints whose c o ordinates are (9, 3) and (11, 3), indicate the final clusters obtained (af te r the algorithm converge s) on Figure 2a.

2. (4 p oints) Now consider the data in Figure 2b. We will use s p ectral clustering to divide thes e p oints into two clusters. Our version of sp ectral clus te ring us es a neighb ourho o d graph obtained by connecting each p oint to its two nearest neighb ors (breaking ties randomly), and by weighting the re sulting edges b etween p oints x i and x j by W ij = e xp( −||x i x j ||). Indicate on Figure 2b the c lusters that we will obtain from sp e ctral clus te ring. Provide a brief j ustification.

The random walk induced by the weights can switch be tween the clusters in the figure in only two places, (0,-1) and (2,0). Since the weights decay with distance, the weights corresponding to transitions within c lusters are higher than those going acro ss in both places. The random would would therefore tend to remain wi thin the clusters indicated in the figure.

3. (4 p oints) Can the solution obtained in the previous part f or the data in Figure 2b also b e obtaine d by k -me ans clustering ( k = 2)? Jus tify your answer.

No. In the k-means algorithm points are assigned to the closest mean (cluster cen­ troid ). The centroids of the left and right clusters in the figure are (0,0) and (5,0), respectively. P oint (2,0), for example, is closer to the left cluster cent roid (0,0) and wouldn’t be assigned to t he right cluster. The two clusters in the figure therefore cannot be fixed points of the k-means algorithm.

4

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

y

1.2

1 0 0 0.2 0.4 0.6 0.8 1 x
1
0
0
0.2
0.4
0.6
0.8
1
x

0.8

0.6

0.4

0.2

−0.2

Figure 3: Training sample from a mixture of two line ar mo dels

Problem 4

The data in Figure 3 come s from a mixture of two line ar re gress ion mo dels with Gaussian noise:

P ( y |x ; θ ) = p 1 N ( y ; w 10 + w 11 x,

2

σ ) + p N ( y ; w

1

2

20

+ w x, σ 2 )

21

2

where p 1 + p 2 = 1 and θ = (p 1 , p 2 , w 10 , w 11 , w 20 , w 21 , σ 1 , σ 2 ). We hop e such data via the EM algorithm.

To this end, let z ∈ { 1 , 2} b e the mixture index, variable indicating which of the regression mo dels is use d to gene rate y given x .

to estimate θ from

1. (6 p oints) Connect the random variable s X , Y , and Z with directed edges so that the graphic al mo del on the left represents the mixture of linear regres sion mo dels desc rib ed ab ove , and the one on the right represe nts a mixture- of -exp erts mo del. For b oth mo dels , Y denote s the output variable, X the input, and Z is the choice of the linear regres sion mo del or exp ert.

mixture of linear regressions

X Z Y
X
Z
Y

5

mixture of experts X Z Y
mixture of experts
X
Z
Y

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

We use a single plot to repres ent the mo del parameters (s ee the figure b elow). E ach linear regression mo del app ears as a solid line ( y = w i 0 + w i 1 x ) in b etween two parallel dotted lines at vertical distance 2σ i to the solid line. Thus e ach regression mo del “covers” the data that falls b etwee n the dotte d lines. When w 10 = w 20 and w 11 = w 21 you would only see a s ingle solid line in the figure; you may still see two different sets of dotte d lines corresp onding to different values of σ 1 and σ 2 . T he solid bar to the right represe nts p 1 (and p 2 = 1 p 1 ).

For example, if θ = ( p 1 , p 2 , w 10 ,
For example, if
θ
= (
p 1 ,
p 2 , w 10 , w 11 ,
= ( 0.35, 0 .65,
0 .5 ,
0 ,
w 20 ,
0. 85,
w 21 ,
−0 . 7 ,
σ 1 ,
0 . 05,
the plot is
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
y

)

σ 2 0 . 15)

0

0.2

0.4

0.6

0.8

1

p

1

x

2. (6 p oints) We are now re ady to estimate the parameters θ via EM. There are , howe ver, many ways to initialize the parameters f or the algorithm.

On the next page you are asked to c onnect 3 different initializations (left column) with the paramete rs that would result af te r one E M iteration (right column). Diffe re nt initializ ations may lead to the same set of parame te rs . Your answer should c onsis t of 3 arrows, one f rom e ach initialization.

6

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

y

y

y

y

y

Initialization 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 0.2 0.4 0.6 0.8 1
Initialization
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
0
0.2
0.4
0.6
0.8
1

0.9

0.8

0.7

0.6

0.5

0.4

0.2

0.1

0.3

0
0

p 1

0.6 0.8 1 0.9 0.8 0.7 0.6 0.5 0.4 0.2 0.1 0.3 0 p 1 x
x 1.2 1 0.9 1 0.8 0.8 0.7 0.6 0.6 0.5 0.4 0.4 0.3 0.2
x
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0 0.2 0.4 0.6 0.8 1 x 1.2 1 0.8 0.6 0.4 0.2 0 −0.2
0 0.2
0.4
0.6
0.8
1
x
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
0 0.2
0.4
0.6
0.8
1

x

p 1

0.9

0.8

0.7

0.6

0.5

0.4

0
0

0.2

0.3

0.1

p 1

7

Next iteration

1.2 1 0.9 1 0.8 0.8 0.7 0.6 0.6 0.5 0.4 0.4 0.3 0.2 0.2
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0
0.2
0.4
0.6
0.8
1
p
1
x
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0
0.2
0.4
0.6
0.8
1
p
1
x
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0
0.2
0.4
0.6
0.8
1
p
1
x
1.2 1 0.9 1 0.8 0.8 0.7 0.6 0.6 0.5 0.4 0.4 0.3 0.2 0.2
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0 0.2
0.4
0.6
0.8
1
p 1
x

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 5

Ass ume that the following se quenc es are ve ry long and the pattern highlighted with spac es is re p eate d:

Sequence 1: 1 0 0

1

0 0

1

0 0

1 0

0

1

0 0

Sequence 2: 1

1

0 0

1

0 0

1

0 0

1

0 0

1. (4 p oints) If we mo del each sequence with a diffe rent first-orde r HMM , w hat is the numb er of hidden states that a reasonable mo del s election metho d would rep ort?

HMM for Sequence 1 HM M for Sequence 2

No. of hidden s tates

3 4
3
4

2. (2 p oints) The following Bayesian network depic ts a sequence of 5 observations from an HMM, where s 1 , s 2 , s 3 , s 4 , s 5 is the hidden s tate s equence.

s 1 s 2 s 3 s 4 s 5 x 1 x 2 x
s 1
s 2
s 3
s 4
s 5
x 1
x 2
x
3
x 4
x 5

Are x 1 and x 5 indep e nde nt given x 3 ? Briefly justify your answer.

They are not independent. The moralized ancestral graph corresponding to x 1 , x 3 , and x 5 is the same graph with arrows replaced with undirect ed edges. x 1 and x 5 are not separated given x 3 , and thus not independent.

3. (3 p oints) Do es the order of Markov dep endenc ies in the observed se que nc e always determine the numb er of hidde n states of the HMM that ge nerated the sequenc e? Provide a brie f jus tific ation.

No. The answer to the previous question implies that observations corresponding to (typical) HMMs have no Markov properties (of any order). This holds, for exam­ ple, when there are only two possible hidden states. Thus Markov properties of the observation sequence cannot in general determine the number of hidden states.

8

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 6

We wish to de velop a graphical mo del for the following transp ortation problem. A trans p ort company is trying to cho ose b etween two alte rnative routes for commuting b etween B os ton and New York. In an exp eriment, two ide ntical bus ses leave Boston at the same but othe rw ise random time, T B . The buss es take different routes, arriving at their (common) destination at times T N 1 and T N 2 .

Transit time f or each route dep e nds on the congestion along the route, and the two con­ gestions are unrelated. Let us re pre sent the random de lays intro duc ed along the routes by variable s C 1 and C 2. Finally, let F represent the identity of the bus which reaches New York first. We view F as a random variable that takes values 1 or 2.

1. (6 p oints) Complete the following dire cted graph (Bayesian network) with edge s so that it c aptures the relationships b etween the variables in this transp ortation proble m.

T B C 2 C 1 T T N N 2 1 F
T B
C 2
C 1
T
T N
N
2
1
F

9

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

2. (3 p oints) Consider the following directed graph as a p oss ible representation of the indep e nde nces b etween the variables T N 1 , T N 2 , and F only:

the variables T N 1 , T N 2 , and F only: Which of the

Which of the following factorizations of the joint are consistent w ith the graph?

P

( T N 1 ) P ( T N 2 ) P ( F | T N 1 , T N 2 )

X
X

P

( T N 1 ) P ( T N 2 ) P ( F | T N 1 )

X
X

P

( T N 1 ) P ( T N 2 ) P ( F )

X
X

10

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].