Sie sind auf Seite 1von 10

6.867 Machine learning

Your name and M IT ID:

Final exam

Decemb er 3, 2004

J. D . 00000000

(Optional) The gr ade you would give to your se lf + a brief justiﬁcation.

A why not?

1

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 1

5

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0

 (1) (2) − + − +          0

1

2

3

4

5

6

Figure 1: Lab ele d training p oints for proble m 1.

Consider the lab eled training p oints in Figure 1, whe re ’x’ and ’o’ de note p os itive and negative lab els , resp ectively. We wish to apply AdaBo ost with decision stumps to s olve the classiﬁc ation problem. I n each b o os ting iteration, we select the stump that minimiz es the we ighted training e rror, breaking ties arbitrarily.

1. (3 p oints) In ﬁgure 1, draw the decision b oundary c orresp onding to the ﬁrst decis ion stump that the b o osting algorithm would cho ose . Lab el this b oundary (1), and als o indicate +/- side of the de cision b oundary.

2. (2 p oints) In the same ﬁgure 1 also circle the p oint(s) that have the highe st weight after the ﬁrst b o osting iteration.

3. (2 p oints) What is the weighted error of the ﬁrst decis ion stump after the ﬁrst b o osting ite ration, i.e., after the p oints have b een reweighte d? 0.5

4. (3 p oints) Draw the de cision b oundary corresp onding to the s econd decision stump, again in Figure 1, and lab el it with (2), also indicating the +/- s ide of the b oundary.

5. (3 p oints) Would s ome of the p oints b e misc lass iﬁed by the c ombined clas siﬁer after the two b o osting iterations? Provide a brie f justiﬁcation. (the p oints will b e awarded for the j us tiﬁc ation, not whether your y/n answer is corre ct)

Yes. For example, the circled point in the ﬁgure is misclassiﬁed by the ﬁrst decision stump and could be classiﬁed correctly in the combination only if the weight/votes of the second stump is higher than the ﬁrst. If it were higher, however, then the points misclassiﬁed by the second stump would be miscla ssiﬁed in the combination.

2

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 2

1. (2 p oints) Consider a linear SVM trained w ith n lab eled p oints in R 2 without slack p enalties and resulting in k = 2 supp ort ve ctors ( k < n ). B y adding one additional lab eled training p oint and retraining the SVM classiﬁe r, w hat is the maximum numb er of supp ort vectors in the re sulting s olution?

 ( ) k ( ) k + 1 ( ) k + 2

( X ) n + 1

2. We train two SVM classiﬁers to separate p oints in R 2 . The classiﬁe rs diﬀer only in terms of the kernel function. Classiﬁe r 1 uses the linear kernel K 1 ( x , x ) = x T x , and classiﬁer 2 use s K 2 ( x , x ) = p ( x )p ( x ), where p (x ) is a 3- comp onent Gaus sian mixture density, estimated on the basis of re lated other proble ms.

 (a) (3 p oints) What is the VC -dimension of the second SVM classiﬁer that uses kernel K 2 ( x , x � )? 2 The feature space is 1-dim ensional; each point x ∈ R 2 is mapped to a non-negative number p (x ) . (b) (T/F – 2 p oints) The second SVM clas siﬁer can only separate p oints that are likely according to p ( x ) f rom those that have low probability under p( x). T (c) (4 p oints) If b oth SVM clas siﬁers achieve zero training error on n lab eled p oints , which classiﬁer would have a b ette r generaliz ation guarantee ? P rovide a brief justiﬁcation.

The ﬁrst classiﬁer has VC-dimension 3 while the second one has VC-dimension 2. The complexity penalty for the ﬁrst one is therefore higher. When the number of trai ning e rrors is the same for the two classiﬁers, the bound on the expected error is smal ler for the second classiﬁer.

3

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 3

x2

x2

−2

−4

8

6

4

2

0            0

2

4

6

8

x1

(a)

10

12

14

16

18    6

4

2

0    −2

−4

−6

−2

0

2

x1 4

(b)

6

8

10

Figure 2: Data s ets for c lus te ring. Points are lo c ated at integer c o ordinate s.

1. (4 p oints) First consider the data plotte d in Figure 2a, w hich consist of two rows of equally spac ed p oints . If k -me ans clustering (k = 2) is initialised w ith the two p oints whose c o ordinates are (9, 3) and (11, 3), indicate the ﬁnal clusters obtained (af te r the algorithm converge s) on Figure 2a.

2. (4 p oints) Now consider the data in Figure 2b. We will use s p ectral clustering to divide thes e p oints into two clusters. Our version of sp ectral clus te ring us es a neighb ourho o d graph obtained by connecting each p oint to its two nearest neighb ors (breaking ties randomly), and by weighting the re sulting edges b etween p oints x i and x j by W ij = e xp( −||x i x j ||). Indicate on Figure 2b the c lusters that we will obtain from sp e ctral clus te ring. Provide a brief j ustiﬁcation.

The random walk induced by the weights can switch be tween the clusters in the ﬁgure in only two places, (0,-1) and (2,0). Since the weights decay with distance, the weights corresponding to transitions within c lusters are higher than those going acro ss in both places. The random would would therefore tend to remain wi thin the clusters indicated in the ﬁgure.

3. (4 p oints) Can the solution obtained in the previous part f or the data in Figure 2b also b e obtaine d by k -me ans clustering ( k = 2)? Jus tify your answer.

No. In the k-means algorithm points are assigned to the closest mean (cluster cen­ troid ). The centroids of the left and right clusters in the ﬁgure are (0,0) and (5,0), respectively. P oint (2,0), for example, is closer to the left cluster cent roid (0,0) and wouldn’t be assigned to t he right cluster. The two clusters in the ﬁgure therefore cannot be ﬁxed points of the k-means algorithm.

4

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

y

1.2 1
0
0
0.2
0.4
0.6
0.8
1
x

0.8

0.6

0.4

0.2

−0.2

Figure 3: Training sample from a mixture of two line ar mo dels

Problem 4

The data in Figure 3 come s from a mixture of two line ar re gress ion mo dels with Gaussian noise:

P ( y |x ; θ ) = p 1 N ( y ; w 10 + w 11 x,

2

σ ) + p N ( y ; w

1

2

20

+ w x, σ 2 )

21

2

where p 1 + p 2 = 1 and θ = (p 1 , p 2 , w 10 , w 11 , w 20 , w 21 , σ 1 , σ 2 ). We hop e such data via the EM algorithm.

To this end, let z ∈ { 1 , 2} b e the mixture index, variable indicating which of the regression mo dels is use d to gene rate y given x .

to estimate θ from

1. (6 p oints) Connect the random variable s X , Y , and Z with directed edges so that the graphic al mo del on the left represents the mixture of linear regres sion mo dels desc rib ed ab ove , and the one on the right represe nts a mixture- of -exp erts mo del. For b oth mo dels , Y denote s the output variable, X the input, and Z is the choice of the linear regres sion mo del or exp ert.

mixture of linear regressions X
Z
Y

5 mixture of experts
X
Z
Y

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

We use a single plot to repres ent the mo del parameters (s ee the ﬁgure b elow). E ach linear regression mo del app ears as a solid line ( y = w i 0 + w i 1 x ) in b etween two parallel dotted lines at vertical distance 2σ i to the solid line. Thus e ach regression mo del “covers” the data that falls b etwee n the dotte d lines. When w 10 = w 20 and w 11 = w 21 you would only see a s ingle solid line in the ﬁgure; you may still see two diﬀerent sets of dotte d lines corresp onding to diﬀerent values of σ 1 and σ 2 . T he solid bar to the right represe nts p 1 (and p 2 = 1 p 1 ). For example, if
θ
= (
p 1 ,
p 2 , w 10 , w 11 ,
= ( 0.35, 0 .65,
0 .5 ,
0 ,
w 20 ,
0. 85,
w 21 ,
−0 . 7 ,
σ 1 ,
0 . 05,
the plot is
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
y

)

σ 2 0 . 15)

0

0.2

0.4

0.6

0.8

1

p

1

x

2. (6 p oints) We are now re ady to estimate the parameters θ via EM. There are , howe ver, many ways to initialize the parameters f or the algorithm.

On the next page you are asked to c onnect 3 diﬀerent initializations (left column) with the paramete rs that would result af te r one E M iteration (right column). Diﬀe re nt initializ ations may lead to the same set of parame te rs . Your answer should c onsis t of 3 arrows, one f rom e ach initialization.

6

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

y

y

y

y

y Initialization
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
0
0.2
0.4
0.6
0.8
1

0.9

0.8

0.7

0.6

0.5

0.4

0.2

0.1

0.3 0

p 1  x
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0 0 0.2
0.4
0.6
0.8
1
x
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
0 0.2
0.4
0.6
0.8
1

x

p 1

0.9

0.8

0.7

0.6

0.5

0.4 0

0.2

0.3

0.1

p 1

7

Next iteration 1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0
0.2
0.4
0.6
0.8
1
p
1
x
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0
0.2
0.4
0.6
0.8
1
p
1
x
1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0
0.2
0.4
0.6
0.8
1
p
1
x 1.2
1
0.9
1
0.8
0.8
0.7
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0
0.1
−0.2
0
0 0.2
0.4
0.6
0.8
1
p 1
x

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 5

Ass ume that the following se quenc es are ve ry long and the pattern highlighted with spac es is re p eate d:

 Sequence 1: 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 Sequence 2: 1 1 0 0 1 0 0 1 0 0 1 0 0

1. (4 p oints) If we mo del each sequence with a diﬀe rent ﬁrst-orde r HMM , w hat is the numb er of hidden states that a reasonable mo del s election metho d would rep ort?

HMM for Sequence 1 HM M for Sequence 2

No. of hidden s tates 3
4

2. (2 p oints) The following Bayesian network depic ts a sequence of 5 observations from an HMM, where s 1 , s 2 , s 3 , s 4 , s 5 is the hidden s tate s equence. s 1
s 2
s 3
s 4
s 5
x 1
x 2
x
3
x 4
x 5

Are x 1 and x 5 indep e nde nt given x 3 ? Brieﬂy justify your answer.

They are not independent. The moralized ancestral graph corresponding to x 1 , x 3 , and x 5 is the same graph with arrows replaced with undirect ed edges. x 1 and x 5 are not separated given x 3 , and thus not independent.

3. (3 p oints) Do es the order of Markov dep endenc ies in the observed se que nc e always determine the numb er of hidde n states of the HMM that ge nerated the sequenc e? Provide a brie f jus tiﬁc ation.

No. The answer to the previous question implies that observations corresponding to (typical) HMMs have no Markov properties (of any order). This holds, for exam­ ple, when there are only two possible hidden states. Thus Markov properties of the observation sequence cannot in general determine the number of hidden states.

8

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Problem 6

We wish to de velop a graphical mo del for the following transp ortation problem. A trans p ort company is trying to cho ose b etween two alte rnative routes for commuting b etween B os ton and New York. In an exp eriment, two ide ntical bus ses leave Boston at the same but othe rw ise random time, T B . The buss es take diﬀerent routes, arriving at their (common) destination at times T N 1 and T N 2 .

Transit time f or each route dep e nds on the congestion along the route, and the two con­ gestions are unrelated. Let us re pre sent the random de lays intro duc ed along the routes by variable s C 1 and C 2. Finally, let F represent the identity of the bus which reaches New York ﬁrst. We view F as a random variable that takes values 1 or 2.

1. (6 p oints) Complete the following dire cted graph (Bayesian network) with edge s so that it c aptures the relationships b etween the variables in this transp ortation proble m. T B
C 2
C 1
T
T N
N
2
1
F

9

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

2. (3 p oints) Consider the following directed graph as a p oss ible representation of the indep e nde nces b etween the variables T N 1 , T N 2 , and F only: Which of the following factorizations of the joint are consistent w ith the graph?

 P ( T N 1 ) P ( T N 2 ) P ( F | T N 1 , T N 2 ) X P ( T N 1 ) P ( T N 2 ) P ( F | T N 1 ) X P ( T N 1 ) P ( T N 2 ) P ( F ) X

10

Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].