Homework 4: Multivariate Analysis, S. S. Mukherjee, Fall 2018

Homework 4 Multivariate Analysis, S. S.
Mukherjee, Fall 2018

Topics: Fisher’s LDA, CART, k-NN classifiers. Due on October 10, 2018
Name of student: Rohan Hore
Roll number: MB1812
Note: Attach your code with your submission.
1. Fisher’s LDA and least squares in a two-class problem. [5 points]
Recall that Fisher’s LDA for a two-class problem results in a discriminant function of the form
hLDA (x) = a> x + b,
where a = Σ̂−1
pooled (x̄1 − x̄2 ) (one classifies a new example x as class 1 if hLDA (x) > 0, and class 2 otherwise). Suppose
we encode the two categories by two distinct real rumbers α and β, i.e. we dream up a response variable yi which takes
value α if xi is from class 1, and β otherwise. Now consider the following least squares problem:
N
X
(θ̂0 , θ̂) = arg min (yi − θ0 − θ> xi )2 .
i=1
(a) Show that θ̂ ∝ a.

(α,β) (α,β)
(b) Let hLS (x) = θ̂0 + θ̂> x. How do the classification rules corresponding to hLS (x) compare for different (α, β)?
(α,β)
When are the classification rules corresponding to hLS (x) and hLS (x) equivalent?
Solution.
Ans.(a) Consider the minimization problem
N
X
min (yi − θ0 − θ> xi )2
θ0 ,θ
i=1
Taking derivative of the inner function w.r.t θ0 and equating with 0,we have
N
X
2 (yi − θ0 − θ> xi ) = 0
i=1
Hence,ȳ = θ0 + θ> x̄ and our minimization problem can be translated to following problem
N
X
min ((yi − ȳ) − θ> (xi − x̄))2
θ
i
Hence,we define x̃ = x − x̄ and ỹ = y − ȳ and consider the following problem

N
X
min (y˜i − θ> x̃i )2
θ
i
written in matrix form with Ỹ = (y˜1 , y˜2 , . . . , y˜N ) and X̃ a matrix with ith row being x̃>
i
min kỸ − X̃θk2

θ
Hence,from multiple regression results we know our optimizer is
θopt = (X̃ > X̃)−1 X̃ > Ỹ (1)

Now,X̃ = In − n1 Jn X,X being the matrix with ith row being x>
i .Similarly Ỹ = In − 1
n J n Y
Also,taking Xi to be the matrix corresponding to class i=1,2 only,we can write

X1 α1n1
X= and Y =
X2 β1n2
1
where ni is the no of observations corresponding to ith class.Also,x̄ = (n1 x̄1 + n2 x¯2 )/n and ȳ = (n1 α + n2 β)/n while
x̄i is ith class mean. Now observe,

> > 1 1
X̃ Ỹ = X In − Jn In − J n Y
n n

1
= X > In − Jn Y
n

> > n2 (α − β)/n1n1
= X1 X2
n1 (β − α)/n1n2
n1 n2 n2 n1
= (α − β)x¯1 + (β − α)x¯2
n n
n1 n2
= (α − β)(x¯1 − x¯2 )
n
Now,Observe
n
X
X̃ > X̃ = (xi − x̄)(xi − x̄)>
i=1
X X
= (xi − x̄)(xi − x̄)> + (xi − x̄)(xi − x̄)>
i=1,yi =α i=1,yi =β
Further,
X X
(xi − x̄)(xi − x̄)> = (xi − x¯1 + x¯1 − x̄)(xi − x¯1 + x¯1 − x̄)>
i=1,yi =α i=1,yi =α
X
> >
= (xi − x¯1 )(xi − x¯1 ) + (x¯1 − x̄)(x¯1 − x̄)
i=1,yi =α
X n1 n22
= (xi − x¯1 )(xi − x¯1 )> + (x¯1 − x¯2 )(x¯1 − x¯2 )>
i=1,yi =α
n2
Similarly,we can do this for the other sum.Finally we will have

X X n1 n2
X̃ > X̃ = (xi − x¯1 )(xi − x¯1 )> + (xi − x¯2 )(xi − x¯2 )> + (x¯1 − x¯2 )(x¯1 − x¯2 )>
i=1,y =α
n
i i=1,yi =β
n1 n2 >
= (n − 2)Su + dd
n
where definitions of Su and d are clear from the context.
Now,we recall the following result

If A is invertible square matrix and u,v are column vectors.Then,if (A + uv > )−1 exists it is given by
A−1 uv > A−1

(A + uv > )−1 = A−1 −
1 + v > A−1 u
Hence,we will have here
(n − 2)−1 Su−1 n1 n2 dd> (n − 2)−1 Su−1

(X̃ > X̃)−1 = (n − 2)−1 Su−1 −
n + n1 n2 d> (n − 2)−1 Su−1 d
Su−1 d(d> Su−1 d)
2 2
n1 n2 (α − β) −1 n1 n2 (α − β)
=⇒ θ̂ = Su d −
n(n − 2) n(n − 2) 2
n + n1 n2 d> (n − 2)−1 Su−1 d
=⇒ θ̂ ∝ Su−1 d
Now,noting from definitions,Su = Σpooled and d = (x¯1 − x¯2 ) and hence θ̂ ∝ a

2. The smallest optimally pruned subtrees corresponding to different α’s. [5 points]
Let T be a tree. As defined is class, T (α) is the smallest pruned subtree of T optimizing Rα (T 0 ) = R(T 0 ) + α|T̃ | over
all pruned subtrees T 0 of T . Show that if 0 ≤ α1 ≤ α2 , then T (α2 ) 4 T (α1 ).
2
Solution. Firstly,recall from class that
(
t ∪ TL (α) ∪ TR (α) if Rα (t) > Rα (TL (α)) + Rα (TR (α))
T (α) =
t otherwise
Will show T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2 by using induction on the height of sub-trees of T.
Let,h be height of a tree.
Let’s first show the result for the base case i.e. when h=1.
The tree will look like following where t is the root node,while tL and tR are our leaf nodes.
tL tR
Only sub-trees of T are t and T itself. We need to show for T,T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2 .
Suppose the result is not true.Then following our previous discussions of only two choices of sub-trees of T,we must
have T (α2 ) = T and T (α1 ) = t.Also,Observe here TL (α) = tL and TR (α) = tR
Now,following optimal properties of T (α) we have,
Rα2 (t) > Rα2 (tL ) + Rα2 (tR ) and

Rα1 (t) ≤ Rα1 (tL ) + Rα1 (tR )
Now,t,tL and tR individually are leaf nodes.

Hence,following definition of Rα (T ),above conditions will translate into
R(t) > R(tL ) + R(tR ) + α2 and

R(t) ≤ R(tL ) + R(tR ) + α1
Now,α2 ≥ α1 hence above conditions can’t hold together.So, a contradiction.

Thus for h=1,T (α2 ) 4 T (α1 ).
Now,suppose the result holds for h=n.We will now show the result holds for h=n+1
Observe,
Rα1 (t) ≤ Rα1 (TL (α1 )) + Rα1 (TR (α1 ))

=⇒ Rα1 (t) ≤ Rα1 (TL (α2 )) + Rα1 (TR (α2 ))
˜ 2 )| + |TR˜(α2 | − 1)
=⇒ R(t) − R(TL (α2 )) − R(TR (α2 )) ≤ α1 (|TL (α
˜ 2 )| + |TR˜(α2 | − 1)
=⇒ R(t) − R(TL (α2 )) − R(TR (α2 )) ≤ α2 (|TL (α Since,α2 ≥ α1
=⇒ Rα2 (t) ≤ Rα2 (TL (α2 )) + Rα2 (TR (α2 ))
Hence,observe from the definition of optimal T (α),if we have T (α1 ) =t then also T (α2 ) =t.Hence,trivially T (α2 ) 4 T (α1 )
So,let’s consider the case when T (α1 ) 6= t.
Now,if Rα (t) > Rα (TL )+Rα (TR ) doesn’t hold for α2 ,then we have T (α2 ) = t which is always a sub-tree of T (α1 ).Hence,T (α2 ) 4
T (α1 ) holds in this case.
Now,if Rα (t) > Rα (TL ) + Rα (TR ) holds for α2 (and also for α1 ),we will have non-trivial branches of both T (α1 ) and
T (α2 ) given by TL (α) 4 TL and TR (α) 4 TR for both α s.
Now,TL and TR are depth n trees.hence,by induction hypothesis,we have
TL (α2 ) 4 TL (α1 ) and TR (α2 ) 4 TR (α1 )

and by definition of T (α) clearly,T (α2 ) 4 T (α1 ).
Hence,T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2
3
3. Splits decrease impurity of nodes. [5 points]
Recall that a function f : S ⊆ Rd → R is called concave, if, for any x, y ∈ S and α ∈ [0, 1], one has
f (αx + (1 − α)y) ≥ αf (x) + (1 − α)f (y).
Suppose that we have an impurity measure Imp : ∆K → R+ , which is concave.

(a) Show that a split t (tL , tR ) always (weakly) decreases the impurity of node t, i.e.
Imp(p̂) − pL Imp(p̂L ) − pR Imp(p̂R ) ≥ 0.
Here p̂, p̂L and p̂R are the label proportion vectors in the nodes t, tL , and tR respectively, and pL = |tL |/|t|, pR =
|tR |/|t| are the fractions of the observations that fall into the two child nodes.
(b) Show that the three impurity measures defined in class, namely the Gini index, the entropy, and the mis-
classification error are all concave, and thus conclude (a) for each of these measures. When do you have equality?
(c) Establish the same result for a regression tree and squared error loss, i.e. show that
1 X 1 X 1 X
min (yi − c)2 − pL min (yi − cL )2 − pR min (yi − cR )2 ≥ 0.
c |t| i∈t cL |tL |
i∈t
cR |tR |
i∈t
L
Solution.
Ans.(a) We have |t| training instances at node t.Define,ni as the number of instances at t falling in ith class and
Pk
i=1 ni = |t|.
Similarly,we have |tL | and |tR | training instances at node tL and tR .Define,niL and niR as the number of instances
Pk Pk
at tL and tR falling in ith class and i=1 niL = |tL |, i=1 niR = |tR |.
Easy to observe,∀ i = 1, 2, . . . , k
niL + niR = ni
Also,|tL | + |tR | = |t| Now we observe,

n1 n2 nk
p̂ = , ,...
|t| |t| |t|

n1L n2L nkL
pˆL = , ,...
|tL | |tL | |tL |

n1R n2R nkR
pˆR = , ,...
|tR | |tR | |tR |
So.from the above written expressions we can infer
p̂ = pL pˆL + pR pˆR
where pL and pR are as defined in problem.

Given,Imp is concave function.Take x = pˆL and y = pˆR .So following definition of concavity
Imp(p̂) ≥ pL Imp(p̂L ) + pR Imp(p̂R )
(Hence,Proved)
Ans.(b) Here,we need to show that the impurity measures are concave.
First we will show,Gini index is concave.Now,Gini index is
X
f (p) = Imp(p) = pi pi0
i6=i0
X
= pi (1 − pi )
i
X
=1− p2i
i
4
Now,To show
f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)
Thus,need to show
X X X
1− (αpi + (1 − α)qi )2 ≥ α 1 − p2i + (1 − α) 1 − qi2
i i i
X X X
⇐⇒ α p2i + (1 − α) qi2 ≥ (αpi + (1 − α)qi ) 2
i i i
X X X X X
⇐⇒ α p2i + (1 − α) qi2 ≥ α2 p2i + (1 − α)2 qi2 + 2α(1 − α) pi qi
i i i i i
X X X
⇐⇒ α(1 − α)( p2i + qi2 ) ≥ 2α(1 − α) pi qi
i i i
X X X
⇐⇒ p2i + qi2 ≥ 2 pi qi
i i i
X
⇐⇒ (pi − qi )2 ≥ 0
The last inequality is obviously true and hence the concavity of gini index holds and thus,(a) also holds for gini
index.
Since,all steps in the proof of concavity inequality are iff statements,the equality in concavity inequality holds iff
pi = qi ∀ i=1,2,. . .,k.
Hence,equality in part (a) for Gini Index holds iff p̂L = p̂R
Now,will show the concavity of mis-classification error.Now mis-classification error is
f (p) = Imp(p) = 1 − max pi

i
We need to show,
f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)
Thus,we need to show
1 − max(αpi + (1 − α)qi ) ≥ α(1 − max pi ) + (1 − α)(1 − max qj )

i i j
⇐⇒ 1 − max(αpi + (1 − α)qi ) ≥ 1 − α max pi − (1 − α) max qj

i i j
⇐⇒ max αpi + max(1 − α)qj ≥ max(αpi + (1 − α)qi )

i j i
Suppose, argmaxi (αpi + (1 − α)qi ) = k then,observe
max αpi + max(1 − α)qi ≥ αpk + (1 − α)qk

i i
Clearly,the last desired inequality holds and thus also the concavity of mis-classification error holds.Hence also (a)
holds for mis-classification error.
Now,again all the steps of proof of concavity inequality are ’iff’ statements.
We claim,the equality in concavity holds iff
argmaxi pi = argmaxj qj = k(say)
we will prove this claim below

proof of ”if”
Now,
max αpi = αpk and max(1 − α)qj = (1 − α)qk
i j
Hence,we have
αpi + (1 − α)qi ≤ αpk + (1 − α)qk ≤ maxi (αpi + (1 − α)qi )
Thus,maxi (αpi + (1 − α)qi ) = (αpk + (1 − α)qk ).
proof of ”only if”
5
We will show the contra-positive true i.e. if argmaxi pi 6= argmaxj qj ,then equality will not be attained.
Suppose,
max αpi = αpk and max(1 − α)qj = (1 − α)pl s.t. k 6= l
i j
Then,∀k = 1, 2, . . .
αpk + (1 − α)qk < max αpi + max(1 − α)qj

i j
=⇒ max(αpk + (1 − α)qk ) < max αpi + max(1 − α)qj

k i j
So,the equality in concavity holds iff argmaxi pi = argmaxj qj .
Now,we will show concavity of shanon’s entropy.Now,shanon’s entropy is

X
f (p) = (p) = − pi log(pi )
i
we need to show,
f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)
Thus,need to show,
X X X
− (αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) ≥ −α pi log(pi ) − (1 − α) qi log(qi )
i i i
X
⇐⇒ (αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) − αpi log(pi ) − (1 − α)qi log(qi ) ≤ 0
i
Now,we will show the last desired inequality will hold.
Firstly,we will show g(u)=ulog(u) is convex function of u,u ≥ 0.Since g(u) is continuous function of u,it is enough
to show, g 00 (u) ≥ 0.Thus,we observe, g 00 (u) = 1/u which is ≥ 0.
Hence,we have ∀i = 1, 2, . . .
(αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) ≤ αpi log(pi ) + (1 − α)qi log(qi )
Now,from here it is quite clear that the last desired inequality will also hold.Thus,(a) will also hold for shanon’s
entropy.
Now,we observe that each of the terms in the left hand sum of last desired inequality in ≤ 0 by convexity of
ulog(u).Hence,equality in concavity holds iff each term of left hand sum is zero.
Also,note g 00 (u)
Ans.(c) Observe,instances in t split into exclusive sets of instances of tL and instances of tR .So,
X X X
(yi − c)2 = (yi − c)2 + (yi − c)2
i∈t i∈tL i∈tR
So,∀ c ∈ R X X X
(yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
cL cR
i∈t i∈tL i∈tR
Hence,
X X X
min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
c cL cR
i∈t i∈tL i∈tR
1 X 1 X 1 X
=⇒ min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
|t| c i∈t |t| cL i∈t |t| cR i∈t
L R
1 X p L
X p R
X
=⇒ min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
|t| c i∈t |tL | cL i∈t |tR | cR i∈t
L R
where the last line follows by noting definitions of |tL and |tR | and this is what we wanted to prove.
(Hence,Proved)
6
4. Bagged classifiers. [5 points]
Suppose you have training data (X, y). Generate B bootstrap samples (X (1) , y (1) ), . . . , (X (B) , y (B) ) of the training data
and train a classifer φ(i) on each. The bagged classifer φ(bagged) then predicts the majority class among the predictions
of the B classifiers φ(i) , i.e.
XB
φ(bagged) (x) = arg max 1{φ(i) (x)=a} .
a
i=1
Consider the MNIST handwritten digits data and take ntrain = 1000 images (and their labels) as your training data.
Train bagged (with B = 100) and non-bagged versions of CART, k-NN and Fisher’s LDA. Take a validation set of size
nvalidation = 1000 and select the best performer among these. Report the error of the selected classifier on a separate
test set of size ntest = 1000.
Solution.

Homework 4: Multivariate Analysis, S. S. Mukherjee, Fall 2018

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Homework 4: Multivariate Analysis, S. S. Mukherjee, Fall 2018

Hochgeladen von

Copyright:

Verfügbare Formate

Homework 4 Multivariate Analysis, S. S.

Mukherjee, Fall 2018

hLDA (x) = a> x + b,

(a) Show that θ̂ ∝ a.

Hence,we define x̃ = x − x̄ and ỹ = y − ȳ and consider the following problem

min kỸ − X̃θk2

Hence,from multiple regression results we know our optimizer is

θopt = (X̃ > X̃)−1 X̃ > Ỹ (1)

Similarly,we can do this for the other sum.Finally we will have

Now,we recall the following result

A−1 uv > A−1

(n − 2)−1 Su−1 n1 n2 dd> (n − 2)−1 Su−1

Now,noting from definitions,Su = Σpooled and d = (x¯1 − x¯2 ) and hence θ̂ ∝ a

Now,following optimal properties of T (α) we have,

Rα2 (t) > Rα2 (tL ) + Rα2 (tR ) and

Now,t,tL and tR individually are leaf nodes.

R(t) > R(tL ) + R(tR ) + α2 and

Now,α2 ≥ α1 hence above conditions can’t hold together.So, a contradiction.

Rα1 (t) ≤ Rα1 (TL (α1 )) + Rα1 (TR (α1 ))

So,let’s consider the case when T (α1 ) 6= t.

TL (α2 ) 4 TL (α1 ) and TR (α2 ) 4 TR (α1 )

Hence,T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2

f (αx + (1 − α)y) ≥ αf (x) + (1 − α)f (y).

Suppose that we have an impurity measure Imp : ∆K → R+ , which is concave.

Imp(p̂) − pL Imp(p̂L ) − pR Imp(p̂R ) ≥ 0.

Also,|tL | + |tR | = |t| Now we observe,

So.from the above written expressions we can infer

where pL and pR are as defined in problem.

Imp(p̂) ≥ pL Imp(p̂L ) + pR Imp(p̂R )

Now,will show the concavity of mis-classification error.Now mis-classification error is

f (p) = Imp(p) = 1 − max pi

1 − max(αpi + (1 − α)qi ) ≥ α(1 − max pi ) + (1 − α)(1 − max qj )

⇐⇒ 1 − max(αpi + (1 − α)qi ) ≥ 1 − α max pi − (1 − α) max qj

⇐⇒ max αpi + max(1 − α)qj ≥ max(αpi + (1 − α)qi )

Suppose, argmaxi (αpi + (1 − α)qi ) = k then,observe

max αpi + max(1 − α)qi ≥ αpk + (1 − α)qk

argmaxi pi = argmaxj qj = k(say)

we will prove this claim below

αpk + (1 − α)qk < max αpi + max(1 − α)qj

=⇒ max(αpk + (1 − α)qk ) < max αpi + max(1 − α)qj

So,the equality in concavity holds iff argmaxi pi = argmaxj qj .

Now,we will show concavity of shanon’s entropy.Now,shanon’s entropy is

f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)

Now,we will show the last desired inequality will hold.

(αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) ≤ αpi log(pi ) + (1 − α)qi log(qi )

Das könnte Ihnen auch gefallen