Sie sind auf Seite 1von 7

Homework 4 Multivariate Analysis, S. S.

Mukherjee, Fall 2018


Topics: Fisher’s LDA, CART, k-NN classifiers. Due on October 10, 2018
Name of student: Rohan Hore
Roll number: MB1812
Note: Attach your code with your submission.
1. Fisher’s LDA and least squares in a two-class problem. [5 points]
Recall that Fisher’s LDA for a two-class problem results in a discriminant function of the form

hLDA (x) = a> x + b,

where a = Σ̂−1
pooled (x̄1 − x̄2 ) (one classifies a new example x as class 1 if hLDA (x) > 0, and class 2 otherwise). Suppose
we encode the two categories by two distinct real rumbers α and β, i.e. we dream up a response variable yi which takes
value α if xi is from class 1, and β otherwise. Now consider the following least squares problem:
N
X
(θ̂0 , θ̂) = arg min (yi − θ0 − θ> xi )2 .
i=1

(a) Show that θ̂ ∝ a.


(α,β) (α,β)
(b) Let hLS (x) = θ̂0 + θ̂> x. How do the classification rules corresponding to hLS (x) compare for different (α, β)?
(α,β)
When are the classification rules corresponding to hLS (x) and hLS (x) equivalent?

Solution.
Ans.(a) Consider the minimization problem
N
X
min (yi − θ0 − θ> xi )2
θ0 ,θ
i=1

Taking derivative of the inner function w.r.t θ0 and equating with 0,we have
N
X
2 (yi − θ0 − θ> xi ) = 0
i=1

Hence,ȳ = θ0 + θ> x̄ and our minimization problem can be translated to following problem
N
X
min ((yi − ȳ) − θ> (xi − x̄))2
θ
i

Hence,we define x̃ = x − x̄ and ỹ = y − ȳ and consider the following problem


N
X
min (y˜i − θ> x̃i )2
θ
i

written in matrix form with Ỹ = (y˜1 , y˜2 , . . . , y˜N ) and X̃ a matrix with ith row being x̃>
i

min kỸ − X̃θk2


θ

Hence,from multiple regression results we know our optimizer is

θopt = (X̃ > X̃)−1 X̃ > Ỹ (1)


   
Now,X̃ = In − n1 Jn X,X being the matrix with ith row being x>
i .Similarly Ỹ = In − 1
n J n Y
Also,taking Xi to be the matrix corresponding to class i=1,2 only,we can write
   
X1 α1n1
X= and Y =
X2 β1n2

1
where ni is the no of observations corresponding to ith class.Also,x̄ = (n1 x̄1 + n2 x¯2 )/n and ȳ = (n1 α + n2 β)/n while
x̄i is ith class mean. Now observe,
  
> > 1 1
X̃ Ỹ = X In − Jn In − J n Y
n n
 
1
= X > In − Jn Y
n
  
> > n2 (α − β)/n1n1
= X1 X2
n1 (β − α)/n1n2
n1 n2 n2 n1
= (α − β)x¯1 + (β − α)x¯2
n n
n1 n2
= (α − β)(x¯1 − x¯2 )
n
Now,Observe
n
X
X̃ > X̃ = (xi − x̄)(xi − x̄)>
i=1
X X
= (xi − x̄)(xi − x̄)> + (xi − x̄)(xi − x̄)>
i=1,yi =α i=1,yi =β

Further,
X X
(xi − x̄)(xi − x̄)> = (xi − x¯1 + x¯1 − x̄)(xi − x¯1 + x¯1 − x̄)>
i=1,yi =α i=1,yi =α
X  
> >
= (xi − x¯1 )(xi − x¯1 ) + (x¯1 − x̄)(x¯1 − x̄)
i=1,yi =α
X n1 n22
= (xi − x¯1 )(xi − x¯1 )> + (x¯1 − x¯2 )(x¯1 − x¯2 )>
i=1,yi =α
n2

Similarly,we can do this for the other sum.Finally we will have


X X n1 n2
X̃ > X̃ = (xi − x¯1 )(xi − x¯1 )> + (xi − x¯2 )(xi − x¯2 )> + (x¯1 − x¯2 )(x¯1 − x¯2 )>
i=1,y =α
n
i i=1,yi =β
n1 n2 >
= (n − 2)Su + dd
n
where definitions of Su and d are clear from the context.

Now,we recall the following result


If A is invertible square matrix and u,v are column vectors.Then,if (A + uv > )−1 exists it is given by

A−1 uv > A−1


(A + uv > )−1 = A−1 −
1 + v > A−1 u
Hence,we will have here

(n − 2)−1 Su−1 n1 n2 dd> (n − 2)−1 Su−1


(X̃ > X̃)−1 = (n − 2)−1 Su−1 −
n + n1 n2 d> (n − 2)−1 Su−1 d
Su−1 d(d> Su−1 d)
 2 2 
n1 n2 (α − β) −1 n1 n2 (α − β)
=⇒ θ̂ = Su d −
n(n − 2) n(n − 2) 2
n + n1 n2 d> (n − 2)−1 Su−1 d
=⇒ θ̂ ∝ Su−1 d

Now,noting from definitions,Su = Σpooled and d = (x¯1 − x¯2 ) and hence θ̂ ∝ a


2. The smallest optimally pruned subtrees corresponding to different α’s. [5 points]
Let T be a tree. As defined is class, T (α) is the smallest pruned subtree of T optimizing Rα (T 0 ) = R(T 0 ) + α|T̃ | over
all pruned subtrees T 0 of T . Show that if 0 ≤ α1 ≤ α2 , then T (α2 ) 4 T (α1 ).

2
Solution. Firstly,recall from class that
(
t ∪ TL (α) ∪ TR (α) if Rα (t) > Rα (TL (α)) + Rα (TR (α))
T (α) =
t otherwise

Will show T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2 by using induction on the height of sub-trees of T.
Let,h be height of a tree.
Let’s first show the result for the base case i.e. when h=1.
The tree will look like following where t is the root node,while tL and tR are our leaf nodes.

tL tR
Only sub-trees of T are t and T itself. We need to show for T,T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2 .
Suppose the result is not true.Then following our previous discussions of only two choices of sub-trees of T,we must
have T (α2 ) = T and T (α1 ) = t.Also,Observe here TL (α) = tL and TR (α) = tR

Now,following optimal properties of T (α) we have,

Rα2 (t) > Rα2 (tL ) + Rα2 (tR ) and


Rα1 (t) ≤ Rα1 (tL ) + Rα1 (tR )

Now,t,tL and tR individually are leaf nodes.


Hence,following definition of Rα (T ),above conditions will translate into

R(t) > R(tL ) + R(tR ) + α2 and


R(t) ≤ R(tL ) + R(tR ) + α1

Now,α2 ≥ α1 hence above conditions can’t hold together.So, a contradiction.


Thus for h=1,T (α2 ) 4 T (α1 ).

Now,suppose the result holds for h=n.We will now show the result holds for h=n+1
Observe,

Rα1 (t) ≤ Rα1 (TL (α1 )) + Rα1 (TR (α1 ))


=⇒ Rα1 (t) ≤ Rα1 (TL (α2 )) + Rα1 (TR (α2 ))
˜ 2 )| + |TR˜(α2 | − 1)
=⇒ R(t) − R(TL (α2 )) − R(TR (α2 )) ≤ α1 (|TL (α
˜ 2 )| + |TR˜(α2 | − 1)
=⇒ R(t) − R(TL (α2 )) − R(TR (α2 )) ≤ α2 (|TL (α Since,α2 ≥ α1
=⇒ Rα2 (t) ≤ Rα2 (TL (α2 )) + Rα2 (TR (α2 ))

Hence,observe from the definition of optimal T (α),if we have T (α1 ) =t then also T (α2 ) =t.Hence,trivially T (α2 ) 4 T (α1 )

So,let’s consider the case when T (α1 ) 6= t.

Now,if Rα (t) > Rα (TL )+Rα (TR ) doesn’t hold for α2 ,then we have T (α2 ) = t which is always a sub-tree of T (α1 ).Hence,T (α2 ) 4
T (α1 ) holds in this case.

Now,if Rα (t) > Rα (TL ) + Rα (TR ) holds for α2 (and also for α1 ),we will have non-trivial branches of both T (α1 ) and
T (α2 ) given by TL (α) 4 TL and TR (α) 4 TR for both α s.
Now,TL and TR are depth n trees.hence,by induction hypothesis,we have

TL (α2 ) 4 TL (α1 ) and TR (α2 ) 4 TR (α1 )


and by definition of T (α) clearly,T (α2 ) 4 T (α1 ).

Hence,T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2

3
3. Splits decrease impurity of nodes. [5 points]
Recall that a function f : S ⊆ Rd → R is called concave, if, for any x, y ∈ S and α ∈ [0, 1], one has

f (αx + (1 − α)y) ≥ αf (x) + (1 − α)f (y).

Suppose that we have an impurity measure Imp : ∆K → R+ , which is concave.


(a) Show that a split t (tL , tR ) always (weakly) decreases the impurity of node t, i.e.

Imp(p̂) − pL Imp(p̂L ) − pR Imp(p̂R ) ≥ 0.

Here p̂, p̂L and p̂R are the label proportion vectors in the nodes t, tL , and tR respectively, and pL = |tL |/|t|, pR =
|tR |/|t| are the fractions of the observations that fall into the two child nodes.
(b) Show that the three impurity measures defined in class, namely the Gini index, the entropy, and the mis-
classification error are all concave, and thus conclude (a) for each of these measures. When do you have equality?
(c) Establish the same result for a regression tree and squared error loss, i.e. show that
1 X 1 X 1 X
min (yi − c)2 − pL min (yi − cL )2 − pR min (yi − cR )2 ≥ 0.
c |t| i∈t cL |tL |
i∈t
cR |tR |
i∈t
L

Solution.
Ans.(a) We have |t| training instances at node t.Define,ni as the number of instances at t falling in ith class and
Pk
i=1 ni = |t|.

Similarly,we have |tL | and |tR | training instances at node tL and tR .Define,niL and niR as the number of instances
Pk Pk
at tL and tR falling in ith class and i=1 niL = |tL |, i=1 niR = |tR |.

Easy to observe,∀ i = 1, 2, . . . , k

niL + niR = ni

Also,|tL | + |tR | = |t| Now we observe,


 
n1 n2 nk
p̂ = , ,...
|t| |t| |t|
 
n1L n2L nkL
pˆL = , ,...
|tL | |tL | |tL |
 
n1R n2R nkR
pˆR = , ,...
|tR | |tR | |tR |

So.from the above written expressions we can infer

p̂ = pL pˆL + pR pˆR

where pL and pR are as defined in problem.


Given,Imp is concave function.Take x = pˆL and y = pˆR .So following definition of concavity

Imp(p̂) ≥ pL Imp(p̂L ) + pR Imp(p̂R )

(Hence,Proved)
Ans.(b) Here,we need to show that the impurity measures are concave.
First we will show,Gini index is concave.Now,Gini index is
X
f (p) = Imp(p) = pi pi0
i6=i0
X
= pi (1 − pi )
i
X
=1− p2i
i

4
Now,To show
f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)
Thus,need to show
X  X   X 
1− (αpi + (1 − α)qi )2 ≥ α 1 − p2i + (1 − α) 1 − qi2
i i i
X X X
⇐⇒ α p2i + (1 − α) qi2 ≥ (αpi + (1 − α)qi ) 2

i i i
X X X X X
⇐⇒ α p2i + (1 − α) qi2 ≥ α2 p2i + (1 − α)2 qi2 + 2α(1 − α) pi qi
i i i i i
X X X
⇐⇒ α(1 − α)( p2i + qi2 ) ≥ 2α(1 − α) pi qi
i i i
X X X
⇐⇒ p2i + qi2 ≥ 2 pi qi
i i i
X
⇐⇒ (pi − qi )2 ≥ 0

The last inequality is obviously true and hence the concavity of gini index holds and thus,(a) also holds for gini
index.
Since,all steps in the proof of concavity inequality are iff statements,the equality in concavity inequality holds iff
pi = qi ∀ i=1,2,. . .,k.

Hence,equality in part (a) for Gini Index holds iff p̂L = p̂R

Now,will show the concavity of mis-classification error.Now mis-classification error is

f (p) = Imp(p) = 1 − max pi


i

We need to show,
f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)
Thus,we need to show

1 − max(αpi + (1 − α)qi ) ≥ α(1 − max pi ) + (1 − α)(1 − max qj )


i i j

⇐⇒ 1 − max(αpi + (1 − α)qi ) ≥ 1 − α max pi − (1 − α) max qj


i i j

⇐⇒ max αpi + max(1 − α)qj ≥ max(αpi + (1 − α)qi )


i j i

Suppose, argmaxi (αpi + (1 − α)qi ) = k then,observe

max αpi + max(1 − α)qi ≥ αpk + (1 − α)qk


i i

Clearly,the last desired inequality holds and thus also the concavity of mis-classification error holds.Hence also (a)
holds for mis-classification error.
Now,again all the steps of proof of concavity inequality are ’iff’ statements.
We claim,the equality in concavity holds iff

argmaxi pi = argmaxj qj = k(say)

we will prove this claim below


proof of ”if”
Now,
max αpi = αpk and max(1 − α)qj = (1 − α)qk
i j

Hence,we have
αpi + (1 − α)qi ≤ αpk + (1 − α)qk ≤ maxi (αpi + (1 − α)qi )
Thus,maxi (αpi + (1 − α)qi ) = (αpk + (1 − α)qk ).
proof of ”only if”

5
We will show the contra-positive true i.e. if argmaxi pi 6= argmaxj qj ,then equality will not be attained.
Suppose,
max αpi = αpk and max(1 − α)qj = (1 − α)pl s.t. k 6= l
i j

Then,∀k = 1, 2, . . .

αpk + (1 − α)qk < max αpi + max(1 − α)qj


i j

=⇒ max(αpk + (1 − α)qk ) < max αpi + max(1 − α)qj


k i j

So,the equality in concavity holds iff argmaxi pi = argmaxj qj .

Now,we will show concavity of shanon’s entropy.Now,shanon’s entropy is


X
f (p) = (p) = − pi log(pi )
i

we need to show,

f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)

Thus,need to show,
X X X
− (αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) ≥ −α pi log(pi ) − (1 − α) qi log(qi )
i i i
X
⇐⇒ (αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) − αpi log(pi ) − (1 − α)qi log(qi ) ≤ 0
i

Now,we will show the last desired inequality will hold.

Firstly,we will show g(u)=ulog(u) is convex function of u,u ≥ 0.Since g(u) is continuous function of u,it is enough
to show, g 00 (u) ≥ 0.Thus,we observe, g 00 (u) = 1/u which is ≥ 0.
Hence,we have ∀i = 1, 2, . . .

(αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) ≤ αpi log(pi ) + (1 − α)qi log(qi )

Now,from here it is quite clear that the last desired inequality will also hold.Thus,(a) will also hold for shanon’s
entropy.

Now,we observe that each of the terms in the left hand sum of last desired inequality in ≤ 0 by convexity of
ulog(u).Hence,equality in concavity holds iff each term of left hand sum is zero.

Also,note g 00 (u)
Ans.(c) Observe,instances in t split into exclusive sets of instances of tL and instances of tR .So,
X X X
(yi − c)2 = (yi − c)2 + (yi − c)2
i∈t i∈tL i∈tR

So,∀ c ∈ R X X X
(yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
cL cR
i∈t i∈tL i∈tR

Hence,
X X X
min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
c cL cR
i∈t i∈tL i∈tR
1 X 1 X 1 X
=⇒ min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
|t| c i∈t |t| cL i∈t |t| cR i∈t
L R

1 X p L
X p R
X
=⇒ min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
|t| c i∈t |tL | cL i∈t |tR | cR i∈t
L R

where the last line follows by noting definitions of |tL and |tR | and this is what we wanted to prove.
(Hence,Proved)

6
4. Bagged classifiers. [5 points]
Suppose you have training data (X, y). Generate B bootstrap samples (X (1) , y (1) ), . . . , (X (B) , y (B) ) of the training data
and train a classifer φ(i) on each. The bagged classifer φ(bagged) then predicts the majority class among the predictions
of the B classifiers φ(i) , i.e.
XB
φ(bagged) (x) = arg max 1{φ(i) (x)=a} .
a
i=1

Consider the MNIST handwritten digits data and take ntrain = 1000 images (and their labels) as your training data.
Train bagged (with B = 100) and non-bagged versions of CART, k-NN and Fisher’s LDA. Take a validation set of size
nvalidation = 1000 and select the best performer among these. Report the error of the selected classifier on a separate
test set of size ntest = 1000.

Solution.

Das könnte Ihnen auch gefallen