Beruflich Dokumente
Kultur Dokumente
where a = Σ̂−1
pooled (x̄1 − x̄2 ) (one classifies a new example x as class 1 if hLDA (x) > 0, and class 2 otherwise). Suppose
we encode the two categories by two distinct real rumbers α and β, i.e. we dream up a response variable yi which takes
value α if xi is from class 1, and β otherwise. Now consider the following least squares problem:
N
X
(θ̂0 , θ̂) = arg min (yi − θ0 − θ> xi )2 .
i=1
Solution.
Ans.(a) Consider the minimization problem
N
X
min (yi − θ0 − θ> xi )2
θ0 ,θ
i=1
Taking derivative of the inner function w.r.t θ0 and equating with 0,we have
N
X
2 (yi − θ0 − θ> xi ) = 0
i=1
Hence,ȳ = θ0 + θ> x̄ and our minimization problem can be translated to following problem
N
X
min ((yi − ȳ) − θ> (xi − x̄))2
θ
i
written in matrix form with Ỹ = (y˜1 , y˜2 , . . . , y˜N ) and X̃ a matrix with ith row being x̃>
i
1
where ni is the no of observations corresponding to ith class.Also,x̄ = (n1 x̄1 + n2 x¯2 )/n and ȳ = (n1 α + n2 β)/n while
x̄i is ith class mean. Now observe,
> > 1 1
X̃ Ỹ = X In − Jn In − J n Y
n n
1
= X > In − Jn Y
n
> > n2 (α − β)/n1n1
= X1 X2
n1 (β − α)/n1n2
n1 n2 n2 n1
= (α − β)x¯1 + (β − α)x¯2
n n
n1 n2
= (α − β)(x¯1 − x¯2 )
n
Now,Observe
n
X
X̃ > X̃ = (xi − x̄)(xi − x̄)>
i=1
X X
= (xi − x̄)(xi − x̄)> + (xi − x̄)(xi − x̄)>
i=1,yi =α i=1,yi =β
Further,
X X
(xi − x̄)(xi − x̄)> = (xi − x¯1 + x¯1 − x̄)(xi − x¯1 + x¯1 − x̄)>
i=1,yi =α i=1,yi =α
X
> >
= (xi − x¯1 )(xi − x¯1 ) + (x¯1 − x̄)(x¯1 − x̄)
i=1,yi =α
X n1 n22
= (xi − x¯1 )(xi − x¯1 )> + (x¯1 − x¯2 )(x¯1 − x¯2 )>
i=1,yi =α
n2
2
Solution. Firstly,recall from class that
(
t ∪ TL (α) ∪ TR (α) if Rα (t) > Rα (TL (α)) + Rα (TR (α))
T (α) =
t otherwise
Will show T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2 by using induction on the height of sub-trees of T.
Let,h be height of a tree.
Let’s first show the result for the base case i.e. when h=1.
The tree will look like following where t is the root node,while tL and tR are our leaf nodes.
tL tR
Only sub-trees of T are t and T itself. We need to show for T,T (α2 ) 4 T (α1 ) for 0 ≤ α1 ≤ α2 .
Suppose the result is not true.Then following our previous discussions of only two choices of sub-trees of T,we must
have T (α2 ) = T and T (α1 ) = t.Also,Observe here TL (α) = tL and TR (α) = tR
Now,suppose the result holds for h=n.We will now show the result holds for h=n+1
Observe,
Hence,observe from the definition of optimal T (α),if we have T (α1 ) =t then also T (α2 ) =t.Hence,trivially T (α2 ) 4 T (α1 )
Now,if Rα (t) > Rα (TL )+Rα (TR ) doesn’t hold for α2 ,then we have T (α2 ) = t which is always a sub-tree of T (α1 ).Hence,T (α2 ) 4
T (α1 ) holds in this case.
Now,if Rα (t) > Rα (TL ) + Rα (TR ) holds for α2 (and also for α1 ),we will have non-trivial branches of both T (α1 ) and
T (α2 ) given by TL (α) 4 TL and TR (α) 4 TR for both α s.
Now,TL and TR are depth n trees.hence,by induction hypothesis,we have
3
3. Splits decrease impurity of nodes. [5 points]
Recall that a function f : S ⊆ Rd → R is called concave, if, for any x, y ∈ S and α ∈ [0, 1], one has
Here p̂, p̂L and p̂R are the label proportion vectors in the nodes t, tL , and tR respectively, and pL = |tL |/|t|, pR =
|tR |/|t| are the fractions of the observations that fall into the two child nodes.
(b) Show that the three impurity measures defined in class, namely the Gini index, the entropy, and the mis-
classification error are all concave, and thus conclude (a) for each of these measures. When do you have equality?
(c) Establish the same result for a regression tree and squared error loss, i.e. show that
1 X 1 X 1 X
min (yi − c)2 − pL min (yi − cL )2 − pR min (yi − cR )2 ≥ 0.
c |t| i∈t cL |tL |
i∈t
cR |tR |
i∈t
L
Solution.
Ans.(a) We have |t| training instances at node t.Define,ni as the number of instances at t falling in ith class and
Pk
i=1 ni = |t|.
Similarly,we have |tL | and |tR | training instances at node tL and tR .Define,niL and niR as the number of instances
Pk Pk
at tL and tR falling in ith class and i=1 niL = |tL |, i=1 niR = |tR |.
Easy to observe,∀ i = 1, 2, . . . , k
niL + niR = ni
p̂ = pL pˆL + pR pˆR
(Hence,Proved)
Ans.(b) Here,we need to show that the impurity measures are concave.
First we will show,Gini index is concave.Now,Gini index is
X
f (p) = Imp(p) = pi pi0
i6=i0
X
= pi (1 − pi )
i
X
=1− p2i
i
4
Now,To show
f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)
Thus,need to show
X X X
1− (αpi + (1 − α)qi )2 ≥ α 1 − p2i + (1 − α) 1 − qi2
i i i
X X X
⇐⇒ α p2i + (1 − α) qi2 ≥ (αpi + (1 − α)qi ) 2
i i i
X X X X X
⇐⇒ α p2i + (1 − α) qi2 ≥ α2 p2i + (1 − α)2 qi2 + 2α(1 − α) pi qi
i i i i i
X X X
⇐⇒ α(1 − α)( p2i + qi2 ) ≥ 2α(1 − α) pi qi
i i i
X X X
⇐⇒ p2i + qi2 ≥ 2 pi qi
i i i
X
⇐⇒ (pi − qi )2 ≥ 0
The last inequality is obviously true and hence the concavity of gini index holds and thus,(a) also holds for gini
index.
Since,all steps in the proof of concavity inequality are iff statements,the equality in concavity inequality holds iff
pi = qi ∀ i=1,2,. . .,k.
Hence,equality in part (a) for Gini Index holds iff p̂L = p̂R
We need to show,
f (αp + (1 − α)q) ≥ αf (p) + (1 − α)f (q)
Thus,we need to show
Clearly,the last desired inequality holds and thus also the concavity of mis-classification error holds.Hence also (a)
holds for mis-classification error.
Now,again all the steps of proof of concavity inequality are ’iff’ statements.
We claim,the equality in concavity holds iff
Hence,we have
αpi + (1 − α)qi ≤ αpk + (1 − α)qk ≤ maxi (αpi + (1 − α)qi )
Thus,maxi (αpi + (1 − α)qi ) = (αpk + (1 − α)qk ).
proof of ”only if”
5
We will show the contra-positive true i.e. if argmaxi pi 6= argmaxj qj ,then equality will not be attained.
Suppose,
max αpi = αpk and max(1 − α)qj = (1 − α)pl s.t. k 6= l
i j
Then,∀k = 1, 2, . . .
we need to show,
Thus,need to show,
X X X
− (αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) ≥ −α pi log(pi ) − (1 − α) qi log(qi )
i i i
X
⇐⇒ (αpi + (1 − α)qi ) log(αpi + (1 − α)qi ) − αpi log(pi ) − (1 − α)qi log(qi ) ≤ 0
i
Firstly,we will show g(u)=ulog(u) is convex function of u,u ≥ 0.Since g(u) is continuous function of u,it is enough
to show, g 00 (u) ≥ 0.Thus,we observe, g 00 (u) = 1/u which is ≥ 0.
Hence,we have ∀i = 1, 2, . . .
Now,from here it is quite clear that the last desired inequality will also hold.Thus,(a) will also hold for shanon’s
entropy.
Now,we observe that each of the terms in the left hand sum of last desired inequality in ≤ 0 by convexity of
ulog(u).Hence,equality in concavity holds iff each term of left hand sum is zero.
Also,note g 00 (u)
Ans.(c) Observe,instances in t split into exclusive sets of instances of tL and instances of tR .So,
X X X
(yi − c)2 = (yi − c)2 + (yi − c)2
i∈t i∈tL i∈tR
So,∀ c ∈ R X X X
(yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
cL cR
i∈t i∈tL i∈tR
Hence,
X X X
min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
c cL cR
i∈t i∈tL i∈tR
1 X 1 X 1 X
=⇒ min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
|t| c i∈t |t| cL i∈t |t| cR i∈t
L R
1 X p L
X p R
X
=⇒ min (yi − c)2 ≥ min (yi − cL )2 + min (yi − cR )2
|t| c i∈t |tL | cL i∈t |tR | cR i∈t
L R
where the last line follows by noting definitions of |tL and |tR | and this is what we wanted to prove.
(Hence,Proved)
6
4. Bagged classifiers. [5 points]
Suppose you have training data (X, y). Generate B bootstrap samples (X (1) , y (1) ), . . . , (X (B) , y (B) ) of the training data
and train a classifer φ(i) on each. The bagged classifer φ(bagged) then predicts the majority class among the predictions
of the B classifiers φ(i) , i.e.
XB
φ(bagged) (x) = arg max 1{φ(i) (x)=a} .
a
i=1
Consider the MNIST handwritten digits data and take ntrain = 1000 images (and their labels) as your training data.
Train bagged (with B = 100) and non-bagged versions of CART, k-NN and Fisher’s LDA. Take a validation set of size
nvalidation = 1000 and select the best performer among these. Report the error of the selected classifier on a separate
test set of size ntest = 1000.
Solution.