Sie sind auf Seite 1von 26

Lecture 7: The VC Dimension

0/26

The VC Dimension

Roadmap
1

When Can Machines Learn?

Why Can Machines Learn?

Lecture 6: Theory of Generalization


Eout Ein possible
if mH (N) breaks somewhere and N large enough

Lecture 7: The VC Dimension


Definition of VC Dimension
VC Dimension of Perceptrons
Physical Intuition of VC Dimension
Interpreting VC Dimension
3

How Can Machines Learn?

How Can Machines Learn Better?


1/26

Theory of Generalization

Bounding Function: Inductive Cases

Bounding Function: The Theorem


B(N, k )

kX
1 


N
i
i=0
| {z }

highest term N k 1

simple induction using boundary and inductive formula


for fixed k , B(N, k ) upper bounded by poly (N)

= mH (N) is poly (N) if break point exists

can be = actually,
go play and prove it if math lover! :-)

Theory of Generalization

Bounding Function: Inductive Cases

The Three Break Points


B(N, k )

kX
1 


N
i
i=0
| {z }

highest term N k 1

positive rays:

mH (N) = N + 1 N + 1
mH (2) = 3 < 22 : break point at 2

positive intervals:

mH (N) = 21 N 2 + 21 N + 1 12 N 2 + 12 N + 1
mH (3) = 7 < 23 : break point at 3

2D perceptrons:

mH (N)=? 61 N 3 + 56 N + 1
mH (4) = 14 < 24 : break point at 4

can bound mH (N) by only one break point

Theory of Generalization

BAD Bound for General H

want:
h
i


P h H s.t. Ein (h)Eout (h) >  2


mH ( N)exp 2

2 N

actually, when N large enough,




h
i


1 2


P h H s.t. Ein (h) Eout (h) >  22mH (2N) exp 2  N
16

The VC Dimension

Definition of VC Dimension

Vapnik-Chervonenkis (VC) Bound


For any g = A(D) H and statistical large D
h
i

PD Ein (g) Eout (g) > 
i
h

PD h H s.t. Ein (h) Eout (h) > 





4mH (2N) exp 81 2 N




if k exists

4(2N)k 1 exp 18 2 N

if 1 mH (N) breaks at k

(good H)

2 N large enough
(good D)
= probably generalized Eout Ein , and
if 3 A picks a g with small Ein
(good A)
= probably learned!
3/26

The VC Dimension

Definition of VC Dimension

VC Dimension
the formal name of maximum non-break point

Definition
VC dimension of H, denoted dVC (H) is
largest N for which mH (N) = 2N
the most inputs H that can shatter
dVC = minimum k - 1

N dVC
k > dVC

=
=

H can shatter some N inputs


k is a break point for H

4/26

The VC Dimension

Definition of VC Dimension

The Four VC Dimensions


positive rays:

mH (N) = N + 1

dVC = 1

positive intervals:

dVC = 2

convex sets:

mH (N) = 12 N 2 + 12 N + 1
mH (N) = 2N

up

dVC =

bottom

2D perceptrons:

dVC = 3

mH (N) N 3 for N 2

good: finite dVC


5/26

The VC Dimension

Definition of VC Dimension

VC Dimension and Learning


finite dVC = g will generalize (Eout (g) Ein (g))
regardless of learning algorithm A

regardless of input distribution P


regardless of target function f
unknown target function
f: X Y

unknown
P on X

(ideal credit approval formula)


x1 , x2 , , xN

training examples
D : (x1 , y1 ), , (xN , yN )
(historical records in bank)
hypothesis set
H

x
learning
algorithm
A

final hypothesis
gf
(learned formula to be used)

worst case guarantee


on generalization
6/26

The VC Dimension

Definition of VC Dimension

Fun Time
If there is a set of N inputs that cannot be shattered by H. Based
only on this information, what can we conclude about dVC (H)?
1

dVC (H) > N

dVC (H) = N

dVC (H) < N

no conclusion can be made

Reference Answer: 4
It is possible that there is another set of N
inputs that can be shattered, which means
dVC N. It is also possible that no set of N
input can be shattered, which means dVC < N.
Neither cases can be ruled out by one
non-shattering set.
7/26

The VC Dimension

VC Dimension of Perceptrons

2D PLA Revisited
linearly separable D

with xn P and yn = f (xn )

PLA can converge

P[|Ein (g) Eout (g)| > ] ... by dVC = 3

T large

N large

Eout (g) Ein (g)

Ein (g) = 0
Eout (g) 0 :-)

general PLA for x with more than 2 features?

8/26

The VC Dimension

VC Dimension of Perceptrons

VC Dimension of Perceptrons
1D perceptron (pos/neg rays): dVC = 2
2D perceptrons: dVC = 3

dVC 3:

dVC 3:

d-D perceptrons: dVC = d + 1

two steps:
dVC d + 1
dVC d + 1

9/26

The VC Dimension

VC Dimension of Perceptrons

Extra Fun Time


What statement below shows that dVC d + 1?
1

There are some d + 1 inputs we can shatter.

We can shatter any set of d + 1 inputs.

There are some d + 2 inputs we cannot shatter.

We cannot shatter any set of d + 2 inputs.

Reference Answer: 1
dVC is the maximum that mH (N) = 2N , and
mH (N) is the most number of dichotomies of N
inputs. So if we can find 2d+1 dichotomies on
some d + 1 inputs, mH (d + 1) = 2d+1 and
hence dVC d + 1.
10/26

The VC Dimension

VC Dimension of Perceptrons

dVC d + 1

There are some d + 1 inputs we can shatter.


some trivial inputs:

X=

visually in 2D:

xT1
xT2
xT3
..
.
xTd+1

1 0 0 ... 0
1 1 0 ... 0
1 0 1
0
.. ..
..
. 0
. .
1 0 ... 0 1

note: X invertible!

11/26

The VC Dimension

VC Dimension of Perceptrons

Can We Shatter X?
xT1
xT2
..
.

X=

xTd+1

1 0 0 ... 0
1 1 0 ... 0
.. ..
..
. 0
. .
1 0 ... 0 1

invertible

to shatter . . .

y1

for any y = ... , find w such that


yd+1
sign (Xw) = y

(Xw) = y

X invertible!

w = X1 y

special X can be shattered = dVC d + 1


12/26

The VC Dimension

Physical Intuition of VC Dimension

Degrees of Freedom
10

11

10

11
6

12

13

10

15

16
17

18
10

0
9

11

16
17

11

10
6

12

18

16

2
17

18

16

11

18
10

0
9

16

2
17

18

16

11

18
10

0
9

16

2
17

18

4
3

16

2
17

11

18
10

0
9

7
6

12
5

13

4 14

15

12
5

15

2
17

4 14

15

15

13

4 14

13

12
5

12
5

11
6

13

2
17

4 14

15

15

10

4 14

13

12
5

12
5

14

0
9

11
6

13

13

10

4 14

15

12
5

13

4 14

14

11
6

12

13

4 14

15

16

2
17

18

15

16

2
17

18

(modified from the work of Hugues Vermeiren on http://www.texample.net)

hypothesis parameters w = (w0 , w1 , , wd ):

creates degrees of freedom

hypothesis quantity M = |H|:

analog degrees of freedom

hypothesis power dVC = d + 1:

effective binary degrees of freedom


dVC (H): powerfulness of H
17/26

The VC Dimension

0.8 Physical Intuition of VC Dimension

Two Old Friends


Positive Rays (dVC = 1)
h(x) = 1
0.8

x1

x2

x3

...

h(x) = +1
xN

free parameters: a

Positive Intervals (dVC = 2)


h(x) = 1
x1

x2

x3

h(x) = +1

...

h(x) = 1
xN

free parameters: `, r
practical rule of thumb:
dVC #free parameters (but not always)
18/26

The VC Dimension

Physical Intuition of VC Dimension

M and dVC
copied from Lecture 5 :-)
1

can we make sure that Eout (g) is close enough to Ein (g)?

can we make Ein (g) small enough?

small M
1

large M

Yes!,
P[BAD] 2 M exp(. . .)
No!, too few choices

small dVC
1

No!,
P[BAD] 2 M exp(. . .)
Yes!, many choices

large dVC

Yes!, P[BAD]
4 (2N)dVC exp(. . .)

No!, too limited power

No!, P[BAD]
4 (2N)dVC exp(. . .)
Yes!, lots of power

using the right dVC (or H) is important


19/26

The VC Dimension

Physical Intuition of VC Dimension

Fun Time
Origin-crossing Hyperplanes are essentially perceptrons with w0
fixed at 0. Make a guess about the dVC of origin-crossing
hyperplanes in Rd .
1

d +1

Reference Answer: 2
The proof is almost the same as proving the
dVC for usual perceptrons, but it is the intuition
(dVC #free parameters) that you shall use to
answer this quiz.
20/26

The VC Dimension

Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity


For any g = A(D) H and statistical large D
h
i

PD Ein (g) Eout (g) > 
{z
}
|

BAD



4(2N)dVC exp 18 2 N
|
{z
}

Rephrase


. . ., with probability 1 , GOOD: Ein (g) Eout (g) 


set
= 4(2N)dVC exp 18 2 N



1 2
=
exp


N
d
8
4(2N) VC


4(2N)dVC
1 2
ln
= 8 N

r


4(2N)dVC
8
ln
= 
N

21/26

The VC Dimension

Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity


For any g = A(D) H and statistical large D, for N 2, dVC 2
h
i

PD Ein (g) Eout (g) > 
{z
}
|

BAD



4(2N)dVC exp 18 2 N
|
{z
}

Rephrase
. . ., with probability 1 , GOOD!
r


gen. error Ein (g) Eout (g)
r


Ein (g)

8
N

ln

4(2N)dVC

8
N

ln

4(2N)dVC

r
Eout (g)

Ein (g) +

8
N

ln




4(2N)dVC

. . . : penalty for model complexity


| {z}
(N, H, )
21/26

The VC Dimension

Interpreting VC Dimension

THE VC Message
with a high probability,
r
Eout (g) Ein (g) +
|

out-of-sample error

Error

model complexity

8
N

ln

4(2N)dVC

{z

(N,H,)


}

dVC : Ein but

dVC : but Ein

in the middle
best dVC

in-sample error
dvc

VC dimension, dvc

powerful H not always good!


22/26

The VC Dimension

Interpreting VC Dimension

VC Bound Rephrase: Sample Complexity


For any g = A(D) H and statistical large D
i
h

PD Ein (g) Eout (g) > 
{z
}
|

BAD



4(2N)dVC exp 18 2 N
|
{z
}


given specs  = 0.1, = 0.1, dVC = 3, want 4(2N)dVC exp 81 2 N
N
100
1,000
10,000
100,000
29,300

bound
2.82 107
9.17 109
1.19 108
1.65 1038
9.99 102

sample complexity:
need N 10, 000dVC in theory

practical rule of thumb:


N 10dVC often enough!
23/26

The VC Dimension

Interpreting VC Dimension

Looseness of VC Bound
i
h

PD Ein (g) Eout (g) > 



4(2N)dVC exp 18 2 N

theory: N 10, 000dVC ; practice: N 10dVC

Why?
Hoeffding for unknown Eout

mH (N) instead of |H(x1 , . . . , xN )|

N dVC

instead of mH (N)

union bound on worst cases

any distribution, any target


any data
any H of same dVC

any choice made by A

but hardly better, and similarly loose for all models


philosophical message of VC bound
important for improving ML
24/26

The VC Dimension

Interpreting VC Dimension

Fun Time
Consider the VC Bound below. How can we decrease the
probability of getting BAD data?
h
i

PD Ein (g) Eout (g) > 



4(2N)dVC exp 18 2 N

decrease model complexity dVC

increase data size N a lot

increase generalization error tolerance 

all of the above

Reference Answer:
Congratulations on being
Master of VC bound! :-)
25/26

The VC Dimension

Interpreting VC Dimension

Summary
1
2

When Can Machines Learn?


Why Can Machines Learn?

Lecture 6: Theory of Generalization


Lecture 7: The VC Dimension
Definition of VC Dimension
maximum non-break point
VC Dimension of Perceptrons
dVC (H) = d + 1
Physical Intuition of VC Dimension
dVC #free parameters
Interpreting VC Dimension
loosely: model complexity & sample complexity
next: more than noiseless binary classification?
3
4

How Can Machines Learn?


How Can Machines Learn Better?
26/26

Das könnte Ihnen auch gefallen