Sie sind auf Seite 1von 45

CSE

575: Sta*s*cal Machine Learning


Jingrui He
CIDSE, ASU

Instance-based Learning

1-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric



Euclidian (and many more)
2. How many nearby neighbors to look at?
One
1. A weigh:ng func:on (op:onal)

Unused

2. How to t with the local points?





Just predict the same output as the nearest neighbor.

Consistency of 1-NN
Consider an es*mator fn trained on n examples
e.g., 1-NN, regression, ...

Es*mator is consistent if true error goes to zero as amount of


data increases
e.g., for no noise data, consistent if:

Regression is not consistent!


Representa*on bias

1-NN is consistent (under some mild neprint)

What about variance???


4

1-NN overts?

k-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric


Euclidian (and many more)
2. How many nearby neighbors to look at?
k
1. A weigh:ng func:on (op:onal)

Unused
2.

How to t with the local points?






Just predict the average output among the k nearest neighbors.

k-Nearest Neighbor (here k=9)

K-nearest neighbor for funcFon Hng smooth away noise, but there are clear
deciencies.
What can we do about all the discon*nui*es that k-NN gives us?

Curse of dimensionality for


instance-based learning
Must store and retrieve all data!
Most real work done during tes*ng
For every test sample, must search through all dataset very slow!
There are fast methods for dealing with large datasets, e.g., tree-based
methods, hashing methods,

Instance-based learning o^en poor with noisy or irrelevant features

Support Vector Machines

Linear classiers Which line is beber?


Data:

Example i:

w.x = j w(j) x(j)

10

w.x + b

= 0

Pick the one with the largest margin!

w.x = j w(j) x(j)

11

w.x + b

= 0

Maximize the margin

12

w.x + b

= 0

But there are a many planes

13

w.x + b

= 0

Review: Normal to a plane

14

x+

margin 2

= -1
w.x + b

= 0
w.x + b

w.x + b

= +1

Normalized margin Canonical


hyperplanes

x-

15

x+

margin 2

= -1
w.x + b

= 0
w.x + b

w.x + b

= +1

Normalized margin Canonical


hyperplanes

x-

16

w.x + b

= 0

= +1

= -1

w.x + b

w.x + b

Margin maximiza*on using


canonical hyperplanes

margin 2

17

= -1
w.x + b

= 0
w.x + b

w.x + b

= +1

Support vector machines (SVMs)

Solve eciently by quadra*c


programming (QP)
Well-studied solu*on algorithms

Hyperplane dened by support vectors


margin 2

18

What if the data is not linearly


separable?
Use features of features
of features of features.

19

What if the data is s*ll not linearly


separable?
Minimize w.w and number of training
mistakes
Tradeo two criteria?

Tradeo #(mistakes) and w.w

20

0/1 loss
Slack penalty C
Not QP anymore
Also doesnt dis*nguish near misses
and really bad mistakes

Slack variables Hinge loss

If margin 1, dont care


If margin < 1, pay linear penalty
21

Side note: Whats the dierence


between SVMs and logis*c regression?
SVM:

LogisFc regression:
Log loss:

22

Constrained op*miza*on

23

Lagrange mul*pliers Dual variables


Moving the constraint to objecFve funcFon
Lagrangian:

Solve:

24

Lagrange mul*pliers Dual variables


Solving:

25

Dual SVM deriva*on (1)


the linearly separable case

26

Dual SVM deriva*on (2)


the linearly separable case

27

w.x + b

= 0

Dual SVM interpreta*on

28

Dual SVM formula*on


the linearly separable case

29

Dual SVM deriva*on


the non-separable case

30

Dual SVM formula*on


the non-separable case

31

Why did we learn about the dual SVM?


There are some quadra*c programming
algorithms that can solve the dual faster than
the primal
But, more importantly, the kernel trick!!!
Another lible detour

32

Reminder from last *me: What if the


data is not linearly separable?
Use features of features
of features of features.

Feature space can get really 33 large really quickly!

number of monomial terms

Higher order polynomials

d=4

m input features
d degree of polynomial

d=3
d=2
number of input dimensions
34

grows fast!
d = 6, m = 100
about 1.6 billion terms

Dual formula*on only depends on


dot-products, not on w!

35

Dot-product of polynomials

36

Finally: the kernel trick!

Never represent features explicitly


Compute dot products in closed form

Constant-*me high-dimensional dot-


products for many classes of features
Very interes*ng theory Reproducing
Kernel Hilbert Spaces

37

Polynomial kernels
All monomials of degree d in O(d) opera*ons:

How about all monomials of degree up to d?


Solu*on 0:
Beber solu*on:

38

Common kernels
Polynomials of degree d

Polynomials of degree up to d
Gaussian kernels
Sigmoid

39

Overvng?
Huge feature space with kernels, what about
overvng???
Maximizing margin leads to sparse set of support
vectors
Some interes*ng theory says that SVMs search for
simple hypothesis with large margin
O^en robust to overvng

40

What about at classica*on *me


For a new input x, if we need to represent
(x), we are in trouble!
Recall classier: sign(w.(x)+b)
Using kernels we are cool!

41

SVMs with kernels


Choose a set of features and kernel func*on
Solve dual problem to obtain support vectors
i
At classica*on *me, compute:

Classify as

42

Whats the dierence between


SVMs and Logis*c Regression?

Loss function

High dimensional
features with
kernels

SVMs

Logistic
Regression

Hinge loss

Log-loss

Yes!

No

43

Kernels in logis*c regression

Dene weights in terms of support vectors:

Derive simple gradient descent rule on i


44

Whats the dierence between SVMs


and Logis*c Regression? (Revisited)

Loss function
High dimensional
features with
kernels
Solution sparse
Semantics of
output

SVMs

Logistic
Regression

Hinge loss

Log-loss

Yes!

Yes!

Often yes!

Almost always no!

Margin

Real probabilities

45