Sie sind auf Seite 1von 59

Last updated: Oct 22, 2012

LINEAR CLASSIFIERS
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Problems
2 Probability & Bayesian Inference

Please do Problem 8.3 in the textbook. We will


discuss this in class.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Classification: Problem Statement
3 Probability & Bayesian Inference

In regression, we are modeling the relationship


between a continuous input variable x and a
continuous target variable t.
In classification, the input variable x may still be

continuous, but the target variable is discrete.


In the simplest case, t can have only 2 values.

e.g., Let t = +1 x assigned to C1


t = 1 x assigned to C2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example Problem
4 Probability & Bayesian Inference

Animal or Vegetable?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Discriminative Classifiers
5 Probability & Bayesian Inference

If the conditional distributions are normal, the best


thing to do is to estimate the parameters of these
distributions and use Bayesian decision theory to
classify input vectors. Decision boundaries are
generally quadratic.
However if the conditional distributions are not
exactly normal, this generative approach will yield
sub-optimal results.
Another alternative is to build a discriminative
classifier, that focuses on modeling the decision
boundary directly, rather than the conditional
distributions themselves.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Linear Models for Classification
6 Probability & Bayesian Inference

Linear models for classification separate input vectors into


classes using linear (hyperplane) decision boundaries.
Example:
2D Input vector x
Two discrete classes C1 and C2
4

x2 2

4 2 0 2 4 6 8

x1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Two Class Discriminant Function
7 Probability & Bayesian Inference

y>0 x2
y=0
y<0 R1
R2
y (x) = w t x + w 0
x
w
y (x) 0 x assigned to C1 y(x)
w

y (x) < 0 x assigned to C2 x

x1
Thus y (x) = 0 defines the decision boundary w0
w

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Two-Class Discriminant Function
8 Probability & Bayesian Inference

y (x) = w t x + w 0
y>0 x2
y (x) 0 x assigned to C1 y=0
y<0 R1
y (x) < 0 x assigned to C2
R2

For convenience, let x


t t w
w = w 1 w M w 0 w 1 w M
y(x)
w

and x
t t x1
x = x 1 x M 1 x1 x M
w0
w

So we can express y(x) = w t x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Generalized Linear Models
9 Probability & Bayesian Inference

For classification problems, we want y to be a predictor of t. In other


words, we wish to map the input vector into one of a number of discrete
classes, or to posterior probabilities that lie between 0 and 1.
For this purpose, it is useful to elaborate the linear model by introducing a
nonlinear activation function f, which typically will constrain y to lie between
-1 and 1 or between 0 and 1.

(
y(x) = f w t x + w 0 )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron
10 Probability & Bayesian Inference

(
y(x) = f w t x + w 0 ) y (x) 0 x assigned to C1
y (x) < 0 x assigned to C2

A classifier based upon this simple generalized linear model is


called a (single layer) perceptron.
It can also be identified with an abstracted model of a neuron
called the McCulloch Pitts model.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


End of Lecture
Oct 15, 2012
Parameter Learning
12 Probability & Bayesian Inference

How do we learn the parameters of a perceptron?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
13 Probability & Bayesian Inference

The Perceptron Algorithm


Least-Squares Classifiers

Fishers Linear Discriminant

Logistic Classifiers

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Case 1. Linearly Separable Inputs
14 Probability & Bayesian Inference

For starters, lets assume that the training data is in


fact perfectly linearly separable.
In other words, there exists at least one hyperplane

(one set of weights) that yields 0 classification error.


We seek an algorithm that can automatically find

such a hyperplane.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
15 Probability & Bayesian Inference

The perceptron algorithm was


invented by Frank Rosenblatt
(1962).
The algorithm is iterative.

The strategy is to start with a

random guess at the weights w, Frank Rosenblatt (1928 1971)

and to then iteratively change


the weights to move the
hyperplane in a direction that
lowers the classification error.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
16 Probability & Bayesian Inference

Note that as we change the weights continuously,


the classification error changes in discontinuous,
piecewise constant fashion.
Thus we cannot use the classification error per se as

our objective function to minimize.


What would be a better objective function?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Criterion
17 Probability & Bayesian Inference

Note that we seek w such that


w t x 0 when t = +1
w t x < 0 when t = 1

In other words, we would like


w t x ntn 0 n

Thus we seek to minimize


( )
E P w = w t x ntn
nM

where M is the set of misclassified inputs.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Criterion
18 Probability & Bayesian Inference

( )
E P w = w t x ntn
nM

where M is the set of misclassified inputs.

Observations:
EP(w) is always non-negative.
EP(w) is continuous and
piecewise linear, and thus ( )
EP w

easier to minimize.

wi
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
The Perceptron Algorithm
19 Probability & Bayesian Inference

( )
E P w = w t x ntn
nM

where M is the set of misclassified inputs.


dEP w ( )= ( )
dw
x t n n
EP w
nM

where the derivative exists.

Gradient descent:
( )
w +1 = w EP w = w + x ntn
nM

wi

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
20 Probability & Bayesian Inference

( )
w +1 = w EP w = w t + x ntn
nM

Why does this make sense?


If an input from C1(t = +1) is misclassified, we need to
make its projection on w more positive.
If an input from C2 (t = -1) is misclassified, we need to
make its projection on w more negative.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
21 Probability & Bayesian Inference

The algorithm can be implemented sequentially:


Repeat until convergence:
n For each input (xn, tn):
n If it is correctly classified, do nothing
n If it is misclassified, update the weight vector to be
w +1 = w + x ntn
n Note that this will lower the contribution of input n to the
objective function:
( )xt ( )xt ( )xt ( ) ( )xt.
t t t t t
( ) ( +1) ( ) ( )
w n n
w n n
= w n n
x ntn x ntn < w n n

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Not Monotonic
22 Probability & Bayesian Inference

While updating with respect to a misclassified input


n will lower the error for that input, the error for
other misclassified inputs may increase.
Also, new inputs that had been classified correctly

may now be misclassified.


The result is that the perceptron algorithm is not
guaranteed to reduce the total error monotonically
at each stage.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Convergence Theorem
23 Probability & Bayesian Inference

Despite this non-monotonicity, if in fact the data are


linearly separable, then the algorithm is
guaranteed to find an exact solution in a finite
number of steps (Rosenblatt, 1962).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example
24 Probability & Bayesian Inference
1 1

0.5 0.5

0 0

0.5 0.5

1 1
1 0.5 0 0.5 1 1 0.5 0 0.5 1

1 1

0.5 0.5

0 0

0.5 0.5

1 1
1 0.5 0 0.5 1 1 0.5 0 0.5 1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The First Learning Machine
25 Probability & Bayesian Inference

Mark 1 Perceptron Hardware (c. 1960)

Visual Inputs Patch board allowing Rack of adaptive weights w


configuration of inputs (motor-driven potentiometers)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Practical Limitations
26 Probability & Bayesian Inference

The Perceptron Convergence Theorem is an


important result. However, there are practical
limitations:
Convergence may be slow
If the data are not separable, the algorithm will not
converge.
We will only know that the data are separable once
the algorithm converges.
The solution is in general not unique, and will depend
upon initialization, scheduling of input vectors, and the
learning rate .

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Generalization to inputs that are not linearly separable.
27 Probability & Bayesian Inference

The single-layer perceptron can be generalized to


yield good linear solutions to problems that are not
linearly separable.
Example: The Pocket Algorithm (Gal 1990)

Idea:
n Run the perceptron algorithm
n Keep track of the weight vector w* that has produced the
best classification error achieved so far.
n It can be shown that w* will converge to an optimal solution
with probability 1.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Generalization to Multiclass Problems
28 Probability & Bayesian Inference

How can we use perceptrons, or linear classifiers in


general, to classify inputs when there are K > 2
classes?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


K>2 Classes
29 Probability & Bayesian Inference

Idea #1: Just use K-1 discriminant functions, each of


which separates one class Ck from the rest. (One-
versus-the-rest classifier.)
Problem: Ambiguous regions

R1
R2

C1
R3
C2
not C1
not C2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


K>2 Classes
30 Probability & Bayesian Inference

Idea #2: Use K(K-1)/2 discriminant functions, each


of which separates two classes Cj, Ck from each
other. (One-versus-one classifier.)
Each point classified by majority vote.

Problem: Ambiguous regions

C3
C1

R1
R3
C1 ?

C3
R2
C2

C2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


K>2 Classes
31 Probability & Bayesian Inference

Idea #3: Use K discriminant functions yk(x)


Use the magnitude of yk(x), not just the sign.

y k (x) = w tk x

x assigned to Ck if y k (x) > y j (x)j k

( ) x + (w )
t
Decision boundary y k (x) = y j (x) w k w j k0
w j0 = 0

Results in decision regions that are


Rj
simply-connected and convex. Ri

Rk
xB
xA x
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Example: Keslers Construction
32 Probability & Bayesian Inference

The perceptron algorithm can be generalized to K-


class classification problems.
Example:

Keslers Construction:
n Allows use of the perceptron algorithm to simultaneously
learn K separate weight vectors wi.
n Inputs are then classified in Class i if and only if
w ti x > w tj x j i
n Thealgorithm will converge to an optimal solution if a
solution exists, i.e., if all training vectors can be correctly
classified according to this rule.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


1-of-K Coding Scheme
33 Probability & Bayesian Inference

When there are K>2 classes, target variables can


be coded using the 1-of-K coding scheme:
Input from Class C i t = [0 0 10 0]t

Element i

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Computational Limitations of Perceptrons
34 Probability & Bayesian Inference

Initially, the perceptron was


thought to be a potentially
powerful learning machine that
could model human neural
x2
processing.
However, Minsky & Papert
(1969) showed that the single-
layer perceptron could not learn x1

a simple XOR function.


This is just one example of a
non-linearly separable pattern Marvin Minsky (1927 - )

that cannot be learned by a


single-layer perceptron.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Multi-Layer Perceptrons
35 Probability & Bayesian Inference

Minsky & Paperts book was widely


misinterpreted as showing that
artificial neural networks were
inherently limited.
This contributed to a decline in the
reputation of neural network
research through the 70s and 80s.
However, their findings apply only
to single-layer perceptrons. Multi-
layer perceptrons are capable of
learning highly nonlinear functions,
and are used in many practical
applications.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
36 Probability & Bayesian Inference

The Perceptron Algorithm


Least-Squares Classifiers

Fishers Linear Discriminant

Logistic Classifiers

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Dealing with Non-Linearly Separable Inputs
37 Probability & Bayesian Inference

The perceptron algorithm fails when the training


data are not perfectly linearly separable.
Lets now turn to methods for learning the

parameter vector w of a perceptron (linear


classifier) even when the training data are not
linearly separable.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Least Squares Method
38 Probability & Bayesian Inference

In the least squares method, we simply fit the (x, t)


observations with a hyperplane y(x).
Note that this is kind of a weird idea, since the t

values are binary (when K=2), e.g., 0 or 1.


However it can work pretty well.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Least Squares: Learning the Parameters
39 Probability & Bayesian Inference

Assume D dimensional input vectors x.

For each class k 1K :


y k (x) = w tk x + w k 0

t x
y(x) = W

where
x = (1,x t )t

( )
t
is a (D + 1) K matrix whose kth column is w
W k = w 0 ,w tk

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Learning the Parameters
40 Probability & Bayesian Inference

Method #2: Least Squares


t x
y(x) = W

(
Training dataset x n ,t n , ) n = 1,,N

where we use the 1-of-K coding scheme for t n

Let T be the N K matrix whose nth row is t tn

be the N (D + 1) matrix whose nth row is x t


Let X n

Let RD W( )
=X
W T

Then we define the error as ED W ( )


= 1 R 2 = 1 Tr R W

2 i,j ij
2 D
tR W
D

{ ( ) ( )}
to 0 yields:
Setting derivative wrt W

= X
W ( )
tX
1
tT = X
X T Recall:
A
( )
Tr AB = B t .

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
41 Probability & Bayesian Inference

The Perceptron Algorithm


Least-Squares Classifiers

Fishers Linear Discriminant

Logistic Classifiers

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Fishers Linear Discriminant
42 Probability & Bayesian Inference

Another way to view linear discriminants: find the 1D subspace


that maximizes the separation between the two classes.

1 1
Let m1 = x ,
N1 nC1 n
m2 = x
N2 nC2 n

( )
For example, might choose w to maximize w t m2 m1 , subject to w = 1

This leads to w m2 m1
4

However, if conditional distributions are not isotropic,


2
this is typically not optimal.
2 2 6

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Fishers Linear Discriminant
43 Probability & Bayesian Inference

Let m1 = w t m1, m2 = w t m2 be the conditional means on the 1D subspace.

(y )
2
Let sk2 = n
mk be the within-class variance on the subspace for class Ck
nCk

(m )
2
2
m1
The Fisher criterion is then J(w) = 4
s +s
2
1
2
2
2
This can be rewritten as
w t SB w
J(w) = t
0
w SW w
where 2

( )( )
t
SB = m2 m1 m2 m1 is the between-class variance 2 2 6

and

(x )( )
m1 x n m1 + x n m2 x n m2 ( )( )
t t
SW = n
is the within-class variance
nC1 nC2

J(w) is maximized for w SW1 m2 m1 ( )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Connection to MVN Maximum Likelihood
44 Probability & Bayesian Inference

J(w) is maximized for w SW1 m2 m1 ( )


Recall that if the two distributions are normal with
the same covariance, the maximum likelihood
classifier is linear, with
(
w 1 m2 m1 )
Further, note that SW is proportional to the maximum
likelihood estimator for.
Thus FLD is equivalent to assuming MVN distributions
with common covariance.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Connection to Least-Squares
45 Probability & Bayesian Inference

Change coding scheme used in least-squares method to


N
tn = for C1
N1
N
tn = for C2
N2

Then one can show that the ML w satisfies


(
w SW1 m2 m1 )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


End of Lecture
October 17, 2012
Problems with Least Squares
47 Probability & Bayesian Inference

Problem #1: Sensitivity to outliers

4 4

2 2

0 0

2 2

4 4

6 6

8 8

4 2 0 2 4 6 8 4 2 0 2 4 6 8

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Problems with Least Squares
48 Probability & Bayesian Inference

Problem #2: Linear activation function is not a


good fit to binary data. This can lead to problems.
6

6
6 4 2 0 2 4 6

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
49 Probability & Bayesian Inference

The Perceptron Algorithm


Least-Squares Classifiers

Fishers Linear Discriminant

Logistic Classifiers

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Logistic Regression (K = 2)
50 Probability & Bayesian Inference

( ) () ( )
p C1 | = y = w t 1
where (a) =
p (C | ) = 1 p (C | )
2 1
1+ exp(a)
140 8 Classification models

( ) ()
p C1 | = y = w t ( )

w
1
w2

2

w t x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.
Logistic Regression
51 Probability & Bayesian Inference

( ) () ( )
p C1 | = y = w t
p (C | ) = 1 p (C | )
2 1

where
1
(a) =
1+ exp(a)

Number of parameters
Logistic
regression: M
Gaussian model: 2M + 2M(M+1)/2 + 1 = M2+3M+1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


ML for Logistic Regression
52 Probability & Bayesian Inference

( ) ( )
N

( ) {1 y }
t
p t|w = y
1tn
tn
n n
where t = t1,,tN and y n = p C1 | n
n=1

We define the error function to be E(w) = log p t | w ( )


( )
Given y n = an and an = w tn , one can show that
N

(
E(w) = y n tn n )
n=1

Unfortunately, there is no closed form solution for w.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


ML for Logistic Regression:
53 Probability & Bayesian Inference

Iterative Reweighted Least Squares


Although there is no closed form solution for the ML
estimate of w, fortunately, the error function is convex.
Thus an appropriate iterative method is guaranteed to
find the exact solution.
A good method is to use a local quadratic
approximation to the log likelihood function (Newton-
Raphson update):

w (new ) = w (old ) H1E(w)


where H is the Hessian matrix of E(w)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Hessian Matrix H
54 Probability & Bayesian Inference

E(w)
H = w w E(w), i.e., Hij = .
w i w j

Hij describes how the ith component of the gradient


varies as we move in the wj direction.
Let u be any unit vector. Then

Hu describes the variation in the gradient as we move


in the direction u.
utHu describes the projection of this variation onto u.
Thus utHu measures how much the gradient is changing
in the u direction as we move in the u direction.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


ML for Logistic Regression
55 Probability & Bayesian Inference

w (new ) = w (old ) H1E(w)


where H is the Hessian matrix of E(w) :

H = tR
where R is the N N diagonal weight matrix with Rnn = y n 1 y n ( )
and is the N M design matrix whose nth row is given by nt .

(Note that, since R nn 0, R is positive semi-definite, and hence H is positive semi-definite


Thus E(w) is convex.)

Thus

( ) ( )
1
w new
=w (old )
R
t
t y t See Problem 8.3 in the text!

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


ML for Logistic Regression
56 88 Classification
Classification models
Probability & Bayesian Inference models
Iterative Reweighted Least Squares

( ) ()
p C1 | = y = w t ( )



w1

w2 w t

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Logistic Regression
57 Probability & Bayesian Inference

For K>2, we can generalize the activation function


by modeling the posterior probabilities as

( )
p Ck | = y k =()
exp ak( )
exp (a ) j
j

where the activations ak are given by


ak = w tk

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example
58 Probability & Bayesian Inference

6 6

4 4

2 2

0 0

2 2

4 4

6 6
6 4 2 0 2 4 6 6 4 2 0 2 4 6

Least-Squares Logistic

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
59 Probability & Bayesian Inference

The Perceptron Algorithm


Least-Squares Classifiers

Fishers Linear Discriminant

Logistic Classifiers

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Das könnte Ihnen auch gefallen