05 Linear Classifiers

Last updated: Oct 22, 2012
LINEAR CLASSIFIERS
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Problems
2 Probability & Bayesian Inference
Please do Problem 8.3 in the textbook. We will

discuss this in class.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Classification: Problem Statement
In regression, we are modeling the relationship

between a continuous input variable x and a
continuous target variable t.
In classification, the input variable x may still be
continuous, but the target variable is discrete.

In the simplest case, t can have only 2 values.
e.g., Let t = +1 x assigned to C1

t = 1 x assigned to C2

Example Problem
Animal or Vegetable?

Discriminative Classifiers
If the conditional distributions are normal, the best

thing to do is to estimate the parameters of these
distributions and use Bayesian decision theory to
classify input vectors. Decision boundaries are
generally quadratic.
However if the conditional distributions are not
exactly normal, this generative approach will yield
sub-optimal results.
Another alternative is to build a discriminative
classifier, that focuses on modeling the decision
boundary directly, rather than the conditional
distributions themselves.
Linear Models for Classification
Linear models for classification separate input vectors into

classes using linear (hyperplane) decision boundaries.
Example:
2D Input vector x
Two discrete classes C1 and C2
4
x2 2
4 2 0 2 4 6 8
x1

Two Class Discriminant Function
y>0 x2
y=0
y<0 R1
R2
y (x) = w t x + w 0
x
w
y (x) 0 x assigned to C1 y(x)
w
y (x) < 0 x assigned to C2 x
x1
Thus y (x) = 0 defines the decision boundary w0
w

Two-Class Discriminant Function
y (x) = w t x + w 0
y>0 x2
y (x) 0 x assigned to C1 y=0
y<0 R1
y (x) < 0 x assigned to C2
R2
For convenience, let x

t t w
w = w 1 w M w 0 w 1 w M
y(x)
w
and x
t t x1
x = x 1 x M 1 x1 x M
w0
w
So we can express y(x) = w t x

Generalized Linear Models
For classification problems, we want y to be a predictor of t. In other

words, we wish to map the input vector into one of a number of discrete
classes, or to posterior probabilities that lie between 0 and 1.
For this purpose, it is useful to elaborate the linear model by introducing a
nonlinear activation function f, which typically will constrain y to lie between
-1 and 1 or between 0 and 1.
(
y(x) = f w t x + w 0 )

The Perceptron
(
y(x) = f w t x + w 0 ) y (x) 0 x assigned to C1
y (x) < 0 x assigned to C2
A classifier based upon this simple generalized linear model is

called a (single layer) perceptron.
It can also be identified with an abstracted model of a neuron
called the McCulloch Pitts model.

End of Lecture
Oct 15, 2012
Parameter Learning
How do we learn the parameters of a perceptron?

Outline
The Perceptron Algorithm

Least-Squares Classifiers
Fishers Linear Discriminant
Logistic Classifiers

Case 1. Linearly Separable Inputs
For starters, lets assume that the training data is in

fact perfectly linearly separable.
In other words, there exists at least one hyperplane
(one set of weights) that yields 0 classification error.

We seek an algorithm that can automatically find
such a hyperplane.

The perceptron algorithm was

invented by Frank Rosenblatt
(1962).
The algorithm is iterative.
The strategy is to start with a
random guess at the weights w, Frank Rosenblatt (1928 1971)
and to then iteratively change

the weights to move the
hyperplane in a direction that
lowers the classification error.

Note that as we change the weights continuously,

the classification error changes in discontinuous,
piecewise constant fashion.
Thus we cannot use the classification error per se as
our objective function to minimize.

What would be a better objective function?

The Perceptron Criterion
Note that we seek w such that

w t x 0 when t = +1
w t x < 0 when t = 1
In other words, we would like

w t x ntn 0 n
Thus we seek to minimize

( )
E P w = w t x ntn
nM
where M is the set of misclassified inputs.

The Perceptron Criterion
( )
E P w = w t x ntn
nM
Observations:
EP(w) is always non-negative.
EP(w) is continuous and
piecewise linear, and thus ( )
EP w
easier to minimize.
wi
( )
E P w = w t x ntn
nM

dEP w ( )= ( )
dw
x t n n
EP w
nM
where the derivative exists.
Gradient descent:
( )
w +1 = w EP w = w + x ntn
nM
wi

( )
w +1 = w EP w = w t + x ntn
nM
Why does this make sense?

If an input from C1(t = +1) is misclassified, we need to
make its projection on w more positive.
If an input from C2 (t = -1) is misclassified, we need to
make its projection on w more negative.

The algorithm can be implemented sequentially:

Repeat until convergence:
n For each input (xn, tn):
n If it is correctly classified, do nothing
n If it is misclassified, update the weight vector to be
w +1 = w + x ntn
n Note that this will lower the contribution of input n to the
objective function:
( )xt ( )xt ( )xt ( ) ( )xt.
t t t t t
( ) ( +1) ( ) ( )
w n n
w n n
= w n n
x ntn x ntn < w n n

Not Monotonic
While updating with respect to a misclassified input

n will lower the error for that input, the error for
other misclassified inputs may increase.
Also, new inputs that had been classified correctly
may now be misclassified.

The result is that the perceptron algorithm is not
guaranteed to reduce the total error monotonically
at each stage.

The Perceptron Convergence Theorem
Despite this non-monotonicity, if in fact the data are

linearly separable, then the algorithm is
guaranteed to find an exact solution in a finite
number of steps (Rosenblatt, 1962).

Example
1 1
0.5 0.5
0 0
0.5 0.5
1 1
1 0.5 0 0.5 1 1 0.5 0 0.5 1
1 1
0.5 0.5
0 0
0.5 0.5
1 1
1 0.5 0 0.5 1 1 0.5 0 0.5 1

The First Learning Machine
Mark 1 Perceptron Hardware (c. 1960)
Visual Inputs Patch board allowing Rack of adaptive weights w

configuration of inputs (motor-driven potentiometers)

Practical Limitations
The Perceptron Convergence Theorem is an

important result. However, there are practical
limitations:
Convergence may be slow
If the data are not separable, the algorithm will not
converge.
We will only know that the data are separable once
the algorithm converges.
The solution is in general not unique, and will depend
upon initialization, scheduling of input vectors, and the
learning rate .

Generalization to inputs that are not linearly separable.
The single-layer perceptron can be generalized to

yield good linear solutions to problems that are not
linearly separable.
Example: The Pocket Algorithm (Gal 1990)
Idea:
n Run the perceptron algorithm
n Keep track of the weight vector w* that has produced the
best classification error achieved so far.
n It can be shown that w* will converge to an optimal solution
with probability 1.

Generalization to Multiclass Problems
How can we use perceptrons, or linear classifiers in

general, to classify inputs when there are K > 2
classes?

K>2 Classes
Idea #1: Just use K-1 discriminant functions, each of

which separates one class Ck from the rest. (One-
versus-the-rest classifier.)
Problem: Ambiguous regions
R1
R2
C1
R3
C2
not C1
not C2

K>2 Classes
Idea #2: Use K(K-1)/2 discriminant functions, each

of which separates two classes Cj, Ck from each
other. (One-versus-one classifier.)
Each point classified by majority vote.
Problem: Ambiguous regions
C3
C1
R1
R3
C1 ?
C3
R2
C2
C2

K>2 Classes
Idea #3: Use K discriminant functions yk(x)

Use the magnitude of yk(x), not just the sign.
y k (x) = w tk x
x assigned to Ck if y k (x) > y j (x)j k
( ) x + (w )
t
Decision boundary y k (x) = y j (x) w k w j k0
w j0 = 0
Results in decision regions that are

Rj
simply-connected and convex. Ri
Rk
xB
xA x
Example: Keslers Construction
The perceptron algorithm can be generalized to K-

class classification problems.
Example:
Keslers Construction:
n Allows use of the perceptron algorithm to simultaneously
learn K separate weight vectors wi.
n Inputs are then classified in Class i if and only if
w ti x > w tj x j i
n Thealgorithm will converge to an optimal solution if a
solution exists, i.e., if all training vectors can be correctly
classified according to this rule.

1-of-K Coding Scheme
When there are K>2 classes, target variables can

be coded using the 1-of-K coding scheme:
Input from Class C i t = [0 0 10 0]t
Element i

Computational Limitations of Perceptrons
Initially, the perceptron was

thought to be a potentially
powerful learning machine that
could model human neural
x2
processing.
However, Minsky & Papert
(1969) showed that the single-
layer perceptron could not learn x1
a simple XOR function.

This is just one example of a
non-linearly separable pattern Marvin Minsky (1927 - )
that cannot be learned by a

single-layer perceptron.

Multi-Layer Perceptrons
Minsky & Paperts book was widely

misinterpreted as showing that
artificial neural networks were
inherently limited.
This contributed to a decline in the
reputation of neural network
research through the 70s and 80s.
However, their findings apply only
to single-layer perceptrons. Multi-
layer perceptrons are capable of
learning highly nonlinear functions,
and are used in many practical
applications.

Outline


Dealing with Non-Linearly Separable Inputs
The perceptron algorithm fails when the training

data are not perfectly linearly separable.
Lets now turn to methods for learning the
parameter vector w of a perceptron (linear

classifier) even when the training data are not
linearly separable.

The Least Squares Method
In the least squares method, we simply fit the (x, t)

observations with a hyperplane y(x).
Note that this is kind of a weird idea, since the t
values are binary (when K=2), e.g., 0 or 1.

However it can work pretty well.

Least Squares: Learning the Parameters
Assume D dimensional input vectors x.
For each class k 1K :

y k (x) = w tk x + w k 0
t x
y(x) = W
where
x = (1,x t )t
( )
t
is a (D + 1) K matrix whose kth column is w
W k = w 0 ,w tk

Learning the Parameters
Method #2: Least Squares

t x
y(x) = W
(
Training dataset x n ,t n , ) n = 1,,N
where we use the 1-of-K coding scheme for t n
Let T be the N K matrix whose nth row is t tn
be the N (D + 1) matrix whose nth row is x t

Let X n
Let RD W( )
=X
W T
Then we define the error as ED W ( )

= 1 R 2 = 1 Tr R W

2 i,j ij
2 D
tR W
D

{ ( ) ( )}
to 0 yields:
Setting derivative wrt W

= X
W ( )
tX
1
tT = X
X T Recall:
A
( )
Tr AB = B t .

Outline


Another way to view linear discriminants: find the 1D subspace

that maximizes the separation between the two classes.
1 1
Let m1 = x ,
N1 nC1 n
m2 = x
N2 nC2 n
( )
For example, might choose w to maximize w t m2 m1 , subject to w = 1
This leads to w m2 m1
4
However, if conditional distributions are not isotropic,

2
this is typically not optimal.
2 2 6

Let m1 = w t m1, m2 = w t m2 be the conditional means on the 1D subspace.
(y )
2
Let sk2 = n
mk be the within-class variance on the subspace for class Ck
nCk
(m )
2
2
m1
The Fisher criterion is then J(w) = 4
s +s
2
1
2
2
2
This can be rewritten as
w t SB w
J(w) = t
0
w SW w
where 2
( )( )
t
SB = m2 m1 m2 m1 is the between-class variance 2 2 6
and
(x )( )
m1 x n m1 + x n m2 x n m2 ( )( )
t t
SW = n
is the within-class variance
nC1 nC2
J(w) is maximized for w SW1 m2 m1 ( )

Connection to MVN Maximum Likelihood
J(w) is maximized for w SW1 m2 m1 ( )

Recall that if the two distributions are normal with
the same covariance, the maximum likelihood
classifier is linear, with
(
w 1 m2 m1 )
Further, note that SW is proportional to the maximum
likelihood estimator for.
Thus FLD is equivalent to assuming MVN distributions
with common covariance.
Connection to Least-Squares
Change coding scheme used in least-squares method to

N
tn = for C1
N1
N
tn = for C2
N2
Then one can show that the ML w satisfies

(
w SW1 m2 m1 )

End of Lecture
October 17, 2012
Problems with Least Squares
Problem #1: Sensitivity to outliers
4 4
2 2
0 0
2 2
4 4
6 6
8 8
4 2 0 2 4 6 8 4 2 0 2 4 6 8

Problems with Least Squares
Problem #2: Linear activation function is not a

good fit to binary data. This can lead to problems.
6
6
6 4 2 0 2 4 6

Outline


Logistic Regression (K = 2)
( ) () ( )
p C1 | = y = w t 1
where (a) =
p (C | ) = 1 p (C | )
2 1
1+ exp(a)
140 8 Classification models
( ) ()
p C1 | = y = w t ( )

w
1
w2

2
w t x1
Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.
Logistic Regression
( ) () ( )
p C1 | = y = w t
p (C | ) = 1 p (C | )
2 1
where
1
(a) =
1+ exp(a)
Number of parameters
Logistic
regression: M
Gaussian model: 2M + 2M(M+1)/2 + 1 = M2+3M+1

ML for Logistic Regression
( ) ( )
N
( ) {1 y }
t
p t|w = y
1tn
tn
n n
where t = t1,,tN and y n = p C1 | n
n=1
We define the error function to be E(w) = log p t | w ( )

( )
Given y n = an and an = w tn , one can show that
N
(
E(w) = y n tn n )
n=1
Unfortunately, there is no closed form solution for w.

ML for Logistic Regression:
Iterative Reweighted Least Squares

Although there is no closed form solution for the ML
estimate of w, fortunately, the error function is convex.
Thus an appropriate iterative method is guaranteed to
find the exact solution.
A good method is to use a local quadratic
approximation to the log likelihood function (Newton-
Raphson update):
w (new ) = w (old ) H1E(w)

where H is the Hessian matrix of E(w)

The Hessian Matrix H
E(w)
H = w w E(w), i.e., Hij = .
w i w j
Hij describes how the ith component of the gradient

varies as we move in the wj direction.
Let u be any unit vector. Then
Hu describes the variation in the gradient as we move

in the direction u.
utHu describes the projection of this variation onto u.
Thus utHu measures how much the gradient is changing
in the u direction as we move in the u direction.

w (new ) = w (old ) H1E(w)

where H is the Hessian matrix of E(w) :
H = tR
where R is the N N diagonal weight matrix with Rnn = y n 1 y n ( )
and is the N M design matrix whose nth row is given by nt .
(Note that, since R nn 0, R is positive semi-definite, and hence H is positive semi-definite

Thus E(w) is convex.)
Thus
( ) ( )
1
w new
=w (old )
R
t
t y t See Problem 8.3 in the text!

56 88 Classification
Classification models
Probability & Bayesian Inference models
Iterative Reweighted Least Squares

( ) ()
p C1 | = y = w t ( )

w1

w2 w t

Logistic Regression
For K>2, we can generalize the activation function

by modeling the posterior probabilities as
( )
p Ck | = y k =()
exp ak( )
exp (a ) j
j
where the activations ak are given by

ak = w tk

Example
6 6
4 4
2 2
0 0
2 2
4 4
6 6
6 4 2 0 2 4 6 6 4 2 0 2 4 6
Least-Squares Logistic

Outline


05 Linear Classifiers

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

05 Linear Classifiers

Hochgeladen von

Copyright:

Verfügbare Formate

Last updated: Oct 22, 2012

Please do Problem 8.3 in the textbook. We will

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

In regression, we are modeling the relationship

continuous, but the target variable is discrete.

e.g., Let t = +1 x assigned to C1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

If the conditional distributions are normal, the best

Linear models for classification separate input vectors into

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

y (x) < 0 x assigned to C2 x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

For convenience, let x

So we can express y(x) = w t x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

For classification problems, we want y to be a predictor of t. In other

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

A classifier based upon this simple generalized linear model is

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

How do we learn the parameters of a perceptron?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

The Perceptron Algorithm

Fishers Linear Discriminant

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

For starters, lets assume that the training data is in

(one set of weights) that yields 0 classification error.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

The perceptron algorithm was

The strategy is to start with a

random guess at the weights w, Frank Rosenblatt (1928 1971)

and to then iteratively change

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Note that as we change the weights continuously,

our objective function to minimize.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Note that we seek w such that

In other words, we would like

Thus we seek to minimize

where M is the set of misclassified inputs.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

where M is the set of misclassified inputs.

where M is the set of misclassified inputs.

where the derivative exists.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Why does this make sense?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

The algorithm can be implemented sequentially:

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

While updating with respect to a misclassified input

may now be misclassified.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Despite this non-monotonicity, if in fact the data are

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Mark 1 Perceptron Hardware (c. 1960)

Visual Inputs Patch board allowing Rack of adaptive weights w

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

The Perceptron Convergence Theorem is an

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

The single-layer perceptron can be generalized to

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

How can we use perceptrons, or linear classifiers in

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder