Linear Discriminant Functions

Linear Discriminant Functions
Chapter 5 (Duda et al.)
CS479/679 Pattern Recognition

Dr. George Bebis
Generative vs Discriminant Approach
• Generative approaches estimate the discriminant

function by first estimating the probability
distribution of the patterns belonging to each class.
• Discriminant approaches estimate the discriminant

function explicitly, without assuming a probability
distribution.
Generative Approach
(case of two categories)
• More common to use a single discriminant function
(dichotomizer) instead of two:
Example: g (x)  P(1 / x)  P(2 / x)
If g(x)=0, then x lies on the decision boundary and can

be assigned to either class.
Linear Discriminants
• The first step in the discriminative approach is to
specify the form of the discriminant.
• A linear discriminant has the following form:
d
g (x)  w t x  w0   wi xi  w0
i 1
Decide 1 if g(x) > 0 and 2 if g(x) < 0
If g(x)=0, then x lies on the decision boundary and can

be assigned to either class.
Linear Discriminants (cont’d)
• The decision boundary g(x)=0 is a hyperplane.

• The orientation of the hyperplane is determined by w
and its location by w0.
g (x)  w t x  w0
– w is the normal to the hyperplane

– If w0=0, it passes through the origin
• Estimate w and w0 using a set of training examples xk.

Linear Discriminants (cont’d)
• The solution can be found by minimizing an error
function (e.g., “training error” or “empirical risk”):
true class label:
1 n
J (w, w0 )   [ zk  zˆk ]2  1 if xk  1
zk  
n k 1 1 if xk  2
true predicted
predicted class label:
1 if g (xk )  0
• Use “learning” algorithms zˆk  
1 if g (xk )  0
to find the solution.
Geometric Interpretation of g(x)
• g(x) provides an algebraic measure of the distance

of x from the hyperplane.
x can be expressed as follows:
w direction of r
x  xp  r
|| w ||
Substitute x in g ( x )  w t
x  w0
Geometric Interpretation of g(x) (cont’d)
• Substitute x in g(x):
w
g (x)  w x  w0  w (x p  r
t t
)  w0 
|| w ||
wt w
 w xp  r
t
 w0  r || w ||
|| w ||
since wt w || w ||2 and wt x p  w0  0
g (x)  r || w ||
Geometric Interpretation of g(x) (cont’d)
• The distance of x from the hyperplane is given by:
g ( x)
r
|| w ||
w0
setting x=0: r
|| w ||
Linear Discriminant Functions:
multi-category case
• There are several ways to devise multi-category
classifiers using linear discriminant functions:
(1) One against the rest
problem:
ambiguous regions
multi-category case (cont’d)
(2) One against another (i.e., c(c-1)/2 pairs of classes)
problem:
ambiguous regions
• To avoid the problem of ambiguous regions:
– Define c linear discriminant functions
– Assign x to i if gi(x) > gj(x) for all j  i.
• The resulting classifier is called a linear machine
(see Chapter 2)
• A linear machine divides the feature space in c
convex decisions regions.
– If x is in region Ri, the gi(x) is the largest.
Note: although there are c(c-1)/2 pairs of regions, there

typically less decision boundaries
Geometric Interpretation:
multi-category case
• The decision boundary between adjacent
regions Ri and Rj is a portion of the hyperplane
Hij given by:
gi (x)  g j (x) or gi (x)  g j (x)  0
or (w i  w j )t x  ( wi 0  w j 0 )  0
• (wi-wj) is normal to Hij and the signed distance

from x to Hij is g ( x)  g ( x)
r
i j
|| w i  w j ||
Higher Order Discriminant Functions
• Higher order discriminants yield more complex

decision boundaries than linear discriminant functions.
g (x)
Linear Discriminants – Alternative
Definition
• Augmented feature/parameter space:
d d
g (x)  w t x  w0   wi xi  x0 w0   wi xi  α t y
i 1 ( x0  1) i 0
 w1   w0   x1   x0 
w  w  x  x 
w   2  α   1  x   2  y   1 
 ...   ...   ...   ... 
       
 wd   wd   xd   xd 
d+1 parameters d+1 features
Linear Discriminants – Alternative
Definition (cont’d)
Separates points
Discriminant: g ( x )  α t
y in (d+1)-space by a
hyperplane which
passes through
Classification rule: the origin.
If αtyi>0 assign yi to ω1
else if αtyi<0 assign yi to ω2
Generalized Discriminants
• A generalized discriminant can obtained by first
mapping the data to a space of higher dimensionality.
d d̂ where d̂ >> d
• This is done by transforming the data through properly

chosen functions yi(x), i=1,2,…, d̂ (called φ functions):
 x1   y1 (x) 
 x  φ  y ( x) 
x  2  
2
 ...   ... 
   
 xd   ydˆ (x) 
Generalized Discriminants (cont’d)
• A generalized discriminant is defined as a linear
discriminant in the d̂ - dimensional space:
d
g (x)   ai xi
i 1
 x1   y1 (x) 
x  φ
 y ( x) 
x  
2 2
 ...   ... 
   
 d
x  dˆ
y ( x ) 
dˆ
g (x)   ai yi (x) or g ( x)  α t y
i 1
Generalized Discriminants (cont’d)
dˆ
g (x)   ai yi (x) or g ( x)  α t y
i 1
• Why are generalized discriminants attractive?
• By properly choosing the φ functions, a problem which

is not linearly-separable in the d-dimensional space,
might become linearly separable in the d̂ dimensional
space!
Example
g ( x)  0 if x  1 or x  0.5
• The corresponding decision regions
R1,R2 in the 1D-space are not simply
connected (not linearly separable).
• Consider the following mapping and

parameters : Discriminant:
 y1 ( x)   1   1 g ( x)  α t y
y   y2 ( x)    x  α   1 
or
 y3 ( x)   x 2   2 
g ( x)  1  x  2 x 2
d=1  dˆ  3
Example (cont’d)
g(x) maps a line in d-space to

a parabola in d̂ - space.
The problem has now

become linearly separable!
The plane y  0 divides

α t
the d̂ -space in two decision

regions R ˆ , Rˆ
1 2
Learning: linearly separable case
(two categories)
• Given a linear discriminant function
g ( x)  α y
t
the goal is to “learn” the parameters (weights) α

from a set of n labeled samples yi, where each yi
has a class label ω1 or ω2.
Learning: effect of training examples
• Every training sample yi

places a constraint on the
weight vector α 1
• Visualize solution in
“feature space”:
– αty=0 defines a hyperplane 2
in the feature space with α
being the normal vector.
– Given n examples, the
solution α must lie within a
certain region.
Learning: effect of training examples
(cont’d)
• Visualize solution in parameter

“parameter space”: space (ɑ1, ɑ2)
– αty=0 defines a hyperplane in
the parameter space with y
a2
being the normal vector. y i  1
– Given n examples, the

solution α must lie on the y j  1
intersection of n half-spaces. a1
Uniqueness of Solution
• Solution vector α is usually not unique; we can impose

certain constraints to enforce uniqueness, e.g.,:
“Find unit-length weight vector α that maximizes the

minimum distance from the training examples to the
separating plane”
“Learning” Using Iterative
Optimization
• Minimize an error function J(α) (e.g., classification
error) with respect to α: 1 n
J (α ) 
n

[ z  zˆ ]2
k 1
k k
• Minimize iteratively: α(k  1)  α(k )   (k )p k

α(k)
pk search direction
 (k ) learning rate
α(k+1) (search step)
How should we choose pk?

Choosing pk using Gradient Descent
α (k  1)  α (k )   (k )p k
pk  J (α(k ))
Gradient Descent (cont’d)
search space
α (k ) 2
α (0)
1
-1
-2
-2 -1 0 1 2
J(α)
• What is the effect of the learning rate (k) ?
η = 0.37

2 η = 0.39

2
1
1
J(α) 0
0
-1
-1
-2
-2 -1 0 1 2 -2
-2 -1 0 1 2
slow but converges to solution fast but overshoots solution

• How can we choose the learning rate (k)?
– Need to use Taylor Series expansion
Expand f(x) around x0:

• Expand J(a) around a(k) using Taylor Series (up to
second derivatives):
Hessian (2nd derivatives)
Evaluating J(a) at a=a(k+1) and using
optimum learning rate
Expensive in practice!
Choosing pk using Newton’s Method
α (k  1)  α (k )   (k )p k
 (k )p k   H 1J (α(k ))
requires inverting H
Newton’s method (cont’d)
If J(α) is quadratic,
J(α) 0
Newton’s method
-1
converges in one iteration!
-2
-2 -1 0 1 2
Gradient descent vs Newton’s method
Gradient
Descent
Newton
“Dual” Classification Problem
If αtyi>0 assign yi to ω1 • If yi in ω2, replace yi by -yi
else if αtyi<0 assign yi to ω2 • Find α such that: αtyi>0
Seek a hyperplane that separates Seek a hyperplane that puts

patterns from different categories normalized patterns on the same
(positive) side
Perceptron rule
• The perceptron rule minimizes the following error:
J p (α )  
yY ( α )
(α t y )
where Y(α) is the set of samples misclassified by α.
• If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0
Find α such that: αtyi>0

for all i
Perceptron rule (cont’d)
• Apply gradient descent using Jp(α):
• Compute the gradient of Jp(α)

J p (α )  
yY ( α )
(α t y ) J p  
yY ( α )
(y )
α (k  1)  α (k )   (k ) 
yY ( α )
y
missclassified
examples
• Keep changing the orientation of the hyperplane

until all training samples are on its positive side.
a2
Example:
a1
η(k)=1
α  α  yk
one example
at a time
Perceptron Convergence Theorem: If training

samples are linearly separable, then the perceptron
algorithm will terminate at a solution vector in a finite
number of steps.
order of examples:
y2 y3 y1 y3
“Batch” algorithm
leads to a smoother
trajectory in solution
space.
Quiz
• Next quiz on “Linear Discriminant Functions”
• When: Tuesday, April 23rd

Linear Discriminant Functions

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Linear Discriminant Functions

Hochgeladen von

Copyright:

Verfügbare Formate

Linear Discriminant Functions

Chapter 5 (Duda et al.)

CS479/679 Pattern Recognition

• Generative approaches estimate the discriminant

• Discriminant approaches estimate the discriminant

Example: g (x)  P(1 / x)  P(2 / x)

If g(x)=0, then x lies on the decision boundary and can

Decide 1 if g(x) > 0 and 2 if g(x) < 0

If g(x)=0, then x lies on the decision boundary and can

• The decision boundary g(x)=0 is a hyperplane.

– w is the normal to the hyperplane

• Estimate w and w0 using a set of training examples xk.

• g(x) provides an algebraic measure of the distance

x can be expressed as follows:

since wt w || w ||2 and wt x p  w0  0

• The distance of x from the hyperplane is given by:

(1) One against the rest

Note: although there are c(c-1)/2 pairs of regions, there

• (wi-wj) is normal to Hij and the signed distance

• Higher order discriminants yield more complex

d d̂ where d̂ >> d

• This is done by transforming the data through properly

• Why are generalized discriminants attractive?

• By properly choosing the φ functions, a problem which

• Consider the following mapping and

g(x) maps a line in d-space to

The problem has now

The plane y  0 divides

the d̂ -space in two decision

• Given a linear discriminant function

the goal is to “learn” the parameters (weights) α

• Every training sample yi

• Visualize solution in parameter

– Given n examples, the

• Solution vector α is usually not unique; we can impose

“Find unit-length weight vector α that maximizes the

• Minimize iteratively: α(k  1)  α(k )   (k )p k

How should we choose pk?

slow but converges to solution fast but overshoots solution

Expand f(x) around x0:

Hessian (2nd derivatives)

Evaluating J(a) at a=a(k+1) and using

optimum learning rate

Seek a hyperplane that separates Seek a hyperplane that puts

where Y(α) is the set of samples misclassified by α.

• If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0

Find α such that: αtyi>0

• Compute the gradient of Jp(α)

• Keep changing the orientation of the hyperplane

Perceptron Convergence Theorem: If training

Das könnte Ihnen auch gefallen