Beruflich Dokumente
Kultur Dokumente
LINEAR CLASSIFIERS
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Problems
2 Probability & Bayesian Inference
Animal or Vegetable?
x2 2
4 2 0 2 4 6 8
x1
y>0 x2
y=0
y<0 R1
R2
y (x) = w t x + w 0
x
w
y (x) 0 x assigned to C1 y(x)
w
x1
Thus y (x) = 0 defines the decision boundary w0
w
y (x) = w t x + w 0
y>0 x2
y (x) 0 x assigned to C1 y=0
y<0 R1
y (x) < 0 x assigned to C2
R2
and x
t t x1
x = x 1 x M 1 x1 x M
w0
w
(
y(x) = f w t x + w 0 )
(
y(x) = f w t x + w 0 ) y (x) 0 x assigned to C1
y (x) < 0 x assigned to C2
Logistic Classifiers
such a hyperplane.
( )
E P w = w t x ntn
nM
Observations:
EP(w) is always non-negative.
EP(w) is continuous and
piecewise linear, and thus ( )
EP w
easier to minimize.
wi
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
The Perceptron Algorithm
19 Probability & Bayesian Inference
( )
E P w = w t x ntn
nM
Gradient descent:
( )
w +1 = w EP w = w + x ntn
nM
wi
( )
w +1 = w EP w = w t + x ntn
nM
0.5 0.5
0 0
0.5 0.5
1 1
1 0.5 0 0.5 1 1 0.5 0 0.5 1
1 1
0.5 0.5
0 0
0.5 0.5
1 1
1 0.5 0 0.5 1 1 0.5 0 0.5 1
Idea:
n Run the perceptron algorithm
n Keep track of the weight vector w* that has produced the
best classification error achieved so far.
n It can be shown that w* will converge to an optimal solution
with probability 1.
R1
R2
C1
R3
C2
not C1
not C2
C3
C1
R1
R3
C1 ?
C3
R2
C2
C2
y k (x) = w tk x
( ) x + (w )
t
Decision boundary y k (x) = y j (x) w k w j k0
w j0 = 0
Rk
xB
xA x
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Example: Keslers Construction
32 Probability & Bayesian Inference
Keslers Construction:
n Allows use of the perceptron algorithm to simultaneously
learn K separate weight vectors wi.
n Inputs are then classified in Class i if and only if
w ti x > w tj x j i
n Thealgorithm will converge to an optimal solution if a
solution exists, i.e., if all training vectors can be correctly
classified according to this rule.
Element i
Logistic Classifiers
t x
y(x) = W
where
x = (1,x t )t
( )
t
is a (D + 1) K matrix whose kth column is w
W k = w 0 ,w tk
(
Training dataset x n ,t n , ) n = 1,,N
Let RD W( )
=X
W T
Logistic Classifiers
1 1
Let m1 = x ,
N1 nC1 n
m2 = x
N2 nC2 n
( )
For example, might choose w to maximize w t m2 m1 , subject to w = 1
This leads to w m2 m1
4
(y )
2
Let sk2 = n
mk be the within-class variance on the subspace for class Ck
nCk
(m )
2
2
m1
The Fisher criterion is then J(w) = 4
s +s
2
1
2
2
2
This can be rewritten as
w t SB w
J(w) = t
0
w SW w
where 2
( )( )
t
SB = m2 m1 m2 m1 is the between-class variance 2 2 6
and
(x )( )
m1 x n m1 + x n m2 x n m2 ( )( )
t t
SW = n
is the within-class variance
nC1 nC2
4 4
2 2
0 0
2 2
4 4
6 6
8 8
4 2 0 2 4 6 8 4 2 0 2 4 6 8
6
6 4 2 0 2 4 6
Logistic Classifiers
( ) () ( )
p C1 | = y = w t 1
where (a) =
p (C | ) = 1 p (C | )
2 1
1+ exp(a)
140 8 Classification models
( ) ()
p C1 | = y = w t ( )
w
1
w2
2
w t x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.
Logistic Regression
51 Probability & Bayesian Inference
( ) () ( )
p C1 | = y = w t
p (C | ) = 1 p (C | )
2 1
where
1
(a) =
1+ exp(a)
Number of parameters
Logistic
regression: M
Gaussian model: 2M + 2M(M+1)/2 + 1 = M2+3M+1
( ) ( )
N
( ) {1 y }
t
p t|w = y
1tn
tn
n n
where t = t1,,tN and y n = p C1 | n
n=1
(
E(w) = y n tn n )
n=1
E(w)
H = w w E(w), i.e., Hij = .
w i w j
H = tR
where R is the N N diagonal weight matrix with Rnn = y n 1 y n ( )
and is the N M design matrix whose nth row is given by nt .
Thus
( ) ( )
1
w new
=w (old )
R
t
t y t See Problem 8.3 in the text!
w1
w2 w t
( )
p Ck | = y k =()
exp ak( )
exp (a ) j
j
6 6
4 4
2 2
0 0
2 2
4 4
6 6
6 4 2 0 2 4 6 6 4 2 0 2 4 6
Least-Squares Logistic
Logistic Classifiers