Beruflich Dokumente
Kultur Dokumente
w direction of r
x xp r
|| w ||
Substitute x in g ( x ) w t
x w0
Geometric Interpretation of g(x) (cont’d)
• Substitute x in g(x):
w
g (x) w x w0 w (x p r
t t
) w0
|| w ||
wt w
w xp r
t
w0 r || w ||
|| w ||
g (x) r || w ||
Geometric Interpretation of g(x) (cont’d)
g ( x)
r
|| w ||
w0
setting x=0: r
|| w ||
Linear Discriminant Functions:
multi-category case
• There are several ways to devise multi-category
classifiers using linear discriminant functions:
problem:
ambiguous regions
Linear Discriminant Functions:
multi-category case (cont’d)
(2) One against another (i.e., c(c-1)/2 pairs of classes)
problem:
ambiguous regions
Linear Discriminant Functions:
multi-category case (cont’d)
• To avoid the problem of ambiguous regions:
– Define c linear discriminant functions
– Assign x to i if gi(x) > gj(x) for all j i.
• The resulting classifier is called a linear machine
(see Chapter 2)
Linear Discriminant Functions:
multi-category case (cont’d)
• A linear machine divides the feature space in c
convex decisions regions.
– If x is in region Ri, the gi(x) is the largest.
|| w i w j ||
Higher Order Discriminant Functions
g (x)
Linear Discriminants – Alternative
Definition
• Augmented feature/parameter space:
d d
g (x) w t x w0 wi xi x0 w0 wi xi α t y
i 1 ( x0 1) i 0
w1 w0 x1 x0
w w x x
w 2 α 1 x 2 y 1
... ... ... ...
wd wd xd xd
d+1 parameters d+1 features
Linear Discriminants – Alternative
Definition (cont’d)
Separates points
Discriminant: g ( x ) α t
y in (d+1)-space by a
hyperplane which
passes through
Classification rule: the origin.
If αtyi>0 assign yi to ω1
else if αtyi<0 assign yi to ω2
Generalized Discriminants
• A generalized discriminant can obtained by first
mapping the data to a space of higher dimensionality.
... ...
xd ydˆ (x)
Generalized Discriminants (cont’d)
• A generalized discriminant is defined as a linear
discriminant in the d̂ - dimensional space:
d
g (x) ai xi
i 1
x1 y1 (x)
x φ
y ( x)
x
2 2
... ...
d
x dˆ
y ( x )
dˆ
g (x) ai yi (x) or g ( x) α t y
i 1
Generalized Discriminants (cont’d)
dˆ
g (x) ai yi (x) or g ( x) α t y
i 1
g ( x) α y
t
• Visualize solution in
“feature space”:
– αty=0 defines a hyperplane 2
in the feature space with α
being the normal vector.
– Given n examples, the
solution α must lie within a
certain region.
Learning: effect of training examples
(cont’d)
(k ) learning rate
α(k+1) (search step)
α (0)
1
-1
-2
-2 -1 0 1 2
J(α)
Gradient Descent (cont’d)
• What is the effect of the learning rate (k) ?
η = 0.37
2 η = 0.39
2
1
1
J(α) 0
0
-1
-1
-2
-2 -1 0 1 2 -2
-2 -1 0 1 2
Expensive in practice!
Choosing pk using Newton’s Method
α (k 1) α (k ) (k )p k
(k )p k H 1J (α(k ))
requires inverting H
Newton’s method (cont’d)
If J(α) is quadratic,
J(α) 0
Newton’s method
-1
converges in one iteration!
-2
-2 -1 0 1 2
Gradient descent vs Newton’s method
Gradient
Descent
Newton
“Dual” Classification Problem
If αtyi>0 assign yi to ω1 • If yi in ω2, replace yi by -yi
else if αtyi<0 assign yi to ω2 • Find α such that: αtyi>0
J p (α )
yY ( α )
(α t y )
α (k 1) α (k ) (k )
yY ( α )
y
Perceptron rule (cont’d)
missclassified
examples
Perceptron rule (cont’d)
a2
Example:
a1
Perceptron rule (cont’d)
η(k)=1
α α yk
one example
at a time
order of examples:
y2 y3 y1 y3
“Batch” algorithm
leads to a smoother
trajectory in solution
space.
Quiz
• Next quiz on “Linear Discriminant Functions”
• When: Tuesday, April 23rd