Beruflich Dokumente
Kultur Dokumente
Machine Learning I:
Supervised Learning
© CSE AI Faculty
What’s on our menu today?
Supervised Learning
• Classification
– Decision trees
– Cross validation
– K-nearest neighbor
– Neural networks
+ Perceptrons
– Support Vector Machines (SVMs)
• Regression
– Backpropagation networks
2
Why Learning?
Learning is essential for unknown environments
• e.g., when designer lacks omniscience
4
Inductive learning
We will focus on one form of supervised learning called
Inductive Learning:
Learn a function from examples
6
Inductive learning example
h = Straight line?
7
Inductive learning example
What about a quadratic function?
What about
this little
fella?
8
Inductive learning example
Finally, a function that satisfies all!
9
Inductive learning example
But so does this one…
10
Ockham’s Razor Principle
To play or
not to play?
http://www.sfgate.com/blogs/images/sfgate/sgreen/2007/09/05/2240773250x321.jpg 12
Example data for learning the concept
“Good day for tennis”
Day Outlook Humid Wind PlayTennis?
d1 s h w n
d2 s h s n • Outlook =
d3 o h w y sunny,
d4 r h w y overcast,
d5 r n w y rain
d6 r n s y
d7 o n s y • Humidity =
d8 s h w n high, normal
d9 s n w y
d10 r n w y • Wind = weak,
d11 s n s y
strong
d12 o h s y
d13 o n w y
d14 r h s n
13
A Decision Tree for the Same Data
Decision Tree for “PlayTennis?”
Leaves = classification output
Outlook
Arcs = choice of value
for parent attribute
Sunny Overcast Rain
Yes No No Yes
Decision tree is equivalent to logic in disjunctive normal form
PlayTennis ⇔ (Sunny ∧ Normal) ∨ Overcast ∨ (Rain ∧ Weak)
14
Decision Trees
Input: Description of an object or a situation through a
set of attributes
Output: a decision that is the predicted output value
for the input
Both input and output can be discrete or continuous
Discrete-valued functions lead to classification
problems
15
Example: Decision Tree for Continuous
Valued Features and Discrete Output
Input real number attributes (x1,x2), Classification output: 0 or 1
x2
x1
16
Example: Classification of Continuous Valued Inputs
Decision Tree
x2
4 x1
17
Expressiveness of Decision Trees
Decision trees can express any function of the input
attributes.
E.g., for Boolean functions, truth table row = path to leaf:
19
Example Decision tree
A decision tree for Wait? based on personal “rules of thumb”:
20
Input Data for Learning
Past examples when I did/did not wait for a table:
21
Decision Tree Learning
Aim: find a small tree consistent with training examples
Idea: (recursively) choose "most significant" attribute as
root of (sub)tree
22
Choosing an attribute to split on
Idea: a good attribute should reduce uncertainty
• E.g., splits the examples into subsets that are
(ideally) "all positive" or "all negative"
23
How do we quantify uncertainty?
http://a.espncdn.com/media/ten/2006/0306/photo/g_mcenroe_195.jpg
Using information theory to
quantify uncertainty
25
Using information theory
Imagine we have p examples with Wait = True (positive)
and n examples with Wait = false (negative).
26
Entropy is
1.0
highest
when
uncertainty
is greatest
Entropy I
0.5
Wait = F
Wait = T
P(Wait = T)
.00 .50 1.00
27
Choosing an attribute to split on
Idea: a good attribute should reduce uncertainty
and result in “gain in information”
Answer:
uncertainty before – uncertainty after
28
Back at the Restaurant
29
Back at the Restaurant
32
Information gain in our example
2 4 6 2 4
IG ( Patrons ) = 1 − [ I (0,1) + I (1,0) + I ( , )] = .541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG (Type) = 1 − [ I ( , ) + I ( , ) + I ( , ) + I ( , )] = 0 bits
12 2 2 12 2 2 12 4 4 12 4 4
35
Generalization
How do we know the classifier function we have
learned is good?
• Look at generalization error on test data
– Method 1: Split data in training vs test set
(the “hold out” method)
– Method 2: Cross-Validation
36
Cross-validation
K-fold cross-validation:
• Divide data into k subsets of equal size
• Train learning algorithm K times, leaving out
one of the subsets. Compute error on
left-out subset
• Report average error over all subsets
Leave-1-out cross-validation:
• Train on all but 1 data point, test on that
data point; repeat for each point
• Report average error over all points
37
Decision trees are for girlie men
– let’s move on to more powerful
learning algorithms
http://www.ipjnet.com/schwarzenegger2/pages/arnold_01.htm
Example Problem: Face Detection
How do we build a classifier to distinguish
between faces and other objects?
39
Images as Vectors
Binary handwritten characters Treat an image as a high-
dimensional vector
(e.g., by reading pixel values
left to right, top to bottom row)
⎡ p1 ⎤
⎢ p ⎥
⎢ 2 ⎥
I=⎢ M ⎥
⎢ ⎥
Greyscale images p
⎢ N −2 ⎥
62 79 23 119 120 105 4 0
⎢⎣ p N ⎥⎦
10 10 9 62 12 78 34 0
or 1 (binary image) or
2 1 1 29 26 37 0 77
0 89 144 147 187 102 62 208
0 to 255 (greyscale)
255 252 0 166 123 62 0 31
166 63 127 17 1 0 99 30
40
The human brain is extremely
good at classifying images
Inputs
Output spike
(electrical pulse)
vi = Θ(∑ w ji u j − μi )
j
Θ(x) = 1 if x > 0 and -1 if x ≤ 0
44
“Perceptrons” for Classification
Fancy name for a type of layered “feed-forward”
networks (no loops)
Multilayer
Single-layer
45
Perceptrons and Classification
Consider a single-layer perceptron
• Weighted sum forms a linear hyperplane
∑w j
ji u j − μi = 0
• Everything on one side of this hyperplane is in
class 1 (output = +1) and everything on other
side is class 2 (output = -1)
Any function that is linearly separable can be
computed by a perceptron
46
Linear Separability
Example: AND is linearly separable
Linear hyperplane
u1 u2 AND
-1 -1 -1 u2
(1,1) v
1 -1 -1 1 μ = 1.5
-1 1 -1 u1
-1 1
1 1 1
-1 u1 u2
v = Θ( ∑ w j u j − μ ) = Θ( w u − μ )
T
49
Perceptron Learning Rule
3. Change w and μ according to error (vd – v) :
If input is positive and error is positive,
then w not large enough ⇒ increase w
If input is positive and error is negative,
then w too large ⇒ decrease w
Similar reasoning for other cases yields:
w → w + α (v d − v)u
A → B means replace A with B
μ → μ − α (v − v )
d
50
What about the XOR function?
u1 u2 XOR ?
u2
-1 -1 1 (1,1)
1
1 -1 -1 1
u1
-1 1 -1 -1
1 1 1 -1
51
Linear Inseparability
Perceptron with threshold units fails if classification
task is not linearly separable
• Example: XOR
• No single line can separate the “yes” (+1)
outputs from the “no” (-1) outputs!
u2
(1,1)
Minsky and Papert’s book 1
showing such negative
results put a damper on
X 1
u1
-1
neural networks research
for over a decade! -1
52
How do we deal with linear
inseparability?
Idea 1: Multilayer Perceptrons
Removes limitations of single-layer networks
• Can solve XOR
Example: Two-layer perceptron that computes XOR
x y
?
−
2
1
1 2
2 1
−1
−1 −1
1
x y
x
1 2
55
Multilayer Perceptron: What does it do?
out 1 =-1
y 1+ x − y < 0
2 2
=1
1
1+ x − y > 0
y = 1+
1
x 2
1 2
1
2 1
−1
1
x y
x
1 2
56
Multilayer Perceptron: What does it do?
out
y =-1
2
=1
1
2
−1 =1 =-1
−1
1 2− x− y > 0 2− x− y < 0
x y
x
1 2
57
Multilayer Perceptron: What does it do?
y =-1
out
2
=1
1 −1
1
− 1
2 - − >0
1 2
=1 =-1
1
x y x
1 2
58
Perceptrons as Constraint
Satisfaction Networks
y =-1
out
2
=1
1
1 −1
1 1+ x − y > 0
−
2
2
1
1 2 =1 =-1
1
2
−1 −1
−1 2− x− y < 0
1
x y x
1 2
59
Back to Linear Separability
• Recall: Weighted sum in perceptron
forms a linear hyperplane
∑w x +b = 0
i
i i
60
Separating Hyperplane
∑wx i i +b =0
Class 1 i
denotes +1 output
denotes -1 output
Class 2
61
Separating Hyperplanes
Different choices of w and b give different hyperplanes
Class 1
denotes +1 output
denotes -1 output
Class 2
Class 1
denotes +1 output
denotes -1 output
Class 2
63
How about the one right in the middle?
Avoids misclassification of
new test points if they are
generated from the same
distribution as training points
64
Margin
65
Maximum Margin and Support Vector Machine
The maximum
margin classifier is
Support Vectors called a Support
are those Vector Machine (in
datapoints that this case, a Linear
the margin SVM or LSVM)
pushes up
against
66
Why Maximum Margin?
• Robust to small
perturbations of data
points near boundary
• There exists theory
showing this is best for
generalization to new
points
• Empirically works great
67
Support Vector Machines: The Math
Suppose the training data points ( x i , y i ) satisfy :
w ⋅ x i + b ≥ + 1 for y i = + 1 We can always do this by rescaling
w ⋅ x i + b ≤ − 1 for y i = − 1 w and b, without affecting the
separating hyperplane:
This can be rewritten as w⋅x +b = 0
y i (w ⋅ x i + b ) ≥ + 1
68
Estimating the Margin
The margin is given by (see Burges tutorial online):
Class 2
Class 1
m
Margin can be calculated based on expression for distance from a point to a line, see,
e.g., http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
69
Learning the Maximum Margin Classifier
Want to maximize margin:
2/ w subject to y i (w ⋅ x i + b ) ≥ + 1, ∀ i
70
Learning the Maximum Margin Classifier
Using Lagrange formulation and Lagrangian
multipliers αi, we get (see Burges tutorial online):
w = ∑α
i
i yi x i
71
Geometrical Interpretation
xi with non-zero αi are called support vectors
α8=0.6 α10=0
α7=0
α2=0
α5=0
α1=0.8
α4=0
α6=1.4
α9=0
α3=0
72
What if data is not linearly separable?
Outliers (due to noise)
73
Approach 1: Soft Margin SVMs
1
Minimize: w + C ∑ ξ i subject to :
2
2 i
y i (w ⋅ x i + b ) ≥ 1 − ξ i and ξ i ≥ 0 , ∀ i
74
What if data is not linearly
separable: Other ideas?
u1 u2 XOR u2
-1 -1 1 (1,1)
1
1 -1 -1 X 1
u1
-1 1 -1 -1
1 1 1 -1
75
Another Example
76
What if data is not linearly separable?
Approach 2: Map original input space to higher-
dimensional feature space; use linear classifier in
higher-dim. space
x → φ(x)
77
Problem with high dimensional spaces
x → φ(x)
Insight:
The data points only appear as inner product
• No need to compute high-dimensional φ(x)
explicitly! Just replace inner product xi⋅xj with
a kernel function K(xi,xj) = φ(xi) ⋅ φ(xj)
• E.g., Gaussian kernel
K(xi,xj) = exp(-||xi-xj||2/2σ2)
• E.g., Polynomial kernel
K(xi,xj) = (xi⋅xj+1)d
79
An Example for φ(.) and K(.,.)
Suppose φ(.) is given as follows
80
Summary: Steps for Classification
using SVMs
Prepare the data matrix
Select the kernel function to use
Select parameters of the kernel function
• You can use the values suggested by the
SVM software, or use cross-validation
Execute the training algorithm and obtain the
parameters αi
Classify new data using the learned parameters
81
Face Detection using
SVMs
82
Support Vectors
83
K-Nearest Neighbors
A simple non-parametric classification algorithm
Idea:
• Look around you to see how your neighbors
classify data
• Classify a new data-point according to a majority
vote of your k nearest neighbors
84
Distance Metric
How do we measure what it means to be a neighbor
(what is “close”)?
Appropriate distance metric depends on the problem
Examples:
x discrete (e.g., strings): Hamming distance
d(x1,x2) = # features on which x1 and x2 differ
85
Example
Input Data: 2-D points (x1,x2)
Two classes: C1 and C2. New Data Point +
Some points
near the
boundary may
be misclassified
(but maybe noise)
87
What if we want to learn
continuous-valued functions?
Output
Input
Example: Learning to Drive
90
Sigmoidal Networks
a
Non-linear “squashing” function: Squashes input to be between 0
and 1. The parameter β controls the slope.
91
Gradient-Descent Learning
(“Hill-Climbing”)
1
E (w ) = ∑ (d − v )
m m 2
2 m
where v = g (w u )
m T m
92
Gradient-Descent Learning
(“Hill-Climbing”)
Derivative of sigmoid
93
“Stochastic” Gradient Descent
What if the inputs only arrive one-by-one?
Stochastic gradient descent approximates
sum over all inputs with an “on-line” running
sum:
dE1
w → w −ε
dw
Also known as
dE1
= −(d m − v m ) g ′(w T u m )u m the “delta rule”
dw or “LMS (least
delta = error mean square)
rule”
94
But wait….
Delta rule tells us how to modify the connections
from input to output (one layer network)
• One layer networks are not that interesting
(remember XOR?)
What if we have multiple layers?
95
Learning Multilayer Networks
96
Backpropagation: Output Weights
vi = g (∑ W ji x j )
1
E ( W, w ) = ∑ i i
2 i
( d − v ) 2 j
xj
uk
Learning rule for hidden-output weights W:
dE
W ji → W ji − ε {gradient descent}
dW ji
dE
= −(d i − vi ) g ′(∑ W ji x j ) x j {delta rule}
dW ji j
97
Backpropagation: Hidden Weights
1 vim = g (∑ W ji x j )
E ( W, w ) = ∑ (d i − vi ) 2 j
2 i
x mj = g (∑ wkj ukm )
k
u km
Learning rule for input-hidden weights w:
dE dE dE dx j
wkj → wkj − ε But : = ⋅ {chain rule}
dw kj dw kj dx j dw kj
dE ⎡ ⎤ ⎡ m⎤
= ⎢ − ∑ (d i − vi ) g ′( ∑ W ji x j )W ji ⎥ ⋅ ⎢ g ′( ∑ wkj u k )u k ⎥
m m m m
dw kj ⎣ m ,i j ⎦ ⎣ k ⎦
98
Learning to Drive using Backprop
99
ALVINN (Autonomous Land Vehicle in a Neural
Network)
(Pomerleau, 1992)
100
Another Example: Face Detection
101
Face Detection Results
102
(Rowley, Baluja & Kanade, 1998)
Demos: Pole Balancing and Backing up a Truck
(courtesy of Keith Grochow, CSE 599)
vpole
) Neural network learns to balance a pole on a cart
¼ System:
¼ 4 state variables: xcart, vcart, θpole, vpole xcart
¼ 1 input: Force on cart θpole
¼ Backprop Network: vcart
¼ Input: State variables
¼ Output: New force on cart
) NN learns to back a truck into a loading dock
¼ System (Nyugen and Widrow, 1989):
¼ State variables: xcab, ycab, θcab
¼ 1 input: new θsteering
¼ Backprop Network:
¼ Input: State variables
¼ Output: Steering angle θsteering
103