Beruflich Dokumente
Kultur Dokumente
Machine Learning
1. Linear Regression
Lars Schmidt-Thieme
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 1/61
Machine Learning
3. Multiple Regression
4. Variable Interactions
5. Model Selection
6. Case Weights
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 1/61
Machine Learning / 1. The Regression Problem
Example
weekly measurements of
• average external temperature
• total gas consumption
(in 1000 cubic feets)
A third variable encodes two heating
seasons, before and after wall
insulation.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 1/61
7
●
●
Gas consumption (1000 cubic feet)
6 ●
● ● ●
5
● ●
●
●
●
●
●
●
4 ● ●
● ●
●
●
●
●
3
0 2 4 6 8 10
Average external temperature (deg. C)
linear model
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 2/61
Machine Learning / 1. The Regression Problem
Example
● ●
7 7
● ●
● ●
Gas consumption (1000 cubic feet)
● ● ● ● ● ●
● ●
● ●
5 5
● ● ● ●
● ●
● ●
● ●
● ●
● ●
● ●
4 ● ● 4 ● ●
● ● ● ●
● ●
● ●
● ●
● ●
3 3
● ●
0 2 4 6 8 10 0 2 4 6 8 10
Average external temperature (deg. C) Average external temperature (deg. C)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 3/61
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 4/61
Machine Learning / 1. The Regression Problem
Variable Types and Coding
Replace
one variable X with 3 levels: red, green, blue
by
two variables δred(X) and δgreen(X) with 2 levels each: 0, 1
X δred(X) δgreen(X)
red 1 0
green 0 1
blue 0 0
— 1 1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 5/61
Let
X1, X2, . . . , Xp be random variables called predictors (or inputs,
covariates).
Let X 1, X 2, . . . , X p be their domains.
We write shortly
X := (X1, X2, . . . , Xp)
for the vector of random predictor variables and
X := X 1 × X 2 × · · · × Xp
for its domain.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 7/61
Machine Learning
3. Multiple Regression
4. Variable Interactions
5. Model Selection
6. Case Weights
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 8/61
Machine Learning / 2. Simple Linear Regression
Simple Linear Regression Model
Make it simple:
• the predictor X is simple, i.e., one-dimensional (X = X1).
• 3 parameters:
β0 intercept (sometimes also called bias)
β1 slope
σ 2 variance
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 8/61
parameter estimates
β̂0, β̂1, σ̂ 2
fitted line
r̂(x) := β̂0 + β̂1x
residuals
ˆi := yi − ŷi = yi − (β̂0 + β̂1xi)
Example:
Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for x = 3.
6
5
4
●
3
y
●
2
1
● data
0
0 1 2 3 4 5
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 10/61
Example:
Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for x = 3.
Line through first two points:
y2 − y1 ●
6
β̂1 = =1
x2 − x1
5
●
3
y
RSS:
i yi ŷi (yi − ŷi)2 ●
2
1 2 2 0
1
2 3 3 0 ● data
● model
3 6 5 1
0
P
1 0 1 2 3 4 5
x
r̂(3) = 4
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 11/61
Machine Learning / 2. Simple Linear Regression
How to estimate the parameters?
Example:
Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for x = 3.
Line through first and last point:
y3 − y1 ●
6
β̂1 = = 4/3 = 1.333
x3 − x1
5
●
4
●
3
y
RSS:
ŷi (yi − ŷi)2 ●
2
i yi
1 2 2 0
1
2 3 3.333 0.111 ●
●
data
model
0
3
P 6 6 0 0 1 2 3 4 5
0.111 x
r̂(3) = 4.667
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 12/61
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 13/61
Machine Learning / 2. Simple Linear Regression
Least Squares Estimates / Proof
Proof (1/2):
n
X
RSS = (yi − (β̂0 + β̂1xi))2
i=1
n
∂ RSS X !
= 2(yi − (β̂0 + β̂1xi))(−1) = 0
∂ β̂0 i=1
Xn
=⇒ nβ̂0 = yi − β̂1xi
i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 14/61
Proof (2/2):
n
X
RSS = (yi − (β̂0 + β̂1xi))2
i=1
n
X
= (yi − (ȳ − β̂1x̄) − β̂1xi)2
i=1
Xn
= (yi − ȳ − β̂1(xi − x̄))2
i=1
n
∂ RSS X !
= 2(yi − ȳ − β̂1(xi − x̄))(−1)(xi − x̄) = 0
∂ β̂1 i=1
P n
(y − ȳ)(xi − x̄)
=⇒ β̂1 = i=1Pn i 2
i=1 (xi − x̄)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 15/61
Machine Learning / 2. Simple Linear Regression
Least Squares Estimates / Example
Example:
Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for x = 3.
Assume simple linear model.
x̄ = 7/3, ȳ = 11/3.
6
1 −4/3 −5/3 16/9 20/9
2 −1/3 −2/3 1/9 2/9
5
●
4
P
42/9 57/9
●
3
y
Pn
(x − x̄)(yi − ȳ) ●
2
β̂1 = i=1Pn i 2
= 57/42 = 1.357
(x i − x̄)
1
i=1
11 57 7 63 ● data
β̂0 =ȳ − β̂1x̄ = − · = = 0.5 ● model
0
3 42 3 126 0 1 2 3 4 5
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 16/61
Example:
Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for x = 3.
Assume simple linear model.
Pn
(x − x̄)(yi − ȳ)
β̂1 = i=1Pn i = 57/42 = 1.357 ●
6
(x − x̄) 2
i=1 i
11 57 7 63
5
3 42 3 126
4
●
3
y
RSS:
i yi ŷi (yi − ŷi)2 ●
2
1 2 1.857 0.020
1
2 3 3.214 0.046 ●
●
data
model
0
3 6 5.929 0.005
P 0 1 2 3 4 5
0.071 x
r̂(3) = 4.571
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 17/61
Machine Learning / 2. Simple Linear Regression
A Generative Model
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 18/61
Likelihood: n
Y
LD (θ) := p̂(xi, yi | θ)
i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 19/61
Machine Learning / 2. Simple Linear Regression
Least Squares Estimates and Maximum Likelihood Estimates
Likelihood:
n
Y n
Y n
Y n
Y
2
LD (β̂0, β̂1, σ̂ ) := p̂(xi, yi) = p̂(yi | xi)p(xi) = p̂(yi | xi) p(xi)
i=1 i=1 i=1 i=1
Conditional likelihood:
n n (y −ŷ )2
cond 2
Y Y 1 − i 2i 1 1 Pn (y −ŷ )2
LD (β̂0, β̂1, σ̂ ) := p̂(yi | xi) = √ e 2σ̂ = √ n e 2 i=1 i i
−2σ̂
i=1 i=1
2π σ̂ 2π σ̂ n
Conditional log-likelihood:
n
1 X
log Lcond 2
D (β̂0 , β̂1 , σ̂ ) ∝ −n log σ̂ − 2 (yi − ŷi)2
2σ̂ i=1
1 simple-regression(D) :
2 sx := 0, sy := 0
3 for i = 1, . . . , n do
4 sx := sx + xi
5 sy := sy + yi
6 od
7 x̄ := sx/n, ȳ := sy/n
8 a := 0, b := 0
9 for i = 1, . . . , n do
10 a := a + (xi − x̄)(yi − ȳ)
11 b := b + (xi − x̄)2
12 od
13 β1 := a/b
14 β0 := ŷ − β1 x̂
15 return (β0 , β1 )
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 21/61
Machine Learning / 2. Simple Linear Regression
Implementation Details
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 21/61
Machine Learning
3. Multiple Regression
4. Variable Interactions
5. Model Selection
6. Case Weights
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 22/61
Machine Learning / 3. Multiple Regression
Several predictors
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 22/61
where
β0 1
β1 X1
β :=
.. ,
.. ,
X :=
βp Xp
Thus, the intercept is handled like any other parameter, for the
artificial constant variable X0 ≡ 1.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 23/61
Machine Learning / 3. Multiple Regression
Simultaneous equations for the whole dataset
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 24/61
Proof:
||Y − Xβ̂||2 = hY − Xβ̂, Y − Xβ̂i
∂(. . .) !
= 2h−X, Y − Xβ̂i = −2(XT Y − XT Xβ̂) = 0
∂ β̂
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 25/61
Machine Learning / 3. Multiple Regression
2. Cholesky decomposition
3. QR decomposition
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 26/61
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 27/61
Machine Learning / 3. Multiple Regression
●
7
7
● data ● data
6
● model
6
● model
5
5
4
y
4
y
●
●
3
3
●
2
2
●
●
1
1
1 2 3 4 5 1 2 3 4 5
x1 x2
ŷ(x1 = 3) = 3.25
ŷ(x2 = 4) = 1.571
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 28/61
1 1 2 3
1 2 3 2
X=
1
, Y =
4 1 7
1 5 5 1
4 12 11 13
X T X = 12 46 37 , X T Y = 40
11 37 39 24
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 29/61
Machine Learning / 3. Multiple Regression
4 12 11 13 4 12 11 13 4 12 11 13
12 46 37 40 ∼ 0 10 4 1 ∼ 0 10 4 1
11 37 39 24 0 16 35 −47 0 0 143 −243
4 12 11 13 286 0 0 1597
∼ 0 1430 0 1115 ∼ 0 1430 0 1115
0 0 143 −243 0 0 143 −243
i.e.,
1597/286 5.583
β̂ = 1115/1430 ≈ 0.779
−243/143 −1.699
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 30/61
10
4
y
−2
x2
x1
−4
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 31/61
Machine Learning / 3. Multiple Regression
0.02
y−hat y
●
−0.02
1 2 3 4 5 6 7
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 32/61
written as:
0.3
X ∼ N (µ, σ 2)
phi(x)
0.2
with parameters:
µ mean,
0.1
σ standard deviance.
probability density function (pdf):
0.0
1 (x−µ)2
− −3 −2 −1 0 1 2 3
φ(x) := √ e 2σ2
2πσ
x
Z x
Φ(x) := φ(x)dx
Phi(x)
−∞
0.4
0.3
X ∼ tp
with parameter:
0.2
f(x)
p degrees of freedom.
0.1
Γ( p+1
2 ) x2 − p+1
p(x) := (1 + ) 2
0.0
Γ( p2 ) p
−6 −4 −2 0 2 4 6
1.0
x
p→∞
tp −→ N (0, 1)
0.8
0.6
F(x)
0.4
0.2
p=5
p=10
0.0
p=50
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI
−6 & Institute
−4 for−2Computer
0 Science,
2 University
4 of Hildesheim
6
Course on Machine Learning, winter term 2007 34/61
x
Machine Learning / 3. Multiple Regression
The χ2 Distribution
0.15
p=5
p=7
p=10
written as:
X ∼ χ2p
0.10
with parameter:
f(x)
p degrees of freedom.
0.05
Γ(p/2)2p/2
0 5 10 15 20
1.0
x
If X1, . . . , Xp ∼ N (0, 1), then
0.8
p
X
Y := Xi2 ∼ χ2p
0.6
i=1
F(x)
0.4
0.2
p=5
p=7
0.0
p=10
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI
0 & Institute
5 for Computer
10 Science, University
15 of Hildesheim
20
Course on Machine Learning, winter term 2007 35/61
Machine Learning / 3. Multiple Regression
Parameter Variance
proof:
As E() = 0: E(β̂) = β
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 36/61
Furthermore
(n − p)σ̂ 2 ∼ σ 2χ2n−p
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 37/61
Machine Learning / 3. Multiple Regression
Parameter Variance / Standardized coefficient
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 38/61
As Φ−1(1 − 0.05
2 ) ≈ 1.95996 ≈ 2, the rule-of-thumb for a 5%
confidence interval is
βi ± 2sc
e(β̂i)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 39/61
Machine Learning / 3. Multiple Regression
Parameter Variance / Example
n
2 1 X 2 1
σ̂ = ˆi = 0.00350 = 0.00350
n − p i=1 4−3
0.00520 −0.00075 −0.00076
(X T X)−1σ̂ 2 = −0.00075 0.00043 −0.00020
−0.00076 −0.00020 0.00049
covariate β̂i sc
e(β̂i) z-score p-value
(intercept) 5.583 0.0721 77.5 0.0082
X1 0.779 0.0207 37.7 0.0169
X2 −1.699 0.0221 −76.8 0.0083
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 40/61
●
2.5
●
40 50 60
6000
● ● ● ● ● ● ●
● ●● ●● ● ● ● ● ●●
●
● ●
●● ● ● ● ●● ●● ● ●● ●●
●
4500
● ● ●● ● ● ● ● ●●
●●
●●● ● ● ●●● ● ●● ● ● ●● ●
●
●● ● ●● ● ● ●● ●● ●
●● ● ●● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ●
●● ● ● ● ●
● ● ● ●●● ● ●● ● ●●● ●● ●●● ●●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
1970), ●
●
●
Illiteracy ●
●
● ●
●
●
1.5
● ● ●
● ● ●
● ●● ● ● ● ● ● ●
●● ● ● ● ●
● ● ●●● ● ●●● ● ●● ● ● ●
● ● ●
●●● ●●●● ● ●● ●
●●● ● ●
● ●
●●●● ●● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ●●● ● ● ●
● ● ●● ● ●
● ●
●
● ● ●● ●●
● ● ●
(1970). ●
●●
●
●
● ●●
●● ● ●●●
●●
●● ● ●
●
●
●● ●
●
●●●●● ●●
● ●
●●
●
●
Life Exp ●
●
●
● ●
●
●
●
●
● ●●
● ● ● ●●
●
●●
●
70
●● ●
● ● ●● ● ●
● ●● ●
● ● ●● ● ●
●
●
●
● ●●
●
●
●
60
●
● ●●● ● ●● ●
●
●● ●● ● ●
● ● ● ●●
(1976) ●●
● ● ●
●
● ●
● ●● ●●● ●●
●
●
●●
●
● ●
●
● ●●● ● ●●
● ●
●
● ●●
●●
●
●
●●●
● ●
●
●
●
●
HS Grad
50
● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ● ● ●
● ● ● ● ●● ●● ●
40
● ● ●● ●
●●● ● ● ● ● ● ●
Machine Learning
3. Multiple Regression
4. Variable Interactions
5. Model Selection
6. Case Weights
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 43/61
Machine Learning / 4. Variable Interactions
Need for higher orders
200
●
150
acceleration a. ●
1
s(t) = at2 +
100
y
●
2
50
●
●
0
●
0 2 4 6 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 43/61
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 44/61
Machine Learning / 4. Variable Interactions
Need for variable interactions
3. Multiple Regression
4. Variable Interactions
5. Model Selection
6. Case Weights
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 47/61
●
●
●
●
100
y
●
50
●
● ● data
●● model
●
0
0 2 4 6 8
x
If a model does not well explain the data,
e.g., if the true model is quadratic, but we try to fit a linear model,
one says, the model underfits.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 47/61
Machine Learning / 5. Model Selection
Overfitting / Fitting Polynomials of High Degree
●
●●
8
6 ●
●
y
●
●
2
● ● data
model
●●
0
0 2 4 6 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 48/61
●
●●
8
6
●
y
●
●
2
● ● data
model
●●
0
0 2 4 6 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 48/61
Machine Learning / 5. Model Selection
Overfitting / Fitting Polynomials of High Degree
●
●●
8
6 ●
●
y
●
●
2
● ● data
model
●●
0
0 2 4 6 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 48/61
●
●●
8
6
●
y
●
●
2
● ● data
model
●●
0
0 2 4 6 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 48/61
Machine Learning / 5. Model Selection
Overfitting / Fitting Polynomials of High Degree
If to data
(x1, y1), (x2, y2), . . . , (xn, yn)
consisting of n points we fit
X = β0 + β1X1 + β2X2 + · · · + βn−1Xn−1
i.e., a polynomial with degree n − 1, then this results in an
interpolation of the data points
(if there are no repeated measurements, i.e., points with the
same X1.)
As the polynomial
n
X Y X − xj
r(X) = yi
i=1
xi − xj
j6=i
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 48/61
If we just look at fit measures such as RSS, then the larger p the
better the fit
as the model with p parameters can be reparametrized in a
model with p0 > p parameters by setting
βi, for i ≤ p
βi0 =
0, for i > p
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 49/61
Machine Learning / 5. Model Selection
Model Selection Measures
The smaller the complexity, the simpler and thus better the
model.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 50/61
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 51/61
Machine Learning / 5. Model Selection
Variable Backward Selection
{ A, F, H, I, J, L, P }
AIC = 63.01
{X
A, F, H, I, J, L, P } ... { A, F, H,X
I, J, L, P } ... { A, F, H, I, J, L, X
P}
AIC = 63.87 AIC = 61.11 AIC = 70.17
{X
A, F, H,X
I, J, L, P } ... { A, F, X
H,X
I, J, L, P } ... { A, F, H,X
I, J, L, X
P}
AIC = 61.88 AIC = 59.40 AIC = 68.70
{X
A, F,X
H,XI, J, L, P } { A,X
F,XX
H, I, J, L, P } ... { A, F, XX
H, I, J, L, X
P}
AIC = 63.23 AIC = 61.50 AIC = 66.71
X removed variable
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 52/61
{ A, F, H, I, J, L, P }
AIC = 63.01
{X
A, F, H, I, J, L, P } ... { A, F, H,X
I, J, L, P } ... { A, F, H, I, J, L, X
P}
AIC = 63.87 AIC = 61.11 AIC = 70.17
{X
A, F, H,X
I, J, L, P } ... { A, F, X
H,X
I, J, L, P } ... { A, F, H,X
I, J, L, X
P}
AIC = 61.88 AIC = 59.40 AIC = 68.70
{X
A, F,X
H,XI, J, L, P } { A,X
F,XX
H, I, J, L, P } ... { A, F, XX
H, I, J, L, X
P}
AIC = 63.23 AIC = 61.50 AIC = 66.71
X removed variable
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 52/61
Machine Learning / 5. Model Selection
Variable Backward Selection
{ A, F, H, I, J, L, P }
AIC = 63.01
{X
A, F, H, I, J, L, P } ... { A, F, H,X
I, J, L, P } ... { A, F, H, I, J, L, X
P}
AIC = 63.87 AIC = 61.11 AIC = 70.17
{X
A, F, H,X
I, J, L, P } ... { A, F, X
H,X
I, J, L, P } ... { A, F, H,X
I, J, L, X
P}
AIC = 61.88 AIC = 59.40 AIC = 68.70
{X
A, F,X
H,XI, J, L, P } { A,X
F,XX
H, I, J, L, P } ... { A, F, XX
H, I, J, L, X
P}
AIC = 63.23 AIC = 61.50 AIC = 66.71
X removed variable
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 52/61
{ A, F, H, I, J, L, P }
AIC = 63.01
{X
A, F, H, I, J, L, P } ... { A, F, H,X
I, J, L, P } ... { A, F, H, I, J, L, X
P}
AIC = 63.87 AIC = 61.11 AIC = 70.17
{X
A, F, H,X
I, J, L, P } ... { A, F, X
H,X
I, J, L, P } ... { A, F, H,X
I, J, L, X
P}
AIC = 61.88 AIC = 59.40 AIC = 68.70
{X
A, F,X
H,XI, J, L, P } { A,X
F,XX
H, I, J, L, P } ... { A, F, XX
H, I, J, L, X
P}
AIC = 63.23 AIC = 61.50 AIC = 66.71
X removed variable
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 52/61
Machine Learning / 5. Model Selection
Variable Backward Selection
full model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.222e+02 1.789e+01 6.831 2.54e-08 ***
Population 1.880e-04 6.474e-05 2.905 0.00584 **
Income -1.592e-04 5.725e-04 -0.278 0.78232
Illiteracy 1.373e+00 8.322e-01 1.650 0.10641
‘Life Exp‘ -1.655e+00 2.562e-01 -6.459 8.68e-08 ***
‘HS Grad‘ 3.234e-02 5.725e-02 0.565 0.57519
Frost -1.288e-02 7.392e-03 -1.743 0.08867 .
Area 5.967e-06 3.801e-06 1.570 0.12391
library(datasets);
library(MASS);
st = as.data.frame(state.x77);
mod.opt = stepAIC(mod.full);
summary(mod.opt);
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 53/61
Machine Learning / 5. Model Selection
Shrinkage
shrinkage operates by
• including a penalty term directly in the model equation and
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 54/61
As
• solutions of ridge regression are not equivariant under scaling of
the predictors, and as
1 1 2 3
1 1 0 0
2 3 2
X=
1
,
7 ,
Y = I := 0 1 0 ,
4 1
0 0 1
1 5 5 1
4 12 11 9 12 11 13
X T X = 12 46 37 , X T X + 5I = 12 51 37 , X T Y = 40
11 37 39 11 37 44 24
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 56/61
Machine Learning
3. Multiple Regression
4. Variable Interactions
5. Model Selection
6. Case Weights
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 57/61
Machine Learning / 6. Case Weights
Cases of Different Importance
8
Thus, the model does not need to fit to
6
this point equally as well as it needs to ●
4
●
●
2
●
● data
● model
0 ●
0 2 4 6 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 57/61
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 58/61
Machine Learning / 6. Case Weights
Weighted Least Squares Estimates
with
w1 0
w2
W :=
...
0 wn
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 59/61
●
●
w x y ●
8
1 5.65 3.54
1 3.37 1.75
6
1 1.97 0.04 ●
1 3.70 4.42
●
y
1 8.14 8.75 ●
1 7.42 8.11
2
1 6.59 5.64 ●
0 2 4 6 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 60/61
Machine Learning / 6. Case Weights
Summary
• The ordinary least squares estimates (OLS) are the parameters with
minimal residual sum of squares (RSS). They coincide with the
maximum likelihood estimates (MLE).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim
Course on Machine Learning, winter term 2007 61/61