Sie sind auf Seite 1von 34

Lecture 2.

2: Linear Regression
CSC 84020 - Machine Learning
Andrew Rosenberg
February 5, 2009
Today
Linear Regression
Linear Regression
Linear Regression is a Regression algorithm, a supervised
technique.
In one dimension:
Goal: identify y : R R.
In D-dimensions:
Goal: identify y : R
D
R.
Given: a set of training data {x
0
, x
1
, . . . , x
N
}
with targets, {t
0
, t
1
, . . . , t
N
}
Recall Regression
Recall Regression
Recall Regression
Dene the problem
In linear regression, we assume that the model that generates the
data involves only a linear combination of the input variables.
y(x, w) = w
0
+ w
1
x
1
+. . . + w
D
x
D
Or, simplied
y(x, w) = w
0
+
M1

j =1
w
j
x
j
w is a vector of weights which dene the M parameters of the
model.
Optimization
How can we evaluate the performance of a regression solution?
Error Functions
(aka Loss Functions)
Simplest error: Squared Error from Target
E(t
i
, y(x
i
, w)) =
1
2
(t
i
y(x
i
, w))
2
Other options: Linear error
E(t
i
, y(x
i
, w)) = |t
i
y(x
i
, w)|
Total Error
E(t, y(x, w)) = R
emp
=
1
N
N
X
i =1
E(t
i
, y(x
i
, w))
Likelihood of t
If we can describe the likelihood of a guess t, given a function y
and training data x, we can minimize this risk, by setting its
derivative to zero.
R
emp
=
1
N
N

i =1
E(t
i
, y(x
i
, w))
=
1
N
N

i =1
1
2
(t
i
y(x
i
, w))
2

w
R = 0
Likelihood and Risk
Brief Aside
The relationship between model likelihood and Empirical Risk.
The likelihood of a target given a model is:
p(t|x, w, ) = N(t|y(x, w), 1)
where =
1

2
the inverse variance.
So...
p(t|x, w, ) =
N1

i =0
N(t
i
|y(x
i
, w), 1)
assuming Independent Identically Distributed (iid) data.
Likelihood and Risk
p(t|x, w, ) =
N1

i =0
N(t
i
|y(x
i
, w), 1)
p(t|x, w, ) =
N1

i =0
_

2
exp
_

2
(y(x
i
, w) t
i
)
2
_
ln p(t|x, w, ) = ln
N1

i =0
_

2
exp
_

2
(y(x
i
, w) t
i
)
2
_
=

2
N1

i =0
_
(y(x
i
, w) t
i
)
2
_
+
N
2
ln
N
2
ln 2
ln p(t|x, w, ) =

2
N1

i =0
_
(y(x
i
, w) t
i
)
2
_
+
N
2
ln
N
2
ln 2

w
ln p(t|x, w, ) =
w
_

2
N1

i =0
_
(y(x
i
, w) t
i
)
2
_
_
To Maximize log likelihood:

w
_

2
N1

i =0
_
(y(x
i
, w) t
i
)
2
_
_
= 0
Maximizing (log) likelihood (under a gaussian) is equivalent to
minimizing sum of squares error.
Maximize the log likelihood
Optimize the weights in one dimension.
In one dimension w =
_
w
0
w
1

w
R = 0
_
R
w
0
R
w
1
_
=
_
0
0
_
R(w) =
1
2N
N1

i =0
(t
i
w
1
x
i
w
0
)
2
Maximize the log likelihood

w
R(w) =
1
2N
N1

i =0
(t
i
w
1
x
i
w
0
)
2
Set each partial to 0. First w
0
.
R
w
0
=
1
N
N1
X
i =0
(t
i
w
1
x
i
w
0
)(1)
1
N
N1
X
i =0
(t
i
w
1
x
i
w
0
)(1) = 0
1
N
N1
X
i =0
w
0
=
1
N
N1
X
i =0
(t
i
w
1
x
i
)
w
0
=
1
N
N1
X
i =0
(t
i
w
1
x
i
)
w
0
=
1
N
N1
X
i =0
t
i
w
1
1
N
N1
X
i =0
x
i
Maximize the log likelihood

w
R(w) =
1
2N
N1

i =0
(t
i
w
1
x
i
w
0
)
2
Set each partial to 0. Now w
1
.
R
w
1
=
1
N
N1
X
i =0
(t
i
w
1
x
i
w
0
)(x
i
)
1
N
N1
X
i =0
(t
i
w
1
x
i
w
0
)(x
i
) = 0
1
N
N1
X
i =0
(t
i
x
i
w
1
x
2
i
w
0
x
i
) = 0
1
N
N1
X
i =0
w
1
x
2
i
=
1
N
N1
X
i =0
t
i
x
i

1
N
N1
X
i =0
w
0
x
i
w
1
N1
X
i =0
x
2
i
=
N1
X
i =0
t
i
x
i
w
0
N1
X
i =0
x
i
Maximize the log likelihood
Substitute in w

0
and Simplify.
w

0
=
1
N
N1

i =0
t
i
w
1
1
N
N1

i =0
x
i
w
1
N1

i =0
x
2
i
=
N1

i =0
t
i
x
i
w
0
N1

i =0
x
i
w
1
N1

i =0
x
2
i
=
N1

i =0
t
i
x
i

_
1
N
N1

i =0
t
i
w
1
1
N
N1

i =0
x
i
_
N1

i =0
x
i
w
1
_
N1

i =0
x
2
i

1
N
N1

i =0
x
i
N1

i =0
x
i
_
=
N1

i =0
t
i
x
i

1
N
N1

i =0
t
i
N1

i =0
x
i
w
1
=

N1
i =0
t
i
x
i

1
N

N1
i =0
t
i

N1
i =0
x
i

N1
i =0
x
2
i

1
N

N1
i =0
x
i

N1
i =0
x
i
Maximized Log likelihood
Thus:
_
w

0
w

1
_
=
_
_
1
N

N1
i =0
t
i
w

1
1
N

N1
i =0
x
i
P
N1
i =0
t
i
x
i

1
N
P
N1
i =0
t
i
P
N1
i =0
x
i
P
N1
i =0
x
2
i

1
N
P
N1
i =0
x
i
P
N1
i =0
x
i
_
_
Done.
But this is a little clunky. Lets use linear algebra to generalize.
Extend to multiple dimensions
Maximum Log Likelihood calculation as vectors and matrices.
R
emp
(w) =
1
2N
N1

i =0
(t
i
w
1
x
i
w
0
)
2
=
1
2N
N1

i =0
_
t
i

_
1 x
i

_
w
0
w
1
__
2
=
1
2N
_
_
_
_
_
_
_
_
_
_

_
t
0
t
1
.
.
.
t
N1
_

_
1 x
0
1 x
1
.
.
.
1 x
N1
_

_
_
w
0
w
1
_
_
_
_
_
_
_
_
_
_
2
=
1
2N
t Xw
2
Extend to multiple dimensions
Now that we have a general form of the empirical Risk, we can
easily extend to higher dimensions.
R
e
mp(w) =
1
2N
t Xw
2
Now...

w
R
emp
(w) = 0

w
_
1
2N
t Xw
2
_
= 0
General form of Risk minimization
Solve the Gradient = 0

w
R
emp
(w) = 0

w
_
1
2N
t Xw
2
_
= 0
1
2N

w
_
(t Xw)
T
(t Xw)
_
= 0
1
2N

w
_
(t
T
t t
T
Xw w
T
X
T
t + w
T
X
T
Xw)
_
= 0
1
2N
_
X
T
t X
T
t + 2X
T
Xw

_
= 0
1
2N
_
2X
T
t + 2X
T
Xw

_
= 0
X
T
Xw

= X
T
t
w

= (X
T
X)
1
X
T
t
Extension to tting a line to a curve
Polynomial Regression
Polynomial Regression in One dimension.
y(x, w) =
D

d=1
w
d
x
d
+ w
0
Risk:
R =
1
2
_
_
_
_
_
_
_
_
_
_

_
t
0
t
1
.
.
.
t
n1
_

_
1 x
0
. . . x
p
0
1 x
1
. . . x
p
1
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n1
. . . x
p
n1
_

_
_

_
w
0
w
1
.
.
.
w
p
_

_
_
_
_
_
_
_
_
_
_
2
But this is just the same as linear regression in P dimensions.
Polynomial Regression as Linear Regression
To t a P dimensional polynomial, create a P-element vector from
x
i
x
i
=
_
x
0
i
x
1
i
. . . x
P
i

T
Then linear regression in P dimensions.
How is this Linear regression?
The regression is linear in the parameters.
Despite manipulating x
i
from one dimension to P dimensions.
Now we t a plane (or hyperplane) to a representation of x
i
in
a higher dimensional feature space.
How else can we use this method?
This generalizes to any set of functions
i
: R R.
x
i
=
_

0
(x
i
)
1
(x
i
) . . .
P
(x
i
)

T
Basis functions as Feature Extraction
These
i
(x) functions are called basis functions, as they dene
the bases of the feature space.
This allows us to t a linear decomposition of any type of function
to data points.
Common Choices include: Polynomials, Gaussians, Sigmoids (well
cover them) and Wave (sine waves) Functions.
Training data v. testing data
Evaluation.
Evaluating the performance of a classier on training data is
meaningless.
With enough parameters, a model can simply memorize
(encode) every training point.
Therefore data is typically divided into two sets, training data and
testing or evaluation data.
Training data is used to learn model parameters.
Testing data is used to evaluate the model.
Overtting 1/2
Overtting 2/2
Overtting
What is the correct model size?
The best model size is the size that generalizes to unseen data the
best.
We approximate this by the testing error.
One way to optimize the parameters is to minimize the testing
error.
This makes the testing data a tuning set.
However, this reduces the amount of training data in favor of
parameter optimization.
Can we do this directly without sacricing training data?
Regularization.
Context
Who cares about Linear Regression?
Its a simple modeling approach that learns eciently.
By extensions to the basis functions, its very extensible.
With regularization we can construct ecient models.
Bye
Next
Regularization in Linear Regression.

Das könnte Ihnen auch gefallen