Beruflich Dokumente
Kultur Dokumente
Linear Regression
Piyush Rai
Machine Learning (CS771A)
Aug 10, 2016
Learning as Optimization
Consider a supervised learning problem with training data {(x n , yn )}N
n=1
x y relationship
on an example (x , y )
We would like to find f that minimizes the true loss or risk defined as
Z
L(f ) = E(x,y )P [`(y , f (x )] = `(y , f (x )dP(x , y )
Problem: We (usually) dont know the true distribution P(x , y ); only have a
finite set of samples from it, in form of the N training examples {(x n , yn )}N
n=1
Solution: Work with the empirical risk defined on the training data
Lemp (f ) =
N
1 X
`(yn , f (x n ))
N n=1
Learning as Optimization
To find the best f , we can now minimize the empirical risk w.r.t. f
N
X
f = arg min Lemp (f ) = arg min
`(yn , f (x n ))
f
n=1
N
X
`(yn , f (x n )) + R(f )
n=1
Regularization: Pictorially
Both curves have the same (zero) empirical error Lemp
Green curve has a smaller R(f ) (thus smaller complexity )
Learning as Optimization
Our goal is to solve the optimization problem
f = arg min
f
N
X
n=1
We will revisit these questions later. First lets look at an example problem
Machine Learning (CS771A)
Linear Regression
w >x = 0
Linear Regression
x n RD , y n R
Assume the following linear model with model parameters w RD
Given: Training data with N examples {(x n , yn )}N
n=1 ,
yn w > x n
yn
D
X
wd xnd
d=1
w R
xn
A simple and interpretable linear model. Can also write it more compactly as
y Xw
w unknown)
w RD
Using the squared loss, the total (empirical) error on the training data
Lemp (w ) =
N
X
`(yn , w > x n ) =
n=1
Well estimate
n=1
N
X
(yn w > x n )2
n=1
N
X
(yn w > x n )2
10
2(yn w > x n )
n=1
N
X
n=1
x n x >n )1
n=1
Note:
n=1 (yn
w > x n )2
(yn w > x n ) = 0
w
w =(
PN
N
X
x n (yn x >n w ) = 0
w
n=1
x n is D 1, X is N D, y
is N 1
11
PN
n=1 (yn
w > x n )2
w may
PD
d=1
wd2
w =(
n=1
Machine Learning (CS771A)
x n x >n + ID )1
N
X
yn x n = (X> X + ID )1 X> y
n=1
Learning as Optimization: Linear Regression
12
13
w
w
=
=
(X> X + ID )1 X> y
w w =w (t1)
where is the learning rate
Repeat until converge
L
w
PN
n=1
x n (yn x >n w )
14
15
16
x n f (z n )
x n and z n
In this case, we can define a loss function `(xn , f (z n )) that measures how
well f can reconstruct the original x n from its new representation z n
This generic unsupervised learning problem can thus be written as
f = arg min
f ,Z
N
X
`(xn , f (z n )) + R(f , Z)
n=1
17