Beruflich Dokumente
Kultur Dokumente
CS434
Supervised learning
Regression analysis
“In statistical modeling, regression analysis is a set of
statistical processes for estimating the relationships
among variables … focus is on the relationship
between a dependent variable and one or
more independent variables (or 'predictors'). ”
--- Wikipedia
Supervised learning
Our Training Data
195 195
190 190
185 185
Height
Height
180 180
175 175
170 170
165 165
45 50 55 60 160 180 200 220
Knee height Arm span
Suspected outliers
Supervised learning
Linear prediction function
• We will only consider linear functions ( thus
the name linear regression):
𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑏𝑏
Supervised learning
One-dimensional Regression
Example: 𝑦𝑦 = 𝑥𝑥 + 3
195
190
185
Height
180
175
170
165
160 170 180 190 200 210 220
Arm span
Supervised learning
One-dimensional regression
195
190
185
Height
180
175
170
165
160 170 180 190 200 210 220
Arm span
• Which line is better?
• The blue line seems better, but in what way?
• How can we define this goodness precisely?
Supervised learning
Let’s formalize it a bit more
arm Height • Given a set of training examples
ID span (x) (y) { 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 : 𝑖𝑖 = 1, … , 𝑛𝑛}
1 166 170
2 196 191 • Goal: learn 𝑤𝑤 and 𝑏𝑏 from the training
3
4
191
180.34
189
180.34
data, so that 𝑦𝑦 = 𝑤𝑤𝑤𝑤 + 𝑏𝑏 predicts 𝑦𝑦𝑖𝑖
5 174 171 from 𝑥𝑥𝑖𝑖 accurately
6 176.53 176.53
7 177 187
• In mathematical terms, we would
8 208.28 185.42 like to find the 𝑤𝑤 and 𝑏𝑏 that
9
10
199
181
190
181
minimizes the following objective:
11 178 180
𝑛𝑛
12 172 175
𝐸𝐸 𝑤𝑤, 𝑏𝑏 = � 𝑦𝑦𝑖𝑖 − (𝑤𝑤𝑥𝑥𝑖𝑖 + 𝑏𝑏) 2
13 185 188
14 188 185 𝑖𝑖=1
15 165 170 Sum of Squared Error (SSE)
Supervised learning
Optimization 101
• Given a function 𝑓𝑓(𝑡𝑡), finding the 𝑡𝑡 ∗ that minimizes 𝑓𝑓(𝑡𝑡) can be a
challenging or impossible in many situations
– 𝑓𝑓(𝑡𝑡) could be unbounded, without a minimizer
𝑖𝑖=1
1. Take partial derivative w.r.t. 𝑤𝑤 and 𝑏𝑏 respectively:
𝑛𝑛 𝑛𝑛
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
= � −2 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏 𝑥𝑥𝑖𝑖 ; = � −2 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
𝑖𝑖=1 𝑖𝑖=1
190
Height
1 166 170 180
2 196 191
175
3 191 189
4 180.34 180.34 170
5 174 171
6 176.53 176.53 165
160 170 180 190 200 210 220
7 177 187 Arm span
8 208.28 185.42
9 199 190
1 1
10 181 181 𝑥𝑥̅ = ∑𝑖𝑖 𝑥𝑥𝑖𝑖 = 182.477 𝑦𝑦� = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 = 181.286
𝑛𝑛 𝑛𝑛
11 178 180
𝑥𝑥 2 = 33436.53 𝑥𝑥𝑥𝑥 = 31148.91 𝑥𝑥̅ 𝑦𝑦� = 33080.46
12 172 175
13 185 188
14 188 185 31148.91 − 33080.46
𝑤𝑤 ∗ = = 0.493
15 165 170 33436.53 − 182.4772
𝑏𝑏 ∗ = 181.3 − 0.493 ∗ 182.5 = 91.30
Supervised learning
Height = 91.30+0.493*armspan
Extending to more features
• Having more features will mean our objective has more
variables to optimize over
• One can solve this similarly by
– Taking partial derivative of each variable
– Setting them to zero
– Solving the system of equations simultaneously
Supervised learning
Definition: vector
• A vector is a one dimensional array.
• We usually denote vectors as boldface lower
case letter x, and use 𝑥𝑥 to denote a single
variable
• If we don’t specify otherwise, assume x is a
column vector
• Or, alternatively:
𝑥𝑥1 𝑥𝑥1 𝑦𝑦1 𝑥𝑥1 𝑦𝑦2 𝑥𝑥1 𝑦𝑦3
𝑥𝑥2 [𝑦𝑦1 𝑦𝑦2 𝑦𝑦3 ] = 𝑥𝑥2 𝑦𝑦1 𝑥𝑥2 𝑦𝑦2 𝑥𝑥2 𝑦𝑦3
𝑥𝑥3 𝑥𝑥3 𝑦𝑦1 𝑥𝑥3 𝑦𝑦2 𝑥𝑥3 𝑦𝑦3
– This is often called the outer product of two vectors, written as:
x ⊗ y=xy𝑇𝑇
Useful operations: vector norm
• Given a d-dimensional vector 𝐱𝐱 = 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑑𝑑 𝑇𝑇 , the
(L-2, or Euclidean) norm of 𝐱𝐱 is represented as
2
𝐱𝐱 2 = 𝑥𝑥12 + 𝑥𝑥22 + ⋯+ 𝑥𝑥𝑑𝑑2 = 2
< 𝐱𝐱, 𝐱𝐱 >
𝐱𝐱 = � 𝑥𝑥𝑖𝑖 𝑝𝑝
𝑝𝑝
𝑖𝑖=1
Useful operations: Matrix Inversion
• The inverse of a square matrix 𝐴𝐴 is a matrix 𝐴𝐴−1 such that 𝐴𝐴𝐴𝐴−1 =
𝐼𝐼, where 𝐼𝐼 is called an identity matrix. For example:
1 0
𝐴𝐴 = 1 , 𝐴𝐴−1 = 1 0
1 −2 2
2
Supervised learning
Objective
𝑛𝑛
2
Previous objective: 𝐸𝐸 𝑤𝑤, 𝑏𝑏 = � 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏
𝑖𝑖=1
𝒏𝒏
Updated form 𝐸𝐸 𝒘𝒘 = � 𝑦𝑦𝑖𝑖 − 𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 𝟐𝟐
𝒊𝒊=𝟏𝟏
x1T y1 Example 1
𝑇𝑇
• Let 𝒚𝒚 = 𝑦𝑦1 , 𝑦𝑦2 , … , 𝑦𝑦𝑛𝑛
𝑇𝑇
X= : y = :
• Let 𝑋𝑋 = 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙𝑛𝑛 xTn
yn Example n
𝒏𝒏
⇒ XT Xw = XT y
⇒ w = ( XT X) −1 XT y
Matrix cookbook is a good resource to help you with this type of manipulations.
Supervised learning
knee arm
height height span 1 50 166 170
170 50 166 1 57 196 191
191 57 196 1 50 191 189
189 50 191 1 53.34 180.34 180.34
180.34 53.34 180.34 1 54 174 171
171 54 174 1 55.88 176.53 176.53
176.53 55.88 176.53 1 57 177 187
187 57 177
𝑋𝑋 = 1 55.88 208.28 𝑌𝑌 = 185.42
185.42 55.88 208.28 1 57 199 190
190 57 199 1 54 181 181
181 54 181 1 55 178 180
180 55 178 1 53 172 175
175 53 172 1 57 185 188
188 57 185 1 49.5 165 170
170 49.5 165 1 57 188 185
185 57 188
70.19
15 815.6 2737.2
𝑇𝑇 𝒘𝒘 = 0.656
𝑋𝑋 𝑋𝑋 = 815.6 44451.6 149081
0.413
2737.2 149081 501547.9
y
y − Xw
x1
x2
What is the effect of adding one
feature?
• By using both armspan and knee height, can we
do better than using just armspan?
• How do we compare?
– Training SSE with only arm span: 257.445
– Training SSE with armspan and knee height: 225.680
• Does it mean the model with two features is
necessarily better than one?
• More generally, is it always better to have more
features?
– Effect on training?
– Effect on testing?
Supervised learning
Summary
• We introduce linear regression, which assumes that the
function that maps from x to y is linear
• Sum Squared Error objective
𝐸𝐸(𝒘𝒘) = 𝒚𝒚 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒚𝒚 − 𝑿𝑿𝑿𝑿
• The solution is given by:
w = ( XT X) −1 XT y
• There are other objectives, leading to different solutions
• Although we make the linear assumption, it is easy to
use this to learn nonlinear functions as well
– Introduce nonlinear features, e.g., 𝑥𝑥12 , 𝑥𝑥22 , 𝑥𝑥1 𝑥𝑥2
Supervised learning