Lecture 2 Simple Regression Model

LECTURE 2
SIMPLE LINEAR REGRESSION

MODEL
Xi Qu
Fall 2017 0
CASE: PRICING DIAMOND RING
 Reference: S. Chu,” Diamond Ring Pricing Using Linear

Regression”, Journal of Statistical Education, v. 4, n. 3, 1996
 Source of Data: full page ad in Straits Times, 2/29/1992
 Data: 48 rings of varying designs, 20K ladies’ rings, each

mounted with a single diamond stone. The weights of diamond
stones ranged from 0.12 to 0.35 carats (1 carat=0.2 gram, 1
carat=100 points) were priced between Singapore $223 and $1086
 Price of diamond jewelry depends on four C’s: caratage, cut, color,

and clarity of the diamond stone.
1
DATA
Obs Carats Price Carats Price
1 0.17 355 0.17 353
2 0.16 328 0.18 438
3 0.17 350 0.17 318
4 0.18 325 0.18 419
5 0.25 642 0.17 346
6 0.16 342 0.15 315
7 0.15 322 0.17 350
8 0.19 485 0.32 918
9 0.21 483 0.32 919
10 0.15 323 0.15 298
11 0.18 462 0.16 339
12 0.28 823 0.16 338
13 0.16 336 0.23 595
14 0.2 498 0.23 553
15 0.23 595 0.17 345
16 0.29 860 0.33 945
17 0.12 223 0.25 655
18 0.26 663 0.18 443
19 0.25 750 0.25 678
20 0.27 720 0.25 675
21 0.18 468 0.15 287
22 0.16 345 0.26 693
23 0.17 352 0.15 316
24 0.16 332 0.35 1086
2
https://www.bluenile.com/diamond-search?track=NavDiaSeaRD
FITTED REGRESSION
3
Y = -250.568 + 3671.397 X,
(18.11663) (87.17442)
The Simple Regression Model
 Definition of the simple linear regression model
Explain variable y in terms of a single variable x .
Intercept, Slope parameter,

constant term coefficient
y  0  1 x  u
Dependent variable, Error term,
explained variable, Independent variable, disturbance,
regressand, explanatory variable, unobservables,…
left hand side variable right hand side variable 4
response variable,… control variable,
regressor, covariate,…
QUESTION
 1. A dependent variable is also known as a(n) _____.
a. explanatory variable
b. control variable
c. predictor variable
d. response variable
 2. In the equation y = β0 + β1 x + u, β0 is the _____.

a. dependent variable
b. independent variable
c. slope parameter
d. intercept parameter 5
Interpretation
 Interpretation of the simple linear
regression model
 Studies how y varies with changes in x. Ideally, we want to interprete it as
y u
 1 as long as 0
x x
By how much does the dependent Interpretation only correct if all other
variable change if the independent things remain equal when the indepen-
variable is increased by one unit? dent variable is increased by one unit
(Infeasible!)
The simple linear regression model is rarely applicable in

practice but its discussion is useful for pedagogical reasons 6
Examples
 Example: Crop yield and fertilizer
yield  0  1 fertilizer  u Rainfall,

land quality,
presence of parasites,
Measures the effect of fertilizer on …
yield, holding all other factors fixed
 Example: A simple wage equation
wage  0  1educ  u Labor force experience,

tenure with current employer,
work ethic, intelligence …
Measures the change in hourly wage
given another year of education, 7
holding all other factors fixed
Causal Interpretation
 When is there a causal interpretation?
 Conditional mean independence assumption
The explanatory variable must not
E (u | x)  0 contain information about the mean
of the unobserved factors
 Stronger than E(u)=0 and cov(u,x)=0.
 Example: wage equation

wage  0  1educ  u e.g. intelligence …
8
The conditional mean independence assumption is unlikely to hold because
individuals with more education will also be more intelligent on average.
Population Regression Function
 Population regression function (PRF)
 The conditional mean independence assumption implies that
E ( y | x)  E (  0  1 x  u | x)
  0  1 x  E (u | x)
  0  1 x
 This means that the average value of the dependent variable
can be expressed as a linear function of the explanatory variable
 Formally, (compared to the ideal case)
E ( y | x)
 1 and  0  E ( y | x  0) 9
x
Population regression function
For individuals
with x=x2 , the
average value of
y is β0+β1x2.
10
A random sample
 In order to estimate the regression model one needs data
 A random sample of n observations
Value of the dependent

variable of the i-th observation
First observation
Second observation
Third observation
Value of the explanatory variable

of the i-th observation
n-th observation
11
 Fit as good as possible a regression line through the data points:
For example, the i-th Fitted

data point regression line
12
 Compare to the PRF.

As good as possible
 Regression residuals
 Minimize sum of squared regression residuals
 Ordinary Least Squares (OLS) estimates
13
Example 1
 CEO Salary and return on equity
Salary in thousands of dollars Return on equity of the CEO‘s firm
 Fitted regression
Intercept If the return on equity increases by 1 percent,

then salary is predicted to change by 18,501 $
 Causal interpretation? 14
Fitted regression line
(depends on sample)
Unknown population
regression line
15
Example 2
 Wage and education
Hourly wage in dollars Years of education
Intercept In the sample, one more year of education was

associated with an increase in hourly wage by 0.54 $
 Causal interpretation?
16
Example 3
 Voting outcomes and campaign expenditures (two
parties)
Percentage of vote Percentage of campaign expenditures

for candidate A for candidate A
Intercept If candidate A‘s share of spending increases by one

percentage point, he or she receives 0.464
Percentage points more of the total vote.
 Causal interpretation?
17
Properties of OLS
 Fitted values and residuals
uˆi  yi  yˆi
Fitted or predicted values Residuals (Deviations from regression line)
 Difference between the error and the residual
 Algebraic properties of OLS regression

n n
 uˆ
i 1
i 0  uˆ x
i 1
i i 0
Deviations from Correlation between Sample averages of y

regression line deviations and and x lie on
sum up to zero regressors is zero regression line 18
For example, CEO number 12’s salary was 19
526,023 $ lower than predicted using the
the information on his firm’s return on equity
How Well the Model Fit
 Goodness-of-Fit
How well does the explanatory variable explain the dependent variable?
 Measures of Variation
Total sum of squares, Explained sum of squares, Residual sum of squares,

represents total variation represents variation represents variation not
in dependent variable explained by regression explained by regression
16
20
 Decomposition of total variation
Total variation Explained part Unexplained part
 Goodness-of-fit measure (R-squared)
R-squared measures the fraction of the

total variation that is explained by the
regression
 Purpose of the OLS: minimizes SSR

Goodness of fit
 CEO Salary and return on equity
The regression explains

only 1.3 % of the total
variation in salaries
 Voting outcomes and campaign expenditures

The regression explains
85.6% of the totalvariation
in election outcomes
 Caution: A high R-squared does not necessarily mean that the

regression has a causal interpretation!
22
QUESTION
 3.In the equation c = β0 + β1 i + u, c denotes
consumption and i denotes income. What is the
residual for the 5th observation if c5 =$500 and
c5 =$475?
a. $975 b. $300 c. $25 d. $50
 4.What does the equation y = β0 + β1 x denote if

the regression equation is y = β0 + β1x1 + u?
a. The explained sum of squares
b. The total sum of squares
c. The sample regression function
d. The population regression function
23
 5. Consider the following regression model: y = β0 + β1x1 + u.
Which of the following is a property of Ordinary Least Square
(OLS) estimates of this model and their associated statistics?
a. The sum, and therefore the sample average of the OLS
residuals, is positive.
b. The sum of the OLS residuals is negative.
c. The sample covariance between the regressors and the OLS
residuals is positive.
d. The point (x, y) always lies on the OLS regression line.
 6. The explained sum of squares for the regression function,

yi = β0 + β1 x1 + u1 , is defined as _____.
a. 𝑛𝑖=1(𝑦𝑖 − 𝑦)2
𝑛
b. 𝑖=1(𝑦𝑖 − 𝑦)2
𝑛
c. 𝑖=1 𝑢𝑖
𝑛 2
d. 𝑖=1(𝑢𝑖 ) 24
Incorporating nonlinearities:
 Semi-logarithmic form
 Regression of log wages on years of eduction
log(wage)  0  1educ  u
Natural logarithm of wage
 Changes the interpretation of the coefficient:

wage Percentage change
 log( wage) wage of wage
1  
educ educ
25
If years of education
are increased by one year
The wage increases by 8.3% for

every additional year of education
(= return to education)
For example:
Growth rate of wage is 8.3%
per year of education
wage 0.83$
wage
 10$  0.083  8.3%
educ 1 year 26
 Log-logarithmic form
 CEO salary and firm sales
log(salary)  0  1 log(sales)  u
Natural logarithm Natural logarithm

of CEO salary of his/her firm‘s sales
 Changes the interpretation of the coefficient:

Percentage change of
salary salary
 log( salary ) salary
1   if sales increase by 1%
 log( sales) sales
sales
Logarithmic changes are 27
always percentage changes
 CEO salary and firm sales: fitted regression
+ 1 % sales ! + 0.257 % salary

 For example:
 The log-log form postulates a constant

elasticity model;
 The semi-log form assumes a semi-elasticity
model.
28
Expected values and variances
of the OLS estimators
 The estimated regression coefficients are random variables
because they are calculated from a random sample
Data is random and depends on particular sample that has been drawn
 The question is what the estimators will estimate on average

and how large their variability in repeated samples is
29
Standard Assumptions
 Assumption SLR.1 (Linear in parameters)
y  0  1 x  u In the population, the relationship

between y and x is linear
 Assumption SLR.2 (Random sampling)
 x , y  : i  1,..., n
i i
The data is a random sample
drawn from the population
Each data point therefore follows

the population equation
yi  0  1 xi  ui
30
Discussion of random sampling
 Example: Wage and education
 The population consists, for example, of all workers of country A
 In the population, a linear relationship between wages (or log
wages) and years of education holds
 Draw completely randomly a worker from the population
 The wage and years of education of the worker drawn are random
because one does not know beforehand which worker is drawn
 Throw back worker into population and repeat random draw n times
 The wages and years of education of the sampled workers are used
to estimate the linear relationship between wages and education
31
The values drawn
for the i-th worker
The implied deviation

from the population
relationship for
the i-th worker:
Notice the difference of ui and u i 32

Assumptions (Con‘t)
 Assumption SLR.3
(Sample variation in explanatory variable)
n
The values of x‘s are not all the same
 ( x  x)
i 1
i
2
0 (otherwise it would be impossible to
study how different values of x
lead to different values of y)
 Assumption SLR.4 (Zero conditional mean)
The value of the explanatory

E (ui | xi )  0 variable must contain no
information about the mean of
the unobserved factors
33
Unbiasedness of OLS
 Theorem 2.1 (Unbiasedness of OLS)
 Interpretation of unbiasedness
 The estimated coefficients may be smaller or larger, depending on the sample
that is the result of a random draw
 However, on average, they will be equal to the values that characterize the
true relationship between y and x in the population
 “On average” means if sampling was repeated, i.e. if drawing the random
sample and doing the estimation was repeated many times
 In a given sample, estimates may differ considerably from true values
34
 Variances of the OLS estimators
 Depending on the sample, the estimates will be nearer or farther

away from the true population values
 How far can we expect our estimates to be away from the true
population values on average (= sampling variability)?
 Sampling variability is measured by the estimator‘s variances
 Assumption SLR.5 (Homoskedasticity)
Var (ui | xi )  E (u | xi )  
2
i
2
The value of the explanatory variable must

contain no information about the variability of
the unobserved factors 35
 Graphical illustration of homoskedasticity
The variability of the

unobserved influences does
not dependent on the value
of the explanatory variable
36
 An example for heteroskedasticity: Wage and education
The variance of the unobserved

determinants of wages increases
with the level of education
37
CAR CONSUMPTION VS INCOME OR WEALTH
Mark Zuckerberg
38
BILL GATES
39
TRUMP
40
QUESTION
 7. Which of the following is a nonlinear regression model?
a. y = β0 + β1x1/2 + u
b. log y = β0 + β1log x +u
c. y = 1 / (β0 + β1x) + u
d. y = β0 + β1x + u
 8. Which of the following is assumed for establishing the

unbiasedness of Ordinary Least Square (OLS) estimates?
a. The error term has an expected value of 1 given any value
of the explanatory variable.
b. The regression equation is linear in the explained and
explanatory variables.
c. The sample is drawn randomly.
d. The error term has the same variance given any value of
the explanatory variable. 41
 Theorem 2.2 (Variances of OLS estimators)
Under assumptions SLR.1 – SLR.5:
 Conclusion:
 The sampling variability of the estimated regression coefficients will
be the higher the larger the variability of the unobserved factors,
and the lower, the higher the variation in the explanatory variable
42
 Estimating the error variance
The variance of u does not

depend on x, i.e. is equal to the
unconditional variance
One could estimate the variance
of the errors by calculating the
variance of the residuals in the
sample; unfortunately this
estimate would be biased
An unbiased estimate of the error variance

can be obtained by substracting the number of
estimated regression coefficients
from the number of observations
43
Unbiasedness of the error variance
 Theorem 2.3 (Unbiasedness of the error variance)
 Calculation of standard errors for coefficients
Plug in for
the unknown
The estimated standard deviations of the regression coefficients are

called “standard errors”. 44
They measure how precisely the regression coefficients are estimated.
UNBIASEDNESS OF OLS (PROOF)
 We need to rewrite our estimator in terms of
the population parameter
 Think: What is unbiasedness?
 Start with a simple rewrite of the formula
ˆ 
 x  x y
i i
, where s    xi  x 
2 2
1 2 x
s x
 The numerator
n n
  x  x  y    x  x  
i 1
i i
i 1
i 0  1 xi  ui 
n n n
  0   xi  x   1   xi  x  xi    xi  x  xi ui 45
i 1 i 1 i 1 45
UNBIASEDNESS OF OLS (CONT)
  x  x   0,   x  x  x    x  x 
2
i i i i
so, ˆ   
  x  x 
iu
and i
1 1 2
s
  
xi  x E  ui | xi 
x
 
E ˆ1  1 
s 2
x
 1
E  ˆ  ?
0 46

Lecture 2 Simple Regression Model

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 2 Simple Regression Model

Hochgeladen von

Copyright:

Verfügbare Formate

LECTURE 2

SIMPLE LINEAR REGRESSION

 Reference: S. Chu,” Diamond Ring Pricing Using Linear

 Source of Data: full page ad in Straits Times, 2/29/1992

 Data: 48 rings of varying designs, 20K ladies’ rings, each

 Price of diamond jewelry depends on four C’s: caratage, cut, color,

Explain variable y in terms of a single variable x .

Intercept, Slope parameter,

 2. In the equation y = β0 + β1 x + u, β0 is the _____.

The simple linear regression model is rarely applicable in

yield  0  1 fertilizer  u Rainfall,

 Example: A simple wage equation

wage  0  1educ  u Labor force experience,

 Stronger than E(u)=0 and cov(u,x)=0.

 Example: wage equation

Value of the dependent

Value of the explanatory variable

For example, the i-th Fitted

 Compare to the PRF.

 Minimize sum of squared regression residuals

 Ordinary Least Squares (OLS) estimates

Salary in thousands of dollars Return on equity of the CEO‘s firm

Intercept If the return on equity increases by 1 percent,

Hourly wage in dollars Years of education

Intercept In the sample, one more year of education was

Percentage of vote Percentage of campaign expenditures

Intercept If candidate A‘s share of spending increases by one

 Difference between the error and the residual

 Algebraic properties of OLS regression

Deviations from Correlation between Sample averages of y

Total sum of squares, Explained sum of squares, Residual sum of squares,

Total variation Explained part Unexplained part

 Goodness-of-fit measure (R-squared)

R-squared measures the fraction of the

 Purpose of the OLS: minimizes SSR

The regression explains

 Voting outcomes and campaign expenditures

 Caution: A high R-squared does not necessarily mean that the

 4.What does the equation y = β0 + β1 x denote if

 6. The explained sum of squares for the regression function,

Natural logarithm of wage

 Changes the interpretation of the coefficient:

The wage increases by 8.3% for

Natural logarithm Natural logarithm

 Changes the interpretation of the coefficient:

+ 1 % sales ! + 0.257 % salary

 The log-log form postulates a constant

 The question is what the estimators will estimate on average

 Assumption SLR.1 (Linear in parameters)

y  0  1 x  u In the population, the relationship

 Assumption SLR.2 (Random sampling)

Each data point therefore follows

The implied deviation

Notice the difference of ui and u i 32

 Assumption SLR.4 (Zero conditional mean)

The value of the explanatory

 Depending on the sample, the estimates will be nearer or farther

 Assumption SLR.5 (Homoskedasticity)

The value of the explanatory variable must

The variability of the

The variance of the unobserved

 8. Which of the following is assumed for establishing the

The variance of u does not

An unbiased estimate of the error variance