Sie sind auf Seite 1von 47

LECTURE 2

SIMPLE LINEAR REGRESSION


MODEL
Xi Qu
Fall 2017 0
CASE: PRICING DIAMOND RING

 Reference: S. Chu,” Diamond Ring Pricing Using Linear


Regression”, Journal of Statistical Education, v. 4, n. 3, 1996

 Source of Data: full page ad in Straits Times, 2/29/1992

 Data: 48 rings of varying designs, 20K ladies’ rings, each


mounted with a single diamond stone. The weights of diamond
stones ranged from 0.12 to 0.35 carats (1 carat=0.2 gram, 1
carat=100 points) were priced between Singapore $223 and $1086

 Price of diamond jewelry depends on four C’s: caratage, cut, color,


and clarity of the diamond stone.
1
DATA
Obs Carats Price Carats Price
1 0.17 355 0.17 353
2 0.16 328 0.18 438
3 0.17 350 0.17 318
4 0.18 325 0.18 419
5 0.25 642 0.17 346
6 0.16 342 0.15 315
7 0.15 322 0.17 350
8 0.19 485 0.32 918
9 0.21 483 0.32 919
10 0.15 323 0.15 298
11 0.18 462 0.16 339
12 0.28 823 0.16 338
13 0.16 336 0.23 595
14 0.2 498 0.23 553
15 0.23 595 0.17 345
16 0.29 860 0.33 945
17 0.12 223 0.25 655
18 0.26 663 0.18 443
19 0.25 750 0.25 678
20 0.27 720 0.25 675
21 0.18 468 0.15 287
22 0.16 345 0.26 693
23 0.17 352 0.15 316
24 0.16 332 0.35 1086
2
https://www.bluenile.com/diamond-search?track=NavDiaSeaRD
FITTED REGRESSION

3
Y = -250.568 + 3671.397 X,
(18.11663) (87.17442)
The Simple Regression Model
 Definition of the simple linear regression model

Explain variable y in terms of a single variable x .

Intercept, Slope parameter,


constant term coefficient

y  0  1 x  u
Dependent variable, Error term,
explained variable, Independent variable, disturbance,
regressand, explanatory variable, unobservables,…
left hand side variable right hand side variable 4
response variable,… control variable,
regressor, covariate,…
QUESTION
 1. A dependent variable is also known as a(n) _____.
a. explanatory variable
b. control variable
c. predictor variable
d. response variable

 2. In the equation y = β0 + β1 x + u, β0 is the _____.


a. dependent variable
b. independent variable
c. slope parameter
d. intercept parameter 5
Interpretation
 Interpretation of the simple linear
regression model
 Studies how y varies with changes in x. Ideally, we want to interprete it as

y u
 1 as long as 0
x x
By how much does the dependent Interpretation only correct if all other
variable change if the independent things remain equal when the indepen-
variable is increased by one unit? dent variable is increased by one unit
(Infeasible!)

The simple linear regression model is rarely applicable in


practice but its discussion is useful for pedagogical reasons 6
Examples
 Example: Crop yield and fertilizer

yield  0  1 fertilizer  u Rainfall,


land quality,
presence of parasites,
Measures the effect of fertilizer on …
yield, holding all other factors fixed

 Example: A simple wage equation

wage  0  1educ  u Labor force experience,


tenure with current employer,
work ethic, intelligence …
Measures the change in hourly wage
given another year of education, 7
holding all other factors fixed
Causal Interpretation
 When is there a causal interpretation?
 Conditional mean independence assumption
The explanatory variable must not
E (u | x)  0 contain information about the mean
of the unobserved factors

 Stronger than E(u)=0 and cov(u,x)=0.

 Example: wage equation


wage  0  1educ  u e.g. intelligence …

8
The conditional mean independence assumption is unlikely to hold because
individuals with more education will also be more intelligent on average.
Population Regression Function
 Population regression function (PRF)
 The conditional mean independence assumption implies that

E ( y | x)  E (  0  1 x  u | x)
  0  1 x  E (u | x)
  0  1 x
 This means that the average value of the dependent variable
can be expressed as a linear function of the explanatory variable
 Formally, (compared to the ideal case)

E ( y | x)
 1 and  0  E ( y | x  0) 9
x
Population regression function

For individuals
with x=x2 , the
average value of
y is β0+β1x2.

10
A random sample
 In order to estimate the regression model one needs data
 A random sample of n observations

Value of the dependent


variable of the i-th observation
First observation

Second observation

Third observation

Value of the explanatory variable


of the i-th observation
n-th observation

11
 Fit as good as possible a regression line through the data points:

For example, the i-th Fitted


data point regression line

12

 Compare to the PRF.


As good as possible
 Regression residuals

 Minimize sum of squared regression residuals

 Ordinary Least Squares (OLS) estimates

13
Example 1
 CEO Salary and return on equity

Salary in thousands of dollars Return on equity of the CEO‘s firm

 Fitted regression

Intercept If the return on equity increases by 1 percent,


then salary is predicted to change by 18,501 $

 Causal interpretation? 14
Fitted regression line
(depends on sample)

Unknown population
regression line

15
Example 2
 Wage and education

Hourly wage in dollars Years of education

 Fitted regression

Intercept In the sample, one more year of education was


associated with an increase in hourly wage by 0.54 $

 Causal interpretation?
16
Example 3
 Voting outcomes and campaign expenditures (two
parties)

Percentage of vote Percentage of campaign expenditures


for candidate A for candidate A
 Fitted regression

Intercept If candidate A‘s share of spending increases by one


percentage point, he or she receives 0.464
Percentage points more of the total vote.
 Causal interpretation?
17
Properties of OLS
 Fitted values and residuals

uˆi  yi  yˆi
Fitted or predicted values Residuals (Deviations from regression line)

 Difference between the error and the residual

 Algebraic properties of OLS regression


n n

 uˆ
i 1
i 0  uˆ x
i 1
i i 0

Deviations from Correlation between Sample averages of y


regression line deviations and and x lie on
sum up to zero regressors is zero regression line 18
For example, CEO number 12’s salary was 19
526,023 $ lower than predicted using the
the information on his firm’s return on equity
How Well the Model Fit

 Goodness-of-Fit

How well does the explanatory variable explain the dependent variable?

 Measures of Variation

Total sum of squares, Explained sum of squares, Residual sum of squares,


represents total variation represents variation represents variation not
in dependent variable explained by regression explained by regression

16

20
 Decomposition of total variation

Total variation Explained part Unexplained part

 Goodness-of-fit measure (R-squared)

R-squared measures the fraction of the


total variation that is explained by the
regression

 Purpose of the OLS: minimizes SSR


Goodness of fit
 CEO Salary and return on equity

The regression explains


only 1.3 % of the total
variation in salaries

 Voting outcomes and campaign expenditures


The regression explains
85.6% of the totalvariation
in election outcomes

 Caution: A high R-squared does not necessarily mean that the


regression has a causal interpretation!
22
QUESTION
 3.In the equation c = β0 + β1 i + u, c denotes
consumption and i denotes income. What is the
residual for the 5th observation if c5 =$500 and
c5 =$475?
a. $975 b. $300 c. $25 d. $50

 4.What does the equation y = β0 + β1 x denote if


the regression equation is y = β0 + β1x1 + u?
a. The explained sum of squares
b. The total sum of squares
c. The sample regression function
d. The population regression function
23
 5. Consider the following regression model: y = β0 + β1x1 + u.
Which of the following is a property of Ordinary Least Square
(OLS) estimates of this model and their associated statistics?
a. The sum, and therefore the sample average of the OLS
residuals, is positive.
b. The sum of the OLS residuals is negative.
c. The sample covariance between the regressors and the OLS
residuals is positive.
d. The point (x, y) always lies on the OLS regression line.

 6. The explained sum of squares for the regression function,


yi = β0 + β1 x1 + u1 , is defined as _____.
a. 𝑛𝑖=1(𝑦𝑖 − 𝑦)2
𝑛
b. 𝑖=1(𝑦𝑖 − 𝑦)2
𝑛
c. 𝑖=1 𝑢𝑖
𝑛 2
d. 𝑖=1(𝑢𝑖 ) 24
Incorporating nonlinearities:
 Semi-logarithmic form
 Regression of log wages on years of eduction
log(wage)  0  1educ  u

Natural logarithm of wage

 Changes the interpretation of the coefficient:


wage Percentage change
 log( wage) wage of wage
1  
educ educ
25
If years of education
are increased by one year
 Fitted regression

The wage increases by 8.3% for


every additional year of education
(= return to education)

For example:
Growth rate of wage is 8.3%
per year of education

wage 0.83$
wage
 10$  0.083  8.3%
educ 1 year 26
 Log-logarithmic form
 CEO salary and firm sales

log(salary)  0  1 log(sales)  u

Natural logarithm Natural logarithm


of CEO salary of his/her firm‘s sales

 Changes the interpretation of the coefficient:


Percentage change of
salary salary
 log( salary ) salary
1   if sales increase by 1%
 log( sales) sales
sales
Logarithmic changes are 27
always percentage changes
 CEO salary and firm sales: fitted regression

+ 1 % sales ! + 0.257 % salary


 For example:

 The log-log form postulates a constant


elasticity model;
 The semi-log form assumes a semi-elasticity
model.

28
Expected values and variances
of the OLS estimators
 The estimated regression coefficients are random variables
because they are calculated from a random sample

Data is random and depends on particular sample that has been drawn

 The question is what the estimators will estimate on average


and how large their variability in repeated samples is

29
Standard Assumptions

 Assumption SLR.1 (Linear in parameters)

y  0  1 x  u In the population, the relationship


between y and x is linear

 Assumption SLR.2 (Random sampling)

 x , y  : i  1,..., n
i i
The data is a random sample
drawn from the population

Each data point therefore follows


the population equation
yi  0  1 xi  ui
30
Discussion of random sampling
 Example: Wage and education
 The population consists, for example, of all workers of country A
 In the population, a linear relationship between wages (or log
wages) and years of education holds
 Draw completely randomly a worker from the population
 The wage and years of education of the worker drawn are random
because one does not know beforehand which worker is drawn

 Throw back worker into population and repeat random draw n times

 The wages and years of education of the sampled workers are used
to estimate the linear relationship between wages and education

31
The values drawn
for the i-th worker

The implied deviation


from the population
relationship for
the i-th worker:

Notice the difference of ui and u i 32


Assumptions (Con‘t)
 Assumption SLR.3
(Sample variation in explanatory variable)
n
The values of x‘s are not all the same
 ( x  x)
i 1
i
2
0 (otherwise it would be impossible to
study how different values of x
lead to different values of y)

 Assumption SLR.4 (Zero conditional mean)

The value of the explanatory


E (ui | xi )  0 variable must contain no
information about the mean of
the unobserved factors
33
Unbiasedness of OLS
 Theorem 2.1 (Unbiasedness of OLS)

 Interpretation of unbiasedness
 The estimated coefficients may be smaller or larger, depending on the sample
that is the result of a random draw
 However, on average, they will be equal to the values that characterize the
true relationship between y and x in the population
 “On average” means if sampling was repeated, i.e. if drawing the random
sample and doing the estimation was repeated many times
 In a given sample, estimates may differ considerably from true values

34
 Variances of the OLS estimators

 Depending on the sample, the estimates will be nearer or farther


away from the true population values
 How far can we expect our estimates to be away from the true
population values on average (= sampling variability)?
 Sampling variability is measured by the estimator‘s variances

 Assumption SLR.5 (Homoskedasticity)

Var (ui | xi )  E (u | xi )  
2
i
2

The value of the explanatory variable must


contain no information about the variability of
the unobserved factors 35
 Graphical illustration of homoskedasticity

The variability of the


unobserved influences does
not dependent on the value
of the explanatory variable

36
 An example for heteroskedasticity: Wage and education

The variance of the unobserved


determinants of wages increases
with the level of education

37
CAR CONSUMPTION VS INCOME OR WEALTH
Mark Zuckerberg

38
BILL GATES

39
TRUMP

40
QUESTION
 7. Which of the following is a nonlinear regression model?
a. y = β0 + β1x1/2 + u
b. log y = β0 + β1log x +u
c. y = 1 / (β0 + β1x) + u
d. y = β0 + β1x + u

 8. Which of the following is assumed for establishing the


unbiasedness of Ordinary Least Square (OLS) estimates?
a. The error term has an expected value of 1 given any value
of the explanatory variable.
b. The regression equation is linear in the explained and
explanatory variables.
c. The sample is drawn randomly.
d. The error term has the same variance given any value of
the explanatory variable. 41
 Theorem 2.2 (Variances of OLS estimators)
Under assumptions SLR.1 – SLR.5:

 Conclusion:
 The sampling variability of the estimated regression coefficients will
be the higher the larger the variability of the unobserved factors,
and the lower, the higher the variation in the explanatory variable

42
 Estimating the error variance

The variance of u does not


depend on x, i.e. is equal to the
unconditional variance
One could estimate the variance
of the errors by calculating the
variance of the residuals in the
sample; unfortunately this
estimate would be biased

An unbiased estimate of the error variance


can be obtained by substracting the number of
estimated regression coefficients
from the number of observations

43
Unbiasedness of the error variance
 Theorem 2.3 (Unbiasedness of the error variance)

 Calculation of standard errors for coefficients

Plug in for
the unknown

The estimated standard deviations of the regression coefficients are


called “standard errors”. 44
They measure how precisely the regression coefficients are estimated.
UNBIASEDNESS OF OLS (PROOF)
 We need to rewrite our estimator in terms of
the population parameter
 Think: What is unbiasedness?
 Start with a simple rewrite of the formula

ˆ 
 x  x y
i i
, where s    xi  x 
2 2
1 2 x
s x
 The numerator
n n

  x  x  y    x  x  
i 1
i i
i 1
i 0  1 xi  ui 
n n n
  0   xi  x   1   xi  x  xi    xi  x  xi ui 45

i 1 i 1 i 1 45
UNBIASEDNESS OF OLS (CONT)

  x  x   0,   x  x  x    x  x 
2
i i i i

so, ˆ   
  x  x 
iu
and i
1 1 2
s
  
xi  x E  ui | xi 
x

 
E ˆ1  1 
s 2
x
 1

E  ˆ  ?
0 46

Das könnte Ihnen auch gefallen