Training Deck OLS

A professor is trying to show his students the importance of mid term test He believes that higher the grade
for mid-term, higher the final grade. A random sample of 15 students in his class was selected with the data given below:
Mid te rm Fina l Gra de Gra de 59 92 72 90 95 87 89 77 76 65 97 42 94 62 91
65 84 77 80 77 81 80 84 80 69 83 40 78 65 90
A dependent variable is the variable to be predicted or explained in a regression model. This variable is assumed to be functionally related to the independent variable.
An independent variable is the variable related to the dependent variable in a regression equation. The independent variable is used in a regression model to estimate the value of the dependent variable.
Scatter plots
Scatter diagram of final grades of 15 students 100 80 60 40 20 0 0 20 40 60 80 100 120
Mid term grades
A scatter plot is a graph that may be used to represent the relationship between the independent and dependent variables. Also referred to as a scatter diagram.
The dependent variable is plotted on the y-axis and independent variable is plotted on the x-axis
1
Final grades
Two Variable Relationships
(a) Linear( positive slope)

2
(b) Linear (negative slope)

3
(c) Curvilinear (negative slope)
(d) Curvilinear (positive slope)

5
(e) No Relationship
6
Simple Linear Regression Analysis
Simple linear regression analysis analyzes the linear relationship that exists between a dependent variable and a single independent variable.
Line of best fit or Least squares line Relationship NOT Linear
Positive Linear Relationship
Negative Linear Relationship
No Relationship
Equation for a straight line
Dependent variable
Independent variable
Y=a+b X
Y-intercept Slope of the line
Equation for a straight line
Y-intercept (a) is that value of the dependent variable(y) when the value of the independent variable(x) is zero. It is the point at which the line cuts the y-axis. slope (b) is the change in the dependent variable for a unit increase in the independent variable. It is the tangent of the angle made by the line with the x-axis.
10
Slope and y-intercept of a straight line
ya x i s
a
The line Y=a+b X
Slope,b =tan
y-intercept
X=0
x-axis
11
Simple Linear Regression: Example
You wish to examine the linear dependency of the annual sales of grocery stores on their sizes in square footage. Sample data for 7 stores were obtained.
Store 1 2 3 4 5 6 7
Square Feet 1,726 1,542 2,816 5,555 1,292 2,208 1,313
Annual Sales ($1000)

3,681 3,395 6,653 9,543 3,318 5,563 3,760
12
Scatter plot
12000
Annua l Sa le s ($000)
10000 8000 6000 4000 2000 0 0 1000 2000 3000 4000 5000 6000
S q u a re F e e t
13
Equation of the regression line
=a+bX
(pronounced as Y hat)
14
Slope and y-intercept of the regression line
slope
b=
XY n X Y X2 - nX2
Y-intercept
a= Y b X
15
The least squares criterion is used for determining a regression line that minimizes the sum of squared residuals.
16
Least Square Criterion

What is the intuition behind the least squares formula for b1?
We know that the fitted line goes through the point of means so now we have to pick from all possible rotating lines through the point of means.
Point of means (X, Y)
17
Interpretation of Results: Example
1636.415 1.487 X Y i i
The slope of 1.487 means that each increase of one unit in X, we predict the average of Y to increase by an estimated 1.487 units. The equation estimates that for each increase of 1 square foot in the size of the store, the expected annual sales are predicted to increase by $1487.
18
A residual is the difference between the actual value of the dependent variable and the value predicted by the regression model.
y y
19
Sales in Thousands
y
390 400 300 200 312
150 60 x y
Residual = 312 - 390 = -78
100
x Years with Company

20
Standard Error of Estimate
se =
( Y -)
n-2
Standard error of estimate measures the reliability of the estimating equation that is developed.
Standard error of estimate measures the variation or scatter of the observed values around the regression line.
21
Standard error of estimate

Store Square Annual Feet Sales 1636.415+ ($1,000) 1.487(X)
No. 1 2 3 4 5 6 7
X
1726 1542 2,816 5,555 1,292 2,208 1,313
Y
3,681 3,395 6,653 9,543 3,318 5,563 3,760
4202.977 3929.369 5823.807 9896.7 3557.619 4919.711 3588.846
(Y - )
-522 -534 829 -354 -240 643 171
(Y - )2
272460 285550.2 687561 125103.7 57417.27 413820.7 29293.69 1871207 7--2 5
(Y - )2 n-2
= = = =
se
(Y )
n-2 n-k-1
1871207
611.7527
22

Regression line = 10 + 2 X X Y Y- 10 30 30 0 20 50 50 0 Hence se = 0 30 70 70 0 40 90 90 0 50 110 110 0
120 100 80 60 40 20 0 0 10 20 30 X variable

23
Y variable
40
50
60

For a perfect fitting regression line, the standard error of estimate is equal to zero. The closer se is to 0, better the reliability of the regression line. Another way for checking the reliability of regression line Find the standard deviation of the Y variable ,(y) If se < (y), the regression line is a reliable estimate of the data If se > (y), the regression line is not a reliable estimate of the data.
24
Measures of Variation: The Sum of Squares
Y
SST = (Yi - Y)2
SSE =(Yi - Yi )2
_ SSR = (Yi - Y)2
_ Y
Xi
X
25

SST = Total Sum of Squares
Measures the variation of the Yi values around their mean,

SSR = Regression Sum of Squares Explained variation attributable to the relationship between X and Y SSE = Error Sum of Squares Variation attributable to factors other than the relationship between X and Y
26
SST
SSR Explained Variability
+ +
SSE Unexplained Variability
Total = Sample Variability
27
Coefficient of Determination
Sample Coefficient of determination SSR r2 = SST
or
SSE r2 = 1 SST
28
Coefficient of Determination
Sample Coefficient of determination (Y )2 r2 = 1 (Y Y )2
Population Coefficient of determination is denoted as R2
29
Coefficient of determination
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by the regression line. The coefficient of determination is also called r-squared and is denoted as r2.
30
X 1726 1542 2,816 5,555 1,292 2,208 1,313 Y= r2 = 1-
Y 3,681 3,395 6,653 9,543 3,318 5,563 3,760 5,130
4202.977 3929.369 5823.807 9896.7 3557.619 4919.711 3588.846
(Y - ) -522 -534 829 -354 -240 643 171
(Y - )2 272460 285550.2 687561 125103.7 57417.27 413820.7 29293.69
(Y - Y) -1,449 -1,735 1,523
(Y - Y)2 2100843 3011712 2318224
4,413 19470787
(Y - )2 =
= 11871207 32251656
-1,812 3284897 433 187118 -1,370 1878074 2 1871207 (Y - Y) = 32251656 = 0.9419811
(Y - )2
(Y - Y)2
Conclusion:94% of the variation in the annual sales (Y variable) is explained by size of the stores(X variable) and 6 % is explained by other external factors
31
Multiple regression
Estimating Equation describing relationship among three variables = a + b1X1 + b2X2 Multiple Regression Estimating Equation when there are k variables
= a + b1X1 + b2X2 + b3X3 + .. +b k X k
32
Multiple regression
Standard error of estimate when there are 2 variables
se =
2 ( Y -) n-3
Standard error of estimate when there are k variables
se =
2 ( Y -) n-(k+1)
33
F-Test
To test how good the model is in predicting y by conducting individual ttests on each of the s is not a good idea. For that we use a global test that encompasses all s and test the following overall hypothesis:
The test statistic to test this hypothesis is called Fstatistic and is calculated as:
34
. Contd
The F statistic is the ratio of the explained variability (as reflected by R2) and the unexplained variability (as reflected by 1 R2), each divided by the corresponding degrees of freedom. The larger the F statistic, the more useful the model.
35
Assumptions underlying classical linear regression model

1. The regression model is linear in the parameters
2. X values are fixed in repeated sampling

3. Mean value of disturbance is zero 4. Homoscedassticity or equal variance of disturbances
5. No autocorrelation between the disturbances

6. Zero covaraince between disturbance term and X values 7. The number of observations must be greater than the number of parameters to be estimated
8. Variability in X values
9. The regression model is correctly specified 10. There is no perfect multicollinearity
36
Types of Regression Models

1 Explanatory Variable
Regression Model
2 Explanatory Variables
Simple
Multiple
Linear
NonLinear
Linear
NonLinear
Different estimation techniques

Enter method Forward inclusion Backward elimination Step wise Hierarchical
37
Standard Regression Procedure

Define research problem - Select dependent variable - Select independent variable
NO
Create additional variables - Transformations to meet assumptions - Dummy variables for non metric data - Interaction terms for moderation effects
Do the variables meet assumptions of: Normality, Linearity, Homoscedasticity Independence of error terms
38
Is there multicollinearity?
YES
Principal Component Analysis or Factor Analysis
NO
Divide the sample into development and validation sets
Remove outliers
Select an Estimation Technique
39
Does the regression variate meet the assumptions of regression analysis?
NO
1
YES
Examine statistical and practical significance - Model fit - Adjusted R2 - Standard error of estimate - Statistical significance of beta coefficients
Validation of the model

40
Software packages
In SPSS, select Analyze, Regression, Linear; select your dependent and independent variables; click Statistics; select Estimates, Confidence Intervals, Model Fit; continue; OK In Excel, select Tools, Data Analysis, Regression, select your dependent and independent variables; OK In SAS, proc reg is used to run regression A sample code Proc Reg data = libraryname.dataset name; Model A30_3 = B90_1 B90_4 B90_6 B90_11 B90_12 B90_14 B90_15 B90_17 B90_19/ selection=STEPWISE slentry=0.05 slstay=0.05 details ; ods output parameterestimates=est; weight weight; quit; Run;
41
Thank You
42

Training Deck OLS

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Training Deck OLS

Hochgeladen von

Copyright:

Verfügbare Formate

A professor is trying to show his students the importance of mid term test He believes that higher the grade

Mid te rm Fina l Gra de Gra de 59 92 72 90 95 87 89 77 76 65 97 42 94 62 91

Two Variable Relationships

(a) Linear( positive slope)

Two Variable Relationships

(b) Linear (negative slope)

Two Variable Relationships

(c) Curvilinear (negative slope)

Two Variable Relationships

(d) Curvilinear (positive slope)

Two Variable Relationships

Simple Linear Regression Analysis

Line of best fit or Least squares line Relationship NOT Linear

Positive Linear Relationship

Negative Linear Relationship

Equation for a straight line

Equation for a straight line

Slope and y-intercept of a straight line

The line Y=a+b X

Simple Linear Regression: Example

Square Feet 1,726 1,542 2,816 5,555 1,292 2,208 1,313

Annual Sales ($1000)

Equation of the regression line

Slope and y-intercept of the regression line

Simple Linear Regression Analysis

Least Square Criterion

Point of means (X, Y)

Interpretation of Results: Example

Simple Linear Regression Analysis

Simple Linear Regression Analysis

Residual = 312 - 390 = -78

x Years with Company

Standard Error of Estimate

Standard error of estimate

4202.977 3929.369 5823.807 9896.7 3557.619 4919.711 3588.846

Standard error of estimate

120 100 80 60 40 20 0 0 10 20 30 X variable

Standard error of estimate

Measures of Variation: The Sum of Squares

_ SSR = (Yi - Y)2

Measures of Variation: The Sum of Squares

Measures the variation of the Yi values around their mean,

Measures of Variation: The Sum of Squares

SSR Explained Variability

SSE Unexplained Variability

Total = Sample Variability

Sample Coefficient of determination (Y )2 r2 = 1 (Y Y )2

Population Coefficient of determination is denoted as R2

X 1726 1542 2,816 5,555 1,292 2,208 1,313 Y= r2 = 1-

Y 3,681 3,395 6,653 9,543 3,318 5,563 3,760 5,130

4202.977 3929.369 5823.807 9896.7 3557.619 4919.711 3588.846

(Y - ) -522 -534 829 -354 -240 643 171

(Y - )2 272460 285550.2 687561 125103.7 57417.27 413820.7 29293.69

(Y - Y) -1,449 -1,735 1,523

(Y - Y)2 2100843 3011712 2318224

-1,812 3284897 433 187118 -1,370 1878074 2 1871207 (Y - Y) = 32251656 = 0.9419811

Standard error of estimate when there are 2 variables

Standard error of estimate when there are k variables

Assumptions underlying classical linear regression model

2. X values are fixed in repeated sampling

5. No autocorrelation between the disturbances

Types of Regression Models

Different estimation techniques