Beruflich Dokumente
Kultur Dokumente
for mid-term, higher the final grade. A random sample of 15 students in his class was selected with the data given below:
65 84 77 80 77 81 80 84 80 69 83 40 78 65 90
A dependent variable is the variable to be predicted or explained in a regression model. This variable is assumed to be functionally related to the independent variable.
An independent variable is the variable related to the dependent variable in a regression equation. The independent variable is used in a regression model to estimate the value of the dependent variable.
Scatter plots
Scatter diagram of final grades of 15 students 100 80 60 40 20 0 0 20 40 60 80 100 120
Mid term grades
A scatter plot is a graph that may be used to represent the relationship between the independent and dependent variables. Also referred to as a scatter diagram.
The dependent variable is plotted on the y-axis and independent variable is plotted on the x-axis
1
Final grades
(e) No Relationship
6
Simple linear regression analysis analyzes the linear relationship that exists between a dependent variable and a single independent variable.
No Relationship
Dependent variable
Independent variable
Y=a+b X
Y-intercept Slope of the line
Y-intercept (a) is that value of the dependent variable(y) when the value of the independent variable(x) is zero. It is the point at which the line cuts the y-axis. slope (b) is the change in the dependent variable for a unit increase in the independent variable. It is the tangent of the angle made by the line with the x-axis.
10
ya x i s
a
Slope,b =tan
y-intercept
X=0
x-axis
11
You wish to examine the linear dependency of the annual sales of grocery stores on their sizes in square footage. Sample data for 7 stores were obtained.
Store 1 2 3 4 5 6 7
Scatter plot
12000
Annua l Sa le s ($000)
10000 8000 6000 4000 2000 0 0 1000 2000 3000 4000 5000 6000
S q u a re F e e t
13
=a+bX
(pronounced as Y hat)
14
slope
b=
XY n X Y X2 - nX2
Y-intercept
a= Y b X
15
The least squares criterion is used for determining a regression line that minimizes the sum of squared residuals.
16
We know that the fitted line goes through the point of means so now we have to pick from all possible rotating lines through the point of means.
17
1636.415 1.487 X Y i i
The slope of 1.487 means that each increase of one unit in X, we predict the average of Y to increase by an estimated 1.487 units. The equation estimates that for each increase of 1 square foot in the size of the store, the expected annual sales are predicted to increase by $1487.
18
A residual is the difference between the actual value of the dependent variable and the value predicted by the regression model.
y y
19
Sales in Thousands
y
390 400 300 200 312
150 60 x y
100
se =
( Y -)
n-2
Standard error of estimate measures the reliability of the estimating equation that is developed.
Standard error of estimate measures the variation or scatter of the observed values around the regression line.
21
No. 1 2 3 4 5 6 7
X
1726 1542 2,816 5,555 1,292 2,208 1,313
Y
3,681 3,395 6,653 9,543 3,318 5,563 3,760
(Y - )
-522 -534 829 -354 -240 643 171
(Y - )2
272460 285550.2 687561 125103.7 57417.27 413820.7 29293.69 1871207 7--2 5
(Y - )2 n-2
= = = =
se
(Y )
n-2 n-k-1
1871207
611.7527
22
Y variable
40
50
60
24
Y
SST = (Yi - Y)2
SSE =(Yi - Yi )2
_ Y
Xi
X
25
26
SST
+ +
27
Coefficient of Determination
Sample Coefficient of determination SSR r2 = SST
or
SSE r2 = 1 SST
28
Coefficient of Determination
29
Coefficient of determination
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by the regression line. The coefficient of determination is also called r-squared and is denoted as r2.
30
4,413 19470787
(Y - )2 =
= 11871207 32251656
(Y - )2
(Y - Y)2
Conclusion:94% of the variation in the annual sales (Y variable) is explained by size of the stores(X variable) and 6 % is explained by other external factors
31
Multiple regression
Estimating Equation describing relationship among three variables = a + b1X1 + b2X2 Multiple Regression Estimating Equation when there are k variables
= a + b1X1 + b2X2 + b3X3 + .. +b k X k
32
Multiple regression
se =
2 ( Y -) n-3
se =
2 ( Y -) n-(k+1)
33
F-Test
To test how good the model is in predicting y by conducting individual ttests on each of the s is not a good idea. For that we use a global test that encompasses all s and test the following overall hypothesis:
The test statistic to test this hypothesis is called Fstatistic and is calculated as:
34
. Contd
The F statistic is the ratio of the explained variability (as reflected by R2) and the unexplained variability (as reflected by 1 R2), each divided by the corresponding degrees of freedom. The larger the F statistic, the more useful the model.
35
8. Variability in X values
9. The regression model is correctly specified 10. There is no perfect multicollinearity
36
Regression Model
2 Explanatory Variables
Simple
Multiple
Linear
NonLinear
Linear
NonLinear
NO
Create additional variables - Transformations to meet assumptions - Dummy variables for non metric data - Interaction terms for moderation effects
Do the variables meet assumptions of: Normality, Linearity, Homoscedasticity Independence of error terms
38
Is there multicollinearity?
YES
Principal Component Analysis or Factor Analysis
NO
Remove outliers
39
NO
1
YES
Examine statistical and practical significance - Model fit - Adjusted R2 - Standard error of estimate - Statistical significance of beta coefficients
Software packages
In SPSS, select Analyze, Regression, Linear; select your dependent and independent variables; click Statistics; select Estimates, Confidence Intervals, Model Fit; continue; OK In Excel, select Tools, Data Analysis, Regression, select your dependent and independent variables; OK In SAS, proc reg is used to run regression A sample code Proc Reg data = libraryname.dataset name; Model A30_3 = B90_1 B90_4 B90_6 B90_11 B90_12 B90_14 B90_15 B90_17 B90_19/ selection=STEPWISE slentry=0.05 slstay=0.05 details ; ods output parameterestimates=est; weight weight; quit; Run;
41
Thank You
42