Sie sind auf Seite 1von 43

A professor is trying to show his students the importance of mid term test He believes that higher the grade

for mid-term, higher the final grade. A random sample of 15 students in his class was selected with the data given below:

Mid te rm Fina l Gra de Gra de 59 92 72 90 95 87 89 77 76 65 97 42 94 62 91

65 84 77 80 77 81 80 84 80 69 83 40 78 65 90

A dependent variable is the variable to be predicted or explained in a regression model. This variable is assumed to be functionally related to the independent variable.

An independent variable is the variable related to the dependent variable in a regression equation. The independent variable is used in a regression model to estimate the value of the dependent variable.

Scatter plots
Scatter diagram of final grades of 15 students 100 80 60 40 20 0 0 20 40 60 80 100 120
Mid term grades

A scatter plot is a graph that may be used to represent the relationship between the independent and dependent variables. Also referred to as a scatter diagram.

The dependent variable is plotted on the y-axis and independent variable is plotted on the x-axis
1

Final grades

Two Variable Relationships

(a) Linear( positive slope)


2

Two Variable Relationships

(b) Linear (negative slope)


3

Two Variable Relationships

(c) Curvilinear (negative slope)

Two Variable Relationships

(d) Curvilinear (positive slope)


5

Two Variable Relationships

(e) No Relationship
6

Simple Linear Regression Analysis

Simple linear regression analysis analyzes the linear relationship that exists between a dependent variable and a single independent variable.

Line of best fit or Least squares line Relationship NOT Linear

Positive Linear Relationship

Negative Linear Relationship

No Relationship

Equation for a straight line

Dependent variable

Independent variable

Y=a+b X
Y-intercept Slope of the line

Equation for a straight line

Y-intercept (a) is that value of the dependent variable(y) when the value of the independent variable(x) is zero. It is the point at which the line cuts the y-axis. slope (b) is the change in the dependent variable for a unit increase in the independent variable. It is the tangent of the angle made by the line with the x-axis.

10

Slope and y-intercept of a straight line

ya x i s
a

The line Y=a+b X

Slope,b =tan

y-intercept
X=0

x-axis
11

Simple Linear Regression: Example

You wish to examine the linear dependency of the annual sales of grocery stores on their sizes in square footage. Sample data for 7 stores were obtained.

Store 1 2 3 4 5 6 7

Square Feet 1,726 1,542 2,816 5,555 1,292 2,208 1,313

Annual Sales ($1000)


3,681 3,395 6,653 9,543 3,318 5,563 3,760
12

Scatter plot

12000

Annua l Sa le s ($000)

10000 8000 6000 4000 2000 0 0 1000 2000 3000 4000 5000 6000

S q u a re F e e t
13

Equation of the regression line

=a+bX
(pronounced as Y hat)

14

Slope and y-intercept of the regression line

slope

b=

XY n X Y X2 - nX2

Y-intercept

a= Y b X

15

Simple Linear Regression Analysis

The least squares criterion is used for determining a regression line that minimizes the sum of squared residuals.

16

Least Square Criterion


What is the intuition behind the least squares formula for b1?

We know that the fitted line goes through the point of means so now we have to pick from all possible rotating lines through the point of means.

Point of means (X, Y)

17

Interpretation of Results: Example

1636.415 1.487 X Y i i
The slope of 1.487 means that each increase of one unit in X, we predict the average of Y to increase by an estimated 1.487 units. The equation estimates that for each increase of 1 square foot in the size of the store, the expected annual sales are predicted to increase by $1487.
18

Simple Linear Regression Analysis

A residual is the difference between the actual value of the dependent variable and the value predicted by the regression model.

y y
19

Simple Linear Regression Analysis

Sales in Thousands

y
390 400 300 200 312

150 60 x y

Residual = 312 - 390 = -78

100

x Years with Company


20

Standard Error of Estimate

se =

( Y -)
n-2

Standard error of estimate measures the reliability of the estimating equation that is developed.
Standard error of estimate measures the variation or scatter of the observed values around the regression line.
21

Standard error of estimate


Store Square Annual Feet Sales 1636.415+ ($1,000) 1.487(X)

No. 1 2 3 4 5 6 7

X
1726 1542 2,816 5,555 1,292 2,208 1,313

Y
3,681 3,395 6,653 9,543 3,318 5,563 3,760

4202.977 3929.369 5823.807 9896.7 3557.619 4919.711 3588.846

(Y - )
-522 -534 829 -354 -240 643 171

(Y - )2
272460 285550.2 687561 125103.7 57417.27 413820.7 29293.69 1871207 7--2 5

(Y - )2 n-2

= = = =

se

(Y )
n-2 n-k-1

1871207

611.7527
22

Standard error of estimate


Regression line = 10 + 2 X X Y Y- 10 30 30 0 20 50 50 0 Hence se = 0 30 70 70 0 40 90 90 0 50 110 110 0

120 100 80 60 40 20 0 0 10 20 30 X variable


23

Y variable

40

50

60

Standard error of estimate


For a perfect fitting regression line, the standard error of estimate is equal to zero. The closer se is to 0, better the reliability of the regression line. Another way for checking the reliability of regression line Find the standard deviation of the Y variable ,(y) If se < (y), the regression line is a reliable estimate of the data If se > (y), the regression line is not a reliable estimate of the data.

24

Measures of Variation: The Sum of Squares

Y
SST = (Yi - Y)2

SSE =(Yi - Yi )2

_ SSR = (Yi - Y)2

_ Y

Xi

X
25

Measures of Variation: The Sum of Squares


SST = Total Sum of Squares

Measures the variation of the Yi values around their mean,


SSR = Regression Sum of Squares Explained variation attributable to the relationship between X and Y SSE = Error Sum of Squares Variation attributable to factors other than the relationship between X and Y

26

Measures of Variation: The Sum of Squares

SST

SSR Explained Variability

+ +

SSE Unexplained Variability

Total = Sample Variability

27

Coefficient of Determination
Sample Coefficient of determination SSR r2 = SST

or

SSE r2 = 1 SST
28

Coefficient of Determination

Sample Coefficient of determination (Y )2 r2 = 1 (Y Y )2

Population Coefficient of determination is denoted as R2

29

Coefficient of determination

The coefficient of determination is the portion of the total variation in the dependent variable that is explained by the regression line. The coefficient of determination is also called r-squared and is denoted as r2.

30

X 1726 1542 2,816 5,555 1,292 2,208 1,313 Y= r2 = 1-

Y 3,681 3,395 6,653 9,543 3,318 5,563 3,760 5,130

4202.977 3929.369 5823.807 9896.7 3557.619 4919.711 3588.846

(Y - ) -522 -534 829 -354 -240 643 171

(Y - )2 272460 285550.2 687561 125103.7 57417.27 413820.7 29293.69

(Y - Y) -1,449 -1,735 1,523

(Y - Y)2 2100843 3011712 2318224

4,413 19470787

(Y - )2 =
= 11871207 32251656

-1,812 3284897 433 187118 -1,370 1878074 2 1871207 (Y - Y) = 32251656 = 0.9419811

(Y - )2
(Y - Y)2

Conclusion:94% of the variation in the annual sales (Y variable) is explained by size of the stores(X variable) and 6 % is explained by other external factors
31

Multiple regression

Estimating Equation describing relationship among three variables = a + b1X1 + b2X2 Multiple Regression Estimating Equation when there are k variables
= a + b1X1 + b2X2 + b3X3 + .. +b k X k
32

Multiple regression

Standard error of estimate when there are 2 variables

se =

2 ( Y -) n-3

Standard error of estimate when there are k variables

se =

2 ( Y -) n-(k+1)
33

F-Test
To test how good the model is in predicting y by conducting individual ttests on each of the s is not a good idea. For that we use a global test that encompasses all s and test the following overall hypothesis:

The test statistic to test this hypothesis is called Fstatistic and is calculated as:

34

. Contd
The F statistic is the ratio of the explained variability (as reflected by R2) and the unexplained variability (as reflected by 1 R2), each divided by the corresponding degrees of freedom. The larger the F statistic, the more useful the model.

35

Assumptions underlying classical linear regression model


1. The regression model is linear in the parameters

2. X values are fixed in repeated sampling


3. Mean value of disturbance is zero 4. Homoscedassticity or equal variance of disturbances

5. No autocorrelation between the disturbances


6. Zero covaraince between disturbance term and X values 7. The number of observations must be greater than the number of parameters to be estimated

8. Variability in X values
9. The regression model is correctly specified 10. There is no perfect multicollinearity

36

Types of Regression Models


1 Explanatory Variable

Regression Model

2 Explanatory Variables

Simple

Multiple

Linear

NonLinear

Linear

NonLinear

Different estimation techniques


Enter method Forward inclusion Backward elimination Step wise Hierarchical
37

Standard Regression Procedure


Define research problem - Select dependent variable - Select independent variable

NO

Create additional variables - Transformations to meet assumptions - Dummy variables for non metric data - Interaction terms for moderation effects

Do the variables meet assumptions of: Normality, Linearity, Homoscedasticity Independence of error terms
38

Standard Regression Procedure

Is there multicollinearity?

YES
Principal Component Analysis or Factor Analysis

NO

Divide the sample into development and validation sets

Remove outliers

Select an Estimation Technique

39

Standard Regression Procedure

Does the regression variate meet the assumptions of regression analysis?

NO
1

YES
Examine statistical and practical significance - Model fit - Adjusted R2 - Standard error of estimate - Statistical significance of beta coefficients

Validation of the model


40

Software packages
In SPSS, select Analyze, Regression, Linear; select your dependent and independent variables; click Statistics; select Estimates, Confidence Intervals, Model Fit; continue; OK In Excel, select Tools, Data Analysis, Regression, select your dependent and independent variables; OK In SAS, proc reg is used to run regression A sample code Proc Reg data = libraryname.dataset name; Model A30_3 = B90_1 B90_4 B90_6 B90_11 B90_12 B90_14 B90_15 B90_17 B90_19/ selection=STEPWISE slentry=0.05 slstay=0.05 details ; ods output parameterestimates=est; weight weight; quit; Run;

41

Thank You

42

Das könnte Ihnen auch gefallen