Sie sind auf Seite 1von 66

Data Scientist

Module 3.2 : Regression

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


Module Objectives

• Introduction

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 2


M3.2 Regression - Introduction

• Relationship between one dependent variable and explanatory


variable(s)
• Use equation to set up relationship
• Numerical Dependent (Response) Variable
• 1 or More Numerical or Categorical Independent (Explanatory) Variables
• Used Mainly for Prediction & Estimation

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression - Introduction
• Relation between variables where changes in some variables
may “explain” or possibly “cause” changes in other variables.
• Explanatory variables are termed the independent variables
and the variables to be explained are termed the dependent
variables.
• Regression model estimates the nature of the relationship
between the independent and dependent variables.
– Change in dependent variables that results from changes
in independent variables, i.e. size of the relationship.
– Strength of the relationship.
– Statistical significance of the relationship.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression - Introduction
• Dependent variable is retail price of gasoline in Region –
independent variable is the price of crude oil.
• Dependent variable is employment income – independent
variables might be hours of work, education, occupation,
sex, age, years of experience, unionization status, etc.

• Price of a product and quantity produced or sold:


– Quantity sold affected by price. Dependent variable is
quantity of product sold – independent variable is price.
– Price affected by quantity offered for sale. Dependent
variable is price – independent variable is quantity sold.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression - Introduction
• Dependent variable is retail price of gasoline in Region –
independent variable is the price of crude oil.
• Dependent variable is employment income – independent
variables might be hours of work, education, occupation,
sex, age, years of experience, unionization status, etc.

• Price of a product and quantity produced or sold:


– Quantity sold affected by price. Dependent variable is
quantity of product sold – independent variable is price.
– Price affected by quantity offered for sale. Dependent
variable is price – independent variable is quantity sold.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression - Introduction

Bivariate or simple regression model


(Education) x y (Income)

Multivariate or multiple regression model


(Education) x1
(Sex) x2
(Experience) x3 y (Income)

(Age) x4

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Simple Relationship

• x is the independent variable


• y is the dependent variable
• The regression model is
y   0  1 x  
• The model has two variables, the independent or explanatory
variable, x, and the dependent variable y, the variable whose
variation is to be explained.
• The relationship between x and y is a linear or straight line
• Two parameters to estimate – the slope of the line β1 and the
y-intercept β0 (where the line crosses the vertical axis).
• ε is the unexplained, random, or error component.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Simple Relationship
(Example)
Income hrs/week Income hrs/week
8000 38 8000 35
6400 50 18000 37.5
2500 15 5400 37
3000 30 15000 35
6000 50 3500 30
5000 38 24000 45
8000 50 1000 4
4000 20 8000 37.5
11000 45 2100 25
25000 50 8000 46
4000 20 4000 30
8800 35 1000 200
5000 30 2000 200
7000 43 4800 30

Data Cleanup?

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Simple Relationship
(Example)
Summe r Income as a Function of Hours Worked

30000

25000

20000
Income

15000

10000

5000

0
0 10 20 30 40 50 60
Hours pe r Wee k

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Simple Relationship
(Example)

yˆ  2461  297 x

R2 = 0.311
Significance = 0.0031

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Simple Relationship
(Outliers)

Outliers:
• Rare, extreme values may distort the
outcome.
• Could be an error.
• Could be a very important observation.
• More than 3 standard deviations from the
mean.
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.
M3.2 Regression – Simple Relationship
(Non Linear Relationship)
U-Shape d Re lationship
Correlation =
12 +0.12.

10

6
Y

0
0 2 4 6 8 10 12
X

• Be wary of the scatter graph, it is not always perfect.


• Close to zero relationship, but there really is a more complex one.
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.
M3.2 Regression – Types of Regression
Models

1 Explanatory Regression 2+ Explanatory


Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


Linear Regression Model

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 15


M3.2 Regression – Linear Regression

© 1984-1994 T/Maker Co.


Module 3.2
EPI 809/Spring 2008
Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 16
M3.2 Regression – Linear Regression

• Relationship Between Variables Is a Linear Function

Population Population Random


Y-Intercept Slope Error

Yi   0  1X i   i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


Estimating Parameters:
Least Squares Method

Module 3.2
EPI 809/Spring 2008
Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 18
M3.2 Regression – Linear Regression
(Scatter plot)
• 1. Plot of All (Xi, Yi) Pairs
• 2. Suggests How Well Model Will Fit

Y
60
40
20
0 X
0 20 40 60
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.
M3.2 Regression – Linear Regression
(Thinking Challenge)
How would you draw a line through the points? How do you determine which
line ‘fits best’?

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Least Squares)
• 1. ‘Best Fit’ Means Difference Between Actual Y Values &
Predicted Y Values Are a Minimum. But Positive Differences Off-
Set Negative ones

• So square errors!

    ˆ
n n
2
Yi  Yˆi 2
i
i 1 i 1

• 2.LS Minimizes the Sum of the Squared Differences (errors) (SSE)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Coefficient Equations)

Prediction Equation:

yˆi  ˆ0  ˆ1 xi


Sample slope:
SS xy   xi  x  yi  y 
ˆ1  
SS xx 
 ix  x 2

Sample Y – intercept:

ˆ0  y  ˆ1x
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 22
M3.2 Regression – Linear Regression
(Computation Table)

Module 3.2
EPI 809/Spring 2008
Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 23
Interpretation of Coefficients

• 1. Slope (^1)
– Estimated Y Changes by ^1 for Each 1 Unit Increase in X
^
• If 1 = 2, then Y Is Expected to Increase by 2 for Each 1 Unit Increase in X
• 2. Y-Intercept (^0)
– Average Value of Y When X = 0
• If ^0 = 4, then Average Y Is Expected to Be 4 When
X Is 0

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression (Example)
• What is the relationship between Mother’s Estriol level &
Birthweight using the following data?

Estriol Birthweight
(mg/24h) (g/1000)
1 1
2 1
3 2
4 2
5 4

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 25


M3.2 Regression – Linear Regression (Example)

n
 X  n 
  i   Yi 
n
 i 1  i 1  1510
 X Y
i i 
n
37 
5
ˆ1  i 1
  0.70
 X
n 2
15
2

  i 55 
n 5
 Xi   
2 i 1

i 1 n

ˆ0  Y  ˆ1 X  2  0.70 3  0.10

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 26


M3.2 Regression – Linear Regression
(Example)
• Many factors affect the wages of workers: the industry they work
in, their type of job, their education and their experience, and
changes in general levels of wages. We will look at a sample of 59
married women who hold customer service jobs in Indiana banks.
The following table gives their weekly wages at a specific point in
time also their length of service with their employer, in month. The
size of the place of work is recorded simply as “large” (100 or
more workers) or “small.” Because industry, job type, and the time
of measurement are the same for all 59 subjects, we expect to see a
clear relationship between wages and length of service.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Example)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Example)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Example)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Interpretation)
• R2 is called the coefficient of determination.

0  R2  1
• We may interpret R2 as the proportionate reduction of total variability in y
associated with the use of the independent variable x.
• The larger is R2, the more is the total variation of y reduced by including the
variable x in the model.
• If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1.
• If the slope of the fitted regression line
b1 = 0 so that , SSE=SST and R2 = 0.
• The closer R2 is to 1, the greater is said to be the degree of linear association
between x and y.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Interpretation)
11-4.1 Use of t-Tests
An important special case of the hypotheses of
Equation 11-18 is

These hypotheses relate to the significance of regression.


Failure to reject H0 is equivalent to concluding that there
is no linear relationship between x and Y.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Interpretation)
11-4.2 Analysis of Variance Approach to Test
Significance of Regression

The quantities, MSR and MSE are called mean squares.


Analysis of variance table:

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Adequacy of Model)
• Fitting a regression model requires several
assumptions.
1. Errors are uncorrelated random variables with
mean zero;
2. Errors have constant variance; and,
3. Errors be normally distributed.
• The analyst should always consider the validity of
these assumptions to be doubtful and conduct
analyses to examine the adequacy of the model

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Adequacy of Model)
11-7.1 Residual Analysis
Figure 11-9 Patterns for
residual plots. (a)
satisfactory, (b) funnel, (c)
double bow, (d) nonlinear

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Linear Regression
(Adequacy of Model)

Figure 11-10 Normal


probability plot of residuals

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


Multiple Regression

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Example)
• Typically, we want to use more than a single predictor
(independent variable) to make predictions
• Regression with more than one predictor is called “multiple
regression”

y i    1 x1i   2 x 2i  K   p x pi  i
• the β’s are coefficients for the independent variables in the true
or population equation and the x’s are the values of the
independent variables for the member of the population.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Example)

The data in the table on the following slide are:


Dependent Variable
y = BMI
Independent Variables
x1 = Age in years
x2 = FFNUM, a measure of fast food usage,
x3 = Exercise, an exercise intensity score
x4 = Beers per day

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Example)
OBS AGE BMI FFNUM EXERCISE BEER
1 26 23.2 0 621 3
2 30 30.2 9 201 6
3 32 28.1 17 240 10
4 27 22.7 1 669 5
5 33 28.9 7 1,140 12
6 29 22.4 3 445 9
7 32 23.2 1 710 15
8 33 20.3 0 783 11
9 31 25.6 1 454 0
10 33 21.2 3 432 2
11 26 22.3 5 1,562 13
12 34 23.0 2 697 1
13 33 26.3 4 280 2
14 31 22.2 1 449 5
15 31 19.0 0 689 4
16 27 20.8 2 785 3
17 36 20.9 2 350 7
18 35 36.4 14 48 11
19 31 28.6 11 285 12
20 36 27.5 8 85 5
Total 626 492.8 91 10,925 136
Mean 31.3 24.6 4.6 546.3 6.8

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Example)
T he RE G P roc edu re
Mo del : M ODE L1
D epe nde nt Var iab le: bmi

Bac kwa rd Eli min ati on: Ste p 0

All Va ria ble s E nte red : R -Sq uar e = 0.7 932 an d C (p) = 5.0 000

An aly sis of Va rian ce

S um of Mea n
S our ce DF Sq uar es Sq uar e F Va lue Pr > F

M ode l 4 2 73. 748 77 6 8.4 371 9 14 .38 <. 000 1


E rro r 15 71. 379 23 4.7 586 2
C orr ect ed Tot al 19 3 45. 128 00

Par ame ter S tan dar d


V ari abl e Es tim ate E rro r Typ e I I S S F V alu e Pr > F

I nte rce pt 18 .47 774 6.4 540 6 3 9.0 043 6 8.2 0 0.0 119
a ge 0 .08 424 0.1 893 1 0.9 423 9 0.2 0 0.6 627
f fnu m 0 .42 292 0.1 367 1 4 5.5 395 8 9.5 7 0.0 074
e xer cis e -0 .00 107 0.0 017 0 1.8 760 4 0.3 9 0.5 395
b eer 0 .32 601 0.1 151 8 3 8.1 211 1 8.0 1 0.0 127

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Example)
One df for each
The REG Procedure independent variable in
Model: MODEL1 the model
Dependent Variable: bmi

Backward Elimination: Step 0

All Variables Entered: R-Square = 0.7932 and C(p) = 5.0000

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 273.74877 68.43719 14.38 <.0001


Error 15 71.37923 4.75862
Corrected Total 19 345.12800

b0 Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
b1
Intercept 18.47774 6.45406 39.00436 8.20 0.0119
age 0.08424 0.18931 0.94239 0.20 0.6627
b2 ffnum 0.42292 0.13671 45.53958 9.57 0.0074
exercise -0.00107 0.00170 1.87604 0.39 0.5395
beer 0.32601 0.11518 38.12111 8.01 0.0127
b3
b4

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Example)
The REG Procedure
Model: MODEL1
Dependent Variable: bmi

Backward Elimination: Step 0

All Variables Entered: R-Square = 0.7932 and C(p) = 5.0000

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 273.74877 68.43719 14.38 <.0001


Error 15 71.37923 4.75862
Corrected Total 19 345.12800

b0 Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

b1 Intercept 18.47774 6.45406 39.00436 8.20 0.0119


age 0.08424 0.18931 0.94239 0.20 0.6627
ffnum 0.42292 0.13671 45.53958 9.57 0.0074
b2 exercise -0.00107 0.00170 1.87604 0.39 0.5395
beer 0.32601 0.11518 38.12111 8.01 0.0127
b3
b4
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.
M3.2 Regression – Multiple Regression
(Example)
Age

We have,
b0 = 18.478 , b1 = 0.084, b2 = 0.422,
b3 = - 0.001, b4 = 0.326

So,

ŷ = 18.478 + 0.084x1 + 0.422x2 – 0.001x3 + 0.326x4

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Model Verification)
The first step is to test the global hypothesis:
H0: β1 = β2 = β3 = β4 = 0

vs H1: β1 ≠ β2 ≠ β3 ≠ β4 ≠ 0
The ANOVA highlighted in the green box at the top of the next slide tests
this hypothesis:

F = 14.33 > F0.95(4,15) = 3.06,


so the hypothesis is rejected. Thus, we have evidence that at least on of
the βi ≠ 0.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Model Verification)

The amount of variation in the dependent


variable, BMI, explained by its regression
relationship with the four independent variables is

R2 = SS(Model)/SS(Total) = 273.75/345.13
= 0.79 or 79%

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Model Verification)

If the global hypothesis is rejected, it is then


appropriate to examine hypotheses for the
individual parameters, such as

H0: β1 = 0 vs H1: β1 ≠ 0.

P = 0.6627 for this test is greater than α = 0.05,


so we accept H0: β1 = 0

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Model Verification)

From the ANOVA, we have

b1 = 0.084, P = 0.66
b2 = 0.422, P = 0.01
b3 = - 0.001, P = 0.54
b4 = 0.326, P = 0.01

So b2 = 0.422 and b4 = 0.326 appear to represent terms that should be


explored further.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Approaches)

Backward elimination
Start with all independent variables, test the global hypothesis and if
rejected, eliminate, step by step, those independent variables for which  =
0.

Forward
Start with a “ core ” subset of essential variables and add others step by
step.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


Backward Elimination

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Backward Elimination)
The REG Procedure
Model: MODEL1
Global
Dependent Variable: bmi hypothesis
Backward Elimination: Step 0

All Variables Entered: R-Square = 0.7932 and C(p) = 5.0000

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 4 273.74877 68.43719 14.38 <.0001


Error 15 71.37923 4.75862
Corrected Total 19 345.12800

b0 Variable
Parameter
Estimate
Standard
Error Type II SS F Value Pr > F

b1 Intercept
age
18.47774
0.08424
6.45406
0.18931
39.00436
0.94239
8.20
0.20
0.0119
0.6627
ffnum 0.42292 0.13671 45.53958 9.57 0.0074
b2 exercise
beer
-0.00107
0.32601
0.00170
0.11518
1.87604
38.12111
0.39
8.01
0.5395
0.0127

b3
b4

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Backward Elimination)
Backward Elimination: Step 1

Variable age Removed: R-Square = 0.7904 and C(p) = 3.1980

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 3 272.80638 90.93546 20.12 <.0001


Error 16 72.32162 4.52010
Corrected Total 19 345.12800

The REG Procedure


Model: MODEL1
Dependent Variable: bmi

Backward Elimination: Step 1

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 21.28788 1.30004 1211.98539 268.13 <.0001


ffnum 0.42963 0.13243 47.57610 10.53 0.0051
exercise -0.00140 0.00149 4.00750 0.89 0.3604
beer 0.32275 0.11203 37.51501 8.30 0.0109

Bounds on condition number: 1.7883, 14.025

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Backward Elimination)
Backward Elimination: Step 2

Variable exercise Removed: R-Square = 0.7788 and C(p) = 2.0402

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 268.79888 134.39944 29.93 <.0001


Error 17 76.32912 4.48995
Corrected Total 19 345.12800

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F

Intercept 20.29360 0.75579 3237.09859 720.97 <.0001


ffnum 0.46380 0.12693 59.94878 13.35 0.0020
beer 0.33375 0.11105 40.55414 9.03 0.0080

Bounds on condition number: 1.654, 6.6161

All variables left in the model are significant at the 0.0500 level.

The SAS System


The REG Procedure
Model: MODEL1
Dependent Variable: bmi

Summary of Backward Elimination

Variable Number Partial Model


Step Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 age 3 0.0027 0.7904 3.1980 0.20 0.6627


2 exercise 2 0.0116 0.7788 2.0402 0.89 0.3604

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


Forward Stepwise Regression

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Forward Addition)
The REG P roc edur e
M od e l: MO D EL 1
De p en d en t V a ri a bl e : b mi

Ste pwis e Sel ecti on: S tep 1

Va r ia b le ff n um En t er e d: R- S qu a re = 0 .6 6 13 an d C( p ) = 8 . 56 2 5

A na l ys i s o f V ar i an c e

Su m of M ean
S ou r ce DF Sq u ar e s Sq uar e F Va lue Pr > F

M ode l 1 2 28 . 24 4 73 2 28 . 24 4 73 3 5.15 < .00 01


E rro r 18 1 16 . 88 3 27 6 . 49 3 51
C or r ec t ed To t al 19 3 45 . 12 8 00

P ara mete r St a nd a rd
V ar i ab l e E sti mate Erro r T y pe II SS F Val ue Pr > F

I nt e rc e pt 2 1.4 3827 0. 7850 6 4 8 42 . 33 8 95 7 45. 72 <. 0 00 1


f fnu m 0.7 0368 0. 1186 9 2 28 . 24 4 73 35. 15 <. 0 00 1

St e pw i se Se l ec t io n : S te p 2

V a ri a bl e b e er En t er e d: R- S qu a re = 0 .7 7 88 an d C( p ) = 2 . 04 0 2

A na l ys i s o f V ar i an c e

Sum of Mea n
S ou r ce DF Sq u ar e s Sq uar e F Va lue Pr > F

M ode l 2 2 68 . 79 8 88 1 34 . 39 9 44 2 9.93 < .00 01


E rro r 17 76 . 32 9 12 4 . 48 9 95
C or r ec t ed To t al 19 3 45 . 12 8 00

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Multiple Regression
(Forward Addition)
Mod el: MO DEL 1
D epe nde nt Var iab le: bmi

St epw ise Se lec tio n: S tep 2

P ara met er St and ard


Var iab le E sti mat e Err or T ype II SS F Val ue Pr > F

Int erc ept 2 0.2 936 0 0. 755 79 3 237 .09 859 7 20. 97 <. 000 1
ffn um 0.4 638 0 0. 126 93 59 .94 878 13. 35 0. 002 0
bee r 0.3 337 5 0. 111 05 40 .55 414 9. 03 0. 008 0

B oun ds on con dit ion nu mbe r: 1 .65 4, 6.6 161

Al l v ari abl es lef t i n t he mod el are si gni fica nt at the 0. 050 0 l eve l.

N o o the r v ari abl e m et the 0. 150 0 s ign ifi can ce leve l f or ent ry int o t he mod el.

S umm ary of St epw ise Se lec tion

Va ria ble Var iab le N umb er P art ial Mo del


S tep En ter ed Rem ove d V ars In R -Sq uar e R- Squa re C( p) F V alu e Pr > F

1 f fnu m 1 0 .66 13 0. 6613 8 .56 25 35 .15 <.0 001


2 b eer 2 0 .11 75 0. 7788 2 .04 02 9 .03 0.0 080

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


Hands on – Using R (Simple Linear Regression)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 57


M3.2 Regression R
(Simple Linear)
• Use the “cars” data for the Regression Analysis
• >head(cars) #display the first 6 observations
• > Speed dist
• >1 4 2
• >2 4 10
• >3 7 4
• >4 7 22
• >5 8 16
• >6 9 10
• Before doing the Regression analysis its good practice to
understand the data
– Box Plot (for any outliers)
– Scatter Plot (for relationship)
– Density Plot (understand distribution, skewed etc.,)

• linearMod <- lm(dist ~ speed, data=cars) # build


linear regression model on full data
• print(linearMod)
• #> Call:
• #> lm(formula = dist ~ speed, data = cars)
• #>
• #> Coefficients:
• #> (Intercept) speed
• #> -17.579 3.932

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression R
(Simple Linear)
• Use the “alligator” data for the Regression Analysis
• First create a data frame to fit the simple linear regression model
• > alligator = data.frame( lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81,
3.83, 3.46, 3.76, 3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78), lnWeight =
c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50, 3.58, 3.64, 5.90, 4.43,
4.38, 4.42, 4.25) )
• > Perform Exploratory Data Analysis (EDA)
• > xyplot(lnWeight ~ lnLength, data = alligator, xlab = "Snout vent
length (inches) on log scale", ylab = "Weight (pounds) on log scale",
main = "Alligators in Central Florida" )
• > alli.mod1 = lm(lnWeight ~ lnLength, data = alligator)
• > summary(alli.mod1)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression R
(Simple Linear)
• Problem: Apply the simple linear regression model for the data set faithful, and estimate next eruption duration if waiting time is 80
minutes.
• Apply the lm function to a formula that describes the variable eruptions by the variable waiting.
• > eruption.lm = lm(eruptions ~ waiting, data=faithful)

• Extract the parameters of the estimated regression equation with the coefficients function.
• > coeffs = coefficients(eruption.lm)
• > coeffs
(Intercept) waiting
-1.874016 0.075628

• Fit the eruption duration using the estimated regression equation.


• > waiting = 80 # the waiting time
> duration = coeffs[1] + coeffs[2]*waiting
> duration
(Intercept)
4.1762
• Answer: Based on the simple linear regression model, if the waiting time since the last eruption has been 80 minutes, we expect the
next one to last 4.1762 minutes.
• Alternative Solution: Wrap the waiting parameter value inside a new data frame named newdata.
• > newdata = data.frame(waiting=80) # wrap the parameter
• > predict(eruption.lm, newdata) # apply predict
1
4.1762

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 60


M3.2 Regression R
(Simple Linear)
• Extract the coefficient of determination from the r.squared attribute of its summary.
• > summary(eruption.lm)$r.squared
[1] 0.81146
• Answer: The coefficient of determination of simple linear regression model for data set faithful is 0.81146.

• Print out the F-statistics of the significance test with the summary function.
• > summary(eruption.lm)
• Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.2992 -0.3769 0.0351 0.3491 1.1933

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.87402 0.16014 -11.7 <2e-16 ***
waiting 0.07563 0.00222 34.1 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.497 on 270 degrees of freedom
Multiple R-squared: 0.811, Adjusted R-squared: 0.811
F-statistic: 1.16e+03 on 1 and 270 DF, p-value: <2e-16

• Answer: As the p-value is much less than 0.05, we reject the null hypothesis that β = 0. Hence there is a significant relationship
between the variables in the linear regression model of the data set faithful.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 61


M3.2 Regression R
(Simple Linear)
• Plot the residual against the observed values of the • Normal Probability Plot the residual against the observed
variable waiting. values of the variable waiting.
• > plot(faithful$waiting, eruption.res, • > eruption.lm = lm(eruptions ~ waiting, data=faithful)
+ ylab="Residuals", xlab="Waiting Time", > eruption.stdres = rstandard(eruption.lm)
+ main="Old Faithful Eruptions")
> abline(0, 0) # the horizon
• Create normal probability plot with the qqnorm function,
• and add the qqline for further comparison.
• > qqnorm(eruption.stdres,
+ ylab="Standardized Residuals",
+ xlab="Normal Scores",
+ main="Old Faithful Eruptions")
> qqline(eruption.stdres)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 62


Logistic Regression

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 63


M3.2 Regression – Logistic Regression
(Introduction)
• There are many important research topics for which the dependent variable is "limited."
• For example: voting, morbidity or mortality, and participation data is not continuous or
distributed normally.
• Binary logistic regression is a type of regression analysis where the dependent variable
is a dummy variable: coded 0 (did not vote) or 1(did vote)
• Logistic regression is the type of regression we use for a response variable (Y) that
follows a binomial distribution
• Linear regression is the type of regression we use for a continuous, normally distributed
response (Y) variable

• Why cant we fit normal regression as below Y =  + X + e ; where Y = (0, 1)


 Reason:
 Response Y is not normally distributed
 The error terms are heteroskedastic
 e is not normally distributed because Y takes on only two values
 The predicted probabilities can be greater than 1 or less than 0

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Logistic Regression
(Introduction)
The "logit" model solves these problems:

ln[p/(1-p)] =  + X + e

 p is the probability that the event Y occurs, p(Y=1)


 p/(1-p) is the "odds ratio"
 ln[p/(1-p)] is the log odds ratio, or "logit"

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.


M3.2 Regression – Logistic Regression
(Introduction)
 The logistic distribution
constrains the estimated
probabilities to lie between 0
and 1.
 The estimated probability is:

p = 1/[1 + exp(- -  X)]

 if you let  +  X =0, then p =


.50
 as  +  X gets really big, p
approaches 1
 as  +  X gets really small, p
approaches 0

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Das könnte Ihnen auch gefallen