M3.2 - Regression PDF

Data Scientist
Module 3.2 : Regression
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module Objectives
• Introduction
Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 2

M3.2 Regression - Introduction
• Relationship between one dependent variable and explanatory

variable(s)
• Use equation to set up relationship
• Numerical Dependent (Response) Variable
• 1 or More Numerical or Categorical Independent (Explanatory) Variables
• Used Mainly for Prediction & Estimation

• Relation between variables where changes in some variables
may “explain” or possibly “cause” changes in other variables.
• Explanatory variables are termed the independent variables
and the variables to be explained are termed the dependent
variables.
• Regression model estimates the nature of the relationship
between the independent and dependent variables.
– Change in dependent variables that results from changes
in independent variables, i.e. size of the relationship.
– Strength of the relationship.
– Statistical significance of the relationship.

• Dependent variable is retail price of gasoline in Region –
independent variable is the price of crude oil.
• Dependent variable is employment income – independent
variables might be hours of work, education, occupation,
sex, age, years of experience, unionization status, etc.
• Price of a product and quantity produced or sold:

– Quantity sold affected by price. Dependent variable is
quantity of product sold – independent variable is price.
– Price affected by quantity offered for sale. Dependent
variable is price – independent variable is quantity sold.

• Dependent variable is retail price of gasoline in Region –
independent variable is the price of crude oil.
• Dependent variable is employment income – independent
variables might be hours of work, education, occupation,
sex, age, years of experience, unionization status, etc.
• Price of a product and quantity produced or sold:

– Quantity sold affected by price. Dependent variable is
quantity of product sold – independent variable is price.
– Price affected by quantity offered for sale. Dependent
variable is price – independent variable is quantity sold.

Bivariate or simple regression model

(Education) x y (Income)
Multivariate or multiple regression model

(Education) x1
(Sex) x2
(Experience) x3 y (Income)
(Age) x4

M3.2 Regression – Simple Relationship
• x is the independent variable

• y is the dependent variable
• The regression model is
y   0  1 x  
• The model has two variables, the independent or explanatory
variable, x, and the dependent variable y, the variable whose
variation is to be explained.
• The relationship between x and y is a linear or straight line
• Two parameters to estimate – the slope of the line β1 and the
y-intercept β0 (where the line crosses the vertical axis).
• ε is the unexplained, random, or error component.

(Example)
Income hrs/week Income hrs/week
8000 38 8000 35
6400 50 18000 37.5
2500 15 5400 37
3000 30 15000 35
6000 50 3500 30
5000 38 24000 45
8000 50 1000 4
4000 20 8000 37.5
11000 45 2100 25
25000 50 8000 46
4000 20 4000 30
8800 35 1000 200
5000 30 2000 200
7000 43 4800 30
Data Cleanup?

(Example)
Summe r Income as a Function of Hours Worked
30000
25000
20000
Income
15000
10000
5000
0
0 10 20 30 40 50 60
Hours pe r Wee k

(Example)
yˆ  2461  297 x
R2 = 0.311
Significance = 0.0031

(Outliers)
Outliers:
• Rare, extreme values may distort the
outcome.
• Could be an error.
• Could be a very important observation.
• More than 3 standard deviations from the
mean.
(Non Linear Relationship)
U-Shape d Re lationship
Correlation =
12 +0.12.
10
6
Y
0
0 2 4 6 8 10 12
X
• Be wary of the scatter graph, it is not always perfect.

• Close to zero relationship, but there really is a more complex one.
M3.2 Regression – Types of Regression
Models
1 Explanatory Regression 2+ Explanatory

Variable Models Variables
Simple Multiple
Non- Non-
Linear Linear
Linear Linear

Linear Regression Model

M3.2 Regression – Linear Regression
© 1984-1994 T/Maker Co.

Module 3.2
EPI 809/Spring 2008
Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 16
• Relationship Between Variables Is a Linear Function
Population Population Random

Y-Intercept Slope Error
Yi   0  1X i   i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)

Estimating Parameters:
Least Squares Method
Module 3.2
EPI 809/Spring 2008
(Scatter plot)
• 1. Plot of All (Xi, Yi) Pairs
• 2. Suggests How Well Model Will Fit
Y
60
40
20
0 X
0 20 40 60
(Thinking Challenge)
How would you draw a line through the points? How do you determine which
line ‘fits best’?

(Least Squares)
• 1. ‘Best Fit’ Means Difference Between Actual Y Values &
Predicted Y Values Are a Minimum. But Positive Differences Off-
Set Negative ones
• So square errors!
    ˆ
n n
2
Yi  Yˆi 2
i
i 1 i 1
• 2.LS Minimizes the Sum of the Squared Differences (errors) (SSE)

(Coefficient Equations)
Prediction Equation:
yˆi  ˆ0  ˆ1 xi

Sample slope:
SS xy   xi  x  yi  y 
ˆ1  
SS xx 
 ix  x 2
Sample Y – intercept:
ˆ0  y  ˆ1x
(Computation Table)
Module 3.2
EPI 809/Spring 2008
Interpretation of Coefficients
• 1. Slope (^1)
– Estimated Y Changes by ^1 for Each 1 Unit Increase in X
^
• If 1 = 2, then Y Is Expected to Increase by 2 for Each 1 Unit Increase in X
• 2. Y-Intercept (^0)
– Average Value of Y When X = 0
• If ^0 = 4, then Average Y Is Expected to Be 4 When
X Is 0

M3.2 Regression – Linear Regression (Example)
• What is the relationship between Mother’s Estriol level &
Birthweight using the following data?
Estriol Birthweight
(mg/24h) (g/1000)
1 1
2 1
3 2
4 2
5 4

M3.2 Regression – Linear Regression (Example)
n
 X  n 
  i   Yi 
n
 i 1  i 1  1510
 X Y
i i 
n
37 
5
ˆ1  i 1
  0.70
 X
n 2
15
2
  i 55 
n 5
 Xi   
2 i 1
i 1 n
ˆ0  Y  ˆ1 X  2  0.70 3  0.10

(Example)
• Many factors affect the wages of workers: the industry they work
in, their type of job, their education and their experience, and
changes in general levels of wages. We will look at a sample of 59
married women who hold customer service jobs in Indiana banks.
The following table gives their weekly wages at a specific point in
time also their length of service with their employer, in month. The
size of the place of work is recorded simply as “large” (100 or
more workers) or “small.” Because industry, job type, and the time
of measurement are the same for all 59 subjects, we expect to see a
clear relationship between wages and length of service.

(Example)

(Example)

(Example)

(Interpretation)
• R2 is called the coefficient of determination.
0  R2  1
• We may interpret R2 as the proportionate reduction of total variability in y
associated with the use of the independent variable x.
• The larger is R2, the more is the total variation of y reduced by including the
variable x in the model.
• If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1.
• If the slope of the fitted regression line
b1 = 0 so that , SSE=SST and R2 = 0.
• The closer R2 is to 1, the greater is said to be the degree of linear association
between x and y.

(Interpretation)
11-4.1 Use of t-Tests
An important special case of the hypotheses of
Equation 11-18 is
These hypotheses relate to the significance of regression.

Failure to reject H0 is equivalent to concluding that there
is no linear relationship between x and Y.

(Interpretation)
11-4.2 Analysis of Variance Approach to Test
Significance of Regression
The quantities, MSR and MSE are called mean squares.

Analysis of variance table:

(Adequacy of Model)
• Fitting a regression model requires several
assumptions.
1. Errors are uncorrelated random variables with
mean zero;
2. Errors have constant variance; and,
3. Errors be normally distributed.
• The analyst should always consider the validity of
these assumptions to be doubtful and conduct
analyses to examine the adequacy of the model

(Adequacy of Model)
11-7.1 Residual Analysis
Figure 11-9 Patterns for
residual plots. (a)
satisfactory, (b) funnel, (c)
double bow, (d) nonlinear

(Adequacy of Model)
Figure 11-10 Normal

probability plot of residuals

Multiple Regression

M3.2 Regression – Multiple Regression
(Example)
• Typically, we want to use more than a single predictor
(independent variable) to make predictions
• Regression with more than one predictor is called “multiple
regression”
y i    1 x1i   2 x 2i  K   p x pi  i
• the β’s are coefficients for the independent variables in the true
or population equation and the x’s are the values of the
independent variables for the member of the population.

(Example)
The data in the table on the following slide are:

Dependent Variable
y = BMI
Independent Variables
x1 = Age in years
x2 = FFNUM, a measure of fast food usage,
x3 = Exercise, an exercise intensity score
x4 = Beers per day

(Example)
OBS AGE BMI FFNUM EXERCISE BEER
1 26 23.2 0 621 3
2 30 30.2 9 201 6
3 32 28.1 17 240 10
4 27 22.7 1 669 5
5 33 28.9 7 1,140 12
6 29 22.4 3 445 9
7 32 23.2 1 710 15
8 33 20.3 0 783 11
9 31 25.6 1 454 0
10 33 21.2 3 432 2
11 26 22.3 5 1,562 13
12 34 23.0 2 697 1
13 33 26.3 4 280 2
14 31 22.2 1 449 5
15 31 19.0 0 689 4
16 27 20.8 2 785 3
17 36 20.9 2 350 7
18 35 36.4 14 48 11
19 31 28.6 11 285 12
20 36 27.5 8 85 5
Total 626 492.8 91 10,925 136
Mean 31.3 24.6 4.6 546.3 6.8

(Example)
T he RE G P roc edu re
Mo del : M ODE L1
D epe nde nt Var iab le: bmi
Bac kwa rd Eli min ati on: Ste p 0
All Va ria ble s E nte red : R -Sq uar e = 0.7 932 an d C (p) = 5.0 000
An aly sis of Va rian ce
S um of Mea n
S our ce DF Sq uar es Sq uar e F Va lue Pr > F
M ode l 4 2 73. 748 77 6 8.4 371 9 14 .38 <. 000 1

E rro r 15 71. 379 23 4.7 586 2
C orr ect ed Tot al 19 3 45. 128 00
Par ame ter S tan dar d

V ari abl e Es tim ate E rro r Typ e I I S S F V alu e Pr > F
I nte rce pt 18 .47 774 6.4 540 6 3 9.0 043 6 8.2 0 0.0 119
a ge 0 .08 424 0.1 893 1 0.9 423 9 0.2 0 0.6 627
f fnu m 0 .42 292 0.1 367 1 4 5.5 395 8 9.5 7 0.0 074
e xer cis e -0 .00 107 0.0 017 0 1.8 760 4 0.3 9 0.5 395
b eer 0 .32 601 0.1 151 8 3 8.1 211 1 8.0 1 0.0 127

(Example)
One df for each
The REG Procedure independent variable in
Model: MODEL1 the model
Dependent Variable: bmi
Backward Elimination: Step 0
All Variables Entered: R-Square = 0.7932 and C(p) = 5.0000
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 4 273.74877 68.43719 14.38 <.0001

Error 15 71.37923 4.75862
Corrected Total 19 345.12800
b0 Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
b1
Intercept 18.47774 6.45406 39.00436 8.20 0.0119
age 0.08424 0.18931 0.94239 0.20 0.6627
b2 ffnum 0.42292 0.13671 45.53958 9.57 0.0074
exercise -0.00107 0.00170 1.87604 0.39 0.5395
beer 0.32601 0.11518 38.12111 8.01 0.0127
b3
b4

(Example)
The REG Procedure
Model: MODEL1
Sum of Mean
Model 4 273.74877 68.43719 14.38 <.0001

Error 15 71.37923 4.75862
b0 Parameter Standard
b1 Intercept 18.47774 6.45406 39.00436 8.20 0.0119

age 0.08424 0.18931 0.94239 0.20 0.6627
ffnum 0.42292 0.13671 45.53958 9.57 0.0074
b2 exercise -0.00107 0.00170 1.87604 0.39 0.5395
beer 0.32601 0.11518 38.12111 8.01 0.0127
b3
b4
(Example)
Age
We have,
b0 = 18.478 , b1 = 0.084, b2 = 0.422,
b3 = - 0.001, b4 = 0.326
So,
ŷ = 18.478 + 0.084x1 + 0.422x2 – 0.001x3 + 0.326x4

(Model Verification)
The first step is to test the global hypothesis:
H0: β1 = β2 = β3 = β4 = 0
vs H1: β1 ≠ β2 ≠ β3 ≠ β4 ≠ 0
The ANOVA highlighted in the green box at the top of the next slide tests
this hypothesis:
F = 14.33 > F0.95(4,15) = 3.06,

so the hypothesis is rejected. Thus, we have evidence that at least on of
the βi ≠ 0.

The amount of variation in the dependent

variable, BMI, explained by its regression
relationship with the four independent variables is
R2 = SS(Model)/SS(Total) = 273.75/345.13
= 0.79 or 79%

If the global hypothesis is rejected, it is then

appropriate to examine hypotheses for the
individual parameters, such as
H0: β1 = 0 vs H1: β1 ≠ 0.
P = 0.6627 for this test is greater than α = 0.05,

so we accept H0: β1 = 0

From the ANOVA, we have
b1 = 0.084, P = 0.66
b2 = 0.422, P = 0.01
b3 = - 0.001, P = 0.54
b4 = 0.326, P = 0.01
So b2 = 0.422 and b4 = 0.326 appear to represent terms that should be

explored further.

(Approaches)
Backward elimination
Start with all independent variables, test the global hypothesis and if
rejected, eliminate, step by step, those independent variables for which  =
0.
Forward
Start with a “ core ” subset of essential variables and add others step by
step.

Backward Elimination

(Backward Elimination)
The REG Procedure
Model: MODEL1
Global
Dependent Variable: bmi hypothesis
Sum of Mean
Model 4 273.74877 68.43719 14.38 <.0001

Error 15 71.37923 4.75862
b0 Variable
Parameter
Estimate
Standard
Error Type II SS F Value Pr > F
b1 Intercept
age
18.47774
0.08424
6.45406
0.18931
39.00436
0.94239
8.20
0.20
0.0119
0.6627
ffnum 0.42292 0.13671 45.53958 9.57 0.0074
b2 exercise
beer
-0.00107
0.32601
0.00170
0.11518
1.87604
38.12111
0.39
8.01
0.5395
0.0127
b3
b4

Variable age Removed: R-Square = 0.7904 and C(p) = 3.1980
Sum of Mean
Model 3 272.80638 90.93546 20.12 <.0001

Error 16 72.32162 4.52010
The REG Procedure

Model: MODEL1
Parameter Standard
Intercept 21.28788 1.30004 1211.98539 268.13 <.0001

ffnum 0.42963 0.13243 47.57610 10.53 0.0051
exercise -0.00140 0.00149 4.00750 0.89 0.3604
beer 0.32275 0.11203 37.51501 8.30 0.0109
Bounds on condition number: 1.7883, 14.025

Variable exercise Removed: R-Square = 0.7788 and C(p) = 2.0402
Sum of Mean
Model 2 268.79888 134.39944 29.93 <.0001

Error 17 76.32912 4.48995
Parameter Standard
Intercept 20.29360 0.75579 3237.09859 720.97 <.0001

ffnum 0.46380 0.12693 59.94878 13.35 0.0020
beer 0.33375 0.11105 40.55414 9.03 0.0080
Bounds on condition number: 1.654, 6.6161
All variables left in the model are significant at the 0.0500 level.
The SAS System

The REG Procedure
Model: MODEL1
Summary of Backward Elimination
Variable Number Partial Model

Step Removed Vars In R-Square R-Square C(p) F Value Pr > F
1 age 3 0.0027 0.7904 3.1980 0.20 0.6627

2 exercise 2 0.0116 0.7788 2.0402 0.89 0.3604

Forward Stepwise Regression

(Forward Addition)
The REG P roc edur e
M od e l: MO D EL 1
De p en d en t V a ri a bl e : b mi
Ste pwis e Sel ecti on: S tep 1
Va r ia b le ff n um En t er e d: R- S qu a re = 0 .6 6 13 an d C( p ) = 8 . 56 2 5
A na l ys i s o f V ar i an c e
Su m of M ean
S ou r ce DF Sq u ar e s Sq uar e F Va lue Pr > F
M ode l 1 2 28 . 24 4 73 2 28 . 24 4 73 3 5.15 < .00 01

E rro r 18 1 16 . 88 3 27 6 . 49 3 51
C or r ec t ed To t al 19 3 45 . 12 8 00
P ara mete r St a nd a rd
V ar i ab l e E sti mate Erro r T y pe II SS F Val ue Pr > F
I nt e rc e pt 2 1.4 3827 0. 7850 6 4 8 42 . 33 8 95 7 45. 72 <. 0 00 1

f fnu m 0.7 0368 0. 1186 9 2 28 . 24 4 73 35. 15 <. 0 00 1
St e pw i se Se l ec t io n : S te p 2
V a ri a bl e b e er En t er e d: R- S qu a re = 0 .7 7 88 an d C( p ) = 2 . 04 0 2
A na l ys i s o f V ar i an c e
Sum of Mea n
S ou r ce DF Sq u ar e s Sq uar e F Va lue Pr > F
M ode l 2 2 68 . 79 8 88 1 34 . 39 9 44 2 9.93 < .00 01

E rro r 17 76 . 32 9 12 4 . 48 9 95
C or r ec t ed To t al 19 3 45 . 12 8 00

(Forward Addition)
Mod el: MO DEL 1
D epe nde nt Var iab le: bmi
St epw ise Se lec tio n: S tep 2
P ara met er St and ard

Var iab le E sti mat e Err or T ype II SS F Val ue Pr > F
Int erc ept 2 0.2 936 0 0. 755 79 3 237 .09 859 7 20. 97 <. 000 1
ffn um 0.4 638 0 0. 126 93 59 .94 878 13. 35 0. 002 0
bee r 0.3 337 5 0. 111 05 40 .55 414 9. 03 0. 008 0
B oun ds on con dit ion nu mbe r: 1 .65 4, 6.6 161
Al l v ari abl es lef t i n t he mod el are si gni fica nt at the 0. 050 0 l eve l.
N o o the r v ari abl e m et the 0. 150 0 s ign ifi can ce leve l f or ent ry int o t he mod el.
S umm ary of St epw ise Se lec tion
Va ria ble Var iab le N umb er P art ial Mo del

S tep En ter ed Rem ove d V ars In R -Sq uar e R- Squa re C( p) F V alu e Pr > F
1 f fnu m 1 0 .66 13 0. 6613 8 .56 25 35 .15 <.0 001

2 b eer 2 0 .11 75 0. 7788 2 .04 02 9 .03 0.0 080

Hands on – Using R (Simple Linear Regression)

M3.2 Regression R
(Simple Linear)
• Use the “cars” data for the Regression Analysis
• >head(cars) #display the first 6 observations
• > Speed dist
• >1 4 2
• >2 4 10
• >3 7 4
• >4 7 22
• >5 8 16
• >6 9 10
• Before doing the Regression analysis its good practice to
understand the data
– Box Plot (for any outliers)
– Scatter Plot (for relationship)
– Density Plot (understand distribution, skewed etc.,)
• linearMod <- lm(dist ~ speed, data=cars) # build

linear regression model on full data
• print(linearMod)
• #> Call:
• #> lm(formula = dist ~ speed, data = cars)
• #>
• #> Coefficients:
• #> (Intercept) speed
• #> -17.579 3.932

M3.2 Regression R
(Simple Linear)
• Use the “alligator” data for the Regression Analysis
• First create a data frame to fit the simple linear regression model
• > alligator = data.frame( lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81,
3.83, 3.46, 3.76, 3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78), lnWeight =
c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50, 3.58, 3.64, 5.90, 4.43,
4.38, 4.42, 4.25) )
• > Perform Exploratory Data Analysis (EDA)
• > xyplot(lnWeight ~ lnLength, data = alligator, xlab = "Snout vent
length (inches) on log scale", ylab = "Weight (pounds) on log scale",
main = "Alligators in Central Florida" )
• > alli.mod1 = lm(lnWeight ~ lnLength, data = alligator)
• > summary(alli.mod1)

M3.2 Regression R
(Simple Linear)
• Problem: Apply the simple linear regression model for the data set faithful, and estimate next eruption duration if waiting time is 80
minutes.
• Apply the lm function to a formula that describes the variable eruptions by the variable waiting.
• > eruption.lm = lm(eruptions ~ waiting, data=faithful)
• Extract the parameters of the estimated regression equation with the coefficients function.
• > coeffs = coefficients(eruption.lm)
• > coeffs
(Intercept) waiting
-1.874016 0.075628
• Fit the eruption duration using the estimated regression equation.

• > waiting = 80 # the waiting time
> duration = coeffs[1] + coeffs[2]*waiting
> duration
(Intercept)
4.1762
• Answer: Based on the simple linear regression model, if the waiting time since the last eruption has been 80 minutes, we expect the
next one to last 4.1762 minutes.
• Alternative Solution: Wrap the waiting parameter value inside a new data frame named newdata.
• > newdata = data.frame(waiting=80) # wrap the parameter
• > predict(eruption.lm, newdata) # apply predict
1
4.1762

M3.2 Regression R
(Simple Linear)
• Extract the coefficient of determination from the r.squared attribute of its summary.
• > summary(eruption.lm)$r.squared
[1] 0.81146
• Answer: The coefficient of determination of simple linear regression model for data set faithful is 0.81146.
• Print out the F-statistics of the significance test with the summary function.
• > summary(eruption.lm)
• Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.2992 -0.3769 0.0351 0.3491 1.1933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.87402 0.16014 -11.7 <2e-16 ***
waiting 0.07563 0.00222 34.1 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.497 on 270 degrees of freedom
Multiple R-squared: 0.811, Adjusted R-squared: 0.811
F-statistic: 1.16e+03 on 1 and 270 DF, p-value: <2e-16
• Answer: As the p-value is much less than 0.05, we reject the null hypothesis that β = 0. Hence there is a significant relationship
between the variables in the linear regression model of the data set faithful.

M3.2 Regression R
(Simple Linear)
• Plot the residual against the observed values of the • Normal Probability Plot the residual against the observed
variable waiting. values of the variable waiting.
• > plot(faithful$waiting, eruption.res, • > eruption.lm = lm(eruptions ~ waiting, data=faithful)
+ ylab="Residuals", xlab="Waiting Time", > eruption.stdres = rstandard(eruption.lm)
+ main="Old Faithful Eruptions")
> abline(0, 0) # the horizon
• Create normal probability plot with the qqnorm function,
• and add the qqline for further comparison.
• > qqnorm(eruption.stdres,
+ ylab="Standardized Residuals",
+ xlab="Normal Scores",
+ main="Old Faithful Eruptions")
> qqline(eruption.stdres)
•

Logistic Regression

M3.2 Regression – Logistic Regression
(Introduction)
• There are many important research topics for which the dependent variable is "limited."
• For example: voting, morbidity or mortality, and participation data is not continuous or
distributed normally.
• Binary logistic regression is a type of regression analysis where the dependent variable
is a dummy variable: coded 0 (did not vote) or 1(did vote)
• Logistic regression is the type of regression we use for a response variable (Y) that
follows a binomial distribution
• Linear regression is the type of regression we use for a continuous, normally distributed
response (Y) variable
• Why cant we fit normal regression as below Y =  + X + e ; where Y = (0, 1)

 Reason:
 Response Y is not normally distributed
 The error terms are heteroskedastic
 e is not normally distributed because Y takes on only two values
 The predicted probabilities can be greater than 1 or less than 0

(Introduction)
The "logit" model solves these problems:
ln[p/(1-p)] =  + X + e
 p is the probability that the event Y occurs, p(Y=1)

 p/(1-p) is the "odds ratio"
 ln[p/(1-p)] is the log odds ratio, or "logit"

(Introduction)
 The logistic distribution
constrains the estimated
probabilities to lie between 0
and 1.
 The estimated probability is:
p = 1/[1 + exp(- -  X)]
 if you let  +  X =0, then p =

.50
 as  +  X gets really big, p
approaches 1
 as  +  X gets really small, p
approaches 0

M3.2 - Regression PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

M3.2 - Regression PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Data Scientist

Module 3.2 : Regression

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 2

• Relationship between one dependent variable and explanatory

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

• Price of a product and quantity produced or sold:

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

• Price of a product and quantity produced or sold:

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Bivariate or simple regression model

Multivariate or multiple regression model

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

• x is the independent variable

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

• Be wary of the scatter graph, it is not always perfect.

1 Explanatory Regression 2+ Explanatory

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 15

© 1984-1994 T/Maker Co.

• Relationship Between Variables Is a Linear Function

Population Population Random

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

• 2.LS Minimizes the Sum of the Squared Differences (errors) (SSE)

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

yˆi  ˆ0  ˆ1 xi

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 25

ˆ0  Y  ˆ1 X  2  0.70 3  0.10

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd. 26

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

These hypotheses relate to the significance of regression.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

The quantities, MSR and MSE are called mean squares.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Figure 11-10 Normal

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

The data in the table on the following slide are:

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Bac kwa rd Eli min ati on: Ste p 0

An aly sis of Va rian ce

M ode l 4 2 73. 748 77 6 8.4 371 9 14 .38 <. 000 1

Par ame ter S tan dar d

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Backward Elimination: Step 0

All Variables Entered: R-Square = 0.7932 and C(p) = 5.0000

Model 4 273.74877 68.43719 14.38 <.0001

Module 3.2 Copyrights 2017 © ProcessWhirl Management Consulting Pvt. Ltd.

Backward Elimination: Step 0