Beruflich Dokumente
Kultur Dokumente
PREPARED FOR :
PREPARED BY :
GS48233
MULTIPLE REGRESSION
HATCO management has long been interested in more accurately predicting the level
of business obtained from its customers in the attempt to provide a better basis for
production control and marketing errors.
In doing multiple regression, the objective that researchers want to focuses on ;
o To this end, researchers at HATCO proposed that a multiple regression
The relationship among the seven independent variables and product usage was
assumed to be statistical, not functional, because it involved perceptions of
performance and may have had levels of measurement error.
2
a) Procedure for Standard Multiple Regression
1) From the menu at the top of the screen click on: Analyze, then click on Regression,
then on Linear.
2) Click on your dependent variable (e.g. Product Usage Level) and move it into the
Dependent box.
3) Click on your independent variables (Perception of HATCO - Delivery Speed, Price
Level, Price Flexibility, Manufacturers Image, Service, Salesforce Image and Product
Quality) and move them into the Independent box.
4) For Method, make sure Enter is selected (this will give you standard multiple
regression).
5) Click on the Statistics button.
a. Tick the box marked Estimates, Model fit and Descriptives,
b. Click on Continue.
6) Click on OK.
Descriptive Statistics
Correlation
3
The descriptives command also gives you a correlation matrix, showing you the Pearson rs
between the variables (in the top part of this table).
Correlation between sets of data is a measure of how well they are related. The most common
measure of correlation in stats is the Pearson Correlation. The full name is the Pearson
Product Moment Correlation or PPMC.
Pearsons correlation coefficient is the test statistics that measures the statistical
relationship, or association, between two continuous variables. It is known as the best
method of measuring the association between variables of interest because it is based on the
4
method of covariance. It gives information about the magnitude of the association, or
correlation, as well as the direction of the relationship.
The Pearsons r for the correlation between the Service (IV) and Usage Level (DV) in our
example is 0.701.
This means that there is a strong relationship between your two variables. This means that
changes in one variable are strongly correlated with changes in the second variable. In our
example, Pearsons r is 0.701. This number is very close to 1. For this reason, we can
conclude that there is a strong relationship between our Usage Level and Service. Its means
that the perception on HATCO service will affect the usage level. However, we cannot make
any other conclusions about this relationship, based on this number.
This means that there is a weak relationship between your two variables. This means that
changes in one variable are not correlated with changes in the second variable. If our
Pearsons r were 0.01, we could conclude that our variables were not strongly correlated. In
this example, price level is not correlated with the changes of the usage level.
This means that as one variable increases in value, the second variable also increase in value.
Similarly, as one variable decreases in value, the second variable also decreases in value. This
is called a positive correlation. In our example, our Pearsons r value of 0.701 was positive.
We know this value is positive because SPSS did not put a negative sign in front of it. So,
positive is the default. Since our example Pearsons r is positive, we can conclude that when
the service (our first variable) increase, the usage level (our second variable) also increases.
5
This means that as one variable increases in value, the second variable decreases in value.
This is called a negative correlation. In our example, our Pearsons r value of product quality
and usage level is -0.192. So, we could conclude that when the product quality (our first
variable) is decrease, the usage level (our second variable) decreases.
Multicollinearity. The correlations between the variables in your model are provided
in the table labelled Correlations. Check that your independent variables show at
least some relationship with your dependent variable (above .3 preferably). In this
case 3/7 of the scales (Delivery Speed, Price Flexibility and Service) correlate
substantially with Usage Level (.676, .559, and .701 respectively). Meanwhile, 4/7 of
the scales (Price Level, Manufacturer Image, Salesforce Image and Product
Quality) show relationship with dependent variable below .3 (which are .082, .224, .
256 and -.192).
Also check that the correlation between each of your independent variables is not too
high. Tabachnick and Fidell (2001, p. 84) suggest that you think carefully before
including two variables with a bivariate correlation of, say, .7 or more in the same
analysis. If you find yourself in this situation you may need to consider omitting
one of the variables or forming a composite variable from the scores of the two
highly correlated variables. In the example presented here the correlation between
salesforce image variable and manufacturer image variable is .788, which is more
than .7, therefore all variables will be retained.
Model Summary
6
ANOVAa
Sum of Mean
Model df F Sig.
Squares Square
Regression 6198.677 7 885.525 45.252 .000b
1 Residual 1800.323 92 19.569
Total 7999.000 99
a. Dependent Variable: Usage Level
b. Predictors: (Constant), Product Quality, Service, Salesforce Image, Price Flexibility, Price
Level, Manufacturer Image, Delivery Speed
Coefficientsa
Standardized
Unstandardized Coefficients
Model Coefficients t Sig.
B Std. Error Beta
1 (Constant) -10.187 4.977 -2.047 .044
Delivery Speed -.058 2.013 -.008 -.029 .977
Price Level -.697 2.090 -.093 -.333 .740
Price Flexibility 3.368 .411 .520 8.191 .000
Manufacturer Image -.042 .667 -.005 -.063 .950
Service 8.369 3.918 .699 2.136 .035
Salesforce Image 1.281 .947 .110 1.352 .180
Product Quality .567 .355 .100 1.595 .114
a. Dependent Variable: Usage Level
Variables Entered/Removeda
Variables Variables
Model Method
Entered Removed
1 Service Stepwise (Criteria:
Probability-of-F-to-enter <= .
.
050, Probability-of-F-to-
remove >= .100).
7
2 Price Flexibility Stepwise (Criteria:
Probability-of-F-to-enter <= .
.
050, Probability-of-F-to-
remove >= .100).
3 Salesforce Image Stepwise (Criteria:
Probability-of-F-to-enter <= .
.
050, Probability-of-F-to-
remove >= .100).
a. Dependent Variable: Usage Level
Model Summary
8
ANOVAa
Sum of
Model Squares df Mean Square F Sig.
Total 7999.000 99
2 Regression 6036.513 2 3018.256 149.184 .000c
Residual 1962.487 97 20.232
Total 7999.000 99
3 Regression 6145.700 3 2048.567 106.115 .000d
Total 7999.000 99
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
9
Variables Entered/Removeda
Variables Variables
Model Entered Removed Method
1 Product Quality,
Service,
Salesforce
Image, Price
Flexibility, Price . Enter
Level,
Manufacturer
Image, Delivery
Speedb
2 Backward
(criterion:
. Delivery Speed Probability of F-
to-remove >= .
100).
3 Backward
(criterion:
Manufacturer
. Probability of F-
Image
to-remove >= .
100).
4 Backward
(criterion:
. Price Level Probability of F-
to-remove >= .
100).
5 Backward
(criterion:
. Product Quality Probability of F-
to-remove >= .
100).
10
Model Summary
ANOVAa
Total 7999.000 99
2 Regression 6198.661 6 1033.110 53.367 .000c
Residual 1800.339 93 19.358
Total 7999.000 99
3 Regression 6198.591 5 1239.718 64.726 .000d
Residual 1800.409 94 19.153
Total 7999.000 99
4 Regression 6176.787 4 1544.197 80.506 .000e
Residual 1822.213 95 19.181
Total 7999.000 99
5 Regression 6145.700 3 2048.567 106.115 .000f
11
Total 7999.000 99
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
12
Product Quality .403 .317 .071 1.273 .206
5 (Constant) -6.520 3.247 -2.008 .047
Excluded Variablesa
Collinearity
Partial Statistics
13
b) Interpretation of Output from Standard Multiple Regression
As with the output from most of the SPSS procedures, there are lots of rather confusing
numbers generated as output from regression.
Coefficientsa
Delivery
.240 .180 .370 1.333 .186 -.118 .598 .651 .138 .062 .028 35.747
Speed
Price Level .176 .187 .246 .942 .349 -.195 .547 .028 .098 .044 .032 31.597
Price
.290 .037 .470 7.882 .000 .217 .363 .525 .635 .366 .608 1.645
Flexibility
Manufacturer
.429 .060 .567 7.183 .000 .310 .547 .476 .599 .334 .347 2.879
Image
Service .132 .351 .116 .376 .708 -.565 .828 .631 .039 .017 .023 43.834
Salesforce
-.196 .085 -.177 -2.315 .023 -.364 -.028 .341 -.235 -.108 .371 2.697
Image
Product
-.046 .032 -.085 -1.446 .152 -.109 .017 -.283 -.149 -.067 .623 1.606
Quality
14
using the formula 1R2 for each variable. If this value is very small (less than .10), it
indicates that the multiple correlation with other variables is high, suggesting the
possibility of multi-collinearity. The other value given is the VIF (Variance inflation
factor), which is just the inverse of the Tolerance value (1 divided by Tolerance).
VIF values above 10 would be a concern here, indicating multi-collinearity.
I have quoted commonly used cut-off points for determining the presence of multi-
collinearity (tolerance value of less than .10, or a VIF value of above 10). These
values, however, still allow for quite high correlations between independent variables
(above .9), so you should take them only as a warning sign, and check the correlation
matrix.
In this example the tolerance value for Delivery Speed (.028), Price Level (0.32)
and Service (.023) variable show result less than .10; therefore, we have violated
the multi-collinearity assumption. Meanwhile, the tolerance value for Price
Flexibility (.608), Manufacturer Image (.347), Salesforce Image (.371) and
Product Quality (.623) variables, which is not less than .10; therefore, we have
not violated the multi-collinearity assumption.
This is also supported by the VIF values, which are Delivery Speed (35.747), Price
Level (31.597) and Service (43.834) variables show results above 10. Meanwhile,
Price Flexibility (1.645), Manufacturer Image (2.879), Salesforce Image (2.697)
and Product Quality (1.606) variables shows good results which below the cut-off
of 10. If you exceed these values in your own results, you should seriously
consider removing one of the highly inter-correlated independent variables from
the model.
So that, you should remove delivery speed, price level and service variables in your
model.
One of the ways that these assumptions can be checked is by inspecting the residuals
scatterplot and the Normal Probability Plot of the regression standardised residuals
that were requested as part of the analysis. These are presented at the end of the output. In
the Normal Probability Plot you are hoping that your points will lie in a reasonably straight
15
diagonal line from bottom left to top right. This would suggest no major deviations from
normality. In the Scatterplot of the standardised residuals (the recond plot displayed) you are
hoping that the residuals will be roughly rectangular distributed, with most of the scores
concentrated in the centre (along the 0 point). What you dont want to see is a clear or
systematic pattern to your residuals (e.g. curvilinear, or higher on one side than the other).
The presence of outliers can also be detected from the Scatterplot. Tabachnick and Fidell
(2001) define outliers as cases that have a standardised residual (as displayed in the
scatterplot) of more than 3.3 or less than 3.3. With large samples, it is not uncommon to find
a number of outlying residuals. If you find only a few, it may not be necessary to take any
action. The results of Scatterplot is shown below :
The other information in the output concerning unusual cases is in the Table titled
Casewise Diagnostics. This presents information about cases that have standardised
16
residual values above 3.0 or below 3.0. In a normally distributed sample we would
expect only 1 per cent of cases to fall outside this range.
To check whether this strange case is having any undue influence on the results for
our model as a whole, we can check the value for Cooks Distance given towards the
bottom of the Residuals Statistics table. According to Tabachnick and Fidell (2001,
p. 69), cases with values larger than 1 are a potential problem. In our example the
maximum value for Cooks Distance is .100, suggesting no major problems. In your
own data, if you obtain a maximum value above 1 you will need to go back to your
data file, sort cases by the new variable that SPSS created at the end of your file.
Check each of the cases with values above 1you may need to consider removing the
offending case/cases.
Residuals Statisticsa
Minimum Maximum Mean Std. Deviation N
Predicted Value 3.129 6.495 4.771 .7658 100
Std. Predicted Value -2.145 2.252 .000 1.000 100
Standard Error of Predicted
.070 .240 .108 .028 100
Value
Adjusted Predicted Value 3.097 6.446 4.771 .7668 100
Residual -.9393 .7193 .0000 .3815 100
Std. Residual -2.374 1.818 .000 .964 100
Stud. Residual -2.519 1.884 .000 .999 100
Deleted Residual -1.0577 .7726 .0003 .4103 100
Stud. Deleted Residual -2.596 1.911 -.003 1.011 100
Mahal. Distance 2.109 35.390 6.930 5.043 100
Cook's Distance .000 .100 .009 .015 100
Centered Leverage Value .021 .357 .070 .051 100
a. Dependent Variable: Satisfaction Level
Look in the Model Summary box and check the value given under the heading R
Square. This tells you how much of the variance in the dependent variable
(Satisfaction Level) is explained by the model (which includes the variables of
Perceptions HATCO - Delivery Speed, Price Level, Price Flexibility, Manufacturers
Image, Service, Salesforce Image and Product Quality).
17
Model Summaryb
The Adjusted R square statistic corrects this value to provide a better estimate of the
true population value. If you have a small sample you may wish to consider reporting
this value, rather than the normal R Square value. To assess the statistical significance
of the result it is necessary to look in the table labelled ANOVA. This tests the null
hypothesis that multiple R in the population equals 0. The model in this example
reaches statistical significance (Sig = .000, this really means p<.0005).
ANOVAa
18
The next thing we want to know is which of the variables included in the model contributed
to the prediction of the dependent variable. We find this information in the output box
labelled Coefficients. Look in the column labelled Beta under Standardised Coefficients.
To compare the different variables it is important that you look at the standardised
coefficients, not the unstandardised ones. Standardised means that these values for each of
the different variables have been converted to the same scale so that you can compare them.
If you were interested in constructing a regression equation, you would use the
unstandardized coefficient values listed as B.
Coefficientsa
Delivery
.240 .180 .370 1.333 .186 -.118 .598 .651 .138 .062 .028 35.747
Speed
Price Level .176 .187 .246 .942 .349 -.195 .547 .028 .098 .044 .032 31.597
Price
.290 .037 .470 7.882 .000 .217 .363 .525 .635 .366 .608 1.645
Flexibility
Manufacturer
.429 .060 .567 7.183 .000 .310 .547 .476 .599 .334 .347 2.879
Image
Service .132 .351 .116 .376 .708 -.565 .828 .631 .039 .017 .023 43.834
Salesforce
-.196 .085 -.177 -2.315 .023 -.364 -.028 .341 -.235 -.108 .371 2.697
Image
Product
-.046 .032 -.085 -1.446 .152 -.109 .017 -.283 -.149 -.067 .623 1.606
Quality
19
The Beta value for Product Quality Variable was slightly lower (.085), indicating
that it made less of a contribution. For each of these variables, check the value in the
column marked Sig. This tells you whether this variable is making a statistically
significant unique contribution to the equation. This is very dependent on which
variables are included in the equation, and how much overlap there is among the
independent variables. If the Sig. value is less than .05 (.01, .0001, etc.), then the
variable is making a significant unique contribution to the prediction of the dependent
variable. If greater than .05, then you can conclude that variable is not making a
significant unique contribution to the prediction of your dependent variable. This may
be due to overlap with other independent variables in the model.
In this case, Price Flexibility (.000), Manufacturer Image (.000) and Salesforce
Image (.023) variables made a unique, and statistically significant, contribution
to the prediction of satisfaction level scores. Meanwhile, Delivery Speed (.186),
Price Level (.349), Service (.708) and Product Quality (.152) not making a
significant unique contribution to the prediction of your dependent variable. This
may be due to overlap with other independent variables in the model.
The other potentially useful piece of information in the coefficients table is the Part
correlation coefficients. Just to confuse matters, you will also see these coefficients
referred to as semi-partial correlation coefficients (see Tabachnick and Fidell, 2001, p.
140). If you square this value (whatever it is called) you get an indication of the
contribution of that variable to the total R squared. In other words, it tells you how
much of the total variance in the dependent variable is uniquely explained by that
variable and how much R squared would drop if it wasnt included in your model. In
this example the Product Quality scale has a part correlation coefficient of .085.
If we square this (multiply it by itself) we get .72, indicating that Product Quality
uniquely explains 72 per cent of the variance in Satisfaction Level scores.
20
1) Hierarchical multiple regression
In hierarchical regression (also called sequential) the independent variables are entered into
the equation in the order specified by the researcher based on theoretical grounds. Variables
or sets of variables are entered in steps (or blocks), with each independent variable being
assessed in terms of what it adds to the prediction of the dependent variable, after the
previous variables have been controlled for. For example, if you wanted to know how well
optimism predicts life satisfaction, after the effect of age is controlled for, you would enter
age in Block 1 and then Optimism in Block 2. Once all sets of variables are entered, the
overall model is assessed in terms of its ability to predict the dependent measure. The relative
contribution of each block of variables is also assessed.
1) From the menu at the top of the screen click on: Analyze, and then click on
Regression, then on Linear.
2) Choose your continuous dependent variable (e.g. Satisfaction Level) and move it into
the Dependent box.
3) Move the variables you wish to control for into the Independent box (e.g. Usage
Level). This will be the first block of variables to be entered in the analysis (Block 1
of 1).
4) Click on the button marked Next. This will give you a second independent variables
box to enter your second block of variables into (you should see Block 2 of 2).
5) Choose your next block of independent variables (e.g. Perception of HATCO).
21
6) In the Method box make sure that this is set to the default (Enter). This will give you
standard multiple regressions for each block of variables entered.
7) Click on the Statistics button. Tick the boxes marked Estimates, Model fit, R squared
change, Descriptive, Part and partial correlations and Collinearity diagnostics. Click
on Continue.
8) Click on the Options button. In the Missing Values section click on Exclude cases
pairwise.
9) Click on the Save button. Click on Mahalonobis and Cooks. Click on Continue and
then OK.
Model Summaryc
Change Statistics
R Adjusted Std. Error of
Model R R Square F Sig. F
Square R Square the Estimate df1 df2
Change Change Change
a
1 .711 .505 .500 .6049 .505 100.016 1 98 .000
2 .895b .801 .784 .3979 .296 19.363 7 91 .000
a. Predictors: (Constant), Usage Level
b. Predictors: (Constant), Usage Level, Price Level, Salesforce Image, Product Quality, Price Flexibility,
Manufacturer Image, Delivery Speed, Service
c. Dependent Variable: Satisfaction Level
The output generated from this analysis is similar to the previous output, but with some extra
pieces of information. In the Model Summary box there are two models listed. Model 1
refers to the first block of variables that were entered (Usage Level), while Model 2 includes
all the variables that were entered in both blocks (Perception of HATCO : X1-X7 Variable).
Check the R Square values in the first Model summary box. After the variables in Block 1
(Usage Level) have been entered, the overall model explains 5.0 per cent of the variance
(.050 100). After Block 2 variables (Perception of HATCO) have also been included, the
22
model as a whole explains 80.1 per cent (.801 100). It is important to note that this second
R square value includes all the variables from both blocks, not just those included in the
second step. To find out how much of this overall variance is explained by our variables of
interest (Perception of HAATCO) after the effects of Usage Level desirable responding are
removed, you need to look in the column labelled R Square change.
In the output presented above you will see, on the line marked Model 2, that the R square
change value is .296. This means that X1-X7 variables explain an additional 29.6 per
cent (.296 100) of the variance in Satisfaction Level, even when the effects of Usage
Level is statistically controlled for. This is a statistically significant contribution, as indicated
by the Sig. F change value for this line (.000). The ANOVA table indicates that the model as a
whole (which includes both blocks of variables) is significant [F (8, 91) = 45.84, p<.0005).
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 36.602 1 36.602 100.016 .000b
Residual 35.864 98 .366
Total 72.466 99
2 Regression 58.060 8 7.257 45.843 .000c
Residual 14.406 91 .158
Total 72.466 99
a. Dependent Variable: Satisfaction Level
b. Predictors: (Constant), Usage Level
c. Predictors: (Constant), Usage Level, Price Level, Salesforce Image, Product Quality, Price
Flexibility, Manufacturer Image, Delivery Speed, Service
In stepwise regression the researcher provides SPSS with a list of independent variables and
then allows the program to select which variables it will enter, and in which order they go
into the equation, based on a set of statistical criteria. There are three different versions of this
approach: forward selection, backward deletion and stepwise regression. There are a number
of problems with these approaches, and some controversy in the literature concerning their
use (and abuse). Before using these approaches I would strongly recommend that you read up
on the issues involved (see p. 138 in Tabachnick & Fidell, 2001). It is important that you
23
understand what is involved, how to choose the appropriate variables and how to interpret the
output that you receive.
24