Beruflich Dokumente
Kultur Dokumente
represent different groups of people, not different quantities, are incorporated into regression analyses,
allowing comparison of means of the groups.
Well discover that IVs with only 2 values can be treated as if they are continuous IVs in any regression.
But IVs with 3 or more values must be treated specially. Once thats done, they also can be included in
regression analyses.
25 1 37 38 2 56
26 2 53 39 2 61
27 2 62 40 2 62
28 2 56 41 2 72
29 2 61 42 2 46
30 2 63 43 2 64
31 2 34 44 2 60
32 2 56 45 2 58
33 2 54 46 2 73
34 2 60 47 2 57
35 2 59 48 2 53
36 2 67 49 2 43
37 2 42 50 2 61
When one of the groups has whatever the other has plus something else, my practice is to give it the larger of
the two values, often 0 for the group with less and 1 for the group with more.
When one is a control and the other is an experimental group, my practice is to use 0 for the control and 1 for
the experimental.
When an IV is a dichotomy, the scatterplot takes on an unusual appearance. It will be two columns of points,
one over one of the values of the IV and the other over the other value. It can be interpreted in the way all
scatterplots are interpreted, although if the values of the IV are arbitrary, the sign of the relationship may not be
a meaningful characteristic. For example, in the following scatterplot, it would not make any sense to say that
performance was positively related to training program. It would make sense, however, to say that performance
was higher in the Lecture+CAI program than in the Lecture-only program.
In the graph of the example data, the best fitting straight line has been drawn through the scatterplot. When the
independent variable is a dichotomy, the line will always go through the mean value of the dependent variable
at each of the two independent variable values.
Well notice that the regression coefficient, the B value, for Training Program is equal to the difference between
the means of performance in the two programs. This will always be the case if the values used to code the two
groups differ by one (1 vs. 2 in this example).
80
Mean
70 Mean Perf for
Perf for Method 2
Method 1
60
50
40
PE
RF
30
L Only L+CAI
TP
Su m of
Mo del
1 Re gressi on
Sq uares
729 .620
df
1
Me an S quare
729 .620
F
7.7 95
Sig .
.00 7 a
As was the case with simple regression
Re sidua l 449 2.88 0 48 93. 602 with a continuous predictor, the
To tal 522 2.50 0 49 information in the ANOVA summary
a. Pre dicto rs: (Consta nt), T P
table is redundant with the information in
b. De pend ent V ariab le: PE RF
the Coefficients box below.
Coeffici ents a
Sta ndard
ize d
Un stand ardized Co efficie
Co efficie nts nts
Mo del B Std . Erro r Be ta t Sig .
1 (Co nstan t) 42. 040 4.3 27 9.7 16 .00 0
TP 7.6 40 2.7 36 .37 4 2.7 92 .00 7
a. De pend ent V ariab le: PE RF
Interpretation of (Constant): This is the expected value of the dependent variable when the independent
variable = 0. If one of the groups had been coded as 0, then the y-intercept would have been the expected value
of Y in that group. In this example, neither group is coded 0, so the value of the y-intercept has no special
meaning.
B = Difference in group means divided by difference in X-values for the two groups.
1 2
If the X-values for the groups differ by 1, as they do here, then B = Difference in group means.
The sign of the b coefficient associated with a dichotomous variable dependent on how the groups were labeled.
In this case, the L Only group was labeled 1 and the L+CAI group was labeled 2.
If the sign of the B coefficient is positive, this means that the group with the larger IV value had a larger
mean.
If the sign of the B coefficient is negative, this means that the group with the larger IV value had a
SMALLER mean.
The fact that B is positive means that the L+CAI group mean (coded 2) was larger than the L group mean
(coded 1). If the labeling had been reversed, with L+CAI coded as 1 and L-only coded as 2, the sign of the b
coefficient would have been negative.
The t-value
The t values test the hypothesis that each coefficient equals 0. In the case of the Constant, we don't care.
In the case of the B coefficient, the t value tells us whether the B coefficient, and equivalently, the
difference in means, is significantly different from 0. The p-value of .007 suggests that the B value is
significantly different from 0.
The bottom line
This means that when the independent variable is a dichotomy, regression of the dependent variable onto a
dichotomous independent variable is a comparison of the means of the two groups.
You may be thinking that another way to compare the performance in the two groups would be to perform an
independent groups t-test. This might then lead you to ask whether you'd get a result different from the
regression analysis.
Note that the t-value is 2.792, the same as the t-value from the regression analysis. This indicates a very
important relationship between the independent groups t-test and simple regression analysis:
When the independent variable is a dichotomy, the simple regression of Y onto the dichotomy gives the
same test of difference in group means as the equal variances assumed independent groups t-test.
As we'll see when we get to multiple regression, when independent variables represent several groups, the
regression of Y onto those independent variables gives the same test of differences in group means as does the
analysis of variance. That is, every test that can is conducted using analysis of variance can be conducted
using multiple regression analysis.
Yes, it is. No self-respecting computer program would use the ANOVA formulae taught in many (but fewer
each year) older statistical textbooks. All well-written computer programs convert the problem to a regression
analysis and conduct the analysis as if it were a regression, using the techniques to be shown in the following.
But statistics is littered with dinosaurs. Among many analysts, regression analysis itself has been replaced by
structural equation modeling a much more inclusive technique.
Among other analysts, the kinds of regression analyses were doing have been replaced by multilevel analyses,
again, a more inclusive technique in a different context.
Consider comparing mean religiosity scores among three religious groups Protestants, Catholics, and Jews.
One seemingly logical approach would be to assign the successive integers to the religion groups and perform a
simple regression.
In the above, the variable, RELCODE, is a numeric variable representing the 3 religions.
Because it is NOT the appropriate way to represent a three-category variable in a regression analysis, well call
it the Nave RELCODE.
14
12
10
6
Religiosity
STRENGTH
4
This is mostly a page of crap
2 because the analysis is completely
0
inappropriate.
.5 1.0 1.5 2.0 2.5 3.0 3.5
Regression
Va riabl es Entere d/Rem ov e db
Va riable s
Mo del Va riable s En tered Re move d Me thod
1 RE LCO DE a . En ter
a. All requ ested vari ables ente red.
b. De pend ent V ariab le: S TRE NGT H
Std . Erro r of
Mo del R R S quare Ad justed R S quare the Estim ate
1 .76 7 a .58 9 .56 7 2.1 52
a. Pre dicto rs: (Consta nt), RELCODE
Coeffici ents a
Sta ndardized
Un stand ardized Coeffi cient s Co effici ents
Mo del B Std . Error Be ta t Sig .
1 (Co nsta nt) 14 .000 1.2 43 11 .267 .00 0
RE LCO D
-3. 000 .57 5 -.7 67 -5. 216 .00 0
E
a. De pend ent V ariab le: S TRE NGT H
For this analysis, I assigned the numbers 1, 2, and 3 to the religions Prot, Cath, and Jew respectively.
But I could just as well have used a different assignment. How about Cath = 1, Prot=2, and Jew=3?
Prot 2 14 14
Prot 2 12
Cath 1 5 12
Cath 1 7 10
Cath 1 8
8
Cath 1 9
Cath 1 10 6
Cath 1 8 Religiosity
STRENGTH
4
Cath 1 9
Jew 3 4 2
Jew 3 3 0
.5 1.0 1.5 2.0 2.5 3.0 3.5
Jew 3 6
Jew 3 5 NEW NAVE RELCODE
RELCODE
Jew 3 7
Jew 3 8
Jew 3 2
Std . Erro r of
Mo del R R S quare Ad justed R S quare the Estim ate
1 .38 4 a .14 7 .10 2 3.0 99
a. Pre dicto rs: (Consta nt), RELCODE
Coeffici ents a
Sta ndardized
Un stand ardized Coeffi cient s Co effici ents
Mo del B Std . Error Be ta t Sig .
1 (Co nsta nt) 11 .000 1.7 89 6.1 48 .00 0
RE LCO D
-1. 500 .82 8 -.3 84 -1. 811 .08 6
E
a. De pend ent V ariab le: S TRE NGT H
Whoops! Whats going on? Two analyses of the same data yield two VERY different results. Which is
correct? Answer: Neither is correct. In fact, there is nothing of use in either analysis.
Qualitative Factors, such as religion, race, type of graduate program, etc. with 3 or more values, cannot be
analyzed using simple regression techniques in which the factor is used as-is as a predictor.
Thats because the numbers assigned to qualitative factors are simply names. Any set of numbers will do. The
problem with that is that each different set of numbers will yield a different result in a simple regression.
Note: If the qualitative factor has only 2 values, i.e., its a dichotomy, it CAN be used as-is in the regression.
(So everything on the first couple of pages of this lecture is still true.) But if it has 3 or more values, it cannot.
Does this mean that regression analysis is useful only for continuous or dichotomous variables? How limiting!!
1. Represent each value of the qualitative factor with a combination of two or more values of specially selected
Group Coding Variables.
Theyre called group coding variables because each value of a qualitative factor represents a
group of people.
If there are K groups, then K-1 group coding variables are required. .
2. Regress the dependent variable onto the set of group coding variables in a multiple regression.
The question arises: What actually are the group coding variables? How are they created?
There are 3 common types of group coding variables. (There are several other less common types.)
1. Dummy Coding.
2. Effects Coding.
3. Contrast Coding. (We wont cover this technique this semester. Covered in Advanced SPSS.)
Each other group is assigned the value 1 on one Dummy Variable and 0 on the remaining.
Each group is assigned the value 1 on a different Dummy Variable.
Examples . . .
Two Groups (Even though we dont actually need special techniques for two groups.)
Group DV1
G1 1
G2 0 = The Comparison Group
Three Groups
Group DV1 DV2
G1 1 0
G2 0 1
G3 0 0 The Comparison Group
Four Groups
Group DV1 DV2 DV3
G1 1 0 0
G2 0 1 0
G3 0 0 1
G4 0 0 0 The Comparison Group
Five Groups
Group DV1 DV2 DV3 DV4
G1 1 0 0 0
G2 0 1 0 0
G3 0 0 1 0
G4 0 0 0 1
G5 0 0 0 0 The Comparison Group
Etc.
Because, as will be shown below, the regression results in a comparison of the means of the groups with 1
codes with the mean of the Comparison Group, this coding scheme is most often used in situations in which
there is a natural comparison group, for example, a control group to be compared with several experimental
groups.
Variables Variables
Model Entered Removed Method
1 DC2, DC1a . Enter
a. All requested variables entered.
b. Dependent Variable: JS
Model Summary
When the predictors are group coding
Std. Error
Adjusted R of the
variables, we often say that R2 is the
Model R R Square Square Estimate proportion of variance related to
1 .630a .397 .330 1.84 group membership.
a. Predictors: (Constant), DC2, DC1
This F tests the overall null
ANOVAb
hypothesis that there are no
Sum of Mean differences between the 3
Model Squares df Square F Sig. population means. Its the same
1 Regression 40.095 2 20.048 5.930 .011a value we would have obtained
Residual 60.857 18 3.381 had we conducted an ANOVA.
Total 100.952 20
a. Predictors: (Constant), DC2, DC1 The F is significant, so reject the
b. Dependent Variable: JS
hypothesis that the population
means are equal.
Interpretation of the Coefficients Box.
Each Dummy Variable compares the mean of the group coded 1 on that variable to the mean of the Comparison
group. The value of the B coefficient is the difference in means.
So, for DC1, the B of 2.857 means that the mean of Group1 was 2.857 larger than the Comparison group mean.
For DC2, the B of 3.000 means that the mean of Group2 was 3.000 larger than the Comparison group mean.
Coefficientsa
When is dummy coding
Stan
dardi used?
zed
Coeff When one of the groups is a
Unstandardized icient natural control group for
Coefficients s all the other groups.
Model B Std. Error Beta t Sig.
1 (Constant) 5.000 .695 7.194 .000
DC1 2.857 .983 .614 2.907 .009
DC2 3.000 .983 .645 3.052 .007
a. Dependent Variable: JS
Each t tests the significance of the difference between a group mean and the reference group mean.
t=2.907 tests the significance of the difference between Group 1 mean and the Reference group mean.
t = 3.052 test the significance of the difference between Group 2 mean and the Reference group mean.
So the mean of Group1 is significantly different from the Reference group mean and the mean of Group2 is also
significantly different from the Reference Group mean.
Two Groups (Remember, special coding is not actually needed, since there are two groups.)
Group Code
G1 1
G2 -1
Three Groups Special coding IS needed when you are comparing means of 3 or more groups.
Group GCV1 GCV2
G1 1 0
G2 0 1
G3 -1 -1
Four Groups
Group GCV1 GCV2 GCV3
G1 1 0 0
G2 0 1 0
G3 0 0 1
G4 -1 -1 -1
Etc.
Now, rather than representing a comparison of the mean of a 1 group with the mean of a comparison group,
the B coefficient represents a comparison of the mean of a 1 group with the mean of ALL groups.
6 1 1 0
7 1 1 0
8 1 1 0 Group 1
11 1 1 0
9 1 1 0
7 1 1 0
7 1 1 0
5 2 0 1 Group 2
7 2 0 1
8 2 0 1
9 2 0 1
10 2 0 1
8 2 0 1
9 2 0 1
4 3 -1 -1
3 3 -1 -1
6 3 -1 -1 Group 3: Comparison Group
5 3 -1 -1
7 3 -1 -1
8 3 -1 -1
2 3 -1 -1
Report
JS
Std. Alas, we can use REGRESSION to
JOB Mean N Deviation compare means, but it wont report
1 Clerks 7.86 7 1.68 them for us. We have to use some
2 Receptionist 8.00 7 1.63 other procedure, such as the
3 Mailroom 5.00 7 2.16 REPORT procedure, if we want to
Total 6.95 21 2.25 actually seen the values of the
means.
Variables Variables
Model Entered Removed Method
Everything in the top
1 DC1a
EC1, EC2
DC2, . Enter three boxes is the same
a. All requested variables entered. as in the dummy variable
b. Dependent Variable: JS analysis.
Model Summary
The means are still
Std. Error significantly different.
Adjusted R of the The F of 5.930 is
Model R R Square Square Estimate
1 .630a .397 .330 1.84 EXACTLY the same
a. Predictors: (Constant), DC2, DC1
EC1, EC2
value as we obtained
using dummy coding and
ANOVAb EXACTLY the same
Sum of Mean value wed have
Model Squares df Square F Sig. obtained had we done an
1 Regression 40.095 2 20.048 5.930 .011a
Analysis of Variance.
Residual 60.857 18 3.381
Total 100.952 20
a. Predictors: (Constant), EC2, EC1
b. Dependent Variable: JS
Interpretation of the Coefficients Box.
In Effects coding, each B coefficient represents a comparison of the mean of the group coded 1 on the variable
with the mean of ALL the groups.
So, for EC1, the B of .905 indicates that the mean of Group 1 was .905 larger than the mean of all the groups.
For EC2, the B of 1.048 indicates that the mean of Group 2 was 1.048 larger than the mean of all the groups.
There is no B coefficient for Group 3.
Coefficientsa DC1
Stan DC2
dardi
zed
Coeff
Unstandardized icient
Coefficients s
Model B Std. Error Beta t Sig.
EC1
1 (Constant) 6.952 .401 17.327 .000 EC2
EC1 .905 .567 .337 1.594 .128
EC2 1.048 .567 .390 1.846 .081
a. Dependent Variable: JS
The t of 1.594 indicates that the mean of Group 1 was not significantly different from the mean of all groups.
The t of 1.846 indicates that the mean of Group 2 was not significantly different from the mean of all groups.
Remember that these are the same data as above. It indicates that one form of analysis of the data may be more
informative than another form. In this case, the Dummy Variable analysis was more informative.
JS
Sum of Mean
Squares df Square F Sig.
Between Groups 40.095 2 20.048 5.930 .011
Within Groups 60.857 18 3.381
Total 100.952 20
Note that the F value (5.930) is exactly the same as the F value from the ANOVA table from the regression
procedure.
The answer is that if the comparison of a single set of group means were all that there was to the analysis, you
would NOT use the regression procedure - youd use the analysis of variance procedure.
But here are four reasons for using or at least being familiar with regression-based means comparisons and the
group coding variable schemes upon which theyre based.
1. Whenever you have a mixture of qualitative and quantitative variables in the analysis, regression
procedures are the overwhelming choice. Example: Are there differences in the means of three groups
controlling for cognitive ability? Cant do that without including cognitive ability, a quantitative variable in
the analysis. Traditional analysis of variance formulas dont easily incorporate quantitative variables. Once
youre familiar with group coding schemes, its pretty easy to perform analyses with both quantitative and
qualitative variables.
2. Most statistical packages perform ALL analyses of both qualitative and quantitative and mixtures
using regression formulas. When analyzing only qualitative variables they will print output that looks like
theyve used the analysis of variance formulas, but behind your back, theyve actually done regression analyses.
Some of that output may reference the behind-your-back regression that was actually performed. So
knowing about the regression approach to comparison of group means will help you understand the output of
statistical packages performing analysis of variance. Well see that in the GLM procedure below.
3. Other analyses, for example Logistic Regression and Survival Analyses, to name two in SPSS, have very
regression-like output when qualitative factors are analyzed. That is, theyre quite up-front about the fact
that they do regression analyses. If you dont understand the regression approach to analysis of variance, itll be
very hard for you to understand the output of these procedures.
Put names of
quantitative
factors in the
Covariates
field.
Between-Subjects Factors
N Descriptive Statistics
Dependent Variable:JS
JOB 1 7
Job Mean Std. Deviation N
2 7
1 7.86 1.676 7
3 7
2 8.00 1.633 7
Levene's Test of Equality of Error Variancesa 3 5.00 2.160 7
Dependent Variable:JS Total 6.95 2.247 21
.572 2 18 .574
Dependent Variable:JS
Noncent Observed
. Powerb
Whats
Type III Sum Partial Eta Paramet
this?
Source of Squares df Mean Square F Sig. Squared er
Total 1116.000 21
GLM regresses the dependent variable onto ALL of the group coding variables and quantitative
variables, if there are any. This is the report of the significance of that regression.
Intercepts: This is the report on the Y-intercept of the All predictors regression reported on in the line
immediately above.
These are signs of the behind-your-back regression analysis thats actually been conducted.
Note that no mention is made of the fact that two group-coding variables were created to represent JOB.
The only indication that something is up is the 2 in the df column. That 2 is the number of actual
independent variables used to represent the JOB factor.
Partial Eta squared: A measure of effect size appropriate for analysis of variance.
Observed Power: Probability of a significant F if experiment were conducted again with population means
equal to these sample means.
Tukey B
Subset
JOB N 1 2
3 7 5.00
1 7 7.86
2 7 8.00
Dummy Simple
Effects Deviation
UNIANOVA JS BY Job
/CONTRAST(Job)=Deviation
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/PRINT=OPOWER ETASQ DESCRIPTIVE PARAMETER Checking the Parameter
/CRITERIA=ALPHA(.05) Estimates box tells GLM to print
/DESIGN=Job. out any regression parameters it
Univariate Analysis of Variance might have computed.
Between-Subjects Factors
These are regression parameters
N
for any quantitative independent
Job 1 7 variables and for group-coding
2 7 variables that are created
3 7 automatically by GLM.
Descriptive Statistics
Dependent Variable:JS
1 7.86 1.676 7
2 8.00 1.633 7
3 5.00 2.160 7
Total 6.95 2.247 21
Custom Hypothesis Tests These are the results for the deviation group coding scheme we asked for.
Contrast Results (K Matrix)
Dependent
Variable
Job Deviation Contrasta JS
Level 1 vs. Mean Contrast Estimate .905
Hypothesized Value 0
Difference (Estimate - Hypothesized) .905
p-values are the
Std. Error .567
Sig. .128
same as those
95% Confidence Interval for Lower Bound -.287 obtained using
Difference Upper Bound 2.097 the
Level 2 vs. Mean Contrast Estimate 1.048 REGRESSION
Hypothesized Value 0 procedure on p.
Difference (Estimate - Hypothesized) 1.048 14.
Std. Error .567
Sig. .081
95% Confidence Interval for Lower Bound -.145
Difference Upper Bound 2.240
a. Omitted category = 3
Whats this???
Test Results
Dependent Variable:JS
Sum of Partial Eta Noncent. Observed
Source Squares df Mean Square F Sig. Squared Parameter Powera
Contrast 40.095 2 20.048 5.930 .011 .397 11.859 .815
Error 60.857 18 3.381
a. Computed using alpha = .05