Sie sind auf Seite 1von 228

TRAINING

Einfhrung in die

Grundlagen der
statistischen
Datenanalyse

SPSS (Schweiz) AG, Schneckenmannstr. 25, 8044 Zrich, Phone 01 266 90 30,

Fax 01 266 90 39

info@spss.ch

www.spss.ch

www.spss.com

SPSS DecisionTime, SPSS Clementine, SPSS Neural Connection, SPSS QI Analyst, SPSS for Windows, SPSS Data Entry, SPSS-X, SCSS,
SPSS/PC, SPSS/PC+, SPSS Categories, SPSS Graphics, SPSS Regression Models, SPSS Advanced Models, SPSS Tables, SPSS Trends, SPSS
Exact Tests, SPSS Missing Values, SPSS Maps, SPSS AnswerTree, SPSS Report Writer und SPSS TextSmart sind eingetragene Warenzeichen
von SPSS Inc. CHAID for Windows ist das eingetragene Warenzeichen von SPSS Inc. und Statistical Innovations Inc.
Material, das die Namen der aufgefhrten Software erwhnt, darf ohne die schriftliche Zustimmung des Besitzers der eingetragenen Warenzeichen,
der Lizenzrechte der Software und des Copyrights verffentlichter Produkte weder produziert noch weiterverteilt werden.
Allgemeiner Hinweis: Andere erwhnte Produktnamen werden nur zum Zweck der Identifizierung genannt und knnen eingetragene Warenzeichen
anderer Firmen sein.
Copyright 2002 by SPSS Inc. and SPSS (Schweiz) AG.
Alle Rechte vorbehalten.
Gedruckt in der Schweiz.
Dieses Druckerzeugnis darf ohne die schriftliche Zustimmung der Verfasser weder kopiert, elektronisch gespeichert noch weitergegeben werden.

SPSS Training
STATISTICAL ANALYSIS
USING SPSS
TABLE OF CONTENTS
Chapter 1

Introduction
Samples and the Population
Level of Measurement
A Special Case: Rating Scales
Independent and Dependent Variables
Data Access
A Note about Variable Names and Labels in Dialog Boxes
Summary

Chapter 2

The Influence of Sample Size


Precision of Percentages
Sample Size and Precision
Precision of Means
Statistical Power Analysis
Types of Statistical Errors
Statistical Significance and Practical Importance
Summary
Appendix: Precision of Percentage Estimates

Chapter 3

2-2
2-6
2-6
2 - 10
2 - 10
2 - 11
2 - 11
2 - 11

Data Checking
Viewing a Few Cases
Minimum, Maximum and Number of Valid Cases
Identifying Inconsistent Responses
When Errors are Discovered
SPSS Missing Values Option
Summary

Chapter 4

1-3
1-3
1-4
1-5
1-6
1-6
1-7

3-2
3-6
3-7
3 - 10
3 - 10
3 - 10

Describing Categorical Data


Frequency Tables
Frequencies Output
Standardizing the Chart Axis
Pie Charts
Summary

4-2
4-4
4 - 11
4 - 15
4 - 15

Table of Contents - 1

SPSS Training
Chapter 5

Comparing Groups: Categorical Data


A Basic Two-Way Table
Chi-Square Test of Independence
Requesting the Chi-Square Test
Different Tests, Different Results?
Ecological Significance
Small Sample Considerations
Additional Two-Way Tables
Why is the Significance Criterion Typically Set at .05?
Association Measures
Association Measures Available within Crosstabs
Graphing Cross Tabulation Results
Three-Way Tables
Extensions
Summary

Chapter 6

Exploratory Data Analysis: Interval Scale Data


Frequency Tables and Histograms
Histograms
Average Satisfaction Variable
Exploratory Data Analysis
Average Satisfaction Variable
Options with Missing Values
Measures of Central Tendency
Variability Measures
Confidence Band for Mean
Shape of the Distribution
Stem & Leaf Plot
Box & Whisker Plot
Exploring Age When First Married
Saving an Updated Copy of the Data
Summary

Chapter 7

6-2
6-3
6-7
6-6
6-7
6-8
6 - 10
6 - 11
6 - 11
6 - 11
6 - 12
6 - 13
6 - 14
6 - 18
6 - 18

Mean Differences Between Groups I: Simple Case


Logic of Testing for Mean Differences
Assumptions
Sample Size
Exploring the Different Groups
T Test
T Test Results for Age When First Married
T Test for Overall Satisfaction
Displaying Mean Differences
Summary
Appendix: Paired T Test
Appendix: Normal Probability Plots

Table of Contents - 2

5-2
5-5
5-6
5 - 12
5 - 12
5 - 12
5 - 13
5 - 17
5 - 17
5 - 18
5 - 22
5 - 23
5 - 27
5 - 27

7-2
7-4
7-5
7-6
7 - 13
7 - 17
7 - 19
7 - 20
7 - 22
7 - 22
7 - 25

SPSS Training
Chapter 8

Mean Differences Between Groups II: One-Factor ANOVA


Logic of Testing for Mean Differences
Factors
Exploring the Data
Running One-Factor ANOVA
One-Factor ANOVA Results
The Bad News - Homogeneity
Post Hoc Testing of Means
Graphing the Results
Summary
Appendix: Group Differences on Ranks
Appendix: Help in Choosing a Statistical Method
Appendix: Help in Interpreting Statistical Results

Chapter 9

Mean Differences Between Groups III: Two-Factor ANOVA


Logic of Testing and Assumptions
How Many Factors?
Interactions
Exploring the Data
Two-Factor ANOVA
The ANOVA Table
Observed Means
Ecological Significance
Presenting the Results
Summary of Analysis
Summary
Appendix: Post Hoc Tests Using GLM Univariate

Chapter 10

8-2
8-2
8-3
8-5
8-6
8-8
8 - 10
8 - 18
8 - 19
8 - 19
8 - 22
8 - 25

9-2
9-2
9-3
9-5
9 - 11
9 - 16
9 - 17
9 - 19
9 - 19
9 - 21
9 - 21
9 - 22

Bivariate Plots and Statistics


Reading the Data
Exploring the Data
Scatterplots
Correlations
Summary

10 - 2
10 - 3
10 - 8
10 - 14
10 - 18

Table of Contents - 3

SPSS Training
Chapter 11

Introduction to Regression
Introduction and Basic Concepts
The Regression Equation and Fit Measure
Residuals and Outliers
Assumptions
Simple Regression
Multiple Regression
Residual Plots
Multiple Regression Results
Residuals and Outliers
Summary of Regression Results
Stepwise Regression
Stepwise Regression Results
Stepwise Summary
Summary

11 - 1
11 - 2
11 - 3
11 - 3
11 - 4
11 - 9
11 - 10
11 - 13
11 - 16
11- 18
11 - 18
11 - 20
11 - 23
11 - 23

References

References

R-1

Exercises

Exercises

E-1

Table of Contents - 4

SPSS Training

Chapter 1 Introduction

Objective

Method

Describe the goals and method of the course; review a few important
statistical terms and concepts; provide a framework for choosing a
statistical procedure within SPSS; briefly discuss some analyses we will
perform in the course to provide a research frame of reference.

Discussion

Data

The General Social Survey 1994 (performed by National Opinion


Research Center, Chicago NORC) is a survey involving demographic,
attitudinal and behavioral items that include views on government and
satisfaction with various facets of life. Approximately 3,200 US adults
were included in the study. However, not all questions were asked of each
respondent, so most analyses will be based on half the sample (about
1,600 individuals). The survey has been administered nearly every year
since 1972.

INTRODUCTION

PSS is an easy to use yet powerful tool for data analysis. In this
course we will cover a number of statistical procedures that SPSS
performs. This is an application oriented course and the approach
will be practical; we will discuss: the situations in which you would use
each technique, the assumptions made by the method, how to set up the
analysis using SPSS, and interpretation of the results. We will not derive
proofs, but rather focus on the practical matters of data analysis in
support of answering research questions. For example, we will discuss
what are correlation coefficients, when to use them, and how to produce
and interpret them, but will not formally derive their properties. This
course is not a substitute for a course in statistics. It presupposes you
have had such a course in the past and wish to apply statistical methods
to data using SPSS.

We will cover descriptive statistics and exploratory data analysis,


then examine relationships between categorical variables using
crosstabulation tables and chi-square tests. Testing for mean differences
between groups using T Tests and analysis of variance (ANOVA) will be
considered. Correlation and regression will be used to investigate the
relationships between interval scale variables. Graphics comprise an
integral part of the analyses. More advanced statistical techniques (for
example multivariate statistics) are covered in our Advanced Statistics
and Market Segmentation courses.

Introduction 1 - 1

SPSS Training
This course assumes you have a working knowledge of SPSS in your
computing environment. Thus the basic use of menu systems, data
definition and labeling will not be considered in any detail. The actual
steps you take to request an analysis within SPSS differ across
computing environments: pull-down menus for Microsoft Windows,
Macintosh and UNIX; syntax commands in batch-oriented mainframe
environments. The analyses in this course will show the relevant dialog
boxes and the SPSS syntax commands for those who prefer to use syntax.
In addition, the locations of the menu choices or dialog boxes within the
overall menu system are cited in the text. The dialog box selections will
be detailed and the resulting dialog box and syntax command shown.

Scenario

SAMPLES AND
THE
POPULATION

Introduction 1 - 2

Many of the analyses we perform early in the course are based on a


national US adult survey done in 1994. We are interested in determining
if relationships exist between various demographics (sex, marital status
and educational degree) and some attitudinal and belief measures (belief
in an afterlife, support of gun registration). Study of such crosstabulation tables is the core of survey research. In addition we will
investigate whether demographic groups show significant mean
differences in hours of TV watched per day and in an overall satisfaction
scale score. We choose these few questions to explore from a large number
of possibilities. As we review and apply statistical methods to this data,
think about how these methods might be used with information you
collect. Although survey data makes up our early examples the same
methods can be used with experimental and archival data.
We will produce a variety of statistics from this single survey. This is
in part to simplify the course through minimizing the number of data sets
you will see. For this reason we may well produce more types of analyses
on one data set than are typically done. In practice the number and
variety of analyses performed in a study is a function of the number of
questions you, the analyst, want to ask of the data.

In studies involving statistical analysis it is important to be able to


accurately characterize the population under investigation. The
population is the group to which you wish to generalize your conclusions,
while the sample is the group you directly study. In some instances the
sample and population are identical or nearly identical; consider the US
Census in this regard. In the majority of studies, the sample represents a
small proportion of the population. Some common examples of this are:
membership surveys in which a small percentage of members are mailed
questionnaires, medical experiments in which samples of patients with a
disease are given different treatments, marketing studies in which users
and non users of a product are compared, and political polling. The
problem is to draw valid inferences from data summaries in the sample
so that they apply to the larger population. In some sense you have
complete information about the sample, but you want conclusions that
are valid for the population. An important component of statistics and a
large part of what we cover in the course involves statistical tests used in
making such inferences. However, before this is done you should give
thought to defining the population of interest to you and making certain
that the sample reflects this population. The survey research literature,

SPSS Training
for example Sudman (1976) or Rossi, Wright and Anderson (1983),
reviews these issues in detail. To state it in a simple way, statistical
inference provides a method of drawing conclusions about a population of
interest based on sample results.

LEVEL OF
MEASUREMENT

In practice your choice of statistical method depends on the questions you


are interested in asking of the data and the nature of the measurements
you make. For example, different test statistics would apply to an
outcome variable that is categorical (pass, fail) than to one that is
continuous (a test score). Because measurement type is important when
choosing test statistics, we briefly review the common taxonomy of level
of measurement. For an interesting discussion of level of measurement
and statistics see Velleman and Wilkinson (1993).
Level of measurement deals with the properties and meaning of
numbers when used in measurement. The major classifications that
follow are found in many introductory statistics texts. They are
presented beginning with the weakest and ending with those having the
strongest measurement properties. Statistics are available for variables
at all measurement levels, and it is important to match the proper
statistic to a given level of measurement. At the end of the chapter we
present a table linking type of statistic and measurement levels of the
variables. We hope you will find this useful. A far more detailed table can
be found in Andrews et al. (1981). The chapters that follow will
demonstrate the use of a number of these statistics.

Nominal in nominal measurement each numeric value


represents a category or group identifier. An example would
be region of the country coded 1 (Northeast) 2 (South) 3
(Midwest) and 4 (West); each number represents a category
and the matching of specific numbers to categories is
arbitrary. Counts and percentages of observations falling into
each category are appropriate summary statistics. Such
statistics as means (the average region?) would not be
appropriate.

Ordinal for ordinal measures the data values represent


ranking or ordering information. For example, you might
rank 10 television shows in preference order where 1 is the
most and 10 the least preferred. In comparing the top two
ranked alternatives, we know that the first is preferred to the
second, but not by how much. There are specific statistics
associated with ranks; SPSS provides a number of them
mostly within the Nonparametric and Ordinal Regression
procedures (see also the SPSS GOLDMineR product).

Introduction 1 - 3

SPSS Training

Interval in interval measurement, a unit increase in


numeric value represents the same change in quantity
regardless of where it occurs on the scale. For interval scale
variables such summaries as means and standard deviation
are meaningful. Statistical techniques such as regression and
analysis of variance assume that the dependent (or outcome)
variable is measured on an interval scale. Examples might be
temperature in Fahrenheit degrees and IQ.

Ratio ratio measures have interval scale properties


with the addition of a meaningful zero point, that is, zero
indicates complete absence of the characteristic measured.
The ratio of two variables with ratio scale properties can thus
be directly interpreted. Money is an example of a ratio scale;
someone with $10,000 has ten times the amount as someone
with $1,000. For statistics such as ANOVA and regression
only interval scale properties are assumed, so ratio scales
have stronger properties than necessary for most statistical
analyses. Health care researchers often use ratio scales
variables (number of deaths, admissions, discharges) to
calculate rates.

A SPECIAL
CASE: RATING
SCALES

A common scale used in social science and market research is an ordered


five point rating scale (such ordered scales are also called Likert scales)
coded 1 (Strongly Agree, or Very Satisfied), 2 (Agree, or Satisfied), 3
(Neither agree nor disagree, or Neutral), 4 (Disagree, or Dissatisfied), and
5 (Strongly Disagree, or Very Dissatisfied). There is an ongoing debate
among researchers as to whether such scales should be considered
ordinal or interval. SPSS contains procedures capable of handling such
variables under either assumption. When in doubt about the
measurement scale, some researchers run their analyses using two
separate methods, each make different assumptions about the nature of
the measurement. If the results agree, the researcher has greater
confidence in the conclusion.

INDEPENDENT
AND DEPENDENT
VARIABLES

In research the dependent (sometime referred to as outcome) variable


is the one we wish to study as a function of other variables. Within an
experiment, the dependent variable is the measure expected to change as
a result of the experimental manipulation. For example, a drug
experiment designed to test the effectiveness of different sleeping pills
might employ the number of hours slept at night as the dependent
variable. In surveys and other non-experiment studies, the dependent
variable is also studied as a function of other variables. However, here no
direct experimental manipulation is performed; rather the dependent
variable is hypothesized to vary as a result of changes in the other
(independent) variables. Thus terms (dependent, independent)
reasonably applied to experiments, have taken on more general meanings
within statistics. Whether such relations are viewed causally, or as
merely predictive, is a matter of belief and reasoning. As such, it is not
something that statistical analysis alone can resolve. To illustrate, we
might investigate the relationship between starting salary (dependent)

Introduction 1 - 4

SPSS Training
and years of education, based on survey data, then develop an equation
predicting starting salary from years of education. Here starting salary
would be considered the dependent variable although no experimental
manipulation of education has been performed.
Correspondingly, independent variables are those used to measure
features manipulated by the experimenter in an experiment. In a nonexperimental study, they represent variables believed to influence or
predict a dependent measure. In summary, the dependent variable is
believed to be influenced by, or be predicted by, the independent
variable(s).
Finally, in some studies, or parts of studies, the emphasis is on
exploring or characterizing relationships among variables with no causal
view or focus on prediction. In such situations there is no designation of
dependent and independent. For example, in crosstabulation tables and
correlation matrices the distinction between dependent and independent
variables is not necessary. It rather resides in the eye, or worldview, of
the beholder (researcher).
The table below suggests which statistical techniques are most
appropriate, based on the measurement level of the variables. Much more
extensive diagrams and discussion are found in Andrews et al. (1981).
Recall that ratio variables can be considered as interval scale for analysis
purposes. If in doubt about the measurement properties of your variables,
you can apply a statistical technique that assumes weaker measurement
properties and compare the results to methods making stronger
assumptions. A consistent answer provides greater confidence in the
conclusions.
Figure 1.1 Statistical Methods and Level of Measurement

* Covered in Advanced Statistics with SPSS, Market Segmentation


Using SPSS and Data Mining: Modeling training courses

Introduction 1 - 5

SPSS Training
DATA ACCESS

Data taken from the General Social Survey 1994 are used in Chapters 1
through 9. The General Social Survey contains several hundred
demographic, attitudinal and behavioral questions. The data are stored
in an SPSS portable file named Gss94.por: a text file containing data,
labels, and missing values. A portable file can be read by SPSS on any
type of computer supporting SPSS (for example PC, Macintosh, and
UNIX).

Note on
Course Data Files

All files for this class are located in the c:\Train\Stats folder on your
training machine. If you are not working in an SPSS Training center, the
training files can be copied from the floppy disk or CD that accompanies
this guide. If you are running SPSS Server (click File..Switch Server to
check), then you should copy these files to the server or a machine that
can be accessed (mapped from) the computer running SPSS Server.

A Note about
Variable Names
and Labels in
Dialog Boxes

SPSS can display either variable names or variable labels in dialog boxes.
In this course we display the variable names in alphabetical order. In
order to match the dialog boxes shown here, from within SPSS:
Click Edit..Options
Within the General tab sheet of the Options dialog box:
Click the Display names option button
Click the Alphabetical option button
Click OK, then click OK to confirm
Click File...Open..Data
Switch to the c:\Train\Stats folder
Select SPSS Portable (*.por) from the Files of Type: drop-down
list
Double-click on Gss94.por
Those using SPSS syntax commands can read a portable file with the
IMPORT command shown below.
IMPORT FILE C:\Train\Stats\Gss94.por.
The IMPORT command reads an SPSS portable file (it is called
import because such files usually come from SPSS on a different type of
computer and the data thus imported from another machine type). Once
the data file is imported, you can immediately proceed to the analysis. We
see the data below.

Introduction 1 - 6

SPSS Training
Figure 1.2 Data After Importing

The data and labels are now available for manipulation and analysis.
We demonstrated here how to read the data file. Since our emphasis is on
statistical analysis, each of the remaining chapters will assume this step
has already been performed.

SUMMARY

In this chapter we reviewed the scope of the course, discussed levels of


measurement and defined several statistical terms, including
independent and dependent variable. A table to suggest appropriate
statistical analysis based on the measurement level was provided and a
reference for a more detailed discussion given. We also covered how to
read an SPSS data file in portable format.

Introduction 1 - 7

SPSS Training

Introduction 1 - 8

SPSS Training

Chapter 2 The Influence of Sample Size

Objective

Method

Demonstrate the relationship between sample size and precision of


measurement; explain types of statistical errors and statistical power;
differentiate between statistical significance and practical importance
(ecological significance).

Display a series of analyses in which only the sample size varies and see
which outcome measures change. Discuss scenarios in which statistical
significance and practical importance do not coincide.

Data

Data files showing the same survey percentages based on samples of 100,
400 and 1,600. A data file containing 10,000 observations drawn from a
normal population with mean 70 and standard deviation of 10.

INTRODUCTION

n statistical analysis sample size plays an important role; one


however, that can easily be overlooked since a minimum sample size
is not required for the most commonly used statistical tests. Workers
in some areas of applied statistics (engineering, medical research)
routinely estimate the effects of sample size on their analyses (termed
power analysis). This is less frequently done in social science and market
research. Statistics texts present the formulas for standard errors that
describe the effect of sample size. Here we will demonstrate the effect in
two common data analysis situations: crosstabulation tables and mean
summaries.

The Influence of Sample Size 2 - 1

SPSS Training
PRECISION OF
PERCENTAGES

Note

When reporting the results of national polls, typically both the


percentage and precision are given. For example, we might hear that
candidate A is preferred by 55 percent (candidate B 45%) of those
sampled, measured with a precision of plus or minus 2 percent. This
means that 55 percent of the sample chose candidate A, but since the
sample is only a partial reflection of the population the precision measure
indicates the uncertainty surrounding the sample value. For example we
would have much less confidence in candidate A being preferred by the
population if in the sample 55 percent chose A with a precision of plus or
minus 20 percent. In this instance, candidate B might very well be
preferred by the entire population.
Precision is strongly influenced by the sample size. In the figures
below we present a series of crosstabulation tables containing identical
percentages, but with varying sample sizes. We will observe how the test
statistics change with sample size and relate this result to the precision
of the measurement. The results below assume a population of infinite
size, or at least one much larger than the sample. For precision
calculations involving percentages with finite populations see Kish (1963)
or other survey sampling texts.
The Chi-square test of independence will be presented for each table as
part of the presentation of the effect of changing sample size. This test
assumes that each sample is representative of the entire population. A
detailed discussion of the chi-square statistic, its assumptions and
interpretation, can be found in Chapter 5.

The Influence of Sample Size 2 - 2

SPSS Training
SAMPLE SIZE OF
100

The table below displays responses of men and women to a question


asking for which candidate they would vote. The table was constructed by
adjusting case weights to reflect a sample of 100.
Figure 2.1 Crosstab Table with Sample of 100

We see that 46 percent of the men and 54 percent of the women


choose candidate A, resulting in an 8% difference between the two gender
groups. Since this sample of 100 people incompletely reflects the
population we turn to the chi-square test to assess whether men differ
from women in the population. As noted above, the chi-square test will be
examined closely in Chapter 5. Here we simply cite the chi-square (.640)
and state that the significance value of .424 indicates that men and
women share the same view (do not differ significantly) concerning
candidate choice. The .424 value suggests that if men and women in the
population had identical attitudes toward the candidates, with a sample
of 100 we could observe a gender difference of 8 or more percentage
points about 42% of the time. Thus we are fairly likely to find such a
difference (8%) in a small sample even if there is no gender difference.

The Influence of Sample Size 2 - 3

SPSS Training
SAMPLE SIZE OF
400

Now we view a table with percentages identical to the previous one, but
based on a sample of 400 people, four times as large as before.
Figure 2.2 Crosstabulation Table with Sample of 400

The gender difference remains at 8% with fewer men choosing


Candidate A. Although the percentages are identical, the chi-square
value has increased by a factor of four (from .640 to 2.56) and the
significance value is smaller (.11). This significance value of .11 suggests
that if men and women in the population had identical attitudes toward
the candidates, with a sample of 400 we would observe a gender
difference of 8 or more percentage points about 11% of the time. Thus
with a bigger sample, we are much less likely to find such a large (8%)
percentage difference. Since much statistical testing uses a cutoff value of
.05 when judging whether a difference is significant, this result is close to
being judged statistically significant.

The Influence of Sample Size 2 - 4

SPSS Training
SAMPLE SIZE OF
1,600

Finally we present the same table of percentages, but increase the sample
size to 1,600; the increase is once again by a factor of four.
Figure 2.3 Crosstabulation Table with Sample of 1,600

The percentages are identical to the previous tables and so the gender
difference remains at 8%. The chi-square value (10.24) is four times that
of the previous table and sixteen times that of the first table. Notice that
the significance value is quite small (.001) indicating a statistically
significant difference between men and women. With a sample as large as
1,600 it is very unlikely (.001 or 1 chance in 1000) that we would observe
a difference of 8 or more percentage points between men and women if
they did not differ in the population.
Thus the 8% sample difference between two groups is highly
significant if the sample is 1,600, but not significant (testing at .05 level)
with a sample of 100. This is because the precision with which we
measure the percents increases with the sample size, and as our
measurement grows more precise the 8% sample difference looms large.
This relationship is quantified in the next section.

The Influence of Sample Size 2 - 5

SPSS Training
SAMPLE SIZE
AND PRECISION

In the series of crosstabulation tables we saw that as the sample size


increased we were more likely to conclude there was a statistically
significant difference between two groups when the magnitude of the
sample difference was constant (8%). This is because the precision with
which we estimate the population percentage increases with increasing
sample size. This relation can be approximated (see appendix to this
chapter for the exact relationship) by a simple equation; the precision of a
sample proportion is approximately equal to one divided by the square
root of the sample size. We display the precision for the sample sizes used
in our examples below.

Sample
Size

Precision

Value

100

1/sqrt(100) = 1/10

.1 or 10%

400

1/sqrt(400) = 1/20

.05 or 5%

1,600

1/sqrt(1,600) =1/40

.025 or 2.5%

Also to obtain 1% precision


10,000

1/sqrt(10,000) = 1/100 .01 or 1%

Since precision increases as the square root of the sample size, in


order to double the precision we must increase the sample size by a factor
of four. This is an unfortunate and expensive fact of survey research. In
practice, samples of 400 or 1,600 are often chosen to obtain precision of +/
- 5%, or +/- 2.5%.

PRECISION OF
MEANS

The same basic relation, that precision increases with the square root of
the sample size, applies to sample means as well. To illustrate this we
display histograms based on different samples from a normally
distributed population with mean 70 and standard deviation 10. We first
view a histogram based on a sample of 10,000 individual observations.
Next we will view a histogram of 1,000 sample means where each mean is
composed of 10 observations. The third histogram is composed of 100
sample means, but here each mean is based on 100 observations. We will
focus our attention on how the standard deviation changes when sample
means are the units of observation. To aid such comparisons the scale is
kept constant across histograms.

The Influence of Sample Size 2 - 6

SPSS Training
A LARGE
SAMPLE OF
INDIVIDUALS

Below is a histogram of 10,000 observations drawn from a normal


distribution of mean 70 and standard deviation 10.
Figure 2.4 Histogram of 10,000 Observations

We see that a sample of this size closely matches its population. The
sample mean is very close to 70, the sample standard deviation is near
10, and the shape of the distribution is normal.

The Influence of Sample Size 2 - 7

SPSS Training
MEANS BASED
ON SAMPLES OF
10

The second histogram displays 1,000 sample means drawn from the same
population (mean 70, standard deviation 10). Here each observation is a
mean based on 10 data points. In other words we pick samples of ten each
and plot their means in the histogram below.
Figure 2.5 Histogram of Means Based on Samples of 10

The overall average of the sample means is about 70, while the
standard deviation of the sample means is reduced to 3.11. Comparing
the two histograms we see there is less variation (standard deviation of
3.11 versus 10) among means based on groups of observations then
among the observations themselves. Recall the rule of thumb that
precision is a function of the square root of the sample size. If the
population standard deviation were 10, we would expect the standard
deviation of means based on samples of 10 to be the population figure
reduced by a factor of 1/square root (N) or 1/square root (10), or .316. If
we multiply this factor (.316) by the population standard deviation (10),
the theoretical value we get (3.16) is very close to what we observe in our
sample (3.11). Thus by increasing the sample size by a factor of ten (from
single observations to means of ten observations each) we reduce the
imprecision (increase the precision) by the factor 1/square root (10). The
shape of the distribution remains normal.

The Influence of Sample Size 2 - 8

SPSS Training
MEANS BASED
ON SAMPLES OF
100

To carry this a step further the next histogram is based on a sample of


100 means where each mean represents 100 observations.
Figure 2.6 Histogram of Means Based on Samples of 100

While quite compressed, the distribution still resembles a normal


curve. The overall mean remains at 70 while the standard deviation is
very close to 1 (1.00). This is what we expect since with samples of 100,
the expected value of the standard deviation of the sample mean would
be the population standard deviation divided by the square root(N), here
the square root of 100. Thus the theoretical sample standard deviation
would be 10 /square root(100) or 1.00, which is quite close to our observed
value.
Thus with means as well as percents, precision increases with the
square root of the sample size.

The Influence of Sample Size 2 - 9

SPSS Training
STATISTICAL
POWER
ANALYSIS

With increasing precision we are better able to detect small differences


that exist between groups and small relationships between variables.
Power analysis was developed to aid researchers in determining the
minimum sample size required in order to have a specified chance of
detecting a true difference or relationship of a given size. For example,
suppose a researcher hopes to find a mean difference of .8 standard
deviation units between two populations. A power calculation can
determine the sample size necessary to have a 90% chance that a
significant difference will be found between the sample means when
performing a statistical test at a specified significance level. Thus a
researcher can evaluate whether the sample is large enough for the
purpose of the study. A Power Analysis option (SPSS SamplePower) is
available through SPSS. Also, books by Cohen (1988) and Kraemer &
Thiemann (1987) discuss power analysis and present tables used to
perform the calculation for common statistical tests. In addition specialty
software is available for such analyses. Power analysis can be very useful
when planning a study, but does require such information as the
magnitude of the hypothesized effect and an estimate of the error
variation.

TYPES OF
STATISTICAL
ERRORS

Recall that when performing statistical tests we are generally attempting


to draw conclusions about the larger population based on information
collected in the sample. There are two major types of errors in this
process. False positives, or Type I errors, occur when no difference (or
relation) exits in the population, but the sample tests indicate there are
significant differences (or relations). Thus the researcher falsely
concludes a positive result. This type of error is explicitly taken into
account when performing statistical tests. When testing for statistical
significance using a .05 criterion (alpha level), we acknowledge that if
there is no effect in the population then the sample statistic will exceed
the criterion on average 5 times in 100 (.05).
Type II errors, or false negatives, are mistakes in which there is a
true effect in the population (difference or relation) but the sample test
statistic is not significant, leading to a false conclusion of no effect. To put
it briefly, a true effect remains undiscovered. It is helpful to note that
statistical power, the probability of detecting a true effect, equals 1 minus
the Type II error.
When other factors are held constant there is a tradeoff between the
two types of errors; thus Type II error can be reduced at the price of
increasing Type I error. In certain disciplines, for example in statistical
quality control when destructive testing is done, the relationship between
the two error types is explicitly taken into account and an optimal
balance determined based on cost considerations. In social science
research, the tradeoff is acknowledged but rarely taken into account (the
exception being power analysis); instead emphasis is usually placed on
maintaining a steady Type I error rate of 5% (testing at .05 level). This
discussion merely touches the surface of these issues; researchers
working with small samples or studying small effects should be very
aware of them.

The Influence of Sample Size 2 - 10

SPSS Training
STATISTICAL
SIGNIFICANCE
AND PRACTICAL
IMPORTANCE

A related issue involves drawing a distinction between statistical


significance and practical importance. When an effect is found to be
statistically significant we conclude that the population effect (difference
or relation) is not zero. However, this allows for a statistically significant
effect that is not quite zero, yet so small as to be insignificant from a
practical or policy perspective. This notion of practical or real world
importance is also called ecological significance. Recalling our discussion
of precision and sample size, very large samples yield increased precision,
and in such samples very small effects may be found to be statistically
significant. In such situations, the question arises as to whether the
effects make any practical difference. For example, suppose a company is
interested in customer ratings of one of its products and obtains rating
scores from several thousand customers. Furthermore, suppose mean
ratings on a 1 to 5 satisfaction scale are calculated for male (mean 3.25)
and female (mean 3.20) customers and are found to be significantly
different. Would such a small difference be of any practical interest or
use?
When sample sizes are small (say under 30), precision tends to be
poor and so only relatively large (and ecologically significant) effects are
found to be statistically significant. With moderate samples (say 30 to one
or two hundred) small effects tend to show modest significance while
large effects are highly significant. For very large samples, several
hundreds or thousands, small effects can be highly significant; thus an
important aspect of the analysis is to examine the effect size and
determine if it is important from a practical, policy or ecological
perspective. In summary, the statistical tests we cover in this course
provide information as to whether there are non-zero effects. Estimates of
the effect size should be examined to determine whether the effects are
important.

SUMMARY

In this chapter we demonstrated and quantified the relation between


sample size and precision for summaries involving percentages and
means. In addition we discussed statistical power analysis, the two major
types of errors when performing statistical testing, and differentiated
between statistical significance and practical or ecological significance.

APPENDIX:
PRECISION OF
PERCENTAGE
ESTIMATES

In this chapter we suggested, as a rule of thumb, that the precision of a


sample proportion is approximately equal to one divided by the square
root of the sample size. Formally, for a binomial or multinomial
distribution the standard deviation of the sample proportion (P) is equal
to SQRT((P * (1 - P)) / N). Thus the standard deviation is a maximum
when P = .5 and reaches a minimum of 0 when P = 0 or 1. A 95%
confidence band is usually determined by taking the sample estimate
plus or minus twice the standard deviation. Precision here is simply two
times the standard deviation. Thus precision is 2 * SQRT ((P * (1-P))/N).
If we substitute for P the value .5 which maximizes the expression (and is
therefore conservative) we have 2 * SQRT(( .5 * (1-.5))/N). Since SQRT( .5
* .5) = 1/2, the previous expression simplifies to SQRT(1/N), the rule of
thumb used in the chapter. Since the rule of thumb employs the value of
P (.5), which maximizes the standard deviation, in practice, greater

The Influence of Sample Size 2 - 11

SPSS Training
precision would be obtained when P departs from .5.
It is important to note that this calculation assumes the population
size is infinite, or as an approximation, much larger than the sample.
Formulations that take finite population values into account can be found
in Kish (1965) and other texts discussing sampling. When applied to
survey data, the calculation also assumes that the survey was carried out
in a methodologically sound manner. Otherwise, the validity of the
sample proportion itself is called into question.

The Influence of Sample Size 2 - 12

SPSS Training

Chapter 3 Data Checking

Objective

Method

Data

INTRODUCTION

Review the importance of checking data before performing formal


analysis and see the methods available in SPSS.

Use the Data Editor in SPSS for Windows and the Case Summaries
procedure to view the data; check for out-of-range values using the
Descriptives procedure; apply If statements to test for consistency across
questions.

General Social Survey 1994 (done by the National Opinion Research


Center, Chicago).

hen working with data it is important to verify their validity


before proceeding with the analysis. Major mail survey
organizations apply such methods as double-entry verification,
a technique in which two people independently enter the data onto a
computer, after which the results are compared and discrepancies
resolved. Survey organizations often use specialized programs, such as
SPSS Data Entry, that automatically check the validity (for example, is
the response within an acceptable range) of an answer and its consistency
with previous information. If you do not have a data entry program, it is
still possible to implement simple data validity and consistency checks
using SPSS. Although mundane, time spent examining data in this way
early on will reduce false starts, misleading analyses, and makeup work
later. For this reason data checking is a critical prelude to statistical
analysis.

Data Checking 3 - 1

SPSS Training
VIEWING A FEW
CASES

Often the first step in checking data previously entered on the computer
is to view the first few observations and compare their data values to the
original data sheets or survey forms. This will detect many gross errors of
data definition (incorrect columns specified for an ASCII text file, reading
alpha characters as numeric data fields). Viewing the first few cases can
be easily accomplished using the Data Editor Window in SPSS or the
Case Summaries procedure. Below we view part of the 1994 General
Social Survey data in SPSS.
Click File...Open..Data (move to the C:\Train\Stats folder if
necessary)
Select SPSS Portable (*.por) from the Files of Type drop-down
list
Click GSS94.POR and click Open
Figure 3.1 General Social Survey 1994 Data in SPSS for Windows

The first few responses can be compared to the original data sheets or
surveys as a preliminary test of data entry. If errors are found,
corrections can be made directly within the Data Editor window.
The Case Summaries procedure can list values of individual cases for
selected variables. For example, we can display values for several
questions related to education: educ (respondents education), speduc
(spouses education), maeduc (mothers education), and paeduc (fathers
education).

Data Checking 3 - 2

SPSS Training
Click Analyze..Reports..Case Summaries
Move educ, speduc, maeduc and paeduc into the Variables
list box.
Type 10 into the Limit cases to first text box
Figure 3.2 Case Summaries Dialog Box

Click OK
Note we limit the listing to the first ten cases (the default is 100). The
Case Summaries procedure can also display group summaries.
The Summarize syntax command below instructs SPSS to list the
four education variables for the first ten cases. In addition a title is
provided for the pivot table containing the case listing and counts are
requested as summary statistics.
SUMMARIZE
/TABLES=educ speduc maeduc paeduc
/FORMAT=VALIDLIST NOCASENUM TOTAL LIMIT=10
/TITLE=Case Summaries
/MISSING=VARIABLE
/CELLS=COUNT .

Data Checking 3 - 3

SPSS Training
Below we see the requested variables for the first ten observations.
Figure 3.3 Case Summary List of First Ten Cases

By default, SPSS will display value labels in case listings; this can be
modified within the SPSS Options dialog box (click Edit..Options, then
move to the Output Labels tab). Please note that the high incidence of
NAP (not applicable) for some variables (see fathers education) is
probably due to the fact that few questions were asked of all respondents
in the General Social Survey 1994. Ordinarily, this much missing data
(see fathers education) would be of concern.

MINIMUM,
MAXIMUM, AND
NUMBER OF
VALID CASES

Data Checking 3 - 4

A second simple data check that can be done within SPSS is to request
descriptive statistics on all numeric variables. By default, the
Descriptives procedure will report the mean, standard deviation,
minimum, maximum and number of valid cases for each numeric
variable. While the mean and standard deviation are not relevant for
nominal variables (see Chapter 1), the minimum and maximum values
will signal any out-of-range data values. In addition, if the number of
valid observations is suspiciously small for a variable, it should be
explored carefully. Since Descriptives provides only summary statistics, it
will not indicate which observation contains an out-of-range value, but
that can be easily determined once the data value is known. The Case
Summaries procedure can also be used for this purpose.
The SPSS Descriptives syntax command below will request
summaries for all variables (although summaries will print only for
numeric variables).

SPSS Training
DESCRIPTIVES /VARIABLES ALL
/STATISTICS=MEAN STDDEV MIN MAX.
We request the same analysis in SPSS by choosing
Analyze..Descriptive Statistics.. Descriptives and selecting all variables
in the Descriptives dialog box (shown below).
Click Analyze..Descriptive Statistics..Descriptives
Move all variables into the Variable(s) list box
Figure 3.4 Descriptives Dialog Box

Only numeric variables will appear in the list box. Running the
Descriptives syntax command or Clicking the OK button in SPSS will
lead to the summaries shown below.
Click OK

Data Checking 3 - 5

SPSS Training
Figure 3.5 Descriptives Output (Beginning)

Figure 3.6 Descriptives Output (End) Showing Valid Listwise

Data Checking 3 - 6

SPSS Training
We can see the minimum, maximum and number of valid cases for
each variable in the data set. By examining such variables as EDUC
(highest year of school completed), TVHOURS (hours per day watching
TV) and AGE (age of respondent) we can determine if there are any outof-range values. The maximum for TVHOURS looks rather high (24). As
an exercise, examine the value labels (click Utilities..Variables) for a few
of the variables and discover the valid range of values. Compare these
ranges to the results in the figure.
The valid number of observations (Valid N) is listed for each variable.
The number of valid observations listwise indicates how many
observations have complete data for all variables, a useful bit of
information. Here it is zero because not all GSS questions are asked of,
nor are relevant to, any single individual. If odd values are discovered in
these summaries we can locate the problem observations with data
selection statements or the Find function (under Edit menu) in the Data
Editor window.

IDENTIFYING
INCONSISTENT
RESPONSES

In most data sets certain relations must hold among variables if the data
are recorded properly. This is especially true with surveys containing
filter questions or skip patterns. Some examples from the GSS are: if a
respondent has never been married then his/her age when first married
should have a missing code; age when first married should not be greater
than current age. Such relations cannot be easily checked by scanning the
data or in single variable summaries such as frequency tables. However,
these relations can be examined by using SPSS data transformation
instructions to identify cases violating the expected patterns.
We will demonstrate how to test for such relations among variables
using the two examples mentioned in the paragraph above. First if never
married (MARITAL = 5) then age when first married (AGEWED) should
be coded as a missing value. Secondly, age when first married (AGEWED)
should be less than or equal to current age (AGE). The basic approach is
to create a new variable that will be set to a specific number (say 1) if the
expected relation does not hold.
From the Data Editor window,
Click Transform..Compute
Click If... to transfer to the Compute Variable: If Cases dialog box
Click Include if case satisfies condition option button
In the text box of the Compute If dialog box we indicate the condition
we want identified: never married (Marital=5) and having a valid age
when first married ( ~ MISSING(AGEWED) - the tilde (~) means NOT).
Enter (type or build) the expression
marital=5 & ~MISSING(agewed) into the text box

Data Checking 3 - 7

SPSS Training
Figure 3.7 Defining the Error Condition

Click Continue
If this condition is met, a new variable (ERRMARIT) will be set equal
to one, as shown in the Compute dialog box below.
Type errmarit in the Target Variable box
Type 1 in the Numeric Expression: box
Figure 3.8 Setting the Error Indicator Variable to 1

Click OK

Data Checking 3 - 8

SPSS Training
We see that if the error condition holds then the new variable
ERRMARIT will be set equal to 1. A frequency analysis can be run to
determine if any cases have ERRMARIT equal 1. The problem cases can
be selected and listed, or located by find function in the Data Editor
window.
The same operation can be applied using an IF syntax command. The
first If statement performs the same function as the dialog boxes just
viewed, setting ERRMARIT to 1 if a respondent was never married yet
reports a valid age when first married. The second If assigns 1 to
ERRAGE if age when first married exceeds the respondents current age.
IF (MARITAL = 5 AND NOT (MISSING(AGEWED)))
ERRMARIT=1.
IF (AGEWED > AGE) ERRAGE=1.
A frequency analysis is then run to see if these errors occurred.
Click Analyze..Descriptive Statistics..Frequencies
Move errmarit and errage (if calculated) into the Variable(s) list
box
Click OK
The Frequencies syntax is shown below.
FREQUENCIES /VARIABLES ERRMARIT ERRAGE .
Figure 3.9 Frequency Tables of Error Variables

Since the data from the General Social Survey are very carefully
checked it is not surprising that no discrepancies (error values equal to 1)
were found.

Data Checking 3 - 9

SPSS Training
WHEN DATA
ERRORS ARE
DISCOVERED

If errors are found the first step is to return to the original survey or data
form. Simple clerical errors are merely corrected. In some instances
errors on the part of respondents can be corrected based on their answers
to other questions. If neither of these approaches is possible the offending
items can be coded as missing responses and will be excluded from SPSS
analyses. While beyond the scope of this course, there are techniques that
substitute values for missing responses in survey work. For a discussion
of such methods see Burke and Clark (1992) or Babbie (1973). Also note
the SPSS Missing Values option can perform this function.
Having cleaned the data we can now move to the more interesting
part of the process, data analysis.

SPSS MISSING
VALUES OPTION

SUMMARY

The SPSS Missing Values option will produce various reports describing
the frequency and pattern of missing data. It also provides methods for
estimating (imputing) values for missing data.

In this chapter we stressed data cleaning as an important prelude to data


analysis and reviewed ways of identifying and locating errors in data
files.

Data Checking 3 - 10

SPSS Training

Chapter 4 Describing Categorical Data

Objective

Method

Data
Scenario

INTRODUCTION

Review the ways of summarizing and displaying data from categorical


(nominal) measures with each variable considered separately. Consider
useful ways of reorganizing such data.

Run Frequencies to create frequency tables; graphically represent the


summaries with bar and pie charts (use the Graph menu).

The General Social Survey 1994.

We are interested in exploring relationships between some demographic


variables (highest educational degree attained, gender) and some belief/
attitudinal/behavioral variables (belief in afterlife, attitude towards gun
permits, gun ownership). Prior to running these two-way analyses
(considered in Chapter 5) we will look at the distribution of responses for
several of these variable. This can be regarded as a preliminary step
before performing the main cross tabulation analysis of interest, or as an
analysis in its own right. There might be considerable interest in
documenting what percentage of the U.S. (non institutionalized) adult
population believes in an afterlife. In addition we will look at the
frequency distributions of marital status and marital happiness.

ummaries of individual variables provide the basis for more


complex analyses. There are a number of reasons for performing
single variable (univariate) analyses. One would be to establish
base rates for the population sampled. These rates may be of immediate
interest - what percentage of our customers is satisfied with services this
year? In addition, studying a frequency table containing many categories
might suggest ways of collapsing groups for a more succinct and striking
table. When studying relationships between variables, the base rates of
the separate variables indicate whether there is a sufficient sample size
(recall the discussion in Chapter 2) in each group to proceed with the
analysis. A second use of such summaries would be as a data checking
device- unusual values would be apparent in a frequency table.
A frequency analysis provides a summary table indicating the
number and percentage of cases falling into each category of a variable.
To represent this information graphically we use bar or pie charts. In this
chapter we run frequency analyses on several questions from the General
Social Survey 1994 and construct charts to accompany the tables. We
discuss the information in the tables and consider the advantages and
disadvantages in standardizing bar charts when making comparisons
across charts.

Describing Categorical Data 4 - 1

SPSS Training
FREQUENCY
TABLES

We begin by requesting frequency tables and bar charts for marital


status (MARITAL), frequency of attending religious services (ATTEND),
highest education degree earned (DEGREE), attitude toward gun permits
(GUNLAW), gun ownership (OWNGUN), and marital happiness
(HAPMAR). Requests in SPSS for Windows for bar charts can be made
from the Frequencies dialog box, or through the Graphs menu.
If the 1994 General Social Survey Data are not in the SPSS Data
Editor, then:
Click File...Open..Data (move to the C:\Train\Stats folder if
necessary)
Select SPSS Portable (*.por) from the Files of Type drop-down
list
Click GSS94.POR and click Open

Click Analyze..Descriptive Statistics..Frequencies


Move marital, attend, degree, gunlaw, owngun, hapmar into
the Variable(s) list box
Figure 4.1 Frequencies Dialog Box

After placing the desired variables in the list box, we use the Charts
button and request bar charts based on percentages (see figure below).
Click the Charts pushbutton
Click the Bar charts option button in the Chart Type box.
Click the Percentages option button in the Chart Values box

Describing Categorical Data 4 - 2

SPSS Training
Figure 4.2 Frequencies: Charts Dialog Box

Click Continue
Click the Format pushbutton
Click Organize output by variables in the Multiple
Variables box (not shown)
Click Continue
Click OK
To request this analysis with command syntax, use the Frequencies
command below:
FREQUENCIES
/VARIABLES=marital attend degree gunlaw owngun hapmar
/BARCHART PERCENT
/ORDER VARIABLES.
We now examine the tables and charts looking for anything
interesting or unusual.

Describing Categorical Data 4 - 3

SPSS Training
FREQUENCIES
OUTPUT

We begin with a table based on marital status.


Figure 4.3 Marital Status

By default, value labels appear in the first column and, if labels were
not supplied, the data values display. Tables involving nominal and
ordinal variables usually benefit from the inclusion of value labels.
Without value labels we wouldnt be able to tell from the output which
number stands for which marital status category. The Frequency column
contains counts or the number of occurrences of each data value. The
Percent column shows the percentage of cases in each category relative to
the number of cases in the entire data set, including those with missing
values. Cases with missing values for marital status would be excluded
from the Valid Percent calculation. Thus the Valid Percent column
contains the percentage of cases in each category relative to the number
of valid (non-missing) cases. Cumulative percentage, the percentage of
cases whose values are less than or equal to the indicated value, appears
in the cumulative percent column. With only one case containing a
missing value, the percent and valid percent columns are
indistinguishable. Note we can edit the frequencies pivot table to display
the percentages to greater precision.
Examine the table. Note the disparate category sizes. What are some
meaningful ways in which you might combine or compare categories?

Describing Categorical Data 4 - 4

SPSS Training
Figure 4.4 Bar Chart of Marital Status

The disparities among the marital status categories are brought into
focus by the bar chart. Notice the small proportion of individuals
separated from their spouse.
We next turn to attendance at religious services.
Figure 4.5 Frequency of Attendance at Religious Services

Describing Categorical Data 4 - 5

SPSS Training
How would you summarize the information in this table? If you
wished to reduce the number of categories, which would you collapse?
Decisions about collapsing categories usually have to do with which
groups need be kept distinct in order to answer the research question
asked, and the sample sizes for the groups. Below we view a bar chart
based on the church attendance variable.
Figure 4.6 Bar Chart of Attendance at Religious Services

Does the picture make it easier to understand the distribution of


ATTEND?

Describing Categorical Data 4 - 6

SPSS Training
Figure 4.7 Frequency Table of Educational Degree

Figure 4.8 Bar Chart of Highest Educational Degree

There are some interesting peaks and valleys in the distribution of


the respondents highest degree. Can you think of a sensible way of
collapsing DEGREE into fewer categories?

Describing Categorical Data 4 - 7

SPSS Training
Figure 4.9 Frequency Table of Attitude Toward Gun Permits

Figure 4.10 Bar Chart of Attitude Toward Gun Permits

The GUNLAW variable is a dichotomy. Note the relative percentage


of responses in the two groups. Given the sample information above, do
you believe that people are as likely to favor gun permits as to oppose
them? If not, characterize GUNLAWs distribution. The three missing
categories for GUNLAW are often used in large-scale survey work. The
first category, not applicable (NAP), is coded if the question is never

Describing Categorical Data 4 - 8

SPSS Training
asked of the respondent. This could be because it is not relevant to the
individual or because not all questions are asked of all individuals in the
sample. A second missing code (DK) represents a response of Dont
Know. The third missing code, NA, indicates no answer is recorded (No
Answer) probably because of a refusal, but possibly because the question
wasnt asked. These three different missing codes are used to provide
information about why there isnt a valid response to the question. These
codes are excluded from consideration in the Valid Percent column of
the frequency table, as well as from the bar chart, and would also be
ignored if any additional statistics were requested.
Figure 4.11 Frequency Table of Gun Ownership

Figure 4.12 Bar Chart of Gun Ownership

Describing Categorical Data 4 - 9

SPSS Training
Respondents are more evenly divided in the question asking about
the presence of a gun in the home. Are respondents as likely to have a
gun at home as not? If there is a need to perform a statistical significance
test on this question, the NPAR TEST procedure within SPSS can do so
(using a chi-square test). Recalling the earlier question regarding gun
permits it might be interesting to look at the relationship between gun
ownership and attitude toward gun permits; we might ask to what extent
is gun ownership related to whether one favors gun permits? The
frequency tables we are viewing display each variable independently. To
investigate the relationship between two categorical (nominal) variables
we will turn to crosstabulation tables in Chapter 5.
A note about the missing value codes in the table above: Refused
implies that the respondent refused to answer the question, NA (no
answer) means no answer was recorded- probably because the question
was inadvertently skipped, and NAP (not applicable) was coded if the
question was not asked (recall every question is not asked of every
respondent in the General Social Survey).
Figure 4.13 Frequency Table - Happiness of Marriage

Describing Categorical Data 4 - 10

SPSS Training
Figure 4.14 Bar Chart - Happiness of Marriage

About two-thirds of those married are very happily married. Of the


rest, most say they are pretty happy. A very small percentage of
respondents are not too happy. Then there are those whose marriage
dissolved! Which category are they in? How might this influence your
interpretation of the percentages? Figure 4.3 (frequency table of marital
status) shows how many people are in each marital status group.

STANDARDIZING
THE CHART AXIS

If we glance back over the last few bar charts we notice that the scale
axis, which displays percents, varies across charts. This is because the
maximum value displayed in each bar chart depends on the percentage of
respondents in the most popular category. Such scaling permits better
use of the space within each bar chart but makes comparison across
charts more difficult. Percentaging is itself a form of standardization, and
bar charts displaying percentages as the scale axis were requested in our
analyses. Charts can be further normed by forcing the scale axis (the axis
showing the percents) in each chart to have the same maximum value.
This facilitates comparisons across charts, but can make the details of
individual charts more difficult to see. We will illustrate this by reviewing
three of the previous bar charts and requesting that the maximum scale
value is set to 100 (100%). We accomplish this by editing each chart
individually.
To force the scale axis maximum to 100%
Double click the chart to open the SPSS Chart Editor
Select Chart..Axis, then Scale
Set the Maximum value to 100
Click OK

Describing Categorical Data 4 - 11

SPSS Training
Rotate the chart 90 degrees by clicking on the rotate tool
Select File..Close to exit from the Chart Editor
If we apply this rescaling to all three variables: attitude toward gun
permits (GUNLAW), having a gun at home (OWNGUN) and frequency of
church attendance (ATTEND), we obtain the results below.
Figure 4.15 Percentage Bar Chart for Gun Permits Question

Figure 4.16 Percentage Bar Chart for Having Gun in Home

Describing Categorical Data 4 - 12

SPSS Training
Note that the horizontal axes of the bar charts are now in comparable
units so we can make direct percentage comparisons based on the bar
length. This is the advantage of the percentage standardization.
However, note the result when we apply the same technique to the
frequency of church attendance variable.
Figure 4.17 Percentage Bar Chart for Church Attendance

The percentage bar chart of church attendance has the same general
shape as the one shown previously. The horizontal axis is scaled 0 to 100,
and church attendance has eight categories, so the bars are shrunken
down compared to the earlier plot of church attendance. As a result some
detail is lost. We can rescale this chart by setting the maximum below
100%, but would lose the ability to directly compare bar length across the
series of charts. Thus the advantage of standardizing the percentage
scale must be traded off against potential loss of detail. In practice it is
usually quite easy to decide which approach is better.

Hint

If you use the Graphs menu (Graphs..Interactive..Bar) or the IGRAPH


command to create your charts, you can set the maximum value of the
scale axis percentages to 100% initially without having to go back to edit
your charts. For example, to create an interactive chart using the
ATTEND variable,
Click Graphs..Interactive..Bar
Drag and Drop How often R attends religious services
[attend] from the source list to the horizontal axis arrow
box
Drag and Drop Percent [$pct] from the source list to the
vertical axis arrow box

Describing Categorical Data 4 - 13

SPSS Training
Click the Options tab
Select Percent from the Variable: pull-down menu in the Scale
Range box
Uncheck the Auto check box in the Scale Range area
Set the Minimum to 0 and the Maximum to 100
Click OK
To request the same chart with command syntax use the IGRAPH
command below
IGRAPH /VIEWNAME=Bar Chart
/X1 = VAR(attend) TYPE = CATEGORICAL
/Y = $pct
/COORDINATE = VERTICAL
/X1LENGTH = 3.0 /YLENGTH = 3.0 /X2LENGTH = 3.0
/SCALERANGE = $pct MIN=0.000000 MAX=100.000000
/CATORDER VAR(attend) (ASCENDING VALUES
OMITEMPTY)
/BAR KEY=ON SHAPE = RECTANGLE BASELINE = AUTO.
Figure 4.18 Interactive Bar Chart

Describing Categorical Data 4 - 14

SPSS Training
PIE CHARTS

Pie charts provide a second way of picturing information in a frequency


table. Such charts are produced using the Graphs menu or the
Frequencies..Chart dialog box. To create a pie chart for church
attendance from the Graphs menu,
Click Graphs..Interactive..Pie..Simple
Drag and Drop the variable How often R attends religious
services [attend] into the Slice By box
Click OK
The syntax command to produce a pie chart based on church
attendance appears below.
IGRAPH /VIEWNAME=Simple Pie Chart
/SUMMARYVAR = $count
/COLOR = VAR(attend) TYPE = CATEGORICAL
/X1LENGTH = 3.0 /YLENGTH = 3.0 /X2LENGTH = 3.0
/CATORDER VAR(attend) (ASCENDING VALUES
OMITEMPTY)
/PIE KEY = ON START 90 CW.
Figure 4.19 Pie Chart of Church Attendance

While the pie and bar charts are based on the same information, the
structure of the pie chart draws attention to the relation between a given
slice (here a group) and the whole. On the other hand, a bar chart leads
one to make comparisons among the bars, rather than any single bar to
the total. You might keep these different emphases in mind when
deciding which to use in your presentations.

SUMMARY

We reviewed the use of frequency tables and bar charts to examine


individual categorical variables as an analysis in its own right and as a
preliminary step before performing more complex analyses. In addition
we discussed the implications of standardizing a series of bar charts.

Describing Categorical Data 4 - 15

SPSS Training

Describing Categorical Data 4 - 16

SPSS Training

Chapter 5 Comparing Groups:


Categorical Data

Objective

Learn how to compare different groups when the outcome measure is


categorical (nominal). Understand the assumptions and interpret the
results of chi-square tests of independence. See how to display the
summaries in crosstabulation tables using bar charts. Explore three-way
crosstabulation relationships.

Method

Use the Crosstabs procedure to construct a basic two-way table. Add


percentages along with some intermediate statistics used in calculating
the chi-square. Request that some strength of association measures
appear with the table. Create a clustered bar chart to graph the
percentages shown in a crosstabulation table. Build a three-way table by
specifying a layer variable.

Data

The General Social Survey 1994. We investigate possible relationships


between some demographic variables (gender, highest education degree)
and some attitudinal (belief in an afterlife, attitude towards gun permits)
and behavioral (gun in home) measures.

INTRODUCTION

hus far we have examined each variable isolated from the others.
A main component of many studies is to look for relationships
among variables or to compare groups on some measure. Using the
General Social Survey 1994 data, our interest is in investigating whether
men differ from women in their belief in an afterlife and in their attitude
toward gun permits. In addition, we will explore whether education
relates to these measures. Our choice of these variables, and not others,
is based on our view of which questions might be interesting to
investigate. More often a study is designed to answer specific questions of
interest to the researcher. These may be theoretical as in an academic
project, or quite applied as often found in market research.

The crosstabulation table is the basic technique used to examine


relationships among nominal (categorical) variables. Crosstabs are used
in practically all areas of research. A crosstabulation table is a cofrequency table of counts, where each row or column is a frequency table
of one variable for observations falling within the same category of the
other variable. When one of the variables identifies groups of interest (for

Comparing Groups: Categorical Data 5 - 1

SPSS Training
example, a demographic variable) crosstabulation tables permit
comparisons between groups. In survey work, two attitudinal measures
are often displayed in a crosstab to assess relationship. While the most
common tables involve two variables, crosstabulations are general
enough to handle additional variables and we will discuss a simple threevariable analysis.
A crosstabulation table can serve several purposes. It might be used
descriptively, that is, the emphasis is on providing some information
about the state of things and not on inferential statistical testing. For
example, demographic information of members of an organization
(company employees, students at a college, members of a professional
group) or recipients of a service (hospital patients, season ticket holders)
can be displayed using crosstabulation tables. Here the point is to provide
summary information describing the groups and not to make explicit
comparisons that generalize to larger populations. For example, an
educational institution might publish a crosstabulation table reporting
student outcome (dropout, return) for its different divisions. For this
purpose, the crosstabulation table is descriptive.
Crosstabulation tables are also used in research studies where the
goal is to draw conclusions about relationships in the population based on
sample data (recall our discussion in Chapter 1). Many survey studies
and all experiments have this as their goal. In order to make such
inferences, statistical tests (usually the chi-square test of independence)
are applied to the tables. In this chapter we will begin by discussing a
simple table displaying gender and belief in the afterlife. We will then
outline the logic of applying a statistical test to the data, perform the test,
and interpret the results. To provide reinforcement, several other twoway tables will be considered.
In addition to the statistical tests, researchers occasionally desire a
numeric summary of the strength of the association between the two
variables in a crosstabulation table. We provide a brief review of some of
these measures.
Another aspect of data analysis involves graphical display of the
results. We will see how bar charts can be used to present the data in
crosstabulation tables. Finally, we will explore a three-way table and
point in the direction of more advanced methods. We begin however, with
a simple table.

A BASIC TWOWAY TABLE

In SPSS, to request a crosstabulation table of counts we need only


indicate the row and column variables. We will specify POSTLIFE as the
row variable and SEX for the column variable. Note that multiple
variables can be given for both.
Click File...Open..Data (move to the C:\Train\Stats folder if
necessary)
Select SPSS Portable (*.por) from the Files of Type drop-down
list
Click GSS94.POR and click Open

Comparing Groups: Categorical Data 5 - 2

SPSS Training
Click Analyze..Descriptive Statistics..Crosstabs
Move postlife into the Row(s): box
Move sex into the Column(s): box
Figure 5.1 Crosstabs Dialog Box

A checkbox option is available to graph the crosstabulation table


results as a clustered bar chart based on counts. Rather than request a
bar chart of counts now, we will later use the Graphs menu to construct a
clustered bar chart based on percents. The Suppress tables option is
available if you want to see the crosstabulation statistical measures but
not the crosstabulation tables. A button labeled Exact will appear if the
SPSS Exact Tests option is installed.
Because SEX is designated as the Column variable, each gender
group will appear as a separate column in the table. The Layer box can be
used to build three-way and higher-order tables; we will see this feature
later in the chapter. By default the Crosstabs procedure will display only
counts in the cells of the table. For interpretive purposes we want
percentages as well. The Cells pushbutton controls the summaries
appearing in the cells of the table.

Comparing Groups: Categorical Data 5 - 3

SPSS Training
Click the Cells pushbutton
Click Column check box in order to obtain column percentages.
Figure 5.2 Crosstab Cell Display Dialog

Click Continue
Row, column and total table percentages can be requested. Row
percentages are computed within each row of the table so that the
percentages across a row sum to 100%. Column percentages would sum to
100% down each column, and total percentages sum to 100% across all
cells of the table. While we can request any or all of these percentages,
the column percent best suits our purpose. Since SEX is our column
variable, column percentages allow immediate comparison of the
percentages of men and women who believe in an afterlife: the question of
interest. We will not request row percents because we are not directly
interested in them and wish to keep the table simple.
Notice that Counts is checked by default. The other choices (Expected
Count, Residuals) are more technical summaries and will be considered
in the next example.
Click OK
The Crosstabs syntax command that can be used to run this analysis
appears below.
CROSSTABS
/TABLES postlife BY SEX
/CELLS COUNT COLUMN .
In SPSS syntax the row variable(s) (here POSTLIFE) precedes the
keyword BY and the column variable(s) follows it on the TABLES
subcommand. We request that counts and column percents appear in the
cells of the table.

Comparing Groups: Categorical Data 5 - 4

SPSS Training
Figure 5.3 Crosstabulation of Belief in an Afterlife by Gender

The statistics labels appear in the row dimension of the table. The
two numbers in each cell are counts and column percentages. We see that
about 79% of the men and 83% of the women said they believe in an
afterlife. Also note that many observations are missing since this
question was not asked of all respondents in 1994. On the descriptive
level we can say that most of those sampled believed in the afterlife. If we
wish to draw conclusions about the population, for example differences
between men and women, we would need to perform statistical tests.
Row percents, if requested, would indicate what percentage of
believers is male and what percentage of believers is female. In other
words, the percentages would sum to 100% across each row. Your choice
of row versus column percents determines your view of the data. In
survey research, independent variables, such as demographics, are often
positioned as column variables (or banner variable in the stub and
banner tables of market research), and since there is much interest in
comparing these groups, column percents are displayed. If you prefer to
interpret row percentages in this context, or wish both percentages to
appear, feel free to do so. The important point is that the percentages
help answer the question of interest in a direct fashion.
Having examined the basic two-way table, we move on to ask
questions about the larger population.

CHI-SQUARE
TEST OF
INDEPENDENCE

In the table viewed above, 79% of the men in the sample and 83% of the
women believed in an afterlife. There is a difference in the sample of
about 4% with a higher proportion of women believing. Can we conclude
from this that there is a population difference between men and women
on this issue (statistical significance)? And if there is a difference in the
population, is it large enough to be important to us (ecological
significance)?
The difficulty we face is that the sample is an incomplete and
imperfect reflection of the population. We use statistical tests to draw
conclusions about the population from the sample data. The basic logic of
such tests follows. We first assume there is no effect (null hypothesis) in
the population (here that men and women show no differences in belief in

Comparing Groups: Categorical Data 5 - 5

SPSS Training
an afterlife). We then calculate how likely it is that a sample could show
as large (or larger) an effect as what we observe (here a 4% difference), if
there were truly no effect in the population. If the probability of obtaining
so large a sample effect by chance alone is very small (often less than 5
chances in 100 or 5% is used) we reject the null hypothesis and conclude
there is an effect in the population. While this approach may seem
backward, that is, we assume no effect when we wish to demonstrate an
effect, it provides a method of forming conclusions about the population.
The details of how this logic is applied will vary depending on the type of
data (counts, means, other summary measures) and the question asked
(differences, association). So we will use a chi-square test in this chapter,
but t and F tests later.
Applying the testing logic to the crosstabulation table, we calculate
the number of people expected to fall into each cell of the table assuming
no relationship between gender and belief in an afterlife, then compare
these numbers to what we actually obtained in the sample. If there is a
close match we accept the null hypothesis of no effect. If the actual cell
counts differ dramatically from what is expected under the null
hypothesis we will conclude there is a gender difference in the population.
The chi-square statistic summarizes the discrepancy between what is
observed and what we expect under the null hypothesis. In addition, the
sample chi-square value can be converted into a probability that can be
readily interpreted by the analyst. To demonstrate how this works in
practice, we will rerun the same analysis as before, but request the chisquare statistic. We will also ask that some supplementary information
appear in the cells of the table to better envision the actual chi-square
calculation. In practice, you would rarely ask for this latter information
to be displayed.

REQUESTING
THE CHI-SQUARE
TEST

We return to the previous Crosstabs dialog box to request the chi-square


statistic.

Click on the Dialog Recall tool


Click on the Statistics pushbutton
Click the Chi-square check box

Comparing Groups: Categorical Data 5 - 6

, and then click Crosstabs

SPSS Training
Figure 5.4 Crosstab Statistics Dialog Box

The first choice is the chi-square test of independence of the row and
column variables. Most of the remaining statistics are association
measures that attempt to assign a single number to represent the
strength of the relationship between the two variables. We will briefly
discuss them later in this chapter. The McNemar statistics is used to test
for equality of correlated proportions, as opposed to general independence
of the row and column variables (as does the chi-square test). For
example, if we ask people, before and after viewing a political
commercial, whether they would vote for candidate A, the McNemar test
would test whether the proportion choosing candidate A changed. The
Cochrans and Mantel-Haenszel statistics test whether a dichotomous
response variable is conditionally independent of a dichotomous
explanatory variable when adjusting for the control variable. For
example, is there an association between instruction method (treatment
vs. control) and exam performance (pass vs. fail), controlling for school
area (urban vs. rural).
Click Continue
To illustrate the chi-square calculation we also request that some
technical results (expected values and residuals) appear in the cells of the
table. Once again, you would not typically display these statistics. To
proceed we return to the Crosstab Cell Display dialog box (click the Cells
pushbutton in the Crosstabs dialog box), then check Expected Counts and
Unstandardized Residuals.
Click the Cells pushbutton
Check Expected Counts and Unstandardized Residuals

Comparing Groups: Categorical Data 5 - 7

SPSS Training
Figure 5.5 Displaying Technical Information in Crosstab Tables

By displaying the expected counts we can see how many observations


are expected to fall into each cell assuming no relationship between the
row and column variables. The unstandardized residual is the difference
between the observed count and the expected count in the cell. As such it
is a measure of the discrepancy between what we expect under the null
hypothesis and what we actually observe. We will not explore the other
residuals listed in this course, but note they can be used with large tables
to quickly identify cells that exhibit the greatest deviations from
independence.
Click Continue
Click OK
The command below will produce the same analysis.
CROSSTABS
/TABLES postlife BY sex
/STATISTIC CHISQ
/CELLS COUNT COLUMN EXPECTED RESID .
The chi-square test is specified on the STATISTIC subcommand. In
addition to the cell counts and column percents, we request expected
values and residuals. The crosstabulation table appears below.

Comparing Groups: Categorical Data 5 - 8

SPSS Training
Figure 5.6 Crosstab with Expected Values and Residuals

The counts and percentages are the same as before; the expected
counts and residuals will aid in explaining the calculation of the chisquare statistic. Recall that our testing logic assumes no relation between
the row and column variables (here gender and belief in an afterlife) in
the population, and then determines how consistent the data are with
this assumption. In the table above there are 565 males who say they
believe in an afterlife. We now need to calculate how many observations
should fall into this cell if there were no relation between gender and
belief in an afterlife. First, note (we calculate this from the counts in the
cells and in the margins of the table) that 40.8% (714 of 1752, or .4075) of
the sample is male and 81.3% (1425 of 1752, or .8133) of the sample
believes in an afterlife. If gender is unrelated to belief in an afterlife, the
probability of picking someone from the sample who is both a male and a
believer would be the product of the probability of picking a male and the
probability of picking a believer, that is, .4075 * .8133 or .3314 (33.1
percent). This is based on the probability of the joint event equaling the
product of the probabilities of the separate events when the events are
independent- for example, the probability of obtaining two heads when
flipping coins. Taking this a step further, if the probability of picking a
male believer is 33.14% and our sample is composed of 1752 people, then
we would expect to find 580.7 male believers in the sample. This number
is the expected count for the male-believer cell, assuming no relation
between gender and belief. We observed 565 male believers while we
expected to find 580.7, and so the discrepancy or residual is -15.7. Small
residuals indicate agreement between the data and the null hypothesis of
no relationship; large residuals suggest the data are inconsistent with the
null hypothesis.
Expected counts and residuals are calculated for each cell in the table
and we wish to obtain an overall summary of the agreement between the
two. Simply summing the residuals has the disadvantage of negative and

Comparing Groups: Categorical Data 5 - 9

SPSS Training
positive residuals (discrepancies) canceling each other out. To avoid this
(and for more technical statistical reasons) residuals are squared so all
values are positive. A second consideration is that a residual of 50 would
be large relative to an expected count of 15, but small relative to an
expected count of 2,000. To compensate for this the squared residual from
each cell is divided by the expected count of the cell. The sum of these cell
summaries ((Observed count - Expected count)**2 / Expected count)
constitutes the Pearson chi-square statistic. One final consideration is
that since the chi-square statistic is the sum of positive values from each
cell in the table, other things being equal, it will have greater values in
larger tables. The chi-square value itself is not adjusted for this, but an
accompanying statistic called degrees of freedom, based on the number of
cells (technically the number of rows minus one multiplied by the number
of columns minus one), is taken into account when the statistic is
evaluated.
Figure 5.7 Chi-Square Test Results

The chi-square is a measure of the discrepancy between the observed


cell counts and what we expect if the row and column variables were
unrelated. Clearly a chi-square of zero would indicate perfect agreement
(and no relationship between the variables); a small chi-square would
indicate agreement while a large chi-square would signal disagreement
between the data and the null hypothesis. In order to assess the
magnitude of the sample chi-square, the calculated value is compared to
the theoretical chi-square distribution and an easily interpreted
probability is returned (column labeled Asymp. Significance (2-Sided)).
The chi-square we have been discussing, Pearsons chi-square, has a
significance value of .05. This means that if there were no relation
between gender and belief in an afterlife in the population, the
probability of obtaining discrepancies as large (or larger) as we see in our
sample (percentage differences of 4% between men and women) would be
about 5%. In other words, it is unlikely that we would obtain this large a
sample difference between men and women if there were no differences in
the population. If we consider as significant those effects that would occur

Comparing Groups: Categorical Data 5 - 10

SPSS Training
less than 5% of the time by chance alone (as many researchers do), we
would claim this is a statistically significant effect. U.S. adult women are
more likely to believe in an afterlife than men.
The Continuity correction will appear only in two-row by two-column
tables when the chi-square test is requested. In such small tables it was
known that the standard chi-square calculation did not closely
approximate the theoretical distribution, which meant that the
significance value was not quite correct. A statistician named Frank
Yates published an adjusted chi-square calculation specifically for tworow by two-column tables and it typically appears labeled as the
Continuity correction or as Yates correction. It was applied routinely
for many years, but more recent Monte Carlo simulation work indicates
that it over adjusts. As a result it is no longer automatically used in two
by two tables, but it is certainly useful to compare the two significance
values to make sure they agree (here notice the significance value for the
continuity correction is slightly above .05).
The Pearson chi-square was the test used originally with
crosstabulation tables. A more recent chi-square approximation is the
likelihood ratio chi-square test. Here it tests the same null hypothesis,
independence of the row and column variables, but uses a different chisquare formulation. It has some technical advantages that largely show
up when dealing with higher-order tables (three-way and up). In the vast
majority of cases, both the Pearson and likelihood ratio chi-square tests
lead to identical conclusions. In most introductory statistics courses, and
when reporting results of two-variable crosstab tables, the Pearson chisquare is commonly used. For more complex tables, and more advanced
statistical applications, the likelihood ratio chi-square is almost
exclusively applied. Note that here the likelihood ratio result is slightly
above .05, leading to a different conclusion than the Pearson chi-square.
This will be discussed below.
The Linear by Linear chi-square tests the very specific hypothesis of
linear association between the row and column variables. This assumes
that both variables are interval scale measures and you are interested in
testing for straight-line association. This is rarely the case in
crosstabulation tables (unless working with rating scales) and the test is
not often used.
Finally, Fishers exact test will appear for crosstabulation tables
containing two rows and two columns (a 2x2 table); exact tests are
available for larger tables through the SPSS Exact Tests option. Fishers
test calculates the proportion of all table arrangements that have more
extreme percentages than observed in the cells, while keeping the same
marginal proportions. Exact tests have the advantage of not depending
on approximations (as do the Pearson and likelihood ratio chi-square
tests). However, the computational effort required to evaluate exact tests
in all but simple situations (for example a 2x2 table) has been large.
Recent improvements in algorithms have resulted in exact tests
calculated more efficiently. You should consider using exact tests when
your sample size is small, or when some cells in large crosstabulation
tables are empty or have small cell counts. As the sample size increases
(for all cells), exact tests and asymptotic (Pearson, likelihood ratio)
results converge.

Comparing Groups: Categorical Data 5 - 11

SPSS Training
DIFFERENT
TESTS,
DIFFERENT
RESULTS?

Here we are faced with our Pearson result disagreeing with the other
tests. It is not a major problem in that the probability results are very
similar. However since we are testing at the .05 level, we would draw
different conclusions from the different tests. That is, while the probably
values from each test are very close in value, some fall just above, and
another just below, the .05 cutoff we chose. In this case it might be best to
say there is a suggestion of a male-female difference, but the test result is
not conclusive. Additional data, if available, would help resolve the issue.
Such disagreements among test results occur relatively rarely in practice.

ECOLOGICAL
SIGNIFICANCE

While our significance tests were not definitive, suppose we did conclude
from the Pearson chi-square test that U.S. adult men and women differ in
their belief in an afterlife, we now ask the question of practical
importance. Recall that majorities of both men and women believe and
the sample difference between them was about 4%. At this point the
researcher should consider whether a 4% difference is large enough to be
of practical importance. For example, if these were dropout rates for
students in two groups (no intervention, a dropout intervention program),
would a 4% difference in dropout rate justify the cost of the program?
These are the more practical and policy decisions that often have to be
made during the course of an applied statistical analysis.

SMALL SAMPLE
CONSIDERATIONS

The crosstabulation table viewed above was based on a large sample.


When sample sizes are small and expected cell counts approach zero, the
chi-square approximation may break down with the result that the
probability (significance values) cannot be trusted. Because of this, rules
of thumb have been developed to warn the analyst away from potentially
misleading results. A conservative rule of thumb regarding expected cell
counts is that they should be on the order of 4 or 5 or greater. Studies
have shown that the minimum expected cell count could be as low as 1 or
2 without adverse results in some situations. In the presence of many
small expected cell counts, you should be concerned that the chi-square
test is no longer behaving as it should. In SPSS, Crosstabs can display
the expected cell counts, and will show the minimum expected cell count
whenever the chi-square test is requested.
Regarding the observed cell counts, you should monitor the number
and proportion of zero cells (cells with no observations). Some researchers
say that no more that 20% of your observed cell counts should be less
than 5 in the situation where your expected counts are well behaved (see
above). If Crosstabs performs the chi-square test and any expected count
is less than 5, then the percentage of cells with expected counts less than
5 will be displayed. Too many zero cells, or a particular pattern of zero
cells, invalidate the usual interpretation of many measures of association.
Zero cells also contribute to a loss of sensitivity in your analysis. Two
subgroups, which might be distinguishable given enough data, are not
when a small sample makes both cells zero.
In practice, when expected or observed counts become small,
researchers often, if it makes conceptual sense, collapse several rows or
columns together to increase the sample sizes for the now broader groups.
Another possibility is to drop a row or column category from the analysis,

Comparing Groups: Categorical Data 5 - 12

SPSS Training
essentially giving up information about that group in order to obtain
stability when investigating the others. In recent years efficient
algorithms have been developed to perform exact tests which permit low
or zero expected cell counts in crosstabulation tables. SPSS has
implemented such algorithms in its Exact Tests module.

ADDITIONAL
TWO-WAY
TABLES

We will examine several additional two-variable crosstabulation tables


and apply the chi-square test. In particular, we will look at the
relationship between education degree (DEGREE) and attitude toward
gun permits (GUNLAW), education degree and belief in an afterlife, and
also gender and attitude toward gun permits. We return to the Crosstabs
dialog box to request the additional tables.

Click the Dialog Recall tool

, and then click Crosstabs

Move gunlaw into the Row(s) box


Move degree into the Column(s) box
Click on the Cells pushbutton
Click to uncheck the Expected cell counts and Unstandardized
Residuals
Click Continue
Figure 5.8 Multiple Crosstab Tables

Comparing Groups: Categorical Data 5 - 13

SPSS Training
Multiple tables can be obtained by naming several row or column
variables. In addition (although not shown) we drop our previous request
that the expected counts and residuals appear (in the Cells dialog box).
Click OK
The final command appears below.
CROSSTABS
/TABLES postlife gunlaw BY sex degree
/STATISTIC CHISQ
/CELLS COUNT COLUMN .
Each variable before the keyword BY will be matched with each one
following it, constructing four tables. Since we have already viewed belief
in an afterlife by gender, we skip it here.

Note

Some of the pivot tables shown in this chapter have been edited in the
Pivot Table editor so they are easier to read in this document.
Figure 5.9 Belief in Afterlife by Education Degree

Across different education degrees the belief in an afterlife ranges


from a high of 84% (Bachelors degree) to a low of 75% (Less than high
school degree). The Pearson and likelihood ratio chi-squares indicate a
nonsignificant result (a sample with differences this large would occur

Comparing Groups: Categorical Data 5 - 14

SPSS Training
about 6 times in 100 (.058) by chance alone if there were no differences in
the population). No continuity correction appears because this is not a
two-row by two-column table. The minimum expected frequency is above
5: the value suggested by the rule of thumb reviewed earlier.
Figure 5.10 Gun Permits and Gender

We see there is a 15% difference between men and women in the


percentage favoring gun permits and it is highly significant. Notice that
the significance value displays as zero (.000) to several decimal places.
This is due to rounding when the number is displayed, the actual
probability is therefore less than .0005, or less than 5 chances in a
thousand. We can display the significance value to greater precision by
double clicking on the pivot table to open the Pivot Table editor, then
double clicking on the significance value (or selecting its cell and
formatting the cells to display greater precision. On the practical level,
more women than men favor gun permits. At the same time, majorities of
both groups favor gun permits. We will return to this table later adding a
third variable, whether or not there is a gun in the home. The minimum
expected cell count is a quite comforting 180.

Comparing Groups: Categorical Data 5 - 15

SPSS Training
Figure 5.11 Gun Permits and Education Degree

We see the differences across degree groups are not statistically


significant (significance value greater than .05). The Pearson chi-square
significance value (.132) indicates that if degree were not related to
attitude toward gun permits in the population, there is a 13% chance of
obtaining differences as large (or larger) as those we found in this table.
Thus this result is too likely to have occurred by chance alone. It is
interesting to note that the General Social Survey data from 1991 showed
a significant relation between attitude toward gun permits and education
degree, and higher degrees were associated with greater support of gun
permits. This might suggest more uniformity in attitude across degree
groups over time.
In this set of four tables, one was statistically significant and at least
one other was suggestive. To repeat, if a significance value were above
.05, say for example .60, it would imply that, under the null hypothesis of
independence between the row and column variables in the population, it
is quite likely (60%) that we could obtain the differences observed in our
sample. In other words the sample is consistent with the assumption of
independence of the variables. This occurred most recently in the gun
permits by education degree table.

Comparing Groups: Categorical Data 5 - 16

SPSS Training
WHY IS THE
SIGNIFICANCE
CRITERION
TYPICALLY SET
AT .05?

In previous discussions we took a significance value of less that .05 to be


statistically significant. This means that if there is no effect in the
population, the measured effect in the sample must be so large as to occur
less than 5% of the time by chance alone. Choosing 5% (or 1%) as the
cutoff for statistical significance is statistically arbitrary and stems from
tradition. During the early years (which preceded computers) the critical
or cutoff values for the chi-square test were obtained by very labor
intensive calculations. As a result, tables for very few cutoff values were
calculated (including .10, .05 and .01). These values have become the
standards when performing statistical tests, but there is nothing
sacrosanct about them. As we discussed in Chapter 1, the significance
cutoff value (or alpha value) reflects the false positive rate you are willing
to tolerate. However, since the .05 cutoff is widely adopted, if you decide
to use a different value (say .10 or .15) you should be prepared to justify
your selection.

ASSOCIATION
MEASURES

We have discussed the concept of statistical significance in


crosstabulation tables and examined several tables with this in mind.
Recall that in this context a claim of statistical significance implies there
is a relationship between the row and column variables in the population.
We viewed the percentages in the table to describe the relation and
determine the magnitude of the differences. Researchers felt it would be
useful to have a single number to quantify the strength of the association
between the row and column variables. With such a measure one could
compare different tables and speak of relative strength of association or
effect. To this end many association measures were developed. They are
typically normed so as to range between 0 (no association) and 1 (perfect
association). Those assuming ordinal measurement in the variables are
scaled from -1 to +1, the extremes representing perfect negative and
positive association, respectively; here zero would indicate no ordinal
association. One reason for the large number of measures developed is
that there are many ways two variables can be related in a large
crosstabulation table. In addition, depending on the level of measurement
(for example, nominal versus ordinal), different aspects of association
might be relevant. Association measures tend to be used in academic and
medical research studies, less so in applied work such as market
research. In market research you typically display the crosstabulation
table for examination, rather than focus on a single summary.
We will review some general characteristics of the association
measures, but not consider them in great detail. For more involved
discussion of association measures for nominal variables see Gibbons
(1993), while a more complete but technical reference is Bishop, Fienberg
and Holland (1975).
First, some general points:

Some measures of association are based on the chi-square


values; others are based on probabilistic considerations. The
latter class is usually preferred, since chi-square based values
have no direct, intuitive interpretation.

Comparing Groups: Categorical Data 5 - 17

SPSS Training

ASSOCIATION
MEASURES
AVAILABLE
WITHIN
CROSSTABS

Some measures of association assume a certain level of


measurement (for example, dichotomous, nominal, ordinal).
Consider this when choosing a particular measure.

Some measures are symmetric, that is, do not vary if the row
and column variables are interchanged. Others are
asymmetric and must be interpreted in light of a causal or
predictive ordering that you conceive between your variables.

Measures of association for crosstabulation tables are


bivariate (two-variable). In general, multivariate (two or
more) extensions do not exist. To explore association in higher
order tables you must turn to a method called loglinear
modeling (implemented in SPSS Genlog and Hiloglinear
procedures of the SPSS Advanced Models option: see
Loglinear choice under the Analyze menu). Such analyses are
briefly mentioned at the end of this chapter, but are beyond
the scope of this course.

Chi-Square Based - Phi, V, and the Contingency Coefficient are measures


of association based on the chi-square value. Their early advantage was
convenience: they could be readily derived from the already calculated
chi-square. Values range from 0 to 1. Their disadvantage is that there is
not a simple, intuitive interpretation of the numeric values.
Lambda and Goodman & Kruskals Tau are probabilistic or PRE
(proportional reduction in error) measures suitable for nominal scale
data. They are measures attempting to summarize the extent to which
the category value of one variable can be predicted from the value of the
other. These measures are asymmetric and are reported with each
variable predicted from the other.
Ordinal Probabilistic Measures - Kendalls Tau-b, Tau-c, gamma and
Somers d are all probabilistic measures appropriate for ordinal tables.
Values range from -1 to +1, and reflect the extent to which higher
categories (based on the data codes used) of one variable are associated
with higher categories of the second variable.
Pearsons r is the standard correlation coefficient, which assumes both
variables are interval scaled. If this association were the main interest in
the analysis, such correlations can be obtained directly from the
correlation procedure.
Eta is asymmetric and assumes the dependent measure is interval scale
while the independent variable is nominal. It measures the reduction in
variation of the dependent measure when the value of the independent
variable is known. It can also be produced by the Means
(Analyze..Compare Means..Means) and GLM (Analyze..General Linear
Model..Univariate) procedures.

Comparing Groups: Categorical Data 5 - 18

SPSS Training
Association measures often used in health research are kappa and
relative risk. Strictly speaking, kappa is a normed association measure,
while relative risk compares the relative risk of a negative outcome
occurring between two groups and is not bounded as the other association
measures are.
These association measures are found in the Crosstabs Statistics
dialog box. We will request several measures for a new two-way table:
gun ownership by education degree. Here both nominal and ordinal
measures of association might be desirable.

Click on the Dialog Recall tool

, and then click Crosstabs

Click the Reset pushbutton


Move owngun into the Row(s) list box
Move degree into the Column(s) list box
Click the Cells pushbutton
Click the Column check box in the Percentages area
Click Continue
Click the Statistics pushbutton
Click to check Chi-square, Lambda, Gamma, and Kendalls
tau-c
Figure 5.12 Association Measures in Crosstabs

Click Continue
Click OK

Comparing Groups: Categorical Data 5 - 19

SPSS Training
The association measures are grouped by level of measurement
assumed for the variables. We checked lambda (which will also produce
Goodman & Kruskals Tau) along with Kendalls c and the gamma
coefficient. The SPSS command to run this analysis is shown below.
CROSSTABS
/TABLES owngun BY degree
/STATISTIC CHISQ LAMBDA GAMMA CTAU
/CELLS COUNT COLUMN .
The desired association measures are listed on the STATISTICS
subcommand. First we review the crosstab table.
Figure 5.13 Gun in the Home and Education Degree

There is a statistically significant relationship, and the highest


degree levels (along with the lowest) are associated with lower levels (yes
is coded 1) of gun ownership. However, note that for every education
degree category, the majority of respondents report no gun in the home.
We view the association measures below.

Comparing Groups: Categorical Data 5 - 20

SPSS Training
Figure 5.14 Association Measures - Gun in Home and Education Degree

The column labeled Value contains the actual association measures.


The most striking aspect is that they are all very near zero (very modest,
if any, association). This is explained in part by the fact that all degree
groups had majorities with no gun in the home. If some degree groups
had majorities with guns in the home, the probabilistic association
measures would be higher since your prediction of having a gun in the
home would vary depending on which degree group was involved. Thus
we have a situation in which there is a statistically significant result, but
the level of association is modest. Note that the ordinal measures (gamma
and Tau-c) are also near zero. For an ordinal measure to be substantial,
the proportion of respondents having a gun in the home would need to
increase (or decrease) as education degree increases. Recall that the
middle (high school, community college) education groups had greater
proportions or respondents with guns in the home than any of the more
extreme (less than high school, bachelor, and graduate) groups. Thus
there is no consistent ordered relationship, as reflected in the ordinal
association measures. The other columns are somewhat technical and we
will not pursue them here (see the references cited earlier in this section).
However they are used when you wish to perform statistical significance
tests on the association measures themselves: that is, you wish to test
whether an association measure differs from zero in the population.

Comparing Groups: Categorical Data 5 - 21

SPSS Training
GRAPHING
CROSSTABULATION
RESULTS

Percentages in a crosstabulation table can be displayed using clustered


bar charts. You can request bar charts based on counts directly from the
Crosstabs dialog box, but since we wish to display percentages, we
instead use the Graphs menu. A simple rule to apply for standard bar
charts is that the percents will sum to 100% within each subgroup of the
Cluster variable. Thus if in a standard bar chart we wish the percentages
to sum to 100% for the men, and separately for the women, then SEX
should be the Cluster variable. Since this is how we presented the
percentages in the crosstabulation table, we want to be consistent in the
representation. Here we will graph the percentages for gender and gun
ownership.
Click Graphs..Bar
Click the Cluster icon, and then click Define
Click % of Cases in the Bars Represent area
Move gunlaw into the Category Axis box
Move sex into the Define Clusters by box
Click the Options pushbutton
Deselect the Display groups defined by missing values check
box
Click Continue
Figure 5.15 Define Clustered Bar Dialog Box

Click OK
The command to obtain the same table in SPSS would be:
GRAPH
/BAR(GROUPED)=PCT BY gunlaw BY sex .

Comparing Groups: Categorical Data 5 - 22

SPSS Training
Figure 5.16 Bar Chart of Attitude Toward Gun Permits by Gender

We now have a direct visual comparison between the men and women
to supplement the crosstabulation table and significance tests. This graph
might be useful in a final presentation or report.

Hint

THREE-WAY
TABLES

You can create a bar chart directly from the values in the crosstabs pivot
table. To do so, double-click on the crosstabs pivot table to activate the
Pivot Table Editor, then select (Ctrl-click) all table values, for example
column percents except for totals, that you wish to plot. Then right-click
and select Create Graph..Bar from the Context menu. A bar chart will be
inserted in the Viewer window, following the pivot table.

Thus far we have examined several two-variable tables. To explore more


complex interactions we turn to three- and higher-way tables. Within the
Crosstabs procedure a three-way table is composed of a series of two-way
tables, each individual table based on responses from a single category of
the third variable. This third variable is sometimes called the control
variable since it determines the composition of each subtable. If a
complete analysis of a multi-way (three-way or higher) table is desired, a
statistical technique called loglinear modeling can be used. This advanced
technique is beyond the scope of this presentation, but procedures
performing loglinear analysis are available within SPSS.

Comparing Groups: Categorical Data 5 - 23

SPSS Training
We will illustrate a three-way table using the table of attitude toward
gun permits by gender as a basis. Suppose we are interested in seeing
how gun ownership might interact with the previously observed
relationship between gender and attitude toward gun permits. To explore
this question we specify the gun-in-home question (OWNGUN) as the
control (or layer) variable in the crosstabulation analysis. In this way we
can view a table of attitude toward gun permits by gender separately for
those who do, and then for those who dont, have a gun in the home. We
will request a chi-square test of independence for each subtable.

Click on the Dialog Recall tool

, and then click Crosstabs

Click the Reset pushbutton


Move gunlaw into the Row(s) list box
Move sex into the Column(s) list box
Move owngun into the Layer list box
Click on the Cells pushbutton and click the Column check box
under Percentages
Click Continue
Click Statistics and click the Chi-square check box
Click Continue
Figure 5.17 Crosstabs Dialog Box for Three-Way Table

Comparing Groups: Categorical Data 5 - 24

SPSS Training
Click OK
As before, GUNLAW (attitude toward gun permits) and Sex are,
respectively, the row and column variables, but OWNGUN (gun in the
home) is added as a layer (or control) variable. Note that OWNGUN is in
the first layer. If additional control variables are to be used, they can be
added at higher-level layers. Although not shown, we asked for Column
percents in the Cells dialog box and the Chi-square test from the
Statistics dialog box.
The following syntax command will do this analysis.
CROSSTABS
/TABLES gunlaw BY sex BY owngun
/STATISTIC CHISQ
/CELLS COUNT COLUMN .
The second occurrence of the keyword BY separates the layer variable
(OWNGUN) from the column variable (SEX). To expand to a four-way
table we would add the Keyword BY and an additional control variable to
the end of the TABLES subcommand. We view the three-way table below.
Figure 5.18 Gun Permits and Gender and Gun in Home

Comparing Groups: Categorical Data 5 - 25

SPSS Training
Figure 5.19 Chi-Square Statistics for Three-Way Table

We see that for respondents who have a gun in the home, there is a
large (19%) and significant difference between men and women. Recall
from Figure 5.10, that the original attitude toward gun permits by gender
crosstab table showed a 15% difference. The result here is consistent
with, but looks more pronounced than in the original table. For
respondents in homes without guns a somewhat different pattern
emerges. Here the percentages of men and women favoring gun permits
are significantly different in the population, yet seem closer (a 5%
difference: 84.5 versus 89.8 %) than the male-female difference for those
with guns in the home (a 19% difference). Thus there is a suggestion that
the male-female difference in attitude toward gun permits is more
pronounced in households with guns. This could be formally tested (test
for presence of a three-way interaction) using a loglinear model (note: we
ran this test and it was not significant, p= .0625).
Considering the two tables, we conclude there is a gender difference
in attitude toward gun permits regardless of whether a gun is in the
home. In addition, the difference seems more pronounced in homes with
guns, although we did not test this (note: when we did test using
loglinear, it was not significant). To carry this point further, suppose
there was a significant male-female difference in homes with guns, but
there was not a male-female difference in homes without guns? This
would modify our original conclusion based on the two-way table, since
we would know a third factor is relevant. If this occurred, the next step
for the analyst would be to explain it. As an exercise, can you suggest
reasons that might account for the patterns seen in this three-way table
(greater male-female difference in households with guns)?

Comparing Groups: Categorical Data 5 - 26

SPSS Training
EXTENSIONS

Analysis of multi-way tables (higher than two-way) can be performed


using a technique called loglinear modeling. This method requires
statistical sophistication and is beyond the domain of our course. SPSS
has several procedures (Genlog, Loglinear and Hiloglinear) to perform
such analyses. They provide a way of determining which variables relate
to which others in the context of a multi-way crosstab (also called
contingency) table. These procedures could be used to explicitly test for
the three-way interaction suggested above (it was found to be not
significant). For an introduction to this methodology see Fienberg (1977).
Academic researchers often use such models to test hypotheses based on
survey data.
A second, related technique is used by data analysts who need to
predict, as accurately as possible, into which outcome group an individual
will fall, based on potentially many nominal background variables. For
example, an insurance company is interested in the combination of
demographics that best predict whether a client is likely to make a claim.
Or a direct mail analyst is interested in the combinations of background
characteristics that yield the highest return rates. Here the emphasis is
less on testing a hypothesis and more on a heuristic method of finding the
optimal set of characteristics for prediction purposes. One methodology
taking this approach is called CHAID (chi-square automatic interaction
detection), a type of decision-tree methodology, which is available along
with other decision-tree methods in the SPSS AnswerTree product.
Occasionally there is interest in testing whether a frequency table
based on sample data is consistent with a distribution specified by the
analyst. This test (one sample chi-square) is available within the SPSS
nonparametric procedure (click Analyze..Nonparametric Tests..ChiSquare).

SUMMARY

We explored the relationships between nominal variables using


crosstabulation tables. Statistical tests of independence were discussed,
the results interpreted, and potential problems with sample size
considered. Association measures were touched upon, as were graphics to
represent the crosstabulation tables. A three-way table was examined
and direction was given for more complex analyses.

Comparing Groups: Categorical Data 5 - 27

SPSS Training

Comparing Groups: Categorical Data 5 - 28

SPSS Training

Chapter 6 Exploratory Data Analysis:


Interval Scale Data
Objective

Method

Data

Examine interval scale variables using methods of exploratory data


analysis.

Use Frequencies to build a frequency table of age first married and


request a histogram plot. Run the Explore procedure to produce
summaries and plots of several interval scale variables (age when first
married, satisfaction over several areas of life, and number of hours per
day spent watching TV).

The General Social Survey, 1994.

Scenario

One of the aims of the overall analysis is to compare demographic groups


on age first married, a measure of satisfaction with life, and on the
amount of time spent watching TV. Before proceeding with the group
comparisons, we look at summaries of these measures across the entire
sample.

INTRODUCTION

n Chapter 4 we chose frequency tables containing counts and


percentages as the appropriate summaries for individual categorical
(nominal) variables. If the variables of interest are interval scale we
can expand the summaries to include means, medians and standard
deviations. Counts and percentages may still be of interest, especially
when the variables can take only a limited number of distinct values. For
example, when working with a one to five point rating scale we might be
very interested in knowing the percentage of respondents who reply
Strongly Agree. However, as the number of possible response values
increases, frequency tables based on interval scale variables become less
useful. Suppose we asked respondents for their family income to the
nearest dollar? It is likely that each response would have a different
value and so a frequency table would be quite lengthy and not
particularly helpful. In short, while there is nothing incorrect about a
frequency table based on an interval scale variable with many values, it
is not a very effective or efficient summary. We will illustrate this point
by looking at a frequency table of age when first married.
For interval scale variables such statistics as means, medians and
standard deviations are often used. Several procedures within SPSS
(Frequencies, Case Summaries and Examine) can produce them; we will
examine them in conjunction with other numeric summaries. In addition
such graphs as histograms, stem & leaf and box & whisker plots, are
designed to display information about interval scale variables. We will
see examples of each.

Exploratory Data Analysis: Interval Scale Data 6 - 1

SPSS Training
FREQUENCY
TABLES AND
HISTOGRAMS

First we request a frequency table of respondent age when first married


(AGEWED), along with some summary statistics and a histogram plot.
Click File..Open..Data (and move to c:\Train\Stats)
Select SPSS Portable (*.por) from the Files of Type drop-down
list
Double click on GSS94.por
Click Analyze..Descriptive Statistics..Frequencies
Move agewed into the Variable(s) list box
Figure 6.1 Frequencies Dialog Box - Age First Married

This choice will produce a frequency table for AGEWED. In order to


add summary statistics we click on the Statistics pushbutton. Here we
indicate (by check boxes) that we wish the mean, median and standard
deviation of age when first married. Many statistics, including
percentiles, are available. If the Frequencies procedure were our primary
analysis we would have asked for additional summaries (minimum,
maximum). Recall that our purpose here is to illustrate the limitations of
frequency tables for variables taking on many values.
Click the Statistics pushbutton
Check Std. deviation, Mean and Median

Exploratory Data Analysis: Interval Scale Data 6 - 2

SPSS Training
Figure 6.2 Frequencies: Statistics Dialog Box

Click Continue

HISTOGRAMS

The histogram is designed to display the distribution (range and


concentration) of an interval or ratio variable that takes many different
values. A bar chart contains one bar for each distinct data value. When
there are many possible data values and few observations at any given
value, a bar chart is less useful than a histogram. In a histogram,
adjacent data values are grouped together so that each bar represents the
same range of data values (for age when first married, perhaps 5 years).
With this chart we can see the general distribution of data regardless of
how many distinct data values are present. As we discussed earlier, for a
one to five or one to seven point rating scale, a bar chart is appropriate.
On the other hand a bar chart of age when first married would contain
many bars and gaps in ages (ages at which no one was married) would
not be displayed. For these reasons, a histogram is a better choice.
Histograms can be requested from the Frequencies dialog box (or
Frequencies command) or directly from the Graphs menu.
Click the Charts pushbutton
Click the Histograms option button

Exploratory Data Analysis: Interval Scale Data 6 - 3

SPSS Training
Figure 6.3 Frequencies: Chart Dialog Box

In addition to obtaining the histogram, we can ask that the normal


bell-shaped curve be superimposed on the plot. Since we are not
interested in the normality of age when first married (can you suggest
why we should not expect age when first married to follow a normal
distribution?) we skip this option.
Click Continue
Click OK
To run the same analysis using SPSS command syntax we use the
Frequencies command below.
FREQUENCIES
VARIABLES agewed
/STATISTICS STDDEV MEAN MEDIAN
/HISTOGRAM
/ORDER ANALYSIS.
We now view the frequency table, summary statistics and histogram.

Exploratory Data Analysis: Interval Scale Data 6 - 4

SPSS Training
Figure 6.4 Frequency Table of Age First Married (Beginning)

Beware of frequency tables for continuous variables or variables with


many values, for they can take many pages to print. Looking at the
numbers, is it easy to see the distribution of age first married? On the
other hand, if our interest is in knowing what percentage of the sample
are married at a given age, then the frequency table is quite useful.
Similarly the frequency table can be used to obtain cumulative
percentages and to consider cutoff points for collapsing categories. Also
note the beginning of the frequency table; as a data check do the first few
values seem reasonable?
Figure 6.5 Summary Statistics

Exploratory Data Analysis: Interval Scale Data 6 - 5

SPSS Training
The mean age when first married is 22.6. The median (50% percentile
value) is 22. The reason for this discrepancy between the two measures of
central tendency is that a few respondents married relatively late in life.
Such relatively extreme values influence the mean more than the
median. Medians are known to be resistant (robust) to extreme scores
and are sometimes preferred because of this. The standard deviation is
about 5 years, which indicates there is a fair amount of variation among
respondents in age when first married.
Figure 6.6 Histogram of Age When First Married

Does this plot seem useful in describing age when first married? We
see the most common ages when first married concentrate in the late
teens to early twenties. The distribution is not symmetric: no one was
married before his early teens, while at the high end, some respondents
married in their forties or fifties. To the eye, the plot seems truncated on
the left-hand side. Technically speaking, this distribution would be
described as skewed to the right or as having positive skewness. We will
discuss skewness in more detail shortly. In summary, we might say that
unless we are interested in the exact ages when first married, the
frequency table contained too much detail, while the statistical
summaries and histogram were more useful and succinct.

EXPLORATORY
DATA ANALYSIS

Bar charts and histograms, as well as such summaries as means and


standard deviations have been used in statistical work for many years.
Sometimes such summaries are ends in their own right; other times they
constitute a preliminary look at the data before proceeding with more
formal methods. Seeing limitations in this standard set of techniques,
John Tukey, a statistician at Princeton and Bell Labs, devised a collection
of statistics and plots designed to reveal data features that might not be
readily apparent from standard statistical summaries. In his book
describing these methods, entitled Exploratory Data Analysis, Tukey
(1977) described the work of a data analyst to be similar to that of a
detective, the goal being to discover surprising, interesting and unusual

Exploratory Data Analysis: Interval Scale Data 6 - 6

SPSS Training
things about the data. To further this effort Tukey developed both plots
and data summaries. These methods, called exploratory data analysis
and abbreviated EDA, have become very popular in applied statistics and
data analysis. Exploratory data analysis can be viewed either as an
analysis in its own right, or as a set of data checks and investigations
performed before applying inferential testing procedures.
These methods are best applied to variables with at least ordinal
(more commonly interval) scale properties and which can take many
different values. The plots and summaries would be less helpful for a
variable that takes on only a few values (for example, one to five scales).
We will apply EDA techniques to three items based on the General
Social Survey: an average satisfaction score, age when first married, and
number of hours of TV viewed per day.

AVERAGE
SATISFACTION
VARIABLE

The General Social Survey 1994 contains five questions asking about
respondent satisfaction with various aspects of life. The questions pertain
to satisfaction with the city or place lived in (SATCITY), family life
(SATLIFE), friendships (SATFRND), health and physical condition
(SATHEALT), and non-working activities and hobbies (SATHOBBY).
Responses are made on a one to seven point scale measuring level of
satisfaction, where 1= A Very Great Deal and 7=None. To create an
overall or average satisfaction measure we take the average score across
the five questions for each respondent. In SPSS for Windows, this is done
within the Compute dialog box.
Click Transform..Compute
Type satmean in the Target Variable box
Select Mean(numexpr,numexpr,...) from the Function menu
and move it to the Numeric Expression box
Move the variables satcity, satfam, satfrnd, sathealt, and
sathobby to the Numeric Expression box
Make sure the variable names are separated by commas (,)
Figure 6.7 Computing the Average Satisfaction Score

Exploratory Data Analysis: Interval Scale Data 6 - 7

SPSS Training
Click OK
The resulting command appears below.
COMPUTE satmean = MEAN(satcity, satfam, satfrnd, sathealt,
sathobby) .
After creating the variable satmean, we will perform exploratory
data analysis on the three variables of interest.
Click Analyze..Descriptive Statistics..Explore
Move satmean, agewed, and tvhours to the Dependent List:
box
Figure 6.8 Explore Dialog Box

The variables to be summarized (here SATMEAN, AGEWED,


TVHOURS) appear in the Dependent list box. The Factor list box can
contain one or more categorical (for example, demographic) variables, and
if used would cause the procedure to present summaries for each
subgroup based on the factor variable(s). We will use this feature in later
chapters when we look at mean differences between groups. By default,
both plots and statistical summaries will appear. We can request specific
statistical summaries and plots using the Statistics and Plots
pushbuttons. While not discussed here, the Explore procedure can
produce robust mean estimates (M-estimators) and lists of extreme
values, as well as normal probability and homogeneity plots.

OPTIONS WITH
MISSING VALUES

Ordinarily SPSS excludes any observations with missing values when


running a procedure like Explore. When several variables are used (as
here) you have a choice as to whether the analysis should be based on
only those observations with valid values for all variables in the analysis
(called listwise deletion), or whether missing values should be excluded

Exploratory Data Analysis: Interval Scale Data 6 - 8

SPSS Training
separately for each variable (called pairwise deletion). When only a single
variable is considered both methods yield the same result, but they will
not give identical answers when multiple variables are analyzed in the
presence of missing values. The default method is listwise deletion, and
we will specifically request via the Options pushbutton that the
alternative method (pairwise) be used. This makes sense when we
consider that one of the variables, AGEWED, is asked only of those who
have been married. Why should we exclude responses to SATMEAN or
TVHOURS for those never married, and who thus have missing values
for AGEWED?
Click the Options pushbutton
Click the Exclude cases pairwise option button
Figure 6.9 Missing Value Options in Explore

Rarely used, the Report choice has SPSS include user-defined missing
values in frequency analyses, but excluded from summary statistics and
charts.
Click Continue
Click OK
The SPSS command to do this analysis appears below.
EXAMINE
VARIABLES satmean agewed tvhours
/MISSING PAIRWISE.
EXAMINE is the syntax command name given to the procedure that
performs exploratory data analysis.

Note

Although SPSS presents statistics for all three variables within one pivot
table, we will present and discuss the summaries and plots for each
variable separately.

Exploratory Data Analysis: Interval Scale Data 6 - 9

SPSS Training
Figure 6.10 EDA Summaries for Average Satisfaction

The Explore procedure first provides a group of statistical summaries


for the average satisfaction score (SATMEAN). Recall that the higher the
score the less the satisfaction (1=Very Great Deal, 7=None). Explore first
presents information about missing data. The Case Process Summary
pivot table (not shown) displays the number of valid and missing
observations. Here 510 cases (respondents) had valid values for the
composite satisfaction variable, while 2482 or 83% were missing.
Ordinarily such a large percentage of missing data would set off alarm
bells for the analyst. However, in the General Social Survey, not all
questions are asked of all subjects, and this was the case for the
satisfaction questions in 1994.

MEASURES OF
CENTRAL
TENDENCY

Next several measures of central tendency appear. Such statistics


attempt to describe with a single number where values are typically
found, or the center of the distribution. The mean is the arithmetic
average. The median is the data value at the center of the distribution,
that is, half the data values are greater than, and half the data values
are less than, the median. Medians are resistant to extreme scores, and
so are considered robust measures of central tendency. The 5% trimmed
mean is the mean calculated after the extreme upper 5% and the extreme
lower 5% of the data values are dropped. Such a measure would be
resistant to small numbers of extreme or wild scores. Here the mean,
median and 5% trimmed mean are very close and their values (2.4 - 2.52)
suggest people are fairly satisfied with aspects of their lives (as measured
by SATMEAN). If the mean were considerably above or below the median
and trimmed mean, it would suggest a skewed or asymmetric
distribution. A perfectly symmetric distribution, for example the normal,
would produce identical expected means, medians and trimmed means.

Exploratory Data Analysis: Interval Scale Data 6 - 10

SPSS Training
VARIABILITY
MEASURES

Explore provides several measures of the amount of variation across


individuals. They indicate to what degree observations tend to cluster
near the center of the distribution. Both the standard deviation and
variance (standard deviation squared) appear. For example, a standard
deviation of 0 would imply all observations had the same value (the
variation is zero), while a standard deviation of 2 (recall this variable is
scaled 1 to 7) would indicate considerable variation from individual to
individual. The standard error is an estimate of the standard deviation of
the mean if repeated samples of the same size were taken. It is used in
calculating the 95% confidence band for the sample mean discussed
below. Also appearing is the interquartile range, which is essentially the
range between the 75th and the 25th percentile values. Thus the
interquartile range represents the range including the middle 50 percent
of the sample. It is a variability measure more resistant to extreme scores
than the standard deviation. The interquartile range of 1.2 indicates that
the middle 50% of the sample lie within a range of 1.2 units on the
average satisfaction scale. We also see the minimum, maximum and
range. It is useful to check the minimum and maximum in order to make
sure no impossible data values are recorded (here values below 1 or above
7).

CONFIDENCE
BAND FOR MEAN

The 95% confidence band (labeled confidence interval) has a technical


definition: if we were to repeatedly perform the study, on average we
would expect the 95% confidence bands to include the true population
mean 95% of the time. It is useful in that it combines measures of both
central tendency (mean) and variation (standard error of mean) to
provide information about where we should expect the population mean
to fall. Here the confidence band for the mean is very narrow (2.43 - 2.6)
so we have a fairly precise idea of the population mean for the average
satisfaction variable.
The 95% confidence band for the mean can be easily obtained from
the sample mean, standard deviation and sample size. The confidence
band is based on the sample mean, plus or minus 1.96 times the standard
error of the mean (1.96 is used because 95% of the area under a normal
curve is within 1.96 standard deviations of the mean). Since the sample
standard error of the mean (discussed in the preceding paragraph) is
simply the sample standard deviation divided by the square root of the
sample size, the 95% confidence band for the mean is equal to the sample
mean plus or minus 1.96 * (sample standard deviation/(square root
(sample size))). Thus if you have the sample mean, standard deviation
and number of observations, you can easily calculate the 95% confidence
band.

SHAPE OF THE
DISTRIBUTION

Skewness and Kurtosis provide numeric summaries about the shape of


the distribution of the data. While many analysts are content to view
histograms in order to make judgments regarding the distribution of a
variable, these measures quantify the shape. Skewness is a measure of
the symmetry of a distribution. It is normed so that a symmetric
distribution has zero skewness. Positive skewness indicates bunching on
the left and a longer tail on the right (for example, income distribution in
the U.S.); negative skewness follows the reverse pattern. The standard
error of skewness also appears, and we can use it to determine if the data
are significantly skewed. Consider average satisfaction. Since the

Exploratory Data Analysis: Interval Scale Data 6 - 11

SPSS Training
skewness value is positive and several standard errors from zero, the
distribution will exhibit this bunching to the left and a longer tail to the
right. We will see this shortly.
Kurtosis also has to do with the shape of a distribution and is a
measure of how much of the data is concentrated near the center, as
opposed to the tails, of the distribution. It is normed to the normal curve
(kurtosis is zero). As an example, a distribution with long, thick tails and
less peaked in the middle than a normal would have a positive kurtosis
measure. A standard error for kurtosis also appears. Since the kurtosis
value is .94, which is beyond two standard errors (2 * .22) of zero, average
satisfaction would be considered slightly non normal. The shape of the
distribution can be of interest in its own right. As another reason,
assumptions are made about the shape of the data distribution within
each group when performing significance tests on mean differences
between groups. This aspect will be covered in later chapters.

STEM & LEAF


PLOT

The stem & leaf plot is modeled after the histogram, but is designed to
provide more information. Instead of using a standard symbol (for
example, an asterisk * or block character) to display a case or group of
cases, the stem & leaf plot uses data values as the plot symbols. Thus the
shape of the distribution appears and the plot can be read to obtain
specific data values. The stem & leaf for average satisfaction appears
below
Figure 6.11 Stem & Leaf Plot of Average Satisfaction

Exploratory Data Analysis: Interval Scale Data 6 - 12

SPSS Training
In a stem & leaf plot the stem is the vertical axis and the leaves
branch horizontally from the stem (Tukey devised the stem & leaf.) The
stem width indicates how to interpret the units in the stem; in this case a
stem unit represents one point on a seven-point satisfaction rating scale.
A stem width of 10 would indicate that the stem value must be multiplied
by 10 to reproduce the original units of analysis. The actual numbers in
the chart (leaves) provide an extra decimal place of information about the
data values. To illustrate, one of the bottom rows of the stem & leaf
contains a stem value of 4 with several leaves of value 6. These represent
individuals whose average satisfaction score was 4.6. Thus besides
viewing the shape of the distribution we can pick out individual scores.
Below the diagram a note indicates that each leaf represents one case.
For large samples a leaf may represent multiple cases. In such
situations, an ampersand (&) is used to denote a partial leaf.
The last line identifies outliers. These are data points far enough
from the center (defined more exactly under Box & Whisker plots below)
that they might merit more careful checking. Extreme points might be
data errors or possibly represent a separate subgroup. The nearest outlier
(the one closest to the median) is listed. If the stem & leaf plot were
extended to include the outliers, then the positive skewness would be
apparent. These extreme values may contribute to the kurtosis as well.

BOX & WHISKER


PLOT

The stem & leaf plot attempts to describe data by showing every
observation. In comparison, displaying only a few summary measures,
the box & whisker plot conveys information about the distribution of a
variable. Also the box & whisker plot will identify outliers (data values
far from the center of the distribution). Below we see the box & whisker
plot (also called box plot) for average satisfaction.
Figure 6.12 Box & Whisker Plot of Average Satisfaction

Exploratory Data Analysis: Interval Scale Data 6 - 13

SPSS Training
The vertical axis represents the average satisfaction scale. In the
plot, the solid line inside the box represents the median. The hinges
provide the top and bottom borders to the box; they correspond to the
75th and 25th percentile values of average satisfaction, and thus define
the interquartile range (IQR). In other words, the middle 50% of data
values fall within the box. The whiskers are the last data values that lie
within 1.5 box lengths (or IQRs) of the respective hinges (edges of box).
Tukey considers data points more that 1.5 box lengths from a hinge to be
far enough from the center to be noted as outliers. Such points are
marked with a circle. Points more than 3 box lengths from a hinge are
viewed by Tukey to be far out points and are marked with an asterisk
symbol. This plot has several outliers and one extreme. If a single outlier
exists at a data value, the case sequence number appears beside it (an ID
variable can be substituted), which aids data checking.
If the distribution were symmetric, then the median would be
centered within the hinges and the whiskers. In the plot above, the
disparate lengths of the whiskers and the outliers at the high end show
the skewness. Such plots are also useful when comparing several groups,
as we will see in later chapters.

EXPLORING AGE
WHEN FIRST
MARRIED

We now consider age when first married.


Figure 6.13 Exploratory Summaries of Age When First Married

The mean age first married is greater than the median. This suggests
a positive skew to the data, confirmed by the skewness statistic. Examine
the minimum and maximum values; do they suggest data errors? Which
other variables might you look at in order to investigate the validity of
these responses? We have valid data for 1,189 observations with 1803
missing (these numbers appear in the Case Processing Summary pivot

Exploratory Data Analysis: Interval Scale Data 6 - 14

SPSS Training
table- not shown). If we turn back to the frequency table of marital status
in Chapter 4 (Figure 4.3), we find 614 people have never been married
which accounts for roughly 1/3 of the missing data. Almost all the
remaining (all but 13 cases) missing data are due to the fact that the
question was not asked of all respondents in 1994. Thus, although about
60% of the responses to this question are missing, we have accounted for
them in a satisfactory manner.
Figure 6.14 Stem & Leaf Diagram for Age When First Married

Almost all leaves are zero (except for age 13) because the ages fall
within a fairly restricted range and were recorded in whole years. Age
13 is denoted with an & because it is a partial leaf. Thus except for
the outlier identification, the plot is equivalent to a histogram. Notice
that all the extreme values occur at the older ages; the individual first
married at age 13 is not identified as an outlier. This is because age 13 is
not that far from the bulk of the observations (many are married in their
late teens) while age 34 is. From a statistical perspective, age 13 is not an
outlier; however, from a social perspective it may very well be considered
unusual, and the case should be examined more closely for this reason.
Finally, do you notice any pattern of peaks and valleys to the plot?

Exploratory Data Analysis: Interval Scale Data 6 - 15

SPSS Training
Figure 6.15 Box & Whisker Plot for Age When First Married

The skewness is apparent from the outliers at the high end. Some of
these are marked as extreme points. While unusual relative to the data,
certainly people can first marry at these ages. If outliers appear in your
data you should check whether they are data errors. If not, you consider
whether you wish them included in your analysis (some references were
given to this issue in Chapter 3). This is especially problematic when
dealing with a small sample (not the case here), since an outlier can
substantially influence the analysis. We now move to hours of TV
watched per day.
Figure 6.16 Exploratory Summaries of Daily TV Hours

The mean (2.82) is very near 3 hours, the trimmed mean is at 2.6 and
the median is 2. This suggests skewness. Do you notice anything

Exploratory Data Analysis: Interval Scale Data 6 - 16

SPSS Training
surprising about the minimum, maximum or range? Watching 24 hours
of TV a day is possible (?), but unlikely, so perhaps it is a result of
misunderstanding the question. The trimmed mean (2.84) is closer to the
mean (2.82) than the median (2), indicating that the difference between
the mean and median is not solely due to the presence of outliers. The
stem & leaf diagram below, showing a heavy concentration of
respondents at 1 and 2 hours of TV viewing, suggests why the median is
at 2.
Figure 6.17 Stem & Leaf Diagram of Daily TV Hours

The stem & leaf identifies outliers on the high side. Other than that it
is of limited use since TVHOURS is recorded to the integer number of
hours and a relatively small number of values are chosen. This
consideration would apply when considering use of Explore for five-point
rating scales.

Exploratory Data Analysis: Interval Scale Data 6 - 17

SPSS Training
Figure 6.18 Box & Whisker Plot of Daily TV Hours

In addition to the asymmetry created by the large outliers, we see the


median is not centered in the box: it is closer to the lower edge (25th
percentile value). This is due to the heavy concentration of those viewing
0 through 2 hours of TV per day.
Notice how the box is squeezed into a small area of the chart due to
the outliers. The vertical scale shows negative values (here -10) in order
that the lower whisker is visually distinct from the horizontal axis. This
scale can be edited if desired.
We would not argue that something of interest always appears
through use of the methods of exploratory data analysis. However, you
can quickly glance over these results, and if anything strikes your
attention, then pursue it in more detail. The possibility of detecting
something unusual encourages the use of these techniques.

SAVING AN
UPDATED COPY
OF THE DATA

Because we will be doing further analysis on the variable satmean in a


next chapter, we need to save an updated copy of the GSS94 file.
Switch to the Data Editor window (Click Goto Data tool

Click File..Save As and type GSS94 in the File name text box
(switch to the c:\Train\Stats folder if necessary)
Click Save

SUMMARY

In this chapter we individually examined interval scale (or stronger)


variables using the methods of exploratory data analysis. We suggested
this to be an important step before performing formal statistical tests.
The use and interpretation of Stem & Leaf diagrams and Box & Whisker
plots was discussed.

Exploratory Data Analysis: Interval Scale Data 6 - 18

SPSS Training

Chapter 7 Mean Differences Between


Groups I: Simple Case
Objective

Method

Data

Understand the logic and procedure of testing for mean differences


between two or more population groups. Perform a t test analysis
comparing two groups and interpret the results.

Discuss the concepts involved in testing for mean differences between


groups. Run the T Test procedure to compare men and women on two
measures: average age when first married, and an overall satisfaction
measure.

General Social Survey, 1994.

Scenario

We wish to explore any differences between men and women on two


measures: overall satisfaction and age when first married. Since both
measures are at least interval scale and can take on many values, we will
summarize the groups using means. Our goal is to draw conclusions
about population differences based on our sample.

INTRODUCTION

n Chapter 5 we performed statistical tests in order to draw


conclusions about population group differences (gender, formal
education degree groups) on several attitude/belief questions
(attitude toward gun permits, belief in an afterlife). These measures had
two possible responses (favor/oppose, yes/no) that represented categories
or ordered categories, so we summarized the data in crosstabulation
tables and applied the chi-square test of independence. When our purpose
is to examine group differences on interval scale outcome measures, we
turn to means as the summary statistic. As discussed earlier, means
provide measures of central tendency, and so we use differences in
sample means to draw conclusions about differences in the populations.
For example, we will compare men and women in their mean age when
first married. We use the mean because it provides a simple and
comprehensible measure of central tendency. Also, from a statistical
perspective, the properties of sample means are well known, which
facilitates testing. If we displayed a crosstabulation table of age when
first married by gender, we would have a rather large table with few
observations in many of the cells; using means we have simple, concise,
and useful summaries.
When performing tests of mean differences we need to make more
assumptions than we did in our chi-square analysis of crosstabulation
tables. Specifically, we assume the dependent measure follows the same
normal bell-shaped curve within each comparison group. Also, the

Mean Differences Between Groups I: Simple Case 7 - 1

SPSS Training
distributions used when testing will be the t and F, rather than chisquare. This is because the properties of sample means are different from
those of counts appearing in a crosstabulation table.
In this chapter, we outline the logic involved when testing for mean
differences between groups, then perform an analysis comparing two
groups. Later chapters will generalize the method to cases involving
additional groups.

LOGIC OF
TESTING FOR
MEAN
DIFFERENCES

The aim of statistical tests on means is to draw conclusions about


population differences based on the observed sample means. To provide a
context for this discussion, we view a box & whisker plot showing three
groups (A, B and C) sampled from populations that are distinctly
different in mean level.
Figure 7.1 Samples from Three Very Different Populations

We see that the groups are well separated: there is no overlap


between members of any sample group and either of the remaining two.
In one sense a statistical test is almost superfluous since the groups are
so disparate, but if performed we would find highly significant
differences.
Next we turn to a case in which the groups are samples from the
same population and show no differences.

Mean Differences Between Groups I: Simple Case 7 - 2

SPSS Training
Figure 7.2 Three Samples from Same Population

Here there is considerable overlap of the three samples; the medians


and other summaries match almost identically across the groups. If there
are any true differences between the population groups they are likely to
be extremely small and not have any practical importance.
When there are modest population differences, we might obtain the
result below.
Figure 7.3 Samples from Three Modestly Different Populations

Mean Differences Between Groups I: Simple Case 7 - 3

SPSS Training
There is some overlap among the three groups, but the sample means
(medians here) are different. In this instance a statistical test would be
valuable to assess whether the sample mean differences are large enough
to justify the conclusion that the population means differ. This last plot
represents the typical situation facing a data analyst.
As we did when we performed the chi-square test, we formulate a null
hypothesis and use the data to evaluate it. First assume the population
means are identical, and then determine if the differences in sample
means are consistent with this assumption. If the probability of obtaining
sample means as far (or further) apart as we find in our sample is very
small (less than 5 chances in 100 or .05), assuming no population
differences, we reject our null hypothesis and conclude the populations
are different.
In order to implement this logic, we compare the variation among
sample means relative to the variation of individuals within each sample
group. The core idea is that if there were no differences between the
population means, then the only source for differences in the sample
means would be the variation among individual observations (since the
samples contain different observations), which we assume is random. If
we then compute a ratio of the variance among sample means divided by
the variance among individual observations within each group, we would
expect this ratio to be about 1 if there are no population differences.
When there are true population differences, the variation among sample
means would be due to two sources: variation among individuals, and the
true population difference in means. In this latter case we would expect
the ratio of variances to be greater than 1. Under the assumptions made
in analysis of variance, this variation ratio follows a known statistical
distribution (F). Thus the result of performing the test will be a
probability indicating how likely we are to obtain sample means as far
apart (or further) as we observe in our sample if the null hypothesis were
true. If this probability is very small, we reject the null hypothesis and
conclude there are true population differences.
This concept of taking a ratio of between-group variation of means
(between-group) to within-group variation of individuals (within-group) is
fundamental to the statistical method called analysis of variance. It is
implicit in the simple two-group case (t test), and appears explicitly in
more complex analyses (general ANOVA).

ASSUMPTIONS

When statistical tests of mean differences are applied (t test, F test), at


least two assumptions are made. First, that the distribution of the
dependent measure within each population subgroup follows the normal
distribution (normality). Second, that its variation is the same within
each population subgroup (homogeneity of variance). When these
assumptions are met, the t and F tests can be used to draw inferences
about population means. We will discuss each of these assumptions as it
applies in practice and see whether they hold in our data.
Normality of the dependent measure within each group is formally
required when statistical tests (t, F) involving mean differences are
performed. However, these tests are not much influenced by moderate
departures from normality. This robustness of the significance tests holds
especially when the sample sizes are moderate to large (say over 25) and
the dependent measure has the same distribution (for example, skewed to
the right) within each comparison group. Thus while normality is

Mean Differences Between Groups I: Simple Case 7 - 4

SPSS Training
assumed when performing the significance tests, the results are not much
affected by moderate departures from normality (for discussion and
references, see Kirk (1968) and for an opposing view see Wilcox (1996,
1997)). In practice, researchers often examine histograms, stem & leaf
plots, and box & whisker plots to view each group in order to make this
determination. If a more formal approach is preferred, the Explore
procedure can produce more technical plots (normal probability plots) and
statistical tests of normality (see the second appendix to this chapter). In
situations where the sample sizes are small or there are gross deviations
from normality, researchers often shift to nonparametric tests. These
tests are not emphasized in this course, but many are available in SPSS.
The second assumption, homogeneity of variance, indicates that the
variance of the dependent measure is the same for each population
subgroup. Under the null hypothesis we assume the variation in sample
means is due to the variation of individual scores, and if different groups
show disparate individual variation, it is difficult to interpret the overall
ratio of between-group to pooled within-group variation. This directly
affects significance tests. Based on simulation work, it is known that
significance tests of mean differences are not much influenced by
moderate lack of homogeneity of variance if the sample sizes of the
groups are about the same. If the sample sizes are quite different, then
lack of homogeneity (heterogeneity) is a problem in that the significance
test probabilities are not correct. When comparing means from two
groups (t test) and one-factor ANOVA (see Chapter 8) there are
corrections for lack of homogeneity. In the more general analysis a simple
correction does not exist. It is beyond the scope of this course, but it
should be mentioned that if there is a relationship or pattern between the
group means and standard deviations (for example, if groups with higher
mean levels also have larger standard deviations), there are sometimes
data transformations that when applied to the dependent variable will
result in homogeneity of variance. Such transformations can entail
additional complications, but provide a method of meeting the
homogeneity of variance requirement. The Explore procedures Spread &
Level plot can provide information as to whether this approach is
appropriate and can suggest the optimal data transformation to apply to
the dependent measure.
To oversimplify, when dealing with moderate or large samples and
testing for mean differences, normality is not always important. Gross
departures from homogeneity of variance do affect significance tests when
the sample sizes are disparate.

SAMPLE SIZE

Generally speaking, tests involving comparisons of sample means do not


require any specific minimal sample size. There must be at least one
observation in each group of interest and at least one group with two or
more observations in order to obtain an estimate of the within-group
variation. While these requirements are quite modest, the more
important point regarding sample size is that of statistical power: your
ability to detect differences that truly exist in the population. As your
sample size increases, the precision with which means and standard
deviations are estimated increases as well, as does the probability of
finding true population differences (power). Thus larger samples are
desirable from the perspectives of statistical power and robustness (recall

Mean Differences Between Groups I: Simple Case 7 - 5

SPSS Training
our discussion of normality), but are not formally required.
Also, these analyses do not demand that the group sizes be equal.
However, analyses involving tests of mean differences are more resistant
to certain assumption violations (homogeneity of variance) when the
sample sizes are equal (or near equal). In more complex analysis of
variance (covered in later chapters) equal (or proportional) group sample
sizes bring assurance that the various factors under investigation can be
looked at independently. Finally, equal sample size conveys greater
statistical power when looking for any differences among groups. So, in
summary, equal group sample sizes are not required, but do carry
advantages. This is not to suggest that you should drop observations from
the analysis in order to obtain equal numbers in each group, since this
would throw away information. Rather, think of equal group sample size
as an advantageous situation you should avail yourself of when possible.
In experiments equal sample size is usually part of the design, while in
survey work it is rarely seen.

EXPLORING THE
DIFFERENT
GROUPS

In this analysis we wish to determine if there are population gender


differences in overall satisfaction and in age when first married. Before
performing tests, you are advised to run the exploratory data analysis
procedures we discussed in Chapter 6. The only change is that the
different groups will be explicitly compared. We first retrieve the data file
saved at the end of the last chapter.
Click File..Open..Data (move to the c:\Train\Stats directory, if
necessary)
Select GSS94 and click Open
Click Analyze..Descriptive Statistics..Explore
Move agewed and satmean into the Dependent List: box
Move sex into the Factor List: box
Figure 7.4 Explore Dialog Box Comparing Groups

Mean Differences Between Groups I: Simple Case 7 - 6

SPSS Training
AGEWED (age when first married) and SATMEAN (overall
satisfaction) are both named as dependent variables. Explore will
perform a separate analysis on each. The variable defining the groups to
be compared, in this instance SEX, is given as the Factor variable. Thus
SPSS will produce summaries for each gender group. Finally, while not
shown in the dialog box above, we also used the Options pushbutton to
request that missing values should be treated separately for each
dependent variable (Pairwise option). We mentioned in Chapter 6 that
Explores default is to exclude a case from analysis if it contains a missing
value for any of the dependent variables. Here we want to avoid
excluding the overall satisfaction scores for those never married (and who
are coded as missing on the AGEWED variable).
Click the Options pushbutton
Click the Exclude cases pairwise option button
Click Continue
Click OK
The command in SPSS to perform this analysis appears below.
EXAMINE
VARIABLES=agewed satmean BY sex
/MISSING PAIRWISE
/NOTOTAL.
AGEWED and SATMEAN are named as the dependent variables.
Variables following the keyword BY are treated as independent variables
(or Factors). The MISSING subcommand requests pairwise case deletion
(explained earlier). While not required, the NOTOTAL subcommand
instructs SPSS to display only the subgroup summaries and plots,
suppressing results for the entire (total) sample. If NOTOTAL were
dropped, we would first view the results for the entire sample (as in
Chapter 6), followed by each subgroup. First we consider age when first
married.

Mean Differences Between Groups I: Simple Case 7 - 7

SPSS Training
Figure 7.5 Summaries of Age When First Married

Note

The original output for Figure 7.5 was edited using the Pivot Table editor
to facilitate the male to female comparisons (steps outlined below).
Right click on the Descriptives pivot table and select SPSS Pivot
Table Object..Open from the Context menu
Click Pivot..Pivoting Trays to activate the Pivoting Trays
window (if necessary)
Drag the pivot tray icon for sex from the Row dimension tray to
the Column dimension tray
Click File..Close to close the Pivot Table Editor
Notice that the mean (male 23.93; female 21.82) is higher than both
the median and trimmed mean for each gender, which suggests some
skewness to the data. This is confirmed by the positive skewness
measures and the stem & leaf diagrams. Note that the mean for females
(21.82) is about 2 years younger than the male average. Also the sample
standard deviation of age first married is 4.81 for the females and 4.72
for the males, suggesting the standard deviations in each population are
about the same, and that the homogeneity of variance assumption has
probably been met.

Mean Differences Between Groups I: Simple Case 7 - 8

SPSS Training
Figure 7.6 Males: Stem & Leaf Plot of Age When First Married

Figure 7.7 Females: Stem & Leaf Plot of Age When First Married

Viewing the stem & leaf diagrams with normality in mind, we might
say each is unimodal (a single peak) but skewed to the right, and thus not

Mean Differences Between Groups I: Simple Case 7 - 9

SPSS Training
normal. However, keeping in mind our earlier discussion of assumptions,
since both gender groups show a similar skewed pattern, we will not be
concerned since the sample sizes are fairly large and the distributions are
similar in the two groups.
Figure 7.8 Box & Whisker Plot of Age First Married for Males and
Females

The box & whisker plot provides visual confirmation of the mean
(actually median) differences between the two samples. The side-by-side
comparison shows that the groups have a similar pattern of positive
skewness. Outliers are identified and might be checked against the
original data for errors; we considered this issue when we performed
exploratory data analysis on age first married for the entire sample.
Based on these plots and summaries we might expect to find a significant
mean difference in age when first married between men and women.
Also, since the two groups have a similar distribution of data values
(positively skewed) with large samples, we feel comfortable about the
normality assumption to be made when testing for mean differences.
Next we turn to the summaries for the overall satisfaction measure.

Mean Differences Between Groups I: Simple Case 7 - 10

SPSS Training
Figure 7.9 Summaries of Overall Satisfaction

Figure 7.10 Males: Stem & Leaf Plot of Overall Satisfaction

Mean Differences Between Groups I: Simple Case 7 - 11

SPSS Training
Figure 7.11 Females: Stem & Leaf Plot of Overall Satisfaction

The mean of overall satisfaction was 2.58 for the males and 2.47 for
the females (1=Satisfied a Very Great Deal, 7=Not at all Satisfied)
indicating, on the whole, they were satisfied with life. The means are
slightly above their respective medians and trimmed means; the
skewness measures are several standard errors from zero; the stem &
leaf diagrams show outliers at the high end. All these signs indicate we
again have moderate positive skewness in this sample. The stem & leaf
plot for the females shows a slight positive skewness, similar to that of
the male sample. Notice also that the standard deviations for the groups
were about the same (.91, .94) so we are unlikely to have a problem with
the homogeneity of variance assumption. To compare the groups directly,
we move to the box & whisker plot.

Mean Differences Between Groups I: Simple Case 7 - 12

SPSS Training
Figure 7.12 Box & Whisker Plot of Overall Satisfaction

To the eye, the medians appear to be almost identical. The


interquartile ranges look the same; we can confirm the numbers in
Figure 7.9. This is consistent with the standard deviations being similar.
Both groups show some positive skewness. Based on this plot we would
not expect to find (or expect to find a very small) mean difference between
the groups. Since both groups follow a similar skewed distribution and
the samples are large, the normality assumption will not be a problem.
Given the similarity in standard deviations (.91 versus .94), we expect the
homogeneity of variance assumption to be satisfied.
Having explored the data focusing on group comparisons, we now
perform tests for mean differences between the populations.

T TEST

As mentioned earlier in this chapter, the t test is commonly used to


obtain a probability statement about differences in means between two
populations. If more than two populations are involved, a generalization
of this method, called analysis of variance (ANOVA), can be used.
Analysis of variance will be considered in later chapters. In SPSS, the
way we request a t test analysis is very similar to what we did with the
Explore procedure.
Click Analyze..Compare Means

Mean Differences Between Groups I: Simple Case 7 - 13

SPSS Training
Figure 7.13 Compare Means Menu

Notice there are three available t tests: one-sample t test,


independent-samples t test, and paired-samples t test. The one-sample t
test compares a value you supply (it can be the known value for some
population, or a target value) to the sample mean in order to conclude
whether the population represented by your sample differs from the
specified value. The other t tests involve comparison of two sample
means. The independent-samples t test applies when there are two
separate populations to compare (for example, males and females). An
observation can only fall into one of the two groups. The paired-samples t
test is appropriate when there are two measures to be compared for a
single population. For example, a paired t test would be used to compare
pre-treatment to post-treatment scores in a medical study. This
distinction is important because a slightly different statistical model is
used in each situation. Broadly speaking, if the same observation
contributes to both means, the paired t test takes advantage of this fact
and can provide a more powerful analysis. An example applying the
paired t test to compare the formal education of the respondents mother
and father appears in the appendix at the end of this chapter. In our
example, an observation (individual interviewed) can fall into only one of
the two groups (you are male or female), so we choose the independentsamples t test.
Click Analyze..Compare Means..Independent-Samples T
Test
Move agewed and satmean into the Test Variable(s): box
Move sex into the Grouping Variable: box

Mean Differences Between Groups I: Simple Case 7 - 14

SPSS Training
Using SPSS, after clicking Independent-Samples T Test, we first
indicate the dependent measure(s) or Test variable. We specify both
AGEWED and SATMEAN, which will yield two separate analyses. The
Group or independent variable is SEX. Thus we wish to compare men
and women in their mean age when first married and mean overall
satisfaction measure.
Figure 7.14 Independent-Samples T Test Dialog Box

We have provided the basic information to SPSS, but notice the two
question marks following the variable SEX in the Grouping Variable box.
SPSS requires that you indicate which groups are to be compared, which
is usually done by providing the data values for the two groups. Since
gender is coded 1 for males and 2 for females, we must supply these
numbers using the Define Groups dialog box.
Click the Define Groups pushbutton
Enter 1 as the first and 2 as the second group code
Figure 7.15 T Test Define Groups Dialog Box

Mean Differences Between Groups I: Simple Case 7 - 15

SPSS Training
We have identified the values defining the two groups to be
compared. The cut point choice is rarely used, but if the independent
(grouping) variable is numeric, then you can give a single cut point value
to define the two groups. Those cases less than or equal to the cut point
go into the first group, and those greater than the cut point fall into the
second group.
Click Continue
Figure 7.16 Completed T Test Dialog Box

Our specifications are complete. By default, the procedure will use all
valid responses for each dependent variable in the analysis.
Click OK
The SPSS T-Test command is shown below.
T-TEST
GROUPS=sex(1 2)
/MISSING=ANALYSIS
/VARIABLES=agewed satmean
/CRITERIA=CIN(.95) .
The GROUPS subcommand instructs SPSS that an independent
groups t test is to be performed comparing SEX groups 1 (males) and 2
(females). AGEWED and SATMEAN are named as the dependent
variables on the VARIABLES subcommand. We now advance to interpret
the results.

Mean Differences Between Groups I: Simple Case 7 - 16

SPSS Training
T TEST RESULTS
FOR AGE FIRST
MARRIED

We will first look at the output for age when first married. Please note
that in the original output, the test results for both dependent variables
were displayed in a single pivot table, but for discussion purposes we
present the agewed and satmean results separately.
Figure 7.17 Summaries for Age First Married

Figure 7.18 T Test Output for Age When First Married

We can see some of the same summaries as Explore displayed:


sample sizes, means, standard deviations, and standard errors for the
two groups (Figure 7.17). Note also that Figures 7.18 displays the sample
mean difference between males and females (2.11 years).

Homogeneity

The next piece of information to consider is the Levene test of


homogeneity of variance in the two groups. Thus we can test the
homogeneity assumption before examining the t test results. There are
several tests of homogeneity (Bartlett-Box, Cochrans C, Levene)
available. Levenes test has the advantage of being sensitive to lack of
homogeneity, but relatively insensitive to nonnormality. Bartlett-Box and
Cochrans C are sensitive to both lack of homogeneity and nonnormality.
Since nonnormality (recall our discussion in the assumptions section) is
not necessarily an important problem for t tests and analysis of variance,
the Levene test is directed toward the more critical issue.
Homogeneity tests evaluate the null hypothesis that the dependent
measure standard deviations are the same in the two populations. Since
homogeneity of variance is assumed when performing the t test, the
analyst hopes to find this test to be nonsignificant. The P value (or
probability) from Levenes test indicates that the probability of obtaining
sample standard deviations (technically, variances are tested) as far
apart (4.72 versus 4.81) as we observe in our data, if the standard
deviations were identical in the two populations, is about 8 chances in

Mean Differences Between Groups I: Simple Case 7 - 17

SPSS Training
100 (or .08). This is above the common (.05) cut-off, so we conclude the
standard deviations are identical in the two population groups and the
homogeneity requirement is met. If this seems too complicated, some
authors suggest the following simplified rules: (1) If the sample sizes are
about the same, dont worry about the homogeneity of variance
assumption; (2) If the sample sizes are quite different, then take the ratio
of the standard deviations in the two groups and round it to the nearest
whole number. If this rounded number is 1, dont worry about lack of
homogeneity of variance.

T TEST

Finally two versions of the t test appear. The row labeled Equal
variances assumed contains results of the standard t test, which
assumes homogeneity of variance. The second row labeled Equal
variances not assumed contains an adjusted t test that corrects for lack
of homogeneity (heterogeneity of variance) in the data. You would choose
one or the other based on your evaluation of the homogeneity of variance
question. The actual t value and df (degrees of freedom) are technical
summaries measuring the magnitude of the group differences and a value
related to the sample sizes, respectively. To interpret the results, move to
the column labeled Sig. (2-tailed). This is the probability (rounded to
.000, meaning it is less than .0005), of our obtaining sample means as far
or further apart (2.1 years), by chance alone, if the two populations
(males and females) actually have the same mean age when first married.
Thus the probability of obtaining such a large difference by chance alone
is quite small (less than 5 in 10,000), so we would conclude there is a
significant difference in age first married between men and women.
Notice we would draw the same conclusion if the unequal variance t test
were applied.
The term two-tailed test indicates that we are interested in testing
for any differences in age first married between men and women, that is,
either in the positive or negative direction (ergo the two tails).
Researchers with hypotheses that are directional, for example,
specifically that men marry at older ages than women, can use one-tailed
tests to address such questions in a more sensitive fashion. Broadly
speaking, two-tailed tests look for any difference between groups, while a
one tailed test focuses on a difference in a specific direction. Two-tailed
tests are most commonly done since the researcher is usually interested
in any differences between the groups, regardless as to which is higher.
If interested, you can obtain the one-tailed t test result directly from
the two-tailed significance value that SPSS displays. For example,
suppose you wish to test the directional hypothesis that in the population
men first marry at an older age than women, the null hypothesis being
that either women first marry at an older age than men or there is no
gender difference. You would simply divide the two-tailed significance
value by 2 to obtain the one-tailed probability, and verify that the pattern
of sample means is consistent with your hypothesized direction (that men
first marry at an older age). Thus if the two-tailed significance value were
.0005, then the one-tailed significance value would be half that value
(.00025) if the direction of the sample means violates the null hypothesis
(otherwise it is 1 p/2, where p is the two-tailed value). To learn more
about the differences and logic behind one and two-tailed testing, see
SPSS Guide to Data Analysis (Norusis, 2001) or an introductory statistics
book.

Mean Differences Between Groups I: Simple Case 7 - 18

SPSS Training
CONFIDENCE
BAND FOR MEAN
DIFFERENCE

The T Test procedure provides an additional bit of useful information: the


95% confidence band for the sample difference between means. It has a
technical definition, but let us say that it attempts to provide an idea of
the precision with which we have estimated the true population
difference. In the output above the 95% confidence band for the mean
difference between groups is from 1.56 to 2.67 years. Thus we expect that
the population mean difference could easily be a number like 1.9 years or
2.1 years, but would not be a number like 5 or 6 years. So the 95%
confidence band indicates the likely range within which we expect the
population mean difference to fall. Speaking in the technically correct
fashion, if we were to continually repeat this study, we would expect the
true population difference to fall within the confidence bands 95% of the
time. While the technical definition is not illuminating, the 95%
confidence band provides a useful precision indicator of our estimate of
the group difference.

SUMMARY FOR
AGE FIRST
MARRIED

Our analysis indicated that the assumption of homogeneity of variance is


satisfied and that there is a significant difference in age when first
married between men and women. Our sample indicates that on average
men marry 2.1 years later than women, and the 95% confidence band on
this difference ranges from 1.56 to 2.67 years.

T TEST FOR
OVERALL
SATISFACTION

We will now compare men and women on the overall satisfaction


measure.
Figure 7.19 Summaries for Overall Satisfaction

Figure 7.20 T Test Output for Overall Satisfaction

Here the sample means are very close, as are the standard deviations.
The standard errors are the expected standard deviations of the sample

Mean Differences Between Groups I: Simple Case 7 - 19

SPSS Training
means if the study were to be repeated with the same sample sizes. The
difference between men and women is tiny (.10 units on the 1 to 7 scale).
Note that the standard deviations of the two groups are fairly close (.91
and .94) and the Levene test returns a probability value of about 62%
(.62): well above our .05 cut-off! It is a good idea to keep the sample size
in mind when evaluating the homogeneity test(s), because with
increasing sample size there is more precise estimation of the sample
standard deviations, and so smaller differences are statistically
significant. Thus if the Levene test were significant, but the sample sizes
were large and the ratio of the sample standard deviations were near 1,
then the equal variance t test should be quite adequate.
Proceeding to the t test itself, the significance value of .207 indicates
that if the null hypothesis of no gender difference in overall satisfaction
in the population were true, then there is about a 21% chance of
obtaining sample means as far (or further) apart as we observe in our
data (difference of .10 units). This is not significant (well above .05) and
we conclude there is no evidence of men differing from women in overall
satisfaction. Notice that the 95% confidence band of the male-female
difference includes 0. This is another reflection of the fact that we cannot
conclude the populations are different on the satisfaction measure.
In summary, we found no indication of a gender difference in overall
satisfaction.

DISPLAYING
MEAN
DIFFERENCES

The T Test procedure displays the means and appropriate statistical test
information. When presenting these results a summary chart is
desirable. Bar charts can be easily produced in which the height of each
bar (group) represents the sample mean. Note there is no simple
mechanism to display the precision with which the mean was estimated
(95% confidence band) on an SPSS standard bar chart, although
Interactive Graph bar charts can display standard errors. A type of chart
called the error bar shows both the group means and precision. We will
produce an error bar chart showing the gender difference in age when
first married.
Click Graphs..Interactive..Error Bar
Click Reset button, then click OK to confirm
Drag and drop Age When First Married [agewed] to the
vertical arrow box
Drag and drop Respondent's Sex [sex] to the horizontal
arrow box

Mean Differences Between Groups I: Simple Case 7 - 20

SPSS Training
Figure 7.21 Create Error Bar Chart Dialog Box

By default, the error bars will represent the 95% confidence band
applied to the sample means.
Click OK
The SPSS syntax command that produces the chart appears below.
IGRAPH /VIEWNAME=Error Bar Chart
/X1 = VAR(sex) TYPE = CATEGORICAL
/Y = VAR(agewed) TYPE = SCALE
/COORDINATE = VERTICAL
/X1LENGTH = 3.0 /YLENGTH = 3.0 /X2LENGTH = 3.0
/CATORDER VAR(sex) (ASCENDING VALUES OMITEMPTY)
/ERRORBAR KEY ON CI(95.0) DIRECTION BOTH CAPSTYLE
= T SYMBOL = ON.

Mean Differences Between Groups I: Simple Case 7 - 21

SPSS Training
Figure 7.22 Error Bar Chart of Age First Married by Gender

The small square in the middle of each error bar represents the
sample group mean of age when first married, and the attached bars are
the upper and lower limits for the 95% confidence band on the sample
mean. Thus we can directly compare groups and view the precision with
which the group means have been estimated. Notice that the lower bound
for men is well above the upper bound for women indicating these groups
are well separated and that the population difference is statistically
significant. Such charts are especially useful when more than two groups
are displayed, since one can quickly make informal comparisons between
any groups of interest.

SUMMARY

APPENDIX:
PAIRED T TEST

In this chapter we introduced the logic and assumptions used to test for
mean differences between population groups. We saw how exploratory
data analysis methods contribute information directly relevant to such
tests, then performed t tests comparing men to women on two measures:
age first married and overall satisfaction. Finally, error bar charts were
offered as a tool to visually portray the results of the analysis.

The paired t test is used to test for statistical significance between two
population means when each observation contributes to both means. In
medical research a paired t test would be used to compare means on a
measure administered both before and after some type of treatment. Here
each patient is tested twice and is used in calculating both the pre- and
post-treatment means. In market research, if a subject were to rate the
product they usually purchase and a competing product on some
attribute, a paired t test would be needed to compare the means. In an

Mean Differences Between Groups I: Simple Case 7 - 22

SPSS Training
industrial experiment, the same operators might run their machines
using two different sets of guidelines in order to compare average
performance scores. Again, the paired t test is appropriate. Each of these
examples differs from the independent groups t test in which an
observation falls into one and only one of the two groups. The paired t
test entails a slightly different statistical model since when a subject
appears in each condition, he acts as his own control. To the extent that
an individuals outcomes across the two conditions are related, the paired
t test provides a more powerful statistical analysis (greater probability of
finding true effects) than the independent groups t test.
To demonstrate a paired t test using the General Social Survey data
from 1994 we will compare mean education levels of the mothers and
fathers of the respondents. The paired t test is appropriate because we
will obtain data from a single respondent as to his/her parents education.
We are interested in testing whether there is a significant difference in
education between fathers and mothers in the population. Keep in mind
that while the population we sample from is a U.S. adult population, the
questions pertain to their parents education. Thus the population our
conclusion directly generalizes to would be parents of U.S. adults. To test
directly for differences between men and women in the U.S. population,
we could run an independent-groups t test comparing mean education
level for men and women.
While not pursued here, we would recommend running exploratory
data analysis on the two variables to be tested. The homogeneity of
variance assumption does not apply since we are dealing with but one
group. Normality is assumed, but technically it applies to the difference
scores, obtained by subtracting for each observation the two measures to
be compared. In SPSS this issue can be investigated by computing a new
variable that is the difference between the two measures, then running
Explore on this variable.
Click Analyze..Compare Mean..Paired-Samples T Test
Click on maeduc and next on paeduc (in the Current
Selections box, maeduc will be listed as Variable 1 and
paeduc as Variable 2)
Click the arrow to move the two variables to the Paired
Variables: box
The Paired-Samples T Test dialog box appears below.

Mean Differences Between Groups I: Simple Case 7 - 23

SPSS Training
Figure 7.23 Paired-Samples T Test Dialog Box

Click OK
In SPSS both the independent groups and the paired samples t test
are produced from the same command.
T-TEST
PAIRS= maeduc WITH paeduc (PAIRED)
/CRITERIA=CIN(.95)
/MISSING=ANALYSIS.
Thus mother and fathers education in years are the variables to be
tested for mean differences using the paired sample t test.
Figure 7.24 Summaries of Differences in Parents Education

Mean Differences Between Groups I: Simple Case 7 - 24

SPSS Training
Figure 7.25 Paired T Test of Differences in Parents Education

What might first attract our attention is that the means for mothers
and fathers are extremely close (within .1 years of each other). This might
indicate very close educational matching of people who marry. Another
possibility might be incorrect reporting of parents formal education by
the respondent with a bias toward reporting the same value for both. The
sample size (number of pairs) appears along with the correlation between
mothers and fathers education. Correlations and their significance tests
will be studied in a later chapter, but we note that the correlation (.643)
is positive, substantial and statistically significant (differs from zero in
the population). This result (significant correlation of mother and fathers
education) supports our choice of the paired t test in place of the
independent sample t test. The mean formal education difference, .04
years, is reported along with the sample standard deviation and standard
error (based on the parents education difference score computed for each
respondent). The t statistic is very small and the significance value (.572)
indicates that if mothers and fathers in the population had the same
formal education (null hypothesis) then there is a 57% chance of
obtaining as large (or larger) a difference as we obtained in our sample
(.04 years). Thus the data are quite consistent with the null hypothesis of
no difference between mothers and fathers in education. In summary,
there is no evidence of a significant difference.

APPENDIX:
NORMAL
PROBABILITY
PLOTS

The Examine procedure (Explore menu choice) will display a stem & leaf
diagram useful for evaluating the shape of the distribution of the
dependent measure within each group. Since one of the t test
assumptions is that these distributions are normal, we implicitly compare
the stem & leaf plots to the well-known normal bell-shaped curve. If a
more direct consideration of normality is desired, the Examine procedure
can produce a normal probability plot and a fit test of normality. In this
section we return to the Explore dialog box and request these features.
Earlier in the chapter we used the Explore dialog box to explore age
when first married (AGEWED) and overall satisfaction (SATMEAN) for
the two gender groups. If we return to this dialog box by clicking the
Dialog Recall tool

, then Explore (Alternatively click

Mean Differences Between Groups I: Simple Case 7 - 25

SPSS Training
Analyze..Descriptive Statistics..Explore) we note it retains the settings
from our last analysis.
Click the Dialog Recall tool

, and then click Explore

Figure 7.26 Explore Dialog Box Comparing Groups

To request the normal probability plot,


Click the Plots pushbutton
Check Normality plots with tests
Figure 7.27 Explore: Plots Dialog Box

Mean Differences Between Groups I: Simple Case 7 - 26

SPSS Training
As mentioned in the discussion concerning homogeneity of variance,
the spread & level plot can be used to find a variance stabilizing
transformation for the dependent measure. Also, note that a histogram
can be requested in addition to the stem & leaf plot.
Click Continue
Click OK
To request the same analysis using SPSS syntax we use the following
instruction.
EXAMINE
VARIABLES=agewed satmean BY sex
/PLOT BOXPLOT STEMLEAF NPPLOT
/COMPARE GROUP
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
By default, box & whisker (BOXPLOT) and stem & leaf (STEMLEAF)
plots are produced. Adding the NPPLOT (normal probability plot)
keyword to the PLOT subcommand will have SPSS build normal
probability plots, detrended normal plots and perform normality tests.
Figure 7.28 Normal Probability Plot - Females

To produce a normal probability plot, the data values (here age first
married) are first ranked in ascending order. Then the normal deviate
corresponding to each rank (compared to the sample size) is calculated

Mean Differences Between Groups I: Simple Case 7 - 27

SPSS Training
and plotted against the observed value. Thus the vertical axis of the
normal probability plot presents normal deviates (based on the rank of
the observation) while the actual data values appear along the horizontal
axis. The individual points (squares) represent the female data, while the
straight line indicates the pattern we would see if the data were perfectly
normal. If age first married followed a normal distribution for females,
the plotted values would closely approximate the straight line.
The advantage of a normal probability plot is that instead of
comparing a histogram or stem & leaf plot to the normal curve (more
complicated), you need only compare the plot to a straight line (a simple
comparison indeed). The plot above confirms what we concluded earlier
from the stem & leaf diagram: that for females, age when first married
does not follow a normal distribution.

TEST OF
NORMALITY

Accompanying the normal probability plot is a modified version of the


Kolmogorov-Smirnov test (Lilliefors test) and the Shapiro-Wilk test,
which address whether the sample can be viewed as originating from a
population following a normal distribution. The null hypothesis is that
the sample comes from a normal population with unknown mean and
variance. The significance value is the probability that we can obtain a
sample as far (or further) from the normal as what we observe in our
data, if our sample truly came from a normal population. This result will
appear in the Viewer window as shown below.
Figure 7.29 Tests of Normality

For both tests the significance value is at .000 (rounded to 3 decimals)


in all cases. Thus if we assume we have sampled from a normal
population, the probability of obtaining a sample as far (or further) from
a normal as what we have found is less that .0005 (or 5 chances in
10,000). So we would conclude that for females in the population, the
distribution of age first married is not normal. Please recall our
discussion during which we outlined when normality might not be that
important. Also keep in mind that since our sample is large, we have a
powerful test of normality and relatively small departures from normality
would be significant.

Mean Differences Between Groups I: Simple Case 7 - 28

SPSS Training
DETRENDED
NORMAL PLOT

If you wish to focus attention on those areas of the data exhibiting


greatest deviation from the normal, the detrended normal probability will
perform this function. This plot displays the deviation of each point in the
normal probability plot from the straight line corresponding to the
normal. In other words, the difference between each point in the normal
probability plot and the straight line representing the perfect normal is
plotted against the observed value. This serves to visually magnify the
areas where there is greatest deviation from the normal. If the data in
the sample were normal, the detrended normal plot would appear as a
horizontal line centered at 0. Below we view the detrended normal plot of
age first married for the females.
Figure 7.30 Detrended Normal Plot of Age First Married -Females

We see that the major deviations from the normal occur in the tails of
the distribution. It might be useful to mention that the same conclusion
could have been drawn from the stem & leaf diagram, a histogram, or the
normal probability plot. The detrended normal plot is a more technical
plot that allows the researcher to focus in detail on the specific locus and
form of deviations from normality.
A normal probability plot and a detrended normal plot also appear for
the males. These will not be displayed here since our aim was to
demonstrate the purpose and use of these charts, and not to repeat the
investigation of normality.

Mean Differences Between Groups I: Simple Case 7 - 29

SPSS Training

Mean Differences Between Groups I: Simple Case 7 - 30

SPSS Training

Chapter 8 Mean Differences Between


Groups II: One-Factor ANOVA
Objective

Method

Data
Scenario

INTRODUCTION

Apply the principles of testing for population mean differences to


situations involving more than two comparison groups. Understand the
concept behind and practical use of post hoc tests applied to a set of
sample means.

Use the Explore (Examine) procedure to produce summaries of the groups


involved in the analysis. Run a one-factor (Oneway procedure) analysis of
variance comparing different education degree groups on average daily
TV viewing. Rerun the analysis requesting multiple comparison (post
hoc) tests to see specifically which population groups differ. Plot the
results using an error bar chart. The appendix contains a nonparametric
analysis of the same data.

General Social Survey 1994.

We wish to investigate the relation between level of education and


amount of TV viewing. One approach is to group people according to their
education degree, and then compare these groups on average amount of
daily TV watched. In the General Social Survey the question about
highest degree completed (DEGREE) contains five categories: less than
high school, high school, junior college, bachelor, and graduate. Assuming
we retain these categories we might first ask if there are any population
differences in TV viewing among these groups. If there are significant
mean differences overall, we next want to know specifically which groups
differ from which others.

nalysis of variance (ANOVA) is a general method of drawing


conclusions regarding differences in population means when two
or more comparison groups are involved. The independent-groups
t test (Chapter 7) applies only to the simplest instance (two groups), while
ANOVA can accommodate more complex situations. It is worth
mentioning that the t test can be viewed as a special case of ANOVA and
they yield the same result in a two-group situation (same significance
value, and the t statistic squared is equal to ANOVAs F).
We will compare five groups composed of people with different
education degrees and evaluate whether the populations they represent
differ in average amount of daily TV viewing. Before performing the
analysis we will look at an exploratory data analysis plot.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 1

SPSS Training
LOGIC OF
TESTING FOR
MEAN
DIFFERENCES

The basic logic of significance testing is the same as that for the t test: we
assume the population groups have the same means (null hypothesis),
then determine the probability of obtaining a sample with group mean
differences as large (or larger) as what we find in our data. To make this
assessment the amount of variation among group means (between-group
variation) is compared to the amount of variation among observations
within each group (within-group variation). Assuming in the population
that the group means are identical (null hypothesis), the only source of
variation among sample means would be the fact that the groups are
composed of different individual observations. Thus a ratio of the two
sources of variation (between group / within group) should be about 1
when there are no population differences. When the distribution of
individual observations within each group follows the normal curve, the
statistical distribution of this ratio is known (F distribution) and we can
make a probability statement about the consistency of our data with the
null hypothesis. The final result is the probability of obtaining sample
differences as large (or larger) as what we found if there were no
population differences. If this probability is sufficiently small (usually
less than 5 chances in 100, or .05) we conclude the population groups
differ.

FACTORS

When performing a t test comparing two groups there is only one


comparison that can be made: group 1 versus group 2. For this reason,
the groups are constructed so their members systematically vary in only
one aspect: for example, males versus females, or drug A versus drug B. If
the two groups differed on more than one characteristic (for example,
males given drug A versus females given drug B), it would be impossible
to differentiate between the two effects (gender, drug).
When the data can be partitioned into more than two groups,
additional comparisons can be made. These might involve one aspect or
dimension, for example, four groups each representing a region of the
country. Or the groups might vary along several dimensions, for example
eight groups each composed of a gender (two categories) by region (four
categories) combination. In this latter case, we can ask additional
questions. (1) Is there a gender difference? (2) Is there a region
difference? (3) Do gender and region interact? Each aspect or dimension
the groups differ on is called a factor. Thus one might discuss a study or
experiment involving one, two, even three or more factors. A factor is
represented in the data set as a categorical (nominal) variable and would
be considered an independent variable. SPSS allows for multiple factors
to be analyzed, and has different procedures available based on how
many factors are involved and their degree of complexity. If only one
factor is to be studied, use the Oneway (or One-factor ANOVA) procedure.
When two or more factors are involved simply shift to the GLM
Univariate (Unianova) or, if needed, the Linear Mixed Models (Mixed)
procedure. In this chapter we consider a one-factor study (education
degree relating to average daily TV viewed), but will review a multiple
factor ANOVA in Chapter 9.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 2

SPSS Training
EXPLORING THE
DATA

Our goal is to determine if there are differences in amount of daily TV


viewing across several educational degree groups. As before, we begin by
applying exploratory data analysis procedures to the variables of interest.
In practice, you would check each groups summary statistics, looking at
the pattern of the data and noting any unusual points. For brevity in our
presentation we will examine only the box & whisker plot.
Click File..Open..Data (move to the c:\Train\Stats folder if
necessary)
Select SPSS Portable (*.por) from the Files of Type drop-down
list
Double click on GSS94.por
Click Analyze..Descriptive Statistics..Explore
Move tvhours to the Dependent List: box
Move degree to the Factor List: box
Figure 8.1 Explore Dialog Box to Compare TV Hours for Degree Groups

Since we are comparing different formal education degree groups, we


designate DEGREE as the factor (or nominal independent variable).
Notice that several variables can be named in the Factor list box; we will
use this feature in a later chapter discussing two factor ANOVA.
Click OK
The SPSS command to perform this analysis appears below.
EXAMINE
VARIABLES=tvhours BY degree
/NOTOTAL.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 3

SPSS Training
An exploratory analysis of TV hours will appear for each degree
group. The NOTOTAL keyword suppresses overall summaries for the
entire sample. For brevity in this presentation we move directly to the
box and whisker plot.
Figure 8.2 Box & Whisker Plot of TV Hours by Degree Groups

The median hours of daily TV watched seems higher for those with a
high school degree or lesser degree than for those with at least some
college. Each group exhibits a positive skew that is more exaggerated for
those with a high school or lesser degree. Some individuals report
watching rather large amounts of daily TV, one might want to examine
the original survey to check for data errors or evidence of
misunderstanding the question. Also, based on the box heights
(interquartile ranges), it looks as those with a high school degree or less
show greater within-group variation than the others. This suggests a
potential problem with homogeneity of variance, especially since the
sample sizes are quite disparate. However, we might also note there
doesnt seem to be any simple pattern between the median level and the
interquartile range (for example as one increases so does the other) that
might suggest a data transformation to stabilize the within-group
variance. We will come back to this point after testing for homogeneity of
variance. Lets move on to the actual analysis.

RUNNING ONEFACTOR ANOVA

To run the analysis using SPSS:


Click Analyze..Compare Means..One-Way ANOVA
Move tvhours to the Dependent List: box
Move degree to the Factor: box

Mean Differences Between Groups II: One-Factor ANOVA 8 - 4

SPSS Training
Figure 8.3 One-Way ANOVA Dialog Box

Enough information has been provided to run the basic analysis. The
Contrasts pushbutton allows users to request statistical tests for planned
group comparisons of interest. The Post Hoc pushbutton will produce
multiple comparison tests that can test each group mean against every
other one. Such tests facilitate determination of just which groups differ
from which others and are usually performed after the overall analysis
establishes that some significant differences exist. We will examine such
tests in the next section. Finally, the Options pushbutton controls such
diverse features as missing value handling and whether descriptive
statistics, means plots, and homogeneity tests are desired. We want both
descriptive statistics (although having just run Explore, they are not
necessary) and the homogeneity of variance test.
Click Options pushbutton
Check Descriptive check box
Check Homogeneity of variance test check box
Check Brown-Forsythe and Welch check boxes
As mentioned earlier, ANOVA assumes homogeneity of within-group
variance. However, when homogeneity does not hold there are several
adjustments that can be made to the F test. We request these optional
statistics because the box & whisker plots and the homogeneity of
variance test (not shown here) indicate that the homogeneity of variance
assumption does not hold. Note that these tests still assume normality of
the residuals.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 5

SPSS Training
Figure 8.4 One-Way ANOVA Options Dialog Box

The missing value choices deal with how missing data are to be
handled when several dependent measures are given. By default, cases
with missing values on a particular dependent variable are dropped only
for the specific analysis involving that variable. Since we are looking at a
single dependent variable, the choice has no relevance to our analysis.
The Means plot option will produce a line chart displaying the group
means; we will request an error bar plot later.
Click Continue
Click OK
The same analysis can be performed in SPSS syntax using the
ONEWAY procedure. TVHOURS is the dependent measure and the
keyword BY separates the dependent variable from the factor variable.
We request descriptive statistics, a homogeneity of variance test, and two
tests that do not make the homogeneity assumption (Brown-Forsythe and
Welch tests).
ONEWAY
tvhours BY degree
/STATISTICS DESCRIPTIVES HOMOGENEITY
BROWNFORSYTHE WELCH .
We now turn to interpret the results.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 6

SPSS Training
ONE-FACTOR
ANOVA RESULTS

The Oneway output includes the analysis of variance summary table,


robust tests that do not assume homogeneity of variance, and the
probability value(s) we will use to judge statistical significance.
Figure 8.5 One-Factor ANOVA Summary Table

Most of the information in the ANOVA table is technical in nature


and is not directly interpreted. Rather the summaries are used to obtain
the F statistic and, more importantly, the probability value we use in
evaluating the population differences. Notice that in the first column
there is a row for the between-groups and a row for within-groups
variation. The df column contains information about degrees of freedom,
related to the number of groups and the number of individual
observations within each group. The degrees of freedom are not
interpreted directly, but are used in calculating the between-group and
within-group variation (variances). Similarly, the sums of squares are
intermediate summary numbers used in calculating the between- and
within-group variances. Technically they represent the sum of the
squared deviations of the individual group means around the total
sample mean (between) and the sum of the squared deviations of
individual observations around their respective sample group mean
(within). These numbers are never interpreted and are reported because
it is traditional to do so. The mean squares are measures of the betweengroup and within-group variation (variances). Recall in our discussion of
the logic of testing that under the null hypothesis both variances should
have the same source and the ratio of between to within would be about
1. This ratio, the sample F statistic, is a far cry from 1 (F = 43.77).
Finally, and most readily interpretable, the column labeled Sig.
provides the probability of obtaining a sample F ratio as large (or larger)
than 43.77 (taking into account the number of groups and sample size),
assuming the null hypothesis that in the population all degree groups
watch the same amount of TV. The probability of obtaining an F this
large (in other words, of obtaining sample means as far apart as we
have), if the null hypothesis were true, is about .000. This number is
rounded when displayed so the actual probability is less than .0005, or
less than 5 chances in 10,000 of obtaining sample mean differences so far
apart by chance alone. Thus we have a highly significant difference.
In practice, most researchers move directly to the significance value
since the columns containing the sums of squares, degrees of freedom,
mean squares and F statistic are all necessary for the probability
calculation but are rarely interpreted in their own right. To interpret the
results we move to the descriptive information.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 7

SPSS Training
Figure 8.6 Descriptive Statistics for Groups

The pattern of means is largely consistent with the box & whisker
plot in that those with less formal education watch more TV than those
with more formal education. The 95% confidence bands for the degree
group means gauge the precision with which we have estimated the
means and we can informally compare groups by comparing their
confidence bands. The minimum and maximum values for each group are
valuable as a data check; we again note some surprisingly large numbers.
Often at this point there is interest in making a statement about just
which of the five groups differ significantly from which others. This is
because the overall F statistic simply tested the null hypothesis that all
population means were the same. Typically, you now want to make more
specific statements than merely that the five groups are not identical.
Post Hoc tests permit these pairwise group comparisons and we will
pursue them.

THE BAD NEWS HOMOGENEITY

We also requested the Levene test of homogeneity of variance.


Figure 8.7 Homogeneity of Within-Group Variance

Unfortunately the null hypothesis assuming homogeneity of withingroup variance is rejected at the rounded .000 (less than .0005) level. Our
sample sizes are quite disparate (see Figures 8.6 or 8.2) so we cannot
count on robustness due to equal sample sizes. For this reason we turn to
the Brown-Forsythe and Welch tests, which test for equality of group
means without assuming homogeneity of variance. Since these results
will not be calculated by default, you would request them based on
homogeneity tests done in the Explore or Oneway procedures.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 8

SPSS Training
Figure 8.8 Robust Tests of Mean Differences

When calculating the between-group to within-group variance ratio,


the Brown-Forsythe test adjusts each group's contribution to the
between-group variation by a weight related to its within-group variation;
thus explicitly adjusting for heterogeneity of variance. The Welch test
adjusts the denoninator of the F ratio so it has the same expectation as
the numerator when the null hypothesis is true, despite the heterogeneity
of within-group variance.
Both tests indicate there are highly significant differences in average
TV viewing between the education degree groups, which are consistent
with the conclusions we drew from the standard ANOVA.
These robust tests, as noted by the caption, are asymptotic tests,
meaning their properties improve as the sample size increases.
Simulation work (Brown and Forsythe, 1974) indicates the tests perform
well with group sample sizes as small as 10 and possibly even 5.
As an alternative, a statistically sophisticated analyst might attempt
to apply transformations to the dependent measure in order to stabilize
the within-group variances (variance stabilizing transforms). These are
beyond the scope of this course, but interested readers might turn to
Emerson in Hoaglin, Mosteller and Tukey (1991) for a discussion from
the perspective of exploratory data analysis, and note that the spread &
level plot in Explore (EXAMINE) will suggest a variance stabilizing
transform. Box, Hunter and Hunter (1978) contains a brief discussion of
such transformations and the original (technical) paper was by Box and
Cox (1964). Yet another alternative would be to perform the analysis
using a statistical method that assumes neither normality nor
homogeneity of variance (recall the Brown-Forsythe and Welch tests
assume normality of error). A one-factor analysis of group differences
assuming that the dependent measure is only an ordinal (rank) variable
is available as a nonparametric procedure within SPSS. When this
analysis was run (see appendix to this chapter if interested), the group
differences were found to be highly significant. This serves as another
confirmation our result, but corresponding nonparametric procedures are
not available for all analysis of variance models. In situations in which
robust or nonparametric equivalents are not available, many researchers
accept the ANOVA results with a caveat that the reported probability
levels are not exactly correct. In our example, since the significance value
was less than .0005, even if we discount the value by an order or two of
magnitude, the result would still be significant at the .05 level. While
these approaches are not entirely satisfactory, and statisticians may
disagree as to which would be best in a given situation, they do constitute
the common ways of dealing with the problem.
Having concluded that there are differences in amount of TV viewed

Mean Differences Between Groups II: One-Factor ANOVA 8 - 9

SPSS Training
among different educational degree groups, we probe to find specifically
which groups differ from which others.

POST HOC
TESTING OF
MEANS

Post hoc tests are typically performed only after the overall F test
indicates that population differences exist, although for a broader view
see Milliken and Johnson(1984). At this point there is usually interest in
discovering just which group means differ from which others. In one
aspect, the procedure is quite straightforward: every possible pair of
group means is tested for population differences and a summary table
produced. However, a problem exits in that as more tests are performed,
the probability of obtaining at least one false-positive result increases. As
an extreme example, if there are ten groups, then 45 pairwise group
comparisons (n*(n-1)/2) can be made. If we are testing at the .05 level, we
would expect to obtain on average about 2 (.05 * 45) false-positive tests.
In an attempt to reduce the false-positive rate when multiple tests of this
type are done, statisticians have developed a number of methods.

WHY SO MANY
TESTS?

The ideal post hoc test would demonstrate tight control of Type I (falsepositive) error, have good statistical power (probability of detecting true
population differences), and be robust over assumption violations (failure
of homogeneity of variance, nonnormal error distributions).
Unfortunately, there are implicit tradeoffs involving some of these
desired features (Type I error and power) and no current post hoc
procedure is best in all these areas. Couple to this the facts that there are
different statistical distributions that pairwise tests can be based on (t, F,
studentized range, and others) and that there are different levels at
which Type I error can be controlled (per individual test, per family of
tests, variations in between), and you have a large collection of post hoc
tests.
We will briefly compare post hoc tests from the perspective of being
liberal or conservative regarding control of the false-positive rate and
apply several to our data. There is a full literature (and several books)
devoted to the study of post hoc (also called multiple comparison or
multiple range tests, although there is a technical distinction between the
two) tests. More recent books (Toothaker (1991)) summarize simulation
studies that compare multiple comparison tests on their power
(probability of detecting true population differences) as well as
performance under different scenarios of patterns of group means, and
assumption violations (homogeneity of variance).
The existence of numerous post hoc tests suggests that there is no
single approach that statisticians agree will be optimal in all situations.
In some research areas, publication reviewers require a particular post
hoc method, simplifying the researchers decision. For more detailed
discussion and recommendations, short books by Klockars and Sax
(1986), Toothaker (1991) or Hsu (1996) are useful. Also, for some thinking
on what post hoc tests ought to be doing see Tukey (1991) or Milliken and
Johnson (1984).
Below we present some tests available within SPSS, roughly ordered
from the most liberal (greater statistical power and greater false-positive
rate) to the most conservative (smaller false-positive rate, less statistical
power), and also mention some designed to adjust for lack of homogeneity
of variance.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 10

SPSS Training
LSD

The LSD or least significant difference method simply applies standard t


tests to all possible pairs of group means. No adjustment is made based
on the number of tests performed. The argument is that since an overall
difference in group means has already been established at the selected
criterion level (say .05), no additional control is necessary. This is the
most liberal of the post hoc tests.

SNK, REGWF,
REGWQ &
Duncan

The SNK (Student-Newman-Keuls), REGWF (Ryan-Einot-Gabriel-Welsh


F), REGWQ (Ryan-Einot-Gabriel-Welsh Q (based on studentized range
statistic)) and Duncan methods involve sequential testing. After ordering
the group means from lowest to highest, the two most extreme means are
tested for a significant difference using a critical value adjusted for the
fact that these are the extremes from a larger set of means. If these
means are found not to be significantly different, the testing stops; if they
are different then the testing continues with the next most extreme set,
and so on. All are more conservative than the LSD. REGWF and REGWQ
improve on the traditionally used SNK in that they adjust for the slightly
elevated false-positive rate (Type I error) that SNK has when the set of
means tested is much smaller than the full set.

Bonferroni &
Sidak

The Bonferroni (also called the Dunn procedure) and Sidak (also called
Dunn-Sidak) perform each test at a stringent significance level to insure
that the family-wise (applying to the set of tests) false-positive rate does
not exceed the specified value. They are based on inequalities relating the
probability of a false-positive result on each individual test to the
probability of one or more false positives for a set of independent tests.
For example, the Bonferroni is based on an additive inequality, so the
criterion level for each pairwise test is obtained by dividing the original
criterion level (say .05) by the number of pairwise comparisons made.
Thus with five means, and therefore ten pairwise comparisons, each
Bonferroni test will be performed at the .05/10 or .005 level.

Tukey (b)

The Tukey (b) test is a compromise test, combining the Tukey (see below)
and the SNK criterion producing a test result that falls between the two.

Tukey

Scheffe

(also called Tukey HSD, WSD, or Tukey(a) test): Tukeys HSD (Honestly
Significant Difference) controls the false-positive rate family-wise. This
means if you are testing at the .05 level, that when performing all
pairwise comparisons, the probability of obtaining one or more false
positives is .05. It is more conservative than the Duncan and SNK. If all
pairwise comparisons are of interest, which is usually the case, Tukeys
test is more powerful than the Bonferroni and Sidak.

Scheffes method also controls the family-wise error rate. It adjusts not
only for the pairwise comparisons, but also for any possible comparison
the researcher might ask. As such it is the most conservative of the
available methods (false-positive rate is least), but has less statistical
power.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 11

SPSS Training
SPECIALIZED
POST HOCS
Unequal Ns:
Hochbergs GT2
& Gabriel

Most post hoc procedures mentioned above (excepting LSD, Bonferroni &
Sidak) were derived assuming equal group sample sizes in addition to
homogeneity of variance and normality of error. When the subgroup sizes
are unequal, SPSS substitutes a single value (the harmonic mean) for the
sample size. Hochbergs GT2 and Gabriels post hoc test explicitly allow
for unequal sample sizes.

Waller-Duncan

The Waller-Duncan takes an approach (Bayesian) that adjusts the


criterion value based on the size of the overall F statistic in order to be
sensitive to the types of group differences associated with the F (for
example, large or small). Also, you can specify the ratio of Type I (false
positive) to Type II (false negative) error in the test. This feature allows
for adjustments if there are differential costs to the two types of errors.

Unequal
Variances and
Unequal Ns:
Tamhane T2,
Dunnetts T3,
Games-Howell,
Dunnetts C

Each of these post hoc tests adjust for unequal variances and sample
sizes in the groups. Simulation studies (summarized in Toothaker, 1991)
suggest that although Games-Howell can be too liberal when the group
variances are equal and sample sizes are unequal, it is more powerful
than the others.

An approach some analysts take is to run both a liberal (say LSD)


and a conservative (Scheffe or Tukey HSD) post hoc test. Group
differences that show up under both criteria are considered solid findings,
while those found different only under the liberal criterion are viewed as
tentative results.
To illustrate the differences among the post hoc tests we will request
that three be done: one liberal (LSD), one midrange (REGWF), and one
conservative (Scheffe). In addition, since homogeneity of variance does
not hold in the data, we request the Games-Howell and would pay serious
attention to its results. Ordinarily, of course, a researcher would not run
all these different tests. For this data, due to the homogeneity of variance
violation, in practice only the Games-Howell might be run.
Click on the Dialog Recall tool

, then click One-Way

ANOVA
Click on the Post Hoc pushbutton
Click LSD (Least Significant Difference, R-E-G-W-F (RyanEniot-Gabriel-Welsh F), Scheffe and Games-Howell check
boxes

Mean Differences Between Groups II: One-Factor ANOVA 8 - 12

SPSS Training
Figure 8.9 Post Hoc Testing Dialog Box

Click Continue
Click OK
By default, statistical tests will be done at the .05 level. If you prefer
to use a different alpha value (for example, .01), you can specify it in the
Significance level box. The command to run the post hoc analysis
appears below.
ONEWAY
tvhours BY degree
/STATISTICS DESCRIPTIVES HOMOGENEITY
/POSTHOC = SCHEFFE LSD FREGW GH ALPHA(.05).
Post hoc tests are requested using the POSTHOC subcommand. The
STATISTICS subcommand need not be included here since we have
already viewed the means and discussed the homogeneity test.
The beginning part of the Oneway output contains the ANOVA table,
robust tests of mean differences, descriptive statistics, and homogeneity
test. We will move directly to the post hoc test results.

Note

Some of the pivot tables shown below were edited (changed column
widths; only one post hoc method shown in some figures) to better display
in this course guide.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 13

SPSS Training
Figure 8.10 Least Significant Difference Post Hoc Results

The rows are made of every possible combination of groups. For


example, at the top of the pivot table the Less than High School group
is paired with each of the other four. The column labeled Mean
Difference (I-J) contains the sample mean difference between each
pairing of groups. We see the Less than High School and Graduate
groups have a mean difference of 2.00 hours of daily TV. If this difference
is statistically significant at the specified level after applying the post hoc
adjustments (none for LSD), then an asterisk (*) appears beside the mean
difference. Notice the actual significance value for the test appears in the
column labeled Sig..
Thus, the first block of LSD results indicate that in the population
those with less than high school degrees differ significantly in daily TV
viewing from each of the other four degree groups. In addition, the
standard errors and 95% confidence intervals for each mean difference
appear. These provide information on the precision with which we have
estimated the mean differences. Note that, as you would expect, if a mean
difference is not significant, the confidence interval includes 0.
Also notice that each pairwise comparison appears twice (for
example: high school - college degree and also college degree - high
school). For each such duplicate pair the significance value is the same,
but the signs are reversed for the mean difference and confidence interval
values.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 14

SPSS Training
Summarizing the entire diagram, we would say that almost all degree
groups differ in amount of TV viewed daily and those with higher degrees
watch less TV. The only groups not different (graduates versus bachelors;
bachelors versus junior college) were adjacent degree categories. Since
LSD is the most liberal of the post hoc tests, we are interested if the same
results hold using more conservative criteria.
Figure 8.11 Homogeneous Subsets Results for REGWF Post Hoc Tests

The REGWF results are not presented in the same format as we saw
for the LSD. This is because for some of the post hoc test methods (for
example, the sequential or multiple-range tests) standard errors and 95%
confidence intervals for all pairwise comparisons are not defined. Rather
than display pivot tables with empty columns in such situations, a
different format, homogeneous subsets, is used. A homogeneous subset is
a set of groups for which no pair of group means differs significantly. This
format is closer in spirit to the nature of the sequential tests actually
performed by REGWF. Depending on the post hoc test requested, SPSS
will display a multiple comparison table, a homogeneous subset table, or
both. Recall the REGWF tests first the most extreme, then the less
extreme means, adjusting for the number of means in the comparison set.
Viewing the REGWF portion of the table, we see four homogeneous
subsets (four columns). The first is composed of graduate and bachelor
degree groups; they do not differ, but one or the other differs from the
three remaining degree groups. Bachelor and junior college degree groups
are the second homogeneous subset (they do not differ significantly).
Notice that the third and fourth homogeneous subsets contain one group
each. This is because the high school group differs from each of the
others, as does the group with less than a high school degree. The
homogeneous subset pivot table thus displays where population
differences do not exit (and by inference, where they do). The results for
REGWF match what we obtained using LSD.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 15

SPSS Training
A homogeneous subset summary appears for the Scheffe test as well.
We will discuss the Scheffe shortly. Here we just point out that the
results are similar, except for subset 3, in which junior college does not
differ from high school as it did for the REGWF. This is consistent with
the Scheffe being a more conservative test.
Figure 8.12 Scheffe Post Hoc Results

The Scheffe test is more conservative (smaller false-positive rate)


than the LSD or REGWF tests. Notice that under the Scheffe test the
junior college population is NOT found to be significantly different from
the graduate or high school groups. The LSD and REGWF tests indicated
the junior college population did differ from these groups. Also, a careful
observer will notice that the Scheffe multiple comparison result in Figure
8.12 indicates that the junior college group does not differ significantly
from the graduate group (significance .075) while the homogeneous
subset results indicate they do (Figure 8.11). Here a slightly different
sample size adjustment (for homogeneous subsets, sample size is set to be
the harmonic mean of all groups, while for multiple comparison tables
the default is to compute harmonic means on a two-group (pairwise)
basis) produces a different conclusion for one of the comparisons.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 16

SPSS Training
Figure 8.13 Games-Howell Post Hoc Results

The Games-Howell multiple comparison test adjusts for both


unequal variances (determined to be present by the Levene test earlier)
and unequal sample sizes. The overall results are consistent with the
LSD and REGWF tests and differ from Scheffe in that the junior college
group is found to be different from the high school and graduate groups.
What is the true situation? We dont know. Your original choice of a
post hoc test would be based on how you want to balance power and the
false-positive rate. Here under more liberal false-positive rates we would
conclude those with junior college degrees watch more TV than graduate
degree holders and less TV than those with just a high school degree.
Under a more conservative false-positive rate we would not judge them
different. Also, only Games-Howell adjusts for the unequal variances
across degree groups and it didnt find these two junior college
comparisons to be significant. Some researchers would state these two
junior college comparisons as tentative results, having more confidence
in the others. However, the bottom line is that your choice in post hoc
tests should reflect your preference for the power/false-positive tradeoff
and your evaluation of how well the data meet the assumptions of the
analysis, and you live with the results of that choice. Such books as
Toothaker (1991) or Hsu (1996) and their references evaluate the
various post hoc tests on the basis of theoretical and Monte Carlo
considerations.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 17

SPSS Training
GRAPHING THE
RESULTS

For presentations it is helpful to display the sample group means along


with their 95% confidence bands. We saw such error bar charts in the
two-group case and they are useful here as well. To create an error bar
chart for TV hours grouped by respondents highest degree,
Click Graphs..Interactive..Error Bar
Click Reset button, then click OK to confirm
Move Hours Per Day Watching TV [tvhours] into the vertical
arrow box
If prompted to convert tvhours into a scale variable (which it
should be), click Convert
Move Rs Highest Degree [degree] into the horizontal arrow
box
Click OK
Figure 8.14 Create Error Bar Chart Dialog Box

The setup for the error bar chart is very straightforward. While not
shown we used the Titles tab sheet to give a title to the chart. The
command below produces the chart using SPSS syntax.
IGRAPH /VIEWNAME=Error Bar Chart
/X1 = VAR(degree) TYPE = CATEGORICAL
/Y = VAR(tvhours) TYPE = SCALE
/COORDINATE = VERTICAL /TITLE= 'Error Bar Chart'
/X1LENGTH = 3.0 /YLENGTH = 3.0 /X2LENGTH = 3.0
/CATORDER VAR(degree) (ASCENDING VALUES
OMITEMPTY)
/ERRORBAR KEY ON CI(95.0) DIRECTION BOTH
CAPSTYLE = T SYMBOL = ON.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 18

SPSS Training
Weve requested an error bar chart with a 95% confidence band on
the sample means. A title is included. The final chart appears below.
Figure 8.15 Error Bar Chart of TV Hours by Degree Group

The chart provides a visual sense of how far the groups are separated.
The confidence bands are determined for each group separately and no
adjustment is made based on the number of groups that are compared.
From the graph we have a clear sense of relation between formal
education degree and TV viewing.

SUMMARY

APPENDIX:
GROUP
DIFFERENCES
ON RANKS

In this chapter we extended testing for population mean differences to


the case where more than two groups are compared and these groups
constitute a single factor. We examined the data to check for assumption
violations, discussed alternatives, and interpreted the ANOVA results.
Having found significant differences, we performed post hoc tests to
determine which specific groups differed from which others and
summarized the analysis with an error bar chart. The appendix contains
a nonparametric analysis of the same data.

Analysis of variance assumes that the distribution of the dependent


measure within each group follows a normal curve, and that the withingroup variation is homogeneous across groups. If any of these
assumptions fail in a gross way, one may be able to apply techniques that
make fewer assumptions about the data. We saw such a variation when
we applied tests that didn't assume homogeneity of variance but did
assume normality (Brown-Forsythe and Welch). However, what if neither
the homogeneity nor the normality assumption were met? In this case,
nonparametric statistics are available; they dont assume specific data
distributions described by parameters such as the mean and standard
deviation. Since these methods make few if any distributional
assumptions, they can often be applied when the usual assumptions are
not met. If you are tempted to think that something is obtained for

Mean Differences Between Groups II: One-Factor ANOVA 8 - 19

SPSS Training
nothing, the downside of such methods is that if the stronger data
assumptions hold, then nonparametric techniques are generally less
powerful (probability of finding true differences) than the appropriate
parametric method. Second, there are some parametric statistical
analyses that currently have no corresponding nonparametric method. I
think it is fair to say that boundaries separating where one would use
parametric versus nonparametric methods are in practice somewhat
vague, and statisticians can and often do disagree about which approach
is optimal in a specific situation. For more discussion of the common
nonparametric tests see Daniel (1978), Siegel (1956) or Wilcox (1996).
Because of our concerns about the lack of homogeneity of variance
and normality of TV hours viewed for our different degree groups, we will
perform a nonparametric procedure, which only assumes that the
dependent measure has ordinal (rank order) properties. The basic logic
behind this test, the Kruskal-Wallis test, follows. If we rank order the
dependent measure throughout the entire sample, we would expect under
the null hypothesis (no population differences) that the mean rank
(technically the sum of the ranks adjusted for sample size) should be the
same for each sample group. The Kruskal-Wallis test calculates the
ranks, the sample group mean ranks, and the probability of obtaining
average ranks (weighted summed ranks) as far apart (or further) as what
are observed in the sample, if the population groups were identical.
To run the Kruskal-Wallis test in SPSS, we will declare tvhours as
the Test Variable (from which ranks are calculated) and degree as the
independent or grouping variable.
Click Analyze..Nonparametric Tests..K Independent
Samples
Move tvhours into the Test Variable List: box
Move degree into the Grouping Variable: box
Note that the minimum and maximum value of the grouping variable
must be specified using the Define Range pushbutton.
Click the Define Range pushbutton
Enter 0 as the Minimum and 4 as the Maximum
Click Continue
Click OK

Mean Differences Between Groups II: One-Factor ANOVA 8 - 20

SPSS Training
Figure 8.16 Analysis of Ranks

By default, the Kruskal-Wallis test will be performed. The


organization of this dialog box closely resembles that of the One-Way
ANOVA. The command to run this analysis using SPSS syntax follows.
NPAR TESTS
/K-W=tvhours BY degree(0 4) /MISSING ANALYSIS.
The K-W subcommand instructs the nonparametric testing routine to
perform the Kruskal-Wallis analysis of ranks on the dependent variable
TVHOURS with DEGREE as the independent or grouping variable.
Figure 8.17 Results of Kruskal-Wallis Nonparametric Analysis

Mean Differences Between Groups II: One-Factor ANOVA 8 - 21

SPSS Training
We see the pattern of mean ranks (remember smaller ranks imply
less TV watched) follows that of the original means of TVHOURS in that
the higher the degree, the less TV watched. The chi-square statistic used
in the Kruskal-Wallis test indicates that we are very unlikely (less than
.0005 or 5 chances in 10,000- we can edit the pivot table to obtain more
precision) to obtain samples with average ranks so far apart if the null
hypothesis (same distribution of TV hours in each group) were true. We
feel much more confident in our original analysis because we were able to
confirm that population differences exist without making all the
assumptions required for analysis of variance.

APPENDIX: HELP
IN CHOOSING A
STATISTICAL
METHOD

SPSS contains a help facility that suggests the appropriate statistical


analysis within the SPSS Base system, given the questions you want to
ask about your data and the type of measurements made. Since this
Statistics Coach is limited to procedures within the SPSS Base system, it
will not suggest analyses contained in optional modules (for example,
SPSS Advanced Models), nor will it suggest analyses unavailable in
SPSS. After invoking the Statistics Coach (Help..Statistics Coach), you
first provide broad, then more specific information about what you want
to do, until the search narrows to a particular statistical analysis. We
demonstrate the Statistics Coach by asking about group comparisons
when normality cannot be assumed: the situation we found in this
chapter.

Note

The Statistics Coach is an optional component listed when installing


SPSS for Windows. For this reason, it may not have been placed on your
machine. If not, it can be added by rerunning the SPSS installation setup
and checking the Statistics Coach as a component for installation.
Click Help..Statistics Coach
Click the Compare groups for significant differences option
button

Mean Differences Between Groups II: One-Factor ANOVA 8 - 22

SPSS Training
Figure 8.18 Statistics Coach Main Dialog Box

The first Statistics Coach dialog box describes broad types of


analyses.
Click the Next pushbutton
Click the Continuous numeric data divided into groups
option button under the What kind of data do you want to
compare? question
Click the Next pushbutton
Click the Three or more groups option button in the How
many groups or variables do you want to compare?
dialog
Click the Next pushbutton
Click the One option button under the How many grouping
(factor) variables do you have? question
Click the Next pushbutton

Mean Differences Between Groups II: One-Factor ANOVA 8 - 23

SPSS Training
Figure 8.19 Comparing Groups Help Dialog Box

A Finish button replaces the Next button since no additional dialogs


are required. If we indicate data are normally distributed within groups,
then One-Way ANOVA is suggested. If we select a test that does not
assume normality, then nonparametric tests (notice that ranks and chisquare results appear in the examples) are suggested. If we decided to
check the data for normality, a histogram with a normal distribution
overlay is chosen. Since we found earlier that the data were not normally
distributed within groups and didn't exhibit homogeneity of variance, we
pick the second choice.
Click Test that does not assume data are normally
distributed under the Which test do you want? question
Click Finish

Mean Differences Between Groups II: One-Factor ANOVA 8 - 24

SPSS Training
Figure 8.20 How to Run Nonparametric Tests

Thus the Statistics Coach leads us to the same analysis


(nonparametric tests for several independent samples) that we chose
earlier. The "How-to" window describes the menu choices required to
open the analysis dialog box (Tests for K-Independent Samples) and
contains instructions about the information you must provide. Click the
Tell me more pushbutton to provide additional information about the
analysis dialog box.
In this way, the Statistics Coach can be very useful in suggesting
which SPSS Base procedure is most appropriate to address a specific
research or analysis question. It can serve to choose a statistical method
or confirm your initial choice.
Click the Close box

to close the How To window

Hint

SPSS provides definitions of statistical terms appearing in pivot tables


through the Help..What's this? menu choice made after selecting the term
within the Pivot Table editor.

HELP IN
INTERPRETING
STATISTICAL
RESULTS

SPSS contains a Results Coach that helps to explain results appearing in


SPSS pivot tables. This is especially useful when interpreting statistical
results. We will demonstrate the Results Coach by applying it to the Test
Statistics pivot table produced in our last analysis (Kruskal Wallis test of
degree group differences in daily hours of TV viewing.

Mean Differences Between Groups II: One-Factor ANOVA 8 - 25

SPSS Training
Return to the Viewer window and scroll to the bottom
Right-click on the Test Statistics pivot table
Click Results Coach on the Context menu
(Alternatively, double-click on the pivot table to open the Pivot
Table editor, and then click the Results Coach

tool on the

Format toolbar)
Click Next pushbutton three (3) times to display additional
information
Figure 8.21 Results Coach for Kruskal Wallis Tests Pivot Table

After calling the Results Coach, help appears that describes the
contents of the selected type of pivot table and how they are typically
used. Here, the Results Coach states what the Kruskal-Wallis test does
and indicates the meaning of a low significance value, which we recently
discussed in this chapter. This is a useful aid when interpreting
statistical results appearing in SPSS pivot tables.
Click the Close

box to close the Results Coach

Mean Differences Between Groups II: One-Factor ANOVA 8 - 26

SPSS Training

Chapter 9 Mean Differences Between


Groups III: Two-Factor ANOVA
Objective

Apply the principles of testing for mean differences between populations


to situations involving more than a single factor.

Method

We wish to test whether there are any regional or gender differences in


average years of formal education. First, use the Explore (Examine)
procedure to explore the subgroups involved in the analysis. Next, make
the Analysis..General Linear Model menu choices to run a univariate
analysis of variance (Unianova procedure), specifying education (EDUC)
as the dependent variable with region (REGION) and gender (SEX) as
factors. Display the results using an error bar chart and a table. In the
appendix we perform post-hoc tests based on the results of our analysis.

Data
Scenario

INTRODUCTION

General Social Survey, 1994.

We wish to see whether there are regional and gender differences in the
amount of formal education U.S. adults have received. In the previous
chapter, highest education degree obtained (DEGREE) was considered an
independent variable; here we view highest year of school completed
(EDUC) as a dependent measure. While debatable, in this chapter we
argue that the amount of formal education one obtains may be viewed as
potentially influenced by ones region and gender. Previously we viewed
education degree as a factor that might influence the amount of TV
watched. Claims about just what causes what cannot be fully resolved
using survey studies (followers of Hume would argue they cannot be
resolved at all), and reside in the way we view the world (that region may
influence education, rather than education influences the region you live
in). In summary, we will study mean differences in the dependent
measure (education) as a function of two independent variables (region,
gender).

nalysis of variance (ANOVA) is a general method for drawing


conclusions about differences in population means when two or
more comparison groups are involved. We have already discussed
how the t test is used to contrast two groups, and how one-factor ANOVA
compares more than two groups differing along a single factor. In this
chapter, we expand our consideration of ANOVA to allow multiple factors
in a single analysis. Such an approach is efficient in that several
questions are addressed within one study. The assumptions and issues
considered earlier (normality of dependent measure within each group,
homogeneity of variance and the importance of each) apply to general

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 1

SPSS Training
ANOVA and will not be repeated here.
We will investigate whether there are differences in average years of
formal education for men and women living in different regions of the
country. Since two factors, region and gender, are under consideration,
we can ask three different questions. (1) Are there regional differences?
(2) Is there a gender difference? (3) Do region and gender interact?
As in earlier chapters, we begin by running an exploratory data
analysis, then proceed with more formal testing.

LOGIC OF
TESTING, AND
ASSUMPTIONS

As before, we wish to draw conclusions about the populations from which


we sample. The main difference in moving from t tests and one-factor
ANOVA to general ANOVA is that more questions can be asked about the
populations. However, the results will be stated in the same terms: how
likely is it that we could obtain means as far apart as what we observe in
our sample, if there were no mean differences in the populations.
Comparisons are again framed as a ratio of the variation among group
means (between-group variation) to the variation among observations
within groups (within-group variation). When statistical tests are
performed, homogeneity of variance and normality of the dependent
measure within each group are assumed. Comments made earlier
regarding robustness of the means analysis when these assumptions are
violated apply directly.

HOW MANY
FACTORS?

The new aspect we consider is how to include several factors, or ask


several different questions of the data, within a single analysis of
variance. We will test whether there is a gender difference in education,
whether there are regional differences, and finally, whether gender and
region interact concerning education. The interpretation of an interaction
is discussed in the next section.
Although our example involves only two factors (gender and region),
ANOVA can accommodate more. Usually the number of factors is limited
by either the interests of the researcher, who might wish to examine a
few specific issues, or by sample size considerations. Sample size plays a
role in that the greater the number of factors, the greater the number of
cell means that must be computed, and the smaller the sample for each
mean. For example, if we have a sample of 400 subjects and wish to look
at education differences due to gender (2 levels), race (3 levels), region (9
levels) and age group (4 levels), there are 2 * 3 * 9 * 4 or 216 subgroup
means involved. If the data were distributed evenly across the
demographics, each subgroup mean would be based on approximately 2
observations, which would not produce a very powerful analysis. Such
analyses can be performed, and technically, questions involving single
effects like gender or region would be based on means involving fairly
large samples. Also, in some planned experiments many subgroups are
dropped (for example, incomplete designs). Still, the fact remains that
with smaller samples, there are practical limitations in the number of
questions you can ask of the data.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 2

SPSS Training
INTERACTIONS

When moving beyond oneway or one-factor ANOVA, the distinction


between main effects and interactions becomes relevant. A main effect is
an effect (or group difference) attributable to a single factor (independent
variable). For example, when we study education differences across
regions and gender, the effect of region alone, and the effect of gender
alone, would each be considered a main effect. The two-way interaction
would test whether the effect of one factor is the same within each level of
the other factor.
In our example, this can be phrased in either of two ways. We can say
the interaction tests whether the gender difference (which might be zero)
is the same for each region. Alternatively, we can say the two-way
interaction tests whether the regional differences are identical for each
gender group. While these two phrasings are mathematically equivalent,
it can be simpler (based on the number of levels in each factor) for you to
present the information from one perspective instead of the other. The
presence of a two-way interaction is important to report, since it qualifies
our interpretation of a main effect. For example, a gender by region
interaction implies that the magnitude of a gender difference varies
across regions (in fact there may be no difference or a reversal in the
pattern of the gender means for some regions), so statements about
gender differences must be qualified by regional information.
Since we are studying two factors, there can be only one interaction.
If we expand our analysis to three factors (say, gender, age group and
region) we can ask both two-way (gender by age group, gender by region,
age group by region) and three-way (gender by age group by region)
interaction questions. As the number of factors increase, so does the
possible complexity of the interactions. In practice, significant high-order
(three, four, five way, etc.) interactions are relatively rare compared to
the number of significant main effects.
Interpretation of an interaction can be done directly from a table of
relevant subgroup means, but it is both more convenient and common to
view a multiple-line chart of the means. We illustrate this below under
several scenarios.
Suppose that in our population, women are more highly educated
than men, there are regional differences in education, and that the
gender difference is the same across regions. The line chart below plots a
set of means consistent with this pattern.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 3

SPSS Training
Figure 9.1 Main Effects, No Interaction

In the chart we see that the mean line for women is above that of the
men. In addition, there are differences among the four regions. However,
note that the gender difference is nearly identical for each region. This
equal distance between the lines (parallelism of lines) indicates there is
no interaction present.
Figure 9.2 No Main Effects, Strong Interaction

Here the overall means for men and women are about the same, as
are the means for each region (pooling the two gender groups). However,
the gender difference varies dramatically across the different regions: in

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 4

SPSS Training
region B women have higher education, in regions A and D there is no
gender difference, and in region C males have higher education. We
cannot make a statement about gender differences without qualifying it
with region information, nor can we make regional claims without
mentioning gender. Strong interactions are marked by this crossover
pattern in the multiple-line chart.
Figure 9.3 One Main Effect, Weak Interaction

We see a gender difference for each of the four regions, but the
magnitude of this difference varies across regions (substantially greater
for region D). This difference in magnitude of the gender effect would
constitute an interaction between gender and region. It would be termed
a weak interaction because there is no crossover of the mean lines.
Additional scenarios can be charted, and we have not mentioned
three-way and higher interactions. Such topics are discussed in
introductory statistics books (see the references for suggestions). We will
now proceed to analyze our data set.

EXPLORING THE
DATA

We begin by applying exploratory data analysis to education within


subgroups defined by combinations of region and gender. In practice, you
would check each groups summaries, look for patterns in the data, and
note any unusual points. Also, we will request that the Explore procedure
perform a homogeneity of variance test. As in the last chapter, for brevity
in our presentation we will skip the individual group summaries and
move immediately to the box & whisker plot.
Click File..Open..Data (switch to the c:\Train\Stats folder )
Select SPSS Portable (*.por) from the Files of Type drop-down
list

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 5

SPSS Training
Select Gss94.por and click Open
Click Analyze..Descriptive Statistics..Explore
Move educ to the Dependent List: box
Move sex and region to the Factor List: box
Click the Plots pushbutton
Click the Untransformed option button from the Spread vs.
Level with Levene Test area
Figure 9.4 Requesting a Homogeneity of Variance Test

By default, no homogeneity test is performed (None option button).


Each of the remaining choices will lead to homogeneity being tested. The
second (Power Estimation) and third (Transformed) choices are used by
more technical analysts to investigate power transformations of the
dependent measure that would yield greater homogeneity of variance.
These issues are of interest to serious practitioners of ANOVA, but are
beyond the scope of this course (see Emerson in Hoaglin, Mosteller and
Tukey (1991), also the brief discussion in Box, Hunter and Hunter (1978),
and the original (technical) paper by Box and Cox (1964)). The
Untransformed choice builds a plot without transforming the scale of the
dependent measure and is easier to interpret.
Click Continue

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 6

SPSS Training
Figure 9.5 Explore Dialog Box with Two Factors

Since we are comparing region by gender subgroups, we designate


SEX and REGION as the factors (or nominal independent variables).
However, if we were to run this analysis, it would produce a set of
summaries for each gender group, then a set of summaries for each
regional group. In other words, each of the two factors would be treated
separately, instead of being combined as we desire. To instruct SPSS to
treat each gender by region combination as a subgroup, we must use
SPSS syntax. The easiest way to accomplish this would be to click the
Paste pushbutton, which opens a syntax window and builds an Examine
command that will perform an analysis for each factor.
Click Paste
Figure 9.6 Examine Command in Syntax Window

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 7

SPSS Training
The Examine command requires only the Variables subcommand in
order to run. We must also include the Plot subcommand since we desire
the homogeneity test (controlled by SPREADLEVEL keyword; the 1 in
parenthesis indicates a power transformation of 1, thus no
transformation, is to be applied to the dependent measure). The other
subcommands specify default values; they appear in order to make it
simple for the analyst to modify the command when necessary. Note the
keyword BY separates the dependent variable (EDUC) from the factors
(SEX and REGION). Currently, both SEX and REGION follow the BY
keyword, and so have the same status, that is, an analysis will be run for
each separately. To indicate we wish a joint analysis, we insert an
additional BY between SEX and REGION on the Variables subcommand.
Type by between sex and region in the Syntax Editor window
Figure 9.7 Examine Command Requesting Subgroup Analysis

SPSS now interprets the factor groupings to be based on each gender


by region combination (SEX BY REGION).
To run the analysis, click the Run button
Now we proceed to review the Explore output. As mentioned earlier,
to keep the presentation brief we will not view the individual group
summaries, but move directly to the box & whisker plot.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 8

SPSS Training
Figure 9.8 Box & Whisker Plot of Education

We see variation in the lengths of the boxes, which suggests that the
variation of education within groups is not homogeneous. Also, to the eye,
in most of the regions the median education for males is equal to that of
females. Finally, it seems that median education is lowest for those living
in the east central regions of the U.S. There are outliers at both the high
and low ends. Do any of them seem so extreme as to suggest data errors?

Note

The spread and level plot will be reformatted as a sunflower plot within
the Chart editor window. This is done because some of the subgroup
points fell on top of each other and could not be distinguished. To obtain a
sunflower plot:
Double click on the spread & level chart
Click Chart..Options
Check the Show Sunflowers check box
Click OK
Click File..Close to close the Chart editor

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 9

SPSS Training
Figure 9.9 Spread & Level Plot of Education

A spread and level plot is a scatterplot in which the spread


(interquartile range) is plotted against the level (median) for each
subgroup. If the assumption of homogeneity of variance holds true, then
the spread should be at about the same value regardless of the level
presenting a horizontal or flat pattern. The first caption confirms that no
data transformation was performed: a blank value for P (Power) indicates
that education was not transformed. In addition the second caption
displays the slope. If the Power estimation choice were made, the slope
from that analysis (natural logs are taken) would suggest a variancestabilizing power transformation. There is no clear upward or downward
pattern to the points, and we see the concentration of subgroup medians
at 13 years. The actual tests of homogeneity appear separately.
Figure 9.10 Levene Tests of Homogeneity of Variance

The Levene tests all indicate that the probability of obtaining sample
variances as disparate (or more) as what we observe is very small (.000,
meaning less than .0005) if the subgroup variances were identical in the
population. Thus we cannot assume homogeneity of variance. The spread

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 10

SPSS Training
and level plot did not reveal any obvious relation between the
interquartile range and the median (a common pattern is for the spread
to increase as the level increases), so no simple adjustment is apparent.
One mitigating factor is that because our sample size is so large, we have
a fairly sensitive test of homogeneity. However, short of pursuing the
technical route of variance-stabilizing data transformations (referred to
earlier), we will proceed with the analysis of variance, taking the
probability values with a grain of salt.

TWO-FACTOR
ANOVA

To run the analysis, we choose Analysis..General Linear Model. Please


note that the General Linear Model submenu choices will vary depending
on whether you have the Advanced Models option installed.
Click Analysis..General Linear Model
Figure 9.11 General Linear Model Submenu

The Univariate (single dependent variable) choice can handle designs


from the simple to the more complex (incomplete block, Latin square,
etc.) and also provides the user with control over various aspects of the
analysis. The other submenu items lead to GLM (General Linear Model)
procedures that are part of the Advanced Models option. The
Multivariate submenu performs multivariate (multiple dependent
measure) analysis of variance, and the Repeated Measures submenu is
used for studies in which an observation contributes to several group
means. These are commonly called split-plot or repeated measure
designs, and are extensions of the paired t test we discussed in Chapter 7.
These latter analyses are reviewed in the Advanced Techniques: ANOVA
and Advanced Statistical Analysis Using SPSS courses, and discussed in
the SPSS Advanced Models manual. These more complex analyses are
run using the General Linear Model (GLM) procedure. The Mixed Models

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 11

SPSS Training
menu choice (under Analyze) extends these analyses to include more
complex repeated measures analysis and nested random effects models.
Click Analysis..General Linear Model..Univariate
Move educ to the Dependent Variable: box
Move region and sex to the Fixed Factor(s): box
Figure 9.12 GLM Univariate Dialog Box

Fixed factors have a limited (finite) number of levels, which are


included in the study, and we wish to draw population conclusions about
only these levels. Our analysis does not include random factors or
covariates. Briefly, random factors are those in which a random sample of
a few levels from all those possible are included in the study, but
population conclusions are to be applied to all levels. For example, an
institutional researcher might randomly select schools from a large school
district to be included in a study investigating gender differences in
learning mathematics. Here, gender is a fixed factor while school is a
random factor. It is important to distinguish between fixed and random
factors since error terms differ. Covariates are interval-scale independent
variables, whose relationships with the dependent measure you wish to
statistically control before performing the ANOVA itself. While we do not
cover analyses with covariates in this course, our discussion of regression
is directly relevant to the topic.
The OK button is active, so we can run the analysis. However, first
we will request some additional information.
Click on the Model pushbutton

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 12

SPSS Training
Figure 9.13 GLM Univariate Model Dialog Box

Within the Univariate Model dialog you can specify the model you
want applied to the data. By default, a full factorial model (all main
effects, interactions, and covariates) is fit and the various effects tested
using Type III sums of squares (each effect is tested after statistically
adjusting for all other effects in the model). For balanced or unbalanced
models with no missing cells, Type III sums of squares is most commonly
used. If there are any missing cells in your analysis, we recommend you
switch to Type IV sums of squares, which better adjusts for them. If only
a subset of the main effects and interactions are of interest in your
analysis, and you want to specify an incomplete design, you would click
the Custom option button in the Specify Model area and indicate which
main effects and interactions should be included in the model. A custom
model is sometimes used if there is no interest in, or the design does not
allow, the testing of high-order interaction effects. Because we want to
examine the full factorial model, there is no reason to modify this dialog.
Click the Cancel button
Click the Plots pushbutton
We next examine the Profile Plots dialog box. The Profile Plots dialog
produces line charts that display means at different factor levels. You can
view main effects with such plots, but they are most helpful in
interpreting two- and three-way interactions (note that up to three factor
variables can be included). The dependent variable does not appear in
this dialog box. Multiple plots can be requested, which is useful in
complex analyses where there may be several significant interactions. We
will request profile plots for each main effect (region and sex) and for
their interaction (region * sex). Some analysts would request such plots

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 13

SPSS Training
only for significant main effects and interactions, as determined by the
initial analysis.
Move region into the Horizontal Axis: box
Click Add
Move sex into the Horizontal Axis: box
Click Add
Move region into the Horizontal Axis: box and sex into the
Separate Lines: box
Figure 9.14 GLM Univariate Profile Plots Dialog Box

Click Add
Click Continue
Click the Options pushbutton
The Options dialog is used to request means, homogeneity of variance
tests, residual plots, and other diagnostic information pertaining to the
analysis. We will ask GLM to provide us with descriptive statistics for the
cells in the analysis. This will display means, standard deviations and
sample sizes for the factors and the two-way interaction. Estimated
marginal means are means estimated for each level of a factor averaging
across all levels of other factors (marginals), based on the specified model
(estimated). These means can differ from the observed means if
covariates are included or if an incomplete model (not all main effects and
interactions) is used. Post hoc analyses can be applied to the observed
means using the Post Hoc pushbutton (See Appendix to this chapter).
Click the Descriptive statistics check box in the Display area

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 14

SPSS Training
Figure 9.15 GLM Univariate Options Dialog Box

Click Continue
Click OK
The following syntax will also run the analysis:
UNIANOVA
educ BY region sex
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/PLOT = PROFILE( region sex region*sex )
/PRINT=DESCRIPTIVE
/CRITERIA = ALPHA(.05)
/DESIGN = region sex region*sex .
The first piece of output describes the factors involved in the analysis.
They are labeled between-subject factors.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 15

SPSS Training
Figure 9.16 Between-Subject Factors

THE ANOVA
TABLE

The ANOVA table contains the information, much of it technical,


necessary to evaluate whether there are significant differences in
education across regions, between genders, and whether the two factors
interact.
Figure 9.17 ANOVA Table

The first column lists the different sources of variation. We are much
interested in the region and gender main effects, as well as the region by
gender interaction. The source labeled Error contains summaries of the
within-group variation (or residual term), which will be used when
calculating the F ratios (ratios of between-group to within-group
variation). The remaining sources in the list are simply totals involving
the sources already described and, as such, are generally not of interest.
The Sums of Squares column contains a technical summary (sums of the
squared deviations of group means around the overall mean or of

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 16

SPSS Training
individual observations around the group means) that is not interpreted
directly, but is used in calculating the later column values. The df
(degrees of freedom) column contains values that are functions of the
number of levels of the factors (for region, sex and region by sex) or the
number of observations (for error). Although this is a gross
oversimplification, you might think of degrees of freedom as measuring
the number of independent values (whether means or observations) that
contribute to the sums of squares in the previous column. As with sums of
squares, degrees of freedom are technical measures, not interpreted
themselves, but used in later calculations.
Mean Square values are variance measures attributable to the
various effects (region, gender, and region by gender) and to the variation
of individuals within groups (error). The ratio of an effect mean square to
the mean square of the error provides the between-group to within-group
variance ratio, or F statistic. If there were no group differences in the
population, then the ratio of the between-group variation to the withingroup variation should be about 1. The Sig. column contains the most
interpretable numbers in the table: the probabilities that one can obtain
F ratios as large or larger (group means being as far or farther apart) as
what we find in our sample, if there are no mean differences in the
population.
Gender does not show a significant difference in formal education
(Sig.= .185). In other words, there is about a 1 in 5 (.185) chance of
obtaining as large a sample difference in formal education between men
and women as we observe here if there is no sex difference in the
population. Similarly, the gender by region interaction is not significant,
indicating that the pattern of regional differences is the same for men
and women. On the other hand, region differences are highly significant
(probability rounded to three decimals is .000, and is thus less than
.0005, or 5 chances in 10,000). Despite this, the r-square measures
indicate that the model accounts for only about 3% of the variance in
education.
Earlier we found that the homogeneity of variance assumption was
not met. However, the region effect is so highly significant that even if we
were to inflate the probability by two orders of magnitude (.0005 to .05)
there still would be a significant difference. Thus we are confident that
there are regional differences despite the assumption violation. While
this informal adjustment of probability values is not a real solution to the
problem, it is better than entirely ignoring it. We could also perform a
nonparametric test of regional differences in education.

OBSERVED
MEANS

In the Options dialog box we asked for the descriptive statistics, which
include means for each main effect and the interaction. We are especially
interested in the regional means because only region had a significant
effect on the level of education in the ANOVA table.

Note

The pivot table below was edited in the Pivot Table editor so that gender
was placed in the layer dimension with the gender totals in the top layer,
which permits us to focus on regional differences in education.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 17

SPSS Training
Figure 9.18 Mean Education Level by Region

The average level of education ranges from 11.71 years in the East
South Central region to 13.77 years in the Pacific region. We can view
these means graphically in the profile plots. The profile plot for region
appears below.
Figure 9.19 Profile Plot of Region (Estimated Marginal Means)

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 18

SPSS Training
The Profile plot displays the estimated marginal means for the nine
regions.

ECOLOGICAL
SIGNIFICANCE

We have established that there are significant differences across regions


in highest year of school completed. However, we should ask whether a
two-year difference (Pacific versus East South Central) has practical
importance. Is it large enough that one might base policy, or take action,
on it? It is again important to recall that a statistically significant mean
difference implies that the population difference is not zero. Differences
can be small, even inconsequential, and yet statistically significant when
the sample size is large.

PRESENTING
THE RESULTS

To show the results graphically we could simply use the profile plot, but
an error bar chart would present the subgroup means along with their
95% confidence bands. Below we request an error bar chart displaying
each region.
Click Graphs..Interactive..Error Bar
Click Reset button, then click OK to confirm
Drag and drop Region of Interview [region] into the
horizontal arrow box
Drag and drop Highest Year of School Completed [educ] into
the vertical arrow box
Click Convert if asked to convert educ to a scale variable

Hint

Interactive Graphs make use of the measurement type (scale or


categorical) of variables when assigning statistics to the Y (here the
vertical) axis. Since we began with an SPSS portable file, which does not
contain measurement type information, SPSS uses rules to classify each
variable as nominal, ordinal or scale (interval/ratio). EDUC was classified
as an ordinal variable, and if we retained this measurement type, the
statistical summary produced by the Interactive Graph would be a mode,
rather than the mean we want. Thus we changed the measurement type
for EDUC to scale. A saved SPSS data file retains measurement type
information and has an advantage once measurement types have been
declared. We could have declared EDUC to be a scale variable in the
Variable tab within the Data Editor window, or through a Context menu
invoked by right-clicking on the variable within an Interactive Graph
dialog box.
If both region and gender were significant, or if there was an
interaction, we would have created a Clustered Error bar chart, which
would display one bar for each region by gender combination.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 19

SPSS Training
Figure 9.20 Create Error Bar Chart Dialog Box

Click OK
The SPSS command to produce the error bar chart appears below.
IGRAPH /VIEWNAME=Error Bar Chart
/X1 = VAR(region) TYPE = CATEGORICAL
/Y = VAR(educ) TYPE = SCALE
/COORDINATE = VERTICAL /X1LENGTH = 3.0
/YLENGTH = 3.0 /X2LENGTH = 3.0
/CATORDER VAR(region) (ASCENDING VALUES
OMITEMPTY)
/ERRORBAR KEY ON CI(95.0) DIRECTION BOTH
CAPSTYLE = T SYMBOL = ON.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 20

SPSS Training
Figure 9.21 Error Bar Chart of Education by Region

The error bar chart displays where the various region sample means
fall. The 95% confidence bands provide information about which
population group means are expected to differ. The post hoc tests
performed in the appendix give a more exact answer to this question.

SUMMARY OF
ANALYSIS

SUMMARY

We found highly significant differences in formal education (highest year


of school completed) across regions. There was neither a gender difference
nor a gender by region interaction. This result is qualified by the finding
of significant differences in variance (violation of the homogeneity of
variance assumption) among the subgroups. Although the region
difference is highly significant statistically, the maximum regional
difference was two years (East South Central versus Pacific).

In this chapter we performed a two-factor analysis of variance, looking at


education differences as a function of gender and region. Before
proceeding with the ANOVA, an exploratory analysis was done, and the
results were presented in table and graphical form. The appendix applies
post hoc tests to the regional means.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 21

SPSS Training
APPENDIX: POST
HOC TESTS
USING GLM
UNIVARIATE

At this point of the analysis it is natural to ask which regions differ from
which others in terms of mean education level. The GLM procedure in
SPSS will perform separate post hoc tests on each dependent variable in
order to determine this issue. These tests are usually performed to
investigate which levels within a factor differ after the overall main effect
has been established. To request a post hoc test we will return to the
GLM Univariate dialog box.

Click the Dialog Recall tool

, then click Univariate

Figure 9.22 GLM Univariate Dialog Box

Click Post Hoc pushbutton


We will select the Games-Howell test. Given the homogeneity of
variance results discussed earlier, Games-Howell, or another test that
does not assume homogeneity of variance, may be preferred. Here GLM
will apply Games-Howell post hoc tests to the observed regional means.
Click region in the Factor(s): list box and click the arrow to
move it into the Post Hoc Tests for: list box
Click the Games-Howell check box

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 22

SPSS Training
Figure 9.23 Post Hoc Tests in GLM Univariate

Click Continue
Click OK
The command below will run this analysis. On the Posthoc
subcommand we request that the Games-Howell test (GH) be applied to
region.
UNIANOVA
educ BY region sex
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/POSTHOC = region ( GH )
/PLOT = PROFILE( region sex region*sex )
/PRINT=DESCRIPTIVE
/CRITERIA = ALPHA(.05)
/DESIGN = region sex region*sex .
We view part of the post hoc results below.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 23

SPSS Training
Figure 9.24 GLM Post Hoc Results - Beginning

Looking at the Games-Howell results we see the New England region


differs only from the East South Central region. The table can be scanned
for additional regional differences. Recall this analysis (the error term) is
based on the region by gender ANOVA.

Note

Since the gender and gender by region effects were not found to be
significant, most analysts would rerun the analysis with region as the
only factor when performing post hocs. This would be a one-factor
ANOVA, which was discussed in Chapter 8. Performing the post hocs
from the two-factor analysis, as we do here, serves to demonstrate how it
would be done if more than a single factor were significant.

Mean Differences Between Groups III: Two-Factor ANOVA 9 - 24

SPSS Training

Chapter 10 Bivariate Plots and Statistics

Objective

Method

To understand the techniques used to display, and measures used to


quantify, relationships between interval scale variables.

First use the Explore procedure (Examine) to study variables


individually, then produce a scatterplot to view them jointly. Quantify the
relationship using the Correlation procedure.

Data

A personnel file containing demographic, salary and work related data


from 474 employees at a bank in the early 1970s. The salary information
has not been converted to current dollars. Demographic variables include
gender, race, age, and education (in years-EDLEVEL). Work related
variables are job classification (JOBCAT), previous work experience
recorded in years (WORK), time (in months) spent in current job position
(TIME). Current salary (SALNOW) and starting salary (SALBEG) are
also available. The data are stored as an SPSS portable file called
Bank.por.

Scenario

We wish to study the relationship between starting salary and several


background variables: education, prior work experience and age. Since
our entire sample is composed of employees from a single bank, the
population to which we generalize our conclusions is restricted. However,
it is of interest to examine the relative strength of the relationships of
education, work experience and age with starting salary. In addition, we
wish to quantify the relation between education and starting salary and
obtain a prediction equation. Such equations are constructed in salary
compensation studies. After exploring the data we will view the relation
between beginning salary and education in a scatterplot, then quantify
the linear association between beginning salary and each of several
variables in a correlation analysis.

INTRODUCTION

n previous chapters we explored relations among categorical


variables (using crosstab tables), and between categorical variables
and interval scale variables (t-test, ANOVA). Here we focus on
studying two interval scale measures: starting salary and formal
education. We wish to determine if there is a relationship, and if so,
quantify it. Starting salary is recorded in dollars and formal education is
reported in years; thus both variables are interval scales or stronger
(actually ratio scales). In addition, each variable can take on many
different values. If we tried to present these variables (beginning salary
and education) in a crosstabulation table, the table could contain
hundreds of rows. In order to view the relation between these measures

Bivariate Plots and Statistics 10 - 1

SPSS Training
we must either recode salary and education into categories and run a
crosstab (the appropriate graph is a clustered bar chart), or alternatively,
present the original variables in a scatterplot. Both approaches are valid
and you would choose one or the other depending on your interests. Since
we hope to build an equation relating amount of education to beginning
salary we will stick to the original scales and begin with a scatterplot.
But first we will take a quick look at the relevant variables using
exploratory data analysis methods.

READING THE
DATA

The data are stored as an SPSS portable file named Bank.por.

Click File..Open..Data
Switch to the c:\Train\Stats folder (if necessary)
Select SPSS Portable (*.por) from the Files of Type: drop-down
menu
Click Bank.por
Figure 10.1 Reading the Bank Portable File

The portable file contains both the data and dictionary information
(formats, labels, missing values. For those using SPSS command syntax,
the following command will read the portable file.
IMPORT FILE C:\Train\Stats\Bank.por.
Click Open

Bivariate Plots and Statistics 10 - 2

SPSS Training
Figure 10.2 Bank Data

We see the data values for several employees in the Data Editor
window.

EXPLORING THE
DATA

As in earlier chapters, we will explore the data before performing more


formal analysis (for example, regression). While the scatterplot itself
provides much useful information about each of the variables displayed,
we begin by examining each variable separately. We will run exploratory
data analysis on beginning salary (SALBEG) and education (EDLEVEL).
The ID variable will be used to label cases (outliers).
Click Analyze..Descriptive Statistics..Explore
Move salbeg and edlevel into the Dependent List: box
Move id into the Label Cases by: box

Bivariate Plots and Statistics 10 - 3

SPSS Training
Figure 10.3 Explore Dialog Box

There are no Factor variables in this analysis since we are looking at


the two variables over the entire sample. Outliers in the box & whisker
plot will be identified by their employee ID number (ID).
Click OK
The syntax below will run this analysis in SPSS.
EXAMINE
VARIABLES=salbeg edlevel /ID=id.
We request analyses on SALBEG and EDLEVEL and specify the
variable id as the case identification variable (using the ID
subcommand). The equal signs are not required by SPSS. Proceeding to
the output, we first examine the results for beginning salary.

Bivariate Plots and Statistics 10 - 4

SPSS Training
Figure 10.4 Statistics for Beginning Salary

The mean ($6,806) is considerably higher than the median ($6,000),


suggesting a skewed distribution. This is confirmed by the skewness
value compared to its standard error. Starting salaries range from $3,600
to $31,992 (recall that these are salaries from the 1960s and early 1970s
in unadjusted dollars).
Figure 10.5 Stem & Leaf Plot for Beginning Salary

Bivariate Plots and Statistics 10 - 5

SPSS Training
The extreme values at the high salary end result in a skewed
distribution. Since several different job classifications are represented in
this data, the skewness may be due to a relatively small number of people
in high paying jobs. In the plot above each leaf represents 3 employees,
and the ampersand & (called a partial leaf) symbolizes the presence of
fewer than 3 employees with the same leaf value.
Figure 10.6 Box & Whisker Plot of Beginning Salary

All outliers are at the high end, and the employee numbers for some
of them can be read from the plot (changing the font size of these
numbers in the Chart Editor window would make more of them legible).
It might be useful to look at the job classification (JOBCAT) of some of
the higher salaried individuals as a check for data errors.

Bivariate Plots and Statistics 10 - 6

SPSS Training
Figure 10.7 Statistics for Formal Education (in years)

The mean is again above the median, but the skewness value is very
near zero (suggesting a symmetric distribution). Here the mean
exceeding the median is not due to the presence of outliers, as will be
made apparent in the stem & leaf diagram below.
Figure 10.8 Stem & Leaf Plot of Education

Since education was recorded in whole years and the range is small,
all the leaves are zero. Notice there are only a few extreme observations,
and they are at high education values. An oddity is the gap in education
between 8th and 12th grade. If this occurred in your data you might
investigate further to determine if it might be the result of the sampling
procedure used to select individuals for inclusion in the study or a data

Bivariate Plots and Statistics 10 - 7

SPSS Training
coding problem. The mean is above the median because of the
concentration of employees with education of 15 to 19 years (compare to
the single block of employees at 8 years). This imbalance is revealed in
the stem & leaf.
Figure 10.9 Box & Whisker Plot of Education

The median or 50th percentile (dark line within box) falls on the
lower edge of the box (25th percentile) indicating a large number of
people with 12th grade education.
Having explored each variable separately, we will now view them
jointly with a scatterplot.

SCATTERPLOTS

A scatterplot displays individual observations in an area determined by a


vertical and a horizontal axis, each of which represent an interval scale
variable of interest. In a scatterplot one looks for a relationship between
the two variables, and notes any patterns or extreme points. The
scatterplot visually presents the relation between two variables, while
correlation and regression summaries quantify certain types of relations.
To request a standard scatterplot in SPSS:
Click Graphs..Scatter

Bivariate Plots and Statistics 10 - 8

SPSS Training
Figure10.10 Scatterplot Dialog Box

A Simple, or standard, scatterplot displays just two variables. The


Overlay plot would be picked if we wanted to display more than a single
variable along the vertical axis. An example would be to graph both
beginning and current salary (using different symbols) against education.
A Matrix plot would present one image containing all possible
scatterplots based on a list of variables. Thus you can view many
relations within a single chart, but in a reduced scale. A 3-D scatterplot
adds a third dimension to the standard plot, and allows you to view, from
any angle, relations among three variables. Controls to rotate the 3-D
scatterplot are available in the Chart Editor window. Since we are
dealing with only two variables, a Simple scatterplot will suffice. Note
that scatterplots are also available via the Interactive Graphs submenu of
the Graphs menu.
Click Define
Next we indicate that beginning salary (SALBEG) and education
(EDLEVEL) are the Y and X variables, respectively.
Move salbeg into the Y Axis: box
Move edlevel into the X Axis: box

Bivariate Plots and Statistics 10 - 9

SPSS Training
Figure 10.11 Simple Scatterplot Dialog Box

Traditionally, the vertical axis is called the Y axis, while the


horizontal axis is referred to as the X axis. Also, if one of the variables is
viewed as dependent and the other as independent (in the sense
discussed in Chapter 1), by convention the dependent variable is specified
as the Y axis variable.
If we wished different groups within the sample, say those belonging
to different job categories, to be represented by different symbols (or
colors) in the plot, we would name this grouping variable as the Marker
variable. You would use the Label Cases By box if you wished a label
(perhaps a name or ID number) to appear beside each point (case) in the
chart. Since the label would take up much more space than the point
itself, this option should not be used when there are many cases plotted.
Click OK
The command below will produce the graph in SPSS.
GRAPH
/SCATTERPLOT(BIVAR)=edlevel WITH salbeg.

Bivariate Plots and Statistics 10 - 10

SPSS Training
Figure 10.12 Scatterplot of Beginning Salary and Education

Each square represents at least one observation. We see there are


many points (fairly dense) at 8 and 12 years of education. Overall, there
seems to be a positive relation between the two variables. Notice there is
no one with little education and a high salary, nor is there anyone with
high education and a very low salary. This will be explored in more detail
shortly. There is one individual at a salary considerably higher than the
rest. If this were your study, you might check this observation to make
sure it wasnt in error.
While we can describe the pattern to an interested party by saying
that to some extent greater education is associated with higher salary
levels, or simply show them the chart, there would be an advantage if we
could quantify the relation using some simple function. We will pursue
this aspect later in this chapter and in the regression chapter.
If we wish to overlay our plot with a best-fitting straight line, we can
do so using the Chart Editor.
Double click on the chart to open the Chart editor
Click Chart..Options
Click the Total check box in the Fit Line area
Click the Show Sunflowers check box in the Sunflowers area

Bivariate Plots and Statistics 10 - 11

SPSS Training
Figure 10.13 Scatterplot Options Dialog Box

Since there are no subgroups in our analysis (we did not name a
Marker variable), we can only fit a line to the entire (Total) sample. The
other areas in the dialog box allow adding a mean reference line and
creating a sunflower plot: a scatterplot representing the density of cases
at a location by the number of petals on a sunflower symbol. This is
useful here since we are plotting many points with a limited number of
education values. While the default Fit options are fine for our purpose,
we will click the Fit Options pushbutton to see the possibilities.
Click the Fit Options pushbutton
Figure 10.14 Scatterplot: Fit Options

Bivariate Plots and Statistics 10 - 12

SPSS Training
By default a straight line (linear) will be fit to the data, although
other simple curves (quadratic, cubic) are also available. The Lowess
choice applies a robust regression technique to the data. Such methods
produce a result that is more resistant to outliers than the traditional
least-squares regression. While not invoked here, note that 95%
confidence bands around the best-fitting line can be added to the plot.
Finally, we will define the r-square measure when we consider
regression, but please note that it can be displayed on the chart
(Interactive Graph scatterplots can also display the lines equation).
Because we did not change any Fit option settings, we will exit this dialog
box and process the requested Scatterplot options.
Click Cancel
Click OK
Click File..Close to close the Chart editor
Figure 10.15 Scatterplot with Best Fitting Line

The straight line tracks the positive relationship between beginning


salary and education. It would be helpful if we could quantify the
strength of the relationship, and furthermore to describe it
mathematically. If a simple function (for instance a straight line) does a
fair job of representing the relationship, then we can very easily describe
a straight line with the equation, Y = b * X + a. Here b is the slope (or
average change in Y per unit change in X) and a is the intercept. Methods
are available to perform both tasks: correlation for assigning a number to
the strength of the straight-line relationship, and regression to describe
the best-fitting straight line. We first obtain the correlations, and will
consider regression in the next chapter.

Bivariate Plots and Statistics 10 - 13

SPSS Training
CORRELATIONS

It is helpful to be able to quantify the strength of the relationship


between variables in a scatterplot, if for no other reason than having a
succinct summary. The common correlation coefficient (formally named
the Pearson product-moment correlation coefficient) is a measure of the
extent to which there is a linear (or straight line) relationship between
two variables. It is normed so that a correlation of +1 indicates that the
data fall on a perfect straight line sloping upwards (positive relationship),
while a correlation of -1 would represent data forming a straight line
sloping downwards (negative relationship). A correlation of 0 indicates
there is no straight-line relationship at all. Correlations falling between 0
and either extreme (-1,+1) indicate some degree of linear relation: the
closer to +1 or -1, the stronger the relation. In social science and market
research, when straight-line relationships are found, significant
correlation values are often in the range of .3 to .6.
Below we display four scatterplots with their accompanying
correlations, all based on simulated data following normal distributions.
Four different correlations appear (1.0, .8, .4, 0). All are positive, but
represent the full range in strength of linear association (from 0 to 1). As
a benchmark aid, a best-fitting straight line is superimposed on each
chart.
Figure 10.16 Scatterplots Based on Various Correlations.

For the perfect correlation of 1.0, all points fall on the straight line
trending upwards. In the scatterplot with a correlation of .8 the strong
positive relation is apparent, but there is some variation around the line.
Looking at the plot of data with correlation of .4, the positive relation is
suggested by the absence of points in the upper left and lower right of the
plot area. The association is clearly less pronounced than with the data
correlating .8 (note greater scatter of points around the line). The final
chart displays a correlation of 0: there is no linear association present.
This is fairly clear to the eye (the plot most resembles a blob), and the
best-fitting straight line is a horizontal line.
While we have stressed the importance of looking at the relationships
between variables using scatterplots, you should be aware that human

Bivariate Plots and Statistics 10 - 14

SPSS Training
judgment studies indicate that people tend to overestimate the degree of
correlation when viewing scatterplots. Thus obtaining the numeric
correlation is a useful adjunct to viewing the plot. Correspondingly, since
correlations only capture the linear relation between variables, viewing a
scatterplot allows you to detect nonlinear relationships present.
Additionally, statistical significance tests can be applied to
correlation coefficients. Assuming the variables follow normal
distributions, you can test whether the correlation differs from zero (zero
indicates no linear association) in the population, based on your sample
results. The significance value is the probability that you would obtain as
large (or larger in absolute value) a correlation as you find in your
sample, if there were no linear association (zero correlation) between the
two variables in the population.
In SPSS, correlations (Pearson product-moment correlations) can be
easily obtained along with an accompanying significance test. If one has
grossly nonnormal data, or only ordinal scale data, the Spearman rank
correlation coefficient (or Spearman correlation) can be calculated. It
evaluates the linear relationship between two variables after ranks have
been substituted for the original scores. Another, less common, rank
association measure is Kendalls coefficient of concordance (also known as
Kendalls coefficient, or Kendalls tau-b). We will obtain the correlation
(Pearson) between beginning salary and education, and will also include
age, current salary, and work experience in the analysis.
Click Analyze..Correlate..Bivariate
Move salbeg, salnow, edlevel, age and work to the Variables:
list box
Figure 10.17 Correlation Dialog Box

Bivariate Plots and Statistics 10 - 15

SPSS Training
Notice that we simply list the variables to be analyzed; there is no
designation of dependent and independent. Correlations are simply
measures of straight-line association.
By default, Pearson correlations will be calculated, which is what we
want. However, the alternative types can be easily requested. A twotailed significance test will be performed on each correlation. This will
posit as the null hypothesis that in the population there is no linear
association between the two variables. Thus any straight-line
relationship, either positive or negative, is of interest. If you prefer a onetailed test, one in which you specify the direction (or sign) of the relation
you expect and any relation in the opposite direction (opposite sign) is
bundled with the zero (or null) effect, you can obtain it though the Onetailed option button. This issue was discussed earlier in the context of one
versus two-tailed t tests. A one-tailed test gives you greater power to
detect a correlation of the sign you propose, at the price of giving up the
ability to detect a significant correlation of the opposite sign. In practice,
researchers are usually interested in all linear relations, positive and
negative, and so two-tailed tests are very common. The Flag significant
correlations check box is checked by default. When checked, significant
correlations will be identified by asterisks appearing beside the
correlations.
The Options pushbutton leads to a dialog box in which you can
request that descriptive statistics appear for the variables used in the
analysis. There is also a choice for missing values. The default missing
setting is Pairwise, which means that if a case has missing values for one
or more of the analysis variables, SPSS will still use the valid information
from other variables in that case. The alternative is Listwise, in which a
case is dropped from the correlation analysis if any of its analysis
variables have missing values. Neither method provides an ideal solution;
in practice, pairwise deletion is often chosen when a large number of
cases are dropped by the listwise method. This is an area of statistics in
which considerable progress has been made in the last decade, and the
SPSS Missing Values option incorporates some of these improvements.
Click OK
The Correlation syntax command below will run a Pearson
correlation analysis. To request Spearman correlations or Kendalls
coefficients use the NPAR CORR (nonparametric correlation) command.

CORRELATIONS /PRINT NOSIG


/VARIABLES=salbeg salnow edlevel age work.
The NOSIG keyword instructs SPSS to mark significant correlations
with asterisks.

Bivariate Plots and Statistics 10 - 16

SPSS Training
Figure 10.18 Correlation Matrix

SPSS displays the correlations, sample sizes and significance values


together in a cell. Looking at the correlation pivot table, we see that the
variable names run down the first column and across the top row. Each
cell (intersection of a row and column) in the matrix contains the
correlation (also significance value and sample size) between the relevant
row and column variable. The correlation (Pearson Correlation) is listed
first in each cell, followed by the probability value of the significance test
(Sig. (2-tailed)), and finally the sample size (N).
Note all correlations along the major (upper left to lower right)
diagonal are 1. This is because each variable perfectly correlates with
itself (no significance tests are performed for these correlations). Also, the
correlation matrix is symmetric, that is, the correlation between
beginning salary and education is the same as the correlation between
education and beginning salary. Thus you need only view part of the
matrix to the upper right of the diagonal (or to the lower left of the
diagonal) to see all the correlations.
There is, not surprisingly, a strong (.88) correlation between
beginning salary and current salary. Its significance value rounded to
three decimals is .000 (thus less than .0005). This means that if
beginning salary and current salary had no linear association in the
population, then the probability of obtaining a sample with such a strong
(or stronger) linear association is less than .0005. The sample size is
nearly 500, which should provide fairly sensitive (powerful) tests of the
correlations being nonzero.
Formal education and beginning salary have a substantial (.63)
positive correlation, while age has no linear association with beginning
salary (correlation -.01; probability value of .81, or 81% chance of
obtaining a sample correlation this far from zero, if it were truly zero in
the population). Do you see any other large correlations in the table, and

Bivariate Plots and Statistics 10 - 17

SPSS Training
if so can you explain them? Also note that the significant correlations are
marked by asterisks.
A correlation provides a concise numerical summary of the degree of
linear association between pairs of variables. However, a correlation can
be influenced by outliers without warning signs appearing in the
correlation. Such outliers would probably be visible in a scatterplot. Also,
a scatterplot might suggest that a function other than a straight line be
fit to the data, whereas a correlation simply provides a measure of
straight-line fit. For these reasons, serious analysts look at scatterplots.
If the number of variables is so large as to render looking at all
scatterplots unfeasible, then at least view those involving important
variables.

SUMMARY

In this chapter we considered how to display and quantify relations


between pairs of interval scale variables. We took a personnel file
containing salary and education. We first examined the variables using
exploratory data analysis, then looked at the relation between them with
a scatterplot, and added a best-fitting straight line. Finally, we quantified
the strength of the linear association with a correlation coefficient.

Bivariate Plots and Statistics 10 - 18

SPSS Training

Chapter 11 Introduction to Regression


Objective

Method

Learn when to use regression, the assumptions involved, and how to


interpret the standard results.

Use the Regression procedure to run a simple regression analysis, then


add additional variables and perform multiple regression. Finally,
explore stepwise regression: an automated method of selecting predictor
variables.

Data

A personnel file containing demographic, salary and work related data


from 474 employees at a bank in the early 1970s. The salary information
has not been converted to current dollars. Demographic variables include
gender, race, age, and education (in years-EDLEVEL). Work related
variables are job classification (JOBCAT), previous work experience
recorded in years (WORK), time (in months) spent in current job position
(TIME). Current salary (SALNOW) and starting salary (SALBEG) are
also available. The data are stored as an SPSS portable file called
Bank.por.

Scenario

We found, based on a scatterplot and correlation coefficient, that


beginning salary and education are positively related for the bank
employees. We wish to further quantify this relation by developing an
equation predicting starting salary based on education. Compensation
analysts build such equations using several predictor variables (multiple
regression). Additionally we want to assess the degree to which the
equation fits the data and verify that there are no oddities in the results.
Since we have measured several variables that might be related to
beginning salary, we will add additional predictor (independent)
variables to the equation, evaluate the improvement and interpret the
equation coefficients. In a successful analysis we would obtain an
equation useful in predicting starting salary based on background
information, and understand the relative contribution of each predictor
variable.

INTRODUCTION
AND BASIC
CONCEPTS

egression analysis is a statistical method used to predict a


variable (an interval scale dependent measure) from one or more
predictor (interval scale) variables. Commonly, simple functions
(straight lines) are used, although forms of regression allow nonlinear
functions, and even (in robust regression) smoothed functions of the data
that have no straightforward equation description. In this chapter we will
focus on linear regression, which typically, but not necessarily, involves
linear or straight-line relations between variables. To aid our discussion,
lets revisit the scatterplot of beginning salary and education (Figure
10.15).

Introduction to Regression 11 - 1

SPSS Training
Figure 11.1 Scatterplot of Beginning Salary and Education (with
Sunflower Option)

Earlier we pointed out that to the eye there seems to be a positive


relation between education and beginning salary, that is, higher
education is associated with greater starting salaries. This was confirmed
by the two variables having a significant positive correlation (.63). While
the correlation does provide a single numeric summary of the relation,
something that would be more useful in practice is some form of
prediction equation. Specifically, if some simple function can approximate
the pattern shown in the plot, then the equation for the function would
concisely describe the relation, and could be used to predict values of one
variable given knowledge of the other. A straight line is a very simple
function, and is usually what researchers start with, unless there are
reasons (theory, previous findings, or a poor linear fit) to suggest another.
Also, since the point of much research involves prediction, a prediction
equation is valuable. However, the value of the equation would be linked
to how well it actually describes or fits the data, and so part of the
regression output includes fit measures.

THE
REGRESSION
EQUATION AND
FIT MEASURE

In the plot above, beginning salary is placed on the Y (or vertical axis)
and education appears along the X (horizontal) axis. Since education is
typically completed before starting a career, we consider beginning salary
to be the dependent variable and education the independent or predictor
variable. A straight line is superimposed on the scatterplot along with the
general form of the equation, Y = B * X + A. Here, B is the slope (the
change in Y per one unit change in X) and A is the intercept (the value of
Y when X is zero).

Introduction to Regression 11 - 2

SPSS Training
Given this, how would one go about finding a best-fitting straight
line? In principle, there are various criteria that might be used:
minimizing the mean deviation, mean absolute deviation, or median
deviation. Due to technical considerations, and with a dose of tradition,
the best-fitting straight line is the one that minimizes the sum of the
squared deviation of each point about the line. If these sums of squared
deviations remind you of our discussion of ANOVA, it is the same
concept.
Returning to the plot of beginning salary and education, we might
wish to quantify the extent to which the straight line fits the data. The fit
measure most often used, the r-square measure, has the dual advantages
of falling on a standardized scale and having a practical interpretation.
The r-square measure (which is the correlation squared, or r2, when there
is a single predicator variable, and thus its name) is on a scale from 0 (no
linear association) to 1 (perfect linear prediction). Also, the r-square value
can be interpreted as the proportion of variation in one variable that can
be predicted from the other. Thus an r-square of .50 indicates that we can
account for 50% of the variation in one variable if we know values of the
other. You can think of this value as a measure of the improvement in
your ability to predict one variable from the other (or others if there are
multiple independent variables).

RESIDUALS AND
OUTLIERS

ASSUMPTIONS

Viewing the plot, we see that many points fall near the line, but some are
quite a distance from it. For each point, the difference between the value
of the dependent variable and the value predicted by the equation (value
on the line) is called the residual. Points above the line have positive
residuals (they were under-predicted), those below the line have negative
residuals (they were over-predicted), and a point falling on the line has a
residual of zero (perfect prediction). Points having relatively large
residuals are of interest because they represent instances where the
prediction line did poorly. For example, one case has a beginning salary of
about $30,000 while the predicted value (based on the line) is about
$10,000, yielding a residual, or miss, of about $20,000. If budgets were
based on such predictions, this is a substantial discrepancy. In SPSS, the
Regression procedure can provide information about large residuals, and
also present them in standardized form. Outliers, or points far from the
mass of the others, are of interest in regression because they can exert
considerable influence on the equation (especially if the sample size is
small). Also, outliers can have large residuals and would be of interest for
this reason as well. While not covered in this class, SPSS can provide
influence statistics to aid in judging whether the equation was strongly
affected by an observation and, if so, to identify the observation.

Regression is usually performed on data for which the dependent and


independent variables are interval scale. In addition, when statistical
significance tests are performed, it is assumed that the deviations of
points around the line (residuals) follow the normal bell-shaped curve.
Also, the residuals are assumed to be independent of the predicted
(values on the line) values, which implies that the variation of the
residuals around the line is homogeneous (homogeneity of variance).
SPSS can provide summaries and plots useful in evaluating these latter
issues. One special case of the assumptions involves the interval scale

Introduction to Regression 11 - 3

SPSS Training
nature of the independent variable(s). A variable coded as a dichotomy
(say 0 and 1) can technically be considered as an interval scale. An
interval scale assumes that a one-unit change has the same meaning
throughout the range of the scale. If a variables only possible codes are 0
and 1 (or 1 and 2, etc.), then a one-unit change does mean the same
change throughout the scale. Thus dichotomous variables, for example
gender, can be used as predictor variables in regression. It also permits
the use of nominal predictor variables if they are converted into a series
of dichotomous variables; this technique is called dummy coding and is
considered in most regression texts (Draper and Smith (1998), Cohen and
Cohen (1983)). When we perform multiple regression (multiple
independent variables) later in this chapter we will use a dichotomous
predictor (gender).

SIMPLE
REGRESSION

A regression involving a single independent variable is the simplest case


and is called simple regression. We will develop a regression equation
predicting beginning salary based on education.
File..Open..Data (move to the c:\Train\Stats directory)
Select SPSS Portable (*.por) from the Files of Type: dropdown menu
Double click on Bank.por
Click Analyze..Regression
Figure 11.2 Regression Submenu

This chapter will focus on the first choice, linear regression, which
performs simple and multiple linear regression. Curve Estimation will
invoke the Curvefit procedure, which can apply up to 16 different
functions relating two variables. Binary logistic regression is used when
the dependent variable is a dichotomy (for example, when predicting

Introduction to Regression 11 - 4

SPSS Training
whether a medical patient survives or not). Multinomial logistic
regression is appropriate when you have a categorical dependent variable
with more than two values. Ordinal regression can be applied when the
measurement level of the dependent variable is ordinal (rank ordered).
Probit analysis is traditionally used in medical dosage response studies in
which one records at different drug dosage levels the number of
experimental animals that survive and the number that die. Nonlinear
regression will apply a user-specified nonlinear equation to the variables.
Weight estimation will compute weight factors, which when later used by
the Regression procedure will result in areas where the points show less
variation (reflect greater precision) being weighted more heavily. 2-Stage
Least Squares is a method used in econometrics to evaluate regressionlike models in which variables can appear in two separate equations.
Finally, Optimal Scaling analysis (Regression with Optimal Scaling) is
used to predict the values of a categorical, ordinal or interval dependent
variable from a combination of categorical, ordinal or interval
independent variables, but does so by performing scaling on the original
variables. Binary and Multinomial Logistic, Probit, Nonlinear, 2-Stage
Least Squares and Weight Estimation are part of the SPSS Regression
Models option, Ordinal Regression is part of the SPSS Advanced Models
option, and Optimal Scaling comes with the SPSS Categories option.
We will select Linear to perform simple linear regression, then name
beginning salary (SALBEG) as the dependent variable and education
(EDLEVEL) as the independent variable.
Click Linear from the Regression menu
Move salbeg to the Dependent: box
Move edlevel to the Independent(s): box
Figure 11.3 Linear Regression Dialog Box

Introduction to Regression 11 - 5

SPSS Training
In this first analysis we will limit ourselves to producing the standard
regression output. In our next regression example, we will ask for
residual plots and information about cases with large residuals. Also, the
Regression dialog box allows many specifications; here we will discuss the
most important features. However, if you will be running regression
often, some time spent reviewing the additional features and controls
mentioned in the manual and Help system will be well worth it.
Beginning salary (SALBEG) is the dependent variable and education
(EDLEVEL) is the sole independent variable. Notice the Independent(s)
list box will permit more than one independent variable, and so this
dialog box can be used for both simple and multiple regression. The block
controls permit an analyst to build a series of regression models with the
variables entered at each stage (block), as specified by the user.
By default, the Method is Enter, which means that all independent
variables in the block will be entered into the regression equation
simultaneously. This method is chosen to run one regression based on all
variables you specify. If you wish the program to select, from a larger set
of independent variables, those that in some statistical sense are the best
predictors, you can request the Stepwise method. We will review this
method later in the chapter.
The Selection Variable option permits cross-validation of regression
results. Only cases whose values meet the rule specified for a selection
variable will be used in the regression analysis, yet the resulting
prediction equation will be applied to the other cases. Thus you can
evaluate the regression on cases not used in the analysis, or apply the
equation derived from one subgroup of your data to other groups.
While SPSS will present standard regression output by default, many
additional (and some of them quite technical) statistics can be requested
via the Statistics dialog box. The Plots dialog box is used to generate
various diagnostic plots used in regression, including residual plots. We
will request such plots in the next analysis. The Save dialog box permits
you to add new variables to the data file. These variables contain such
statistics as the predicted values from the regression equation, various
residuals and influence measures. Finally, the Options dialog box controls
the criteria when running stepwise regression and choices in handling
missing data (the SPSS Missing Values option provides more
sophisticated methods of handling missing values). By default, SPSS
excludes a case from regression if it has one or more values missing for
the variables used in the analysis.
Click OK
The command below will produce the analysis in SPSS.
REGRESSION
/DEPENDENT salbeg
/METHOD=ENTER edlevel .
We indicate that beginning salary (SALBEG) is the dependent
variable and we wish to enter education (EDLEVEL) as the single
predictor.

Introduction to Regression 11 - 6

SPSS Training
Figure 11.4 Model Summary and Overall Significance Tests

After listing the dependent and independent variables, Regression


provides several measures of how well the model fits the data. First is the
multiple R, which is a generalization of the correlation coefficient. If
there is a single predictor variable (as in our case) then the multiple R is
simply the unsigned (positive) correlation between the independent and
dependent variable; recall the correlation between beginning salary and
education was .63 (Figure 10.18). If there are several independent
variables then the multiple R represents the unsigned (positive)
correlation between the dependent measure and the optimal linear
combination of the independent variables. Thus the closer the multiple R
is to 1, the better the fit. As mentioned earlier, the r-square measure can
be interpreted as the proportion of variance of the dependent measure
that can be predicted from the independent variable(s). Here it is about
40%, which is far from perfect prediction, but still substantial. The
adjusted r-square represents a technical improvement over the r-square
in that it explicitly adjusts for the number of predictor variables relative
to the sample size, and as such is preferred by many analysts. However,
it is a more recently developed statistic and so is not as well known as the
r-square. Generally, they are very close in value; in fact, if they differ
dramatically in multiple regression, it is a sign that you have used too
many predictor variables for your sample size, and the adjusted r-square
value should be more trusted. In our results, they are very close.
The standard error of the estimate is a standard deviation type
summary of the dependent variable that measures the deviation of
observations around the best fitting straight line. As such it provides, in
the scale of the dependent variable, an estimate of how much variation
remains to be accounted for after the line is fit. The reference number for
comparison is the original standard deviation of the dependent variable,
which measures the original amount of unaccounted variation.

Introduction to Regression 11 - 7

SPSS Training
Regression can display such descriptive statistics as the standard
deviation, but since we didnt request this, we will note that the original
standard deviation of beginning salary was $3,148 (Figure 10.4). Thus the
uncertainty surrounding individual beginning salaries has been reduced
from $3,148 (standard deviation) to $2,439 (standard error). If the
straight line perfectly fit the data, the standard error would be 0.
While the fit measures indicate how well we can expect to predict the
dependent variable or how well the line fits the data, they do not tell
whether there is a statistically significant relationship between the
dependent and independent variables. The analysis of variance table
presents technical summaries (sums of squares and mean square
statistics) similar to what we found in ANOVA, but here we refer to
variation accounted for by the prediction equation. Our main interest is
in determining whether there is a statistically significant (non zero)
linear relation between the dependent variable and the independent
variable(s) in the population. Since in simple regression there is a single
independent variable, we are testing a single relationship; in multiple
regression, we test whether any linear relation differs from zero. The
significance value accompanying the F test gives us the probability that
we could obtain one or more sample slope coefficients (which measure the
straight line relationships) as far from zero as what we obtained, if there
were no linear relations in the population. The result is highly significant
(significance probability less than .0005 or 5 chances in 10,000). Now that
we have established there is a significant relationship between the
beginning salary and education, and obtained fit measures, we turn to
interpret the regression coefficients.
Figure 11.5 Regression Coefficients

The first column contains a list of the independent variables (here


one) plus the intercept (constant). The column labeled B contains the
estimated regression coefficients we would use in a prediction equation.
The coefficient for formal education level indicates that on average, each
year of education was associated with a beginning salary increase of
$691. The constant or intercept of -2,516 indicates that the predicted
beginning salary of someone with 0 years of education is negative $2,516,
that is, they would pay the bank to work. This is clearly impossible. This
odd result stems in part from the fact that no one in the sample had
fewer than 8 years of education, so the intercept projects well beyond the
region containing data. When using regression it can be risky to
extrapolate beyond where the data are observed; the assumption is that

Introduction to Regression 11 - 8

SPSS Training
the same pattern continues. Here it clearly cannot! The Standard Error
(of B) column contains standard errors of the regression coefficients.
These provide a measure of the precision with which we estimate the B
coefficients. The standard errors can be used to create a 95% confidence
band around the B coefficients (available as a Statistics option). In our
example, the regression coefficient is $691 and the standard error is
about $39. Thus we would not be surprised if in the population the true
regression coefficient were $650 or $710 (within two standard errors of
our sample estimate), but it is very unlikely that the true population
coefficient would be $300 or $2,000.
Betas are standardized regression coefficients and are used to judge
the relative importance of each of several independent variables. We will
use these measures when discussing multiple regression. Finally, the t
statistics provide a significance test for each B coefficient, testing
whether it differs from zero in the population. Since we have but one
independent variable, this is the same result as what the F test provided
earlier. In multiple regression, the F statistic tests whether any of the
independent variables are significantly related to the dependent variable,
while the t statistic is used to test each independent variable separately.
The significance test on the constant assesses whether the intercept
coefficient is different from zero in the population (it is).
Thus if we wish to predict beginning salary based on education for
new employees, the formula would use the B coefficients: Beginning
Salary = $691 * Education - $2,516. Even when running simple
regression, the analyst would probably take a look at some residual plots
and check for outliers; we will follow through on this aspect in the next
example.

MULTIPLE
REGRESSION

Multiple regression represents a direct extension of simple regression.


Instead of a single predictor variable (Y = B * X + A), multiple regression
allows for more than one independent variable (Y = B1 * X1 + B2 * X2 + B3
* X3 + . . . + A) in the prediction equation. While we are limited to the
number of dimensions we can view in a single plot (SPSS can build a 3dimensional scatterplot), the regression equation allows for many
independent variables. When we run multiple regression we will again be
concerned with how well the equation fits the data, whether there are
any significant linear relations, and estimating the coefficients for the
best-fitting prediction equation. In addition, we are interested in the
relative importance of the independent variables in predicting the
dependent measure.
In our example, we expand our prediction model of beginning salary
to include formal education (EDLEVEL), years of previous work
experience (WORK), age, and gender (SEX). Gender is a dichotomous
variable coded 0 for males and 1 for females. As such (recall our earlier
discussion), it can be included as an independent variable in regression.
Its regression coefficient will indicate the relation between gender and
beginning salary, adjusting for the effects of the other independent
variables.
To run the analysis, we will return to the Linear Regression dialog
box and add the additional independent variables (SEX, WORK, AGE) to
the Independent Variables list.

Introduction to Regression 11 - 9

SPSS Training
Click the Dialog Recall Box tool

, and click Linear

Regression
Move sex, work, and age into the Independent(s): box (edlevel
should already be there)
Figure 11.6 Setting Up Multiple Regression

Since the four independent variables will be entered as a single block


(we are at block 1 of 1), the order in which we list the variables will not
affect the analysis, but Regression will maintain this order when
presenting results.

RESIDUAL
PLOTS

While we can run the multiple regression at this point, we will request
some diagnostic plots involving residuals and information about outliers.
By default no residual plots will appear. These options are explained
below.
Click the Plots pushbutton
Within the Plots dialog box:
Click the Histogram check box in the Standardized Residual
Plots area
Move *ZRESID into the Y: box
Move *ZPRED into the X: box

Introduction to Regression 11 - 10

SPSS Training
Figure 11.7 Regression Plots Dialog Box

The options in the Standardized Residual Plots area of the dialog box
all involve plots of standardized residuals. Ordinary residuals are useful
if the scale of the dependent variable is meaningful, as it is here
(beginning salary in dollars). Standardized residuals are helpful if the
scale of the dependent is not familiar (say a 1 to 10 customer satisfaction
scale). By this I mean that it may not be clear to the analyst just what
constitutes a large residual; is an over prediction of 1.5 units a large miss
on a 1 to 10 scale? In such situations, standardized residuals (residuals
expressed in standard deviation units) are very useful because large
prediction errors can be easily identified. If the errors follow a normal
distribution, then standardized residuals greater than 2 (in absolute
value) should occur about 5% of the time, and those greater than 3 (in
absolute value) should happen less than 1% of the time. Thus
standardized residuals provide a norm against which one can judge what
constitutes a large residual. We requested a histogram of the
standardized residuals; note that a normal probability plot is available as
well. Recall that the F and t tests in regression assume that the residuals
follow a normal distribution.
Regression can produce summaries concerning various types of
residuals. Without going into all these possibilities, we request a
scatterplot of the standardized residuals (*ZRESID) versus the
standardized predicted values (*ZPRED). An assumption of regression is
that the residuals are independent of the predicted values, so if we see
any patterns (as opposed to a random blob) in this plot, it might suggest a
way of adjusting and improving the analysis.
Click Continue
Next we will look at the Statistics dialog box. The Casewise
Diagnostics choice appears here. When this option is checked, Regression
will list information about all cases whose standardized residuals are
more than 3 standard deviations from the line. This outlier criterion is

Introduction to Regression 11 - 11

SPSS Training
under your control.
Click the Statistics pushbutton
Click the Casewise diagnostics check box in the Residuals
area
Figure 11.8 Regression Statistics Dialog Box

Statistics such as the 95% confidence interval for the B (regression)


coefficients can be requested.
Click Continue
Click OK
To perform this analysis from SPSS command syntax, use the
following command.
REGRESSION
/DEPENDENT salbeg
/METHOD=ENTER edlevel sex work age
/SCATTERPLOT=(*ZRESID ,*ZPRED )
/RESIDUALS HIST(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3) .
We have added the three additional independent variables to the
Method subcommand. Next a scatterplot of the standardized residuals
and standardized predicted values is requested. The asterisks must
appear before ZRESID and ZPRED; ordinary variable names can be
placed in the scatterplot, and the asterisk designates ZRESID and
ZPRED as special variables internal to Regression. In the Residuals
subcommand, we request a histogram of the standardized residuals. The

Introduction to Regression 11 - 12

SPSS Training
Casewise subcommand will produce a casewise plot or list of those
observations more than three standard deviations from the line (actually
from the plane, since there are several predictor variables). To run just a
basic multiple regression analysis, the command would end with the
Method subcommand.

MULTIPLE
REGRESSION
RESULTS

We now turn to the results of our multiple regression run.


Figure 11.9 Variable Summary and Fit Measures

Recall that listwise deletion of missing data has occurred, that is, if a
case is missing data on any of the five variables used in the regression it
will be dropped from the analysis. If this results in heavy data loss, other
choices for handling missing values are available in the Regression
Options dialog box (see also the SPSS Missing Values option). The
dependent and independent variables are listed. The r-square measure is
about .49, indicating that with these four predictor variables we can
account for about 49% of the variation in beginning salaries. Education
alone had an r-square of .40, so the additional set of three predictors
added only an additional 9%: an improvement, but a modest
improvement. The adjusted r-square is quite close to the r-square. The
standard error has dropped from $2,439 (with just education as a
predictor) to $2,260: an improvement, but not especially large.

Introduction to Regression 11 - 13

SPSS Training
Figure 11.10 ANOVA Table

Since there are four independent variables, the F statistic tests


whether any of the variables have a linear relationship with beginning
salary. Not surprisingly, since we already know that education is
significantly related to beginning salary, the result is highly significant.
Figure 11.11 Multiple Regression Coefficients, Bs and Betas

The independent variables appear in the order they were given in the
Regression dialog box, and not in order of importance. Although the B
coefficients are important for prediction and interpretive purposes,
analysts usually look first to the t test at the end of each line to
determine which independent variables are significantly related to the
outcome measure. Since four variables are in the equation, we are testing
if there is a linear relationship between each independent variable and
the dependent measure after adjusting for the effects of the three other
independent variables. Looking at the significance values we see that
education and gender are highly significant (less than .0005), age is
significant at the .05 level, while work experience is not linearly related
to beginning salary (after controlling the other predictors). Thus we can
drop work experience as a predictor. It may seem odd that work
experience is not related to salary, but since many of the positions were
clerical, work experience may not play a large role. Typically, you would

Introduction to Regression 11 - 14

SPSS Training
rerun the regression after removing variables not found to be significant,
but we will proceed and interpret this output.
The estimated regression (B) coefficient for education is $651, similar
but not identical to the coefficient ($691) found in the simple regression
using formal education alone. In the simple regression we estimated the
B coefficient for education ignoring any other effects, since none were
included in the model. Here we evaluate the effect of education after
adjusting for age, work experience and gender. If the independent
variables are correlated, the change in B coefficient from simple to
multiple regression can be substantial. So, after adjusting for age, work
experience and gender, meaning if they were held constant, a year of
formal education, on average, was worth $651 in starting salary. The
gender variable has a B coefficient of -1526. This means that a one-unit
change in gender (moving from male status to female status), controlling
for the other independent variables in the equation, is associated with a
drop (negative coefficient) in beginning salary of $1,526. Age has a B
coefficient of $33, so a one-year change in age (controlling for the other
variables) was associated with a $33 beginning salary increase. Since we
found work experience not to be significantly different from zero, we treat
it as if it were zero. The constant or intercept term is still negative, and
would correspond to the predicted salary for a male (sex=0) with 0 years
of education, 0 years of work experience and whose age is 0 - not a likely
or realistic combination. The standard errors again provide precision
measures for the regression coefficient estimates.
If we simply look at the estimated B coefficients we might think that
gender is the most important variable. However, the magnitude of the B
coefficient is influenced by the standard deviation of the independent
variable. For example, gender (SEX) takes on only two values (0 and 1),
while education values range from 8 years to over 20 years. The Beta
coefficients explicitly adjust for such standard deviation differences in the
independent variables. They indicate what the regression coefficients
would be if all variables were standardized to have means of 0 and
standard deviations of 1. A Beta coefficient thus indicates the expected
change (in standard deviation units) of the dependent variable per one
standard deviation unit increase in the independent variable (after
adjusting for other predictors). This provides a means of assessing
relative importance of the different predictor variables in multiple
regression. The Betas are normed so that the maximum should be less
than or equal to one in absolute value (if any Betas are above 1 in
absolute value, it suggests a problem with the data: multicollinearity).
Examining the Betas, we see that education is the most important
predictor, followed by gender, and then age. The Beta for work experience
is very near zero. If we needed to predict beginning salary from these
background variables (dropping work experience) we would use the B
coefficients. Rounding to whole numbers, we would say: Salbeg = 651 *
Edlevel - 1526 * Gender + 33 * Age - 2666.

Introduction to Regression 11 - 15

SPSS Training
RESIDUALS AND
OUTLIERS

We now examine the residual plots and summaries.


Figure 11.12 Casewise List of Outliers

We see those observations more than three standard deviations from


the line; assuming a normal distribution, this would happen less than 1%
of the time by chance alone. It is interesting to note that all the large
residuals are positive. Some of them are quite substantial. The next step
would be to see if these observations have anything in common (same job
category perhaps, which may be out of line with the others regarding
salary). Since we know their case numbers (an ID variable can be
substituted), we could look at them more closely.
Figure 11.13 Histogram of Residuals

Introduction to Regression 11 - 16

SPSS Training
We see the distribution of the residuals with a normal bell-shaped
curve superimposed. The residuals are a bit too concentrated in the
center (notice the peak) and are skewed; notice the long tail to the right.
Given this pattern, a technical analyst might try a data transformation
on the dependent measure (taking logs), which might improve the
properties of the residual distribution. Overall, the distribution is not too
bad, but there are clearly some outliers in the tail; these also show up in
the casewise outlier summary.
Figure 11.14 Scatterplot of Residuals and Predicted Values

Here we hope to see a horizontally oriented blob of points with the


residuals showing the same spread across different predicated values.
Unfortunately, we see a hint of a curving pattern: the residuals seem to
slowly decrease then swing up at the end. This type of pattern can
emerge if the relationship is curvilinear, but a straight line is fit to the
data. Also, the spread of the residuals is much more pronounced at higher
predicted values than at the lower ones. This suggests lack of
homogeneity of variance. Such a pattern is common with financial data:
there is greater variation at high levels. At this point, the analyst would
think about adjustments to the equation. Given the lack of homogeneity,
the suggestion of curvilinearity and knowing that the dependent measure
is income, an experienced regression user would probably perform a log
transform on beginning salary and rerun the analysis. This is not to
suggest that such an adjustment should occur to you at this stage, the
point being that it is worth looking at residual plots to check the
assumptions, and you may find hints on improving your equation.

Introduction to Regression 11 - 17

SPSS Training
SUMMARY OF
REGRESSION
RESULTS

Overall, the regression analysis was successful in that we can predict


about 49% of the variation in beginning salary from education, gender
and age, and we obtained a prediction equation. We found that education
is the best predictor, but that gender played a substantial role.
Examination of residual summaries suggested that a straight line may
not be the best function to fit, and there were several large positive
residuals that should be checked more carefully.

STEPWISE
REGRESSION

In the previous regression analysis we provided instructions about which


variables to enter in the model. The analyst may be faced with many
potentially useful independent variables, but without any guidance about
which should be used in the prediction equation. As an aid in such
situations, stepwise regression provides a method of selecting, from a set
of independent variables, those that in some limited sense produce the
best equation. The results of stepwise regression should be used with
caution, especially if many independent variables were used, since with
increasing numbers of variables, there is an increasing chance of false
positives (variables that appear to be significant predictors, but are not so
in the population). As a filtering device to select promising predictors, or
as an equation building method when the analyst has no model in mind,
stepwise regression is a frequently employed technique.
The algorithm first computes the correlations between each
independent variable and the dependent measure, then selects the
variable with the highest correlation as the first variable in the equation
(assuming it is statistically significant), and evaluates the equation. It
then selects the independent variable that has the highest partial
correlation (correlation after adjusting for the variable already in the
equation) with the dependent measure, and if significant, it too is added.
This process continues until there are no variables remaining that have a
significant linear relation to the dependent measure (after adjusting for
those in the equation). The algorithm also has a check after each step to
make sure each variable in the equation still bears a significant linear
relation to the dependent variable (this can change as additional
variables are added to the regression); if not, the variable is dropped from
the equation.
We will demonstrate stepwise regression using the same set of four
variables as in the previous analysis.
Click the Dialog Recall tool

, and click Linear Regression

Select Stepwise from the Method: drop-down menu

Introduction to Regression 11 - 18

SPSS Training
Figure 11.15 Requesting a Stepwise Regression

While stepwise is the most commonly used variable selection method,


SPSS also offers a backward selection method in which all independent
variables are entered into the equation, and the one with the weakest
relation to the dependent measure is dropped, and so on.
The command below runs stepwise regression in SPSS.
REGRESSION
/DEPENDENT salbeg
/METHOD=STEPWISE edlevel sex work age
/SCATTERPLOT=(*ZRESID ,*ZPRED )
/RESIDUALS HIST(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3) .
We retain the residual plots. The only change is that now the Method
is stepwise.
Click OK to run the analysis

Introduction to Regression 11 - 19

SPSS Training
STEPWISE
REGRESSION
RESULTS

Note: Some of the pivot tables appearing below were edited in the Pivot
Table editor so they would fit in this document as figures.
Figure 11.16 Step History and Model Summary

The Variables Entered/Removed pivot table simply describes which


variables are entered or removed at each step of the stepwise analysis.
We see that formal education was entered in model 1 (step 1), followed by
gender (model 2), and age (model 3). Work experience was not included
and there was no need to drop, at a later stage, any variables entered
earlier.
The Model Summary provides fit measures for each stage in the
stepwise regression. Model 1 (education alone) accounted for 40% of the
variation in beginning salary; this is identical to what we found earlier
with a simple regression involving education and beginning salary.
Adding gender (model 2) increases the r-square about 6% (.40 to .46), so
gender accounts for an additional 6% of the variance in beginning salary.
Note there is quite a drop-off between the first and second variable. The
last variable entered (age in model 3) improves the r-square less than 3%.

Introduction to Regression 11 - 20

SPSS Training
While there was an unusually dramatic drop-off in the r-square
contribution from the first to the second stage, a substantial drop-off
frequently occurs.
Figure 11.17 ANOVA Summary

The ANOVA pivot table provides significance tests of the overall


model at each stage of the stepwise analysis. Footnotes identify the
variables included in each model. It is no surprise, given that statistical
significance is a criterion for inclusion of a variable in stepwise
regression, that each model shows some significant linear relation(s).
Because of this, the F tests are more of interest when running planned
regression analyses in which you choose the variables to be included.

Introduction to Regression 11 - 21

SPSS Training
Figure 11.18 Coefficients Pivot Table

Model 1 contains formal education alone. The summaries for this


model are identical to what we saw in Figure 11.4. Can you explain why
this should occur? Notice that the B coefficient for formal education
changes with the addition of gender (model 2) to the equation, and again
when age is added (model 3). Again, this is because the education
coefficient reflects the expected change in beginning salary corresponding
to a one-unit change in education, holding the other variables in the
model constant. Examining model 3, the final model containing formal
education, gender and age, we see the Beta coefficients indicate that
formal education is more strongly related to beginning salary than gender
(.59 versus -.25). The B coefficients from model 3 (the final model) would
be used to predict beginning salary.
Figure 11.19 Excluded Variables Summary

Introduction to Regression 11 - 22

SPSS Training
Each model (step) in the stepwise regression is accompanied by a
summary of the variables not included in the model at that point: the
remaining candidate variables. We see the Beta coefficient each
independent variable would have if it were entered into the equation at
that point. The partial correlation measures the correlation between the
independent variable and the dependent measure (beginning salary) after
statistically controlling for any predictor variables already in the
equation. For model 1 gender has the largest partial correlation
(adjusting for formal education), so it will be entered next. Tolerance
measures the proportion of variation in each independent variable that is
unique, that is, not shared with the other predictors. A tolerance of 1
indicates the predictor is uncorrelated with the other independent
variables. If any tolerance values approach zero, the regression results
may become unstable (see the regression references mentioned earlier in
this chapter). The t statistic tests whether the independent variable
would have a statistically significant linear relation to the dependent
measure if added to the equation at this point. Gender (SEX) would be
significant, so there is no barrier for its inclusion.
If we examine model 2 (formal education and gender included) an
interesting result emerges. Notice that the partial correlations for work
experience and age are very close in magnitude, and if either one were
entered at this point, it would be significant. Age will be selected because
it has a slightly larger partial correlation. You might wonder how work
experience would be significant if added here, when it was not significant
in the original four-variable equation, and was not entered using stepwise
regression. Work experience correlates highly with age (.80 - Figure
10.18), and so if one of them is already in the equation, the second does
not provide much more information.

STEPWISE
SUMMARY

We see that three variables (formal education, gender and age) were
entered into the stepwise prediction equation. Only work experience
remains, since its B coefficient would not be statistically significant if
added to the equation. The coefficients from this three-variable model are
close to those we observed with our earlier four-variable model (see
Figure 11.10). The prediction equation for beginning salary would be
(from Figure 11.18): Beginning Salary = $643* EDLEVEL - $1614* SEX
+ $45*AGE - $2,801. At this point, Regression produces the residual plots
and summaries, which we will not view here.
Thus stepwise regression supplies a means of automatically selecting
variables to be used in a regression equation. Despite the ease with which
it runs, oversight is required to inspect the results to insure that they
make sense (for example, we expect that education is positively related to
salary). Also, let me repeat the reminder that running stepwise with
many variables increases the probability of variables improperly being
included in the equation (false positives).

SUMMARY

In this chapter we discussed simple and multiple regression, and


demonstrated stepwise regression. We reviewed how to interpret the
results of regression, and examined residual plots to detect problems and
see if improvements might be made.

Introduction to Regression 11 - 23

SPSS Training

Introduction to Regression 11 - 24

SPSS Training

References
INTRODUCTORY STATISTICS BOOKS
Freund, Rudolf J. and Wilson, William J., Statistical Methods, Academic
Press, New York, 1993.
Hays, William L., Statistics, 4th Edition, Harcourt Brace Jovanovich,
New York, 1988.
Norusis, Marija J., SPSS 11.0 Guide to Data Analysis, Prentice-Hall, New
York, 2001.
Wilcox, Rand, R. Statistics for the Social Sciences, Academic Press, New
York, 1996.

ADDITIONAL REFERENCES
Agresti, Alan, An Introduction to Categorical Data Analysis, Wiley, New
York, 1996.
Andrews, Frank M, Klem, L., Davidson, T.N., OMalley, P.M. and
Rodgers, W.L., A Guide for Selecting Statistical Techniques for Analyzing
Social Science Data , Institute for Social Research, University of
Michigan, Ann Arbor, 1981.
Babbie, Earl, Survey Research Methods, Wadsworth, Belmont CA, 1973.
Box, George E. P. and Cox, D.R., An Analysis of Transformations, J.
Royal Statistical Society, Series B, 23, p 211, 1964.
Box, George E. P., Hunter, W.G. and Hunter, J.S., Statistics for
Experimenters, Wiley, New York, 1978.
Bishop, Yvonne M.M., Fienberg, S. and Holland, P.W., Discrete
Multivariate Analysis, MIT Press, Cambridge, MA, 1975.
Brown, Morton B. and Forsythe, A., The Small Sample Behavior of Some
Statistics Which Test the Quality of Several Means, Technometrics, p 129132, 1974.
Burke, Linda B. and Clark, V.L., Processing Data , Sage Publications,
Newbury Park, CA, 1992.
Cohen, Jacob, Statistical Power Analysis for the Behavioral Sciences,
Lawrence Erlbaum Associates, Hillsdale, NJ, 1988.
Cohen, Jacob and Cohen, P., Applied Multiple Regression/Correlation
Analysis for the Behavioral Sciences (2nd ed.), Lawrence Erlbaum
Associates, Hillsdale, NJ, 1983.

References R - 1

SPSS Training
Daniel, Cuthbert and Wood, Fred S., Fitting Equations to Data (2nd ed.),
Wiley, New York, 1980.
Daniel, Wayne W., Applied Nonparametric Statistics, Houghton Mifflin,
Boston, 1978.
Draper, Norman and Smith, Harry, Applied regression Analysis (3nd
ed.), Wiley, New York, 1998.
Fienberg, Stephen E., The Analysis of Cross-Classified Categorical Data,
MIT Press, Cambridge, MA, 1977.
Gibbon, Jean D. Nonparametric Measures of Association, Sage
Publications, Newbury Park, CA, 1993.
Hardle, Wolfgang, Applied Nonparametric Regression, Cambridge
University Press, Cambridge, 1990).
Hoaglin, David C., Mosteller, F. and Tukey, J.W., Exploring Data Tables,
Trends and Shapes, Wiley, New York, 1985.
Hoaglin, David C., Mosteller, F. and Tukey, J.W., Fundamentals of
Exploratory Analysis of Variance, Wiley, New York, 1991.
Hsu, Jason C. Multiple Comparisons: Theory and Methods, Chapman &
Hall, London, 1996.
Kish, Leslie, Survey Sampling, Wiley, New York, 1965.
Kraemer, H.K and Thiemann, S., How Many Subjects? Statistical Power
Analysis in Research, Sage Publications, Newbury Park, CA, 1987.
Kirk, Roger E., Experimental Design: Procedures for the Behavioral
Sciences, Brooks/Cole Publishing Co., Belmont, CA. 1968.
Klockars, Alan J. and Sax, G. Multiple Comparisons, Sage Publications,
Newbury, CA, 1986.
Milliken, George A. and Johnson, Dallas E., Analysis of Messy Data,
Volume 1: Designed Experiments, Van Nostrand Reinhold, New York,
1984.
Mosteller, Frederick and Tukey, John W., Data Analysis and Regression,
Addison-Wesley, Reading, MA, 1977.
National Opinion Research Center (NORC), General Social Survey; 19721991: Cumulative Codebook, National Opinion Research Center, Chicago,
1991.
Rossi, Peter H, Wright, J.D. and Anderson, A.B., Handbook of Survey
Research, Academic Press, New York, 1983

References R - 2

SPSS Training
Searle, Shayle R. Linear Models for Unbalanced Data, Wiley, New York,
1987.
Siegel, S., Nonparametric Statistics, McGraw-Hill, New York, 1956.
Sudman, Seymour, Applied Sampling, Academic Press, New York, 1976.
Toothaker, Larry E., Multiple Comparisons for Researchers, Sage
Publications, Newbury Park, CA, 1991.
Tukey, John W., Exploratory Data Analysis, Addison-Wesley, Reading,
MA, 1977.
Tukey, John W., The Philosophy of Multiple Comparisons, Statistical
Science, v 6, 1, p 100-116, 1991.
Wilcox, Rand R. Introduction to Robust Estimation and Hypothesis
Testing, Academic Press, New York, 1997.
Velleman, Paul F. and Wilkinson, L., Nominal, Ordinal and Ratio
Typologies are Misleading for Classifying Statistical Methodology, The
American Statistician, v 47, p 65-72, 1993.

References R - 3

SPSS Training

References R - 4

SPSS Training

Exercises
Note on Exercise
Data

Chapter 3

The exercises use the General Social Survey 1994 data (GSS94.POR)
located in the c:\Train\Stats folder on your training machine. If you are
not working in an SPSS Training center, the training files can be copied
from the floppy disk that accompanies this course guide. If you are
running SPSS Server (click File..Switch Server to check), then you should
copy these files to the server or a machine that can be accessed (mapped
from) the computer running SPSS Server.

Checking Data
People who have been married should have an age at which they were
first married. Use the Transform..Compute dialog to create a variable
named NOAGEWED. Code it so that for respondents who have been
married (MARITAL 1,2,3,4), code 1 indicates they are missing their age
first married (AGEWED is missing), while 0 indicates they reported an
age first married. Run a frequency table on the NOAGEWED variable.

Chapter 4

Describing Categorical Data


Suppose we are interested in looking for relations between social class
(CLASS) and several attitudes and beliefs: belief in an afterlife
(POSTLIFE), view of government performance in the area of health care
(NATHEAL), and whether life seems exciting or dull (LIFE). In addition,
we wish to determine if smoking behavior (SMOKE) is related to these
same beliefs. First run a frequency analysis on these variables. Look at
the distributions. Do you see any difficulties using these variables in a
cross tabulation analysis? If so, is there an adjustment you can make?
For those with extra time: Choose some variables of interest to you and
run the same analysis. If there is any reason or need to collapse
categories (in anticipation of running later analyses), do so.

Chapter 5

Comparing Groups: Categorical Data


You will run crosstabulations of social class (CLASS) against three
attitude and behavioral measures: belief in an afterlife (POSTLIFE), view
of government performance in the area of health care (NATHEAL), and
whether life seems exciting or dull (LIFE). Then repeat the analysis after
substituting whether the respondent smokes (SMOKE) in place of social
class (CLASS). Before running the analysis, think about the variables
involved in these tables. What relations would you expect to find here and
why? Obtain crosstabs with appropriate percentages and request the chisquare test of independence. How would you summarize each finding in a
paragraph? Create a bar chart displaying the results of one of your
tables.
For those with extra time: Are there any problems in the analysis with
small cell counts? If so, can you suggest a way to avoid the difficulty? If

Exercises E - 1

SPSS Training
you chose additional variables in the previous exercise (Chapter 4), run
them in a crosstab analysis. For a few of your crosstab tables, rerun
requesting appropriate measures of association. Are the results
consistent with your interpretation up to this point? Based on either the
association measures, or percentage differences, would you say the
results have practical (or ecological) significance? Run a three-way
crosstab of social class (CLASS) by gender (SEX) by belief in the afterlife
(POSTLIFE) and interpret the results.

Chapter 6

Exploratory Data Analysis: Interval Scale Data


We will later compare different groups on the average number of children
they have. In anticipation of this, run an exploratory data analysis on
number of children (CHILDS). Also, perform an exploratory analysis on
age (AGE). Review the results. Keep in mind that this is a U.S. adult
sample; do you see anything unusual?
For those with extra time: Number of children (CHILDS) is coded 0
through 8, where 8 indicates eight or more children. Look at the
exploratory output, or run a frequency analysis on CHILDS. Would you
expect the truncation of the CHILDS variable (8 or higher is coded 8) to
have much influence on an analysis?

Chapter 7

Mean Differences Between Groups I: Simple Case


One of the questions (ANOMIA6) in the General Social Survey asks
whether the respondent agrees or disagrees with the statement that it is
not fair to bring a child into the world. It might be interesting to see if
those who agree with this statement actually have fewer children than
those who disagree. First run an exploratory data analysis on number of
children (CHILDS) for the two groups (ANOMIA6). Do you see any
problems with the data relevant to testing for mean differences? Perform
a t-test looking at mean differences in number of children for the two
belief groups (ANOMIA6). Produce a plot of the results.
For those with extra time: Can you think of any other factors that might
relate to both the anomie question and the number of children? If so, is
there a way of adjusting the analysis to account for the factor(s)? If time
permits, try to implement your method.

Chapter 8

Mean Differences Between Groups II: One-Factor ANOVA


We wish to explore regional differences in the average number of
children. As before, first run an exploratory data analysis on number of
children (CHILDS) grouped by region (REGION). Perform a one-factor
ANOVA, testing for regional differences in number of children. If
differences are found, perform post hoc tests to explore these differences
in more detail. How would you summarize the results? Create a summary

Exercises E - 2

SPSS Training
plot displaying the means (error bar). Are the differences large enough to
be of practical importance?
For those with extra time: Look at the box & whisker plot of number of
children by region. The distribution of number of children in each group
is clearly not normal. Why might this not be a major problem in the
analysis? This analysis looks only at region. If there were another factor
(say race) that is related to the number of children, how might it
influence this analysis? How might you adjust for it?

Chapter 9

Mean Differences Between Groups III: Two-Factor ANOVA


Lets do a broader analysis comparing differences in average number of
children (CHILDS) between men and women (SEX) in different regions
(REGION). You might begin with an exploratory data analysis, then
proceed to the two-factor analysis of variance. In viewing the results, pay
particular interpretation to the test of significance for the interaction
term. How would you interpret this result? Construct a plot summarizing
the results.
For those with extra time: The analysis suggests that women have a
greater number of children than men. Can you suggest reasons for this
seemingly odd result?

Chapter 10

Bivariate Plots and Statistics


Suppose you are interested in looking at the relation between the
education of an individual and the education of his/her spouse and
parents? First, what would you expect to find in terms of direction and
relative strength of the relationships. Request a scatterplot of education
of the respondent (EDUC) and that of his/her spouse (SPEDUC). How
would you characterize the relationship? Rerun the plot with social class
(CLASS) are a marker variable. Discuss the pattern you observe. Request
correlations among the four education measures (respondent - EDUC,
spouse - SPEDUC, mother - MAEDUC, father - PAEDUC). Briefly
summarize your findings.
For those with extra time: Look at the number of cases each correlation is
based on. Why are they so disparate and can you think of any potential
difficulties this might cause? Can you think of any adjustment to the
analysis that might address these problems? Secondly, suppose you are
asked to look at the correlations between sons and their fathers, and
daughters with their mothers. How would you investigate this issue in
the current data set? Apply your approach and describe the results.
For those with more extra time: Open the SPSS portable file, Bank.por,
used in the chapter. Create scatterplots of job seniority (time) by work
experience (work) and work experience (work) by age. Which, if either,
seem to have a strong relationship. Check your visual interpretation by
running bivariate correlations on the set of variables.

Exercises E - 3

SPSS Training
Chapter 11

Introduction to Regression
We know there is a relationship between the education of the
respondents parents and the education of the respondent. Develop a
regression equation predicting respondents education (EDUC) from the
education of one parent (you choose between MAEDUC and PAEDUC).
Does the equation, in your opinion, adequately account for the variation
in respondents education? If the parent had 16 years of education, what
is your prediction for the education of the respondent? Add the remaining
parents education as a second predictor to the equation. Does this
substantially improve the prediction? Does there seem to be much
difference between the two predictors?
For those with extra time: Compare the coefficients from the one and two
variable prediction equations and note the very small difference in rsquare. Looking at the two analyses, can you explain the change in the
regression coefficient of the variable (MAEDUC or PAEDUC) present in
both equations (hint: look at the coefficient of the other predictor
variable)?

Exercises E - 4