Beruflich Dokumente
Kultur Dokumente
Einfhrung in die
Grundlagen der
statistischen
Datenanalyse
SPSS (Schweiz) AG, Schneckenmannstr. 25, 8044 Zrich, Phone 01 266 90 30,
Fax 01 266 90 39
info@spss.ch
www.spss.ch
www.spss.com
SPSS DecisionTime, SPSS Clementine, SPSS Neural Connection, SPSS QI Analyst, SPSS for Windows, SPSS Data Entry, SPSS-X, SCSS,
SPSS/PC, SPSS/PC+, SPSS Categories, SPSS Graphics, SPSS Regression Models, SPSS Advanced Models, SPSS Tables, SPSS Trends, SPSS
Exact Tests, SPSS Missing Values, SPSS Maps, SPSS AnswerTree, SPSS Report Writer und SPSS TextSmart sind eingetragene Warenzeichen
von SPSS Inc. CHAID for Windows ist das eingetragene Warenzeichen von SPSS Inc. und Statistical Innovations Inc.
Material, das die Namen der aufgefhrten Software erwhnt, darf ohne die schriftliche Zustimmung des Besitzers der eingetragenen Warenzeichen,
der Lizenzrechte der Software und des Copyrights verffentlichter Produkte weder produziert noch weiterverteilt werden.
Allgemeiner Hinweis: Andere erwhnte Produktnamen werden nur zum Zweck der Identifizierung genannt und knnen eingetragene Warenzeichen
anderer Firmen sein.
Copyright 2002 by SPSS Inc. and SPSS (Schweiz) AG.
Alle Rechte vorbehalten.
Gedruckt in der Schweiz.
Dieses Druckerzeugnis darf ohne die schriftliche Zustimmung der Verfasser weder kopiert, elektronisch gespeichert noch weitergegeben werden.
SPSS Training
STATISTICAL ANALYSIS
USING SPSS
TABLE OF CONTENTS
Chapter 1
Introduction
Samples and the Population
Level of Measurement
A Special Case: Rating Scales
Independent and Dependent Variables
Data Access
A Note about Variable Names and Labels in Dialog Boxes
Summary
Chapter 2
Chapter 3
2-2
2-6
2-6
2 - 10
2 - 10
2 - 11
2 - 11
2 - 11
Data Checking
Viewing a Few Cases
Minimum, Maximum and Number of Valid Cases
Identifying Inconsistent Responses
When Errors are Discovered
SPSS Missing Values Option
Summary
Chapter 4
1-3
1-3
1-4
1-5
1-6
1-6
1-7
3-2
3-6
3-7
3 - 10
3 - 10
3 - 10
4-2
4-4
4 - 11
4 - 15
4 - 15
Table of Contents - 1
SPSS Training
Chapter 5
Chapter 6
Chapter 7
6-2
6-3
6-7
6-6
6-7
6-8
6 - 10
6 - 11
6 - 11
6 - 11
6 - 12
6 - 13
6 - 14
6 - 18
6 - 18
Table of Contents - 2
5-2
5-5
5-6
5 - 12
5 - 12
5 - 12
5 - 13
5 - 17
5 - 17
5 - 18
5 - 22
5 - 23
5 - 27
5 - 27
7-2
7-4
7-5
7-6
7 - 13
7 - 17
7 - 19
7 - 20
7 - 22
7 - 22
7 - 25
SPSS Training
Chapter 8
Chapter 9
Chapter 10
8-2
8-2
8-3
8-5
8-6
8-8
8 - 10
8 - 18
8 - 19
8 - 19
8 - 22
8 - 25
9-2
9-2
9-3
9-5
9 - 11
9 - 16
9 - 17
9 - 19
9 - 19
9 - 21
9 - 21
9 - 22
10 - 2
10 - 3
10 - 8
10 - 14
10 - 18
Table of Contents - 3
SPSS Training
Chapter 11
Introduction to Regression
Introduction and Basic Concepts
The Regression Equation and Fit Measure
Residuals and Outliers
Assumptions
Simple Regression
Multiple Regression
Residual Plots
Multiple Regression Results
Residuals and Outliers
Summary of Regression Results
Stepwise Regression
Stepwise Regression Results
Stepwise Summary
Summary
11 - 1
11 - 2
11 - 3
11 - 3
11 - 4
11 - 9
11 - 10
11 - 13
11 - 16
11- 18
11 - 18
11 - 20
11 - 23
11 - 23
References
References
R-1
Exercises
Exercises
E-1
Table of Contents - 4
SPSS Training
Chapter 1 Introduction
Objective
Method
Describe the goals and method of the course; review a few important
statistical terms and concepts; provide a framework for choosing a
statistical procedure within SPSS; briefly discuss some analyses we will
perform in the course to provide a research frame of reference.
Discussion
Data
INTRODUCTION
PSS is an easy to use yet powerful tool for data analysis. In this
course we will cover a number of statistical procedures that SPSS
performs. This is an application oriented course and the approach
will be practical; we will discuss: the situations in which you would use
each technique, the assumptions made by the method, how to set up the
analysis using SPSS, and interpretation of the results. We will not derive
proofs, but rather focus on the practical matters of data analysis in
support of answering research questions. For example, we will discuss
what are correlation coefficients, when to use them, and how to produce
and interpret them, but will not formally derive their properties. This
course is not a substitute for a course in statistics. It presupposes you
have had such a course in the past and wish to apply statistical methods
to data using SPSS.
Introduction 1 - 1
SPSS Training
This course assumes you have a working knowledge of SPSS in your
computing environment. Thus the basic use of menu systems, data
definition and labeling will not be considered in any detail. The actual
steps you take to request an analysis within SPSS differ across
computing environments: pull-down menus for Microsoft Windows,
Macintosh and UNIX; syntax commands in batch-oriented mainframe
environments. The analyses in this course will show the relevant dialog
boxes and the SPSS syntax commands for those who prefer to use syntax.
In addition, the locations of the menu choices or dialog boxes within the
overall menu system are cited in the text. The dialog box selections will
be detailed and the resulting dialog box and syntax command shown.
Scenario
SAMPLES AND
THE
POPULATION
Introduction 1 - 2
SPSS Training
for example Sudman (1976) or Rossi, Wright and Anderson (1983),
reviews these issues in detail. To state it in a simple way, statistical
inference provides a method of drawing conclusions about a population of
interest based on sample results.
LEVEL OF
MEASUREMENT
Introduction 1 - 3
SPSS Training
A SPECIAL
CASE: RATING
SCALES
INDEPENDENT
AND DEPENDENT
VARIABLES
Introduction 1 - 4
SPSS Training
and years of education, based on survey data, then develop an equation
predicting starting salary from years of education. Here starting salary
would be considered the dependent variable although no experimental
manipulation of education has been performed.
Correspondingly, independent variables are those used to measure
features manipulated by the experimenter in an experiment. In a nonexperimental study, they represent variables believed to influence or
predict a dependent measure. In summary, the dependent variable is
believed to be influenced by, or be predicted by, the independent
variable(s).
Finally, in some studies, or parts of studies, the emphasis is on
exploring or characterizing relationships among variables with no causal
view or focus on prediction. In such situations there is no designation of
dependent and independent. For example, in crosstabulation tables and
correlation matrices the distinction between dependent and independent
variables is not necessary. It rather resides in the eye, or worldview, of
the beholder (researcher).
The table below suggests which statistical techniques are most
appropriate, based on the measurement level of the variables. Much more
extensive diagrams and discussion are found in Andrews et al. (1981).
Recall that ratio variables can be considered as interval scale for analysis
purposes. If in doubt about the measurement properties of your variables,
you can apply a statistical technique that assumes weaker measurement
properties and compare the results to methods making stronger
assumptions. A consistent answer provides greater confidence in the
conclusions.
Figure 1.1 Statistical Methods and Level of Measurement
Introduction 1 - 5
SPSS Training
DATA ACCESS
Data taken from the General Social Survey 1994 are used in Chapters 1
through 9. The General Social Survey contains several hundred
demographic, attitudinal and behavioral questions. The data are stored
in an SPSS portable file named Gss94.por: a text file containing data,
labels, and missing values. A portable file can be read by SPSS on any
type of computer supporting SPSS (for example PC, Macintosh, and
UNIX).
Note on
Course Data Files
All files for this class are located in the c:\Train\Stats folder on your
training machine. If you are not working in an SPSS Training center, the
training files can be copied from the floppy disk or CD that accompanies
this guide. If you are running SPSS Server (click File..Switch Server to
check), then you should copy these files to the server or a machine that
can be accessed (mapped from) the computer running SPSS Server.
A Note about
Variable Names
and Labels in
Dialog Boxes
SPSS can display either variable names or variable labels in dialog boxes.
In this course we display the variable names in alphabetical order. In
order to match the dialog boxes shown here, from within SPSS:
Click Edit..Options
Within the General tab sheet of the Options dialog box:
Click the Display names option button
Click the Alphabetical option button
Click OK, then click OK to confirm
Click File...Open..Data
Switch to the c:\Train\Stats folder
Select SPSS Portable (*.por) from the Files of Type: drop-down
list
Double-click on Gss94.por
Those using SPSS syntax commands can read a portable file with the
IMPORT command shown below.
IMPORT FILE C:\Train\Stats\Gss94.por.
The IMPORT command reads an SPSS portable file (it is called
import because such files usually come from SPSS on a different type of
computer and the data thus imported from another machine type). Once
the data file is imported, you can immediately proceed to the analysis. We
see the data below.
Introduction 1 - 6
SPSS Training
Figure 1.2 Data After Importing
The data and labels are now available for manipulation and analysis.
We demonstrated here how to read the data file. Since our emphasis is on
statistical analysis, each of the remaining chapters will assume this step
has already been performed.
SUMMARY
Introduction 1 - 7
SPSS Training
Introduction 1 - 8
SPSS Training
Objective
Method
Display a series of analyses in which only the sample size varies and see
which outcome measures change. Discuss scenarios in which statistical
significance and practical importance do not coincide.
Data
Data files showing the same survey percentages based on samples of 100,
400 and 1,600. A data file containing 10,000 observations drawn from a
normal population with mean 70 and standard deviation of 10.
INTRODUCTION
SPSS Training
PRECISION OF
PERCENTAGES
Note
SPSS Training
SAMPLE SIZE OF
100
SPSS Training
SAMPLE SIZE OF
400
Now we view a table with percentages identical to the previous one, but
based on a sample of 400 people, four times as large as before.
Figure 2.2 Crosstabulation Table with Sample of 400
SPSS Training
SAMPLE SIZE OF
1,600
Finally we present the same table of percentages, but increase the sample
size to 1,600; the increase is once again by a factor of four.
Figure 2.3 Crosstabulation Table with Sample of 1,600
The percentages are identical to the previous tables and so the gender
difference remains at 8%. The chi-square value (10.24) is four times that
of the previous table and sixteen times that of the first table. Notice that
the significance value is quite small (.001) indicating a statistically
significant difference between men and women. With a sample as large as
1,600 it is very unlikely (.001 or 1 chance in 1000) that we would observe
a difference of 8 or more percentage points between men and women if
they did not differ in the population.
Thus the 8% sample difference between two groups is highly
significant if the sample is 1,600, but not significant (testing at .05 level)
with a sample of 100. This is because the precision with which we
measure the percents increases with the sample size, and as our
measurement grows more precise the 8% sample difference looms large.
This relationship is quantified in the next section.
SPSS Training
SAMPLE SIZE
AND PRECISION
Sample
Size
Precision
Value
100
1/sqrt(100) = 1/10
.1 or 10%
400
1/sqrt(400) = 1/20
.05 or 5%
1,600
1/sqrt(1,600) =1/40
.025 or 2.5%
PRECISION OF
MEANS
The same basic relation, that precision increases with the square root of
the sample size, applies to sample means as well. To illustrate this we
display histograms based on different samples from a normally
distributed population with mean 70 and standard deviation 10. We first
view a histogram based on a sample of 10,000 individual observations.
Next we will view a histogram of 1,000 sample means where each mean is
composed of 10 observations. The third histogram is composed of 100
sample means, but here each mean is based on 100 observations. We will
focus our attention on how the standard deviation changes when sample
means are the units of observation. To aid such comparisons the scale is
kept constant across histograms.
SPSS Training
A LARGE
SAMPLE OF
INDIVIDUALS
We see that a sample of this size closely matches its population. The
sample mean is very close to 70, the sample standard deviation is near
10, and the shape of the distribution is normal.
SPSS Training
MEANS BASED
ON SAMPLES OF
10
The second histogram displays 1,000 sample means drawn from the same
population (mean 70, standard deviation 10). Here each observation is a
mean based on 10 data points. In other words we pick samples of ten each
and plot their means in the histogram below.
Figure 2.5 Histogram of Means Based on Samples of 10
The overall average of the sample means is about 70, while the
standard deviation of the sample means is reduced to 3.11. Comparing
the two histograms we see there is less variation (standard deviation of
3.11 versus 10) among means based on groups of observations then
among the observations themselves. Recall the rule of thumb that
precision is a function of the square root of the sample size. If the
population standard deviation were 10, we would expect the standard
deviation of means based on samples of 10 to be the population figure
reduced by a factor of 1/square root (N) or 1/square root (10), or .316. If
we multiply this factor (.316) by the population standard deviation (10),
the theoretical value we get (3.16) is very close to what we observe in our
sample (3.11). Thus by increasing the sample size by a factor of ten (from
single observations to means of ten observations each) we reduce the
imprecision (increase the precision) by the factor 1/square root (10). The
shape of the distribution remains normal.
SPSS Training
MEANS BASED
ON SAMPLES OF
100
SPSS Training
STATISTICAL
POWER
ANALYSIS
TYPES OF
STATISTICAL
ERRORS
SPSS Training
STATISTICAL
SIGNIFICANCE
AND PRACTICAL
IMPORTANCE
SUMMARY
APPENDIX:
PRECISION OF
PERCENTAGE
ESTIMATES
SPSS Training
precision would be obtained when P departs from .5.
It is important to note that this calculation assumes the population
size is infinite, or as an approximation, much larger than the sample.
Formulations that take finite population values into account can be found
in Kish (1965) and other texts discussing sampling. When applied to
survey data, the calculation also assumes that the survey was carried out
in a methodologically sound manner. Otherwise, the validity of the
sample proportion itself is called into question.
SPSS Training
Objective
Method
Data
INTRODUCTION
Use the Data Editor in SPSS for Windows and the Case Summaries
procedure to view the data; check for out-of-range values using the
Descriptives procedure; apply If statements to test for consistency across
questions.
Data Checking 3 - 1
SPSS Training
VIEWING A FEW
CASES
Often the first step in checking data previously entered on the computer
is to view the first few observations and compare their data values to the
original data sheets or survey forms. This will detect many gross errors of
data definition (incorrect columns specified for an ASCII text file, reading
alpha characters as numeric data fields). Viewing the first few cases can
be easily accomplished using the Data Editor Window in SPSS or the
Case Summaries procedure. Below we view part of the 1994 General
Social Survey data in SPSS.
Click File...Open..Data (move to the C:\Train\Stats folder if
necessary)
Select SPSS Portable (*.por) from the Files of Type drop-down
list
Click GSS94.POR and click Open
Figure 3.1 General Social Survey 1994 Data in SPSS for Windows
The first few responses can be compared to the original data sheets or
surveys as a preliminary test of data entry. If errors are found,
corrections can be made directly within the Data Editor window.
The Case Summaries procedure can list values of individual cases for
selected variables. For example, we can display values for several
questions related to education: educ (respondents education), speduc
(spouses education), maeduc (mothers education), and paeduc (fathers
education).
Data Checking 3 - 2
SPSS Training
Click Analyze..Reports..Case Summaries
Move educ, speduc, maeduc and paeduc into the Variables
list box.
Type 10 into the Limit cases to first text box
Figure 3.2 Case Summaries Dialog Box
Click OK
Note we limit the listing to the first ten cases (the default is 100). The
Case Summaries procedure can also display group summaries.
The Summarize syntax command below instructs SPSS to list the
four education variables for the first ten cases. In addition a title is
provided for the pivot table containing the case listing and counts are
requested as summary statistics.
SUMMARIZE
/TABLES=educ speduc maeduc paeduc
/FORMAT=VALIDLIST NOCASENUM TOTAL LIMIT=10
/TITLE=Case Summaries
/MISSING=VARIABLE
/CELLS=COUNT .
Data Checking 3 - 3
SPSS Training
Below we see the requested variables for the first ten observations.
Figure 3.3 Case Summary List of First Ten Cases
By default, SPSS will display value labels in case listings; this can be
modified within the SPSS Options dialog box (click Edit..Options, then
move to the Output Labels tab). Please note that the high incidence of
NAP (not applicable) for some variables (see fathers education) is
probably due to the fact that few questions were asked of all respondents
in the General Social Survey 1994. Ordinarily, this much missing data
(see fathers education) would be of concern.
MINIMUM,
MAXIMUM, AND
NUMBER OF
VALID CASES
Data Checking 3 - 4
A second simple data check that can be done within SPSS is to request
descriptive statistics on all numeric variables. By default, the
Descriptives procedure will report the mean, standard deviation,
minimum, maximum and number of valid cases for each numeric
variable. While the mean and standard deviation are not relevant for
nominal variables (see Chapter 1), the minimum and maximum values
will signal any out-of-range data values. In addition, if the number of
valid observations is suspiciously small for a variable, it should be
explored carefully. Since Descriptives provides only summary statistics, it
will not indicate which observation contains an out-of-range value, but
that can be easily determined once the data value is known. The Case
Summaries procedure can also be used for this purpose.
The SPSS Descriptives syntax command below will request
summaries for all variables (although summaries will print only for
numeric variables).
SPSS Training
DESCRIPTIVES /VARIABLES ALL
/STATISTICS=MEAN STDDEV MIN MAX.
We request the same analysis in SPSS by choosing
Analyze..Descriptive Statistics.. Descriptives and selecting all variables
in the Descriptives dialog box (shown below).
Click Analyze..Descriptive Statistics..Descriptives
Move all variables into the Variable(s) list box
Figure 3.4 Descriptives Dialog Box
Only numeric variables will appear in the list box. Running the
Descriptives syntax command or Clicking the OK button in SPSS will
lead to the summaries shown below.
Click OK
Data Checking 3 - 5
SPSS Training
Figure 3.5 Descriptives Output (Beginning)
Data Checking 3 - 6
SPSS Training
We can see the minimum, maximum and number of valid cases for
each variable in the data set. By examining such variables as EDUC
(highest year of school completed), TVHOURS (hours per day watching
TV) and AGE (age of respondent) we can determine if there are any outof-range values. The maximum for TVHOURS looks rather high (24). As
an exercise, examine the value labels (click Utilities..Variables) for a few
of the variables and discover the valid range of values. Compare these
ranges to the results in the figure.
The valid number of observations (Valid N) is listed for each variable.
The number of valid observations listwise indicates how many
observations have complete data for all variables, a useful bit of
information. Here it is zero because not all GSS questions are asked of,
nor are relevant to, any single individual. If odd values are discovered in
these summaries we can locate the problem observations with data
selection statements or the Find function (under Edit menu) in the Data
Editor window.
IDENTIFYING
INCONSISTENT
RESPONSES
In most data sets certain relations must hold among variables if the data
are recorded properly. This is especially true with surveys containing
filter questions or skip patterns. Some examples from the GSS are: if a
respondent has never been married then his/her age when first married
should have a missing code; age when first married should not be greater
than current age. Such relations cannot be easily checked by scanning the
data or in single variable summaries such as frequency tables. However,
these relations can be examined by using SPSS data transformation
instructions to identify cases violating the expected patterns.
We will demonstrate how to test for such relations among variables
using the two examples mentioned in the paragraph above. First if never
married (MARITAL = 5) then age when first married (AGEWED) should
be coded as a missing value. Secondly, age when first married (AGEWED)
should be less than or equal to current age (AGE). The basic approach is
to create a new variable that will be set to a specific number (say 1) if the
expected relation does not hold.
From the Data Editor window,
Click Transform..Compute
Click If... to transfer to the Compute Variable: If Cases dialog box
Click Include if case satisfies condition option button
In the text box of the Compute If dialog box we indicate the condition
we want identified: never married (Marital=5) and having a valid age
when first married ( ~ MISSING(AGEWED) - the tilde (~) means NOT).
Enter (type or build) the expression
marital=5 & ~MISSING(agewed) into the text box
Data Checking 3 - 7
SPSS Training
Figure 3.7 Defining the Error Condition
Click Continue
If this condition is met, a new variable (ERRMARIT) will be set equal
to one, as shown in the Compute dialog box below.
Type errmarit in the Target Variable box
Type 1 in the Numeric Expression: box
Figure 3.8 Setting the Error Indicator Variable to 1
Click OK
Data Checking 3 - 8
SPSS Training
We see that if the error condition holds then the new variable
ERRMARIT will be set equal to 1. A frequency analysis can be run to
determine if any cases have ERRMARIT equal 1. The problem cases can
be selected and listed, or located by find function in the Data Editor
window.
The same operation can be applied using an IF syntax command. The
first If statement performs the same function as the dialog boxes just
viewed, setting ERRMARIT to 1 if a respondent was never married yet
reports a valid age when first married. The second If assigns 1 to
ERRAGE if age when first married exceeds the respondents current age.
IF (MARITAL = 5 AND NOT (MISSING(AGEWED)))
ERRMARIT=1.
IF (AGEWED > AGE) ERRAGE=1.
A frequency analysis is then run to see if these errors occurred.
Click Analyze..Descriptive Statistics..Frequencies
Move errmarit and errage (if calculated) into the Variable(s) list
box
Click OK
The Frequencies syntax is shown below.
FREQUENCIES /VARIABLES ERRMARIT ERRAGE .
Figure 3.9 Frequency Tables of Error Variables
Since the data from the General Social Survey are very carefully
checked it is not surprising that no discrepancies (error values equal to 1)
were found.
Data Checking 3 - 9
SPSS Training
WHEN DATA
ERRORS ARE
DISCOVERED
If errors are found the first step is to return to the original survey or data
form. Simple clerical errors are merely corrected. In some instances
errors on the part of respondents can be corrected based on their answers
to other questions. If neither of these approaches is possible the offending
items can be coded as missing responses and will be excluded from SPSS
analyses. While beyond the scope of this course, there are techniques that
substitute values for missing responses in survey work. For a discussion
of such methods see Burke and Clark (1992) or Babbie (1973). Also note
the SPSS Missing Values option can perform this function.
Having cleaned the data we can now move to the more interesting
part of the process, data analysis.
SPSS MISSING
VALUES OPTION
SUMMARY
The SPSS Missing Values option will produce various reports describing
the frequency and pattern of missing data. It also provides methods for
estimating (imputing) values for missing data.
Data Checking 3 - 10
SPSS Training
Objective
Method
Data
Scenario
INTRODUCTION
SPSS Training
FREQUENCY
TABLES
After placing the desired variables in the list box, we use the Charts
button and request bar charts based on percentages (see figure below).
Click the Charts pushbutton
Click the Bar charts option button in the Chart Type box.
Click the Percentages option button in the Chart Values box
SPSS Training
Figure 4.2 Frequencies: Charts Dialog Box
Click Continue
Click the Format pushbutton
Click Organize output by variables in the Multiple
Variables box (not shown)
Click Continue
Click OK
To request this analysis with command syntax, use the Frequencies
command below:
FREQUENCIES
/VARIABLES=marital attend degree gunlaw owngun hapmar
/BARCHART PERCENT
/ORDER VARIABLES.
We now examine the tables and charts looking for anything
interesting or unusual.
SPSS Training
FREQUENCIES
OUTPUT
By default, value labels appear in the first column and, if labels were
not supplied, the data values display. Tables involving nominal and
ordinal variables usually benefit from the inclusion of value labels.
Without value labels we wouldnt be able to tell from the output which
number stands for which marital status category. The Frequency column
contains counts or the number of occurrences of each data value. The
Percent column shows the percentage of cases in each category relative to
the number of cases in the entire data set, including those with missing
values. Cases with missing values for marital status would be excluded
from the Valid Percent calculation. Thus the Valid Percent column
contains the percentage of cases in each category relative to the number
of valid (non-missing) cases. Cumulative percentage, the percentage of
cases whose values are less than or equal to the indicated value, appears
in the cumulative percent column. With only one case containing a
missing value, the percent and valid percent columns are
indistinguishable. Note we can edit the frequencies pivot table to display
the percentages to greater precision.
Examine the table. Note the disparate category sizes. What are some
meaningful ways in which you might combine or compare categories?
SPSS Training
Figure 4.4 Bar Chart of Marital Status
The disparities among the marital status categories are brought into
focus by the bar chart. Notice the small proportion of individuals
separated from their spouse.
We next turn to attendance at religious services.
Figure 4.5 Frequency of Attendance at Religious Services
SPSS Training
How would you summarize the information in this table? If you
wished to reduce the number of categories, which would you collapse?
Decisions about collapsing categories usually have to do with which
groups need be kept distinct in order to answer the research question
asked, and the sample sizes for the groups. Below we view a bar chart
based on the church attendance variable.
Figure 4.6 Bar Chart of Attendance at Religious Services
SPSS Training
Figure 4.7 Frequency Table of Educational Degree
SPSS Training
Figure 4.9 Frequency Table of Attitude Toward Gun Permits
SPSS Training
asked of the respondent. This could be because it is not relevant to the
individual or because not all questions are asked of all individuals in the
sample. A second missing code (DK) represents a response of Dont
Know. The third missing code, NA, indicates no answer is recorded (No
Answer) probably because of a refusal, but possibly because the question
wasnt asked. These three different missing codes are used to provide
information about why there isnt a valid response to the question. These
codes are excluded from consideration in the Valid Percent column of
the frequency table, as well as from the bar chart, and would also be
ignored if any additional statistics were requested.
Figure 4.11 Frequency Table of Gun Ownership
SPSS Training
Respondents are more evenly divided in the question asking about
the presence of a gun in the home. Are respondents as likely to have a
gun at home as not? If there is a need to perform a statistical significance
test on this question, the NPAR TEST procedure within SPSS can do so
(using a chi-square test). Recalling the earlier question regarding gun
permits it might be interesting to look at the relationship between gun
ownership and attitude toward gun permits; we might ask to what extent
is gun ownership related to whether one favors gun permits? The
frequency tables we are viewing display each variable independently. To
investigate the relationship between two categorical (nominal) variables
we will turn to crosstabulation tables in Chapter 5.
A note about the missing value codes in the table above: Refused
implies that the respondent refused to answer the question, NA (no
answer) means no answer was recorded- probably because the question
was inadvertently skipped, and NAP (not applicable) was coded if the
question was not asked (recall every question is not asked of every
respondent in the General Social Survey).
Figure 4.13 Frequency Table - Happiness of Marriage
SPSS Training
Figure 4.14 Bar Chart - Happiness of Marriage
STANDARDIZING
THE CHART AXIS
If we glance back over the last few bar charts we notice that the scale
axis, which displays percents, varies across charts. This is because the
maximum value displayed in each bar chart depends on the percentage of
respondents in the most popular category. Such scaling permits better
use of the space within each bar chart but makes comparison across
charts more difficult. Percentaging is itself a form of standardization, and
bar charts displaying percentages as the scale axis were requested in our
analyses. Charts can be further normed by forcing the scale axis (the axis
showing the percents) in each chart to have the same maximum value.
This facilitates comparisons across charts, but can make the details of
individual charts more difficult to see. We will illustrate this by reviewing
three of the previous bar charts and requesting that the maximum scale
value is set to 100 (100%). We accomplish this by editing each chart
individually.
To force the scale axis maximum to 100%
Double click the chart to open the SPSS Chart Editor
Select Chart..Axis, then Scale
Set the Maximum value to 100
Click OK
SPSS Training
Rotate the chart 90 degrees by clicking on the rotate tool
Select File..Close to exit from the Chart Editor
If we apply this rescaling to all three variables: attitude toward gun
permits (GUNLAW), having a gun at home (OWNGUN) and frequency of
church attendance (ATTEND), we obtain the results below.
Figure 4.15 Percentage Bar Chart for Gun Permits Question
SPSS Training
Note that the horizontal axes of the bar charts are now in comparable
units so we can make direct percentage comparisons based on the bar
length. This is the advantage of the percentage standardization.
However, note the result when we apply the same technique to the
frequency of church attendance variable.
Figure 4.17 Percentage Bar Chart for Church Attendance
The percentage bar chart of church attendance has the same general
shape as the one shown previously. The horizontal axis is scaled 0 to 100,
and church attendance has eight categories, so the bars are shrunken
down compared to the earlier plot of church attendance. As a result some
detail is lost. We can rescale this chart by setting the maximum below
100%, but would lose the ability to directly compare bar length across the
series of charts. Thus the advantage of standardizing the percentage
scale must be traded off against potential loss of detail. In practice it is
usually quite easy to decide which approach is better.
Hint
SPSS Training
Click the Options tab
Select Percent from the Variable: pull-down menu in the Scale
Range box
Uncheck the Auto check box in the Scale Range area
Set the Minimum to 0 and the Maximum to 100
Click OK
To request the same chart with command syntax use the IGRAPH
command below
IGRAPH /VIEWNAME=Bar Chart
/X1 = VAR(attend) TYPE = CATEGORICAL
/Y = $pct
/COORDINATE = VERTICAL
/X1LENGTH = 3.0 /YLENGTH = 3.0 /X2LENGTH = 3.0
/SCALERANGE = $pct MIN=0.000000 MAX=100.000000
/CATORDER VAR(attend) (ASCENDING VALUES
OMITEMPTY)
/BAR KEY=ON SHAPE = RECTANGLE BASELINE = AUTO.
Figure 4.18 Interactive Bar Chart
SPSS Training
PIE CHARTS
While the pie and bar charts are based on the same information, the
structure of the pie chart draws attention to the relation between a given
slice (here a group) and the whole. On the other hand, a bar chart leads
one to make comparisons among the bars, rather than any single bar to
the total. You might keep these different emphases in mind when
deciding which to use in your presentations.
SUMMARY
SPSS Training
SPSS Training
Objective
Method
Data
INTRODUCTION
hus far we have examined each variable isolated from the others.
A main component of many studies is to look for relationships
among variables or to compare groups on some measure. Using the
General Social Survey 1994 data, our interest is in investigating whether
men differ from women in their belief in an afterlife and in their attitude
toward gun permits. In addition, we will explore whether education
relates to these measures. Our choice of these variables, and not others,
is based on our view of which questions might be interesting to
investigate. More often a study is designed to answer specific questions of
interest to the researcher. These may be theoretical as in an academic
project, or quite applied as often found in market research.
SPSS Training
example, a demographic variable) crosstabulation tables permit
comparisons between groups. In survey work, two attitudinal measures
are often displayed in a crosstab to assess relationship. While the most
common tables involve two variables, crosstabulations are general
enough to handle additional variables and we will discuss a simple threevariable analysis.
A crosstabulation table can serve several purposes. It might be used
descriptively, that is, the emphasis is on providing some information
about the state of things and not on inferential statistical testing. For
example, demographic information of members of an organization
(company employees, students at a college, members of a professional
group) or recipients of a service (hospital patients, season ticket holders)
can be displayed using crosstabulation tables. Here the point is to provide
summary information describing the groups and not to make explicit
comparisons that generalize to larger populations. For example, an
educational institution might publish a crosstabulation table reporting
student outcome (dropout, return) for its different divisions. For this
purpose, the crosstabulation table is descriptive.
Crosstabulation tables are also used in research studies where the
goal is to draw conclusions about relationships in the population based on
sample data (recall our discussion in Chapter 1). Many survey studies
and all experiments have this as their goal. In order to make such
inferences, statistical tests (usually the chi-square test of independence)
are applied to the tables. In this chapter we will begin by discussing a
simple table displaying gender and belief in the afterlife. We will then
outline the logic of applying a statistical test to the data, perform the test,
and interpret the results. To provide reinforcement, several other twoway tables will be considered.
In addition to the statistical tests, researchers occasionally desire a
numeric summary of the strength of the association between the two
variables in a crosstabulation table. We provide a brief review of some of
these measures.
Another aspect of data analysis involves graphical display of the
results. We will see how bar charts can be used to present the data in
crosstabulation tables. Finally, we will explore a three-way table and
point in the direction of more advanced methods. We begin however, with
a simple table.
SPSS Training
Click Analyze..Descriptive Statistics..Crosstabs
Move postlife into the Row(s): box
Move sex into the Column(s): box
Figure 5.1 Crosstabs Dialog Box
SPSS Training
Click the Cells pushbutton
Click Column check box in order to obtain column percentages.
Figure 5.2 Crosstab Cell Display Dialog
Click Continue
Row, column and total table percentages can be requested. Row
percentages are computed within each row of the table so that the
percentages across a row sum to 100%. Column percentages would sum to
100% down each column, and total percentages sum to 100% across all
cells of the table. While we can request any or all of these percentages,
the column percent best suits our purpose. Since SEX is our column
variable, column percentages allow immediate comparison of the
percentages of men and women who believe in an afterlife: the question of
interest. We will not request row percents because we are not directly
interested in them and wish to keep the table simple.
Notice that Counts is checked by default. The other choices (Expected
Count, Residuals) are more technical summaries and will be considered
in the next example.
Click OK
The Crosstabs syntax command that can be used to run this analysis
appears below.
CROSSTABS
/TABLES postlife BY SEX
/CELLS COUNT COLUMN .
In SPSS syntax the row variable(s) (here POSTLIFE) precedes the
keyword BY and the column variable(s) follows it on the TABLES
subcommand. We request that counts and column percents appear in the
cells of the table.
SPSS Training
Figure 5.3 Crosstabulation of Belief in an Afterlife by Gender
The statistics labels appear in the row dimension of the table. The
two numbers in each cell are counts and column percentages. We see that
about 79% of the men and 83% of the women said they believe in an
afterlife. Also note that many observations are missing since this
question was not asked of all respondents in 1994. On the descriptive
level we can say that most of those sampled believed in the afterlife. If we
wish to draw conclusions about the population, for example differences
between men and women, we would need to perform statistical tests.
Row percents, if requested, would indicate what percentage of
believers is male and what percentage of believers is female. In other
words, the percentages would sum to 100% across each row. Your choice
of row versus column percents determines your view of the data. In
survey research, independent variables, such as demographics, are often
positioned as column variables (or banner variable in the stub and
banner tables of market research), and since there is much interest in
comparing these groups, column percents are displayed. If you prefer to
interpret row percentages in this context, or wish both percentages to
appear, feel free to do so. The important point is that the percentages
help answer the question of interest in a direct fashion.
Having examined the basic two-way table, we move on to ask
questions about the larger population.
CHI-SQUARE
TEST OF
INDEPENDENCE
In the table viewed above, 79% of the men in the sample and 83% of the
women believed in an afterlife. There is a difference in the sample of
about 4% with a higher proportion of women believing. Can we conclude
from this that there is a population difference between men and women
on this issue (statistical significance)? And if there is a difference in the
population, is it large enough to be important to us (ecological
significance)?
The difficulty we face is that the sample is an incomplete and
imperfect reflection of the population. We use statistical tests to draw
conclusions about the population from the sample data. The basic logic of
such tests follows. We first assume there is no effect (null hypothesis) in
the population (here that men and women show no differences in belief in
SPSS Training
an afterlife). We then calculate how likely it is that a sample could show
as large (or larger) an effect as what we observe (here a 4% difference), if
there were truly no effect in the population. If the probability of obtaining
so large a sample effect by chance alone is very small (often less than 5
chances in 100 or 5% is used) we reject the null hypothesis and conclude
there is an effect in the population. While this approach may seem
backward, that is, we assume no effect when we wish to demonstrate an
effect, it provides a method of forming conclusions about the population.
The details of how this logic is applied will vary depending on the type of
data (counts, means, other summary measures) and the question asked
(differences, association). So we will use a chi-square test in this chapter,
but t and F tests later.
Applying the testing logic to the crosstabulation table, we calculate
the number of people expected to fall into each cell of the table assuming
no relationship between gender and belief in an afterlife, then compare
these numbers to what we actually obtained in the sample. If there is a
close match we accept the null hypothesis of no effect. If the actual cell
counts differ dramatically from what is expected under the null
hypothesis we will conclude there is a gender difference in the population.
The chi-square statistic summarizes the discrepancy between what is
observed and what we expect under the null hypothesis. In addition, the
sample chi-square value can be converted into a probability that can be
readily interpreted by the analyst. To demonstrate how this works in
practice, we will rerun the same analysis as before, but request the chisquare statistic. We will also ask that some supplementary information
appear in the cells of the table to better envision the actual chi-square
calculation. In practice, you would rarely ask for this latter information
to be displayed.
REQUESTING
THE CHI-SQUARE
TEST
SPSS Training
Figure 5.4 Crosstab Statistics Dialog Box
The first choice is the chi-square test of independence of the row and
column variables. Most of the remaining statistics are association
measures that attempt to assign a single number to represent the
strength of the relationship between the two variables. We will briefly
discuss them later in this chapter. The McNemar statistics is used to test
for equality of correlated proportions, as opposed to general independence
of the row and column variables (as does the chi-square test). For
example, if we ask people, before and after viewing a political
commercial, whether they would vote for candidate A, the McNemar test
would test whether the proportion choosing candidate A changed. The
Cochrans and Mantel-Haenszel statistics test whether a dichotomous
response variable is conditionally independent of a dichotomous
explanatory variable when adjusting for the control variable. For
example, is there an association between instruction method (treatment
vs. control) and exam performance (pass vs. fail), controlling for school
area (urban vs. rural).
Click Continue
To illustrate the chi-square calculation we also request that some
technical results (expected values and residuals) appear in the cells of the
table. Once again, you would not typically display these statistics. To
proceed we return to the Crosstab Cell Display dialog box (click the Cells
pushbutton in the Crosstabs dialog box), then check Expected Counts and
Unstandardized Residuals.
Click the Cells pushbutton
Check Expected Counts and Unstandardized Residuals
SPSS Training
Figure 5.5 Displaying Technical Information in Crosstab Tables
SPSS Training
Figure 5.6 Crosstab with Expected Values and Residuals
The counts and percentages are the same as before; the expected
counts and residuals will aid in explaining the calculation of the chisquare statistic. Recall that our testing logic assumes no relation between
the row and column variables (here gender and belief in an afterlife) in
the population, and then determines how consistent the data are with
this assumption. In the table above there are 565 males who say they
believe in an afterlife. We now need to calculate how many observations
should fall into this cell if there were no relation between gender and
belief in an afterlife. First, note (we calculate this from the counts in the
cells and in the margins of the table) that 40.8% (714 of 1752, or .4075) of
the sample is male and 81.3% (1425 of 1752, or .8133) of the sample
believes in an afterlife. If gender is unrelated to belief in an afterlife, the
probability of picking someone from the sample who is both a male and a
believer would be the product of the probability of picking a male and the
probability of picking a believer, that is, .4075 * .8133 or .3314 (33.1
percent). This is based on the probability of the joint event equaling the
product of the probabilities of the separate events when the events are
independent- for example, the probability of obtaining two heads when
flipping coins. Taking this a step further, if the probability of picking a
male believer is 33.14% and our sample is composed of 1752 people, then
we would expect to find 580.7 male believers in the sample. This number
is the expected count for the male-believer cell, assuming no relation
between gender and belief. We observed 565 male believers while we
expected to find 580.7, and so the discrepancy or residual is -15.7. Small
residuals indicate agreement between the data and the null hypothesis of
no relationship; large residuals suggest the data are inconsistent with the
null hypothesis.
Expected counts and residuals are calculated for each cell in the table
and we wish to obtain an overall summary of the agreement between the
two. Simply summing the residuals has the disadvantage of negative and
SPSS Training
positive residuals (discrepancies) canceling each other out. To avoid this
(and for more technical statistical reasons) residuals are squared so all
values are positive. A second consideration is that a residual of 50 would
be large relative to an expected count of 15, but small relative to an
expected count of 2,000. To compensate for this the squared residual from
each cell is divided by the expected count of the cell. The sum of these cell
summaries ((Observed count - Expected count)**2 / Expected count)
constitutes the Pearson chi-square statistic. One final consideration is
that since the chi-square statistic is the sum of positive values from each
cell in the table, other things being equal, it will have greater values in
larger tables. The chi-square value itself is not adjusted for this, but an
accompanying statistic called degrees of freedom, based on the number of
cells (technically the number of rows minus one multiplied by the number
of columns minus one), is taken into account when the statistic is
evaluated.
Figure 5.7 Chi-Square Test Results
SPSS Training
less than 5% of the time by chance alone (as many researchers do), we
would claim this is a statistically significant effect. U.S. adult women are
more likely to believe in an afterlife than men.
The Continuity correction will appear only in two-row by two-column
tables when the chi-square test is requested. In such small tables it was
known that the standard chi-square calculation did not closely
approximate the theoretical distribution, which meant that the
significance value was not quite correct. A statistician named Frank
Yates published an adjusted chi-square calculation specifically for tworow by two-column tables and it typically appears labeled as the
Continuity correction or as Yates correction. It was applied routinely
for many years, but more recent Monte Carlo simulation work indicates
that it over adjusts. As a result it is no longer automatically used in two
by two tables, but it is certainly useful to compare the two significance
values to make sure they agree (here notice the significance value for the
continuity correction is slightly above .05).
The Pearson chi-square was the test used originally with
crosstabulation tables. A more recent chi-square approximation is the
likelihood ratio chi-square test. Here it tests the same null hypothesis,
independence of the row and column variables, but uses a different chisquare formulation. It has some technical advantages that largely show
up when dealing with higher-order tables (three-way and up). In the vast
majority of cases, both the Pearson and likelihood ratio chi-square tests
lead to identical conclusions. In most introductory statistics courses, and
when reporting results of two-variable crosstab tables, the Pearson chisquare is commonly used. For more complex tables, and more advanced
statistical applications, the likelihood ratio chi-square is almost
exclusively applied. Note that here the likelihood ratio result is slightly
above .05, leading to a different conclusion than the Pearson chi-square.
This will be discussed below.
The Linear by Linear chi-square tests the very specific hypothesis of
linear association between the row and column variables. This assumes
that both variables are interval scale measures and you are interested in
testing for straight-line association. This is rarely the case in
crosstabulation tables (unless working with rating scales) and the test is
not often used.
Finally, Fishers exact test will appear for crosstabulation tables
containing two rows and two columns (a 2x2 table); exact tests are
available for larger tables through the SPSS Exact Tests option. Fishers
test calculates the proportion of all table arrangements that have more
extreme percentages than observed in the cells, while keeping the same
marginal proportions. Exact tests have the advantage of not depending
on approximations (as do the Pearson and likelihood ratio chi-square
tests). However, the computational effort required to evaluate exact tests
in all but simple situations (for example a 2x2 table) has been large.
Recent improvements in algorithms have resulted in exact tests
calculated more efficiently. You should consider using exact tests when
your sample size is small, or when some cells in large crosstabulation
tables are empty or have small cell counts. As the sample size increases
(for all cells), exact tests and asymptotic (Pearson, likelihood ratio)
results converge.
SPSS Training
DIFFERENT
TESTS,
DIFFERENT
RESULTS?
Here we are faced with our Pearson result disagreeing with the other
tests. It is not a major problem in that the probability results are very
similar. However since we are testing at the .05 level, we would draw
different conclusions from the different tests. That is, while the probably
values from each test are very close in value, some fall just above, and
another just below, the .05 cutoff we chose. In this case it might be best to
say there is a suggestion of a male-female difference, but the test result is
not conclusive. Additional data, if available, would help resolve the issue.
Such disagreements among test results occur relatively rarely in practice.
ECOLOGICAL
SIGNIFICANCE
While our significance tests were not definitive, suppose we did conclude
from the Pearson chi-square test that U.S. adult men and women differ in
their belief in an afterlife, we now ask the question of practical
importance. Recall that majorities of both men and women believe and
the sample difference between them was about 4%. At this point the
researcher should consider whether a 4% difference is large enough to be
of practical importance. For example, if these were dropout rates for
students in two groups (no intervention, a dropout intervention program),
would a 4% difference in dropout rate justify the cost of the program?
These are the more practical and policy decisions that often have to be
made during the course of an applied statistical analysis.
SMALL SAMPLE
CONSIDERATIONS
SPSS Training
essentially giving up information about that group in order to obtain
stability when investigating the others. In recent years efficient
algorithms have been developed to perform exact tests which permit low
or zero expected cell counts in crosstabulation tables. SPSS has
implemented such algorithms in its Exact Tests module.
ADDITIONAL
TWO-WAY
TABLES
SPSS Training
Multiple tables can be obtained by naming several row or column
variables. In addition (although not shown) we drop our previous request
that the expected counts and residuals appear (in the Cells dialog box).
Click OK
The final command appears below.
CROSSTABS
/TABLES postlife gunlaw BY sex degree
/STATISTIC CHISQ
/CELLS COUNT COLUMN .
Each variable before the keyword BY will be matched with each one
following it, constructing four tables. Since we have already viewed belief
in an afterlife by gender, we skip it here.
Note
Some of the pivot tables shown in this chapter have been edited in the
Pivot Table editor so they are easier to read in this document.
Figure 5.9 Belief in Afterlife by Education Degree
SPSS Training
about 6 times in 100 (.058) by chance alone if there were no differences in
the population). No continuity correction appears because this is not a
two-row by two-column table. The minimum expected frequency is above
5: the value suggested by the rule of thumb reviewed earlier.
Figure 5.10 Gun Permits and Gender
SPSS Training
Figure 5.11 Gun Permits and Education Degree
SPSS Training
WHY IS THE
SIGNIFICANCE
CRITERION
TYPICALLY SET
AT .05?
ASSOCIATION
MEASURES
SPSS Training
ASSOCIATION
MEASURES
AVAILABLE
WITHIN
CROSSTABS
Some measures are symmetric, that is, do not vary if the row
and column variables are interchanged. Others are
asymmetric and must be interpreted in light of a causal or
predictive ordering that you conceive between your variables.
SPSS Training
Association measures often used in health research are kappa and
relative risk. Strictly speaking, kappa is a normed association measure,
while relative risk compares the relative risk of a negative outcome
occurring between two groups and is not bounded as the other association
measures are.
These association measures are found in the Crosstabs Statistics
dialog box. We will request several measures for a new two-way table:
gun ownership by education degree. Here both nominal and ordinal
measures of association might be desirable.
Click Continue
Click OK
SPSS Training
The association measures are grouped by level of measurement
assumed for the variables. We checked lambda (which will also produce
Goodman & Kruskals Tau) along with Kendalls c and the gamma
coefficient. The SPSS command to run this analysis is shown below.
CROSSTABS
/TABLES owngun BY degree
/STATISTIC CHISQ LAMBDA GAMMA CTAU
/CELLS COUNT COLUMN .
The desired association measures are listed on the STATISTICS
subcommand. First we review the crosstab table.
Figure 5.13 Gun in the Home and Education Degree
SPSS Training
Figure 5.14 Association Measures - Gun in Home and Education Degree
SPSS Training
GRAPHING
CROSSTABULATION
RESULTS
Click OK
The command to obtain the same table in SPSS would be:
GRAPH
/BAR(GROUPED)=PCT BY gunlaw BY sex .
SPSS Training
Figure 5.16 Bar Chart of Attitude Toward Gun Permits by Gender
We now have a direct visual comparison between the men and women
to supplement the crosstabulation table and significance tests. This graph
might be useful in a final presentation or report.
Hint
THREE-WAY
TABLES
You can create a bar chart directly from the values in the crosstabs pivot
table. To do so, double-click on the crosstabs pivot table to activate the
Pivot Table Editor, then select (Ctrl-click) all table values, for example
column percents except for totals, that you wish to plot. Then right-click
and select Create Graph..Bar from the Context menu. A bar chart will be
inserted in the Viewer window, following the pivot table.
SPSS Training
We will illustrate a three-way table using the table of attitude toward
gun permits by gender as a basis. Suppose we are interested in seeing
how gun ownership might interact with the previously observed
relationship between gender and attitude toward gun permits. To explore
this question we specify the gun-in-home question (OWNGUN) as the
control (or layer) variable in the crosstabulation analysis. In this way we
can view a table of attitude toward gun permits by gender separately for
those who do, and then for those who dont, have a gun in the home. We
will request a chi-square test of independence for each subtable.
SPSS Training
Click OK
As before, GUNLAW (attitude toward gun permits) and Sex are,
respectively, the row and column variables, but OWNGUN (gun in the
home) is added as a layer (or control) variable. Note that OWNGUN is in
the first layer. If additional control variables are to be used, they can be
added at higher-level layers. Although not shown, we asked for Column
percents in the Cells dialog box and the Chi-square test from the
Statistics dialog box.
The following syntax command will do this analysis.
CROSSTABS
/TABLES gunlaw BY sex BY owngun
/STATISTIC CHISQ
/CELLS COUNT COLUMN .
The second occurrence of the keyword BY separates the layer variable
(OWNGUN) from the column variable (SEX). To expand to a four-way
table we would add the Keyword BY and an additional control variable to
the end of the TABLES subcommand. We view the three-way table below.
Figure 5.18 Gun Permits and Gender and Gun in Home
SPSS Training
Figure 5.19 Chi-Square Statistics for Three-Way Table
We see that for respondents who have a gun in the home, there is a
large (19%) and significant difference between men and women. Recall
from Figure 5.10, that the original attitude toward gun permits by gender
crosstab table showed a 15% difference. The result here is consistent
with, but looks more pronounced than in the original table. For
respondents in homes without guns a somewhat different pattern
emerges. Here the percentages of men and women favoring gun permits
are significantly different in the population, yet seem closer (a 5%
difference: 84.5 versus 89.8 %) than the male-female difference for those
with guns in the home (a 19% difference). Thus there is a suggestion that
the male-female difference in attitude toward gun permits is more
pronounced in households with guns. This could be formally tested (test
for presence of a three-way interaction) using a loglinear model (note: we
ran this test and it was not significant, p= .0625).
Considering the two tables, we conclude there is a gender difference
in attitude toward gun permits regardless of whether a gun is in the
home. In addition, the difference seems more pronounced in homes with
guns, although we did not test this (note: when we did test using
loglinear, it was not significant). To carry this point further, suppose
there was a significant male-female difference in homes with guns, but
there was not a male-female difference in homes without guns? This
would modify our original conclusion based on the two-way table, since
we would know a third factor is relevant. If this occurred, the next step
for the analyst would be to explain it. As an exercise, can you suggest
reasons that might account for the patterns seen in this three-way table
(greater male-female difference in households with guns)?
SPSS Training
EXTENSIONS
SUMMARY
SPSS Training
SPSS Training
Method
Data
Scenario
INTRODUCTION
SPSS Training
FREQUENCY
TABLES AND
HISTOGRAMS
SPSS Training
Figure 6.2 Frequencies: Statistics Dialog Box
Click Continue
HISTOGRAMS
SPSS Training
Figure 6.3 Frequencies: Chart Dialog Box
SPSS Training
Figure 6.4 Frequency Table of Age First Married (Beginning)
SPSS Training
The mean age when first married is 22.6. The median (50% percentile
value) is 22. The reason for this discrepancy between the two measures of
central tendency is that a few respondents married relatively late in life.
Such relatively extreme values influence the mean more than the
median. Medians are known to be resistant (robust) to extreme scores
and are sometimes preferred because of this. The standard deviation is
about 5 years, which indicates there is a fair amount of variation among
respondents in age when first married.
Figure 6.6 Histogram of Age When First Married
Does this plot seem useful in describing age when first married? We
see the most common ages when first married concentrate in the late
teens to early twenties. The distribution is not symmetric: no one was
married before his early teens, while at the high end, some respondents
married in their forties or fifties. To the eye, the plot seems truncated on
the left-hand side. Technically speaking, this distribution would be
described as skewed to the right or as having positive skewness. We will
discuss skewness in more detail shortly. In summary, we might say that
unless we are interested in the exact ages when first married, the
frequency table contained too much detail, while the statistical
summaries and histogram were more useful and succinct.
EXPLORATORY
DATA ANALYSIS
SPSS Training
things about the data. To further this effort Tukey developed both plots
and data summaries. These methods, called exploratory data analysis
and abbreviated EDA, have become very popular in applied statistics and
data analysis. Exploratory data analysis can be viewed either as an
analysis in its own right, or as a set of data checks and investigations
performed before applying inferential testing procedures.
These methods are best applied to variables with at least ordinal
(more commonly interval) scale properties and which can take many
different values. The plots and summaries would be less helpful for a
variable that takes on only a few values (for example, one to five scales).
We will apply EDA techniques to three items based on the General
Social Survey: an average satisfaction score, age when first married, and
number of hours of TV viewed per day.
AVERAGE
SATISFACTION
VARIABLE
The General Social Survey 1994 contains five questions asking about
respondent satisfaction with various aspects of life. The questions pertain
to satisfaction with the city or place lived in (SATCITY), family life
(SATLIFE), friendships (SATFRND), health and physical condition
(SATHEALT), and non-working activities and hobbies (SATHOBBY).
Responses are made on a one to seven point scale measuring level of
satisfaction, where 1= A Very Great Deal and 7=None. To create an
overall or average satisfaction measure we take the average score across
the five questions for each respondent. In SPSS for Windows, this is done
within the Compute dialog box.
Click Transform..Compute
Type satmean in the Target Variable box
Select Mean(numexpr,numexpr,...) from the Function menu
and move it to the Numeric Expression box
Move the variables satcity, satfam, satfrnd, sathealt, and
sathobby to the Numeric Expression box
Make sure the variable names are separated by commas (,)
Figure 6.7 Computing the Average Satisfaction Score
SPSS Training
Click OK
The resulting command appears below.
COMPUTE satmean = MEAN(satcity, satfam, satfrnd, sathealt,
sathobby) .
After creating the variable satmean, we will perform exploratory
data analysis on the three variables of interest.
Click Analyze..Descriptive Statistics..Explore
Move satmean, agewed, and tvhours to the Dependent List:
box
Figure 6.8 Explore Dialog Box
OPTIONS WITH
MISSING VALUES
SPSS Training
separately for each variable (called pairwise deletion). When only a single
variable is considered both methods yield the same result, but they will
not give identical answers when multiple variables are analyzed in the
presence of missing values. The default method is listwise deletion, and
we will specifically request via the Options pushbutton that the
alternative method (pairwise) be used. This makes sense when we
consider that one of the variables, AGEWED, is asked only of those who
have been married. Why should we exclude responses to SATMEAN or
TVHOURS for those never married, and who thus have missing values
for AGEWED?
Click the Options pushbutton
Click the Exclude cases pairwise option button
Figure 6.9 Missing Value Options in Explore
Rarely used, the Report choice has SPSS include user-defined missing
values in frequency analyses, but excluded from summary statistics and
charts.
Click Continue
Click OK
The SPSS command to do this analysis appears below.
EXAMINE
VARIABLES satmean agewed tvhours
/MISSING PAIRWISE.
EXAMINE is the syntax command name given to the procedure that
performs exploratory data analysis.
Note
Although SPSS presents statistics for all three variables within one pivot
table, we will present and discuss the summaries and plots for each
variable separately.
SPSS Training
Figure 6.10 EDA Summaries for Average Satisfaction
MEASURES OF
CENTRAL
TENDENCY
SPSS Training
VARIABILITY
MEASURES
CONFIDENCE
BAND FOR MEAN
SHAPE OF THE
DISTRIBUTION
SPSS Training
skewness value is positive and several standard errors from zero, the
distribution will exhibit this bunching to the left and a longer tail to the
right. We will see this shortly.
Kurtosis also has to do with the shape of a distribution and is a
measure of how much of the data is concentrated near the center, as
opposed to the tails, of the distribution. It is normed to the normal curve
(kurtosis is zero). As an example, a distribution with long, thick tails and
less peaked in the middle than a normal would have a positive kurtosis
measure. A standard error for kurtosis also appears. Since the kurtosis
value is .94, which is beyond two standard errors (2 * .22) of zero, average
satisfaction would be considered slightly non normal. The shape of the
distribution can be of interest in its own right. As another reason,
assumptions are made about the shape of the data distribution within
each group when performing significance tests on mean differences
between groups. This aspect will be covered in later chapters.
The stem & leaf plot is modeled after the histogram, but is designed to
provide more information. Instead of using a standard symbol (for
example, an asterisk * or block character) to display a case or group of
cases, the stem & leaf plot uses data values as the plot symbols. Thus the
shape of the distribution appears and the plot can be read to obtain
specific data values. The stem & leaf for average satisfaction appears
below
Figure 6.11 Stem & Leaf Plot of Average Satisfaction
SPSS Training
In a stem & leaf plot the stem is the vertical axis and the leaves
branch horizontally from the stem (Tukey devised the stem & leaf.) The
stem width indicates how to interpret the units in the stem; in this case a
stem unit represents one point on a seven-point satisfaction rating scale.
A stem width of 10 would indicate that the stem value must be multiplied
by 10 to reproduce the original units of analysis. The actual numbers in
the chart (leaves) provide an extra decimal place of information about the
data values. To illustrate, one of the bottom rows of the stem & leaf
contains a stem value of 4 with several leaves of value 6. These represent
individuals whose average satisfaction score was 4.6. Thus besides
viewing the shape of the distribution we can pick out individual scores.
Below the diagram a note indicates that each leaf represents one case.
For large samples a leaf may represent multiple cases. In such
situations, an ampersand (&) is used to denote a partial leaf.
The last line identifies outliers. These are data points far enough
from the center (defined more exactly under Box & Whisker plots below)
that they might merit more careful checking. Extreme points might be
data errors or possibly represent a separate subgroup. The nearest outlier
(the one closest to the median) is listed. If the stem & leaf plot were
extended to include the outliers, then the positive skewness would be
apparent. These extreme values may contribute to the kurtosis as well.
The stem & leaf plot attempts to describe data by showing every
observation. In comparison, displaying only a few summary measures,
the box & whisker plot conveys information about the distribution of a
variable. Also the box & whisker plot will identify outliers (data values
far from the center of the distribution). Below we see the box & whisker
plot (also called box plot) for average satisfaction.
Figure 6.12 Box & Whisker Plot of Average Satisfaction
SPSS Training
The vertical axis represents the average satisfaction scale. In the
plot, the solid line inside the box represents the median. The hinges
provide the top and bottom borders to the box; they correspond to the
75th and 25th percentile values of average satisfaction, and thus define
the interquartile range (IQR). In other words, the middle 50% of data
values fall within the box. The whiskers are the last data values that lie
within 1.5 box lengths (or IQRs) of the respective hinges (edges of box).
Tukey considers data points more that 1.5 box lengths from a hinge to be
far enough from the center to be noted as outliers. Such points are
marked with a circle. Points more than 3 box lengths from a hinge are
viewed by Tukey to be far out points and are marked with an asterisk
symbol. This plot has several outliers and one extreme. If a single outlier
exists at a data value, the case sequence number appears beside it (an ID
variable can be substituted), which aids data checking.
If the distribution were symmetric, then the median would be
centered within the hinges and the whiskers. In the plot above, the
disparate lengths of the whiskers and the outliers at the high end show
the skewness. Such plots are also useful when comparing several groups,
as we will see in later chapters.
EXPLORING AGE
WHEN FIRST
MARRIED
The mean age first married is greater than the median. This suggests
a positive skew to the data, confirmed by the skewness statistic. Examine
the minimum and maximum values; do they suggest data errors? Which
other variables might you look at in order to investigate the validity of
these responses? We have valid data for 1,189 observations with 1803
missing (these numbers appear in the Case Processing Summary pivot
SPSS Training
table- not shown). If we turn back to the frequency table of marital status
in Chapter 4 (Figure 4.3), we find 614 people have never been married
which accounts for roughly 1/3 of the missing data. Almost all the
remaining (all but 13 cases) missing data are due to the fact that the
question was not asked of all respondents in 1994. Thus, although about
60% of the responses to this question are missing, we have accounted for
them in a satisfactory manner.
Figure 6.14 Stem & Leaf Diagram for Age When First Married
Almost all leaves are zero (except for age 13) because the ages fall
within a fairly restricted range and were recorded in whole years. Age
13 is denoted with an & because it is a partial leaf. Thus except for
the outlier identification, the plot is equivalent to a histogram. Notice
that all the extreme values occur at the older ages; the individual first
married at age 13 is not identified as an outlier. This is because age 13 is
not that far from the bulk of the observations (many are married in their
late teens) while age 34 is. From a statistical perspective, age 13 is not an
outlier; however, from a social perspective it may very well be considered
unusual, and the case should be examined more closely for this reason.
Finally, do you notice any pattern of peaks and valleys to the plot?
SPSS Training
Figure 6.15 Box & Whisker Plot for Age When First Married
The skewness is apparent from the outliers at the high end. Some of
these are marked as extreme points. While unusual relative to the data,
certainly people can first marry at these ages. If outliers appear in your
data you should check whether they are data errors. If not, you consider
whether you wish them included in your analysis (some references were
given to this issue in Chapter 3). This is especially problematic when
dealing with a small sample (not the case here), since an outlier can
substantially influence the analysis. We now move to hours of TV
watched per day.
Figure 6.16 Exploratory Summaries of Daily TV Hours
The mean (2.82) is very near 3 hours, the trimmed mean is at 2.6 and
the median is 2. This suggests skewness. Do you notice anything
SPSS Training
surprising about the minimum, maximum or range? Watching 24 hours
of TV a day is possible (?), but unlikely, so perhaps it is a result of
misunderstanding the question. The trimmed mean (2.84) is closer to the
mean (2.82) than the median (2), indicating that the difference between
the mean and median is not solely due to the presence of outliers. The
stem & leaf diagram below, showing a heavy concentration of
respondents at 1 and 2 hours of TV viewing, suggests why the median is
at 2.
Figure 6.17 Stem & Leaf Diagram of Daily TV Hours
The stem & leaf identifies outliers on the high side. Other than that it
is of limited use since TVHOURS is recorded to the integer number of
hours and a relatively small number of values are chosen. This
consideration would apply when considering use of Explore for five-point
rating scales.
SPSS Training
Figure 6.18 Box & Whisker Plot of Daily TV Hours
SAVING AN
UPDATED COPY
OF THE DATA
Click File..Save As and type GSS94 in the File name text box
(switch to the c:\Train\Stats folder if necessary)
Click Save
SUMMARY
SPSS Training
Method
Data
Scenario
INTRODUCTION
SPSS Training
distributions used when testing will be the t and F, rather than chisquare. This is because the properties of sample means are different from
those of counts appearing in a crosstabulation table.
In this chapter, we outline the logic involved when testing for mean
differences between groups, then perform an analysis comparing two
groups. Later chapters will generalize the method to cases involving
additional groups.
LOGIC OF
TESTING FOR
MEAN
DIFFERENCES
SPSS Training
Figure 7.2 Three Samples from Same Population
SPSS Training
There is some overlap among the three groups, but the sample means
(medians here) are different. In this instance a statistical test would be
valuable to assess whether the sample mean differences are large enough
to justify the conclusion that the population means differ. This last plot
represents the typical situation facing a data analyst.
As we did when we performed the chi-square test, we formulate a null
hypothesis and use the data to evaluate it. First assume the population
means are identical, and then determine if the differences in sample
means are consistent with this assumption. If the probability of obtaining
sample means as far (or further) apart as we find in our sample is very
small (less than 5 chances in 100 or .05), assuming no population
differences, we reject our null hypothesis and conclude the populations
are different.
In order to implement this logic, we compare the variation among
sample means relative to the variation of individuals within each sample
group. The core idea is that if there were no differences between the
population means, then the only source for differences in the sample
means would be the variation among individual observations (since the
samples contain different observations), which we assume is random. If
we then compute a ratio of the variance among sample means divided by
the variance among individual observations within each group, we would
expect this ratio to be about 1 if there are no population differences.
When there are true population differences, the variation among sample
means would be due to two sources: variation among individuals, and the
true population difference in means. In this latter case we would expect
the ratio of variances to be greater than 1. Under the assumptions made
in analysis of variance, this variation ratio follows a known statistical
distribution (F). Thus the result of performing the test will be a
probability indicating how likely we are to obtain sample means as far
apart (or further) as we observe in our sample if the null hypothesis were
true. If this probability is very small, we reject the null hypothesis and
conclude there are true population differences.
This concept of taking a ratio of between-group variation of means
(between-group) to within-group variation of individuals (within-group) is
fundamental to the statistical method called analysis of variance. It is
implicit in the simple two-group case (t test), and appears explicitly in
more complex analyses (general ANOVA).
ASSUMPTIONS
SPSS Training
assumed when performing the significance tests, the results are not much
affected by moderate departures from normality (for discussion and
references, see Kirk (1968) and for an opposing view see Wilcox (1996,
1997)). In practice, researchers often examine histograms, stem & leaf
plots, and box & whisker plots to view each group in order to make this
determination. If a more formal approach is preferred, the Explore
procedure can produce more technical plots (normal probability plots) and
statistical tests of normality (see the second appendix to this chapter). In
situations where the sample sizes are small or there are gross deviations
from normality, researchers often shift to nonparametric tests. These
tests are not emphasized in this course, but many are available in SPSS.
The second assumption, homogeneity of variance, indicates that the
variance of the dependent measure is the same for each population
subgroup. Under the null hypothesis we assume the variation in sample
means is due to the variation of individual scores, and if different groups
show disparate individual variation, it is difficult to interpret the overall
ratio of between-group to pooled within-group variation. This directly
affects significance tests. Based on simulation work, it is known that
significance tests of mean differences are not much influenced by
moderate lack of homogeneity of variance if the sample sizes of the
groups are about the same. If the sample sizes are quite different, then
lack of homogeneity (heterogeneity) is a problem in that the significance
test probabilities are not correct. When comparing means from two
groups (t test) and one-factor ANOVA (see Chapter 8) there are
corrections for lack of homogeneity. In the more general analysis a simple
correction does not exist. It is beyond the scope of this course, but it
should be mentioned that if there is a relationship or pattern between the
group means and standard deviations (for example, if groups with higher
mean levels also have larger standard deviations), there are sometimes
data transformations that when applied to the dependent variable will
result in homogeneity of variance. Such transformations can entail
additional complications, but provide a method of meeting the
homogeneity of variance requirement. The Explore procedures Spread &
Level plot can provide information as to whether this approach is
appropriate and can suggest the optimal data transformation to apply to
the dependent measure.
To oversimplify, when dealing with moderate or large samples and
testing for mean differences, normality is not always important. Gross
departures from homogeneity of variance do affect significance tests when
the sample sizes are disparate.
SAMPLE SIZE
SPSS Training
our discussion of normality), but are not formally required.
Also, these analyses do not demand that the group sizes be equal.
However, analyses involving tests of mean differences are more resistant
to certain assumption violations (homogeneity of variance) when the
sample sizes are equal (or near equal). In more complex analysis of
variance (covered in later chapters) equal (or proportional) group sample
sizes bring assurance that the various factors under investigation can be
looked at independently. Finally, equal sample size conveys greater
statistical power when looking for any differences among groups. So, in
summary, equal group sample sizes are not required, but do carry
advantages. This is not to suggest that you should drop observations from
the analysis in order to obtain equal numbers in each group, since this
would throw away information. Rather, think of equal group sample size
as an advantageous situation you should avail yourself of when possible.
In experiments equal sample size is usually part of the design, while in
survey work it is rarely seen.
EXPLORING THE
DIFFERENT
GROUPS
SPSS Training
AGEWED (age when first married) and SATMEAN (overall
satisfaction) are both named as dependent variables. Explore will
perform a separate analysis on each. The variable defining the groups to
be compared, in this instance SEX, is given as the Factor variable. Thus
SPSS will produce summaries for each gender group. Finally, while not
shown in the dialog box above, we also used the Options pushbutton to
request that missing values should be treated separately for each
dependent variable (Pairwise option). We mentioned in Chapter 6 that
Explores default is to exclude a case from analysis if it contains a missing
value for any of the dependent variables. Here we want to avoid
excluding the overall satisfaction scores for those never married (and who
are coded as missing on the AGEWED variable).
Click the Options pushbutton
Click the Exclude cases pairwise option button
Click Continue
Click OK
The command in SPSS to perform this analysis appears below.
EXAMINE
VARIABLES=agewed satmean BY sex
/MISSING PAIRWISE
/NOTOTAL.
AGEWED and SATMEAN are named as the dependent variables.
Variables following the keyword BY are treated as independent variables
(or Factors). The MISSING subcommand requests pairwise case deletion
(explained earlier). While not required, the NOTOTAL subcommand
instructs SPSS to display only the subgroup summaries and plots,
suppressing results for the entire (total) sample. If NOTOTAL were
dropped, we would first view the results for the entire sample (as in
Chapter 6), followed by each subgroup. First we consider age when first
married.
SPSS Training
Figure 7.5 Summaries of Age When First Married
Note
The original output for Figure 7.5 was edited using the Pivot Table editor
to facilitate the male to female comparisons (steps outlined below).
Right click on the Descriptives pivot table and select SPSS Pivot
Table Object..Open from the Context menu
Click Pivot..Pivoting Trays to activate the Pivoting Trays
window (if necessary)
Drag the pivot tray icon for sex from the Row dimension tray to
the Column dimension tray
Click File..Close to close the Pivot Table Editor
Notice that the mean (male 23.93; female 21.82) is higher than both
the median and trimmed mean for each gender, which suggests some
skewness to the data. This is confirmed by the positive skewness
measures and the stem & leaf diagrams. Note that the mean for females
(21.82) is about 2 years younger than the male average. Also the sample
standard deviation of age first married is 4.81 for the females and 4.72
for the males, suggesting the standard deviations in each population are
about the same, and that the homogeneity of variance assumption has
probably been met.
SPSS Training
Figure 7.6 Males: Stem & Leaf Plot of Age When First Married
Figure 7.7 Females: Stem & Leaf Plot of Age When First Married
Viewing the stem & leaf diagrams with normality in mind, we might
say each is unimodal (a single peak) but skewed to the right, and thus not
SPSS Training
normal. However, keeping in mind our earlier discussion of assumptions,
since both gender groups show a similar skewed pattern, we will not be
concerned since the sample sizes are fairly large and the distributions are
similar in the two groups.
Figure 7.8 Box & Whisker Plot of Age First Married for Males and
Females
The box & whisker plot provides visual confirmation of the mean
(actually median) differences between the two samples. The side-by-side
comparison shows that the groups have a similar pattern of positive
skewness. Outliers are identified and might be checked against the
original data for errors; we considered this issue when we performed
exploratory data analysis on age first married for the entire sample.
Based on these plots and summaries we might expect to find a significant
mean difference in age when first married between men and women.
Also, since the two groups have a similar distribution of data values
(positively skewed) with large samples, we feel comfortable about the
normality assumption to be made when testing for mean differences.
Next we turn to the summaries for the overall satisfaction measure.
SPSS Training
Figure 7.9 Summaries of Overall Satisfaction
SPSS Training
Figure 7.11 Females: Stem & Leaf Plot of Overall Satisfaction
The mean of overall satisfaction was 2.58 for the males and 2.47 for
the females (1=Satisfied a Very Great Deal, 7=Not at all Satisfied)
indicating, on the whole, they were satisfied with life. The means are
slightly above their respective medians and trimmed means; the
skewness measures are several standard errors from zero; the stem &
leaf diagrams show outliers at the high end. All these signs indicate we
again have moderate positive skewness in this sample. The stem & leaf
plot for the females shows a slight positive skewness, similar to that of
the male sample. Notice also that the standard deviations for the groups
were about the same (.91, .94) so we are unlikely to have a problem with
the homogeneity of variance assumption. To compare the groups directly,
we move to the box & whisker plot.
SPSS Training
Figure 7.12 Box & Whisker Plot of Overall Satisfaction
T TEST
SPSS Training
Figure 7.13 Compare Means Menu
SPSS Training
Using SPSS, after clicking Independent-Samples T Test, we first
indicate the dependent measure(s) or Test variable. We specify both
AGEWED and SATMEAN, which will yield two separate analyses. The
Group or independent variable is SEX. Thus we wish to compare men
and women in their mean age when first married and mean overall
satisfaction measure.
Figure 7.14 Independent-Samples T Test Dialog Box
We have provided the basic information to SPSS, but notice the two
question marks following the variable SEX in the Grouping Variable box.
SPSS requires that you indicate which groups are to be compared, which
is usually done by providing the data values for the two groups. Since
gender is coded 1 for males and 2 for females, we must supply these
numbers using the Define Groups dialog box.
Click the Define Groups pushbutton
Enter 1 as the first and 2 as the second group code
Figure 7.15 T Test Define Groups Dialog Box
SPSS Training
We have identified the values defining the two groups to be
compared. The cut point choice is rarely used, but if the independent
(grouping) variable is numeric, then you can give a single cut point value
to define the two groups. Those cases less than or equal to the cut point
go into the first group, and those greater than the cut point fall into the
second group.
Click Continue
Figure 7.16 Completed T Test Dialog Box
Our specifications are complete. By default, the procedure will use all
valid responses for each dependent variable in the analysis.
Click OK
The SPSS T-Test command is shown below.
T-TEST
GROUPS=sex(1 2)
/MISSING=ANALYSIS
/VARIABLES=agewed satmean
/CRITERIA=CIN(.95) .
The GROUPS subcommand instructs SPSS that an independent
groups t test is to be performed comparing SEX groups 1 (males) and 2
(females). AGEWED and SATMEAN are named as the dependent
variables on the VARIABLES subcommand. We now advance to interpret
the results.
SPSS Training
T TEST RESULTS
FOR AGE FIRST
MARRIED
We will first look at the output for age when first married. Please note
that in the original output, the test results for both dependent variables
were displayed in a single pivot table, but for discussion purposes we
present the agewed and satmean results separately.
Figure 7.17 Summaries for Age First Married
Homogeneity
SPSS Training
100 (or .08). This is above the common (.05) cut-off, so we conclude the
standard deviations are identical in the two population groups and the
homogeneity requirement is met. If this seems too complicated, some
authors suggest the following simplified rules: (1) If the sample sizes are
about the same, dont worry about the homogeneity of variance
assumption; (2) If the sample sizes are quite different, then take the ratio
of the standard deviations in the two groups and round it to the nearest
whole number. If this rounded number is 1, dont worry about lack of
homogeneity of variance.
T TEST
Finally two versions of the t test appear. The row labeled Equal
variances assumed contains results of the standard t test, which
assumes homogeneity of variance. The second row labeled Equal
variances not assumed contains an adjusted t test that corrects for lack
of homogeneity (heterogeneity of variance) in the data. You would choose
one or the other based on your evaluation of the homogeneity of variance
question. The actual t value and df (degrees of freedom) are technical
summaries measuring the magnitude of the group differences and a value
related to the sample sizes, respectively. To interpret the results, move to
the column labeled Sig. (2-tailed). This is the probability (rounded to
.000, meaning it is less than .0005), of our obtaining sample means as far
or further apart (2.1 years), by chance alone, if the two populations
(males and females) actually have the same mean age when first married.
Thus the probability of obtaining such a large difference by chance alone
is quite small (less than 5 in 10,000), so we would conclude there is a
significant difference in age first married between men and women.
Notice we would draw the same conclusion if the unequal variance t test
were applied.
The term two-tailed test indicates that we are interested in testing
for any differences in age first married between men and women, that is,
either in the positive or negative direction (ergo the two tails).
Researchers with hypotheses that are directional, for example,
specifically that men marry at older ages than women, can use one-tailed
tests to address such questions in a more sensitive fashion. Broadly
speaking, two-tailed tests look for any difference between groups, while a
one tailed test focuses on a difference in a specific direction. Two-tailed
tests are most commonly done since the researcher is usually interested
in any differences between the groups, regardless as to which is higher.
If interested, you can obtain the one-tailed t test result directly from
the two-tailed significance value that SPSS displays. For example,
suppose you wish to test the directional hypothesis that in the population
men first marry at an older age than women, the null hypothesis being
that either women first marry at an older age than men or there is no
gender difference. You would simply divide the two-tailed significance
value by 2 to obtain the one-tailed probability, and verify that the pattern
of sample means is consistent with your hypothesized direction (that men
first marry at an older age). Thus if the two-tailed significance value were
.0005, then the one-tailed significance value would be half that value
(.00025) if the direction of the sample means violates the null hypothesis
(otherwise it is 1 p/2, where p is the two-tailed value). To learn more
about the differences and logic behind one and two-tailed testing, see
SPSS Guide to Data Analysis (Norusis, 2001) or an introductory statistics
book.
SPSS Training
CONFIDENCE
BAND FOR MEAN
DIFFERENCE
SUMMARY FOR
AGE FIRST
MARRIED
T TEST FOR
OVERALL
SATISFACTION
Here the sample means are very close, as are the standard deviations.
The standard errors are the expected standard deviations of the sample
SPSS Training
means if the study were to be repeated with the same sample sizes. The
difference between men and women is tiny (.10 units on the 1 to 7 scale).
Note that the standard deviations of the two groups are fairly close (.91
and .94) and the Levene test returns a probability value of about 62%
(.62): well above our .05 cut-off! It is a good idea to keep the sample size
in mind when evaluating the homogeneity test(s), because with
increasing sample size there is more precise estimation of the sample
standard deviations, and so smaller differences are statistically
significant. Thus if the Levene test were significant, but the sample sizes
were large and the ratio of the sample standard deviations were near 1,
then the equal variance t test should be quite adequate.
Proceeding to the t test itself, the significance value of .207 indicates
that if the null hypothesis of no gender difference in overall satisfaction
in the population were true, then there is about a 21% chance of
obtaining sample means as far (or further) apart as we observe in our
data (difference of .10 units). This is not significant (well above .05) and
we conclude there is no evidence of men differing from women in overall
satisfaction. Notice that the 95% confidence band of the male-female
difference includes 0. This is another reflection of the fact that we cannot
conclude the populations are different on the satisfaction measure.
In summary, we found no indication of a gender difference in overall
satisfaction.
DISPLAYING
MEAN
DIFFERENCES
The T Test procedure displays the means and appropriate statistical test
information. When presenting these results a summary chart is
desirable. Bar charts can be easily produced in which the height of each
bar (group) represents the sample mean. Note there is no simple
mechanism to display the precision with which the mean was estimated
(95% confidence band) on an SPSS standard bar chart, although
Interactive Graph bar charts can display standard errors. A type of chart
called the error bar shows both the group means and precision. We will
produce an error bar chart showing the gender difference in age when
first married.
Click Graphs..Interactive..Error Bar
Click Reset button, then click OK to confirm
Drag and drop Age When First Married [agewed] to the
vertical arrow box
Drag and drop Respondent's Sex [sex] to the horizontal
arrow box
SPSS Training
Figure 7.21 Create Error Bar Chart Dialog Box
By default, the error bars will represent the 95% confidence band
applied to the sample means.
Click OK
The SPSS syntax command that produces the chart appears below.
IGRAPH /VIEWNAME=Error Bar Chart
/X1 = VAR(sex) TYPE = CATEGORICAL
/Y = VAR(agewed) TYPE = SCALE
/COORDINATE = VERTICAL
/X1LENGTH = 3.0 /YLENGTH = 3.0 /X2LENGTH = 3.0
/CATORDER VAR(sex) (ASCENDING VALUES OMITEMPTY)
/ERRORBAR KEY ON CI(95.0) DIRECTION BOTH CAPSTYLE
= T SYMBOL = ON.
SPSS Training
Figure 7.22 Error Bar Chart of Age First Married by Gender
The small square in the middle of each error bar represents the
sample group mean of age when first married, and the attached bars are
the upper and lower limits for the 95% confidence band on the sample
mean. Thus we can directly compare groups and view the precision with
which the group means have been estimated. Notice that the lower bound
for men is well above the upper bound for women indicating these groups
are well separated and that the population difference is statistically
significant. Such charts are especially useful when more than two groups
are displayed, since one can quickly make informal comparisons between
any groups of interest.
SUMMARY
APPENDIX:
PAIRED T TEST
In this chapter we introduced the logic and assumptions used to test for
mean differences between population groups. We saw how exploratory
data analysis methods contribute information directly relevant to such
tests, then performed t tests comparing men to women on two measures:
age first married and overall satisfaction. Finally, error bar charts were
offered as a tool to visually portray the results of the analysis.
The paired t test is used to test for statistical significance between two
population means when each observation contributes to both means. In
medical research a paired t test would be used to compare means on a
measure administered both before and after some type of treatment. Here
each patient is tested twice and is used in calculating both the pre- and
post-treatment means. In market research, if a subject were to rate the
product they usually purchase and a competing product on some
attribute, a paired t test would be needed to compare the means. In an
SPSS Training
industrial experiment, the same operators might run their machines
using two different sets of guidelines in order to compare average
performance scores. Again, the paired t test is appropriate. Each of these
examples differs from the independent groups t test in which an
observation falls into one and only one of the two groups. The paired t
test entails a slightly different statistical model since when a subject
appears in each condition, he acts as his own control. To the extent that
an individuals outcomes across the two conditions are related, the paired
t test provides a more powerful statistical analysis (greater probability of
finding true effects) than the independent groups t test.
To demonstrate a paired t test using the General Social Survey data
from 1994 we will compare mean education levels of the mothers and
fathers of the respondents. The paired t test is appropriate because we
will obtain data from a single respondent as to his/her parents education.
We are interested in testing whether there is a significant difference in
education between fathers and mothers in the population. Keep in mind
that while the population we sample from is a U.S. adult population, the
questions pertain to their parents education. Thus the population our
conclusion directly generalizes to would be parents of U.S. adults. To test
directly for differences between men and women in the U.S. population,
we could run an independent-groups t test comparing mean education
level for men and women.
While not pursued here, we would recommend running exploratory
data analysis on the two variables to be tested. The homogeneity of
variance assumption does not apply since we are dealing with but one
group. Normality is assumed, but technically it applies to the difference
scores, obtained by subtracting for each observation the two measures to
be compared. In SPSS this issue can be investigated by computing a new
variable that is the difference between the two measures, then running
Explore on this variable.
Click Analyze..Compare Mean..Paired-Samples T Test
Click on maeduc and next on paeduc (in the Current
Selections box, maeduc will be listed as Variable 1 and
paeduc as Variable 2)
Click the arrow to move the two variables to the Paired
Variables: box
The Paired-Samples T Test dialog box appears below.
SPSS Training
Figure 7.23 Paired-Samples T Test Dialog Box
Click OK
In SPSS both the independent groups and the paired samples t test
are produced from the same command.
T-TEST
PAIRS= maeduc WITH paeduc (PAIRED)
/CRITERIA=CIN(.95)
/MISSING=ANALYSIS.
Thus mother and fathers education in years are the variables to be
tested for mean differences using the paired sample t test.
Figure 7.24 Summaries of Differences in Parents Education
SPSS Training
Figure 7.25 Paired T Test of Differences in Parents Education
What might first attract our attention is that the means for mothers
and fathers are extremely close (within .1 years of each other). This might
indicate very close educational matching of people who marry. Another
possibility might be incorrect reporting of parents formal education by
the respondent with a bias toward reporting the same value for both. The
sample size (number of pairs) appears along with the correlation between
mothers and fathers education. Correlations and their significance tests
will be studied in a later chapter, but we note that the correlation (.643)
is positive, substantial and statistically significant (differs from zero in
the population). This result (significant correlation of mother and fathers
education) supports our choice of the paired t test in place of the
independent sample t test. The mean formal education difference, .04
years, is reported along with the sample standard deviation and standard
error (based on the parents education difference score computed for each
respondent). The t statistic is very small and the significance value (.572)
indicates that if mothers and fathers in the population had the same
formal education (null hypothesis) then there is a 57% chance of
obtaining as large (or larger) a difference as we obtained in our sample
(.04 years). Thus the data are quite consistent with the null hypothesis of
no difference between mothers and fathers in education. In summary,
there is no evidence of a significant difference.
APPENDIX:
NORMAL
PROBABILITY
PLOTS
The Examine procedure (Explore menu choice) will display a stem & leaf
diagram useful for evaluating the shape of the distribution of the
dependent measure within each group. Since one of the t test
assumptions is that these distributions are normal, we implicitly compare
the stem & leaf plots to the well-known normal bell-shaped curve. If a
more direct consideration of normality is desired, the Examine procedure
can produce a normal probability plot and a fit test of normality. In this
section we return to the Explore dialog box and request these features.
Earlier in the chapter we used the Explore dialog box to explore age
when first married (AGEWED) and overall satisfaction (SATMEAN) for
the two gender groups. If we return to this dialog box by clicking the
Dialog Recall tool
SPSS Training
Analyze..Descriptive Statistics..Explore) we note it retains the settings
from our last analysis.
Click the Dialog Recall tool
SPSS Training
As mentioned in the discussion concerning homogeneity of variance,
the spread & level plot can be used to find a variance stabilizing
transformation for the dependent measure. Also, note that a histogram
can be requested in addition to the stem & leaf plot.
Click Continue
Click OK
To request the same analysis using SPSS syntax we use the following
instruction.
EXAMINE
VARIABLES=agewed satmean BY sex
/PLOT BOXPLOT STEMLEAF NPPLOT
/COMPARE GROUP
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
By default, box & whisker (BOXPLOT) and stem & leaf (STEMLEAF)
plots are produced. Adding the NPPLOT (normal probability plot)
keyword to the PLOT subcommand will have SPSS build normal
probability plots, detrended normal plots and perform normality tests.
Figure 7.28 Normal Probability Plot - Females
To produce a normal probability plot, the data values (here age first
married) are first ranked in ascending order. Then the normal deviate
corresponding to each rank (compared to the sample size) is calculated
SPSS Training
and plotted against the observed value. Thus the vertical axis of the
normal probability plot presents normal deviates (based on the rank of
the observation) while the actual data values appear along the horizontal
axis. The individual points (squares) represent the female data, while the
straight line indicates the pattern we would see if the data were perfectly
normal. If age first married followed a normal distribution for females,
the plotted values would closely approximate the straight line.
The advantage of a normal probability plot is that instead of
comparing a histogram or stem & leaf plot to the normal curve (more
complicated), you need only compare the plot to a straight line (a simple
comparison indeed). The plot above confirms what we concluded earlier
from the stem & leaf diagram: that for females, age when first married
does not follow a normal distribution.
TEST OF
NORMALITY
SPSS Training
DETRENDED
NORMAL PLOT
We see that the major deviations from the normal occur in the tails of
the distribution. It might be useful to mention that the same conclusion
could have been drawn from the stem & leaf diagram, a histogram, or the
normal probability plot. The detrended normal plot is a more technical
plot that allows the researcher to focus in detail on the specific locus and
form of deviations from normality.
A normal probability plot and a detrended normal plot also appear for
the males. These will not be displayed here since our aim was to
demonstrate the purpose and use of these charts, and not to repeat the
investigation of normality.
SPSS Training
SPSS Training
Method
Data
Scenario
INTRODUCTION
SPSS Training
LOGIC OF
TESTING FOR
MEAN
DIFFERENCES
The basic logic of significance testing is the same as that for the t test: we
assume the population groups have the same means (null hypothesis),
then determine the probability of obtaining a sample with group mean
differences as large (or larger) as what we find in our data. To make this
assessment the amount of variation among group means (between-group
variation) is compared to the amount of variation among observations
within each group (within-group variation). Assuming in the population
that the group means are identical (null hypothesis), the only source of
variation among sample means would be the fact that the groups are
composed of different individual observations. Thus a ratio of the two
sources of variation (between group / within group) should be about 1
when there are no population differences. When the distribution of
individual observations within each group follows the normal curve, the
statistical distribution of this ratio is known (F distribution) and we can
make a probability statement about the consistency of our data with the
null hypothesis. The final result is the probability of obtaining sample
differences as large (or larger) as what we found if there were no
population differences. If this probability is sufficiently small (usually
less than 5 chances in 100, or .05) we conclude the population groups
differ.
FACTORS
SPSS Training
EXPLORING THE
DATA
SPSS Training
An exploratory analysis of TV hours will appear for each degree
group. The NOTOTAL keyword suppresses overall summaries for the
entire sample. For brevity in this presentation we move directly to the
box and whisker plot.
Figure 8.2 Box & Whisker Plot of TV Hours by Degree Groups
The median hours of daily TV watched seems higher for those with a
high school degree or lesser degree than for those with at least some
college. Each group exhibits a positive skew that is more exaggerated for
those with a high school or lesser degree. Some individuals report
watching rather large amounts of daily TV, one might want to examine
the original survey to check for data errors or evidence of
misunderstanding the question. Also, based on the box heights
(interquartile ranges), it looks as those with a high school degree or less
show greater within-group variation than the others. This suggests a
potential problem with homogeneity of variance, especially since the
sample sizes are quite disparate. However, we might also note there
doesnt seem to be any simple pattern between the median level and the
interquartile range (for example as one increases so does the other) that
might suggest a data transformation to stabilize the within-group
variance. We will come back to this point after testing for homogeneity of
variance. Lets move on to the actual analysis.
SPSS Training
Figure 8.3 One-Way ANOVA Dialog Box
Enough information has been provided to run the basic analysis. The
Contrasts pushbutton allows users to request statistical tests for planned
group comparisons of interest. The Post Hoc pushbutton will produce
multiple comparison tests that can test each group mean against every
other one. Such tests facilitate determination of just which groups differ
from which others and are usually performed after the overall analysis
establishes that some significant differences exist. We will examine such
tests in the next section. Finally, the Options pushbutton controls such
diverse features as missing value handling and whether descriptive
statistics, means plots, and homogeneity tests are desired. We want both
descriptive statistics (although having just run Explore, they are not
necessary) and the homogeneity of variance test.
Click Options pushbutton
Check Descriptive check box
Check Homogeneity of variance test check box
Check Brown-Forsythe and Welch check boxes
As mentioned earlier, ANOVA assumes homogeneity of within-group
variance. However, when homogeneity does not hold there are several
adjustments that can be made to the F test. We request these optional
statistics because the box & whisker plots and the homogeneity of
variance test (not shown here) indicate that the homogeneity of variance
assumption does not hold. Note that these tests still assume normality of
the residuals.
SPSS Training
Figure 8.4 One-Way ANOVA Options Dialog Box
The missing value choices deal with how missing data are to be
handled when several dependent measures are given. By default, cases
with missing values on a particular dependent variable are dropped only
for the specific analysis involving that variable. Since we are looking at a
single dependent variable, the choice has no relevance to our analysis.
The Means plot option will produce a line chart displaying the group
means; we will request an error bar plot later.
Click Continue
Click OK
The same analysis can be performed in SPSS syntax using the
ONEWAY procedure. TVHOURS is the dependent measure and the
keyword BY separates the dependent variable from the factor variable.
We request descriptive statistics, a homogeneity of variance test, and two
tests that do not make the homogeneity assumption (Brown-Forsythe and
Welch tests).
ONEWAY
tvhours BY degree
/STATISTICS DESCRIPTIVES HOMOGENEITY
BROWNFORSYTHE WELCH .
We now turn to interpret the results.
SPSS Training
ONE-FACTOR
ANOVA RESULTS
SPSS Training
Figure 8.6 Descriptive Statistics for Groups
The pattern of means is largely consistent with the box & whisker
plot in that those with less formal education watch more TV than those
with more formal education. The 95% confidence bands for the degree
group means gauge the precision with which we have estimated the
means and we can informally compare groups by comparing their
confidence bands. The minimum and maximum values for each group are
valuable as a data check; we again note some surprisingly large numbers.
Often at this point there is interest in making a statement about just
which of the five groups differ significantly from which others. This is
because the overall F statistic simply tested the null hypothesis that all
population means were the same. Typically, you now want to make more
specific statements than merely that the five groups are not identical.
Post Hoc tests permit these pairwise group comparisons and we will
pursue them.
Unfortunately the null hypothesis assuming homogeneity of withingroup variance is rejected at the rounded .000 (less than .0005) level. Our
sample sizes are quite disparate (see Figures 8.6 or 8.2) so we cannot
count on robustness due to equal sample sizes. For this reason we turn to
the Brown-Forsythe and Welch tests, which test for equality of group
means without assuming homogeneity of variance. Since these results
will not be calculated by default, you would request them based on
homogeneity tests done in the Explore or Oneway procedures.
SPSS Training
Figure 8.8 Robust Tests of Mean Differences
SPSS Training
among different educational degree groups, we probe to find specifically
which groups differ from which others.
POST HOC
TESTING OF
MEANS
Post hoc tests are typically performed only after the overall F test
indicates that population differences exist, although for a broader view
see Milliken and Johnson(1984). At this point there is usually interest in
discovering just which group means differ from which others. In one
aspect, the procedure is quite straightforward: every possible pair of
group means is tested for population differences and a summary table
produced. However, a problem exits in that as more tests are performed,
the probability of obtaining at least one false-positive result increases. As
an extreme example, if there are ten groups, then 45 pairwise group
comparisons (n*(n-1)/2) can be made. If we are testing at the .05 level, we
would expect to obtain on average about 2 (.05 * 45) false-positive tests.
In an attempt to reduce the false-positive rate when multiple tests of this
type are done, statisticians have developed a number of methods.
WHY SO MANY
TESTS?
The ideal post hoc test would demonstrate tight control of Type I (falsepositive) error, have good statistical power (probability of detecting true
population differences), and be robust over assumption violations (failure
of homogeneity of variance, nonnormal error distributions).
Unfortunately, there are implicit tradeoffs involving some of these
desired features (Type I error and power) and no current post hoc
procedure is best in all these areas. Couple to this the facts that there are
different statistical distributions that pairwise tests can be based on (t, F,
studentized range, and others) and that there are different levels at
which Type I error can be controlled (per individual test, per family of
tests, variations in between), and you have a large collection of post hoc
tests.
We will briefly compare post hoc tests from the perspective of being
liberal or conservative regarding control of the false-positive rate and
apply several to our data. There is a full literature (and several books)
devoted to the study of post hoc (also called multiple comparison or
multiple range tests, although there is a technical distinction between the
two) tests. More recent books (Toothaker (1991)) summarize simulation
studies that compare multiple comparison tests on their power
(probability of detecting true population differences) as well as
performance under different scenarios of patterns of group means, and
assumption violations (homogeneity of variance).
The existence of numerous post hoc tests suggests that there is no
single approach that statisticians agree will be optimal in all situations.
In some research areas, publication reviewers require a particular post
hoc method, simplifying the researchers decision. For more detailed
discussion and recommendations, short books by Klockars and Sax
(1986), Toothaker (1991) or Hsu (1996) are useful. Also, for some thinking
on what post hoc tests ought to be doing see Tukey (1991) or Milliken and
Johnson (1984).
Below we present some tests available within SPSS, roughly ordered
from the most liberal (greater statistical power and greater false-positive
rate) to the most conservative (smaller false-positive rate, less statistical
power), and also mention some designed to adjust for lack of homogeneity
of variance.
SPSS Training
LSD
SNK, REGWF,
REGWQ &
Duncan
Bonferroni &
Sidak
The Bonferroni (also called the Dunn procedure) and Sidak (also called
Dunn-Sidak) perform each test at a stringent significance level to insure
that the family-wise (applying to the set of tests) false-positive rate does
not exceed the specified value. They are based on inequalities relating the
probability of a false-positive result on each individual test to the
probability of one or more false positives for a set of independent tests.
For example, the Bonferroni is based on an additive inequality, so the
criterion level for each pairwise test is obtained by dividing the original
criterion level (say .05) by the number of pairwise comparisons made.
Thus with five means, and therefore ten pairwise comparisons, each
Bonferroni test will be performed at the .05/10 or .005 level.
Tukey (b)
The Tukey (b) test is a compromise test, combining the Tukey (see below)
and the SNK criterion producing a test result that falls between the two.
Tukey
Scheffe
(also called Tukey HSD, WSD, or Tukey(a) test): Tukeys HSD (Honestly
Significant Difference) controls the false-positive rate family-wise. This
means if you are testing at the .05 level, that when performing all
pairwise comparisons, the probability of obtaining one or more false
positives is .05. It is more conservative than the Duncan and SNK. If all
pairwise comparisons are of interest, which is usually the case, Tukeys
test is more powerful than the Bonferroni and Sidak.
Scheffes method also controls the family-wise error rate. It adjusts not
only for the pairwise comparisons, but also for any possible comparison
the researcher might ask. As such it is the most conservative of the
available methods (false-positive rate is least), but has less statistical
power.
SPSS Training
SPECIALIZED
POST HOCS
Unequal Ns:
Hochbergs GT2
& Gabriel
Most post hoc procedures mentioned above (excepting LSD, Bonferroni &
Sidak) were derived assuming equal group sample sizes in addition to
homogeneity of variance and normality of error. When the subgroup sizes
are unequal, SPSS substitutes a single value (the harmonic mean) for the
sample size. Hochbergs GT2 and Gabriels post hoc test explicitly allow
for unequal sample sizes.
Waller-Duncan
Unequal
Variances and
Unequal Ns:
Tamhane T2,
Dunnetts T3,
Games-Howell,
Dunnetts C
Each of these post hoc tests adjust for unequal variances and sample
sizes in the groups. Simulation studies (summarized in Toothaker, 1991)
suggest that although Games-Howell can be too liberal when the group
variances are equal and sample sizes are unequal, it is more powerful
than the others.
ANOVA
Click on the Post Hoc pushbutton
Click LSD (Least Significant Difference, R-E-G-W-F (RyanEniot-Gabriel-Welsh F), Scheffe and Games-Howell check
boxes
SPSS Training
Figure 8.9 Post Hoc Testing Dialog Box
Click Continue
Click OK
By default, statistical tests will be done at the .05 level. If you prefer
to use a different alpha value (for example, .01), you can specify it in the
Significance level box. The command to run the post hoc analysis
appears below.
ONEWAY
tvhours BY degree
/STATISTICS DESCRIPTIVES HOMOGENEITY
/POSTHOC = SCHEFFE LSD FREGW GH ALPHA(.05).
Post hoc tests are requested using the POSTHOC subcommand. The
STATISTICS subcommand need not be included here since we have
already viewed the means and discussed the homogeneity test.
The beginning part of the Oneway output contains the ANOVA table,
robust tests of mean differences, descriptive statistics, and homogeneity
test. We will move directly to the post hoc test results.
Note
Some of the pivot tables shown below were edited (changed column
widths; only one post hoc method shown in some figures) to better display
in this course guide.
SPSS Training
Figure 8.10 Least Significant Difference Post Hoc Results
SPSS Training
Summarizing the entire diagram, we would say that almost all degree
groups differ in amount of TV viewed daily and those with higher degrees
watch less TV. The only groups not different (graduates versus bachelors;
bachelors versus junior college) were adjacent degree categories. Since
LSD is the most liberal of the post hoc tests, we are interested if the same
results hold using more conservative criteria.
Figure 8.11 Homogeneous Subsets Results for REGWF Post Hoc Tests
The REGWF results are not presented in the same format as we saw
for the LSD. This is because for some of the post hoc test methods (for
example, the sequential or multiple-range tests) standard errors and 95%
confidence intervals for all pairwise comparisons are not defined. Rather
than display pivot tables with empty columns in such situations, a
different format, homogeneous subsets, is used. A homogeneous subset is
a set of groups for which no pair of group means differs significantly. This
format is closer in spirit to the nature of the sequential tests actually
performed by REGWF. Depending on the post hoc test requested, SPSS
will display a multiple comparison table, a homogeneous subset table, or
both. Recall the REGWF tests first the most extreme, then the less
extreme means, adjusting for the number of means in the comparison set.
Viewing the REGWF portion of the table, we see four homogeneous
subsets (four columns). The first is composed of graduate and bachelor
degree groups; they do not differ, but one or the other differs from the
three remaining degree groups. Bachelor and junior college degree groups
are the second homogeneous subset (they do not differ significantly).
Notice that the third and fourth homogeneous subsets contain one group
each. This is because the high school group differs from each of the
others, as does the group with less than a high school degree. The
homogeneous subset pivot table thus displays where population
differences do not exit (and by inference, where they do). The results for
REGWF match what we obtained using LSD.
SPSS Training
A homogeneous subset summary appears for the Scheffe test as well.
We will discuss the Scheffe shortly. Here we just point out that the
results are similar, except for subset 3, in which junior college does not
differ from high school as it did for the REGWF. This is consistent with
the Scheffe being a more conservative test.
Figure 8.12 Scheffe Post Hoc Results
SPSS Training
Figure 8.13 Games-Howell Post Hoc Results
SPSS Training
GRAPHING THE
RESULTS
The setup for the error bar chart is very straightforward. While not
shown we used the Titles tab sheet to give a title to the chart. The
command below produces the chart using SPSS syntax.
IGRAPH /VIEWNAME=Error Bar Chart
/X1 = VAR(degree) TYPE = CATEGORICAL
/Y = VAR(tvhours) TYPE = SCALE
/COORDINATE = VERTICAL /TITLE= 'Error Bar Chart'
/X1LENGTH = 3.0 /YLENGTH = 3.0 /X2LENGTH = 3.0
/CATORDER VAR(degree) (ASCENDING VALUES
OMITEMPTY)
/ERRORBAR KEY ON CI(95.0) DIRECTION BOTH
CAPSTYLE = T SYMBOL = ON.
SPSS Training
Weve requested an error bar chart with a 95% confidence band on
the sample means. A title is included. The final chart appears below.
Figure 8.15 Error Bar Chart of TV Hours by Degree Group
The chart provides a visual sense of how far the groups are separated.
The confidence bands are determined for each group separately and no
adjustment is made based on the number of groups that are compared.
From the graph we have a clear sense of relation between formal
education degree and TV viewing.
SUMMARY
APPENDIX:
GROUP
DIFFERENCES
ON RANKS
SPSS Training
nothing, the downside of such methods is that if the stronger data
assumptions hold, then nonparametric techniques are generally less
powerful (probability of finding true differences) than the appropriate
parametric method. Second, there are some parametric statistical
analyses that currently have no corresponding nonparametric method. I
think it is fair to say that boundaries separating where one would use
parametric versus nonparametric methods are in practice somewhat
vague, and statisticians can and often do disagree about which approach
is optimal in a specific situation. For more discussion of the common
nonparametric tests see Daniel (1978), Siegel (1956) or Wilcox (1996).
Because of our concerns about the lack of homogeneity of variance
and normality of TV hours viewed for our different degree groups, we will
perform a nonparametric procedure, which only assumes that the
dependent measure has ordinal (rank order) properties. The basic logic
behind this test, the Kruskal-Wallis test, follows. If we rank order the
dependent measure throughout the entire sample, we would expect under
the null hypothesis (no population differences) that the mean rank
(technically the sum of the ranks adjusted for sample size) should be the
same for each sample group. The Kruskal-Wallis test calculates the
ranks, the sample group mean ranks, and the probability of obtaining
average ranks (weighted summed ranks) as far apart (or further) as what
are observed in the sample, if the population groups were identical.
To run the Kruskal-Wallis test in SPSS, we will declare tvhours as
the Test Variable (from which ranks are calculated) and degree as the
independent or grouping variable.
Click Analyze..Nonparametric Tests..K Independent
Samples
Move tvhours into the Test Variable List: box
Move degree into the Grouping Variable: box
Note that the minimum and maximum value of the grouping variable
must be specified using the Define Range pushbutton.
Click the Define Range pushbutton
Enter 0 as the Minimum and 4 as the Maximum
Click Continue
Click OK
SPSS Training
Figure 8.16 Analysis of Ranks
SPSS Training
We see the pattern of mean ranks (remember smaller ranks imply
less TV watched) follows that of the original means of TVHOURS in that
the higher the degree, the less TV watched. The chi-square statistic used
in the Kruskal-Wallis test indicates that we are very unlikely (less than
.0005 or 5 chances in 10,000- we can edit the pivot table to obtain more
precision) to obtain samples with average ranks so far apart if the null
hypothesis (same distribution of TV hours in each group) were true. We
feel much more confident in our original analysis because we were able to
confirm that population differences exist without making all the
assumptions required for analysis of variance.
APPENDIX: HELP
IN CHOOSING A
STATISTICAL
METHOD
Note
SPSS Training
Figure 8.18 Statistics Coach Main Dialog Box
SPSS Training
Figure 8.19 Comparing Groups Help Dialog Box
SPSS Training
Figure 8.20 How to Run Nonparametric Tests
Hint
HELP IN
INTERPRETING
STATISTICAL
RESULTS
SPSS Training
Return to the Viewer window and scroll to the bottom
Right-click on the Test Statistics pivot table
Click Results Coach on the Context menu
(Alternatively, double-click on the pivot table to open the Pivot
Table editor, and then click the Results Coach
tool on the
Format toolbar)
Click Next pushbutton three (3) times to display additional
information
Figure 8.21 Results Coach for Kruskal Wallis Tests Pivot Table
After calling the Results Coach, help appears that describes the
contents of the selected type of pivot table and how they are typically
used. Here, the Results Coach states what the Kruskal-Wallis test does
and indicates the meaning of a low significance value, which we recently
discussed in this chapter. This is a useful aid when interpreting
statistical results appearing in SPSS pivot tables.
Click the Close
SPSS Training
Method
Data
Scenario
INTRODUCTION
We wish to see whether there are regional and gender differences in the
amount of formal education U.S. adults have received. In the previous
chapter, highest education degree obtained (DEGREE) was considered an
independent variable; here we view highest year of school completed
(EDUC) as a dependent measure. While debatable, in this chapter we
argue that the amount of formal education one obtains may be viewed as
potentially influenced by ones region and gender. Previously we viewed
education degree as a factor that might influence the amount of TV
watched. Claims about just what causes what cannot be fully resolved
using survey studies (followers of Hume would argue they cannot be
resolved at all), and reside in the way we view the world (that region may
influence education, rather than education influences the region you live
in). In summary, we will study mean differences in the dependent
measure (education) as a function of two independent variables (region,
gender).
SPSS Training
ANOVA and will not be repeated here.
We will investigate whether there are differences in average years of
formal education for men and women living in different regions of the
country. Since two factors, region and gender, are under consideration,
we can ask three different questions. (1) Are there regional differences?
(2) Is there a gender difference? (3) Do region and gender interact?
As in earlier chapters, we begin by running an exploratory data
analysis, then proceed with more formal testing.
LOGIC OF
TESTING, AND
ASSUMPTIONS
HOW MANY
FACTORS?
SPSS Training
INTERACTIONS
SPSS Training
Figure 9.1 Main Effects, No Interaction
In the chart we see that the mean line for women is above that of the
men. In addition, there are differences among the four regions. However,
note that the gender difference is nearly identical for each region. This
equal distance between the lines (parallelism of lines) indicates there is
no interaction present.
Figure 9.2 No Main Effects, Strong Interaction
Here the overall means for men and women are about the same, as
are the means for each region (pooling the two gender groups). However,
the gender difference varies dramatically across the different regions: in
SPSS Training
region B women have higher education, in regions A and D there is no
gender difference, and in region C males have higher education. We
cannot make a statement about gender differences without qualifying it
with region information, nor can we make regional claims without
mentioning gender. Strong interactions are marked by this crossover
pattern in the multiple-line chart.
Figure 9.3 One Main Effect, Weak Interaction
We see a gender difference for each of the four regions, but the
magnitude of this difference varies across regions (substantially greater
for region D). This difference in magnitude of the gender effect would
constitute an interaction between gender and region. It would be termed
a weak interaction because there is no crossover of the mean lines.
Additional scenarios can be charted, and we have not mentioned
three-way and higher interactions. Such topics are discussed in
introductory statistics books (see the references for suggestions). We will
now proceed to analyze our data set.
EXPLORING THE
DATA
SPSS Training
Select Gss94.por and click Open
Click Analyze..Descriptive Statistics..Explore
Move educ to the Dependent List: box
Move sex and region to the Factor List: box
Click the Plots pushbutton
Click the Untransformed option button from the Spread vs.
Level with Levene Test area
Figure 9.4 Requesting a Homogeneity of Variance Test
SPSS Training
Figure 9.5 Explore Dialog Box with Two Factors
SPSS Training
The Examine command requires only the Variables subcommand in
order to run. We must also include the Plot subcommand since we desire
the homogeneity test (controlled by SPREADLEVEL keyword; the 1 in
parenthesis indicates a power transformation of 1, thus no
transformation, is to be applied to the dependent measure). The other
subcommands specify default values; they appear in order to make it
simple for the analyst to modify the command when necessary. Note the
keyword BY separates the dependent variable (EDUC) from the factors
(SEX and REGION). Currently, both SEX and REGION follow the BY
keyword, and so have the same status, that is, an analysis will be run for
each separately. To indicate we wish a joint analysis, we insert an
additional BY between SEX and REGION on the Variables subcommand.
Type by between sex and region in the Syntax Editor window
Figure 9.7 Examine Command Requesting Subgroup Analysis
SPSS Training
Figure 9.8 Box & Whisker Plot of Education
We see variation in the lengths of the boxes, which suggests that the
variation of education within groups is not homogeneous. Also, to the eye,
in most of the regions the median education for males is equal to that of
females. Finally, it seems that median education is lowest for those living
in the east central regions of the U.S. There are outliers at both the high
and low ends. Do any of them seem so extreme as to suggest data errors?
Note
The spread and level plot will be reformatted as a sunflower plot within
the Chart editor window. This is done because some of the subgroup
points fell on top of each other and could not be distinguished. To obtain a
sunflower plot:
Double click on the spread & level chart
Click Chart..Options
Check the Show Sunflowers check box
Click OK
Click File..Close to close the Chart editor
SPSS Training
Figure 9.9 Spread & Level Plot of Education
The Levene tests all indicate that the probability of obtaining sample
variances as disparate (or more) as what we observe is very small (.000,
meaning less than .0005) if the subgroup variances were identical in the
population. Thus we cannot assume homogeneity of variance. The spread
SPSS Training
and level plot did not reveal any obvious relation between the
interquartile range and the median (a common pattern is for the spread
to increase as the level increases), so no simple adjustment is apparent.
One mitigating factor is that because our sample size is so large, we have
a fairly sensitive test of homogeneity. However, short of pursuing the
technical route of variance-stabilizing data transformations (referred to
earlier), we will proceed with the analysis of variance, taking the
probability values with a grain of salt.
TWO-FACTOR
ANOVA
SPSS Training
menu choice (under Analyze) extends these analyses to include more
complex repeated measures analysis and nested random effects models.
Click Analysis..General Linear Model..Univariate
Move educ to the Dependent Variable: box
Move region and sex to the Fixed Factor(s): box
Figure 9.12 GLM Univariate Dialog Box
SPSS Training
Figure 9.13 GLM Univariate Model Dialog Box
Within the Univariate Model dialog you can specify the model you
want applied to the data. By default, a full factorial model (all main
effects, interactions, and covariates) is fit and the various effects tested
using Type III sums of squares (each effect is tested after statistically
adjusting for all other effects in the model). For balanced or unbalanced
models with no missing cells, Type III sums of squares is most commonly
used. If there are any missing cells in your analysis, we recommend you
switch to Type IV sums of squares, which better adjusts for them. If only
a subset of the main effects and interactions are of interest in your
analysis, and you want to specify an incomplete design, you would click
the Custom option button in the Specify Model area and indicate which
main effects and interactions should be included in the model. A custom
model is sometimes used if there is no interest in, or the design does not
allow, the testing of high-order interaction effects. Because we want to
examine the full factorial model, there is no reason to modify this dialog.
Click the Cancel button
Click the Plots pushbutton
We next examine the Profile Plots dialog box. The Profile Plots dialog
produces line charts that display means at different factor levels. You can
view main effects with such plots, but they are most helpful in
interpreting two- and three-way interactions (note that up to three factor
variables can be included). The dependent variable does not appear in
this dialog box. Multiple plots can be requested, which is useful in
complex analyses where there may be several significant interactions. We
will request profile plots for each main effect (region and sex) and for
their interaction (region * sex). Some analysts would request such plots
SPSS Training
only for significant main effects and interactions, as determined by the
initial analysis.
Move region into the Horizontal Axis: box
Click Add
Move sex into the Horizontal Axis: box
Click Add
Move region into the Horizontal Axis: box and sex into the
Separate Lines: box
Figure 9.14 GLM Univariate Profile Plots Dialog Box
Click Add
Click Continue
Click the Options pushbutton
The Options dialog is used to request means, homogeneity of variance
tests, residual plots, and other diagnostic information pertaining to the
analysis. We will ask GLM to provide us with descriptive statistics for the
cells in the analysis. This will display means, standard deviations and
sample sizes for the factors and the two-way interaction. Estimated
marginal means are means estimated for each level of a factor averaging
across all levels of other factors (marginals), based on the specified model
(estimated). These means can differ from the observed means if
covariates are included or if an incomplete model (not all main effects and
interactions) is used. Post hoc analyses can be applied to the observed
means using the Post Hoc pushbutton (See Appendix to this chapter).
Click the Descriptive statistics check box in the Display area
SPSS Training
Figure 9.15 GLM Univariate Options Dialog Box
Click Continue
Click OK
The following syntax will also run the analysis:
UNIANOVA
educ BY region sex
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/PLOT = PROFILE( region sex region*sex )
/PRINT=DESCRIPTIVE
/CRITERIA = ALPHA(.05)
/DESIGN = region sex region*sex .
The first piece of output describes the factors involved in the analysis.
They are labeled between-subject factors.
SPSS Training
Figure 9.16 Between-Subject Factors
THE ANOVA
TABLE
The first column lists the different sources of variation. We are much
interested in the region and gender main effects, as well as the region by
gender interaction. The source labeled Error contains summaries of the
within-group variation (or residual term), which will be used when
calculating the F ratios (ratios of between-group to within-group
variation). The remaining sources in the list are simply totals involving
the sources already described and, as such, are generally not of interest.
The Sums of Squares column contains a technical summary (sums of the
squared deviations of group means around the overall mean or of
SPSS Training
individual observations around the group means) that is not interpreted
directly, but is used in calculating the later column values. The df
(degrees of freedom) column contains values that are functions of the
number of levels of the factors (for region, sex and region by sex) or the
number of observations (for error). Although this is a gross
oversimplification, you might think of degrees of freedom as measuring
the number of independent values (whether means or observations) that
contribute to the sums of squares in the previous column. As with sums of
squares, degrees of freedom are technical measures, not interpreted
themselves, but used in later calculations.
Mean Square values are variance measures attributable to the
various effects (region, gender, and region by gender) and to the variation
of individuals within groups (error). The ratio of an effect mean square to
the mean square of the error provides the between-group to within-group
variance ratio, or F statistic. If there were no group differences in the
population, then the ratio of the between-group variation to the withingroup variation should be about 1. The Sig. column contains the most
interpretable numbers in the table: the probabilities that one can obtain
F ratios as large or larger (group means being as far or farther apart) as
what we find in our sample, if there are no mean differences in the
population.
Gender does not show a significant difference in formal education
(Sig.= .185). In other words, there is about a 1 in 5 (.185) chance of
obtaining as large a sample difference in formal education between men
and women as we observe here if there is no sex difference in the
population. Similarly, the gender by region interaction is not significant,
indicating that the pattern of regional differences is the same for men
and women. On the other hand, region differences are highly significant
(probability rounded to three decimals is .000, and is thus less than
.0005, or 5 chances in 10,000). Despite this, the r-square measures
indicate that the model accounts for only about 3% of the variance in
education.
Earlier we found that the homogeneity of variance assumption was
not met. However, the region effect is so highly significant that even if we
were to inflate the probability by two orders of magnitude (.0005 to .05)
there still would be a significant difference. Thus we are confident that
there are regional differences despite the assumption violation. While
this informal adjustment of probability values is not a real solution to the
problem, it is better than entirely ignoring it. We could also perform a
nonparametric test of regional differences in education.
OBSERVED
MEANS
In the Options dialog box we asked for the descriptive statistics, which
include means for each main effect and the interaction. We are especially
interested in the regional means because only region had a significant
effect on the level of education in the ANOVA table.
Note
The pivot table below was edited in the Pivot Table editor so that gender
was placed in the layer dimension with the gender totals in the top layer,
which permits us to focus on regional differences in education.
SPSS Training
Figure 9.18 Mean Education Level by Region
The average level of education ranges from 11.71 years in the East
South Central region to 13.77 years in the Pacific region. We can view
these means graphically in the profile plots. The profile plot for region
appears below.
Figure 9.19 Profile Plot of Region (Estimated Marginal Means)
SPSS Training
The Profile plot displays the estimated marginal means for the nine
regions.
ECOLOGICAL
SIGNIFICANCE
PRESENTING
THE RESULTS
To show the results graphically we could simply use the profile plot, but
an error bar chart would present the subgroup means along with their
95% confidence bands. Below we request an error bar chart displaying
each region.
Click Graphs..Interactive..Error Bar
Click Reset button, then click OK to confirm
Drag and drop Region of Interview [region] into the
horizontal arrow box
Drag and drop Highest Year of School Completed [educ] into
the vertical arrow box
Click Convert if asked to convert educ to a scale variable
Hint
SPSS Training
Figure 9.20 Create Error Bar Chart Dialog Box
Click OK
The SPSS command to produce the error bar chart appears below.
IGRAPH /VIEWNAME=Error Bar Chart
/X1 = VAR(region) TYPE = CATEGORICAL
/Y = VAR(educ) TYPE = SCALE
/COORDINATE = VERTICAL /X1LENGTH = 3.0
/YLENGTH = 3.0 /X2LENGTH = 3.0
/CATORDER VAR(region) (ASCENDING VALUES
OMITEMPTY)
/ERRORBAR KEY ON CI(95.0) DIRECTION BOTH
CAPSTYLE = T SYMBOL = ON.
SPSS Training
Figure 9.21 Error Bar Chart of Education by Region
The error bar chart displays where the various region sample means
fall. The 95% confidence bands provide information about which
population group means are expected to differ. The post hoc tests
performed in the appendix give a more exact answer to this question.
SUMMARY OF
ANALYSIS
SUMMARY
SPSS Training
APPENDIX: POST
HOC TESTS
USING GLM
UNIVARIATE
At this point of the analysis it is natural to ask which regions differ from
which others in terms of mean education level. The GLM procedure in
SPSS will perform separate post hoc tests on each dependent variable in
order to determine this issue. These tests are usually performed to
investigate which levels within a factor differ after the overall main effect
has been established. To request a post hoc test we will return to the
GLM Univariate dialog box.
SPSS Training
Figure 9.23 Post Hoc Tests in GLM Univariate
Click Continue
Click OK
The command below will run this analysis. On the Posthoc
subcommand we request that the Games-Howell test (GH) be applied to
region.
UNIANOVA
educ BY region sex
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/POSTHOC = region ( GH )
/PLOT = PROFILE( region sex region*sex )
/PRINT=DESCRIPTIVE
/CRITERIA = ALPHA(.05)
/DESIGN = region sex region*sex .
We view part of the post hoc results below.
SPSS Training
Figure 9.24 GLM Post Hoc Results - Beginning
Note
Since the gender and gender by region effects were not found to be
significant, most analysts would rerun the analysis with region as the
only factor when performing post hocs. This would be a one-factor
ANOVA, which was discussed in Chapter 8. Performing the post hocs
from the two-factor analysis, as we do here, serves to demonstrate how it
would be done if more than a single factor were significant.
SPSS Training
Objective
Method
Data
Scenario
INTRODUCTION
SPSS Training
we must either recode salary and education into categories and run a
crosstab (the appropriate graph is a clustered bar chart), or alternatively,
present the original variables in a scatterplot. Both approaches are valid
and you would choose one or the other depending on your interests. Since
we hope to build an equation relating amount of education to beginning
salary we will stick to the original scales and begin with a scatterplot.
But first we will take a quick look at the relevant variables using
exploratory data analysis methods.
READING THE
DATA
Click File..Open..Data
Switch to the c:\Train\Stats folder (if necessary)
Select SPSS Portable (*.por) from the Files of Type: drop-down
menu
Click Bank.por
Figure 10.1 Reading the Bank Portable File
The portable file contains both the data and dictionary information
(formats, labels, missing values. For those using SPSS command syntax,
the following command will read the portable file.
IMPORT FILE C:\Train\Stats\Bank.por.
Click Open
SPSS Training
Figure 10.2 Bank Data
We see the data values for several employees in the Data Editor
window.
EXPLORING THE
DATA
SPSS Training
Figure 10.3 Explore Dialog Box
SPSS Training
Figure 10.4 Statistics for Beginning Salary
SPSS Training
The extreme values at the high salary end result in a skewed
distribution. Since several different job classifications are represented in
this data, the skewness may be due to a relatively small number of people
in high paying jobs. In the plot above each leaf represents 3 employees,
and the ampersand & (called a partial leaf) symbolizes the presence of
fewer than 3 employees with the same leaf value.
Figure 10.6 Box & Whisker Plot of Beginning Salary
All outliers are at the high end, and the employee numbers for some
of them can be read from the plot (changing the font size of these
numbers in the Chart Editor window would make more of them legible).
It might be useful to look at the job classification (JOBCAT) of some of
the higher salaried individuals as a check for data errors.
SPSS Training
Figure 10.7 Statistics for Formal Education (in years)
The mean is again above the median, but the skewness value is very
near zero (suggesting a symmetric distribution). Here the mean
exceeding the median is not due to the presence of outliers, as will be
made apparent in the stem & leaf diagram below.
Figure 10.8 Stem & Leaf Plot of Education
Since education was recorded in whole years and the range is small,
all the leaves are zero. Notice there are only a few extreme observations,
and they are at high education values. An oddity is the gap in education
between 8th and 12th grade. If this occurred in your data you might
investigate further to determine if it might be the result of the sampling
procedure used to select individuals for inclusion in the study or a data
SPSS Training
coding problem. The mean is above the median because of the
concentration of employees with education of 15 to 19 years (compare to
the single block of employees at 8 years). This imbalance is revealed in
the stem & leaf.
Figure 10.9 Box & Whisker Plot of Education
The median or 50th percentile (dark line within box) falls on the
lower edge of the box (25th percentile) indicating a large number of
people with 12th grade education.
Having explored each variable separately, we will now view them
jointly with a scatterplot.
SCATTERPLOTS
SPSS Training
Figure10.10 Scatterplot Dialog Box
SPSS Training
Figure 10.11 Simple Scatterplot Dialog Box
SPSS Training
Figure 10.12 Scatterplot of Beginning Salary and Education
SPSS Training
Figure 10.13 Scatterplot Options Dialog Box
Since there are no subgroups in our analysis (we did not name a
Marker variable), we can only fit a line to the entire (Total) sample. The
other areas in the dialog box allow adding a mean reference line and
creating a sunflower plot: a scatterplot representing the density of cases
at a location by the number of petals on a sunflower symbol. This is
useful here since we are plotting many points with a limited number of
education values. While the default Fit options are fine for our purpose,
we will click the Fit Options pushbutton to see the possibilities.
Click the Fit Options pushbutton
Figure 10.14 Scatterplot: Fit Options
SPSS Training
By default a straight line (linear) will be fit to the data, although
other simple curves (quadratic, cubic) are also available. The Lowess
choice applies a robust regression technique to the data. Such methods
produce a result that is more resistant to outliers than the traditional
least-squares regression. While not invoked here, note that 95%
confidence bands around the best-fitting line can be added to the plot.
Finally, we will define the r-square measure when we consider
regression, but please note that it can be displayed on the chart
(Interactive Graph scatterplots can also display the lines equation).
Because we did not change any Fit option settings, we will exit this dialog
box and process the requested Scatterplot options.
Click Cancel
Click OK
Click File..Close to close the Chart editor
Figure 10.15 Scatterplot with Best Fitting Line
SPSS Training
CORRELATIONS
For the perfect correlation of 1.0, all points fall on the straight line
trending upwards. In the scatterplot with a correlation of .8 the strong
positive relation is apparent, but there is some variation around the line.
Looking at the plot of data with correlation of .4, the positive relation is
suggested by the absence of points in the upper left and lower right of the
plot area. The association is clearly less pronounced than with the data
correlating .8 (note greater scatter of points around the line). The final
chart displays a correlation of 0: there is no linear association present.
This is fairly clear to the eye (the plot most resembles a blob), and the
best-fitting straight line is a horizontal line.
While we have stressed the importance of looking at the relationships
between variables using scatterplots, you should be aware that human
SPSS Training
judgment studies indicate that people tend to overestimate the degree of
correlation when viewing scatterplots. Thus obtaining the numeric
correlation is a useful adjunct to viewing the plot. Correspondingly, since
correlations only capture the linear relation between variables, viewing a
scatterplot allows you to detect nonlinear relationships present.
Additionally, statistical significance tests can be applied to
correlation coefficients. Assuming the variables follow normal
distributions, you can test whether the correlation differs from zero (zero
indicates no linear association) in the population, based on your sample
results. The significance value is the probability that you would obtain as
large (or larger in absolute value) a correlation as you find in your
sample, if there were no linear association (zero correlation) between the
two variables in the population.
In SPSS, correlations (Pearson product-moment correlations) can be
easily obtained along with an accompanying significance test. If one has
grossly nonnormal data, or only ordinal scale data, the Spearman rank
correlation coefficient (or Spearman correlation) can be calculated. It
evaluates the linear relationship between two variables after ranks have
been substituted for the original scores. Another, less common, rank
association measure is Kendalls coefficient of concordance (also known as
Kendalls coefficient, or Kendalls tau-b). We will obtain the correlation
(Pearson) between beginning salary and education, and will also include
age, current salary, and work experience in the analysis.
Click Analyze..Correlate..Bivariate
Move salbeg, salnow, edlevel, age and work to the Variables:
list box
Figure 10.17 Correlation Dialog Box
SPSS Training
Notice that we simply list the variables to be analyzed; there is no
designation of dependent and independent. Correlations are simply
measures of straight-line association.
By default, Pearson correlations will be calculated, which is what we
want. However, the alternative types can be easily requested. A twotailed significance test will be performed on each correlation. This will
posit as the null hypothesis that in the population there is no linear
association between the two variables. Thus any straight-line
relationship, either positive or negative, is of interest. If you prefer a onetailed test, one in which you specify the direction (or sign) of the relation
you expect and any relation in the opposite direction (opposite sign) is
bundled with the zero (or null) effect, you can obtain it though the Onetailed option button. This issue was discussed earlier in the context of one
versus two-tailed t tests. A one-tailed test gives you greater power to
detect a correlation of the sign you propose, at the price of giving up the
ability to detect a significant correlation of the opposite sign. In practice,
researchers are usually interested in all linear relations, positive and
negative, and so two-tailed tests are very common. The Flag significant
correlations check box is checked by default. When checked, significant
correlations will be identified by asterisks appearing beside the
correlations.
The Options pushbutton leads to a dialog box in which you can
request that descriptive statistics appear for the variables used in the
analysis. There is also a choice for missing values. The default missing
setting is Pairwise, which means that if a case has missing values for one
or more of the analysis variables, SPSS will still use the valid information
from other variables in that case. The alternative is Listwise, in which a
case is dropped from the correlation analysis if any of its analysis
variables have missing values. Neither method provides an ideal solution;
in practice, pairwise deletion is often chosen when a large number of
cases are dropped by the listwise method. This is an area of statistics in
which considerable progress has been made in the last decade, and the
SPSS Missing Values option incorporates some of these improvements.
Click OK
The Correlation syntax command below will run a Pearson
correlation analysis. To request Spearman correlations or Kendalls
coefficients use the NPAR CORR (nonparametric correlation) command.
SPSS Training
Figure 10.18 Correlation Matrix
SPSS Training
if so can you explain them? Also note that the significant correlations are
marked by asterisks.
A correlation provides a concise numerical summary of the degree of
linear association between pairs of variables. However, a correlation can
be influenced by outliers without warning signs appearing in the
correlation. Such outliers would probably be visible in a scatterplot. Also,
a scatterplot might suggest that a function other than a straight line be
fit to the data, whereas a correlation simply provides a measure of
straight-line fit. For these reasons, serious analysts look at scatterplots.
If the number of variables is so large as to render looking at all
scatterplots unfeasible, then at least view those involving important
variables.
SUMMARY
SPSS Training
Method
Data
Scenario
INTRODUCTION
AND BASIC
CONCEPTS
Introduction to Regression 11 - 1
SPSS Training
Figure 11.1 Scatterplot of Beginning Salary and Education (with
Sunflower Option)
THE
REGRESSION
EQUATION AND
FIT MEASURE
In the plot above, beginning salary is placed on the Y (or vertical axis)
and education appears along the X (horizontal) axis. Since education is
typically completed before starting a career, we consider beginning salary
to be the dependent variable and education the independent or predictor
variable. A straight line is superimposed on the scatterplot along with the
general form of the equation, Y = B * X + A. Here, B is the slope (the
change in Y per one unit change in X) and A is the intercept (the value of
Y when X is zero).
Introduction to Regression 11 - 2
SPSS Training
Given this, how would one go about finding a best-fitting straight
line? In principle, there are various criteria that might be used:
minimizing the mean deviation, mean absolute deviation, or median
deviation. Due to technical considerations, and with a dose of tradition,
the best-fitting straight line is the one that minimizes the sum of the
squared deviation of each point about the line. If these sums of squared
deviations remind you of our discussion of ANOVA, it is the same
concept.
Returning to the plot of beginning salary and education, we might
wish to quantify the extent to which the straight line fits the data. The fit
measure most often used, the r-square measure, has the dual advantages
of falling on a standardized scale and having a practical interpretation.
The r-square measure (which is the correlation squared, or r2, when there
is a single predicator variable, and thus its name) is on a scale from 0 (no
linear association) to 1 (perfect linear prediction). Also, the r-square value
can be interpreted as the proportion of variation in one variable that can
be predicted from the other. Thus an r-square of .50 indicates that we can
account for 50% of the variation in one variable if we know values of the
other. You can think of this value as a measure of the improvement in
your ability to predict one variable from the other (or others if there are
multiple independent variables).
RESIDUALS AND
OUTLIERS
ASSUMPTIONS
Viewing the plot, we see that many points fall near the line, but some are
quite a distance from it. For each point, the difference between the value
of the dependent variable and the value predicted by the equation (value
on the line) is called the residual. Points above the line have positive
residuals (they were under-predicted), those below the line have negative
residuals (they were over-predicted), and a point falling on the line has a
residual of zero (perfect prediction). Points having relatively large
residuals are of interest because they represent instances where the
prediction line did poorly. For example, one case has a beginning salary of
about $30,000 while the predicted value (based on the line) is about
$10,000, yielding a residual, or miss, of about $20,000. If budgets were
based on such predictions, this is a substantial discrepancy. In SPSS, the
Regression procedure can provide information about large residuals, and
also present them in standardized form. Outliers, or points far from the
mass of the others, are of interest in regression because they can exert
considerable influence on the equation (especially if the sample size is
small). Also, outliers can have large residuals and would be of interest for
this reason as well. While not covered in this class, SPSS can provide
influence statistics to aid in judging whether the equation was strongly
affected by an observation and, if so, to identify the observation.
Introduction to Regression 11 - 3
SPSS Training
nature of the independent variable(s). A variable coded as a dichotomy
(say 0 and 1) can technically be considered as an interval scale. An
interval scale assumes that a one-unit change has the same meaning
throughout the range of the scale. If a variables only possible codes are 0
and 1 (or 1 and 2, etc.), then a one-unit change does mean the same
change throughout the scale. Thus dichotomous variables, for example
gender, can be used as predictor variables in regression. It also permits
the use of nominal predictor variables if they are converted into a series
of dichotomous variables; this technique is called dummy coding and is
considered in most regression texts (Draper and Smith (1998), Cohen and
Cohen (1983)). When we perform multiple regression (multiple
independent variables) later in this chapter we will use a dichotomous
predictor (gender).
SIMPLE
REGRESSION
This chapter will focus on the first choice, linear regression, which
performs simple and multiple linear regression. Curve Estimation will
invoke the Curvefit procedure, which can apply up to 16 different
functions relating two variables. Binary logistic regression is used when
the dependent variable is a dichotomy (for example, when predicting
Introduction to Regression 11 - 4
SPSS Training
whether a medical patient survives or not). Multinomial logistic
regression is appropriate when you have a categorical dependent variable
with more than two values. Ordinal regression can be applied when the
measurement level of the dependent variable is ordinal (rank ordered).
Probit analysis is traditionally used in medical dosage response studies in
which one records at different drug dosage levels the number of
experimental animals that survive and the number that die. Nonlinear
regression will apply a user-specified nonlinear equation to the variables.
Weight estimation will compute weight factors, which when later used by
the Regression procedure will result in areas where the points show less
variation (reflect greater precision) being weighted more heavily. 2-Stage
Least Squares is a method used in econometrics to evaluate regressionlike models in which variables can appear in two separate equations.
Finally, Optimal Scaling analysis (Regression with Optimal Scaling) is
used to predict the values of a categorical, ordinal or interval dependent
variable from a combination of categorical, ordinal or interval
independent variables, but does so by performing scaling on the original
variables. Binary and Multinomial Logistic, Probit, Nonlinear, 2-Stage
Least Squares and Weight Estimation are part of the SPSS Regression
Models option, Ordinal Regression is part of the SPSS Advanced Models
option, and Optimal Scaling comes with the SPSS Categories option.
We will select Linear to perform simple linear regression, then name
beginning salary (SALBEG) as the dependent variable and education
(EDLEVEL) as the independent variable.
Click Linear from the Regression menu
Move salbeg to the Dependent: box
Move edlevel to the Independent(s): box
Figure 11.3 Linear Regression Dialog Box
Introduction to Regression 11 - 5
SPSS Training
In this first analysis we will limit ourselves to producing the standard
regression output. In our next regression example, we will ask for
residual plots and information about cases with large residuals. Also, the
Regression dialog box allows many specifications; here we will discuss the
most important features. However, if you will be running regression
often, some time spent reviewing the additional features and controls
mentioned in the manual and Help system will be well worth it.
Beginning salary (SALBEG) is the dependent variable and education
(EDLEVEL) is the sole independent variable. Notice the Independent(s)
list box will permit more than one independent variable, and so this
dialog box can be used for both simple and multiple regression. The block
controls permit an analyst to build a series of regression models with the
variables entered at each stage (block), as specified by the user.
By default, the Method is Enter, which means that all independent
variables in the block will be entered into the regression equation
simultaneously. This method is chosen to run one regression based on all
variables you specify. If you wish the program to select, from a larger set
of independent variables, those that in some statistical sense are the best
predictors, you can request the Stepwise method. We will review this
method later in the chapter.
The Selection Variable option permits cross-validation of regression
results. Only cases whose values meet the rule specified for a selection
variable will be used in the regression analysis, yet the resulting
prediction equation will be applied to the other cases. Thus you can
evaluate the regression on cases not used in the analysis, or apply the
equation derived from one subgroup of your data to other groups.
While SPSS will present standard regression output by default, many
additional (and some of them quite technical) statistics can be requested
via the Statistics dialog box. The Plots dialog box is used to generate
various diagnostic plots used in regression, including residual plots. We
will request such plots in the next analysis. The Save dialog box permits
you to add new variables to the data file. These variables contain such
statistics as the predicted values from the regression equation, various
residuals and influence measures. Finally, the Options dialog box controls
the criteria when running stepwise regression and choices in handling
missing data (the SPSS Missing Values option provides more
sophisticated methods of handling missing values). By default, SPSS
excludes a case from regression if it has one or more values missing for
the variables used in the analysis.
Click OK
The command below will produce the analysis in SPSS.
REGRESSION
/DEPENDENT salbeg
/METHOD=ENTER edlevel .
We indicate that beginning salary (SALBEG) is the dependent
variable and we wish to enter education (EDLEVEL) as the single
predictor.
Introduction to Regression 11 - 6
SPSS Training
Figure 11.4 Model Summary and Overall Significance Tests
Introduction to Regression 11 - 7
SPSS Training
Regression can display such descriptive statistics as the standard
deviation, but since we didnt request this, we will note that the original
standard deviation of beginning salary was $3,148 (Figure 10.4). Thus the
uncertainty surrounding individual beginning salaries has been reduced
from $3,148 (standard deviation) to $2,439 (standard error). If the
straight line perfectly fit the data, the standard error would be 0.
While the fit measures indicate how well we can expect to predict the
dependent variable or how well the line fits the data, they do not tell
whether there is a statistically significant relationship between the
dependent and independent variables. The analysis of variance table
presents technical summaries (sums of squares and mean square
statistics) similar to what we found in ANOVA, but here we refer to
variation accounted for by the prediction equation. Our main interest is
in determining whether there is a statistically significant (non zero)
linear relation between the dependent variable and the independent
variable(s) in the population. Since in simple regression there is a single
independent variable, we are testing a single relationship; in multiple
regression, we test whether any linear relation differs from zero. The
significance value accompanying the F test gives us the probability that
we could obtain one or more sample slope coefficients (which measure the
straight line relationships) as far from zero as what we obtained, if there
were no linear relations in the population. The result is highly significant
(significance probability less than .0005 or 5 chances in 10,000). Now that
we have established there is a significant relationship between the
beginning salary and education, and obtained fit measures, we turn to
interpret the regression coefficients.
Figure 11.5 Regression Coefficients
Introduction to Regression 11 - 8
SPSS Training
the same pattern continues. Here it clearly cannot! The Standard Error
(of B) column contains standard errors of the regression coefficients.
These provide a measure of the precision with which we estimate the B
coefficients. The standard errors can be used to create a 95% confidence
band around the B coefficients (available as a Statistics option). In our
example, the regression coefficient is $691 and the standard error is
about $39. Thus we would not be surprised if in the population the true
regression coefficient were $650 or $710 (within two standard errors of
our sample estimate), but it is very unlikely that the true population
coefficient would be $300 or $2,000.
Betas are standardized regression coefficients and are used to judge
the relative importance of each of several independent variables. We will
use these measures when discussing multiple regression. Finally, the t
statistics provide a significance test for each B coefficient, testing
whether it differs from zero in the population. Since we have but one
independent variable, this is the same result as what the F test provided
earlier. In multiple regression, the F statistic tests whether any of the
independent variables are significantly related to the dependent variable,
while the t statistic is used to test each independent variable separately.
The significance test on the constant assesses whether the intercept
coefficient is different from zero in the population (it is).
Thus if we wish to predict beginning salary based on education for
new employees, the formula would use the B coefficients: Beginning
Salary = $691 * Education - $2,516. Even when running simple
regression, the analyst would probably take a look at some residual plots
and check for outliers; we will follow through on this aspect in the next
example.
MULTIPLE
REGRESSION
Introduction to Regression 11 - 9
SPSS Training
Click the Dialog Recall Box tool
Regression
Move sex, work, and age into the Independent(s): box (edlevel
should already be there)
Figure 11.6 Setting Up Multiple Regression
RESIDUAL
PLOTS
While we can run the multiple regression at this point, we will request
some diagnostic plots involving residuals and information about outliers.
By default no residual plots will appear. These options are explained
below.
Click the Plots pushbutton
Within the Plots dialog box:
Click the Histogram check box in the Standardized Residual
Plots area
Move *ZRESID into the Y: box
Move *ZPRED into the X: box
Introduction to Regression 11 - 10
SPSS Training
Figure 11.7 Regression Plots Dialog Box
The options in the Standardized Residual Plots area of the dialog box
all involve plots of standardized residuals. Ordinary residuals are useful
if the scale of the dependent variable is meaningful, as it is here
(beginning salary in dollars). Standardized residuals are helpful if the
scale of the dependent is not familiar (say a 1 to 10 customer satisfaction
scale). By this I mean that it may not be clear to the analyst just what
constitutes a large residual; is an over prediction of 1.5 units a large miss
on a 1 to 10 scale? In such situations, standardized residuals (residuals
expressed in standard deviation units) are very useful because large
prediction errors can be easily identified. If the errors follow a normal
distribution, then standardized residuals greater than 2 (in absolute
value) should occur about 5% of the time, and those greater than 3 (in
absolute value) should happen less than 1% of the time. Thus
standardized residuals provide a norm against which one can judge what
constitutes a large residual. We requested a histogram of the
standardized residuals; note that a normal probability plot is available as
well. Recall that the F and t tests in regression assume that the residuals
follow a normal distribution.
Regression can produce summaries concerning various types of
residuals. Without going into all these possibilities, we request a
scatterplot of the standardized residuals (*ZRESID) versus the
standardized predicted values (*ZPRED). An assumption of regression is
that the residuals are independent of the predicted values, so if we see
any patterns (as opposed to a random blob) in this plot, it might suggest a
way of adjusting and improving the analysis.
Click Continue
Next we will look at the Statistics dialog box. The Casewise
Diagnostics choice appears here. When this option is checked, Regression
will list information about all cases whose standardized residuals are
more than 3 standard deviations from the line. This outlier criterion is
Introduction to Regression 11 - 11
SPSS Training
under your control.
Click the Statistics pushbutton
Click the Casewise diagnostics check box in the Residuals
area
Figure 11.8 Regression Statistics Dialog Box
Introduction to Regression 11 - 12
SPSS Training
Casewise subcommand will produce a casewise plot or list of those
observations more than three standard deviations from the line (actually
from the plane, since there are several predictor variables). To run just a
basic multiple regression analysis, the command would end with the
Method subcommand.
MULTIPLE
REGRESSION
RESULTS
Recall that listwise deletion of missing data has occurred, that is, if a
case is missing data on any of the five variables used in the regression it
will be dropped from the analysis. If this results in heavy data loss, other
choices for handling missing values are available in the Regression
Options dialog box (see also the SPSS Missing Values option). The
dependent and independent variables are listed. The r-square measure is
about .49, indicating that with these four predictor variables we can
account for about 49% of the variation in beginning salaries. Education
alone had an r-square of .40, so the additional set of three predictors
added only an additional 9%: an improvement, but a modest
improvement. The adjusted r-square is quite close to the r-square. The
standard error has dropped from $2,439 (with just education as a
predictor) to $2,260: an improvement, but not especially large.
Introduction to Regression 11 - 13
SPSS Training
Figure 11.10 ANOVA Table
The independent variables appear in the order they were given in the
Regression dialog box, and not in order of importance. Although the B
coefficients are important for prediction and interpretive purposes,
analysts usually look first to the t test at the end of each line to
determine which independent variables are significantly related to the
outcome measure. Since four variables are in the equation, we are testing
if there is a linear relationship between each independent variable and
the dependent measure after adjusting for the effects of the three other
independent variables. Looking at the significance values we see that
education and gender are highly significant (less than .0005), age is
significant at the .05 level, while work experience is not linearly related
to beginning salary (after controlling the other predictors). Thus we can
drop work experience as a predictor. It may seem odd that work
experience is not related to salary, but since many of the positions were
clerical, work experience may not play a large role. Typically, you would
Introduction to Regression 11 - 14
SPSS Training
rerun the regression after removing variables not found to be significant,
but we will proceed and interpret this output.
The estimated regression (B) coefficient for education is $651, similar
but not identical to the coefficient ($691) found in the simple regression
using formal education alone. In the simple regression we estimated the
B coefficient for education ignoring any other effects, since none were
included in the model. Here we evaluate the effect of education after
adjusting for age, work experience and gender. If the independent
variables are correlated, the change in B coefficient from simple to
multiple regression can be substantial. So, after adjusting for age, work
experience and gender, meaning if they were held constant, a year of
formal education, on average, was worth $651 in starting salary. The
gender variable has a B coefficient of -1526. This means that a one-unit
change in gender (moving from male status to female status), controlling
for the other independent variables in the equation, is associated with a
drop (negative coefficient) in beginning salary of $1,526. Age has a B
coefficient of $33, so a one-year change in age (controlling for the other
variables) was associated with a $33 beginning salary increase. Since we
found work experience not to be significantly different from zero, we treat
it as if it were zero. The constant or intercept term is still negative, and
would correspond to the predicted salary for a male (sex=0) with 0 years
of education, 0 years of work experience and whose age is 0 - not a likely
or realistic combination. The standard errors again provide precision
measures for the regression coefficient estimates.
If we simply look at the estimated B coefficients we might think that
gender is the most important variable. However, the magnitude of the B
coefficient is influenced by the standard deviation of the independent
variable. For example, gender (SEX) takes on only two values (0 and 1),
while education values range from 8 years to over 20 years. The Beta
coefficients explicitly adjust for such standard deviation differences in the
independent variables. They indicate what the regression coefficients
would be if all variables were standardized to have means of 0 and
standard deviations of 1. A Beta coefficient thus indicates the expected
change (in standard deviation units) of the dependent variable per one
standard deviation unit increase in the independent variable (after
adjusting for other predictors). This provides a means of assessing
relative importance of the different predictor variables in multiple
regression. The Betas are normed so that the maximum should be less
than or equal to one in absolute value (if any Betas are above 1 in
absolute value, it suggests a problem with the data: multicollinearity).
Examining the Betas, we see that education is the most important
predictor, followed by gender, and then age. The Beta for work experience
is very near zero. If we needed to predict beginning salary from these
background variables (dropping work experience) we would use the B
coefficients. Rounding to whole numbers, we would say: Salbeg = 651 *
Edlevel - 1526 * Gender + 33 * Age - 2666.
Introduction to Regression 11 - 15
SPSS Training
RESIDUALS AND
OUTLIERS
Introduction to Regression 11 - 16
SPSS Training
We see the distribution of the residuals with a normal bell-shaped
curve superimposed. The residuals are a bit too concentrated in the
center (notice the peak) and are skewed; notice the long tail to the right.
Given this pattern, a technical analyst might try a data transformation
on the dependent measure (taking logs), which might improve the
properties of the residual distribution. Overall, the distribution is not too
bad, but there are clearly some outliers in the tail; these also show up in
the casewise outlier summary.
Figure 11.14 Scatterplot of Residuals and Predicted Values
Introduction to Regression 11 - 17
SPSS Training
SUMMARY OF
REGRESSION
RESULTS
STEPWISE
REGRESSION
Introduction to Regression 11 - 18
SPSS Training
Figure 11.15 Requesting a Stepwise Regression
Introduction to Regression 11 - 19
SPSS Training
STEPWISE
REGRESSION
RESULTS
Note: Some of the pivot tables appearing below were edited in the Pivot
Table editor so they would fit in this document as figures.
Figure 11.16 Step History and Model Summary
Introduction to Regression 11 - 20
SPSS Training
While there was an unusually dramatic drop-off in the r-square
contribution from the first to the second stage, a substantial drop-off
frequently occurs.
Figure 11.17 ANOVA Summary
Introduction to Regression 11 - 21
SPSS Training
Figure 11.18 Coefficients Pivot Table
Introduction to Regression 11 - 22
SPSS Training
Each model (step) in the stepwise regression is accompanied by a
summary of the variables not included in the model at that point: the
remaining candidate variables. We see the Beta coefficient each
independent variable would have if it were entered into the equation at
that point. The partial correlation measures the correlation between the
independent variable and the dependent measure (beginning salary) after
statistically controlling for any predictor variables already in the
equation. For model 1 gender has the largest partial correlation
(adjusting for formal education), so it will be entered next. Tolerance
measures the proportion of variation in each independent variable that is
unique, that is, not shared with the other predictors. A tolerance of 1
indicates the predictor is uncorrelated with the other independent
variables. If any tolerance values approach zero, the regression results
may become unstable (see the regression references mentioned earlier in
this chapter). The t statistic tests whether the independent variable
would have a statistically significant linear relation to the dependent
measure if added to the equation at this point. Gender (SEX) would be
significant, so there is no barrier for its inclusion.
If we examine model 2 (formal education and gender included) an
interesting result emerges. Notice that the partial correlations for work
experience and age are very close in magnitude, and if either one were
entered at this point, it would be significant. Age will be selected because
it has a slightly larger partial correlation. You might wonder how work
experience would be significant if added here, when it was not significant
in the original four-variable equation, and was not entered using stepwise
regression. Work experience correlates highly with age (.80 - Figure
10.18), and so if one of them is already in the equation, the second does
not provide much more information.
STEPWISE
SUMMARY
We see that three variables (formal education, gender and age) were
entered into the stepwise prediction equation. Only work experience
remains, since its B coefficient would not be statistically significant if
added to the equation. The coefficients from this three-variable model are
close to those we observed with our earlier four-variable model (see
Figure 11.10). The prediction equation for beginning salary would be
(from Figure 11.18): Beginning Salary = $643* EDLEVEL - $1614* SEX
+ $45*AGE - $2,801. At this point, Regression produces the residual plots
and summaries, which we will not view here.
Thus stepwise regression supplies a means of automatically selecting
variables to be used in a regression equation. Despite the ease with which
it runs, oversight is required to inspect the results to insure that they
make sense (for example, we expect that education is positively related to
salary). Also, let me repeat the reminder that running stepwise with
many variables increases the probability of variables improperly being
included in the equation (false positives).
SUMMARY
Introduction to Regression 11 - 23
SPSS Training
Introduction to Regression 11 - 24
SPSS Training
References
INTRODUCTORY STATISTICS BOOKS
Freund, Rudolf J. and Wilson, William J., Statistical Methods, Academic
Press, New York, 1993.
Hays, William L., Statistics, 4th Edition, Harcourt Brace Jovanovich,
New York, 1988.
Norusis, Marija J., SPSS 11.0 Guide to Data Analysis, Prentice-Hall, New
York, 2001.
Wilcox, Rand, R. Statistics for the Social Sciences, Academic Press, New
York, 1996.
ADDITIONAL REFERENCES
Agresti, Alan, An Introduction to Categorical Data Analysis, Wiley, New
York, 1996.
Andrews, Frank M, Klem, L., Davidson, T.N., OMalley, P.M. and
Rodgers, W.L., A Guide for Selecting Statistical Techniques for Analyzing
Social Science Data , Institute for Social Research, University of
Michigan, Ann Arbor, 1981.
Babbie, Earl, Survey Research Methods, Wadsworth, Belmont CA, 1973.
Box, George E. P. and Cox, D.R., An Analysis of Transformations, J.
Royal Statistical Society, Series B, 23, p 211, 1964.
Box, George E. P., Hunter, W.G. and Hunter, J.S., Statistics for
Experimenters, Wiley, New York, 1978.
Bishop, Yvonne M.M., Fienberg, S. and Holland, P.W., Discrete
Multivariate Analysis, MIT Press, Cambridge, MA, 1975.
Brown, Morton B. and Forsythe, A., The Small Sample Behavior of Some
Statistics Which Test the Quality of Several Means, Technometrics, p 129132, 1974.
Burke, Linda B. and Clark, V.L., Processing Data , Sage Publications,
Newbury Park, CA, 1992.
Cohen, Jacob, Statistical Power Analysis for the Behavioral Sciences,
Lawrence Erlbaum Associates, Hillsdale, NJ, 1988.
Cohen, Jacob and Cohen, P., Applied Multiple Regression/Correlation
Analysis for the Behavioral Sciences (2nd ed.), Lawrence Erlbaum
Associates, Hillsdale, NJ, 1983.
References R - 1
SPSS Training
Daniel, Cuthbert and Wood, Fred S., Fitting Equations to Data (2nd ed.),
Wiley, New York, 1980.
Daniel, Wayne W., Applied Nonparametric Statistics, Houghton Mifflin,
Boston, 1978.
Draper, Norman and Smith, Harry, Applied regression Analysis (3nd
ed.), Wiley, New York, 1998.
Fienberg, Stephen E., The Analysis of Cross-Classified Categorical Data,
MIT Press, Cambridge, MA, 1977.
Gibbon, Jean D. Nonparametric Measures of Association, Sage
Publications, Newbury Park, CA, 1993.
Hardle, Wolfgang, Applied Nonparametric Regression, Cambridge
University Press, Cambridge, 1990).
Hoaglin, David C., Mosteller, F. and Tukey, J.W., Exploring Data Tables,
Trends and Shapes, Wiley, New York, 1985.
Hoaglin, David C., Mosteller, F. and Tukey, J.W., Fundamentals of
Exploratory Analysis of Variance, Wiley, New York, 1991.
Hsu, Jason C. Multiple Comparisons: Theory and Methods, Chapman &
Hall, London, 1996.
Kish, Leslie, Survey Sampling, Wiley, New York, 1965.
Kraemer, H.K and Thiemann, S., How Many Subjects? Statistical Power
Analysis in Research, Sage Publications, Newbury Park, CA, 1987.
Kirk, Roger E., Experimental Design: Procedures for the Behavioral
Sciences, Brooks/Cole Publishing Co., Belmont, CA. 1968.
Klockars, Alan J. and Sax, G. Multiple Comparisons, Sage Publications,
Newbury, CA, 1986.
Milliken, George A. and Johnson, Dallas E., Analysis of Messy Data,
Volume 1: Designed Experiments, Van Nostrand Reinhold, New York,
1984.
Mosteller, Frederick and Tukey, John W., Data Analysis and Regression,
Addison-Wesley, Reading, MA, 1977.
National Opinion Research Center (NORC), General Social Survey; 19721991: Cumulative Codebook, National Opinion Research Center, Chicago,
1991.
Rossi, Peter H, Wright, J.D. and Anderson, A.B., Handbook of Survey
Research, Academic Press, New York, 1983
References R - 2
SPSS Training
Searle, Shayle R. Linear Models for Unbalanced Data, Wiley, New York,
1987.
Siegel, S., Nonparametric Statistics, McGraw-Hill, New York, 1956.
Sudman, Seymour, Applied Sampling, Academic Press, New York, 1976.
Toothaker, Larry E., Multiple Comparisons for Researchers, Sage
Publications, Newbury Park, CA, 1991.
Tukey, John W., Exploratory Data Analysis, Addison-Wesley, Reading,
MA, 1977.
Tukey, John W., The Philosophy of Multiple Comparisons, Statistical
Science, v 6, 1, p 100-116, 1991.
Wilcox, Rand R. Introduction to Robust Estimation and Hypothesis
Testing, Academic Press, New York, 1997.
Velleman, Paul F. and Wilkinson, L., Nominal, Ordinal and Ratio
Typologies are Misleading for Classifying Statistical Methodology, The
American Statistician, v 47, p 65-72, 1993.
References R - 3
SPSS Training
References R - 4
SPSS Training
Exercises
Note on Exercise
Data
Chapter 3
The exercises use the General Social Survey 1994 data (GSS94.POR)
located in the c:\Train\Stats folder on your training machine. If you are
not working in an SPSS Training center, the training files can be copied
from the floppy disk that accompanies this course guide. If you are
running SPSS Server (click File..Switch Server to check), then you should
copy these files to the server or a machine that can be accessed (mapped
from) the computer running SPSS Server.
Checking Data
People who have been married should have an age at which they were
first married. Use the Transform..Compute dialog to create a variable
named NOAGEWED. Code it so that for respondents who have been
married (MARITAL 1,2,3,4), code 1 indicates they are missing their age
first married (AGEWED is missing), while 0 indicates they reported an
age first married. Run a frequency table on the NOAGEWED variable.
Chapter 4
Chapter 5
Exercises E - 1
SPSS Training
you chose additional variables in the previous exercise (Chapter 4), run
them in a crosstab analysis. For a few of your crosstab tables, rerun
requesting appropriate measures of association. Are the results
consistent with your interpretation up to this point? Based on either the
association measures, or percentage differences, would you say the
results have practical (or ecological) significance? Run a three-way
crosstab of social class (CLASS) by gender (SEX) by belief in the afterlife
(POSTLIFE) and interpret the results.
Chapter 6
Chapter 7
Chapter 8
Exercises E - 2
SPSS Training
plot displaying the means (error bar). Are the differences large enough to
be of practical importance?
For those with extra time: Look at the box & whisker plot of number of
children by region. The distribution of number of children in each group
is clearly not normal. Why might this not be a major problem in the
analysis? This analysis looks only at region. If there were another factor
(say race) that is related to the number of children, how might it
influence this analysis? How might you adjust for it?
Chapter 9
Chapter 10
Exercises E - 3
SPSS Training
Chapter 11
Introduction to Regression
We know there is a relationship between the education of the
respondents parents and the education of the respondent. Develop a
regression equation predicting respondents education (EDUC) from the
education of one parent (you choose between MAEDUC and PAEDUC).
Does the equation, in your opinion, adequately account for the variation
in respondents education? If the parent had 16 years of education, what
is your prediction for the education of the respondent? Add the remaining
parents education as a second predictor to the equation. Does this
substantially improve the prediction? Does there seem to be much
difference between the two predictors?
For those with extra time: Compare the coefficients from the one and two
variable prediction equations and note the very small difference in rsquare. Looking at the two analyses, can you explain the change in the
regression coefficient of the variable (MAEDUC or PAEDUC) present in
both equations (hint: look at the coefficient of the other predictor
variable)?
Exercises E - 4