Sie sind auf Seite 1von 507

The Oxford Handbook of

Quantitative Methods
OX F O R D L I B R A RY O F P S Y C H O L O G Y

Editor-in-Chief

Peter E. Nathan

A r e a E d i t o r s:

Clinical Psychology
David H. Barlow

Cognitive Neuroscience
Kevin N. Ochsner and Stephen M. Kosslyn

Cognitive Psychology
Daniel Reisberg

Counseling Psychology
Elizabeth M. Altmaier and Jo-Ida C. Hansen

Developmental Psychology
Philip David Zelazo

Health Psychology
Howard S. Friedman

History of Psychology
David B. Baker

Methods and Measurement


Todd D. Little

Neuropsychology
Kenneth M. Adams

Organizational Phychology
Steve W. J. Kozlowski

Personality and Social Psychology


Kay Deaux and Mark Snyder
OX F O R D L I B R A RY O F P S Y C H O L O G Y

Editor-in-Chief peter e. nathan

The Oxford Handbook


of Quantitative
Methods
Edited by
Todd D. Little

Vo lu m e ı : Fo u n d at i o n s

1
3
Oxford University Press, Inc., publishes works that further
Oxford University’s objective of excellence in
research, scholarship, and education.

Oxford New York


Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto

With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam

© 2013 by Oxford University Press, Inc.

Published by Oxford University Press, Inc.


198 Madison Avenue, New York, New York 10016
www.oup.com

Oxford is a registered trademark of Oxford University Press

All rights reserved. No part of this publication may be reproduced,


stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording, or
otherwise, without the prior permission of Oxford University Press

Library of Congress Cataloging-in-Publication Data


The Oxford handbook of quantitative methods / edited by Todd D. Little.
v. cm. – (Oxford library of psychology)
ISBN 978–0–19–993487–4
ISBN 978–0–19–993489–8
1. Psychology–Statistical methods. 2. Psychology–Mathematical models. I. Little, Todd D.
BF39.O927 2012
150.72 1—dc23
2012015005

9 8 7 6 5 4 3 2 1
Printed in the United States of America
on acid-free paper
SHORT CONTENTS

Oxford Library of Psychology vii

About the Editor ix

Contributors xi

Table of Contents xvii

Chapters 1–479

Index 480

v
This page intentionally left blank
OX F O R D L I B R A RY O F P SYC H O LO GY

The Oxford Library of Psychology, a landmark series of handbooks, is published by


Oxford University Press, one of the world’s oldest and most highly respected publish-
ers, with a tradition of publishing significant books in psychology. The ambitious goal
of the Oxford Library of Psychology is nothing less than to span a vibrant, wide-ranging
field and, in so doing, to fill a clear market need.
Encompassing a comprehensive set of handbooks, organized hierarchically, the
Library incorporates volumes at different levels, each designed to meet a distinct
need. At one level are a set of handbooks designed broadly to survey the major
subfields of psychology; at another are numerous handbooks that cover important
current focal research and scholarly areas of psychology in depth and detail. Planned
as a reflection of the dynamism of psychology, the Library will grow and expand
as psychology itself develops, thereby highlighting significant new research that will
impact on the field. Adding to its accessibility and ease of use, the Library will be
published in print and, later, electronically.
The Library surveys psychology’s principal subfields with a set of handbooks that
capture the current status and future prospects of those major subdisciplines. This ini-
tial set includes handbooks of social and personality psychology, clinical psychology,
counseling psychology, school psychology, educational psychology, industrial and
organizational psychology, cognitive psychology, cognitive neuroscience, methods
and measurements, history, neuropsychology, personality assessment, developmen-
tal psychology, and more. Each handbook undertakes to review one of psychology’s
major subdisciplines with breadth, comprehensiveness, and exemplary scholarship.
In addition to these broadly conceived volumes, the Library also includes a large
number of handbooks designed to explore in depth more specialized areas of schol-
arship and research, such as stress, health and coping, anxiety and related disorders,
cognitive development, or child and adolescent assessment. In contrast to the broad
coverage of the subfield handbooks, each of these latter volumes focuses on an espe-
cially productive, more highly focused line of scholarship and research. Whether
at the broadest or most specific level, however, all of the Library handbooks offer
synthetic coverage that reviews and evaluates the relevant past and present research
and anticipates research in the future. Each handbook in the Library includes intro-
ductory and concluding chapters written by its editor to provide a roadmap to the
handbook’s table of contents and to offer informed anticipations of significant future
developments in that field.
An undertaking of this scope calls for handbook editors and chapter authors
who are established scholars in the areas about which they write. Many of the

vii
nation’s and world’s most productive and best-respected psychologists have agreed
to edit Library handbooks or write authoritative chapters in their areas of expertise.
For whom has the Oxford Library of Psychology been written? Because of its breadth,
depth, and accessibility, the Library serves a diverse audience, including graduate
students in psychology and their faculty mentors, scholars, researchers, and practi-
tioners in psychology and related fields. Each will find in the Library the information
they seek on the subfield or focal area of psychology in which they work or are
interested.
Befitting its commitment to accessibility, each handbook includes a comprehen-
sive index, as well as extensive references to help guide research. And because the
Library was designed from its inception as an online as well as a print resource,
its structure and contents will be readily and rationally searchable online. Further,
once the Library is released online, the handbooks will be regularly and thoroughly
updated.
In summary, the Oxford Library of Psychology will grow organically to provide
a thoroughly informed perspective on the field of psychology, one that reflects
both psychology’s dynamism and its increasing interdisciplinarity. Once published
electronically, the Library is also destined to become a uniquely valuable interactive
tool, with extended search and browsing capabilities. As you begin to consult this
handbook, we sincerely hope you will share our enthusiasm for the more than 500-
year tradition of Oxford University Press for excellence, innovation, and quality, as
exemplified by the Oxford Library of Psychology.

Peter E. Nathan
Editor-in-Chief
Oxford Library of Psychology

viii oxford library of psychology


ABOUT THE EDITOR

Todd D. Little
Todd D. Little, Ph.D., is a Professor of Psychology, Director of the Quantitative
training program, Director of the undergraduate Social and Behavioral Sciences
Methodology minor, and a member of the Developmental training program. Since
2010, Todd has been Director of the Center for Research Methods and Data Analysis
(CRMDA) at Kansas University. Little is internationally recognized for his quanti-
tative work on various aspects of applied SEM (e.g., indicator selection, parceling,
modeling developmental processes) as well as his substantive developmental research
(e.g., action-control processes and motivation, coping, and self-regulation). In
2001, Little was elected to membership in the Society for Multivariate Experimental
Psychology. In 2009, he was elected President of APA’s Division 5 (Evaluation, Mea-
surement, and Statistics) and in 2010 was elected Fellow of the division. In 2012,
he was elected Fellow in the Association for Psychological Science. He founded,
organizes, and teaches in the internationally renowned KU “Stats Camps” each June
(see crmda.KU.edu for details of the summer training programs). Little has edited
five books related to methodology including The Oxford Handbook of Quantitative
Methods and the Guilford Handbook of Developmental Research Methods (with Brett
Laursen and Noel Card). Little has been principal investigator or co-principal inves-
tigator on more than 15 grants and contracts, statistical consultant on more than 60
grants and he has guided the development of more than 10 different measurement
tools.

ix
This page intentionally left blank
CONTRIBUTORS

Leona S. Aiken Timothy A. Brown


Department of Psychology Department of Psychology
Arizona State University Boston University
Tempe, AZ Boston, MA
Rawni A. Anderson Trent D. Buskirk
Center for Research Methods Department of Community Health-
and Data Analysis Biostatistics Division
University of Kansas Saint Louis University
Lawrence, KS Saint Louis, MO
Luc Anselin Noel A. Card
GeoDa Center for Geospatial Analysis Family Studies and Human Development
and Computation University of Arizona
School of Geographical Sciences and Tucson, AZ
Urban Planning Deborah M. Casper
Arizona State University Family Studies and Human Development
Tempe, AZ University of Arizona
Amanda N. Baraldi Tucson, AZ
Department of Psychology Daniel R. Cavagnaro
Arizona State University Department of Psychology
Tempe, AZ The Ohio State University
David E. Bard Columbus, OH
Department of Pediatrics Rand D. Conger
University of Oklahoma Health Sciences Department of Human and Community
Center Development
Oklahoma City, OK University of California at Davis
Theodore P. Beauchaine Davis, CA
Department of Psychology David Cook
Washington State University Abt Associates Inc.
Pullman, WA Thomas D. Cook
Gabriëlla A.M. Blokland Institute for Policy Research
Genetic Epidemiology Laboratory Northwestern University
Queensland Institute of Evanston, IL
Medical Research Stefany Coxe
School of Psychology and Centre Department of Psychology
for Advanced Imaging Arizona State University
University of Queensland Tempe, AZ
Brisbane, Australia R.J. de Ayala
Annette Brose Department of Educational
Max Plank Institute for Human Psychology
Development University of Nebraska Lincoln
Berlin, Germany Lincoln, NE
Max Plank Institute for Human Cognitive
and Brain Sciences

xi
Pascal R. Deboeck Brian D. Haig
Department of Psychology Department of Psychology
University of Kansas University of Canterbury
Lawrence, KS Canterbury, New Zealand
Sarah Depaoli Kelly Hallberg
Department of Educational Psychology Institute for Policy Research
University of Wisconsin Madison Northwestern University
Madison, WI Evanston, IL
Cody S. Ding Lisa L. Harlow
College of Education Department of Psychology
University of Missouri-Saint Louis University of Rhode Island
Saint Louis, MO Kingston, RI
M. Brent Donnellan Emily J. Hart
Department of Psychology Department of Psychology
Michigan State University University at Buffalo
East Lansing, MI The State University of New York
Dawnté R. Early Buffalo, NY
Department of Human and Community Kit-Tai Hau
Development The Chinese University of Hong Kong
University of California at Davis Hong Kong
Davis, CA Joop J. Hox
Craig K. Enders Department of Methodology and Statistics
Department of Psychology Utrecht University
Arizona State University Utrecht, The Netherlands
Tempe, AZ James Jaccard
David M. Erceg-Hurn Department of Psychology
School of Psychology Florida International University
University of Western Australia Boca Raton, FL
Crawley, WA, Australia Paul E. Johnson
Aurelio José Figueredo Department of Political Science
Department of Psychology Kansas University
School of Mind, Brain, & Behavior Lawrence, KS
Division of Family Studies and Human Kelly M. Kadlec
Development Department of Psychology
College of Agriculture and Life Sciences University of Southern California
University of Arizona Los Angeles, CA
Tucson, AZ David Kaplan
Rafael Antonio Garcia Department of Educational
Department of Psychology Psychology
School of Mind, Brain, & Behavior University of Wisconsin-Madison
University of Arizona Madison, WI
Tucson, AZ Ken Kelley
Amanda C. Gottschall Department of Management
Department of Psychology University of Notre Dame
Arizona State University Notre Dame, IN
Tempe, AZ Harvey J. Keselman
Michael J. Greenacre Department of Psychology
Department of Economics and Business University of Manitoba
Universitat Pompeu Fabra, Barcelona Winnipeg, Canada
Barcelona, Spain

xii contributors
Neal M. Kingston Professor Emeritus
School of Education Macquarie University
University of Kansas † April 16, 1928 – October, 29, 2011
Lawrence, KS Sarah E. Medland
Yasemin Kisbu-Sakarya Genetic Epidemiology Laboratory
Department of Psychology Queensland Institute of Medical Research
Arizona State University School of Psychology
Tempe, AZ University of Queensland
Laura B. Kramer Brisbane, Australia
School of Education Peter C. M. Molenaar
University of Kansas Department of Human Development
Lawrence, KS and Family Studies
Todd D. Little Pennsylvania State University
Center for Research Methods and Data University Park, PA
Analysis Alexandre J.S. Morin
Department of Psychology Department of Psychology
University of Kansas University of Sherbrooke
Lawrence, KS Sherbrooke, Quebec, Canada
Richard E. Lucas Miriam A. Mosing
Department of Psychology Genetic Epidemiology Laboratory
Michigan State University Queensland Institute of Medical Research
East Lansing, MI School of Psychology
David P. MacKinnon University of Queensland
Department of Psychology Brisbane, Australia
Arizona State University Keith E. Muller
Tempe, AZ Department of Health Outcomes
Patrick Mair and Policy
Institute for Statistics and Mathematics University of Florida
Vienna University of Economics Gainesville, FL
and Business Eun-Young Mun
Vienna, Austria Center of Alcohol Studies
Herbert W. Marsh Rutgers University
Department of Education Piscataway, NJ
University of Oxford Alan T. Murray
Oxford, UK GeoDa Center for Geospatial Analysis
Katherine E. Masyn and Computation
Graduate School of Education School of Geographical Sciences
Harvard University and Urban Planning
Cambridge, MA Arizona State University
John J. McArdle Tempe, AZ
Department of Psychology Jay I. Myung
University of Southern California Department of Psychology
Los Angeles, CA The Ohio State University
Roderick P. McDonald† Columbus, OH
Sydney University Benjamin Nagengast
Sydney, Australia Department of Education
Professor Emeritus Oxford University
University of Illinois Oxford, UK
at Urbana-Champaign

contributors xiii
Sally Gayle Olderbak André A. Rupp
Department of Psychology Department of Measurement, Statistics,
School of Mind, Brain, & and Evaluation (EDMS)
Behavior University of Maryland
University of Arizona College Park, MD
Tucson, AZ Gabriel Lee Schlomer
Jamie M. Ostrov Division of Family Studies and
Department of Psychology Human Development
University at Buffalo College of Agriculture and Life Sciences
The State University of New York University of Arizona
Buffalo, NY Tucson, AZ
Trond Peterson Christof Schuster
Department of Sociology Department of Psychology
University of California-Berkeley Justus-Liebig-Unversitat Giessen
Berkeley, CA Giessen, Germany
Mark A. Pitt James P. Selig
Department of Psychology Department of Psychology
The Ohio State University University of New Mexico
Columbus, OH Albuquerque, NM
Larry R. Price Paul E. Spector
College of Education and College of Science Department of Psychology
Texas State University-San Marcos University of South Florida
San Marcos, TX Tampa, FL
Nilam Ram Peter M. Steiner
Department of Human Development and Department of Educational Psychology
Family Studies University of Wisconsin-Madison
Pennsylvania State University Madison, WI
University Park, PA Carolin Strobl
Max Plank Institute for Human Ludwig-Maximilians-Universität
Development München Faculty of Mathematics,
Berlin, Germany Informatics and Statistics Institute
Sergio J. Rey of Statistics
GeoDa Center for Geospatial Analysis Munich, Germany
and Computation Bruce Thompson
School of Geographical Sciences and Baylor College of Medicine
Urban Planning Austin, TX
Arizona State University Terry T. Tomazic
Tempe, AZ Department of Sociology and
Joseph L. Rodgers Criminal Justice
Department of Psychology Saint Louis University
University of Oklahoma Saint Louis, MO
Norman, OK James T. Townsend
Robert Rosenthal Department of Psychological and
Department of Psychology Brain Sciences
University of California, Riverside Indiana University
Riverside, CA Bloomington, IN
Ralph L. Rosnow Trisha Van Zandt
Department of Psychology Department of Psychology
Temple University The Ohio State University
Philadelphia, PA Columbus, OH

xiv contributors
Alexander von Eye Keith F. Widaman
Departments of Psychology Department of Psychology
Michigan State University University of California at Davis
East Lansing, MI Davis, CA
University of Vienna Rand R. Wilcox
Vienna, Austria Department of Psychology
Stefan von Weber University of Southern California
Department of Mechanical Engineering Los Angeles, CA
Hochschule Furtwangen University Lisa M. Willoughby
Furtwangen im Schwarzwald, Germany Department of Psychology
Karin J.H. Verweij Saint Louis University
Genetic Epidemiology Laboratory Saint Louis, MO
Queensland Institute of Medical Research Coady Wing
School of Psychology Institute for Policy Research
University of Queensland Northwestern University
Brisbane, Australia Evanston, IL
Theodore A. Walls Pedro Sofio Abril Wolf
Department of Behavioral Science Department of Psychology
University of Rhode Island University of Cape Town
Kingston, RI Cape Town, South Africa
Lihshing Leigh Wang Vivian Wong
Educational Studies Program School of Education and Social Policy
University of Cincinnati Northwestern University
Cincinnati, OH Evanston, IL
Amber S. Watts Carol M. Woods
Center for Research Methods and Data Center for Research Methods and
Analysis and Data Analysis and
Lifespan Institute, Gerontology Center Department of Psychology
University of Kansas University of Kansas
Lawrence, KS Lawrence, KS
William W.S. Wei Wei Wu
Department of Statistics Center for Research Methods and
Temple University Data Analysis and
Philadelphia, PA Department of Psychology
Zhonglin Wen University of Kansas
South China Normal University Lawrence, KS
Stephen G. West Ke-Hai Yuan
Department of Psychology Department of Psychology
Arizona State University University of Notre Dame
Tempe, AZ Notre Dame, IN

contributors xv
This page intentionally left blank
CONTENTS

1. Introduction 1
Todd D. Little
2. The Philosophy of Quantitative Methods 7
Brian D. Haig
3. Quantitative Methods and Ethics 32
Ralph L. Rosnow and Robert Rosenthal
4. Special Populations 55
Keith F. Widaman, Dawnté R. Early, and Rand D. Conger
5. Theory Construction, Model Building, and Model Selection 82
James Jaccard
6. Teaching Quantitative Psychology 105
Lisa L. Harlow
7. Modern Test Theory 118
Roderick P. McDonald
8. The IRT Tradition and its Applications 144
R.J. de Ayala
9. Survey Design and Measure Development 170
Paul E. Spector
10. High-Stakes Test Construction and Test Use 189
Neal M. Kingston and Laura B. Kramer
11. Effect Size and Sample Size Planning 206
Ken Kelley
12. Experimental Design for Causal Inference: Clinical Trials and
Regression Discontinuity Designs 223
Kelly Hallberg, Coady Wing, Vivian Wong, and Thomas D. Cook
13. Matching and Propensity Scores 237
Peter M. Steiner and David Cook
14. Designs for and Analyses of Response Time Experiments 260
Trisha Van Zandt and James T. Townsend
15. Observational Methods 286
Jamie M. Ostrov and Emily J. Hart

xvii
16. A Primer of Epidemiologic Methods, Concepts, and Analysis with
Examples and More Advanced Applications within Psychology 305
David E. Bard, Joseph L. Rodgers, and Keith E. Muller
17. Program Evaluation: Principles, Procedures, and Practices 332
Aurelio José Figueredo, Sally Gayle Olderbak, Gabriel Lee Schlomer,
Rafael Antonio Garcia, and Pedro Sofio Abril Wolf
18. Overview of Statistical Estimation Methods 361
Ke-Hai Yuan and Christof Schuster
19. Robust Statistical Estimation 388
David M. Erceg-Hurn, Rand R. Wilcox, and Harvey J. Keselman
20. Bayesian Statistical Methods 407
David Kaplan and Sarah Depaoli
21. Mathematical Modeling 438
Daniel R. Cavagnaro, Jay I. Myung, and Mark A. Pitt
22. Monte Carlo Analysis in Academic Research 454
Paul E. Johnson

Index 480

xviii contents
CHAPTER

Introduction
1
Todd D. Little

Abstract

In this introductory chapter to The Oxford Handbook of Quantitative Methods, I provide an overview of
the two volumes. More specifically, I describe the rationale and motivation for the selected topics that
are presented in volumes. I also list out my instructions to the chapter authors and then describe how
the chapters fit together into thematic groupings. I also extend my sincerest gratitude to the persons
who assisted me along the way, as no work this comprehensive can be done without the considerable
help and assistance of many persons. I conclude with how pleased I am with the quality and
comprehensiveness of the chapters that are included.
Key Words: Overview; Quantitative Methods; Methodology; Statistics

Oxford Introduction synthetic Library of Handbooks covering the field


Handbooks provide a crucial venue to communi- of psychology (the Editor-in-Chief of the library
cate the current state of the field. They also provide is Peter Nathan, University of Iowa Foundation
a one-stop source for learning and reviewing cur- Distinguished Professor of Psychology and Public
rent best practices in a field. The Oxford Handbook of Health). The library comprises handbooks in the
Quantitative Methods serves both of these functions. truest sense of the word: books that summarize and
The field of quantitative methods is quite broad, as synthesize a topic, define the current scholarship,
you can probably imagine. I have tried to be thor- and set the agenda for future research. Each hand-
ough in my selection of topics to be covered. As with book is published as a bound book, and it will also
any handbook of this magnitude, some topics were be developed for electronic delivery. In this format,
all set to have a contribution submitted, only to have the content will be integrated across topics and avail-
some unforeseen hindrance preclude its inclusion able as a fully integrated electronic library. I think
at the last minute (e.g., graphical representations the idea of a comprehensive electronic library is very
of data, ecological inference, history of quantita- forward-thinking. This format is a very attractive
tive methods). Some topics overlap with others and opportunity to have a fully comprehensive and up-
may not have found their way to become a separate to-date handbook of methods in our field. Hence,
chapter, but their fundamental elements are found I agreed to take on the role of editor of The Oxford
in parts of other chapters. Handbook of Quantitative Methods.
This handbook is one of many that Oxford I am very pleased with the quality of the work
University Press (OUP) is assembling but will be that each author provided. As per my request to the
the capstone methodology handbook. As many of contributing authors, each chapter is meant to be
you know, OUP is building a comprehensive and both accessible and comprehensive; nearly all the

1
authors were very responsive to my requests. The support his or her chapter contribution. At the top of
guidelines I asked authors to consider were: that page is a link that encourages you to offer com-
ments and suggestions on the topics and coverage
• Handbook chapters should be comprehensive of the handbook. These comments will be reviewed
and authoritative; readers will rely heavily on these and integrated into future editions of this handbook.
chapters, particularly when they move to the online I encourage you, therefore, to take advantage of this
format. opportunity to help shape the directions and content
• Handbook chapters should present not only coverage of this handbook.
the strengths of the topic covered but also any Statistical software has blossomed with the advent
limitations. of hardware that provides the necessary speed and
• Handbook chapters should make all assump- memory and programming languages coupled with
tions underlying the topic explicit. numerical algorithms that are more efficient and
• Regarding citations, handbook chapters optimized than yesteryear. These software advances
should cover the historical origins as well as the have allowed many of the advances in modern statis-
recent renditions of a given key topic. tics to become accessible to the typical end-user.
• Handbook chapters should not present one- Modern missing data algorithms and Bayesian esti-
sided views on any debate; rather, they should report mation procedures, for example, have been the
the issues and present the arguments—both pro and beneficiaries of these advances. Of course, some of
con. Authors can direct readers to other platforms the software developments have included simplified
where a position piece is presented. interfaces with slick graphic user interfaces. The crit-
• To facilitate the online linkages, handbook ical options are usually prefilled with default settings.
chapters should point to other online resources These latter two aspects of advancing software are
unfortunate because they lead to mindless applica-
related to the topic presented.
tions of the statistical techniques. I would prefer
• Every element of every formula presented must
that options not be set as default but, rather, have
be explicitly explained; assume no knowledge of how
the software prompt the user to make a choice (and
to read formulae.
give good help for what each choice means). I would
• Examples, examples, examples, and, when
prefer that a complete script of the GUI choices and
in doubt, provide an example! Concrete examples the order in which steps were taken be automatically
are absolutely critical to communicate quantitative saved and displayed.
content. I have organized the handbook by starting with
• Avoid jargon and acronyms. Please spell out some basics. It begins with the philosophical under-
acronyms, and if you use jargon, please remind the pinnings associated with science and quantitative
reader of the meaning or definition of the jargon methods (Haig, Chapter 2, Volume 1) followed
every three to four times it is used; similarly, if you by a discussion of how to construct theories and
use an acronym, then remind the reader of what it models so that they can be tested empirically
means every three to four times it is used. and the best model selected (Jaccard, Chapter 5,
• Use active voice, and do not shy away from the Volume 1). I then turn to an enlightened discus-
use of I/me or we/us. Channel how you lecture on sion of ethics in the conduct of quantitative research
the topic. It will create a crisp and enjoyable read. (Rosnow & Rosenbloom, Chapter 3, Volume 1)
• Do not start a sentence with “This” followed and related issues when quantitative methods are
by a verb. The referent to “this” must be restated applied in special populations (Widaman, Early, &
because of the ambiguity this creates. This general Conger, Chapter 4, Volume 1). Harlow (Chapter 6,
guideline should be followed as a rule! Volume 1) follows with an encompassing and
impassioned discussion of teaching quantitative
Authors, like editors, have preferences and habits, methods.
so you will find places, chapters, and so on where The theme in the next grouping of chapters
some of my admonitions were not followed. But the centers on measurement issues. First, the late
quality of the product that each chapter provides is McDonald (Chapter 17, Volume 1) provides a thor-
nonetheless uncompromised. We have established a ough overview of Modern Test Theory.1 De Ayala
Wiki-based resource page for the handbook, which (Chapter 8, Volume 1) adds a detailed discussion
can be found at crmda.KU.edu/oxford. Each author of Item Response Theory as an essential mea-
has been asked to maintain and upload materials to surement and analysis tool. After these principles

2 introduction
of measurement are discussed, the principles and Volume 2) then extend the General Linear Model
practices surrounding survey design and measure to discuss the Generalized Linear Model. This dis-
development are presented (Spector, Chapter 9, cussion is easily followed by Woods (Chapter 4,
Volume 1). Kingston and Kramer (Chapter 10, Volume 2), who synthesizes the various techniques
Volume 1) further this discussion in the context of of analyzing categorical data. After the chapter on
high-stakes testing. configural frequency analysis by Von Eye, Mun,
A next grouping of chapters covers various design Mair and von Weber (Chapter 5, Volume 5), I then
issues. Kelley (Chapter 11, Volume 1) begins segway into nonparametric techniques (Buskirk,
this section by covering issues of power, effect Tomazic, & Willoughby, Chapter 6, Volume 2) and
size, and sample size planning. Hallberg, Wing, the more specialized techniques of correspondence
Wong, and Cook (Chapter 12, Volume 1) then analysis (Greenacre, Chapter 7, Volume 2) and spa-
address key experimental designs for causal infer- tial analysis (Anselin, Murry, & Rey, Chapter 8,
ence: the gold standard randomized clinical trials Volume 2). This section is capped with chapters
(RCT) design and the underutilized regression dis- dedicated to special areas of research—namely,
continuity design. Some key quasi-experimental techniques and issues related to the analysis of
procedures for comparing groups are discussed imaging data (e.g., fMRI; Price, Chapter 9, Vol-
in Steiner and Cooks’ (Chapter 13, Volume 1) ume 2). The closely aligned worlds of behavior
chapter on using matching and propensity scores. genetics (i.e., twin studies; Blokland, Mosing,
Finally, Van Zandt and Townsend (Chapter 14, Verweij, & Medland, Chapter 10, Volume 2)
Volume 1) provide a detailed discussion of the and genes (Medland, Chapter 11, Volume 2)
designs for and analyses of response time experi- follows.
ments. I put observational methods (Ostrov & Hart, The foundations of multivariate techniques are
Chapter 15, Volume 1), epidemiological methods grouped beginning with Ding’s (Chapter 12, Vol-
(Bard, Rodgers, & Mueller, Chapter 16, Volume 1), ume 2) presentation of multidimensional scaling
and program evaluation (Figueredo, Olderbak, & and Brown’s (Chapter 13, Volume 2) summary
Schlomer, Chapter 17, Volume 1) in with these of the foundations of latent variable measurement
chapters because they address more collection and models. Hox layers in the multilevel issues as han-
design issues, although the discussion of program dled in both the manifest regression framework
evaluation also addresses the unique analysis and and the latent variable work of structural equation
presentation issues. modeling. McArdle and Kadlec (Chapter 15, Vol-
I have a stellar group of chapters related to esti- ume 2) detail, in broad terms, different structural
mation issues. Yuan and Schuster (Chapter 18, equation models and their utility. MacKinnon,
Volume 1) provide an overview of statistical estima- Kisbu-Sakarya, and Gottschall (Chapter 16, Vol-
tion method; Erceg-Hurn, Wilcox, and Keselman ume 2) address the many new developments in
(Chapter 19, Volume 1) provide a nice comple- mediation analysis, while Marsh, Hau, Wen, and
ment with a focus on robust estimation tech- Nagengast (Chapter 17, Volume 2) do the same for
niques. Bayesian statistical estimation methods are analyses of moderation.
thoroughly reviewed in the Kaplan and Depaoli The next group of chapters focuses on repeated
(Chapter 20, Volume 1) contribution. The details measures and longitudinal designs. It begins with
of mathematical modeling are synthesized in this a chapter I co-wrote with Wu and Selig and pro-
section by Cavagnaro, Myung, and Pitt (Chapter 21, vides a general overview of longitudinal models
Volume 1). This section is completed by John- (Wu, Selig, & Little, Chapter 18, Volume 2).
son (Chapter 22, Volume 1), who discusses the Deboeck (Chapter 19, Volume 2) takes things fur-
many issues and nuances involved in conducting ther into the burgeoning world of dynamical systems
Monte Carlo simulations to address the what- and continuous-time models for longitudinal data.
would-happen-if questions that we often need to Relatedly, Walls (Chapter 20, Volume 2) provides
answer. an overview of designs for doing intensive longitu-
The foundational techniques for the statistical dinal collection and analysis designs. The wonderful
analysis of quantitative data start with a detailed world of dynamic-factor models (a multivariate
overview of the traditional methods that have model for single-subject data) is presented by Ram,
marked social and behavioral sciences (i.e., the Brose, and Molenaar (Chapter 21, Volume 2). Wei
General Linear Model; Thompson, Chapter 2, (Chapter 22, Volume 2) covers all the issues of tradi-
Volume 2). Coxe, West, and Aiken (Chapter 3, tional time-series models and Peterson (Chapter 23,

little 3
Volume 2) rounds out this section with a thorough Author note
coverage of event history models. Partial support for this project was provided by
The volume finishes with two small sections. grant NSF 1053160 (Todd D. Little & Wei Wu,
The first focuses on techniques dedicated to find- co-PIs) and by the Center for Research Methods
ing heterogeneous subgroups in one’s data. Rupp and Data Analysis at the University of Kansas (Todd
(Chapter 24, Volume 2) covers tradition clustering D. Little, director). Correspondence concerning this
and classification procedures. Masyn and Nylund- work should be addressed to Todd D. Little, Center
Gibson (Chapter 25, Volume 2) cover the model- for Research Methods and Data Analysis, Univer-
based approaches encompassed under the umbrella sity of Kansas, 1425 Jayhawk Blvd. Watson Library,
of mixture modeling. Beauchaine (Chapter 26, Vol- 470. Lawrence, KS 66045. E-mail: yhat@ku.edu.
ume 2) completes this first group with his coverage Web: crmda.ku.edu.
of the nuances of taxometrics. The second of the
final group of chapters covers issues related to sec- Note
ondary analyses of extant data. I put the chapter on 1. This chapter was completed shortly before Rod’s unex-
missing data in here because it generally is applied pected passing. His legacy and commitment to quantitative
after data collection occurs, but it is also a little out methods was uncompromising and we will miss his voice of wis-
dom and his piercing intellect; R.I .P., Rod McDonald and, as
of order here because of the terrific and powerful fea-
you once said, pervixi… .
tures of planned missing data designs. In this regard,
Baraldi and Enders (Chapter 27, Volume 2) could
have gone into the design section. Donnellan and
References
Anselin, L., Murry, A. T., & Rey, S. J. (2012). Spatial analysis.
Lucas (Chapter 28, Volume 2) cover the issues asso- In T. D. Little (Ed.), The Oxford Handbook of Quantitative
ciated with analyzing the large-scale archival data Methods (Vol. 2, pp. 154–174). New York: Oxford University
sets that are available via federal funding agencies Press.
such as NCES, NIH, NSF, and the like. Data mining Baraldi, A. N. & Enders, C. K. (2012). Missing data methods.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative
can also be classified as a set of secondary modeling Methods (Vol. 2, pp. 635–664). New York: Oxford University
procedures, and Strobl’s (Chapter 29, Volume 2) Press.
chapter covers the techniques and issues in this Bard, D. E., Rodgers, J. L., & Muller, K. E. (2012). A primer of
emerging field of methodology. Card and Casper epidemiology methods with applications in psychology. In T.
(Chapter 30, Volume 2) covers the still advancing D. Little (Ed.), The Oxford Handbook of Quantitative Methods
(Vol. 1, pp. 305–349). New York: Oxford University Press.
world of meta-analysis and current best practices Beauchaine, T. P. (2012). Taxometrics. In T. D. Little (Ed.),
in quantitative synthesis of published studies. The The Oxford Handbook of Quantitative Methods (Vol. 2,
final chapter of The Oxford Handbook of Quantita- pp. 612–634). New York: Oxford University Press.
tive Methods is one I co-authored with Wang, Watts, Brown, T. A. (2012). Latent variable measurement models. In T.
and Anderson (Wang, Watts, Anderson, & Little, D. Little (Ed.), The Oxford Handbook of Quantitative Methods
(Vol. 2, pp. 257–280). New York: Oxford University Press.
Chapter 31, Volume 2). In this capstone chapter, Buskirk, T. D., Tomazic, T. T., & Willoughbby, L. (2012).
we address the many pervasive fallacies that still Nonparametric statistical techniques. In T. D. Little (Ed.),
permeate the world of quantitative methodology. The Oxford Handbook of Quantitative Methods (Vol. 2,
A venture such as this does involve the gener- pp. 106–141). New York: Oxford University Press.
ous and essential contributions of expert reviewers. Card, N. A. & Casper, D. M. (2012). Meta-analysis and quan-
titative research synthesis. In T. D. Little (Ed.), The Oxford
Many of the chapter authors also served as review- Handbook of Quantitative Methods (Vol. 2, pp. 701–717).
ers for other chapters, and I won’t mention them by New York: Oxford University Press.
name here. I do want to express gratitude to a num- Cavagnaro, D. R., Myung, J. I., & Pitt, M. A. (2012). Mathe-
ber of ad hoc reviewers who assisted me along the matical modeling. In T. D. Little (Ed.), The Oxford Handbook
way (in arbitrary order): Steve Lee, Kris Preacher, of Quantitative Methods (Vol. 1, pp. 438–453). New York:
Oxford University Press.
Mijke Rhemtulla, Chantelle Dowsett, Jason Lee, Coxe, S., West, S. G., & Aiken, L. S. (2012). Generalized lin-
Michael Edwards, David Johnson (I apologize now ear models. In T. D. Little (Ed.), The Oxford Handbook of
if I have forgotten that you reviewed a chapter for Quantitative Methods (Vol. 2, pp. 26–51). New York: Oxford
me!). I also owe a debt of gratitude to Chad Zim- University Press.
merman at OUP, who was relentless in guiding us De Ayala, R. J. (2012). The IRT tradition and its applications.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative
through the incremental steps needed to herd us all Methods (Vol. 1, pp. 144–169). New York: Oxford University
to a final and pride-worthy end product and to Anne Press.
Dellinger who was instrumental in bringing closure Deboeck, P. R. (2012). Dynamical systems and models of con-
to this mammoth project. tinuous time. In T. D. Little (Ed.), The Oxford Handbook

4 introduction
of Quantitative Methods (Vol. 2, pp. 411–431). New York: Marsh, H. W., Hau, K-T., Wen, Z., & Nagengast, B. (2012).
Oxford University Press. Moderation. In T. D. Little (Ed.), The Oxford Handbook
Ding, C. S. (2012). Multidimensional scaling. In T. D. Little of Quantitative Methods (Vol. 2, pp. 361–386). New York:
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, Oxford University Press.
pp. 7–25). New York: Oxford University Press. Masyn, K. E. & Nylund-Gibson, K. (2012). Mixture modeling.
Donnellan, M. B. & Lucas, R. E. (2012). Secondary data analysis. In T. D. Little (Ed.), The Oxford Handbook of Quantitative
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, pp. 551–611). New York: Oxford University
Methods (Vol. 2, pp. 665–677). New York: Oxford University Press.
Press. McArdle, J. J. & Kadlec, K. M. (2012). Structural equation mod-
Erceg-Hurn, D. M., Wilcox, R. R., & Keselman, H. H. (2012). els. In T. D. Little (Ed.), The Oxford Handbook of Quantitative
Robust statistical estimation. In T. D. Little (Ed.), The Oxford Methods (Vol. 2, pp. 295–337). New York: Oxford University
Handbook of Quantitative Methods (Vol. 1, pp. 388–406). Press.
New York: Oxford University Press. McDonald, R. P. (2012). Modern test theory. In T. D. Little
Figueredo, A. J., Olderbak, S. G., & Schlomer, G. L. (2012) (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1,
Program evaluation: Principles, procedures, and practices. pp. 118–143). New York: Oxford University Press.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Medland, S. E. (2012). Quantitative analysis of genes. In T.
Methods (Vol. 1, pp. 332–360). New York: Oxford University D. Little (Ed.), The Oxford Handbook of Quantitative Meth-
Press. ods (Vol. 2, pp. 219–234). New York: Oxford University
Greenacre, M. J. (2012). Correspondence analysis. In T. D. Little Press.
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, Ostrov, J. M. & Hart, E. J. (2012). Observational methods. In T.
pp. 142–153). New York: Oxford University Press. D. Little (Ed.), The Oxford Handbook of Quantitative Methods
Haig, B. D. (2012). The philosophy of quantitative methods. (Vol. 1, pp. 286–304). New York: Oxford University Press.
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Peterson, T. (2012) Analyzing event history data. In T. D. Little
Methods (Vol. 1, pp. 7–31). New York: Oxford University (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
Press. pp. 486–516). New York: Oxford University Press.
Hallberg, K., Wing, C., Wong, V., & Cook, T. D. (2012). Exper- Price, L. R. (2012). Analysis of imaging data. In T. D. Lit-
imental design for causal inference: Clinical trials and regres- tle (Ed.), The Oxford Handbook of Quantitative Methods
sion discontinuity designs. In T. D. Little (Ed.), The Oxford (Vol. 2, pp. 175–197). New York: Oxford University
Handbook of Quantitative Methods (Vol. 1, pp. 223–236). Press.
New York: Oxford University Press. Ram, N., Brose, A., & Molenaar, P. C. M. (2012). Dynamic fac-
Harlow, L. (2012). Teaching quantitative psychology. In tor analysis: Modeling person-specific process. In T. D. Little
T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
ods (Vol. 1, pp. 105–117). New York: Oxford University pp. 441–457). New York: Oxford University Press.
Press. Rosnow, R. L. & Rosenthal, R. (2012). Quantitative methods
Hox, J. J., (2012). Multilevel regression and multilevel struc- and ethics. In T. D. Little (Ed.), The Oxford Handbook of
tural equation modeling In T. D. Little (Ed.), The Oxford Quantitative Methods (Vol. 1, pp. 32–54). New York: Oxford
Handbook of Quantitative Methods (Vol. 2, pp. 281–294). University Press.
New York: Oxford University Press. Rupp, A. A. (2012). Clustering and classification. In T. D. Little
Jaccard, J. (2012). Theory construction, model building, and (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
model selection. In T. D. Little (Ed.), The Oxford Handbook pp. 517–611). New York: Oxford University Press.
of Quantitative Methods (Vol. 1, pp. 82–104). New York: Spector, P. E. (2012). Survey design and measure development.
Oxford University Press. In T. D. Little (Ed.), The Oxford Handbook of Quantitative
Johnson, P. E. (2012). Monte Carlo analysis in academic research. Methods (Vol. 1, pp. 170–188). New York: Oxford University
In T. D. Little (Ed.), The Oxford Handbook of Quantitative Press.
Methods (Vol. 1, pp. 454–479). New York: Oxford University Steiner, P. M. & Cook, D. (2012). Matching and Propensity
Press. Scores. In T. D. Little (Ed.), The Oxford Handbook of Quan-
Kaplan, D. & Depaoli, S. (2012). Bayesian statistical methods. titative Methods (Vol. 1, pp. 237–259). New York: Oxford
In T. D. Little (Ed.), The Oxford Handbook of Quantitative University Press.
Methods (Vol. 1, pp. 407–437). New York: Oxford University Strobl, C. (2012). Data mining. In T. D. Little (Ed.), The Oxford
Press. Handbook of Quantitative Methods (Vol. 2, pp. 678–700).
Kelley, K. (2012). Effect size and sample size planning. In New York: Oxford University Press.
T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- Thompson, B. (2012). Overview of traditional/classical statisti-
ods (Vol. 1, pp. 206–222). New York: Oxford University cal approaches. In T. D. Little (Ed.), The Oxford Handbook of
Press. Quantitative Methods (Vol. 2, pp. 7–25). New York: Oxford
Kingston, N. M. & Kramer, L. B. (2012). High stakes test con- University Press.
struction and test use. In T. D. Little (Ed.), The Oxford Van Zandt, T., & Townsend, J. T. (2012). Designs for and
Handbook of Quantitative Methods (Vol. 1, pp. 189–205). analyses of response time experiments. In T. D. Little
New York: Oxford University Press. (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1,
MacKinnon, D. P., Kisbu-Sakarya, Y., & Gottschall, A. C. pp. 260–285). New York: Oxford University Press.
(2012). Developments in mediation analysis. In T. D. von Eye, A., Mun, E. U., Mair, P., & von Weber, S. Configural
Little (Ed.), The Oxford Handbook of Quantitative Meth- frequency analysis. In T. D. Little (Ed.), The Oxford Hand-
ods (Vol. 2, pp. 338–360). New York: Oxford University book of Quantitative Methods (Vol. 2, pp. 73–105). New York:
Press. Oxford University Press.

little 5
Walls, T. A. (2012). Intensive longitudinal data. In T. D. Little Quantitative Methods (Vol. 1, pp. 55–81). New York: Oxford
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2, University Press.
pp. 432–440). New York: Oxford University Press. Woods, C. M. (2012). Categorical methods. In T. D. Little
Wang, L. L., Watts, A. S., Anderson, R. A., & Little, T. D. (2012). (Ed.), The Oxford Handbook of Quantitative Methods (Vol. 2,
Common fallacies in quantitative research methodology. In pp. 52–73). New York: Oxford University Press.
T. D. Little (Ed.), The Oxford Handbook of Quantitative Meth- Wu, W., Selig, J. P., & Little, T. D. (2012). Longitudinal data
ods (Vol. 2, pp. 718–758). New York: Oxford University analysis. In T. D. Little (Ed.), The Oxford Handbook of Quan-
Press. titative Methods (Vol. 2, pp. 387–410). New York: Oxford
Wei. W. W. S. (2012). Time series analysis. In T. D. Little (Ed.), University Press.
The Oxford Handbook of Quantitative Methods (Vol. 2, pp. Yuan, K-H., & Schuster, C. (2012). Overview of statistical esti-
458–485). New York: Oxford University Press. mation methods. In T. D. Little (Ed.), The Oxford Handbook
Widaman, K. F., Early, D. R., & Conger, R. D. (2012). Special of Quantitative Methods. (Vol. 1, pp. 361–387). New York:
populations. In T. D. Little (Ed.), The Oxford Handbook of Oxford University Press.

6 introduction
CHAPTER

The Philosophy of Quantitative Methods


2
Brian D. Haig

Abstract

This chapter provides a philosophical examination of a number of different quantitative research


methods that are prominent in the behavioral sciences. It begins by outlining a scientific realist
methodology that can help illuminate the conceptual foundations of behavioral research methods. The
methods selected for critical examination are exploratory data analysis, statistical significance testing,
Bayesian confirmation theory, meta-analysis, exploratory factor analysis, and causal modeling. Typically,
these methods contribute to either the detection of empirical phenomena or the construction of
explanatory theory. The chapter concludes with a brief consideration of directions that might be taken
in future philosophical work on quantitative methods.
Key Words: scientific realism, methodology, exploratory data analysis, statistical significance testing,
Bayesianism, meta-analysis, exploratory factor analysis, causal modeling, latent variables, phenomena
detection, hypothetico-deductive method, inference to the best explanation

Introduction This chapter undertakes a philosophical examina-


Historically, philosophers of science have given tion of a number of different quantitative research
research methods in science limited attention, con- methods that are prominent in the behavioral sci-
centrating mostly on the nature and purpose of the- ences. It begins by outlining a scientific realist
ory in the physical sciences. More recently, however, methodology that can help illuminate the concep-
philosophers of science have shown an increased tual foundations of behavioral research methods.
willingness to deal with methodological issues in sci- The methods submitted to critical examination are
ences other than physics—particularly biology, but exploratory data analysis, statistical significance test-
also psychology to a limited extent. There is, then, ing, Bayesian confirmation theory, meta-analysis,
a developing literature in contemporary philosophy exploratory factor analysis, and causal modeling
of science that can aid both our understanding and methods. The chapter concludes with a brief and
use of a variety of research methods and strategies in selective consideration of directions that might be
psychology (e.g., Trout, 1998). taken in future philosophical work on quantitative
At the same time, a miscellany of theoretically methods.
oriented psychologists, and behavioral and social
scientists more generally, have produced work on
the conceptual foundations of research methods that Quantitative Methods and Scientific
helps illuminate those methods. The work of both Realism
professional philosophers of science and theoretical The three major philosophies of science
scientists deserves to be included in a philosophical that bear on psychology are empiricism, social
examination of behavioral research methods. constructionism, and scientific realism (Greenwood,

7
1992; Manicas & Secord, 1983). Nineteenth of which we are part and (2) that both the observ-
century British empiricism had a major influence able and unobservable features of that world can be
on the development of British statistics in the first known by the proper use of scientific methods. Some
half of the twentieth century (e.g., Mulaik, 1985). versions of scientific realism incorporate additional
The statistical methods developed in that intellectual theses (e.g., the claims that truth is the primary aim
milieu remain an important part of psychology’s sta- of science and that successive theories more closely
tistical research practice. For example, Karl Pearson’s approximate the truth), and some will also nominate
product moment correlation coefficient was taken optional doctrines that may, but need not, be used
by its founder to be the quantitative expression of by scientific realists (e.g., the claim that causal rela-
a causal relation viewed in empiricist terms. Simi- tions are relations of natural necessity; see Hooker,
larly, Fisher’s endorsement of inductive methods as 1987). Others who opt for an “industrial strength”
the proper view of scientific method stemmed from version of scientific realism for the physical sciences
a commitment to the empiricism of his day. Even are more cautious about its successful reach in the
in the current postpositivist philosophical climate, behavioral sciences. Trout (1998), for example, sub-
authors of research methods textbooks sometimes scribes to a modest realism in psychology, based on
portray quantitative research as essentially positivist his skepticism about the discipline’s ability to pro-
in its empiricist commitments (Yu, 2006). Among duce deeply informative theories like those of the
other things, positivism restricts its attention to physical sciences.
what can be observed and regards theories as instru- Given that this chapter is concerned with the
ments that organize claims about observables but philosophical foundations of quantitative methods,
that do not explain them by appeal to hidden the remaining characterization of scientific realism
causes. will limit its attention to research methodology.
Qualitative methodologists also often bolster
their preferred conception of qualitative research by Scientific Realist Methodology
comparing it with an unflattering positivist picture Scientific realism boasts a rich conception of
of quantitative research. They tend to adopt the phi- methodology, which is of considerable help in
losophy of social constructionism, which is opposed understanding and guiding research. The resource-
to the traditional notions of truth, objectivity, and fulness of realist methodology is suggested in the
reason, maintaining that our understanding of the following description of its major characteristics (see
world is determined by social negotiation. In one or Hooker, 1987; Nickles, 1987). First, realist method-
another of its various forms, it is the philosophy of ology has three major tasks: to describe how methods
choice for many qualitative researchers, and it tends function; to evaluate methods critically against their
to be employed by those who are opposed, or indif- rivals; and to recommend how to use particular
ferent, to quantitative methods. I shall not consider methods to pursue chosen research goals.
it further in this chapter. Second, realist methodology is critically aim-
In what follows, I will adopt a scientific realist oriented. At a broad level, it recommends the pursuit
perspective on research methods. Although the sub- of valuable truth, explanatory understanding, and
ject of considerable debate, and opposed by many effective control as primary research goals; and it is
antirealist positions, scientific realism is the domi- concerned with the mutual adjustment of methods
nant philosophy of science today. It is also the tacit and research goals.
philosophy of most working scientists. This fact, Third, realist methodology is naturalistic—that
combined with its current heavy emphasis on the is, it is a substantive domain that uses the meth-
nature of scientific practice, makes scientific realism ods of the various sciences to study method itself.
a philosophy for science—not just a philosophy of Proctor and Capaldi (2001) advocate a naturalistic
science. approach to methodology in psychology in which
the empirical justification of methodological ideas
Scientific Realism is emphasized.
The philosophies of positivism, social construc- A fourth feature of realist methodology is that it
tionism, and scientific realism just mentioned are is both generative and consequentialist. Generative
really family positions. This is especially true of sci- methodology involves reasoning to, and accept-
entific realism, which comes in many forms. Most ing, knowledge claims in question from warranted
versions of scientific realism display a commitment premises. Exploratory factor analysis is a prominent
to at least two doctrines: (1) that there is a real world example of a method in psychology that involves

8 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
a generative justification of the factorial hypotheses Exploratory Data Analysis
to which it gives rise. By contrast, consequentialist In psychological research, the major emphasis in
methodology focuses on reasoning from knowledge data analysis is placed on statistical inference, where
claims in question to their testable consequences. the task is to find out whether a data set exhibits a
The widely used hypothetico-deductive method, designated feature of interest characterized with ref-
with its emphasis on predictive accuracy, clearly erence to a probabilistic model. Unfortunately, the
exhibits a consequentialist approach to justifying dominance of this goal has had the effect of discour-
knowledge claims. aging a concerted examination of data sets in terms
Fifth, realist methodology acknowledges the need of their quality and structure. Detailed explorations
for two quite different approaches to justifying of data are important in science, and it often makes
knowledge claims. In philosophy these are com- good sense to conduct them instead of a probabilis-
monly known as reliabilism and coherentism. With tic model or before the model is formulated and
reliabilism, a belief is justified to the extent that it is adopted.
acquired by reliable processes. In general, the innu- Consistent with this emphasis on the close exam-
merable methods that contribute to the detection of ination of data, the last 30 years have witnessed the
empirical phenomena are concerned with reliabilist strong development of an empirical, data-oriented
justification. With coherentism, a belief is justified approach to statistics. One important part of this
in virtue of its coherence with other beliefs. Tha- movement is exploratory data analysis, which con-
gard’s (1992) theory of explanatory coherence, used trasts with the more familiar traditional statistical
for the comparative evaluation of scientific theories, methods with their characteristic emphasis on the
embodies an illuminating coherentist perspective on confirmation of knowledge claims.
knowledge justification. These two forms of justifi-
cation are different, complementary, and of equal Exploratory Data Analysis and John Tukey
importance. Spelling out a philosophy of exploratory data
As a sixth feature, realist methodology regards analysis is difficult, and few methodologists have
science as a problem-oriented endeavor in which attempted to do so (for an initial attempt to do
problems are conceptualized as constraints on their this from a Bayesian perspective, see Good, 1983).
effective solution (Haig, 1987; Nickles, 1981). On However, the intellectual progenitor of modern
this formulation, the constraints are actually con- exploratory data analysis, John Tukey, has devel-
stitutive of the problem itself; they characterize the oped a systematic perspective on the subject that
problem and give it structure. Further, by includ- has helped to highlight its importance to research.
ing all the constraints in the problem’s articulation, It deserves to be considered as a philosophy of data
the problem enables the researcher to direct inquiry analysis in its own right. Therefore, this brief exami-
effectively by pointing the way to its own solution. In nation of the philosophy of exploratory data analysis
a real sense, stating the problem is half the solution! pays particular attention to Tukey’s thinking on the
Finally, realist methodology takes the researcher’s topic.
make up as a “knowing subject” seriously. Among According to Tukey (1980), data analysis should
other things, the researcher is regarded as a satisficer be treated as a two-stage compound process in
who makes heavy use of heuristics to guide her which the patterns in the data are first sug-
inquiries. For example, McGuire (1997) discusses gested by exploratory data analysis and then crit-
many useful heuristics that can be employed to facil- ically checked through the use of confirmatory
itate the generation of hypotheses in psychological data analysis procedures. Exploratory data analysis
research. involves descriptive—and frequently quantitative—
Scientific realist methodology undergirds a wide detective work designed to reveal structure or pattern
variety of methods, strategies, and heuristics that in the data sets under scrutiny. The data analyst
have been successfully used to produce worthwhile is encouraged to undertake an open-eyed investi-
knowledge about both empirical phenomena and gation of the data and perform multiple analyses
explanatory theories. If quantitative researchers in using a variety of intuitively appealing and easily
psychology engage this literature seriously, then they used techniques.
will find resources for enhancing their understand- The compendium of methods for the explo-
ing of research methods. ration of data, many of which were developed by
I turn now to a philosophical consideration of Tukey (1977), is designed to facilitate both dis-
the selected research methods. covery and communication of information. These

haig 9
methods are concerned with the effective organi- there is no acknowledged place for such work in
zation of data, the construction of graphical and the hypothetico-deductive conception of inquiry
semi-graphical displays, and the examination of (Wilkinson & The Task Force, 1999).
distributional assumptions and functional depen- I think the worth of the inductive method as a
dencies. Two additional attractive features of Tukey’s model for data analysis is dismissed too quickly in
methods are their robustness to changes in under- the above quotation. The major failing of the induc-
lying distributions and their resistance to outliers tive account of scientific method lies not so much
in data sets. Exploratory methods with these two with its perspective on data analysis, but with its
features are particularly suited to data analysis in psy- prohibition of the formulation of explanatory the-
chology, where researchers are frequently confronted ories. A modern conception of inductive method is
with ad hoc sets of data on amenable variables, which embedded in the important scientific process of phe-
have been acquired in convenient circumstances. nomena detection. Phenomena are relatively stable
recurrent general features of the world that we seek
Exploratory Data Analysis and Scientific to explain (Woodward, 1989), and their detection
Method frequently involves an inductive process of empiri-
In his writings on data analysis, Tukey (1969) cal generalization. With its emphasis on phenomena
has emphasized the related ideas that psychology is detection, inductive method reserves an important
without an agreed-upon model of data analysis and place for the exploratory analysis of data. In detect-
that we need to think more broadly about scientific ing phenomena, one is concerned to extract a signal
inquiry. In an invited address to the American Psy- from the noise of data, and for this the inten-
chological Association in 1968, Tukey presented the sive search of large amounts of data is frequently
following excerpt from a prominent psychologist for essential. It is precisely because securing a heavy
his audience to ponder. I quote in part: information yield for our data is likely to throw
up potentially interesting data patterns that might
I have the feeling that psychology is currently turn out to be genuine phenomena. In this con-
without a dominant viewpoint concerning a model text, data mining is encouraged, and the capabilities
for data analysis. In the forties and early fifties, a of exploratory techniques in this regard often make
hypothetico-deductive framework was popular, and them the appropriate methods of choice.
our mentors were keen on urging the design of By contrast, Behrens and Yu (2003) suggest that
“crucial” experiments for the refutation of specific the inferential foundations of exploratory data anal-
predictions made from one or another theory. ysis are to be found in the idea of abduction, or
Inductive empiricism was said to be disorderly and explanation (and by implication, not in the notions
inefficient. You and I knew then, as we know now, of hypothetico-deductive testing and inductive gen-
that no one approach is uniformly most powerful. eralization). However, exploratory data analysis is a
(Tukey, 1969, p. 90) descriptive pattern-detection process that is a pre-
Consider the hypothetico-deductive and induc- cursor to the inductive generalizations involved in
tive conceptions of scientific methods, which are phenomena detection. As will be seen later in the
mentioned here as candidate models for data anal- consideration of exploratory factor analysis, abduc-
ysis. Most psychological researchers continue to tive inference is reserved for the construction of
undertake their research within the confines of the causal explanatory theories that are introduced to
hypothetico-deductive method. Witness their heavy explain empirical phenomena. Beherens and Yu’s
preoccupation with theory testing, where confirma- suggestion conflates the quite different ideas of
tory data analyses are conducted on limited sets of descriptive and explanatory inference.
data gathered in accord with the dictates of the test
predictions of theories. In this regard, psychologists Exploratory Data Analysis and a Model of
frequently employ tests of statistical significance Data Analysis
to obtain binary decisions about the credibility of In the spirit of Tukey’s (1962; 1980) push for
the null hypothesis and its substantive alternatives. breadth of vision in data analysis, one might use-
However, the use of statistical significance tests in fully take a perspective on data analysis that extends
this way strongly blunts our ability to look for Tukey’s two-stage model (Haig, 2005b). Before
more interesting patterns in the data. Indeed, the exploring data for patterns of potential interest,
continued neglect of exploratory data analysis in researchers should assiduously screen their data for
psychological research occurs in good part because their quality. This initial data analysis involves

10 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
checking for the accuracy of data entries, identify- principle asserts that statements about observational
ing and dealing with missing and outlying data, and data, including empirical generalizations, have a
examining the data for their fit to the assumptions degree of acceptability on their own. Such claims
of the data analytic methods to be used. This impor- are not indubitable, but they do stand by them-
tant, and time-consuming, preparatory phase of selves better than claims justified solely in terms of
data analysis has not received the amount of explicit what they explain. What justifies the provisional
attention that it deserves in behavioral science edu- acceptance of data statements is that they have
cation and research practice. Fidell and Tabachnick been achieved by reliable methods; what strength-
(2003) provide a useful overview of the task and ens our provisional belief in the patterns thrown
techniques of initial data analysis. up by exploratory data analysis is their confirma-
Confirmation of the initial data patterns sug- tion through use of computer-based resampling
gested by exploratory data analysis is a “just check- methods.
ing” strategy and as such should be regarded as a Further, it is important to appreciate that the
process of close replication. However, it is essential acceptability of claims provided by the reliabilist jus-
to go further and undertake constructive replica- tification of computer-intensive resampling meth-
tions to ascertain the extent to which results hold ods can be enhanced by making appropriate use of
across different methods, treatments, subjects, and a coherentist approach to justification. One impor-
occasions. Seeking results that are reproducible tant form of coherence is explanatory coherence, and
through constructive replications requires data ana- one method that delivers judgments of explanatory
lytic strategies that are designed to achieve significant coherence is the theory of explanatory coherence
sameness rather than significant difference (Ehren- (Thagard, 1992). According to this theory, data
berg & Bound, 1993). Exploratory data analysis, claims, including empirical generalizations, receive
then, can usefully be regarded as the second in an additional justification if and when they enter
a four-stage sequence of activities that, in turn, into, and cohere with, the explanatory relations of
attend to data quality, pattern suggestion, pattern the theory that explains them.
confirmation, and generalization.
A Philosophy for Teaching Data Analysis
Resampling Methods and Reliabilist An underappreciated, but important, feature of
Justification Tukey’s writings on exploratory data analysis is the
Since the 1980s, statisticians have been able to illuminating remarks on the teaching of data analysis
exploit the massive computational power of the that they contain. These remarks can be assem-
modern computer and develop a number of com- bled into a constructive philosophy for teaching data
puter intensive resampling methods, such as the analysis, which can properly be regarded as an aspect
jackknife, the bootstrap, and cross-validation (Efron of an overall philosophy of exploratory data analy-
& Tibshirani, 1993). These methods constitute one sis. This philosophy of teaching advises us to think
important set of confirmatory procedures that are about and teach data analysis in a way that is quite
well suited to the task of checking on the data different from the prevailing custom.
patterns thrown up by exploratory data analysis. Provocatively, Tukey (1980) maintained that the
By exploiting the computer’s computational power, proper role of statistics teachers is to teach that
these resampling methods free us from the restric- which is most difficult and leave that which is more
tive assumptions of modern statistical theory, such as manageable to good textbooks and computers. He
the belief that the data are normally distributed, and recommended teaching data analysis the way he
permit us to gage the reliability of chosen statistics understood biochemistry was taught, concentrat-
by making thousands, even millions, of calculations ing on what the discipline of statistics has learned,
on many data points. perhaps with a discussion of how such things were
It is important to appreciate that the resampling learned. The detail of methods should be assigned
methods just mentioned make use of a reliabilist to laboratory work, and the practice of learning
approach to justification. Here, the reliability checks data analytic techniques should be assigned to a
on emergent data patterns are provided by consis- different course in which problems arose. He fore-
tency of test outcomes, which are time-honored saw that such a redirection in teaching data analysis
validating strategies. Our willingness to accept the would have to be introduced in phases. In Tukey’s
results of such checks is in accord with what Tha- (1962) words, “The proposal is really to go in the
gard (1992) calls the principle of data priority. This opposite direction from cookbookery; to teach not

haig 11
‘what to do,’ nor ‘how we learned what to do,’ but has undertaken a useful extensive review of the con-
rather, ‘what we have learned”’ (p. 63). This advice troversy since its beginning. Despite the plethora
is broadly consistent with the idea that we should of critiques of statistical significance testing, most
teach research methods in terms of their accompa- psychologists understand them poorly, frequently
nying methodology, a recommendation considered use them inappropriately, and pay little attention
at the end of this chapter. to the controversy they have generated (Gigerenzer,
Another prominent feature of Tukey’s philosophy Krauss, & Vitouch, 2004).
of teaching data analysis is his recommendation that The significance testing controversy is multi-
we should teach both exploratory and confirmatory faceted. This section will limit its attention to a
data analysis and that we have an obligation to do so. consideration of the two major schools of signifi-
Tukey’s strong promotion of the value of exploratory cance testing, their hybridization and its defects, and
data analysis was intended as a counter to the dom- the appropriateness of testing scientific hypotheses
inance of confirmatory data analysis in statistical and theories using tests of statistical significance.
practice. However, for Tukey, exploratory data anal- Psychologists tend to assume that there is a sin-
ysis was not to be understood as more important gle unified theory of tests of statistical significance.
than confirmatory data analysis because both are However, there are two major schools of thought
essential to good data analysis. regarding significance tests: Fisherian and Neyman-
Tukey also suggested that exploratory data anal- Pearson. Initially, Neyman and Egon Pearson sought
ysis should probably be taught before confirma- to build on and improve Fisher’s theory, but they
tory data analysis. There are several reasons why subsequently developed their own theory as an alter-
this recommendation has merit. Properly taught, native to Fisher’s theory. There are many points of
exploratory data analysis is probably easier to learn, difference between the two schools, which adopt
and it promotes a healthy attitude to data anal- fundamentally different outlooks on the nature of
ysis (encouraging one to be a dataphile without scientific method. The uncritical combination of
becoming a data junkie). It requires the investiga- the two schools in psychology has led to a confused
tor to get close to the data, analyze them in various understanding of tests of statistical significance and
ways, and seek to extract as much as possible poten- to their misuse in research.
tially important information from the data. This is
done to detect indicative patterns in the data before
establishing through confirmatory data analysis that The Fisherian Significance Testing School
they are genuine patterns. The Fisherian school of significance testing (e.g.,
Tukey emphasized that learning exploratory data Fisher, 1925) tests a hypothesis or theory of sub-
analysis centrally involves acquiring an appropri- stantive interest against the null hypothesis that the
ate attitude toward the data, which includes the experimental effect to be demonstrated is in fact
following elements: exploratory data analysis is suf- absent. Fisher argued that an experiment is per-
ficiently important to be given a great deal of time; formed solely to give the data an opportunity to
exploratory data analysis should be carried out flex- disprove the null hypothesis. No alternative hypoth-
ibly with multiple analyses being performed; and esis is specified, and the null hypothesis is the
exploratory data analysis should employ a multiplic- hypothesis to be nullified; it need not be the hypoth-
ity of methods that enhance visual display. esis of zero difference. Because one cannot accept
the null hypothesis, no provision is made for Type
II error, and relatedly, there is no place for a sta-
Statistical Significance Testing tistical concept of power. Most importantly, and as
It is well known that tests of statistical significance noted earlier, Fisher subscribed to an inductive con-
are the most widely used methods for evaluating ception of scientific method and maintained that
hypotheses in psychology (e.g., Hubbard & Ryan, significance tests are vehicles of inductive reasoning.
2000). These tests have been popular in psychology As such they are concerned with evidence for beliefs.
for nearly 50 years and in statistics for about 75 years.
Since the 1960s, there has developed a massive crit-
ical literature in psychology regarding their worth. Should We Use Fisher’s Significance Tests?
Important early contributions to this debate are col- The question of whether behavioral scientists
lected in Morrison and Henkel (1970; see also Giere, should use Fisherian significance tests as a defensible
1972). Cohen (1994) provides a short perceptive form of hypothesis testing largely centers on whether
review of the controversy, whereas Nickerson (2000) p−values are good measures of scientific evidence.

12 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
Although many psychological researchers think, and neo-Fisherian alternative. Key elements of this alter-
some methodologists argue, that p−values can be native are that the probability of Type I error is not
used to measure strength of evidence, others hold specified, p−values are not misleadingly described
them to be deeply problematic in this respect (e.g., as “significant” or “nonsignificant,” judgment is sus-
Royall, 1997; Hubbard & Lindsay, 2008). pended about accepting the null hypothesis on the
Some have argued that the concept of evidence basis of high p-values, the “three-valued logic” that
adopted by Fisher is defective. For one thing, it is gives information about the direction of the effect
widely agreed by philosophers of science that the- being tested is adopted, accompanying effect size
ory or hypothesis evaluation is a comparative affair information is provided, and use is made of adjunct
in which evidence against one hypothesis is evidence information such as confidence intervals, where
for another hypothesis (e.g., Sober, 2008). However, appropriate. Two things to note here about this neo-
as just noted, Fisher only countenanced the null, Fisherian perspective are that it is concerned with
and without an alternative hypothesis with which significance assessments but not null hypothesis sig-
to compare the null, the logic of his significance nificance tests and that it is concerned with statistical
testing is importantly incomplete; one cannot have tests as distinct from the tests of scientific hypothe-
evidence against the null without it being evidence ses. There are empirical studies in psychology that
for another hypothesis. The idea that one might approximate this modified Fisherian perspective on
allow for alternative hypotheses by joining Fisher’s significance tests.
perspective with that of Neyman and Pearson will
be seen by many to make matters worse. As will be
The Neyman-Pearson Hypothesis Testing
noted shortly, Neyman and Pearson were concerned
School
with the reliability of errors in decision making in
Neyman and Pearson rejected Fisher’s notion of a
the long run rather than with evidence for believing
significance test and its use of a threshold p-value as a
hypotheses in a particular experiment.
basis for rejecting the null hypothesis. In this regard,
Others contend that a further major problem
they added the requirement of the specification of an
with Fisher’s p-value is that it doesn’t measure evi-
alternative hypothesis as well as the null hypothesis,
dence properly (e.g., Goodman, 1993). In this
and they replaced Fisher’s evidential p-value with the
regard, four claims about p-values are thought to
Type I error rate, α (e.g., Neyman & Pearson, 1933).
disqualify it as a proper measure of evidence. First,
In addition, Neyman and Pearson permitted a more
the p-value is not a direct measure of the probability
liberal formulation of the null hypothesis than did
that the null is false; it is a conditional probability
Fisher and regarded it as legitimate to speak of its
of obtaining the data, calculated on the assumption
acceptance. Thus, Type II error was admitted, and
that the null hypothesis is true. Second, to take a
explicit provision was made for a statistical concept
number that quantifies rare data under the null is to
of power. To capture these differences, Neyman and
confuse the strength of evidence with the probability
Pearson spoke of their approach as hypothesis testing
of its occurrence (Royall, 1986). These two things
rather than significance testing.
are different because the probability is an indication
However, the Neyman-Pearson school differs
of the long-run Type I error rate, which is separate
from the Fisherian school most fundamentally in
from strength of evidence. Third, the calculation of
maintaining that significance tests are rules of induc-
a p-value combines the rarity of an obtained result
tive behavior rather than vehicles for inductive rea-
with the probability of results that didn’t happen. As
soning. On this view, significance testing is regarded
Jeffreys (1939) stated long ago, “What the use of P
as a theory of prudential decision-making; accept-
implies . . . is that a hypothesis that may be true may
ing or rejecting a hypothesis amounts to adopting an
be rejected because it has not predicted observable
appropriate course of action, rather than believing
results that have not occurred” (p. 136). Finally, it is
it to be probably true or false. At root, Neyman and
claimed that the p-value can exaggerate the strength
Pearson held different views from Fisher about the
of evidence against the point null and small inter-
nature of science.
val hypotheses (Berger & Selke, 1987), which are
frequently tested in psychology.
These criticisms are not convincing to all. For Should We Use Neyman and Pearson’s
example, Hurlbert and Lombardi (2009) recently Hypothesis Tests?
considered these arguments and recommend a shift It might seem that by focusing on hypothesis test-
in focus from the classical Fisherian framework to a ing and making provision for alternative hypotheses

haig 13
and a statistical concept of power, Neyman and Pear- To make matters worse, this confusion is com-
son’s approach marks an improvement over Fisher’s pounded by a tendency of psychologists to misrep-
significance tests. However, as noted earlier, Ney- resent the cognitive accomplishments of significance
man and Pearson were concerned with measuring tests in a number of ways. For example, levels
the reliability of errors in decision making in the of statistical significance are taken as measures of
long run rather than with evidence for believing confidence in research hypotheses, likelihood infor-
hypotheses in a particular experiment; Type I and mation is taken as a gage of the credibility of the
Type II error are objective probabilities, under- hypotheses being tested, and reported levels of sig-
stood as long-run relative frequencies. As such, they nificance are taken as measures of the replicability
belong to reference classes of endless series of trials of findings (Gigerenzer, Krauss, & Vitouch, 2004).
that might have happened but never did. They are
not about single events such as individual exper- Significance Tests and Theory Testing
iments. Further, being concerned with decision Meehl (1967, 1978, 1997) has made one of the
making understood as behavioral courses of action, strongest criticisms of the use of tests of statisti-
Neyman-Pearson statistics do not measure strength cal significance in psychology. He argued that the
of evidence for different hypotheses and thus do not widespread use of tests of statistical significance to
tell us how confident we should be in our beliefs test substantive hypotheses and theories is deeply
about those hypotheses. flawed because the support for a hypothesis or the-
It would seem, then, that the Neyman-Pearson ory obtained by rejecting the null hypothesis is very
approach is suitable for use only when the focus is on weak.
controlling errors in the long run (e.g., quality con- Sometimes psychological researchers test a
trol experiments). However, this does not happen hypothesis of substantive interest against the point
often in psychology. null hypothesis that the difference between the rel-
evant population parameters is exactly zero. But a
The Hybrid fact, long known to professional statisticians and
Although the Neyman-Pearson theory of testing appreciated by Meehl, is that the point null hypoth-
is the official theory of statistical testing in the field esis is virtually always false in the behavioral and
of professional statistics, textbooks in the behav- social sciences. The reason for this is that in these
ioral sciences ensure that researchers are instructed to sciences, most things are related to one another at
indiscriminately adopt a hybrid account of tests of least to some small extent. In many parts of psy-
statistical significance, one that is essentially Fish- chology, “everything in the brain is connected with
erian in its logic but that is often couched in the everything else,” resulting in a large positive man-
decision-theoretic language of Neyman and Pearson. ifold in which many variables correlate positively
The hybrid logic is a confused and inconsistent with one another to a significant degree. Thus, in
amalgam of the two different schools of thought the “softer” precincts of psychology, where “true”
(Acree, 1979; Gigerenzer, 1993; Spielman, 1974; experiments are often not possible, obtaining a rea-
but see Lehmann, 1993, for the suggestion that the sonable sample size makes the achievement of a
best elements of both can be combined in a unified statistically significant result the likely outcome of an
position). To the bare bones of Fisherian logic, the empirical study. Meehl (1967) reasoned that if the
hybrid adds the notion of Type II error (opposed by null hypothesis of zero group differences is almost
Fisher) and the associated notion of statistical power always false, then with sufficient power, directional
(which Fisher thought could not be quantified) hypotheses in these parts of psychology have a 50:50
but only at the level of rhetoric (thereby ignoring chance of achieving statistical significance! Meehl
Neyman and Pearson), while giving a behavioral and Lykken provided some empirical evidence for
interpretation of both Type I and Type II errors (vig- this claim more than 40 years ago (Meehl, 1967; see
orously opposed by Fisher)! Because the authors of also Meehl, 1997). A recent simulation study on real
statistical textbooks in the behavioral sciences tend data carried out by Waller (2004) confirmed Meehl’s
to present hybrid accounts of significance testing, claim.
aspiring researchers in these sciences almost always One can better appreciate what is wrong with
acquire a confused understanding of such tests. It is using tests of statistical significance to appraise psy-
most unfortunate that many writers of statistics text- chological theories by considering the logic involved
books in the behavioral sciences have unwittingly in such testing. It is helpful to begin by observ-
perpetuated these basic misunderstandings. ing the important distinction between scientific

14 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
hypotheses and the statistical hypotheses that may long and hard to answer this important and chal-
be derived from them (Bolles, 1962). Often in lenging question by developing theories of scientific
psychology, scientific theories about psychological confirmation. Despite the considerable fruits of
processes or structures are formulated, and then their labors, there is widespread disagreement about
statistical hypotheses are derived to facilitate their which theory of confirmation we should accept. In
empirical testing. The former will characteristically recent times, a large number of philosophers of
invoke causal mechanisms for explanatory purposes, science have contributed to Bayesian confirmation
whereas the latter will appeal to statistical tests theory (e.g., Earman, 1992; Howson & Urbach,
of null hypotheses about the population parame- 2006). Many philosophical methodologists now
ters of observed variables. Meehl has argued that believe that Bayesianism, including Bayesian phi-
psychological researchers tend to conflate the sub- losophy of science, holds the best hope for building
stantive theory and the statistical hypothesis and a comprehensive and unified theory of scientific
unwarrantedly take the successful refutation of the inference.
null hypothesis as grounds for concluding that the Bayesianism is a comprehensive position. It com-
substantive theory has been strongly confirmed. prises a theory of statistical inference, an account
However, if we have good grounds for believing that of scientific method, and a perspective on a vari-
the point null hypothesis is probably false at the out- ety of challenging methodological issues. Today, it
set, and we use the null as the observation hurdle for also boasts a fully fledged philosophy of science. In
our theories to surmount, then support for a theory this section, attention is limited to a consideration
by rejecting this implausible null alternative is quite of the strengths and weaknesses of Bayesian statis-
feeble. tical inference, the ability of Bayesian confirmation
For good reason, then, Meehl (1990) has urged theory to improve upon the hypothetic-deductive
psychologists to abandon tests of statistical signifi- method, and the question of whether Bayesianism
cance for purposes of substantive theory testing in provides an illuminating account of the approach
psychology. He suggests that psychologists should to theory evaluation known as inference to the best
replace them with a strategy that is adapted from the explanation.
philosopher of science, Imre Lakatos (1970), who
argued against Popper’s strict falsificationist posi-
Bayesian Statistical Inference
tion on theory testing for the comparative theory
The Bayesian approach to statistical inference is
appraisal of research programs over time. Meehl has
so called because it makes central use of a theorem
maintained that one should defend and amend a
of the mathematical calculus of probability known
theory only if it has a good track record of successful
as Bayes’ theorem. This theorem can be written in a
or near-miss predictions of low prior probability.
simple form as:
In conclusion, it is clear that there are fundamen-
tal philosophical differences between the Fisherian Pr(H) × Pr(D/H)
Pr(H/D) =
and Neyman-Pearson schools of statistical thought. Pr(D)
Fisher’s statistical contributions can been seen as With the proviso that Pr (D) and Pr (H) cannot
a deliberate attempt to develop an objective alter- be zero, the theorem says that the posterior probabil-
native to Bayesian statistical thinking popular in ity of the hypothesis is obtained by multiplying the
Europe at the time, whereas those of Neyman and prior probability of the hypothesis by the probabil-
Pearson can be seen as an attempt to develop a ity of the data, given the hypothesis (the likelihood),
position that is even more objective than Fisher’s. and dividing the product by the prior probability of
Nevertheless, Bayesian thinking today is an the data. It is through use of this and other ver-
attractive option for many statisticians who hold sions of Bayes’ Theorem that Bayesians are able to
misgivings about one or both of these schools of implement their view of statistical inference, which
thought. It is also the focus of much attention in is the orderly revision of opinion in the light of new
the philosophy of science. It is to the elements of information.
the Bayesian position that we now turn. For Bayesians, a couple of features of this gloss
on Bayesian statistical inference recommend them-
Bayesian Confirmation Theory selves. Most importantly, the Bayesian approach
What is it for empirical evidence to provide con- squares with the stated purpose of scientific inquiry
firmation or disconfirmation of a scientific hypothe- noted above—namely, securing the probability of a
sis or theory? Methodologists of science have worked hypothesis in the light of the relevant evidence. The

haig 15
informational output of a traditional test of signifi- on the basis of evidence and logic, and assert that in
cance is the probability of the data, given the truth real-life situations, there is no alternative to relying
of our hypothesis, but it is just one input in the on a subjective component in our testing efforts.
Bayesian scheme of things. A second stated desir- In deciding whether to adopt a Bayesian posi-
able feature of the Bayesian view is its willingness to tion on statistical inference, it should be kept in
make use of relevant information about the hypoth- mind that one does not have to embrace a gen-
esis before the empirical investigation is conducted eral Bayesian theory of scientific confirmation rather
and new data are obtained, explicitly in the form of than, say, the hypothetico-deductive alternative.
a prior probability estimate of our hypothesis. Tra- One might be a Bayesian when dealing with prob-
ditional tests of statistical significance are premised lems of statistical inference but remain wedded to a
on the assumption that inferences should be based general hypothetico-deductive conception of scien-
solely on present data, without any regard for what tific method. Or, more plausibly, one might employ
we might bring to a study in the way of belief or Bayesian statistical methods when concerned with
knowledge about the hypothesis to be tested—a inferential problems about hypotheses for which
position that Bayesians contend is hardly designed to we have the relevant probabilistic information, but
maximize our chances of learning from experience. adopt a non-probabilistic count of theory evaluation
To achieve their goal of the systematic revision of such as Thagard’s theory of explanatory coherence,
opinion on the basis of new information, Bayesians which will be referred to later in the chapter. The
are able to employ Bayes’ theorem iteratively. Hav- general point to be made here is that Bayes’ theorem
ing obtained a posterior probability assignment for can help us deal with some problems of statistical
their hypothesis via Bayes’ theorem, they can then inference, but clearly, a great deal of scientific work
go on and use that posterior probability as the new will be done with the use of other methods—some
prior probability in a further use of Bayes’ theorem of them statistical and some of them not.
designed to yield a revised posterior probability, and
so on. In this way, the Bayesian researcher learns Bayesianism and the Hypothetico-Deductive
from experience. Method
One of the clear achievements of Bayesianism is
Criticisms of Bayesian Hypothesis Testing its ability to improve on the unsatisfactory approach
Although my consideration of the merits of to hypothesis and theory appraisal taken by the
conventional significance tests and their Bayesian hypothetico-deductive method. The hypothetico-
alternative is both sketchy and selective, some read- deductive method has long been the method of
ers will sense that the Bayesian view provides an choice for the evaluation of scientific theories (Lau-
attractive alternative to the traditional approach, dan, 1981), and it continues to have a dominant
particularly when the latter assumes its hybrid form. place in psychology. Despite its popularity, it is
However, as with all theories of confirmation, the usually characterized in an austere manner: The
Bayesian approach has come in for its share of crit- researcher takes a hypothesis or theory of interest
icism. These criticisms have tended to focus on the and tests it indirectly by deriving from it one or
alleged problematic nature of prior probabilities. In more observational predictions that are themselves
this regard, it is objected that because Bayesians directly tested. Predictions borne out by the data
adopt a subjectivist conception of probability and are taken to confirm the theory to some degree;
resort to personal estimates of the prior probabilities those predictions that do not square with the data
of their hypotheses, they introduce an ineliminable, count as disconfirming instances of the theory.
but highly undesirable, subjective element into their Normally, the theory is not compared with rival
calculations. To this objection, the Bayesians have theories in respect of the data, only with the data
two plausible replies: they can concede that personal themselves.
estimates of prior probabilities are subjective, that The hypothetico-deductive method, in some-
they may differ markedly from person to person, and thing like this form, has been strongly criticized
that they are often very rough estimates and then go by methodologists on a number of counts (e.g.,
on to point out that when prior estimates err, they Glymour, 1980; Rozeboom, 1997). One major crit-
are brought into line by freshly obtained sets of data; icism of the method is that it is confirmationally
or, they may appeal to the failure of strictly empiri- lax. This laxity arises from the fact that any posi-
cist theories of confirmation, which hold that one tive confirming instance of a hypothesis submitted
may obtain an adequate test of a hypothesis solely to empirical test can confirm any hypothesis that is

16 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
conjoined with the test hypothesis, regardless of how Bayesianism and Inference to the Best
plausible it might be. This state of affairs is known as Explanation
the fallacy of irrelevant conjunction, or the tacking Recently, some Bayesians have claimed that
problem, because confirmation of a test hypothesis their perspective on scientific method can also pro-
also confirms any conjunct that is attached to the vide an enhanced characterization of the important
test hypothesis. The fallacy of irrelevant conjunc- approach to theory evaluation known as inference
tion arises with the hypothetico-deductive method to the best explanation. Inference to the best expla-
because predictions are deduced from hypotheses nation is based on the belief that much of what
only by making use of auxiliary hypotheses drawn we know about the world is based on consider-
from background knowledge. ations of explanatory worth. In contrast to the
Clearly, this is an unacceptable state of affairs. Bayesian approach, accounts of inference to the best
Bayesians have challenged the assumption that the explanation take theory evaluation to be a quali-
occurrence of the consequences of a theory con- tative exercise that focuses on explanatory criteria
firm the theory and its conjuncts holistically. They rather than a quantitative undertaking in which
argue that the Bayesian approach enables the dif- one assigns probabilities to theories (Haig, 2009;
ferential support of the elements of a theory, spec- Thagard, 1992).
ifying conditions showing that E never increases Although inference to the best explanation has
the probability of H conjoined with any additional typically been regarded as a competitor for Bayesian
hypothesis by more than it increases the probability theory evaluation, Lipton (2004) has recently
of H. argued that the two approaches are broadly compat-
Another major criticism of the hypothetico- ible and that, in fact, their proponents “should be
deductive method is that it tests a single hypothesis friends.” In broad terms, he suggests that judgments
or theory of interest against the empirical evidence; of the loveliest explanation, which are provided by
it does not test a hypothesis or theory in relation to the evaluative criteria of inference to the best expla-
rivals in respect of the evidence. This is held to be nation, such as unificatory power, precision, and
a major flaw because it is widely agreed that theory elaboration of explanatory mechanisms, contribute
evaluation is a comparative affair involving simul- to assessments of the likeliest explanation, which
taneous evaluation of two or more hypotheses or are provided by the probabilities of the Bayesian
theories. approach. Specifically, Lipton maintains that the
The comparative nature of theory evaluation is explanatory considerations invoked in inference to
straightforwardly handled by the Bayesian position the best explanation guide determination of the
by rewriting the simple form of Bayes’ theorem prior probabilities (and the likelihoods) that are
given earlier to deal with two or more hypothe- inserted in Bayes’ Theorem.
ses. Here, Bayes’ theorem is presented for the However, although appeal to explanatory matters
case of two hypotheses, where the theorem can be might be one way in which Bayesians can determine
written for each hypothesis in turn. For the first their prior probabilities, Lipton does not suggest
hypothesis, how this might be done. Further, those who hold
inference to the best explanation to be a norma-
Pr (H1 /D) tive approach to scientific theory evaluation, with
its own distinctive character, will worry that Lipton
Pr(H1 ) × Pr(D/H1 ) relegates it to a descriptive role within a Bayesian
=
Pr(H2 )×Pr(D/H2 )+Pr(H1 )×Pr(D/H1 ) normative framework (e.g., Psillos, 2004).
Another way of showing the compatibility of
This says that the posterior probability of the inference to the best explanation and Bayesianism is
first hypothesis is obtained by multiplying its prior to translate the evaluative criteria employed within
probability by the probability of the data, given that inference to the best explanation into probabilis-
hypothesis (the likelihood), and dividing the prod- tic terms. McGrew (2003) has done this by taking
uct by the value that results from adding the prior the important theoretical virtue of consilience, or
probability of the second hypothesis, multiplied explanatory breadth, and showing that its Bayesian
by the likelihood for that hypothesis, to the prior form leads to higher posterior probabilities of the
probability of the first hypothesis, multiplied by its hypotheses being evaluated. Nevertheless, McGrew
likelihood. Bayes’ theorem for the second hypothesis has acknowledged that by translating consilience
is written in a similar way. into its “flattened” probabilistic form, it no longer

haig 17
remains a genuine explanatory virtue. Not only is in domains where the evaluation of explanatory
there no guarantee that consilience will be concerned hypotheses and theories are of primary concern, sci-
with an explanation of the evidence, there is no entists have, for good reason, often employed some-
way that probabilistic translations of the explana- thing like inference to the best explanation. Like it
tory virtues can refer to the causal connections that or not, the intending Bayesian scientist will have to
are often appealed to in scientific explanations. Fur- consult the relevant philosophical literature, among
ther, Weisberg (2009) has recently argued that the other methodological literatures, to furnish an
explanatory loss incurred in such translations will informed justification for their Bayesian practices.
occur for any distinctively explanatory virtue that
is given such probabilistic treatment. In short, it
would seem that Bayesianism cannot capture the Meta-Analysis
intuitively important notion of explanatory power In the space of three decades meta-analysis has
without significant loss. become a prominent methodology in behavioral sci-
ence research, with the major developments coming
from the fields of education and psychology (Glass,
What Should We Think About McGaw, & Smith, 1981: Hedges & Olkin, 1985;
Bayesianism? Hunter & Schmidt, 2004). Meta-analysis is an
Philosophical assessment of the worth of approach to data analysis that involves the quantita-
Bayesianism range from claims that it is without peer tive analysis of the data analyses of primary empirical
as a theory of scientific reasoning to the view that it is studies. Hence, the term meta-analysis coined by
fundamentally wrong-headed. Howson and Urbach Glass (1976). Meta-analysis, which comes in a vari-
(2006) exemplify the former view, claiming that sci- ety of forms (Bangert-Drowns, 1986), is concerned
entific reasoning is both inductive and probabilistic with the statistical analyses of the results from many
and that the axioms of probability suffice to articu- individual studies in a given domain for the pur-
late such reasoning. The latter view is exemplified by pose of integrating or synthesizing those research
Bunge (2008), who has argued that Bayesianism is findings.
fundamentally wrong for three reasons: (1) it assigns The following selective treatment of meta-
probabilities to statements rather than taking them analysis considers its possible roles in scientific
as objective features of the world; (2) it conceives explanation and evaluation research before critically
of probabilities as subjective; and (3) it appeals to examining one extended argument for the con-
probabilities in the absence of randomness. clusion that meta-analysis is premised on a faulty
To add to this mix of views, many statisticians conception of science.
take Bayesian statistical inference to be a superior
alternative to classical statistical inference, for the
reasons stated earlier. Finally, some advocates of Meta-Analysis and Explanation
Bayesianism see it as a comprehensive theory of Meta-analysis is a prominent example of a dis-
confirmation, whereas others see it as having only tinctive use of statistical methods by behavioral
context-specific application. scientists to aid in the detection of empirical phe-
The difficulties of deciding just what to think nomena. By calculating effect sizes across primary
about Bayesianism are captured well by the ambiva- studies in a common domain, meta-analysis helps us
lence of John Earman (1992), a Bayesian philoso- detect robust empirical generalizations (cf. Schmidt,
pher of science. He confesses to being an enthusiastic 1992). By using statistical methods to ascertain the
Bayesian on Mondays, Wednesdays, and Fridays. existence of such regularities, meta-analysis can be
But on Tuesdays, Thursdays, and Saturdays, he usefully viewed as the statistical analog of direct
holds doubts about the totalizing ambitions of experimental replication. It is in this role that meta-
Bayesianism and indeed whether it can serve as analysis currently performs its most important work
a proper basis for scientific inference. Faced with in science.
such difficulty, it is probably prudent to settle for However, given that the detection of empiri-
a contextual application of Bayesian thinking, as cal phenomena and the construction of explana-
indicated earlier in this section. For example, in par- tory theories are quite different research tasks, the
ticular domains such as medical diagnosis, where the recent suggestion that meta-analysis can directly
relevant probabilistic information is often available, contribute to the construction of explanatory the-
scientists sometimes appeal to the Bayesian corpus to ory (Cook et al., 1992; Schmidt, 1993) is an
justify the selective use of its methods. By contrast, arresting methodological claim. In approving this

18 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
extension of meta-analysis beyond a concern with thereby confer an explanatory role on meta-analysis
phenomena detection, Schmidt has acknowledged itself. One does not assign status simply on the basis
that scientific explanation normally involves the of association.
causal explanation of observed phenomena. Nev-
ertheless, he maintains that it is appropriate to
Meta-Analysis and Evaluative Inquiry
take scientific explanation to include “all research
It is surprising that methodological discussions of
processes that contribute ultimately to theory build-
meta-analysis and its applications have shown little
ing, including the first step of determining what
regard for the rationale that Glass originally pro-
the relationships are among important variable or
vided for its use. Glass claims that many researchers
constructs and how stable these relationships are”
misunderstand meta-analyses of outcome research
(Schmidt, 1993, p. 1164). Thus, the demonstration
because they fail to take cognizance of his rationale.
of a general effect, such as the pervasive influence
Specifically, this failure is offered by him as the rea-
of psycho-educational treatments on adult surgi-
son for the widespread misunderstanding of Smith,
cal patients, is deemed to be a meta-analysis at
Glass, and Miller’s (1980) original meta-analysis of
the “lowest level of explanation.” On the other
psychotherapy outcome studies.
hand, the use of meta-analysis to test compet-
In a number of different publications, Glass
ing theories of how patients cope with the stress
insists that meta-analysis should be understood as
of surgery is viewed as higher level explanatory
an exercise in evaluative research rather than in
meta-analysis.
scientific research (Glass, 1972; Smith, Glass, &
However, this attempt to extend the role of meta-
Miller, 1980; Glass & Kleigl, 1983). The core
analytic methods beyond phenomena detection to
of Glass’s underlying rationale for meta-analysis
explanation obscures the basic methodological dis-
involves drawing a strong distinction between scien-
tinction between phenomena detection and scien-
tific and evaluative inquiry. Glass’s position is that
tific explanation. As noted earlier in the chapter, the
researchers as scientists are concerned to satisfy their
stable general effects gleaned from meta-analysis are
curiosity by seeking truthful conclusions in the form
empirical phenomena, and statements about phe-
of theories comprising explanatory laws. By con-
nomena are the objects of scientific explanations;
trast, evaluators undertake research on behalf of a
they are not the explanations themselves. The ques-
client that is aimed at producing useful decisions
tion, “What do statements of empirical phenomena
based on descriptive determinations of the worth
explain?” occasions no natural reply. This is not sur-
of particular products or programs. For Glass, the
prising, for the successful detection of phenomena
meta-analysis of outcome studies properly involves
is essentially a descriptive achievement that involves
the integration of the products of evaluative research
investigative practices that are, for the most part,
only.
quite different from explanatory endeavors. In psy-
The methodology for this conception of meta-
chology, these methods are often statistical in kind.
analysis fashions the distinction between scientific
By contrast, scientific explanation is often causal-
and evaluative inquiry in terms of the relevance
mechanistic in nature (Salmon, 1984). On this
for each of the concepts of truth, explanation,
view, explanation requires the identification of the
values, problems, and generalizations. Because
mechanisms that underlie and give rise to empirical
of space limitations, I will consider just one of
phenomena, along with a detailing of the ways in
these contrasts—that of explanation. Glass contends
which those mechanisms produce the phenomena
that scientific inquiry involves the continual search
we seek to understand.
for subsurface explanations of surface phenomena.
When meta-analysis enters into the process of
Evaluative inquiry, on the other hand, does not seek
testing explanatory theories, it contributes to an
explanations:
evaluation of those theories in terms of predictive
success. However, this common strategy for evaluat- “A fully proper and useful explanation can be
ing scientific theories is not directly concerned with conducted without producing an explanation of why
their explanatory adequacy. To repeat, it is not being the product or program being evaluated is good or
denied that meta-analytic methods can be employed bad of how it operates to produce its effects . . . [It] is
in the course of testing theories, but meta-analysis usually enough for the evaluator to know that
itself is not an approach to theory testing (Chow, something attendant upon the [product or program]
1996). To employ meta-analysis to assist in the pre- is responsible for the valued outcomes.” (Glass, 1972,
dictive testing of an explanatory theory does not pp. 5–6)

haig 19
Glass’s position seems to be that although pro- as “. . . a two-tiered research enterprise [where] one
gram treatments can be causally responsible for their group of researchers will specialize in conducting
measured outcomes, it matters little that knowl- individual studies [and] another group will apply
edge of this gleaned from evaluation studies does complex and sophisticated meta-analysis methods to
not tell us how programs produce their effects, those cumulative studies and will make the scientific
because such knowledge is not needed for policy discoveries” (p. 1180). Sohn’s primary concerns are to
action. challenge the claim that meta-analysis is an impor-
Glass is surely correct in asserting that scientists tant vehicle of scientific discovery and to identify the
are centrally concerned with the construction of major problems besetting mainstream psychological
causal theories to explain phenomena, for this is the research.
normal way in which they achieve understanding Sohn (1996) has questioned the basic idea of
of the empirical regularities they discover. However, meta-analysis as a standalone literature review capa-
he is wrong to insist that proper evaluations should ble of discovering truths, whereas traditionally sci-
deliberately ignore knowledge of underlying causal entific discoveries were contained in the empirical
mechanisms. The reason for this is that the effective findings of the primary studies themselves. For
implementation and alteration of social programs Sohn, the idea that meta-analytic literature reviews
often benefits from knowledge of the relevant causal can make discoveries about nature rests on the
mechanisms involved (Gottfredson, 1984), and assumption that the primary research literature is
strategic intervention in respect of these is often a proxy for nature. It is an assumption that he has
the most effective way to bring about social change. roundly rejected.
Although standard versions of scientific realism are Noting the tendency of meta-analysts to paint a
wrong to insist that the relevant causal mechanisms bleak picture of progress in twentieth century psy-
are always unobserved mechanisms, it is the case that chology, Sohn (1996) has suggested that although
appeal to knowledge of covert causal mechanisms meta-analysis has been introduced to improve mat-
will frequently be required for understanding and ters in this regard, it is in fact symptomatic of its
change. poor progress. In his judgment, this lack of good
To conclude this highly selective evaluation of progress is a consequence of psychology adopting a
Glass’s rationale for meta-analysis, science itself is hypothesis-testing view of science. For Sohn, this
best understood as a value-laden, problem-oriented view of science seeks knowledge by testing research
human endeavor that tries to construct causal hypotheses about the relationship of descriptive
explanatory theories of the phenomena it discovers. variables without regard for causal mediating vari-
There is no sound way of drawing a principled con- ables. Essentially, the approach amounts to the
trast between scientific and evaluative inquiry. These hypothetico-deductive testing of outcome studies
critical remarks are not directed against the worth of through use of significance tests and effect size
evaluation as such or against the use of meta-analysis measures. Sohn maintains that there are, in fact,
in evaluating program or product effectiveness. They two deleterious consequences of such an approach
are leveled specifically at the conception of eval- to research: one is the lack of agreement about
uation research that appears to undergird Glass’s outcomes, and the other is the absence of knowl-
approach to meta-analysis. edge of the causal mechanisms that are responsi-
ble for those alleged outcomes. Meta-analysis is
indicted by Sohn for failing to remedy both types of
Meta-Analysis and the Nature of Science defect.
Proponents of meta-analysis often justify the However, Sohn has supported his claim that
use of these methods by pointing out the need to meta-analysis does not produce demonstrable evi-
glean valuable knowledge from the information in a dence for treatment effects in a curious way. He
domain that lies dormant in the cornucopia of scat- has acknowledged that Smith, Glass, and Miller’s
tered primary studies. However, Sohn (1996) has (1980) well-known meta-analytic treatment of the
expressed concern about the quality of empirical psy- benefits of psychotherapy has been corroborated
chological studies that are used in meta-analysis. He by subsequent meta-analyses yet has maintained
has urged resistance to the generally accepted view that this does not constitute evidence for replica-
that meta-analysis is a form of research rather than ble effects. He has expressed a distrust of research
a review of research, and he has balked at Schmidt’s that relies on statistical methods for making claims
(1992) revisionist model of possible future science about replicable effects. This distrust appears to

20 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
be founded in part on an extension of the view Factor Analytic Inference
attributed to Lord Rutherford that if an experimen- Alongside the debate between the fictionalist and
tal study requires statistics, then the experiment is realist interpretations of factors, there is a difference
in need of improvement. For Sohn, “If one’s science of view about whether the basic inferential nature
needs [meta-analysis], one should have done better of factor analytic inference is inductive or abductive
science.” in nature. Expositions of exploratory factor analysis
However, this view flies in the face of widely seldom consider its inferential nature, but when they
accepted scientific practice. Woodward’s (1989) do, the method is usually said to be inductive in char-
detailed examination of the practice of phenomena acter. This is not surprising, given that exploratory
detection in science strongly supports the view that factor analysis can be plausibly located histori-
different parts of the various sciences, from physics cally within seventeenth- and eighteenth-century
to anthropology, appropriately make extensive use of empiricist philosophy of science and its inductive
statistical methods in the detection of empirical phe- conception of inquiry (Mulaik, 1987). However,
nomena. It is hard to imagine that statistics would even if one relaxes the Baconian ideal that inductive
exist as we currently know it unless it provided a method is an algorithm that produces incorrigi-
necessary armament for science. In this regard, it ble knowledge, an inductive characterization of
is worth noting that Sohn has acknowledged the exploratory factor analysis seems inappropriate. This
claim made by Hedges and Olkin (1985) that meta- is because inductive inference, being descriptive
analysis in some form or other has a long history inference, cannot take the researcher from mani-
of use in the hard sciences. Sohn has stated his dis- fest effects to theoretical entities that are different in
agreement with this position, but he has not argued kind from those effects. However, abductive infer-
against it. Space limitations preclude further analy- ence, which is concerned with the generation and
sis of Sohn’s (1996) argument, but perhaps enough evaluation of explanatory hypotheses, can do so.
has been said to suggest that it should be regarded For this reason, exploratory factor analysis is bet-
with some skepticism. ter understood as an abductive method of theory
More work on the philosophical foundations generation (Haig, 2005a), a characterization that
of meta-analysis is clearly needed. However, from coheres well with its general acceptance as a latent
this highly selective examination of its conceptual variable method. With exploratory factor analysis,
foundations, it can be concluded that meta-analysis abductive inference is explanatory inference that
receives its primary justification in scientific research leads back from presumed effects to underlying
by articulating one, but only one, way in which causes.
researchers can fashion empirical generalization There are different forms of abductive reason-
from the findings of primary studies. It derives its ing. Exploratory factor analysis is a method that
importance in this role directly from the impor- can facilitate the drawing of explanatory inferences
tance accorded the goal of phenomena detection in that are known as existential abductions. Existen-
science. tial abductions enable researchers to hypothesize
the existence, but not the nature, of entities previ-
ously unknown to them. The innumerable examples
Exploratory Factor Analysis of existential abduction in science include the ini-
Despite the advanced statistical state and fre- tial postulation of hidden entities such as atoms,
quent use of exploratory factor analysis in the genes, tectonic plates, and personality traits. In
behavioral sciences, debate about its basic nature cases like these, the primary thrust of the ini-
and worth abounds. Thurstone (1947) has appro- tial abductive inferences is to claims about the
priately emphasized the exploratory nature of the existence of theoretical entities to explain empiri-
method, and many methodologists take it to be cal facts or phenomena. Similarly, the hypotheses
a method for postulating latent variables that are given to us through the use of exploratory factor
thought to underlie patterns of correlations. Some, analysis postulate the existence of latent variables
however, understand exploratory factor analysis as a such as Spearman’s g and extraversion. It remains
method of data reduction that provides an econom- for further research to elaborate on the first rudi-
ical description of correlational data. The present mentary conception of these variables and their
section considers this and other important foun- interrelation.
dational issues that have figured prominently in The factor analytic use of existential abduction
discussions of the method. to infer the existence of, say, the theoretical entity

haig 21
g , can be coarsely reconstructed in accordance with correlated, infer one or more common causes unless
the following argument schema: there is good reason not to.” Clearly, the princi-
ple should not be taken as a hard-and-fast rule,
The surprising empirical phenomenon of the positive for in many cases, proper inferences about corre-
correlations among different tests of ability is lated events will not be in terms of common causes.
identified. The qualifier, “unless there is good reason not to,”
If g exists, and it is validly and reliably measured by a should be understood as an injunction to consider
Weschler intelligence scale (and/or some other causal interpretations of the correlated events other
objective test), then the positive manifold would than the common causal kind. For example, in a
follow as a matter of course. given research situation, the correlated events might
Hence, there are grounds for judging the hypothesis be related as direct causes, or their relationship
of g to be initially plausible and worthy of further might be mediated by a third variable in a causal
pursuit. sequence.
Note that the schema for abductive inference and Although exploratory factor analysis is used to
its application to the generation of the hypothesis infer common causes, expositions of common factor
of g is concerned with the form of the arguments analysis that explicitly acknowledge the importance
involved, rather than with the actual generation of the principle of the common cause are rare. Kim
of the explanatory hypotheses. The explanatory and Mueller’s (1978) textbook exposition of factor
hypothesis is given in the second premise of the argu- analysis is a noteworthy exception. In discussing
ment. An account of the genesis of the explanatory the conceptual foundations of factor analysis, these
hypothesis must, therefore, be furnished by some authors evince the need to rely on what they call
other means. It is plausible to suggest that reasoning the postulate of factorial causation. The postulate
to explanatory hypotheses trades on our evolved cog- of factorial causation is characterized by them as
nitive ability to abductively generate such hypothe- “the assumption that the observed variables are lin-
ses (Carruthers, 2002). Whatever its origin, an ear combinations of underlying factors and that
informative methodological characterization of the the covariation between observed variables solely
abductive nature of factor analytic inference must results from their common sharing of one or more
appeal to the scientist’s own psychological resources of the common factors” (p. 78). The authors make
as well as those of logic. This injunction is motivated clear that the common factors mentioned in the
by the realist methodological thesis of naturalism assumption are to be regarded as underlying causal
stated near the beginning of the chapter. variables. Understood as a methodological injunc-
Although exploratory factor analysis exemplifies tion, this postulate functions as a variant of the
well the character of existential abduction, it is principle of the common cause. Without appeal
clearly not an all-purpose method for abductively to this principle, factor analysts could not iden-
generating explanatory hypotheses and theories. tify the underlying factor pattern from the observed
With its focus on common factors, it can prop- covariance structure.
erly serve as a generator of elementary theories only There are two features of the principle of the com-
in those multivariate domains that have common mon cause that make it particularly suitable for use
causal structures. in exploratory factor analysis. First, it can be applied
in situations where we do not know how likely it
is that the correlated effects result from a common
The Principle of the Common Cause cause. The abductive inference to common causes
It is well known that exploratory factor analysis is is a basic explanatory move that is non-probabilistic
a common factor analytic model in which the latent and qualitative in nature. It is judgments about the
factors it postulates are referred to as common factors. soundness of the abductive inferences, rather than
Less well known is the fact that there is an important the assignment of probabilities, that confer initial
principle of scientific inference, known as the prin- plausibility on the factorial hypotheses spawned by
ciple of the common cause. (e.g., Sober, 1988), that exploratory factor analysis. Second, the principle
can be used to drive the nature and shape of the exis- can also be used in situations where we are essentially
tential abductive inferences involved in exploratory ignorant of the nature of the common cause. With
factor analysis. The principle of the common cause this second feature, the principle of the common
can be formulated concisely as follows: “Whenever cause accommodates the fact the exploratory factor
two or more events are improbably, or significantly, analysis trades in existential abductions.

22 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
Further, it is important to appreciate that the not determined by the relevant evidence (Duhem,
principle of the common cause does not function 1954). In effect, methodological indeterminacy
in isolation from other methodological constraints. arises from our inability to justify accepting one
Embedded in exploratory factor analysis, the prin- theory among alternatives on the basis of empirical
ciple helps one limit existential abductive inference evidence alone.
to those situations where we reason back from Mulaik (1987) sees underdetermination in
correlated effects to one or more common causes. exploratory factor analysis as involving inductive
Although covariation is an important basic datum in generalizations that go beyond the data. How-
science, not all effects are expressed as correlations, ever, inductive underdetermination should be seen
and of course, not all causes are of the common as applying specifically to the task of establishing
causal variety. It follows from this that one should factorial invariance where one seeks constructive
not always look for common causal interpreta- or external replication of factor patterns. However,
tions of multivariate data, for there are numerous for exploratory factor analysis there is also need to
alternative latent variable models. acknowledge and deal with abductive underdeter-
mination involved in the generation of explanatory
factorial theories. The sound abductive generation
The Underdetermination of Factors of hypotheses is essentially educated guess work.
The methodological literature on exploratory fac- Thus, drawing from background knowledge, and
tor analysis has given considerable attention to the constrained by correlational empirical evidence, the
indeterminacy of factors in the common factor use of exploratory factor analysis can reasonably be
model. Factor indeterminacy arises from the fact expected to yield a plurality of factorial hypotheses
that the common factors are not uniquely deter- or theories that are thought to be in competition.
mined by their related manifest variables. As a This contrasts strongly with the unrealistic expec-
consequence, a number of different common fac- tation held by many earlier users of exploratory
tors can be produced to fit the same pattern of factor analysis that the method would deliver them
correlations in the manifest variables. strongly justified claims about the one best factorial
Although typically ignored by factor analytic hypothesis or theory.
researchers, factor indeterminacy is an epistemic How then, can exploratory factor analysis deal
fact of life that continues to challenge factor ana- with the specter of underdetermination in the con-
lytic methodologists. Some methodologists regard text of theory generation? One plausible answer is
factor indeterminacy as a serious problem for com- that exploratory factor analysis narrows down the
mon factor analysis and recommend the use of space of a potential infinity of candidate theories
alternative methods such as component analysis to a manageable subset by facilitating judgments
methods because they are considered to be deter- of initial plausibility (Haig, 2005a). It seems clear
minate methods. enough that scientists often make judgments about
One constructive perspective on the issue of fac- the initial plausibility of the explanatory hypothe-
tor indeterminacy has been suggested by Mulaik and ses and theories that they generate. However, it
McDonald (Mulaik & McDonald, 1978; McDon- is less clear just to what this evaluative criterion
ald & Mulaik, 1979; Mulaik, 1987). Their position amounts (cf. Whitt, 1992). With an abductive
is that the indeterminacy involved in interpreting conception of exploratory factor analysis, judg-
the common factors in exploratory factor analysis ments of the initial plausibility of theories are
is just a special case of the general indeterminacy judgments about the soundness of the abductive
of theory by empirical evidence widely encountered arguments employed in generating those theories.
in science, and it should not, therefore, be seen as It seems reasonable to suppose that those who
a debilitating feature that forces us to give up on employ exploratory factor analysis as an abductive
common factor analysis. method of theory generation often make com-
Indeterminacy is pervasive in science. It occurs in pressed judgments of initial plausibility. By confer-
semantic, metaphysical, and epistemological forms ring judgments of initial plausibility on the theories
(McMullin, 1995). Factor indeterminacy is essen- it spawns, exploratory factor analysis deems them
tially epistemological in nature. The basic idea of worthy of further pursuit, whereupon it remains
epistemological or, more precisely, methodological for the factorial theories to be further developed
indeterminacy is that the truth or falsity (better, and evaluated, perhaps through the use of confir-
acceptance or rejection) of a hypothesis or theory is matory factor analysis. It should be emphasized that

haig 23
using exploratory factor analysis to facilitate judg- contrary, confirmatory factor analysis is the more
ments about the initial plausibility of hypotheses will theoretically important—and should be the much
still leave the domains being investigated in a state more widely used—of the two major factor analytic
of considerable theoretical underdetermination. It approaches. (p. 134)
should also be stressed that the resulting plural-
ity of competing theories is entirely to be expected Although Gorsuch makes his claim in emphatic
and should not be thought of as an undesirable terms, he provides no justification for it. He seems
consequence of employing exploratory factor anal- to assume that theory testing is more important
ysis. To the contrary, it is essential for the growth than theory generation. However, this belief is diffi-
of scientific knowledge that we vigorously promote cult to defend, given the fact that there are many
theoretical pluralism (Hooker, 1987), develop the- other important phases of scientific inquiry that
oretical alternatives, and submit them to critical together demand most of the researcher’s method-
scrutiny. ological time. Recall, for example, the importance
to science of the detection of empirical phenomena
and the generation, development, and comparative
Exploratory Factor Analysis and appraisal of theories. Viewed in this light, theory
Confirmatory Factor Analysis testing is just one, albeit important, part of scien-
The aforementioned consideration of exploratory tific method (cf. Simon, 1968). Given the fact that
factor analysis supports the conclusion that there science is as much concerned with theory genera-
is an important role for its use in factor analytic tion as it is with theory testing, and acknowledging
research. However, this conclusion raises the ques- that exploratory factor analysis is a useful abduc-
tion of how exploratory factor analysis relates to its tive method of theory generation, exploratory factor
confirmatory namesake. In contrast to popular ver- analysis deserves to be regarded as important as con-
sions of the classical inductivist view of science that firmatory factor analysis in the theory constructor’s
inductive method can generate secure knowledge toolkit.
claims, the use of exploratory factor analysis as an To conclude, despite the fact that exploratory
abductive method of theory generation can only fur- factor analysis has been frequently employed in
nish researchers with a weak logic of discovery—one psychological research, the extant methodological
that gives them educated guesses about underlying literature on the method seldom acknowledges the
causal factors. It is for this reason that those who use explanatory and ontological import of the method’s
exploratory factor analysis to generate theories need inferential nature. Abduction is a major form of cre-
to supplement their generative assessments of the ative reasoning in science, and the principle of the
initial plausibility of those theories with additional common cause is a maxim of scientific inference with
consequentialist justification in the form of confir- important application in research. By incorporating
matory factor analytic testing or some alternative these two related elements into its fold, exploratory
approach to theory appraisal. factor analysis is ensured an important, albeit cir-
However, in the factor analytic literature, there is cumscribed, role in the construction of explanatory
a division of opinion about whether exploratory fac- theories in psychology and other sciences. By gen-
tor analysis and confirmatory factor analysis should erating structural models about common causes,
be viewed as complementary or competing methods exploratory factor analysis can serve as a valuable
of common factor analysis. Quite a number of factor precursor to confirmatory factor analysis.
analytic methodologists have expressed views that
discourage their complementary use in factor ana-
lytic research. For example, Gorsuch (1983), in his Causal Modeling
well-known book on factor analysis, has expressed During the last 50 years, social and behavioral
a view about the relative importance of exploratory science methodologists have developed a variety of
and confirmatory factor analysis that seems to be increasingly sophisticated statistical methods to help
quite widely held today: researchers draw causal conclusions from correla-
tional data. These causal modeling methods, as they
Although the next three chapters [of Factor analysis] have sometimes been called, include path analy-
are primarily concerned with exploratory factor sis, confirmatory factor analysis, and full structural
analysis, the space and time given to that technique is equation modeling.
a function of the complexity of resolving its Despite the fact that psychological researchers are
problems, not of its theoretical importance. On the increasingly employing more sophisticated causal

24 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
modeling methods in place of simple regression as a tendency—that is, an existing state of an object,
and partial correlation procedures, worries about which, if unimpeded, will produce its effect. We
both their accompanying methodology and their are, therefore, able to infer abductively the presence
misuse have been expressed (e.g., Cliff, 1983). In of the causal mechanism on the basis of knowl-
this section, I consider some philosophical aspects edge of the triggering condition and/or its presumed
of three foundational issues that have been dis- effect.
cussed in the literature on causal modeling: the Advocates of the generative theory of causa-
different ideas of causation presupposed by causal tion claim it has a number of important advan-
modeling; the suggestion that causal modeling tages over the regularity theory. One advantage of
can be viewed as a form of inference to the best the generative theory is that it is able to accom-
explanation; and the contested nature of latent modate deep structural, explanatory theories that
variables. postulate unobserved generative mechanisms. It is
argued that we need a theory of causation that
Causal Modeling and Theories of Causation affords us the conceptual space to do this, because
One central methodological issue in the debates many of the world’s causal mechanisms are not
about causal modeling has to do with the appropri- open to direct inspection. The latent variables of
ateness of the nature of causation involved in various many of our causal modeling methods are thought
causal modeling procedures. A popular view of the by many to be precisely of this kind. A related
matter is clearly expressed by Kenny (1979), who advantage of the generative theory is that it is
points out that three conditions must be satisfied needed for enlightened social policy because, as
for a researcher to claim that one variable is the noted in the discussion of evaluation research ear-
cause of another. The first condition is that the rela- lier, the possibility of ameliorative action depends
tionship be asymmetric. The second condition is on effecting change based on an understanding
that a functional relationship be present between of how things work, and for this, knowledge of
cause and effect. The third condition is that the the relevant underlying causal mechanisms is often
causal relationship be direct or non-spurious. These essential.
three conditions are exactly those of the regular- A third, and significant, advantage of the the-
ity theory of causation, which depicts the causal ory of generative causation is that it enables us to
relationship between events in terms of their reg- draw the important distinction between empirical
ular succession, covariation, and contiguity. The regularities and genuine causal laws. An adequate
regularity theory, which is more or less Humean methodology of causal modeling must be able to
in character, provides an important part of the draw the distinction between empirical regularities
epistemic backdrop against which traditional causal and causal laws, because the ability to do so is a
modeling methods like path analysis have been conceptual requirement of being able to differen-
understood. tiate properly direct causal relations from spurious
However, like other parts of the standard empiri- correlations. By collapsing this distinction, empiri-
cist enterprise, this theory has received strong criti- cists, with their regularity theory of causation, are
cism. Its claimed limitations can best be appreciated unable to articulate a satisfactory notion of spuri-
by contrasting it with a scientific realist alternative ousness. For example, Simon’s (1985) influential
known as the generative theory of causation (Harré analysis of spurious correlation explicitly rejects the
& Madden, 1975). Briefly stated, the generative the- generative theory of causation and endeavors to
ory depicts causation as a relationship where, under ground the distinction between true and spurious
appropriate conditions, a causal mechanism produces correlations on a commitment to an empiricist view
its effect. For this to happen, the causal mechanism of causation. The common or intervening causes
must connect to its effect and have the power to that bring about spurious correlations will typically
generate that effect, usually when stimulated by the be unobserved. However, for a statistical treat-
appropriate causal condition. It is the productivity ment of these variables to be consistent with the
of a generative mechanism that makes it a causal regularity theory, Simon’s view of causation forces
mechanism, and for this to occur, there must be a researchers to focus on altogether different variables
naturally necessary connection that allows for the at the manifest level. But this cavalier ontologi-
transmission of power from cause to effect. This cal slide wrecks our efforts to obtain worthwhile
causal power exists irrespective of whether it is cur- causal knowledge, because the manifest replacement
rently being exercised. As such, it is properly viewed variables cannot act as effective surrogates for their

haig 25
presumed common and intervening causes. They the best explanation being considered here will
are ontologically distinct from such causes and, allow the researcher to say that a good explana-
although as causal conditions they may trigger their tory theory will rate well on the explanatory criteria
latent counterparts, they do not function as major and, at the same, boast a measure of predictive
causal mechanisms that can bring about spurious success. Most methodologists and scientists will
correlations. agree that an explanatory theory that also makes
Although it can plausibly be argued that a gener- accurate predictions will be a better theory for
ative view of causation is required to make sense of doing so.
research that embraces hidden causal mechanisms, it Although the use of structural equation mod-
does not follow, as is often supposed (e.g., Manicas, eling in psychology often involves testing models
1989; Sayer, 1992), that the regularity theory has in hypothetico-deductive fashion, it also contains
no place in a realist conception of science. With a minority practice that amounts to inference to
its emphasis on the ideas of regularity, it would the best explanation in the sense just noted. This
seem to be a suitable account of causation for claims latter practice involves the explicit comparison of
about phenomena that take the form of empirical models or theories in which an assessment of
generalizations. Nor should it be thought that the their goodness-of-fit to the empirical evidence is
regularity theory and the generative theory together combined with the weighting of the fit statistics
give one a full understanding of causation in science. in terms of parsimony indices (Kaplan, 2000).
For example, structural equation modeling provides Here goodness-of-fit provides information about the
knowledge of causal networks. As such, it does not so empirical adequacy of the model, whereas parsi-
much encourage the development of detailed knowl- mony functions as a criterion having to do with
edge of the nature of latent variables as it specifies the explanatory value of the model. Both are used
the range and order of causal relations into which in judgments of model goodness. Markus, Hawes,
latent and manifest variables enter. For this type of and Thasites (2008) recently have suggested that
research, a network theory of causation is needed in structural equation modeling, model fit can
(Thagard, 1999). be combined with model parsimony, understood
The suggestion that different conceptions of cau- as explanatory power, to provide an operational-
sation are relevant to causal modeling fits with a ized account of inference to the best explanation.
philosophy of causal pluralism, which is increasingly They discussed the prospects of using structural
being recommended in contemporary methodolog- equation modeling in this way to evaluate the com-
ical studies of the nature of causation (Godfrey- parative merits of two- and three-factor models of
Smith, 2009). psychopathy.

Structural Equation Modeling and Do Latent Variables Exist?


Inference to the Best Explanation Many causal modeling methods are latent vari-
The guess-and-test strategy of the hypothetico- able methods, whose conceptual foundations are to
deductive method takes predictive accuracy as the be found in the methodology of latent variable the-
sole criterion of theory goodness. However, it ory (Borsboom, 2005; 2008). Central to this theory
seems to be the case that in research practice, the is the concept of a latent variable itself. However,
hypothetico-deductive method is sometimes com- the notion of a latent variable is a contested concept,
bined with the use of supplementary evaluative crite- and there are fundamental philosophical differences
ria such as simplicity, scope, and fruitfulness. When in how it should be understood.
this happens, and one or more of the supplementary A clear example of the contested nature of the
criteria have to do with explanation, the combined concept of a latent variable is to be found in the
approach can appropriately be regarded as a version two quite different interpretations of the nature of
of inference to the best explanation, rather than just the factors produced by exploratory factor analy-
an augmented account of the hypothetico-deductive sis. One view, known as fictionalism, maintains
method (Haig, 2009). This is because the central that the common factors, the output of exploratory
characteristic of the hypothetico-deductive method factor analysis, are not theoretical entities invoked
is a relationship of logical entailment between the- to explain why the observed variables correlate the
ory and evidence, whereas with inference to the way that they do. Rather, these factors are taken
best explanation the relationship is also one of to be summary expressions of the way manifest
explanation. The hybrid version of inference to variables co-vary. Relatedly, theories that marshal

26 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
descriptions of such factors are properly considered factor analysis can facilitate the drawing of sound
to serve the instrumentalist function of econom- abductive inference about the existence of latent
ically redescribing the original correlational data. factors.
This interpretation of exploratory factor analysis This highly selective consideration of the phi-
has been quite influential in psychometrics (Block, losophy of causal modeling points to three con-
1976) and has been taught to generations of psychol- clusions: (1) that causation in causal modeling
ogy students through textbooks on psychological manifests itself in a number of different ways;
testing (e.g., Anastasi & Urbina, 1997). Fictional- (2) that causal modeling can transcend the limi-
ism seems to be the preferred option of many factor tations of the hypothetico-deductive method and
analytic researchers in the domains of personality adopt the practice of inference to the best expla-
and intelligence. nation; and (3) that latent variables deserve to be
However, fictionalism is a difficult position to given a realist interpretation as genuine theoretical
defend, and it seems to fail in factor analysis for entities.
the reason it fails in science generally: it inappro-
priately grants ontological significance to a sharp
distinction between observation and theory to but- Conclusion
tress the claim that only observable, or manifest, The philosophy of research methods is an aspect
entities exist, when observability is really a matter of research methodology that receives limited atten-
of degree. Fictionalists argue that because we do tion in behavioral science education. The majority
not have perceptual experience of theoretical enti- of students and research practitioners in the behav-
ties, we do not have grounds for saying they exist; ioral sciences obtain the bulk of their knowledge of
we only have grounds for claiming that observable research methods from textbooks. However, a casual
entities exist. But realist philosophers of science (e.g., examination of these texts shows that they tend to
Maxwell, 1962) assert in reply that fictionalists can- pay little, if any, serious regard to the philosophy
not maintain a sharp distinction between what is of science and its bearing on the research process.
observable and what is unobservable. What can- As Kuhn pointed out nearly 50 years ago (Kuhn,
not be seen directly by the unaided eye might be 1962; 1996), textbooks play a major role in dogmat-
observable through a magnifying glass and what ically initiating students into the routine practices
cannot be observed through a magnifying glass of normal science. Serious attention to the philos-
might be observed through a microscope. Impor- ophy of research methods would go a considerable
tantly, how we draw the observable/unobservable way toward overcoming this uncritical practice. As
distinction at a particular time is a function of contemporary philosophy of science increasingly
prior knowledge, our physiological make-up, and focuses on the contextual use of research methods in
available instrumentation. Thus, the distinction the various sciences, it is to be hoped that research
provides no basis for deciding what entities do, and methodologists and other behavioral scientists will
do not, exist. To assert that factors are theoretical avail themselves of the genuine methodological
entities is not to regard them as having a special insights that it contains.
existence; rather, it is to acknowledge that we come
to know them indirectly in terms of their correlated Future Directions
effects. On this realist interpretation, the factors are In this final section of the chapter, I suggest a
regarded as latent variables that underlie, and give number of directions that future work in the philos-
rise to, the correlated manifest variables. Borsboom ophy of quantitative methods might take. The first
(2005) has made a strong case for adopting a realist three suggestions are briefly discussed; the remaining
attitude to latent variables more generally by com- suggestions are simply listed.
bining an argument similar to Maxwell’s, along with
other foundational considerations in philosophy of
science and psychometrics. Understand Quantitative Methods Through
This general argument against fictionalism simul- Methodology
taneously supports the doctrine of realism in sci- A proper understanding of research methods can-
ence, but it does not by itself establish that not be had without an appreciation of their accom-
the factors of exploratory factor analysis should panying methodology (see Proctor & Capaldi,
be given a realist interpretation. Whether this 2001). Methodology is the interdisciplinary field
should happen depends also on whether exploratory that studies methods. It draws from the disciplines

haig 27
of statistics, philosophy of science, and cognitive However, the quantitative/qualitative debate has
science, among others. And yet, the professional not considered the possibility that most methods
literature of these disciplines does not figure in have both quantitative and qualitative dimensions.
the content of research methods courses. Further, In many cases, we are likely to gain a better under-
it is important to appreciate that methodology standing of the research methods we use not by view-
has descriptive, critical, and advisory dimensions: ing them as either qualitative or quantitative but by
Again, the typical methods curriculum does not sys- regarding them as having both qualitative and quan-
tematically deal with research methods with these titative dimensions. Three examples are mentioned
considerations in mind. It is not surprising, there- here. First, grounded theory (e.g., Strauss, 1987),
fore, that psychologists’ understanding of research the most prominent extant qualitative methodol-
methods often leaves a lot to be desired. ogy, is in good part the product of a translation
A realist-oriented methods curriculum would from some sociological quantitative methods of the
profitably consider methods in the light of the pri- 1950s. Moreover, there is nothing in principle to
mary characteristics of realist methodology outlined stop researchers using quantitative methods within
early in the chapter. To mention just three of these: the fold of grounded theory. Exploratory factor
Greater prominence would be given to generative analysis, for example, could sometimes be used for
methodology in which reasoning well to hypothe- generating grounded theory.
ses and theories would figure in the assessment Second, although exploratory factor analysis
of those knowledge claims. The sound abductive itself is standardly characterized as a multivariate sta-
reasoning to factorial hypotheses using exploratory tistical method, the inferential heart of the method
factor analysis is perhaps psychology’s best example is the important scientific heuristic known as the
of generative justification. Similarly, the coherentist principle of the common cause. Importantly, this
justification of explanatory theories using methods principle, which guides the factor analytic inference
of inference to the best explanation would feature from correlations to underlying common factors,
much more prominently than it does at present. can be effectively formulated in qualitative terms.
Finally, in adopting methods that are apt for us Finally, the theory of explanatory coherence
as knowing subjects, heuristic procedures would (Thagard, 1992), which evaluates theories in terms
receive much more explicit attention in the methods of their explanatory power, is a qualitative method
curriculum as realistic guides to our thinking. of theory appraisal, but it is implemented by a com-
The British Psychological Society now takes con- puter program that is part of the method proper,
ceptual and historical issues as one of psychol- and that has a connectionist architecture that is
ogy’s seven core areas. Teaching methods through mathematically constrained.
methodology is the appropriate way to employ this It is recommended, then, that methodologists
core area in the teaching of research methods. The and researchers seriously entertain the prospect that
American Psychological Association and the Asso- individual methods are likely to have a mix of
ciation of Psychological Science would do well to qualitative and quantitative features—that is, that
follow suit, for it is only by making considered use individual methods are themselves mixed methods.
of methodology that a genuine education in research
methods can be achieved.
Evaluate the Philosophical Critiques of
Rethink the Quantitative/Qualitative Quantitative Research Methods
Distinction Most of the occasional references to scientific real-
A major feature of the methodological landscape ism in psychology are to Bhaskar’s (1975; 1979)
has been the discussion of the distinction between critical realism (e. g., Manicas & Secord, 1983),
quantitative and qualitative methods. Although per- a philosophy that has had considerable impact on
haps necessary in establishing a legitimate role for the the social sciences (e.g., Sayer, 1992). Interestingly,
use of qualitative methods in research, the distinc- critical realists have expressed strong reservations
tion is now the subject of critical scrutiny. The way about the use of statistical methods in quantitative
the original distinction was drawn has been ques- research. Bhaskar himself goes so far as to say that
tioned (e.g., Michell, 2004), and the combination causal models should be “totally discarded.” There
of qualitative and quantitative methods in mixed are various reasons for this attitude (see Pratschke,
methods strategies has been strongly promoted in 2003), but perhaps the most fundamental one is
recent times. the claim that statistical models themselves do not

28 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
provide researchers with the substantive models that • Apply insights from the “new experimental-
are sought in causal modeling research. ism” in the philosophy of science to the understand-
However, this claim rests on a mistaken concep- ing of quantitative research methods.
tion of the relation between statistical models and • Develop the philosophical foundations of the-
substantive theoretical models. It is hard to deny ory construction methods in the behavioral sciences.
that consideration of much more than the statistical • Assess the implications of different theories of
machinery of causal modeling is needed to ground causality for research methods.
substantive conclusions. Indeed, it is difficult to • Examine the philosophical foundations of
see how any statistical method could be properly “new” research methods such as data mining, struc-
understood and used in research without appeal to tural equation modeling, and functional neuroimag-
suprastatistical matters. Consider, for example, the
ing.
oft-made claim that factors of exploratory factor anal-
ysis are statistical entities and that the method cannot,
therefore, be used to favor one substantive facto- Author note
rial theory of intelligence over another (e.g., Gould, Brian D. Haig, Department of Psychology, Uni-
1996). This claim is false because factor analysts typ- versity of Canterbury Christchurch, New Zealand,
ically transcend the statistical level of the method Email: brian.haig@canterbury.ac.nz
and makes use of the relevant part of Latent Vari- References
able Theory to generate plausible hypotheses about Acree, M. C. (1979). Theories of statistical inference in psy-
the existence of latent variables. Of central relevance chological research: A historico-critical study. Dissertation
here is the fact that exploratory factor analysis exploits Abstracts International, 39, 5073B. (UMI No. 7907000).
theso-called“principleofthecommoncause”tosanc- Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.)
Upper Saddle River, NJ: Prentice-Hall.
tion inferences to the initial plausibility of interpreted Bangert-Drowns, R. L. (1986). Review of developments in meta-
latent variables. We saw earlier that inferences from analytic method. Psychological Bulletin, 99, 388–399.
manifest to latent variables made in accordance with Behrens, J. T., & Yu, C-H. (2003). Exploratory data analysis. In
this principle are abductive, or explanatory, in nature J. A. Schinka & W. F. Velicer (Eds.), Handbook of psychology,
and are made by factor analysts themselves. Although Vol. 2 (pp. 33–64). New York: Wiley.
Berger, J. O. & Selke, T. (1987). Testing a point null hypothesis:
the statistical machinery of multiple regression and The irreconcilability of p values and evidence (with com-
partial correlation theory is obviously an impor- ments). Journal of the American Statistical Association, 82,
tant part of exploratory factor analysis, its primary 112–139.
function is to facilitate researchers’ suprastatistical Bhaskar, R. (1975). A realist philosophy of science. Brighton,
inferences to latent factors. England: Harvester.
Bhaskar, R. (1979). The possibility of naturalism. Brighton,
It is important to appreciate that the interpretive England: Harvester.
dimension on causal modeling methods is a proper Block, N. J. (1976). Fictionalism, functionalism, and factor
part of its methodology. There is nothing in critical analysis. In R. S. Cohen, C. A. Hooker, & A. C. Micha-
realism, or other variants of scientific realism, that los (Eds.), Boston Studies in the Philosophy of Science, Vol. 32,
prevents one from taking such an outlook on causal (pp. 127–141). Dordrecht, the Netherlands: Reidel.
Bolles, R. C. (1962). The difference between statistical and
modeling. Indeed, scientific realism comports well scientific hypotheses. Psychological Reports, 11, 629–645.
with causal modeling methods that countenance Borsboom, D. (2005). Measuring the mind: Conceptual issues in
latent variables. contemporary psychometrics. Cambridge, England: Cambridge
University Press.
Borsboom, D. (2008). Latent variable theory. Measurement:
Additional Directions Interdisciplinary Research and Perspectives, 6, 25–53.
Space considerations prevent discussion of addi- Bunge, M. (2008). Bayesianism: Science or pseudoscience?
International Review of Victimology, 15, 165–178.
tional future directions in the philosophy of quan- Carruthers, P. (2002). The roots of scientific reasoning: Infancy,
titative methods. However, the following points modularity, and the art of tracking. In P. Carruthers, S. Stich,
deserve to be on an agenda for future study. & M. Siegal (Eds.), The cognitive basis of science (pp. 73–95).
Cambridge, England: Cambridge University Press.
• Develop a modern interdisciplinary concep- Chow, S. L. (1996). Statistical significance: Rationale, validity, and
tion of research methodology. utility. London, England: Sage.
• Give more attention to investigative strategies Cliff, N. (1983). Some cautions concerning the application of
causal modeling methods. Multivariate Behavioral Research,
in psychological research. 18, 115–126.
• Take major philosophical theories of scientific Cohen, J. (1994). The earth is round (p <.05). American
method seriously. Psychologist, 49, 997–1003.

haig 29
Cook, T. D., Cooper, H., Cordray, D. S., Hartmann, H., Haig, B. D. (2009). Inference to the best explanation: A neglected
Hedges, L. V., Light, R. J., et al. (1992). Meta-analysis for approach to theory appraisal in psychology. American Journal
explanation: A casebook. New York: Russell Sage Foundation. of Psychology, 122, 219–234.
Duhem, P. (1954). The aim and structure of physical theory (2nd Harré, R., & Madden, E. H. (1975). Causal powers. Oxford,
ed., P. P. Weiner, Trans.). Princeton, NJ: Princeton University England: Blackwell.
Press. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-
Earman, J. (1992). Bayes or bust? A critical examination of Bayesian analysis. New York: Academic Press.
confirmation theory. Cambridge, MA: MIT Press. Hooker, C. A. (1987). A realistic theory of science. New York: State
Efron, B., & Tibshirani, R. (1993). An introduction to the University of New York Press.
bootstrap. New York: Chapman & Hall. Howson, C., & Urbach, P. (2006). Scientific reasoning: The
Ehrenberg, A. S. C., & Bound, J. A. (1993). Predictability and Bayesian approach (3rd ed.). La Salle, IL: Open Court.
prediction. Journal of the Royal Statistical Society, Part 2, 156, Hubbard, R., & Lindsay, R. M. (2008). Why p values are not a
167–206. useful measure of evidence in statistical significance testing.
Fidell, L. S., & Tabachnick, B. G. (2003). Preparatory data anal- Theory and Psychology, 18, 69–88.
ysis. In J. A. Schinka & W. F. Velicer (Eds.), Handbook of Hubbard, R., & Ryan, P. A. (2000). The historical growth of
psychology, Vol. 2 (pp. 115–121). New York: Wiley. statistical significance testing in psychology—and its future
Fisher, R. A. (1925). Statistical methods for research workers. prospects. Educational and Psychological Measurement, 60,
Edinburgh, Scotland: Oliver & Boyd. 661–681.
Giere, R. N. (1972). The significance test controversy. British Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-
Journal for the Philosophy of Science, 23, 170–180. analysis: Correcting error and bias in research findings (2nd
Gigerenzer, G. (1993). The superego, the ego, and the id in ed.). Newbury Park, CA: Sage.
statistical reasoning. In G. Keren & C. Lewis (Eds.), A hand- Hurlbert, S. H., & Lombardi, C. M. (2009). Final col-
book for data analysis in the behavioral sciences (pp. 311–339). lapse of the Neyman-Pearson decision theoretic framework
Hillsdale, NJ: Lawrence Erlbaum. and rise of the neoFisherian. Annales Zoologici Fennici, 46,
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null 311–349.
ritual: What you wanted to know about significance testing Jeffreys, H. (1939). Theory of probability. Oxford, England:
but were afraid to ask. In D. Kaplan (Ed.), The Sage handbook Clarendon.
of quantitative methodology for the social sciences (pp. 391– Kaplan, D. (2000). Structural equation modeling: Foundations and
408). Thousand Oaks, CA: Sage. extensions. Thousand Oaks, CA: Sage.
Glass, G. V. (1972). The wisdom of scientific inquiry on Kenny, D. (1979). Correlation and causation. New York: Wiley.
education. Journal of Research in Science Teaching, 9, 3–18. Kim, J-O., & Mueller, C. W. (1978). Introduction to factor
Glass, G. V. (1976). Primary, secondary, and meta-analysis of analysis. Beverly Hills, CA: Sage.
research. Educational Researcher, 5, 3–8. Kuhn, T. S. (1996). The structure of scientific revolutions (3rd
Glass, G. V., & Kleigl, R. M. (1983). An apology for research inte- ed.). Chicago, IL: University of Chicago Press (originally
gration in the study of psychotherapy. Journal of Consulting published, 1962).
and Clinical Psychology, 51, 28–41. Lakatos, I. (1970). Falsification and the methodology of scien-
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis tific research programmes. In I. Lakatos & A. Musgrave, A.
in social research. Beverly Hills, CA: Sage. (Eds.). Criticism and the growth of knowledge (pp. 91–195).
Glymour, C. N. (1980). Theory and evidence. Princeton, NJ: Cambridge, England: Cambridge University Press.
Princeton University Press. Laudan, L. (1981). Science and hypothesis. Dordrecht, The
Godfrey-Smith, P. (2009). Causal pluralism. In H. Beebee, C. Netherlands: Reidel.
Hitchcock, & P. Menzies (Eds.), The Oxford handbook of Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories
causation (pp. 326–337). Oxford, England: Oxford Univer- of testing hypotheses: One theory or two? Journal of the
sity Press. American Statistical Association, 88, 1242–1249.
Good, I. J. (1983). The philosophy of exploratory data analysis. Lipton, P. (2004). Inference to the best explanation (2nd ed.).
Philosophy of Science, 50, 283–295. London, England: Routledge.
Goodman, S. N. (1993). P values, hypothesis tests, and likeli- Manicas, P. T. (1989). Explanation and quantification. In B.
hood: Implications for epidemiology of a neglected historical Glassner & J. D. Moreno (Eds.), The qualitative-quantitative
debate. American Journal of Epidemiology, 137, 485–495. distinction in the social sciences (pp. 179–205). Dordrecht,
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: The Netherlands: Kluwer.
Lawrence Erlbaum. Manicas, P. T., & Secord, P. F. (1983). Implications for psychol-
Gottfredson, G. D. (1984). A theory-ridden approach to ogy of the new philosophy of science. American Psychologist,
program evaluation. American Psychologist, 39, 1101–1112. 38, 399–413.
Gould, S. J. (1996). The mismeasure of man (2nd ed.). New York: Markus, K., Hawes, S. S., & Thasites, R. (2008). Abductive infer-
Norton. ence to psychological variables: Steiger’s question and best
Greenwood, J. D. (1992). Realism, empiricism, and social explanations in psychopathy. Journal of Clinical Psychology,
constructionism. Theory and Psychology, 2, 131–151. 64, 1069–1088.
Haig, B. D. (1987). Scientific problems and the conduct of Maxwell, G. (1962). The ontological status of theoretical entities.
research. Educational Philosophy and Theory, 19, 22–32. In H. Feigl & G. Maxwell (Eds.), Minnesota Studies in the
Haig, B. D. (2005a). Exploratory factor analysis, theory genera- Philosophy of Science, Vol. 3 (pp. 3–28). Minneapolis, MN:
tion, and scientific method. Multivariate Behavioral Research, University of Minnesota Press.
40, 303–329. McDonald, R. P., & Mulaik, S. A. (1979). Determinacy of com-
Haig, B. D. (2005b). An abductive theory of scientific method. mon factors: A nontechnical review. Psychological Bulletin,
Psychological Methods, 10, 371–388. 86, 297–306.

30 t h e p h i l o s o p h y o f q u a n t i tat i v e m e t h o d s
McGrew, T. (2003). Confirmation, heuristics, and explanatory Salmon, W. C. (1984). Scientific explanation and the causal
reasoning. British Journal for the Philosophy of Science, 54, structure of the world. Princeton, NJ: Princeton University
553–567. Press.
McGuire, W. J. (1997). Creative hypothesis generating in psy- Sayer, A. (1992). Methods in social science: A realist approach (2nd
chology: Some useful heuristics. Annual Review of Psychology, ed.). London, England: Routledge.
48, 1–30. Schmidt, F. L. (1992). What do data really mean? Research find-
McMullin, E. (1995). Underdetermination. The Journal of ings, meta-analysis, and cumulative knowledge in psychology.
Medicine and Philosophy, 20, 233–252. American Psychologist, 47, 1173–1181.
Meehl, P. E. (1967). Theory testing in psychology and physics: A Schmidt, F. L. (1993). Meta-analysis and cumulative knowledge.
methodological paradox. Philosophy of Science, 34, 103–115. Contemporary Psychology, 38, 1163–1165.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Simon, H. (1985). Spurious correlation: A causal interpretation.
Karl, Sir Ronald, and the slow progress of soft psychology. In H. M. Blalock (Ed.), Causal models in the social sciences
Journal of Consulting and Clinical Psychology, 46, 806–834. (2nd ed.) (pp. 7–21). New York: Aldine.
Meehl, P. E. (1990). Appraising and amending theories: The Simon, H. A. (1968). On judging the plausibility of theories. In
strategy of Lakatosian defense and two principles that warrant B. van Rootselaar & J. F. Staal (Eds.), Logic, methodology, and
using it. Psychological Inquiry, 1, 108–141, 173–180. philosophy of science, Vol. 3 (pp. 439–459). Amsterdam, The
Meehl, P. E. (1997). The problem is epistemology, not statis- Netherlands: North-Holland.
tics: Replace significance tests by confidence intervals and Smith, M. L., Glass, G. V., & Miller, T. I. (1980). The benefits
quantify accuracy of risky numerical predictions. In L. L. Har- of psychotherapy. Baltimore, MD: Johns Hopkins University
low, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were Press.
no significance tests? (pp. 393–425). Mahwah, NJ: Lawrence Sober, E. (1988). The principle of the common cause. In J.
Erlbaum. H. Fetzer (Ed.), Probability and causality (pp. 211–229).
Michell, J. (2004). The place of qualitative research in psychology. Dordrecht, The Netherlands: Reidel.
Qualitative Research in Psychology, 1, 307–319. Sober, E. (2008). Evidence and evolution: The logic behind the
Morrison, D. E., & Henkel, R. E. (1970). The significance test science. Cambridge, England: Cambridge University Press.
controversy. Chicago: Aldine. Sohn, D. (1996). Meta-analysis and science. Theory and Psychol-
Mulaik, S. A. (1985). Exploratory statistics and empiricism. ogy, 6, 229–246.
Philosophy of Science, 52, 410–430. Spielman, S. (1974). The logic of tests of significance. Philosophy
Mulaik, S. A. (1987). A brief history of the philosophical foun- of Science, 14, 211–225.
dations of exploratory factor analysis. Multivariate Behavioral Strauss, A. L. (1987). Qualitative analysis for social scientists. New
Research, 22, 267–305. York: Cambridge University Press.
Mulaik, S. A., & McDonald, R. P. (1978). The effect of additional Thagard, P. (1992). Conceptual revolutions. Princeton, NJ:
variables on factor indeterminacy in models with a single Princeton University Press.
common factor. Psychometrika, 43, 177–192. Thagard, P. (1999). How scientists explain disease. Princeton, NJ:
Neyman, J., & Pearson, E. S. (1933). On the problem of the Princeton University Press.
most efficient tests of statistical hypotheses. Philosophical Thurstone, L. L. (1947). Multiple factor analysis. Chicago, IL:
Transactions of the Royal Society of London A, 231, 289–337. University of Chicago Press.
Nickerson, R. S. (2000). Null hypothesis significance testing: Trout, J. D. (1998). Measuring the intentional world: Realism,
A review of an old and continuing controversy. Psychological naturalism, and quantitative methods in the behavioral sciences.
Methods, 5, 241–301 New York, NY: Oxford University Press.
Nickles, T. (1981). What is a problem that we might solve it? Tukey, J. W. (1962). The future of data analysis. The Annals of
Synthese, 47, 85–118. Mathematical Statistics, 33, 1–67.
Nickles, T. (1987). Twixt method and madness. In N. J. Nerses- Tukey, J. W. (1969). Analyzing data: Sanctification or detective
sian (Ed.), The process of science (pp. 41–67). Dordrecht, The work? American Psychologist, 24, 83–91.
Netherlands: Martinus Nijhoff. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA:
Pratschke, J. (2003). Realistic models? Critical realism and Addison Wesley.
statistical models in the social sciences. Philosophica, 71, Tukey, J. W. (1980). We need both exploratory and confirmatory.
13–38. American Statistician, 34, 23–25.
Proctor, R. W., & Capaldi, E. J. (2001). Empirical evaluation Waller, N. G. (2004). The fallacy of the null hypothesis in soft
and justification of methodologies in psychological science. psychology. Applied and Preventive Psychology, 11, 83–86.
Psychological Bulletin, 127, 759–772. Weisberg, J. (2009). Locating IBE in the Bayesian framework.
Psillos, S. (2004). Inference to the best explanation and Bayesian- Synthese, 167, 125–143.
ism. In F. Stadler (Ed.), Induction and deduction in the sciences Whitt, L. A. (1992). Indices of theory promise. Philosophy of
(pp. 83–91). Dordrecht, The Netherlands: Kluwer. Science, 59, 612–634.
Royall, R. M. (1986). The effect of sample size on the meaning Wilkinson, L., & the Task Force on Statistical Inference. (1999).
of significance tests. American Statistician, 40, 313–315. Statistical methods in psychology journals: Guidelines and
Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. explanations. American Psychologist, 54, 594–604.
New York: Chapman & Hall. Woodward, J. (1989). Data and phenomena. Synthese, 79, 393–
Rozeboom, W. W. (1997). Good science is abductive, not 472.
hypothetico-deductive. In L. L. Harlow, S. A. Mulaik, & Yu, C-H. (2006). Philosophical foundations of quantitative
J. H. Steiger (Eds.), What if there were no significance tests? research methodology. Lanham, MD: University Press of
(pp. 335–391.). Hillsdale, NJ: Lawrence Erlbaum. America.

haig 31
CHAPTER

Quantitative Methods and Ethics


3
Ralph L. Rosnow and Robert Rosenthal

Abstract

The purpose of this chapter is to provide a context for thinking about the role of ethics in quantitative
methodology. We begin by reviewing the sweep of events that led to the creation and expansion of
legal and professional rules for the protection of research subjects and society against unethical
research. The risk–benefit approach has served as an instrument of prior control by institutional
review boards. After discussing the nature of that approach, we sketch a model of the costs and
utilities of the “doing” and “not doing” of research. We illustrate some implications of the expanded
model for particular data analytic and reporting practices. We then outline a 5 × 5 matrix of general
ethical standards crossed with general data analytic and reporting standards to encourage thinking
about opportunities to address quantitative methodological problems in ways that may have mutual
ethical and substantive rewards. Finally, we discuss such an opportunity in the context of problems
associated with risk statistics that tend to exaggerate the absolute effects of therapeutic interventions
in randomized trials.
Key Words: Accountability,American Psychological Association (APA) Ethics Code, Belmont Report,
ethical principles, health statistics, institutional review board (IRB), moral dilemmas, Nuernberg
(Nuremberg) Code, quantitative methodology, risk–benefit assessment, statistical illiteracy,
transparency, volunteer bias

Introduction not be seen as mutually exclusive, as it can often be


In this chapter we sketch an historic and heuristic illuminating to use both types in the same research.
framework for assessing certain ethical implications For example, in the typical psychological experi-
of the term quantitative methods. We use this term ment in which the observations take a numerical
in the broadest sense to include not only statisti- form, it may be edifying to ask some of the par-
cal procedures but also what is frequently described ticipants in postexperimental interviews to reflect
as quantitative research (in contrast to qualitative on the context in which the experiment was con-
research) in psychology and some other disciplines. ducted and to speculate on the ways in which it
As defined in the APA Dictionary of Psychology, the may have influenced their own and other partici-
traditional distinction between these two general pants’ behaviors (Orne, 1962, 1969). By the same
types of research rests on whether “the approach token, it is usually possible to quantify nonquantita-
to science” does (quantitative research) or does tive observations by, for example, decomposing the
not (qualitative research) “employ the quantification qualitative subject matter element by element and
(expression in numerical form) of the observa- then numerically and visually analyzing and summa-
tions made” (VandenBos, 2007, pp. 762–763). Of rizing the results. Blogs and online discussion groups
course, quantitative and qualitative methods should are currently a popular source of qualitative subject

32
matter, which researchers have trolled for patterns intentions, it might be noted that there is also empir-
or relationships that can be quantified by the use of ical evidence of (1) individual differences in this
simple summary statistics (e.g., Bordia & Rosnow, ability (described as interpersonal acumen) and (2) a
1995) or coded and visually mapped out using social hierarchy of intention–action combinations ranging
network analysis to highlight links and nodes in from the least to most cognitively taxing (Rosnow,
the observed relationships (e.g., Kossinets & Watts, Skleder, Jaeger, & Rind, 1994).
2006; see also Wasserman & Faust, 1994). Whether Societal sensitivities, group values, and situa-
blogs and online discussion groups’ data are treated tional pressures are subject to change in the face of
quantitatively or qualitatively, their use may raise significant events. On the other hand, some moral
ethical questions regarding the invasion of privacy. values seem to be relatively enduring and univer-
The fact that bloggers and participants in online sal, such as the golden rule, which is frequently
discussion groups are typically fully aware that their expressed as “Do unto others as you would have
communications are quite public minimizes the risk them do unto you.” In the framework of quanti-
of invasion of privacy. tative methods and ethics, a categorical imperative
The term ethics was derived from the Greek ethos, might be phrased as “Thou shalt not lie with statis-
meaning “character” or “disposition.” We use the tics.” Still, Huff, in his book, How to Lie with
term here to refer to the dos and don’ts of codified Statistics, first cautioned the public in 1954 that the
and/or culturally ingrained rules by which morally reporting of statistical data was rife with “bungling
“right” and “wrong” conduct can be differentiated. and chicanery” (Huff, 1982, p. 6). The progress
Conformity to such rules is usually taken to mean of science depends on the good faith that scien-
morality, and our human ability to make ethical tists have in the integrity of one another’s work
judgments is sometimes described as a moral sense and the unbiased communication of findings and
(a tradition that apparently goes back to David conclusions. Lying with statistics erodes the cred-
Hume’s A Treatise of Human Nature in the eigh- ibility of the scientific enterprise, and it can also
teenth century). Philosophers and theologians have present an imminent danger to the general pub-
frequently disagreed over the origin of the moral lic. “Lying with statistics” can refer to a number of
sense, but on intuitive grounds it would seem that more specific practices: for example, reporting only
morality is subject to societal sensitivities, group the data that agree with the researcher’s bias, omit-
values, and social pressures. It is not surprising ting any data not supporting the researcher’s bias,
that researchers have documented systematic biases and, most serious of all, fabricating the results of
in ethical judgments. For example, in a study by the research. For example, there was a case reported
Kimmel (1991), psychologists were asked to make in 2009 in which an anesthesiologist fabricated the
ethical judgments about hypothetical research cases. statistical data that he had published in 21 journal
Kimmel reported that those psychologists who were articles purporting to give the results of clinical trials
more (as compared to less) approving in their eth- of a pain medicine marketed by the company that
ical judgments were more often men; had held an funded much of the doctor’s research (Harris, 2009).
advanced degree for a longer period of time; had Another case, around the same time, involved a
received the advanced degree in an area such as medical researcher whose accounts of a blood test
experimental, developmental, or social psychology for diagnosing prostate cancer had generated con-
rather than counseling, school, or community psy- siderable excitement in the medical community, but
chology; and were employed in a research-oriented who was now being sued for scientific fraud by his
context as opposed to a service-oriented context. industry sponsor (Kaiser, 2009). As the detection of
Citing this work of Kimmel’s (1991), an Amer- lying with statistics is often difficult in the normal
ican Psychological Association (APA) committee course of events, there have been calls for the public
raised the possibility that inconsistent implemen- sharing of raw data so that, as one scientist put it,
tation of ethical standards by review boards might “Anyone with the skills can conduct their own anal-
result not only from the expanded role of review yses, draw their own conclusions, and share those
boards but also from the composition of particu- conclusions with others” (Allison, 2009, p. 522).
lar boards (Rosnow, Rotheram-Borus, Ceci, Blanck, That would probably help to reduce some of the
& Koocher, 1993). Assuming that morality is also problems of biased data analysis, but it would not
predicated on people’s abilities to figure out the help much if the shared data had been fabricated to
meaning of other people’s actions and underlying begin with.

rosnow, rosenthal 33
In the following section, we review the sweep of in war crimes and crimes against humanity were
events that led to the development and growth of presented before a military tribunal in Nuernberg,
restraints for the protection of human subjects and Germany. For allied atomic scientists, Hiroshima
society against unethical research.1 A thread running had been an epiphany that vaporized the old iconic
throughout the discussion is the progression of the image of a morally neutral science. For researchers
APA’s code of conduct for psychological researchers who work with human participants, the backdrop
who work with human subjects. We assume that to the formation of ethical and legal principles to
many readers of this Handbook will have had a protect the rights and welfare of all research partici-
primary or consulting background in some area of pants were the shocking revelations of the war crimes
psychology or a related research area. The develop- documented in meticulous detail at the Nuernberg
ment of the APA principles gives us a glimpse of the Military Tribunal. Beginning with the German inva-
specific impact of legal regulations and societal sensi- sion of Poland at the outbreak of World War II,
tivities in an area in which human research has been Jews and other ethnic minority inmates of concen-
constantly expanding into new contexts, includ- tration camps had been subjected to sadistic tortures
ing “field settings and biomedical contexts where and other barbarities in “medical experiments” by
research priorities are being integrated with the Nazi physicians in the name of science. As method-
priorities and interests of nonresearch institutions, ically described in the multivolume report of the
community leaders, and diverse populations” (Sales trials, “in every one of the experiments the subjects
& Folkman, 2000, p. ix). We then depict an ideal- experienced extreme pain or torture, and in most of
ized risk–benefit approach that review boards have them they suffered permanent injury, mutilation, or
used as an instrument of prior control of research, death” (Trials of War Criminals before the Nuernberg
and we also describe an expanded model focused Military Tribunals under Control Council Law No.
on the costs and utilities of “doing” and “not doing” 10, p. 181). Table 3.1 reprints the principles of the
research. The model can also be understood in terms Nuernberg Code, which have resonated to varying
of the cost–utility of adopting versus not adopting degrees in all ensuing codes for biomedical research
particular data analytic and reporting practices. We with human participants as well as having had a gen-
then outline a matrix of general ethical standards erative influence on the development of principles
crossed with general data analytic and reporting for the conduct of behavioral and social research.
standards as (1) a reminder of the basic distinction We pick up the story again in the 1960s in the
between ethical and technical mandates and (2) a United States, a period punctuated by the shock-
framework for thinking about promising opportu- ing assassinations of President John F. Kennedy in
nities for ethical and substantive rewards in quanti- 1963 and then of Dr. Martin Luther King, Jr., and
tative methodology (cf. Blanck, Bellack, Rosnow, Senator Robert F. Kennedy in 1968. The 1960s
Rotheram-Borus, & Schooler, 1992; Rosenthal, were also the beginning of the end of what Pat-
1994; Rosnow, 1997). We discuss such an opportu- tullo (1982) called “the hitherto sacrosanct status”
nity in the context of the way in which a fixation on of the human sciences, which moved “into an era
relative risk (RR) in large sample randomized trials of uncommonly active concern for the rights and
of therapeutic interventions can lead to misconcep- welfare of segments of the population that had tra-
tions about the practical meaning to patients and ditionally been neglected or exploited” (p. 375).
health-care providers of the particular intervention One highly publicized case in 1963 involved a noted
tested. cancer researcher who had injected live cancer cells
into elderly, noncancerous patients, “many of whom
were not competent to give free, informed consent”
The Shaping of Principles to Satisfy Ethical (Pattullo, p. 375). In 1966, the U.S. Surgeon Gen-
and Legal Standards eral issued a set of regulations governing the use
If it can be said that a single historical event in of subjects by researchers whose work was funded
modern times is perhaps most responsible for ini- by the National Institutes of Health (NIH). Most
tially galvanizing changes in the moral landscape of NIH grants funded biomedical research, but there
science, then it would be World War II. On Decem- was also NIH support for research in the behav-
ber 9, 1946 (the year after the surrender of Germany ioral and social sciences. In 1969, following the
on May 8, 1945 and the surrender of Japan on exposure of further instances in which the welfare
August 14, 1945), criminal proceedings against Nazi of subjects had been ignored or endangered in
physicians and administrators who had participated biomedical research (cf. Beecher, 1966, 1970; Katz,

34 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.1. The Nuernberg Principles of 1946–1949 for Permissible Medical Experiments∗
1. The voluntary consent of the human subject is absolutely essential.
This means that the person involved should have legal capacity to give consent; should be so situated as to be able to
exercise free power of choice, without the intervention of any element of force, fraud, deceit, duress, over-reaching,
or other ulterior form of constraint or coercion; and should have sufficient knowledge and comprehension of the
elements of the subject matter involved as to enable him to make an understanding and enlightened decision. This
latter element requires that before the acceptance of an affirmative decision by the experimental subject there should
be made known to him the nature, duration, and purpose of the experiment; the method and means by which it is
to be conducted; all inconveniences and hazards reasonably to be expected; and the effects upon his health or person
which may possibly come from his participation in the experiment.
The duty and responsibility for ascertaining the quality of the consent rests upon each individual who initiates,
directs or engages in the experiment. It is a personal duty and responsibility which may not be delegated to another
with impunity.
2. The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or
means of study, and not random and unnecessary in nature.
3. The experiment should be so designed and based on the results of animal experimentation and a knowledge of the
natural history of the disease or other problem under study that the anticipated results will justify the performance
of the experiment.
4. The experiment should be so conducted as to avoid all unnecessary physical and mental suffering and injury.
5. No experiment should be conducted where there is an a priori reason to believe that death or disabling injury will
occur; except, perhaps, in those experiments where the experimental physicians also serve as subjects.
6. The degree of risk to be taken should never exceed that determined by the humanitarian importance of the problem
to be solved by the experiment.
7. Proper preparations should be made and adequate facilities provided to protect the experimental subject against even
remote possibilities of injury, disability, or death.
8. The experiment should be conducted only by scientifically qualified persons. The highest degree of skill and care
should be required through all stages of the experiment of those who conduct or engage in the experiment.
9. During the course of the experiment the human subject should be at liberty to bring the experiment to an end if he
has reached the physical or mental state where continuation of the experiment seems to him to be impossible.
10. During the course of the experiment the scientist in charge must be prepared to terminate the experiment at any
stage, if he has probable cause to believe, in the exercise of the good faith, superior skill and careful judgment required
of him that a continuation of the experiment is likely to result in injury, disability, or death to the experimental
subject.
∗ Reprinted from pp. 181–182 in Trials of War Criminals before the Nuernberg Military Tribunals under Control Council Law No. 10, October
1946–April 1949, Vol. II. Washington, DC: U.S. Government Printing Office.

1972), the Surgeon General extended the earlier never informed they had syphilis but only told they
safeguards to all human research. In a notorious had “bad blood.” They also were not offered peni-
case (not made public until 1972), a study con- cillin when it was discovered in 1943 and became
ducted by the U.S. Public Health Service (USPHS) widely available in the 1950s, and they were warned
simply followed the course of syphilis in more than not to seek treatment elsewhere or they would be
400 low-income African-American men residing in dropped from the study. The investigators went so
Tuskegee, Alabama, from 1932 to 1972 (Jones, far as to have local doctors promise not to treat the
1993). Recruited from churches and clinics with the men in the study with antibiotics (Stryker, 1997). As
promise of free medical examinations and free health the disease progressed in its predictable course with-
care, the men who were subjects in this study were out any treatment, the men experienced damage to

rosnow, rosenthal 35
their skeletal, cardiovascular, and central nervous by the Department of Health, Education, and Wel-
systems and, in some cases, death. In 1972, the fare (DHEW) 3 years earlier were codified as gov-
appalling details were finally made public by a lawyer ernment regulations by the National Research Act of
who had been an epidemiologist for the USPHS, July 12, 1974 (Pub. L. 93–348). Among the require-
and the study was halted (Fairchild & Bayer, 1999). ments instituted by the government regulations was
The following year, the Senate Health Subcommit- that institutions that received federal funding estab-
tee (chaired by Senator Edward Kennedy) aired the lish an institutional review board (IRB) for the
issue of scientific misconduct in public hearings. purpose of making prior assessments of the possi-
The early 1960s was also a period when emotions ble risks and benefits of proposed research.2 This
about invasions of privacy were running high in the federal act also created the National Commission
United States after a rash of reports of domestic wire- for the Protection of Human Subjects of Biomed-
tapping and other clandestine activities by federal ical and Behavioral Research. Following hearings
agencies. In the field of psychology, the morality of that were held over a 3-year period, the document
the use of deception was being debated. As early called “The Belmont Report” was issued in April,
as the 1950s, there had been concerned statements 1979 (available online and also reprinted in Sales &
issued about the use of deception in social psycho- Folkman, 2000). Unlike other reports of the Com-
logical experiments (Vinacke, 1954). The spark that mission, the Belmont Report did not provide a list of
lit a fuse in the 1960s in the field of psychology specific recommendations for administrative action
was the publication of Stanley Milgram’s studies on by the DHEW, but the Belmont Report recom-
obedience to authority, in which he had used an mended that the report be adopted in its entirety
elaborate deception and found that a majority of as a statement of DHEW policy. In the pream-
ordinary research subjects were willing to adminis- ble, the report mentioned the standards set by the
ter an allegedly dangerous level of shock to another Nuernberg (“Nuremberg”) Code as the prototype
person when “ordered” to do so by a person in of many later codes consisting of rules, some gen-
authority, although no shock was actually admin- eral and others specific, to guide researchers and
istered (cf. Blass, 2004; Milgram, 1963, 1975). assure that research involving human participants
Toward the end of the 1960s, there were impas- would be carried out in an ethical manner. Not-
sioned pleas by leading psychologists for the ethical ing that the rules were often inadequate to cover
codification of practices commonly used in psycho- complex situations, that they were often difficult to
logical research (Kelman, 1968; Smith, 1969). As apply or interpret, and that they often came into
there were new methodological considerations and conflict with one another, the National Commis-
federal regulations since the APA had formulated sion had decided to issue broad ethical principles
a professional code of ethics in 1953, a task force to provide a basis on which specific rules could
was appointed to draft a set of ethical principles for then be formulated, criticized, and interpreted. As
research with human subjects. Table 3.2 shows the we track the development of the APA principles
final 10 principles adopted by the APA’s Council of in this discussion, we will see that there has been
Representatives in 1972, which were elucidated in a similar progression, and later we will emphasize
a booklet that was issued the following year, Ethi- some broad ethical principles when we discuss the
cal Principles in the Conduct of Research with Human interface of ethical and technical standards in quan-
Participants (APA, 1973). An international survey titative methodology. For now, however, it can be
conducted 1 year later found there were by then two noted that the Belmont Report proposed that (1)
dozen codes of ethics that had been either adopted respect for persons, (2) beneficence, and (3) jus-
or were under review by professional organizations tice provide the foundation for research ethics. The
of social scientists (Reynold, 1975). Although viola- report also proposed norms for scientific conduct
tions of such professional codes were supported by in six major areas: (1) the use of valid research
penalties such as loss of membership in the organiza- designs, (2) the competence of researchers, (3) the
tion, the problem was that many researchers engaged identification of risk–benefit consequences, (4) the
in productive, rewarding careers did not belong to selection of research participants, (5) the importance
these professional organizations. of obtaining informed voluntary consent, and (6)
By the end of the 1970s, the pendulum had compensation for injury.3
swung again, as accountability had become the In 1982, the earlier APA code was updated, and
watchword of the decade (National Commission on a new version of Ethical Principles in the Conduct
Research, 1980). In 1974, the guidelines provided of Research with Human Participants was published

36 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.2. The Ethical Principles Adopted in December, 1972, by the Council of Representatives of
the American Psychological Association∗
The decision to undertake research rests upon a considered judgment by the individual psychologist
about how best to contribute to psychological science and to human welfare. The responsible psychologist
weighs alternative directions in which personal energies and resources might be invested. Having made
the decision to conduct research, psychologists must carry out their investigation with respect for the
people who participate and with concern for their dignity and welfare. The Principles that follow make
explicit the investigator’s ethical responsibilities toward participants over the course of research, from the
initial decision to pursue a study to the steps necessary to protect the confidentiality of research data.
These Principles should be interpreted in terms of the context provided in the complete document offered
as a supplement to these Principles.
1. In planning a study the investigator has the personal responsibility to make a careful evaluation of
its ethical acceptability, taking into account these Principles for research with human beings. To
the extent that this appraisal, weighing of scientific and humane values, suggests a deviation from
any Principle, the investigator incurs an increasingly serious obligation to seek ethical advice and to
observe more stringent safeguards to protect the rights of the human research participants.
2. Responsibility for the establishment and maintenance of acceptable ethical practice in research always
remains with the individual investigator. The investigator is also responsible for the ethical treatment
of research participants by collaborators, assistants, students, and employees, all of whom, however,
incur parallel obligations.
3. Ethical practice requires the investigator to inform the participant of all features of the research that
reasonably might be expected to influence willingness to participate and to explain all other aspects of
the research about which the participant inquires. Failure to make full disclosure gives added emphasis
to the investigator’s responsibility to protect the welfare and dignity of the research participant.
4. Openness and honesty are essential characteristics of the relationship between investigator and research
participant. When the methodological requirements of a study necessitate concealment or deception,
the investigator is required to ensure the participant’s understanding of the reasons for this action
and to restore the quality of the relationship with the investigator.
5. Ethical research practice requires the investigator to respect the individual’s freedom to decline to
participate in research or to discontinue participation at any time. The obligation to protect this
freedom requires special vigilance when the investigator is in a position of power over the participant.
The decision to limit this freedom increases the investigator’s responsibility to protect the participant’s
dignity and welfare.
6. Ethically acceptable research begins with the establishment of a clear and fair agreement between the
investigator and the research participant that clarifies the responsibilities of each. The investigator
has the obligation to honor all promises and commitments included in that agreement.
7. The ethical investigator protects participants from physical and mental discomfort, harm, and danger.
If the risk of such consequences exists, the investigator is required to inform the participant of that
fact, secure consent before proceeding, and take all possible measures to minimize distress. A research
procedure may not be used if it is likely to cause serious and lasting harm to participants.
8. After the data are collected, ethical practice requires the investigator to provide the participant
with a full clarification of the nature of the study and to remove any misconceptions that may have
arisen. Where scientific or human values justify delaying or withholding information, the investigator
acquires a special responsibility to assure that there are no damaging consequences for the participant.

rosnow, rosenthal 37
Table 3.2. (Continued)
9. Where research procedures may result in undesirable consequences for the participant, the investigator
has the responsibility to detect and remove or correct these consequences, including, where relevant,
long-term aftereffects.
10. Information obtained about the research participants during the course of an investigation is con-
fidential. When the possibility exists that others may obtain access to such information, ethical
research practice requires that this possibility, together with the plans for protecting confidentiality,
be explained to the participants as a part of the procedure for obtaining informed consent.
∗ Quoted from pp. 1–2 in Ethical Principles in the Conduct of Research with Human Participants. Washington, DC: American
Psychological Association. Copyright © 1973 by the American Psychological Association.

by the APA. In the earlier version and in the 1982 were perhaps likely to take their ethical cues from
version, the principles were based on actual ethi- the legislated morality and its oversight by IRBs as
cal problems that researchers had experienced, and opposed to the aspirational principles embodied in
extensive discussion throughout the profession was professional codes, such as the APA code. Another
incorporated in each edition of Ethical Principles. complication in this case is that there was a frac-
The principles in the 1982 code are reprinted in tious splintering of the APA in the late-1980s, which
Table 3.3. Notice that there were several new terms resulted in many members resigning from the APA
(subject at risk and subject at minimal risk) and also an and the creation of the rival American Psycholog-
addendum sentence to informed consent (referring ical Society, subsequently renamed the Association
to “research with children or with participants who for Psychological Science (APS). For a time in the
have impairments that would limit understanding 1990s, a joint task force of the APA and the APS
and/or communication”). The concept of minimal attempted to draft a revised ethics code, but the
risk (which came out of the Belmont Report) means APS then withdrew its participation following an
that the likelihood and extent of harm to the partic- apparently irresolvable disagreement. In 2002, after
ipants are presumed to be no greater than what may a 5-year revision process, APA adopted a reworked
be typically experienced in everyday life or in routine ethics code that emphasized the five general princi-
physical or psychological examinations (Scott-Jones ples defined (by APA) in Table 3.4 and also “specific
& Rosnow, 1998, p. 149). In actuality, the extent of standards” that fleshed out these principles.5 The
harm may not be completely anticipated, and esti- tenor of the final document was apparently intended
mating the likelihood of harm is frequently difficult to reflect the remaining majority constituency of
or impossible. Regarding the expanded statement the APA (practitioners) but also the residual con-
on deception, the use of deception in research had stituency of psychological scientists who perform
been frowned upon for some years although there either quantitative or qualitative research in fun-
had long been instances in which active and passive damental and applied contexts. Of the specific
deceptions had been used routinely. An example was principles with some relevance to data analysis or
the withholding of information (passive deception). quantitative methods, there were broadly stated rec-
Randomized clinical trials would be considered of ommendations such as sharing the research data
dubious value in medical research had the experi- for verification by others (Section 8.14), not mak-
menters and the participants not been deprived of ing deceptive or false statements (Section 8.10),
information regarding which condition was assigned using valid and reliable instruments (Section 9.02),
to each participant. On the other hand, in some drawing on current knowledge for design, standard-
areas of behavioral experimentation, the use of ization, validation, and the reduction or elimina-
deception has been criticized as having “reached tion of bias when constructing any psychometric
a ‘taken-for-granted’ status” (Smith, Kimmel, & instruments (Section 9.05). We turn next to the
Klein, 2009, p. 486).4 risk–benefit process, but we should also note that
Given the precedence of federal (and state) regu- ethical values with relevance to statistical practices
lations since the guidelines developed by the DHEW are embodied in the codes developed by statistical
were codified by the National Research Act in 1974 organizations (e.g., American Statistical Association,
(and revised as of November 6, 1975), researchers 1999; see also Panter & Sterba, 2011).

38 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.3. The Revised Ethical Principles Adopted in August, 1982, by the Council of Representatives of the
American Psychological Association for Research with Human Participants∗
The decision to undertake research rests upon a considered judgment by the individual psychologist about how best to
contribute to psychological science and human welfare. Having made the decision to conduct research, the psychologist
considers alternative directions in which research energies and resources might be invested. On the basis of this consid-
eration, the psychologist carries out the investigation with respect and concern for the dignity and welfare of the people
who participate and with cognizance of federal and state regulations and professional standards governing the conduct
of research with human participants.
A. In planning a study, the investigator has the responsibility to make a careful evaluation of its ethical acceptability. To
the extent that the weighing of scientific and human values suggests a compromise of any principle, the investigator
incurs a correspondingly serious obligation to seek ethical advice and to observe stringent safeguards to protect the
rights of human participants.
B. Considering whether a participant in a planned study will be a “subject at risk” or a “subject at minimal risk,” according
to recognized standards, is of primary ethical concern to the investigator.
C. The investigator always retains the responsibility for ensuring ethical practice in research. The researcher is also
responsible for the ethical treatment of research participants by collaborators, assistants, students, and employees, all
of whom, however, incur similar obligations.
D. Except in minimal-risk research, the investigator establishes a clear and fair agreement with research participants, prior
to their participation, that clarifies the obligations and responsibilities of each. The investigator has the obligation
to honor all promises and commitments included in that agreement. The investigator informs the participants of all
aspects of the research that might reasonably be expected to influence willingness to participate and explains all other
aspects of the research about which the participants inquire. Failure to make full disclosure prior to obtaining informed
consent requires additional safeguards to protect the welfare and dignity of the research participants. Research with
children or with participants who have impairments that would limit understanding and/or communication requires
special safeguarding procedures.
E. Methodological requirements of a study may make the use of concealment or deception necessary. Before conducting
such a study, the investigator has a special responsibility to (1) determine whether the use of such techniques is justified
by the study’s prospective scientific, educational, or applied value; (2) determine whether alternative procedures are
available that do not use concealment or deception; and (3) ensure that the participants are provided with sufficient
explanation as soon as possible.
F. The investigator respects the individual’s freedom to decline to participate in or to withdraw from the research at any
time. The obligation to protect this freedom requires careful thought and consideration when the investigator is in
a position of authority or influence over the participant. Such positions of authority include, but are not limited to,
situations in which research participation is required as part of employment or in which the participant is a student,
client, or employee of the investigator.
G. The investigator protects the participant from physical and mental discomfort, harm, and danger that may arise
from research procedures. If risks of such consequences exist, the investigator informs the participant of that fact.
Research procedures likely to cause serious or lasting harm to a participant are not used unless the failure to use
these procedures might expose the participant to risk of greater harm or unless the research has great potential benefit
and fully informed and voluntary consent is obtained from each participant. The participant should be informed
of procedures for contacting the investigator within a reasonable time period following participation should stress,
potential harm, or related questions or concerns arise.
H. After the data are collected, the investigator provides the participant with information about the nature of the study
and attempts to remove any misconceptions that may have arisen. Where scientific or human values justify delaying
or withholding this information, the investigator incurs a special responsibility to monitor the research and to ensure
that there are no damaging consequences for the participant.

rosnow, rosenthal 39
Table 3.3. (Continued)
I. Where research procedures result in undesirable consequences for the individual participant, the investigator has the
responsibility to detect and remove or correct these consequences, including long-term effects.
J. Information obtained about a research participant during the course of an investigation is confidential unless otherwise
agreed upon in advance. When the possibility exists that others may obtain access to such information, this possibility,
together with the plans for protecting confidentiality, is explained to the participant as part of the procedure for
obtaining informed consent.
∗ Quoted from pp. 5–7 in Ethical Principles in the Conduct of Research with Human Participants. Washington, DC: American Psychological Association.

Copyright ©1982 by the American Psychological Association.

Expanding the Calculation of Risks and beings are not omniscient, there was also the prob-
Benefits lem that “neither the risks nor the benefits . . . can
After the Belmont Report, it seemed that every- be perfectly known in advance” (Mark et al., 1999,
thing changed permanently for scientists engaged p. 49).
in human subject research, and it made little dif- These complications notwithstanding, another
ference whether they were engaged in biomedical, catch-22 of the risk–benefit assessment is that it
behavioral, or social research. As the philosopher focuses only on the doing of research. Some years ago,
John E. Atwell (1981) put it, the moral dilemma we proposed a way of visualizing this predicament—
was to defend the justification of using human sub- first, in terms of an idealized representation of the
jects as the means to an end that was beneficial risk–benefit assessment and, second, in terms of an
in some profoundly significant way (e.g., the pro- alternative model focused on the costs and benefits
gression of science, public health, or public policy) of both the doing and not doing of research (Rosen-
while protecting the moral “ideals of human dignity, thal & Rosnow, 1984). The latter model also has
respect for persons, freedom and self-determination, implications for the risk–benefit (we prefer the term
and a sense of personal worth” (p. 89). Review cost–utility) of using or not using particular quanti-
boards were now delegated the responsibility of mak- tative methods (we return to this idea in a moment).
ing prior assessments of the future consequences of First, however, Figure 3.1 shows an idealized repre-
proposed research on the basis of the probability sentation of the traditional risk–benefit assessment.
that a certain magnitude of psychological, physi- Risk (importance or probability of harm) is plotted
cal, legal, social, or economic harm might result, from low (C) to high (A) on the vertical axis, and
weighed against the likelihood that “something of the benefit is plotted from low (C) to high (D) on
positive value to health or welfare” might result. the horizontal axis. In other words, studies in which
Quoting the Belmont Report, “risk is properly con- the risk–benefit assessment is close to A would pre-
trasted to probability of benefits, and benefits are sumably be less likely to be approved; studies close
properly contrasted with harms rather than risks of to D would be more likely to be approved; and stud-
harms,” where the “risks and benefits of research ies falling along the B–C “diagonal of indecision”
may affect the individual subjects, the families of exist in a limbo of uncertainty until relevant infor-
the individual subjects, and society at large (or mation nudges the assessment to either side of the
special groups of subjects in society).” The moral diagonal. The idea of “zero risk” is a methodological
calculus of benefits to risks was said to be “in a conceit, however, because all human subject research
favorable ratio” when the anticipated risks were out- can be understood as carrying some degree of risk.
weighed by the anticipated benefits to the subjects The potential risk in the most benign behavioral
(assuming this was applicable) and the anticipated and social research, for example, is the “danger of
benefit to society in the form of the advance- violating someone’s basic rights, if only the right of
ment of knowledge. Put into practice, however, privacy” (Atwell, 1981, p. 89). However, the funda-
researchers and members of review boards found it mental problem of the traditional model represented
difficult to “exorcize the devil from the details” when in Figure 3.1 is that it runs the risk of ignoring the
challenged by ethical guidelines that frequently con- “not doing of research.” Put another way, there are
flicted with traditional technical criteria (Mark, also moral costs when potentially useful research is
Eyssell, & Campbell, 1999, p. 48). As human forestalled, or if the design or implementation is

40 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.4. General Principles Adopted in 2003 by the American Psychological Association∗
General Principles
General Principles, as opposed to Ethical Standards, are aspirational in nature. Their intent is to guide and inspire
psychologists toward the very highest ethical ideals of the profession. General Principles, in contrast to Ethical Standards,
do not represent obligations and should not form the basis for imposing sanctions. Relying upon General Principles for
either of these reasons distorts both their meaning and purpose.
Principle A: Beneficence and Nonmaleficence
Psychologists strive to benefit those with whom they work and take care to do no harm. In their professional actions,
psychologists seek to safeguard the welfare and rights of those with whom they interact professionally and other affected
persons, and the welfare of animal subjects of research. When conflicts occur among psychologists’ obligations or
concerns, they attempt to resolve these conflicts in a responsible fashion that avoids or minimizes harm. Because
psychologists’ scientific and professional judgments and actions may affect the lives of others, they are alert to and
guard against personal, financial, social, organizational, or political factors that might lead to misuse of their influence.
Psychologists strive to be aware of the possible effect of their own physical and mental health on their ability to help
those with whom they work.
Principle B: Fidelity and Responsibility
Psychologists establish relationships of trust with those with whom they work. They are aware of their professional
and scientific responsibilities to society and to the specific communities in which they work. Psychologists uphold
professional standards of conduct, clarify their professional roles and obligations, accept appropriate responsibility for
their behavior, and seek to manage conflicts of interest that could lead to exploitation or harm. Psychologists consult
with, refer to, or cooperate with other professionals and institutions to the extent needed to serve the best interests
of those with whom they work. They are concerned about the ethical compliance of their colleagues’ scientific and
professional conduct. Psychologists strive to contribute a portion of their professional time for little or no compensation
or personal advantage.
Principle C: Integrity
Psychologists seek to promote accuracy, honesty, and truthfulness in the science, teaching, and practice of psychology.
In these activities psychologists do not steal, cheat, or engage in fraud, subterfuge, or intentional misrepresentation of
fact. Psychologists strive to keep their promises and to avoid unwise or unclear commitments. In situations in which
deception may be ethically justifiable to maximize benefits and minimize harm, psychologists have a serious obligation
to consider the need for, the possible consequences of, and their responsibility to correct any resulting mistrust or other
harmful effects that arise from the use of such techniques.
Principle D: Justice
Psychologists recognize that fairness and justice entitle all persons to access to and benefit from the contributions of
psychology and to equal quality in the processes, procedures, and services being conducted by psychologists. Psychol-
ogists exercise reasonable judgment and take precautions to ensure that their potential biases, the boundaries of their
competence, and the limitations of their expertise do not lead to or condone unjust practices.
Principle E: Respect for People’s Rights and Dignity
Psychologists respect the dignity and worth of all people, and the rights of individuals to privacy, confidentiality, and
self-determination. Psychologists are aware that special safeguards may be necessary to protect the rights and welfare
of persons or communities whose vulnerabilities impair autonomous decision making. Psychologists are aware of and
respect cultural, individual, and role differences, including those based on age, gender, gender identity, race, ethnicity,
culture, national origin, religion, sexual orientation, disability, language, and socioeconomic status and consider these
factors when working with members of such groups. Psychologists try to eliminate the effect on their work of biases based
on those factors, and they do not knowingly participate in or condone activities or others based upon such prejudices.
∗ Quoted from the American Psychological Association’s Ethical Principles of Psychologists and Code of Conduct (http://www.apa.org/
ethics/code2002.html). Effective date June 1, 2003, copyrighted in 2002 by the American Psychological Association.

compromised in a way that jeopardizes the integrity appraisal of the “doing of research,” and the model
of the research (cf. Haywood, 1976). on the right corresponds to an appraisal of the “not
Figure 3.2 shows an alternative representing a doing of research.” We use the terms cost and utility
cost–utility assessment of both the doing and not each in a collective sense. That is, the cost of doing
doing of research. In Part A, the decision plane and the cost of not doing a particular research study
model on the left corresponds to a cost–utility include more than only the risk of psychological or

rosnow, rosenthal 41
A B physical harm; they also include the cost to society,
High funding agencies, and to scientific knowledge when
imagination and new scientifically based solutions
are stifled. As one scientist observed, “Scientists
know that questions are not settled; rather, they are

on
given provisional answers for which it is contingent
Risk of doing

isi
ec
upon the imagination of followers to find more illu-

nd
fi minating solutions” (Baltimore, 1997, p. 8). We
lo
a
on

also use utility in a collective sense, not just in the


iag
D

way that a “tool” can immediately be instrumentally


useful, but in a way that may have no immedi-
ate application and instead “speaks to our sense
Low
of wonder and paves the way for future advances”
Low High (Committee on Science, Engineering, and Public
Benefit of doing
C D Policy, 2009, p. 3). These figurative definitions of
cost and utility aside, Part B of Figure 3.2 suggests
Figure 3.1 Idealized decision-plane model representing the rel-
ative risks and benefits of research submitted to a review board
a way of transforming the three dimensions of Part
for prior approval (after Rosenthal & Rosnow, 1984; Rosnow & A to a two-dimensional model. Suppose an A–D
Rosenthal, 1997). “decision diagonal” for each of the decision planes

DOING RESEARCH NOT DOING RESEARCH


A B A' B'
High DON’T DO High DO
n

n
io

io
cis

cis
de

de
Cost

Cost
in

in
of

of
al

al
on

on
iag

iag
D

Low Low
C DO D C' DON’T DO D'
Low High Low High
Utility Utility
(Decision diagonal A'–D' of right plane in Part A)

D'
DON’T DO
Decision not to do

n
io
ecis
nd
fi
o
al
on
iag
D

Low
DO
A'
A D
Decision to do
(Decision diagonal A–D of left plane in Part A)

Figure 3.2 Decision-planes representing the ethical assessment of the costs and utilities of doing and not doing research (after Rosenthal
& Rosnow, 1984, 2008). (A) Costs and utilities of doing (left plane) and not doing (right plane) research. (B) Composite plane
representing both cases in Part A (above).
42 q u a n t i tat i v e m e t h o d s a n d e t h i c s
in Part A (in contrast to B–C and B’–C’, the diag- Unfortunately, there are also cases involving the
onals of indecision). For any point in the plane of suppression of data where the cost is not only that
doing, there would be a location on the cost axis it impedes progress in the field, but it also under-
and on the utility axis, where any point could be mines the authority and trustworthiness of scientific
translated to an equivalent position on the decision research and, in some instances, can cause harm
diagonal. Thus, if a point were twice as far from to the broader society, such as when public pol-
A as from D, the transformed point would then be icy is based on only partial information or when
located two-thirds of the way on the decision diago- there is selective outcome reporting of the efficacy
nal A–D (closer to D than to A). Similar reasoning is of clinical interventions in published reports of ran-
applicable to not doing, with the exception that close- domized trials (Turner, Matthews, Linardatos, Tell,
ness to A would mean “do” rather than “not do.” & Rosenthal, 2008; Vedula, Bero, Scherer, & Dick-
Points near D tell us the research should be done, ersin, 2009). In an editorial in Science, Cicerone
and points near D’ tell us the research should not be (2010), then president of the NAS, stated that his
done.6 impression—based on information from scattered
Figure 3.2 can also be a way of thinking about public opinion polls and various assessments of lead-
cost–utility dilemmas regarding quantitative meth- ers in science, business, and government—was that
ods and statistical reporting practices. In the 2009 “public opinion has moved toward the view that
edition of the U.S. National Academy of Sciences scientists often try to suppress alternative hypotheses
(NAS) guide to responsible conduct in scientific and ideas and that scientists will withhold data and
research, there are several hypothetical scenarios, try to manipulate some aspects of peer review to pre-
including one in which a pair of researchers (a post- vent dissent” (p. 624). Spielmans and Parry (2010)
doctoral and a graduate student) discuss how they described a number of instances of “marketing-based
should deal with two anomalous data points in a medicine” by pharmaceutical firms. Cases included
graph they are preparing to present in a talk (Com- the “cherry-picking” of data for publication, the sup-
mittee on Science, Engineering, and Public Policy, pression or understatement of negative results, and
2009). They want to put the best face on their the publication (and distribution to doctors) of jour-
research, but they fear that discussing the two out- nal articles that were not written by the academic
liers will draw people’s attention away from the bulk authors who lent their names, titles, and purported
of the data. One option would be to drop the out- independence to the papers but instead had been
liers, but, as one researcher cautions, this could be written by ghost writers hired by pharmaceutical and
viewed as “manipulating” the data, which is unethi- medical-device firms to promote company products.
cal. The other person comments that if they include Spielmans and Parry displayed a number of screen
the anomalous points, and if a senior person then shots of company e-mails, which we do not usually
advises them to include the anomalous data in a get to see because they go on behind the curtain. In
paper they are drafting for publication, this could an editorial in PLoS Medicine (2009) lamenting the
make it harder to have the paper accepted by a problem of ghost writers and morally dubious prac-
top journal. That is, the reported results will not tices in the medical marketing of pharmaceutics, the
be unequivocal (a potential reason for rejection), editors wrote:
and the paper will also then be too wordy (another
reason to reject it?). In terms of Figure 3.2, not How did we get to the point that falsifying the
including the two anomalous data points is anal- medical literature is acceptable? How did an industry
ogous to the “not doing of research.” There are, of whose products have contributed to astounding
course, additional statistical options, which can also advances in global health over the past several
be framed in cost–utility terms, such as using a suit- decades come to accept such practices as the norm?
able transformation to pull in the outlying stragglers Whatever the reasons, as the pipeline for new drugs
and make them part of the group (cf. Rosenthal dries up and companies increasingly scramble for an
& Rosnow, 2008, pp. 310–311). On the other ever-diminishing proportion of the market in
hand, outliers that are not merely recording errors “me-too” drugs, the medical publishing and
or instrument errors can sometimes provide a clue pharmaceutical industries and the medical
as to a plausible moderator variable. Suppressing community have become locked into a cycle of
this information could potentially impede scientific mutual dependency, in which truth and a lack of bias
progress (cf. Committee on Science, Engineering, have come to be seen as optional extras. Medical
and Public Policy, 2009, p. 8). journal editors need to decide whether they want to

rosnow, rosenthal 43
roll over and just join the marketing departments of research the focus is usually on the magnitude of
pharmaceutical companies. Authors who put their the difference between the experimental and con-
names to such papers need to consider whether doing trol group means. Such differences, we and other
so is more important than having a medical literature investigators observed, were sometimes affected by
that can be believed in. Politicians need to consider the use of volunteers (Rosenthal & Rosnow, 1975,
the harm done by an environment that incites 2009).
companies into insane races for profit rather than for With problems such as these serving as beginning
medical need. And companies need to consider points for empirical and meta-analytic investiga-
whether the arms race they have started will in the tions, we explored the characteristics that differen-
end benefit anyone. After all, even drug company tiated volunteers and nonvolunteers, the situational
employees get sick; do they trust ghost authors? determinants of volunteering, some possible inter-
actions of volunteer status with particular treatment
effects, the implications for predicting the direction
Ethical Standards and Quantitative and, sometimes, the magnitude of the biasing effects
Methodological Standards in research situations, and we also thought about
We turn now to Table 3.5, which shows a matrix the broader ethical implications of these findings
of general ethical standards crossed with quantitative (Rosenthal & Rosnow, 1975; Rosnow & Rosenthal,
methodological standards (after Rosnow & Rosen- 1997). For example, in one aspect of our meta-
thal, 2011). We do not claim that the row and analytic inquiry, we put the following question to
column standards are either exhaustive or mutually the research literature: What are the variables that
exclusive but only that they are broadly represen- tend to increase or decrease the rates of volunteering
tative of (1) aspirational ideals in the society as a obtained? Our preliminary answers to this question
whole and (2) methodological, data analytic, and may have implications for both the theory and prac-
reporting standards in science and technology. The tice of behavioral science. That is, if we continue
matrix is a convenient way of reminding ourselves to learn more about the situational determinants of
of the distinction between (1) and (2), and it is also volunteering, we can learn more about the social
a way of visualizing a potential clash between (1) psychology of social influence processes. Method-
and (2) and, frequently, the opportunity to exploit ologically, once we learn more about the situational
this situation in a way that could have rewarding determinants of volunteering, we should be in a bet-
ethical and scientific implications. Before we turn ter position to reduce the bias in our samples that
specifically to the definitions of the row and col- derives from the volunteer subjects being systemat-
umn headings in Table 3.5, we will give a quick ically different from nonvolunteers in a variety of
example of what we mean by “rewarding ethical characteristics. For example, one situational corre-
and scientific implications” in the context of the late was that the more important the research was
recruitment of volunteers. For this example, we perceived, the more likely people were to volun-
draw on some of our earlier work on specific threats teer for it. Thus, mentioning the importance of the
to validity (collectively described as artifacts) deriv- research during the recruitment phase might coax
ing from the volunteer status of the participants for more of the “nonvolunteers” into the sampling pool.
research participation. Among our concerns when It would be unethical to exaggerate or misrepresent
we began to study the volunteer was that ethical the importance of the research. By being honest,
sensitivities seemed to be propelling psychological transparent, and informative, we are treating people
science into a science of informed volunteers (e.g., with respect and also giving them a well-founded
Rosenthal & Rosnow, 1969; Rosnow & Rosenthal, justification for asking them to volunteer their valu-
1970). It was long suspected that people who vol- able time, attention, and cooperation. In sum, the
unteered for behavioral and social research might five column headings of Table 3.5 frequently come
not be fully adequate models for the study of behav- precorrelated in the real world of research, often with
ior in general. To the extent that volunteers differ implications for the principles in the row headings
from nonvolunteers on dimensions of importance, of the table.
the use of volunteers could have serious effects on Turning more specifically to the row headings in
such estimated parameters as means, medians, pro- Table 3.5, rows A, B, C, and E reiterate the three
portions, variances, skewness, and kurtosis. The “basic ethical principles” in the Belmont Report,
estimation of parameters such as these is the princi- which were described there as respect for persons,
pal goal in survey research, whereas in experimental beneficence, and justice. Beneficence (the ethical

44 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.5. General Ethical Standards Crossed with Quantitative Methodology Standards (after Rosnow &
Rosenthal, 2011)
Quantitative methodological standards

Ethical standards 1. Transparency 2. Informativeness 3. Precision 4. Accuracy 5. Groundedness

A. Beneficence

B. Nonmaleficence

C. Justice

D. Integrity

E. Respect

ideal of “doing good”) was conflated with the princi- results are presented in an open, frank, and can-
ple (b) of nonmaleficence (“not doing harm”), and the did way, that any technical language used is clear
two were also portrayed as obligations assimilating and appropriate, and that visual displays do not
two complementary responsibilities: (1) do not obfuscate the data but instead are as crystal clear as
harm and (2) maximize possible benefits and mini- possible. Elements of graphic design are explained
mize possible harms. Next in Table 3.5 is justice, by and illustrated in a number of very useful books
which we mean a sense of “fairness in distribution” and articles, particularly the work of Tufte (1983,
or “what is observed” (quoting from the Belmont 1990, 2006) and Wainer (1984, 1996, 2000, 2009;
Report). As the Belmont Report went on to explain: Wainer & Thissen, 1981), and there is a burgeon-
“Injustice occurs when some benefit to which a per- ing literature in every area of science on the visual
son is entitled is denied without good reason or display of quantitative data. Second, by informa-
when some burden is imposed unduly.” Conced- tiveness, we mean that there is enough information
ing that “what is equal?” and “what is unequal?” reported to enable readers to make up their own
are often complex, highly nuanced questions in a minds on the basis of the primary results and enough
specific research situation (just as they are when to enable others to re-analyze the summary results
questions of justice are associated with social prac- for themselves. The development of meta-analysis,
tices, such as punishment, taxation, and political with emphasis on effect sizes and moderator vari-
representation), justice was nonetheless considered a ables, has stimulated ways of recreating summary
basic moral precept relevant to the ethics of research data sets and vital effect size information, often from
involving human subjects. Next in Table 3.5 is minimal raw ingredients. Third, the term precision
integrity, an ethical standard that was not distinctly is used not in a statistical sense (the likely spread of
differentiated in the Belmont Report but that was estimates of a parameter) but rather in a more gen-
discussed in detail in the NAS guide (Committee eral sense to mean that quantitative results should
on Science, Engineering, and Public Policy, 2009). be reported to the degree of exactitude required by
Integrity implies honesty and truthfulness; it also the given situation. For example, reporting the aver-
implies a prudent use of research funding and other age scores on an attitude questionnaire to a high
resources and, of course, the disclosure of any con- degree of decimal places is psychologically mean-
flicts of interest, financial or otherwise, so as not ingless (false precision), and reporting the weight
to betray public trust. Finally, respect was described of mouse subjects to six decimal places is pointless
in the Belmont Report as assimilating two obliga- (needless precision). Fourth, accuracy means that a
tions: “first, that individuals should be treated as conscientious effort is made to identify and cor-
autonomous agents, and second, that persons with rect mistakes in measurements, calculations, and
diminished autonomy are entitled to protection.” the reporting of numbers. Accuracy also means not
In the current APA code, respect is equated with exaggerating results by, for example, making false
civil liberties—that is, privacy, confidentiality, and claims that applications of the results are unlikely to
self-determination. achieve. Fifth, groundedness implies that the method
Inspecting the column headings in Table 3.5, first of choice is appropriate to the question of interest, as
by transparency, we mean here that the quantitative opposed to using whatever is fashionable or having

rosnow, rosenthal 45
a computer program repackage the data in a one- truly astonishing amount of confusion. For example,
size-fits-all conceptual framework. The methods we writing in the Journal of the National Cancer Insti-
choose must be justifiable on more than just the tute, Miller (2007) mentioned that many doctors
grounds that they are what we were taught in grad- equate the level of statistical significance of cancer
uate school, or that “this is what everyone else does” data with the “degree of improvement a new treat-
(cf. Cohen, 1990, 1994; Rosnow & Rosenthal, ment must make for it to be clinically meaningful”
1995, 1996; Zuckerman, Hodgins, Zuckerman, & (p. 1832).7
Rosenthal, 1993). In the space remaining, we concentrate on mis-
conceptions and illusions regarding the concepts of
RR and statistical significance when the clinical sig-
Clinical Significance and the Consequences nificance of interventions is appraised through the
of Statistical Illiteracy lens of these concepts in randomized clinical trials
To bring this discussion of quantitative meth- (RCTs). As a case in point, a highly cited report
ods and ethics full circle, we turn finally to a on the management of depression, a report that was
problem that has been variously described as innu- issued by the National Institute for Health and Clin-
meracy (Paulos, 1990) and statistical illiteracy. The ical Excellence (NICE), used RR of 0.80 or less as a
terms are used to connote a lack of knowledge or threshold indicator of clinical significance in RCTs
understanding of the meaning of numbers, statisti- with dichotomous outcomes and statistically signif-
cal concepts, or the numeric expression of summary icant results.8 We use the term clinical significance
statistics. As the authors of a popular book, The here in the way that it was defined in an authorita-
Numbers Game, put it: “Numbers now saturate the tive medical glossary, although we recognize that it is
news, politics, life. . . . For good or for evil, they a hypothetical construct laden with surplus meaning
are today’s preeminent public language—and those as well (cf. Jacobson & Truax, 1991). In the glos-
who speak it rule” (Blastland & Dilnot, 2009, p. x). sary, clinical significance was taken to mean that “an
To be sure, even people who are most literate in the intervention has an effect that is of practical mean-
language of numbers are prone to wishful thinking ing to patients and health care providers” (NICHSR,
and fearful thinking and, therefore, sometimes sus- 2010; cf. Jeans, 1992; Kazdin, 1977, 2008). By
ceptible to those who use numbers and gimmicks intervention, we mean a treatment or involvement
to sway, influence, or even trick people. The math- such as a vaccine used in a public health immu-
ematician who coined the term innumeracy told of nization program to try to eradicate a preventable
how his vulnerability to whim “entrained a series of disease (e.g., the Salk poliomyelitis vaccine), or a
ill-fated investment decisions,” which he still found drug that can be prescribed for a patient in the doc-
“excruciating to recall” (Paulos, 2003, p. 1). The tor’s office, or an over-the-counter medicine (e.g.,
launching point for the remainder of our discussion aspirin) used to reduce pain or lessen the risk of an
was an editorial in a medical journal several years adverse event (e.g., heart attack), or a medication
ago, in which the writers of the editorial lamented and/or psychotherapy to treat depression. By tradi-
“the premature dissemination of research and the tion, RCTs are the gold standard in evidence-based
exaggeration of medical research findings” (Schwartz medicine when the goal is to appraise the clinical
& Woloshin, 2003, p. 153). A large part of the prob- significance of interventions in a carefully controlled
lem is an emphasis on RR statistics that hook general scientific manner. Claims contradicted by RCTs are
readers into making unwarranted assumptions, a not always immediately rejected in evidence-based
problem that may often begin with researchers, medicine, as it has been noted that some “claims
funders, and journals that “court media atten- from highly cited observational studies persist and
tion through press releases” (Woloshin, Schwartz, continue to be supported in the medical literature
Casella, Kennedy, & Larson, 2009, p. 613). Con- despite strong contradictory evidence from random-
fusion about risk and risk statistics is not limited to ized trials” (Tatsioni, Bonitsis, & Ioannidis, 2007).
the general public (cf. Prasad, Jaeschke, Wyer, Keitz, Of course, just as gold can fluctuate in value, so
& Guyatt, 2008), but it is the susceptible public can conclusions based on the belief that statistical
(Carling, Kristoffersen, Herrin, Treweek, Oxman, significance is a proxy for clinical significance, or
Schünemann, Akl, & Montori, 2008) that must when it is believed that given statistical significance,
ultimately pay the price of the accelerating costs clinical significance is achieved only if the reduc-
of that confusion. Stirring the concept of statistical tion in RR reaches some arbitrary fixed magnitude
significance into this mix can frequently produce a (recall, for example, NICE, 2004). The challenge

46 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Table 3.6. Final results for myocardial infarction (MI) and hemorrhagic stroke (HS) for the
aspirin (325 mg every other day) component of the Physicians’ Health Study (Steering
Committee of the Physicians’ Health Study Research Group, 1989). The increase in relative
risk (RRI) for HS was more than twice the reduction in relative risk (RRR) for MI. Having one
more case of HS in the aspirin group would have yielded a chi-square significant at p < 0.05,
RR = 2.0, and RRI = 100%. In the combined samples, the event rate of MI (378/22,071 =
0.0171, or 1.71% ) exceeded the event rate of HS (35/22,071 = 0.0016, or 0.16% ) by a ratio
of about 10:1, and a difference of 1.71% – 0.16% = 1.55%. In the subtable on the right, RRI is
the relative risk increase, computed as RRR (see Table 3.7), but indicated as RRI when the
treatment increases the risk of the adverse outcome.
Myocardial infarction (Heart attack) Hemorrhagic stroke

MI No MI Total HS No HS Total

Aspirin 139 10,898 11,037 Aspirin 23 11,014 11,037


Placebo 239 10,795 11,034 Placebo 12 11,022 11,034

Total 378 21,693 22,071 Total 35 22,036 22,071

Chi-square 26.9, p = 2.1 × 10−7 Chi-square 3.46, p = 0.063

RR 0.58 RR 1.92

RRR 42% RRI 92%

r(phi) 0.035 r(phi) –0.013

is to reverse the accelerating cost curve of statistical (thereby implying RRR), and an RR greater than 1.0
illiteracy in an area that affects us all (see, for example, indicates the treated sample did more poorly than
Gigerenzer, Gaissmaier, Kurz-Milcke, Schwartz, & the control (thereby implying RRI). Observe that
Woloshin, 2008). the “slightly increased risk of stroke” (RRI = 92% )
Table 3.6 helps us illustrate the folly of a delicate was actually more than twice the reduction in risk of
balancing act that is sometimes required between heart attack (RRR = 42% )! Suppose the study had
statistical significance and RR. The table shows a continued, and one more case of stroke had turned
portion of the results from the aspirin component up in the aspirin group. The p-value would have
of a highly cited double-blind, placebo-controlled, reached the 0.05 level, and the researchers might
randomized trial to test whether 325 milligrams of have arrived at a different conclusion, possibly that
aspirin every other day reduces the mortality from the benefit with respect to heart attack was more
cardiovascular disease and whether beta-carotene than offset by the increased risk in stroke. Appar-
decreases the incidence of cancer (Steering Com- ently, a p-value only a hair’s-breadth greater than
mittee of the Physicians’ Health Study Research 0.05 can trump a RR increase of 92% . On the
Group, 1989). The aspirin component of the study other hand, the event rate of stroke in the study as
was terminated earlier than planned on finding “a a whole was only 0.16% , less than one-tenth the
statistically significant, 44 [sic] percent reduction in magnitude of the event rate of 1.7% of heart attack
the risk of myocardial infarction for both fatal and in the study as a whole.9 However, we would never
nonfatal events . . . [although] there continued to know this from the RR alone.
be an apparent but not significantly increased risk The fact is that RR statements are oblivious to
of stroke” (p. 132). RR (for relative risk) refers to event rates in the total N . To give a quick exam-
the ratio of the incidence rate of the adverse event ple, suppose in a study with 100 people each in the
(the illness) in the treated sample to the control sam- treated and control samples that 1 treated person
ple; RRR is the relative risk reduction; and RRI, is and 5 untreated people (controls) became ill. RR
the relative risk increase (the computation of these and RRR would be 0.20 and 80% , respectively.
indices is described in Table 3.7). When tables of Stating there was an 80% reduction in risk of the
independent counts are set up as shown in Tables adverse event conveys hope. However, suppose we
3.6 and 3.7, an RR less than 1.0 indicates that the increase each sample size to 1,000 but still assume
treated sample fared better than the control sample 1 case of illness in the treated sample and 5 cases

rosnow, rosenthal 47
Table 3.7. Six studies each with total sample size (N ) of 2,000 and 1% event rates in Studies 1 and 4 (20
cases out of 2,000), 25% event rates in Studies 2 and 5 (500 cases out of 2,000), and 50% event rates in
Studies 3 and 6 (1,000 cases out of 2,000). RR, the relative risk or risk ratio, indicates the ratio of the
incidence rate of level of risk in the treated group to the level of risk in the control group. With cells labeled
A, B, C, D from upper left (A) to upper right (B), to lower left (C), to lower right (D), RR =
[A/(A+B)]/[C/(C+D)], where RR < 1.0 favors the treatment effect (risk reduction) and RR > 1.0 favors the
control effect (risk increase). OR, the odds ratio, also called relative odds or the cross-product ratio, is the
ratio of A/B to C/D, or the cross-product AD/BC. RRR, the relative risk reduction, is the reduction in risk
of the adverse outcome (e.g., illness) in the treated sample relative to the control, which is indicated as a
percentage by dividing RD (defined next) by [C/(C+D)] and then multiplying by 100. RD, the risk
difference, also called the absolute risk reduction (ARR), is the reduction in risk of the particular adverse
outcome (e.g., cancer, heart attack, stroke) in the treated group compared with the level of baseline risk in
the control—that is, [A/(A+B)]–[C/(C+D)]. Multiplying RD (or ARR) times 10,000 estimates the number
of people in a group of 10,000 that are predicted to benefit from the treatment. NNT = 1/RD = 1/ARR, is
the number needed to treat to prevent a single case of the particular adverse outcome.
Study 1 (N = 2,000) Study 2 (N = 2,000) Study 3 (N = 2,000)
Adverse outcome (1% ) Adverse outcome (25% ) Adverse outcome (50% )

Condition Yes No Yes No Yes No

Treatment 1 999 25 975 50 950


Control 19 981 475 525 950 50

Chi-square 16.4 (p < 0.0001) 540.0 (p < 0.0001) 1,620.0 (p < 0.0001)

RR 0.05 0.05 0.05

OR 0.05 0.03 0.003

RRR 94.7% 94.7% 94.7%

r(phi) 0.090 0.52 0.90

RD=ARR 0.018 0.45 0.90

NNT=1/ARR=1/RD 55.6 2.2 1.1

ARR(10,000) 180 4,500 9,000

Study 4 (N = 2,000) Study 5 (N = 2,000) Study 6 (N = 2,000)


Adverse outcome (1% ) Adverse outcome (25% ) Adverse outcome (50% )

Condition Yes No Yes No Yes No

Treatment 9 991 225 775 450 550


Control 11 989 275 725 550 450

Chi-square 0.2 (p = 0.89) 6.7 (p = 0.01) 20.0 (p < 0.0001

RR 0.82 0.82 0.82

OR 0.82 0.77 0.67

RRR 18.2% 18.2% 18.2%

r(phi) 0.01 0.06 0.10

RD=ARR 0.002 0.05 0.10

NNT=1/ARR=1/RD 500 20 10

ARR(10,000) 20 500 1,000

48 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Study 1 Study 2 Study 3 RR RRR
1000

800

Number of 600
Patients
400
0.05 95%
200

0
Study 4 Study 5 Study 6
1000

800

Number of 600
Patients
400
0.82 18%
200

0
Treatment

Treatment

Treatment
Control

Control

Control
Group
Assignment

Event
1% 25% 50%
Rate

Figure 3.3 Histograms based on the six studies in Table 3.7, in which the total sample size (N ) was 2,000 in each study. Darkened
areas of the bars indicate the number of adverse outcomes (event rates), which increased from 1% (20 cases out of 2,000) in Studies 1
and 4, to 25% (500 cases out of 2,000) in Studies 2 and 5, to 50% (1,000 cases out of 2,000) in Studies 3 and 6. However, the relative
risk (RR) and relative risk reduction (RRR) were insensitive to these vastly different event rates. In Studies 1, 2, and 3, the RR and
RRR remained constant at 0.05 and 94.7%, respectively, whereas in Studies 4, 5, and 6, the RR and RRR remained constant at 0.82
and 18.2%, respectively.

of illness in the control sample. We would still find (Morris & Gardner, 2000). As Table 3.7 shows, OR
RR = 0.20 and RRR = 80% . It makes no differ- and RR are usually highly correlated. The absolute
ence how large we make the sample sizes, as RR and risk reduction (ARR), also called the risk differ-
RRR will not budge from 0.20 and 80% so long as ence (RD), refers to the absolute reduction in risk
we assume 1 case of illness in the treated sample and of the adverse event (illness) in the treated patients
5 cases of illness in the control sample. Suppose we compared with the level of baseline risk in the con-
now hold the N constant and see what happens to trol group. Gigerenzer et al. (2008) recommended
the RR and RRR when the event rate in the over- using the absolute risk reduction (RD) rather than
all N changes from one study to another. In Figure the RR. As Table 3.7 shows, RD (or ARR) is sen-
3.3, we see the results of six hypothetical studies in sitive to the differences in the event rates. There
which the event rates increased from 1% in Stud- are other advantages as well to RD, which are dis-
ies 1 and 4, to 25% in Studies 2 and 5, to 50% in cussed elsewhere (Rosenthal & Rosnow, 2008, pp.
Studies 3 and 6. Nonetheless, in Studies 1, 2, and 3, 631–632). Phi is the product-moment correlation
RR remained constant at 0.05 and RRR remained (r) when the two correlated variables are dichoto-
constant at an attention-getting 95% . In Studies 4, mous, and Table 3.7 shows it is sensitive to the
5, and 6, RR and RRR stayed constant at 0.82 and event rates and natural frequencies. Another useful
18% , respectively. index is NNT, for the number of patients that
Further details of the studies in Figure 3.3 are need to be treated to prevent a single case of the
given in Table 3.7. The odds ratio (OR), for the adverse event. Relative risk may be an easy-to-handle
ratio of two odds, was for a time widely promoted description, but it is only an alerting indicator that
as a measure of association in 2 × 2 tables of tells us that something happened and we need to
counts (Edwards, 1963; Mosteller, 1968) and is explore the data further. As Tukey (1977), the con-
still frequently reported in epidemiological studies summate exploratory data analyst, stated: “Anything

rosnow, rosenthal 49
that makes a simpler description possible makes the degree to which each procedure in a particular study
description more easily handleable; anything that meets the methodological standards of transparency,
looks below the previously described surface makes informativeness, precision, accuracy, and ground-
the description more effective” (p. v). And, we can edness? The research program called for to address
add, that any index of the magnitude of effect that these psychometric questions of reliability will surely
is clear enough, transparent enough, and accurate find that specific research contexts, specific disci-
enough to inform the nonspecialist of exactly what plinary affiliations, and other specific individual dif-
we have learned from the quantitative data increases ferences (e.g., years of experience) will be moderators
the ethical value of those data (Rosnow & Rosenthal, of the magnitudes of agreement (i.e., reliabilities)
2011). achieved. We believe that the results of such research
will demonstrate that there will be some disagree-
ment (that is, some unreliability) in quantitative
Conclusion methodologists’ evaluations of various standards of
In a cultural sphere in which so many things com-
practice. And, as we noted above, that is likely to
pete for our attention, it is not surprising that people
be associated with some disagreement (that is, some
seem to gravitate to quick, parsimonious forms of
unreliability) in their evaluations of the ethical value
communication and, in the case of health statis-
of various quantitative procedures.
tics, to numbers that appear to speak directly to
Another important question would be addressed
us. For doctors with little spare time to do more
by research asking the degree to which the spe-
than browse abstracts of clinical trials or the sum-
cific goals and specific sponsors of the research
maries of summaries, the emphasis on parsimonious
may serve as causal factors in researchers’ choices of
summary statistics such as RR communications in
quantitative procedures. Teams of researchers (e.g.,
large sample RCTs may seem heavily freighted with
graduate students in academic departments rou-
clinical meaning. For the general public, reading
tinely employing quantitative procedures in their
about a 94.7% reduction in the risk of some ill-
research) could be assigned at random to analyze
ness, either in a pharmaceutical advertisement or in a
the data of different types of sponsors with differ-
news story about a “miracle drug that does wonders,”
ent types of goals. It would be instructive to learn
is attention-riveting. It is the kind of information
that choice of quantitative procedure was predictable
that is especially likely to arouse an inner urgency
from knowing who was paying for the research and
in patients but also in anyone who is anxious and
what results the sponsors were hoping for. Recogni-
uncertain about their health. Insofar as such infor-
tion of the possibility that the choice of quantitative
mation exaggerates the absolute effects, it is not only
procedures used might be affected by the finan-
the patient or the public that will suffer the conse-
cial interests of the investigator is reflected in the
quences; the practice of medicine and the progress
increased frequency with which scientific journals
of science will as well. As Gigerenzer et al. (2008)
(e.g., medical journals) require a statement from
wrote, “Statistical literacy is a necessary precondi-
all co-authors of their financial interest in the com-
tion for an educated citizenship in a technological
pany sponsoring the research (e.g., pharmaceutical
democracy” (p. 53). There are promising opportuni-
companies).
ties for moral (and societal) rewards for quantitative
Finally, it would be valuable to quantify the costs
methodologists who can help us educate our way
and utilities of doing and not doing a wide variety of
out of statistical illiteracy. And that education will be
specific studies, including classic and not-so-classic
beneficial, not only to the public but to many behav-
studies already conducted, and a variety of studies
ioral, social, and medical researchers as well. As that
not yet conducted. Over time, there may develop
education takes place, there will be increased clar-
a disciplinary consensus over the costs and util-
ity, transparency, and accuracy of the quantitative
ities of a wide array of experimental procedures.
methods employed, thereby increasing their ethical
And, although such a consensus is building over
value.
time, it will be of considerable interest to psychol-
ogists and sociologists of science to study disci-
Future Directions plinary differences in such consensus-building. Part
An important theoretical and practical question of such a program of self-study of disciplines doing
remains to be addressed: To what extent is there quantitative research would focus on the quantita-
agreement among quantitative methodologists in tive procedures used, but the primary goal would
their evaluation of quantitative procedures as to the be to apply survey research methods to establish

50 q u a n t i tat i v e m e t h o d s a n d e t h i c s
the degree of consensus on research ethics of the and much of the original research [about organic foods] report
behavioral, social, educational, and biomedical sci- only the statistical significance of the differences in nutrient
ences. The final product of such a program of levels—not whether they are nutritionally important” (Clancy,
Hamm, Levine, & Wilkins, 2009, p. 676).
research would include the costs and utilities of
8. NICE (2004) also recommended that researchers use a
doing, and of not doing, a wide variety of research standardized mean difference (SMD) of half a standard devia-
studies. tion or more (i.e., d or g ≥ 0.5) with continuous outcomes as
the threshold of clinical significance for initial assessments of sta-
tistically significant summary statistics (NICE, 2004). However,
Notes effects far below the 0.5 threshold for SMDs have been associated
1. Where we quote from a document but do not give the page with important interventions. For example, in the classic Salk
numbers of the quoted material, it is because either there was no vaccine trial (Brownlee, 1955; Francis, Korns, Voight, Boisen,
pagination or there was no consistent pagination in the online and Hemphill, Napier, & Tolchinsky, 1955), phi = 0.011, which
hard copy versions that we consulted. Tables 3.1–3.4 reprint only has a d -equivalent of 0.022 (Rosnow & Rosenthal, 2008). It is
the original material, as there were slight discrepancies between probably the case across the many domains in which clinical sig-
original material and online versions. nificance is studied that larger values of d or g are in fact generally
2. Pattullo (1982) described the logical basis on which “rule- associated with greater intervention benefit, efficacy, or clinical
makers” (like DHEW) had proceeded in terms of a syllogism importance. But it is also possible for large SMDs to have little
emphasizing not the potential benefits of research but only the or no clinical significance. Suppose a medication was tested on
avoidance of risks of harm: “(a) Research can harm subjects; (2) 100 pairs of identical twins with fever, and in each and every
Only impartial outsiders can judge the risk of harm; (3) There- pair, the treated twin loses exactly one-tenth of 1 degree more
fore, all research must be approved by an impartial outside group” than the control twin. The SMD will be infinite, inasmuch as
(p. 376). the variability (the denominator of d or g ) will be 0, but few
3. Hearings on the recommendations in the Belmont Report doctors would consider this ES clinically significant. As Cohen
were conducted by the President’s Commission for the Study (1988) wisely cautioned, “the meaning of any given ES is, in the
of Ethical Problems in Medicine, Biomedical, and Behavioral final analysis, a function of the context in which it is embedded”
Research. Proceeding on the basis of the information pro- (p. 535).
vided at these hearings and on other sources of advice, the 9. The high RR of HS in this study, in which participants
Department of Health and Human Services (DHHS) then (male physicians) took 325 milligrams every other day, might
issued a set of regulations in the January 26, 1981, issue of explain in part why the current dose for MI prophylaxis is
the Federal Register. A compendium of regulations and guide- tempered at only 81 milligrams per day.
lines that now govern the implementation of the National
Research Act and subsequent amendments can be found in
the DHHS manual known as the “Gray Booklet,” specifically Author Note
titled Guidelines for the Conduct of Research Involving Human Ralph L. Rosnow is Thaddeus Bolton Professor
Subjects at the National Institutes of Health (available online at Emeritus at Temple University, Philadelphia, PA
http://ohsr.od.nih.gov/guidelines/index.html).
(rosnow@temple.edu). Robert Rosenthal is a Dis-
4. Smith, Kimmel, and Klein (2009) reported that 43.4%
of the articles on consumer research in leading journals in the tinguished Professor at the University of California
field in 1975 through 1976 described some form of deception in at Riverside (robert.rosenthal@ucr.edu) and Edgar
the research. By 1989 through 1990, the number of such articles Pierce Professor of Psychology, Emeritus, Harvard
increased to 57.7% , where it remained steady at 56% in 1996 University.
through 1997, increased to 65.7% in 2001 through 2002, and
jumped to 80.4% in 2006 through 2007. The issue of deception
is further complicated by the fact that active and passive decep-
References
tions are far from rare in our society. Trial lawyers manipulate Allison, D. B. (2009). The antidote to bias in research. Science,
the truth in court on behalf of their clients; prosecutors surrep- 326, 522.
titiously record private conversations; journalists get away with American Psychological Association. (1973). Ethical principles in
using hidden cameras and undercover practices to obtain stories; the conduct of research with human participants. Washington,
and the police use sting operations and entrapment procedures to DC: Author.
gather incriminating evidence (cf. Bok, 1978, 1984; Saxe, 1991; American Psychological Association. (1982). Ethical principles in
Starobin, 1997). the conduct of research with human participants. Washington,
5. The document, titled “Ethical Principles of Psycholo- DC: Author.
gists and Code of Conduct,” is available online at http://www. American Statistical Association (1999). Ethical guidelines for
apa.org/ETHICS/code2002.html. statistical practice. http://www.amstat.org/profession/index.
6. Adaptations of the models in Figures 3.1 and 3.2 have been cfm?fusaction=ethical
used to cue students about possible ethical dilemmas in research Atwell, J. E. (1981). Human rights in human subjects research.
and data analysis (cf. Bragger & Freeman, 1999; Rosnow, 1990; In A. J. Kimmel (Ed.), New directions for methodology of social
Strohmetz & Skleder, 1992). and behavioral science: Ethics of human subject research (No. 10,
7. The confusion of statistical significance with practical pp. 81–90). San Francisco, CA: Jossey-Bass.
importance may be a more far-reaching problem in science. In Baltimore, D. (January 27, 1997). Philosophical differences. The
a letter in Science, the writers noted that “almost all reviews New Yorker, p. 8.

rosnow, rosenthal 51
Beecher, H. K. (July 2, 1966). Documenting the abuses. Saturday Huff, D. (1982). How to lie with statistics. New York: Norton.
Review, 45–46. Jacobson, N. S., & Truax, P. (1991). Clinical significance:
Beecher, H. K. (1970). Research and the individual. Boston: Little A statistical approach to defining meaningful change in
Brown. psychotherapy research. Journal of Consulting and Clinical
Blanck, P. D., Bellack, A. S., Rosnow, R. L., Rotheram-Borus, Psychology, 59, 12–19.
M. J., & Schooler, N. R. (1992). Scientific rewards and con- Jeans, M. E. (1992). Clinical significance of research: A growing
flicts of ethical choices in human subjects research. American concern. Canadian Journal of Nursing, 24, 1–2.
Psychologist, 47, 959–965. Jones, H. H. (1993). Bad blood: The Tuskegee syphilis experiment
Blass, T. (2004). The man who shocked the world: The life and (Revised edition). New York: Free Press.
legacy of Stanley Milgram. New York: Basic Books. Kaiser, J. (September 18, 2009). Researcher, two universities sued
Blastland, M., & Dilnot, A. (2009). The numbers game. New over validity of prostate cancer test. Science, 235, 1484.
York: Gotham Books. Kazdin, A. E. (1977). Assessing the clinical or applied impor-
Bok, S. (1978). Lying: Moral choice in public and private life. New tance of behavior change through social validation. Behavior
York: Pantheon. Modification, 1, 427–451.
Bok, S. (1984). Secrets: On the ethics of concealment and revelation. Kazdin, A. E. (2008). Evidence-based treatment and practice:
New York: Vintage Books. New opportunities to bridge clinical research and practice,
Bordia, P., & Rosnow, R. L. (1995). Rumor rest stops on enhance the knowledge base, and improve patient care.
the information highway: A naturalistic study of transmis- American Psychologist, 63, 146–159.
sion patterns in a computer-mediated rumor chain. Human Katz, J. (1972). Experimentation with human beings. New York:
Communication Research, 25, 163–179. Russell Sage.
Bragger, J. D., & Freeman, M. A. (1999). Using a cost–benefit Kelman, H. C. (1968). A time to speak: On human values and
analysis to teach ethics and statistics. Teaching of Psychology, social research. San Francisco, CA: Jossey-Bass.
26, 34–36. Kimmel, A. J. (1991). Predictable biases in the ethical decision
Brownlee, K. A. (1955). Statistics of the 1954 polio vaccine trials. making of psychologists. American Psychologist, 46, 786–788.
Journal of the American Statistical Association, 50, 1005–1013. Kossinets, G., & Watts, D. J. (2006). Empirical analysis of an
Carling, C., Kristoffersen, D. T., Herrin, J., Treweek, S., Oxman, evolving social network. Science, 311, 88–90.
A. D., Schünemann, H., et al. (2008). How should the Mark, M. M., Eyssell, K. M., & Campbell, B. (1999). The
impact of different presentations of treatment effects on ethics of data collection and analysis. In J. L. Fitzpatrick &
patient choice be evaluated? PLoS One, 3(11), e3693. M. Morris (Eds.), Ethical issues in program evaluation (pp.
Cicerone, R. J. (2010). Ensuring integrity in science. Science, 47–56). San Francisco, CA: Jossey-Bass.
327, 624. Milgram, S. (1963). Behavioral study of obedience. Journal of
Clancy, K., Hamm, M., Levine, A. S., & Wilkins, J. (2009), Abnormal and Social Psychology, 67, 371–378.
Organics: Evidence of health benefits lacking. Science, 325, Milgram, S. (1975). Obedience to authority: An experimental view.
676. New York: Harper Colophon Books.
Cohen, J. (1988). Statistical power analysis for the behavioral Miller, J. D. (2007). Finding clinical meaning in cancer data.
sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Journal of the National Cancer Institute, 99(24), 1832–1835.
Cohen, J. (1990). Things I have learned (so far). American Morris, J. A., & Gardner, M. J. (2000). Epidemiological studies.
Psychologist, 45, 1304–1312. In D. A. Altman, D. Machin, T. N. Bryant, & M. J. Gard-
Cohen, J. (1994). The earth is round (p < .05). American ner (Eds.). Statistics with confidence (2nd. ed., pp. 57–72).
Psychologist, 49, 997–1003. London: British Medical Journal Books.
Committee on Science, Engineering, and Public Policy (2009). Mosteller, F. (1968). Association and estimation in contingency
On being a scientist: A guide to responsible conduct in research tables. Journal of the American Statistical Association, 63, 1–28.
(3rd ed.). Washington, DC: National Academies Press. National Commission for the Protection of Human Subjects of
Edwards, A. W. F. (1963). The measure of association in Biomedical and Behavioral Research. (April 18, 1979). The
a 2×2 table. Journal of the Royal Statistical Society, 126, Belmont Report: Ethical principles and guidelines for the pro-
109–114. tection of human subjects of research. Retrieved April 8, 2009,
Fairchild, A. L., & Bayer, R. (1999). Uses and abuses of Tuskegee. http://ohsr.od.nih.gov/guidelines/belmont.html.
Science, 284, 918–921. National Commission on Research. (1980). Accountability:
Francis, T., Jr., Korns, R. F., Voight, R. B., Boisen, M., Hemphill, Restoring the quality of the partnership. Science, 207,
F., Napier, J., & Tolchinsky, E. (1955). An evaluation of 1177–1182.
the 1954 poliomyelitis vaccine trials: A summary report. NICE. (2004). Depression: Management of depression in pri-
American Journal of Public Health, 45(5), 1–63. mary and secondary care (Clinical practice guideline No.
Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. 23). London: National Institute for Health and Clinical
M., & Woloshin, S. (2008). Helping doctors and patients Excellence
make sense of health statistics. Psychological Science in the NICHSR. (2010). HTA 101: Glossary. Retrieved March 3, 2010
Public Interest, 8, 53–96. from http://www.nlm.nih.gov/nichsr/hta101/ta101014.html
Harris, G (March 10, 2009). Doctor admits pain studies were Orne, M. T. (1962). On the social psychology of the psy-
frauds, hospital says. The New York Times. Retrieved March chological experiment: With particular reference to demand
20, 2009, www.nytimes.com/2009/03/11/health/research/ characteristics and their implications. American Psychologist,
11pain.html?ref=us. 17, 776–783.
Haywood, H. C. (1976). The ethics of doing research . . . and Orne, M. T. (1969). Demand characteristics and the concept
of not doing it. American Journal of Mental Deficiency, 81, of quasi-control. In R. Rosenthal & R. L. Rosnow (Eds.).
311–317. Artifact in behavioral research (pp. 143–179). New York:

52 q u a n t i tat i v e m e t h o d s a n d e t h i c s
Academic Press. (Reissued in Rosenthal & Rosnow, 2009, Rosnow, R. L., & Rosenthal, R. (2008). Assessing the effect size
pp. 110–137). of outcome research. In A. M. Nezu & C. M. Nezu (Eds.),
Panter, A. T., & Sterba, S. K. (2011). Handbook of ethics in Evidence-based outcome research (pp. 379–401). New York:
quantitative methodology. Taylor & Francis. Oxford University Press.
Pattullo, E. L. (1982). Modesty is the best policy: The federal Rosnow, R. L., & Rosenthal, R. (2011). Ethical principles in
role in social research. In T. L. Beauchamp, R. R. Faden, R. J. data analysis: An overview. In A. T. Panter & S. K. Sterba
Wallace, Jr., & L. Walters (Eds.), Ethical issues in social research (Eds.), Handbook of ethics in quantitative methodology (pp.
(pp. 373–390). Baltimore, MD: Johns Hopkins University 37–58). New York: Routledge.
Press. Rosnow, R. L., Rotheram-Borus, M. J., Ceci, S. J., Blanck,
Paulos, J. A. (1990). Innumeracy: Mathematical illiteracy and its P. D., & Koocher, G. P. (1993). The institutional review
consequences. New York: Vintage Books. board as a mirror of scientific and ethical standards. American
Paulos, J. A. (2003). A mathematician plays the stock market. New Psychologist, 48, 821–826.
York: Basic Books. Rosnow, R. L., Skleder, A. A., Jaeger, M. E., & Rind, B. (1994).
PLoS Medicine Editors (September, 2009). Ghostwriting: The Intelligence and the epistemics of interpersonal acumen: Test-
dirty little secret of medical publishing that just got bigger. ing some implications of Gardner’s theory. Intelligence, 19,
PLoS Medicine, 6, Issue 9, e1000156. Accessed September 19, 93–116.
2009 from http://www.plosmedicine.org/static/ghostwriting. Sales, B. D., & Folkman, S. (Eds.). (2000). Ethics in research with
action. human participants. Washington, DC: American Psychologi-
Prasad, K., Jaeschke, R., Wyer, P., Keitz, S., & Guyatt, G. (2008). cal Association.
Tips for teachers of evidence-based medicine: Understanding Saxe, L. (1991). Lying: Thoughts of an applied social psycholo-
odds ratios and their relationship to risk ratios. Journal of gist. American Psychologist, 46, 409–415.
General Internal Medicine, 23(5), 635–640. Schwartz, L. M., & Woloshin, S. (2003). On the prevention
Reynold, P. D. (1975). Value dilemmas in the professional con- and treatment of exaggeration. Journal of General Internal
duct of social science. International Social Science Journal, 27, Medicine, 18(2), 153–154.
563–611. Scott-Jones, D., & Rosnow, R. L. (1998). Ethics and mental
Rosenthal, R. (1994). Science and ethics in conducting, ana- health research. In H. Friedman (Ed.), Encyclopedia of mental
lyzing, and reporting psychological research. Psychological health (Vol. 2, pp. 149–160). San Diego, CA: Academic
Science, 5, 127–134. Press.
Rosenthal, R., & Rosnow, R. L. (1969). The volunteer subject. Smith, M. B. (1969). Social psychology and human values.
In R. Rosenthal & R. L. Rosnow (Eds.), Artifact in behavioral Chicago, IL: Aldine.
research (pp. 59–118). New York: Academic Press. (Reissued Smith, N. C., Kimmel, A. J., & Klein, J. G. (2009). Social con-
in Rosenthal & Rosnow, 2009, pp. 48–92). tract theory and the ethics of deception in consumer research.
Rosenthal, R., & Rosnow, R. L. (1975). The volunteer subject. Journal of Consumer Psychology, 19, 486–496.
New York: Wiley-Interscience. (Reissued in Rosenthal & Spielmans, G. I., & Parry, P. I. (2010). From evidence-
Rosnow, 2009, pp. 667–862). based medicine to marketing-based medicine: Evidence from
Rosenthal, R., & Rosnow, R. L. (1984). Applying Hamlet’s ques- internal industry documents. Journal of Bioethical Inquiry:
tion to the ethical conduct of research. American Psychologist, doi:10.1007/s11673-010-9208-8.
45, 775–777. Starobin, P. (January 28, 1997). Why those hidden cameras hurt
Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behav- journalism. The New York Times, p. A21.
ioral research: Methods and data analysis (3rd ed.). New York: Steering Committee of the Physicians’ Health Study Research
McGraw-Hill. Group. (1989). Final report on the aspirin component of the
Rosenthal, R., & Rosnow, R. L. (2009). Artifacts in behavioral ongoing Physicians’ Health Study. New England Journal of
research. New York: Oxford University Press. Medicine, 321, 129–135.
Rosnow, R. L. (1990). Teaching research ethics through role-play Strohmetz, D. B., & Skleder, A. A. (1992). The use of role-play
and discussion. Teaching of Psychology, 17, 179–181. in teaching research ethics: A validation study. Teaching of
Rosnow, R. L. (1997). Hedgehogs, foxes, and the evolving Psychology, 19, 106–108.
social contract in psychological science: Ethical challenges Stryker, J. (April 13, 1997). Tuskegee’s long arm still touches a
and methodological opportunities. Psychological Methods, 2, nerve. The New York Times, p. 4.
345–356. Tatsioni, A., Bonitsis, N. G., & Ioannidis, J. P. A. (2007). Per-
Rosnow, R. L., & Rosenthal, R. (1970). Volunteer effects in sistence of contradicted claims in the literature. Journal of the
behavioral research. In K. H. Craik, B. Kleinmuntz, R. L. American Medical Association, 298, 2517–2526.
Rosnow, R. Rosenthal, J. A. Cheyne, & R. H. Walters, Trials of War Criminals before the Nuernberg Military Tribunals
New directions in psychology (pp. 211–277). New York: Holt, under Control Council Law No. 10, October 1946-April
Rinehart & Winston. 1949, Vol. II. Washington, DC: U.S. Government Printing
Rosnow, R. L., & Rosenthal, R. (1995). “Some things you Office.
learn aren’t so”: Cohen’s paradox, Asch’s paradigm, and the Tufte, E. R. (1983). The visual display of quantitative information.
interpretation of interaction. Psychological Science, 6, 3–9. Cheshire, CT: Graphics Press.
Rosnow, R. L., & Rosenthal, R. (1996). Contrasts and inter- Tufte, E. T. (1990). Envisioning information. Cheshire, CT:
actions redux: Five easy pieces. Psychological Science, 7, Graphics Press.
253–257. Tufte, E. T. (2006). Beautiful evidence. Cheshire, CT: Graphics
Rosnow, R. L., & Rosenthal, R. (1997). People studying people: Press.
Artifacts and ethics in behavioral research. New York: W. H. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA:
Freeman. Addison-Wesley.

rosnow, rosenthal 53
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Wainer, H. (2009). Picturing the uncertain world. Princeton, NJ:
Rosenthal, R. (2008). Selective publication of antidepressant Princeton University Press.
trials and its influence on apparent efficacy. New England Wainer, H., & Thissen, D. (1981). Graphical data analysis.
Journal of Medicine, 358(3), 32–40. Annual Review of Psychology, 32, 191–241.
VandenBos, G. R. (Ed.). (2007). APA dictionary of psychology. Wasserman, S., & Faust, K. (1994). Social network analysis:
Washington, DC: American Psychological Association. Methods and applications. Cambridge, England: Cambridge
Vedula, S. S., Bero, L., Scherer, R. W., & Dickersin, K. University Press.
(2009). Outcome reporting in industry-sponsored trials Woloshin, S., Schwartz, L. M., Casella, S. L., Kennedy, A. T., &
of Gabapentin for off-label use. New England Journal of Larson, R. J. (2009). Press releases by academic medical cen-
Medicine, 361(20), 1963–1971. ters: Not so academic? Annals of Internal Medicine, 150(9),
Vinacke, W. E. (1954). Deceiving experimental subjects. Ameri- 613–618.
can Psychologist, 9, 155. Zuckerman, M., Hodgins, H. S., Zuckerman, A., & Rosenthal,
Wainer, H. (1984). How to display data badly. American R. (1993). Contemporary issues in the analysis of data: A
Statistician, 38, 137–147. survey of 551 psychologists. Psychological Science, 4, 49–53.
Wainer, H. (1996). Depicting error. American Statistician, 50(2),
101–111.
Wainer, H. (2000). Visual revelations: Graphical tales of fate and
deception from Napoleon Bonaparte to Ross Perot. Mahwah, NJ:
Erlbaum.

54 q u a n t i tat i v e m e t h o d s a n d e t h i c s
CHAPTER

Special Populations
4
Keith F. Widaman, Dawnté R. Early, and Rand D. Conger

Abstract

Special populations offer unique opportunities and challenges for mathematical/statistical modeling of
data. First, we discuss several ways of construing the notion of special populations, including the basis
on which we argue that the general notion of special populations is a rather recent one. Then, we
discuss four key methodological implications when considering special populations: (1) properly
defining and accessing participants from the special population; (2) ensuring that the same dimensions
are present across multiple populations; (3) assessing whether empirical implications of psychological
theories hold across populations; and (4) exploiting unusual variation in special populations that may
allow tests of unique hypotheses. Next, we provide examples that illustrate how to deal with each of
the methodological issues. We close with a discussion of issues occasioned by our discussion of
special populations, issues that represent topics for future developments.
Key Words: Special populations, group differences, factorial invariance, measurement invariance,
structural invariance, individual differences

Introduction methods to capture better behavioral phenomena in


The topic of the current chapter is the place or these groups.
importance of special populations, particularly with We develop several goals for the chapter based on
regard to how quantitative methods or techniques our considerations of the application of quantitative
can be used to understand or characterize special methods to special populations. An initial goal is to
populations or inform about the nature of special identify the nature or conception of special popu-
populations. Research on special populations has lations. Here, we discuss our observations on how
burgeoned during the past quarter-century, and the special populations are identified. A second goal is
pace of development of quantitative methods has to explore how research on special populations offers
also expanded rapidly during this period. In this challenges to or ready application of methodological
chapter, we deal with the intersection of these two or quantitative approaches. To meet this goal, we dis-
streams of research—special populations and quan- cuss four major implications we draw when thinking
titative methods—to illuminate both. That is, we about conducting research with special populations,
discuss ways in which the use of state-of-the-art and we discuss techniques that would be partic-
quantitative techniques can help explain the nature ularly appropriate in pursuing these implications.
of special populations in unique and informative Our third goal is to describe a series of applications
ways. In turn, we hope that consideration of spe- of quantitative techniques in the study of special
cial populations may provide feedback that will populations to provide a substantive instantiation of
lead to interesting developments in quantitative how quantitative techniques can be used to clarify

55
the nature of special populations and the dynam- research on persons with mental retardation or intel-
ics of psychological and biological processes in these lectual disability. Clinical methods in use more than
groups. We close with conclusions and a series of 200 years ago for dealing with persons with mental
questions that represent issues for future research. illness or intellectual disability appear barbaric to the
In pursuing the nature and implications of special twenty-first-century practicing scientist or informed
populations for quantitative methods, we performed citizen; indeed, many clinical methods in use only
several literature searches to help bound or circum- 50 or 75 years ago seem rather unfortunate and
scribe our thinking about special populations. A misguided. Thus, research on special populations
PsycINFO search in August 2010 using “special has long been pursued in psychology, allied behav-
population” and “special populations” as possible ioral and social sciences, and medical sciences. But,
title words yielded 397 citations. The oldest cita- the time is ripe for renewed exploration of behav-
tion was to a publication by Uehling in 1952, the ioral phenomena in special populations to provide a
next oldest was to a publication in 1975, and 390 fuller understanding of persons in these populations,
(or more than 98%) of the 397 citations were to which should lead to improved ways of treating and
publications in 1980 or later. Thus, the clear major- improving the lives of persons in special populations.
ity of publications listing “special population” as a
title word have occurred relatively recently, within
Conceptions of Special Populations
the last 30 years.
To consider methodological implications and
More inclusively, we next conducted a PsycINFO
applications related to special populations, one must
search using “special population” and “special popu-
consider first what a special population is. The term
lations” as possible keywords. This search turned up
special population has no obvious and ready refer-
1546 citations, a far broader net of citations than
ent. That is, if one asked 10 different psychologists
provided by the title-word search. Here, the oldest
what was meant by “special populations,” one might
citation was to a chapter in a volume resulting from
get 10 different answers with little in the way of
a White House Conference on Child Health and
overlap aside from the indication that “special pop-
Protection, published in 1931 during the Hoover
ulation” implies deviation from general population
administration (see Folks, 1931). The second oldest
norms. To confront this definitional problem, we
citation was the Uehling (1952) paper that was the
read a random sample of publications identified in
oldest publication in the title-word search, and 1503
our literature searches to identify the ways different
(or more than 97%) of the 1546 citations were to
researchers used the term “special populations.” In
publications in 1980 or later.
the rest of this section, we discuss several ways in
Finally, a PsycINFO search using “special pop-
which investigators have used the term, admitting
ulation” or “special populations” as phrases to be
that our categorization is not exhaustive.
found anywhere in the database led to 4266 cita-
tions, our search that led to the most inclusive list of
reference citations. The two oldest citations were the Disability Groups
Folks (1931) and Uehling (1952) papers uncovered One common use of the term special popula-
in the earlier searches, and a full 4221 (or more than tion appears, implicitly, to represent a less pejorative
98.9%) of the citations were to works published in way of referring to one or another disability group.
1980 or later. Consistent with the preceding two Research on persons with a slowed rate of mental
searches, the vast bulk of citations under the explicit development was long published under the head-
title or heading of “special populations” occurred ing of research on mentally retarded persons, with a
within the past three decades. clear implication that this was a special population.
Each of these three searches outlined above sup- More recently, the term mentally retarded persons was
ports the conclusion that work on or thinking about replaced by persons with mental retardation, which
special populations under that specific rubric is a was a less pejorative term; still more recently, the
fairly recent phenomenon. However, work on spe- accepted term is now persons with intellectual disabil-
cial populations has been a hallmark of research in ity. Historical changes in these terms are reflected in
psychology for more than 100 years, if not more, the name of the leading professional organization
even if this research has not been published under for research and practice in this domain, which has
the heading of special populations. For example, changed its name from the American Association
research and theory on persons with mental illness on Mental Deficiency to the American Association
can be traced back more than 200 years, as can on Mental Retardation and then to the American

56 s p e c i a l p o p u l at i o n s
Association on Intellectual and Developmental Dis- synthesized results from prior meta-analyses of self-
abilities. Regardless of the precise term used to determination interventions for various categories
refer to this population, the key defining feature is of persons with disability. In some meta-analyses,
that a person is considered to be a member of the research on students from any disability group were
special population if he or she has exhibited a signif- included, whereas other meta-analyses focused on
icantly slower rate of mental development relative more restricted groups, such as students with intel-
to persons in the general population. lectual disability or developmental disabilities or
As noted above, persons with mental illness are students with ADHD. The synthesis by Cobb et
another group that is treated as a special popula- al. ended on an unfortunate note, concluding that
tion, a special population defined on the basis of interventions to enhance levels of self-determination
a perceived disability. Many additional disability appear to have relatively weak effects, perhaps
groups can be identified, including persons with because of heterogeneity of the special populations
visual deficits or blindness, persons with hearing loss combined in the meta-analyses.
or deafness, persons with physical disabilities, and Echoing issues identified above, Barton (2009)
individuals with learning disabilities, to name only has decried the lack of a common definition of
a few. The one family resemblance shared by all of the basic term disability by persons with disabili-
these uses of the term “special population” is that ties or their advocates. In 1990, the U.S. Congress
the generic label is applied to a group to denote the passed the landmark Americans with Disabilities
presence of a behavioral disability that is common Act, which outlawed discrimination on the basis of
to all members of the group. disability and outlined many ways in which accom-
One problem with the use of the term “special modations for persons with disabilities must be
population” in connection with disability groups provided. Given this legislation and the relatively
arises in the context of comorbidity, or the presence large numbers of persons in the United States who
of more than one identifier of disability. For exam- fall in one or another disability category, the pop-
ple, consider the case of a person with intellectual ulation of persons with disability could and should
disability who is also deaf. Is this person a member be a potent political force. But, without a common
of two different special populations—that is, is such and inclusive definition of the term disability, the
a person a member of the special population of per- political clout of this population typically is watered
sons with intellectual disability and also a member down, as advocates for particular disability groups
of the special population of persons who are deaf? (e.g., persons with intellectual disability, students
Or, is the person a member of a new and still more with deafness) follow their own special agendas. If
special population, the population of persons with persons with disability and their advocates could
dual diagnoses of intellectual disability and deafness? unite subgroups under a commonly accepted and
No resolution of this issue is proposed here. Rather, inclusive definition of disability, this would almost
we merely raise this issue to highlight a vexing issue certainly lead to a more united stance on issues, lead-
in evolving notions of disability groups. ing to greater power. Until that time, the political
Research on disability groups takes a number power of persons with disabilities is likely to remain
of forms, and we survey a few of these here. As fragmented and therefore weaker.
one example, Martz and Daniel (2010) collected
data in the United States and Kenya to determine
whether disability prototypes for four disability “Superability” Groups
groups (persons with AIDS, hearing impairment, In contrast to the notion of disability group-
mental illness, and spinal cord injury) differed across ing, “special population” can also be used to define
the four groups and whether this was moderated individuals with superior levels of ability or per-
by country of residence. As a second example, Or, formance, which can be termed superability groups
Cohen, and Tirosh (2010) used discriminant anal- to contrast with the term disability group. One
ysis to determine whether measures from a parent- of the most common “superability” groups is the
rated questionnaire could discriminate among three intellectually gifted, often defined as persons with
groups of students, including those with attention- intelligence quotients (IQs) greater than or equal to
deficit hyperactivity disorder (ADHD), those with 130. But, giftedness has many dimensions; some
a learning disability, and those with a combination individuals are considered gifted in general, whereas
of ADHD and learning disability. Finally, Cobb, others may be deemed gifted in narrower domains
Lehman, Newman-Gonchar, and Alwell (2009) such as the arts or specific academic domains.

widaman, early, conger 57


One issue in the study of superability groups lifespan will be an interesting result to track in the
is the early identification of such individuals. Two future.
notable attempts at early identification were under- In an interesting twist, a superior level of ability
taken during the twentieth century. In the first of or performance in one area can be exhibited in the
these, Terman established what became known as presence of rather low performance in other areas.
the Genetic Studies of Genius. Terman had recently For example, Olson, Berryhill, Drowos, Brown,
developed the Stanford Binet Scale of Intelligence, and Chatterjee (2010) reported on a patient who
which was published in 1916. Then, from 1921 to had rather severe impairments in episodic memory
1923, he asked fifth grade teachers in California to that presumably arose as a result of anoxia during
nominate the three brightest children in their classes birth. The impairments in episodic memory were
and the youngest child. After testing these children quite general, resulting in rather poor performance
with the new Stanford Binet scale, children who on many memory measures. However, the patient
scored 130 or above (i.e., two or more SDs above the had extremely accurate ability to recall calendar data
population mean) were invited into the study. Ini- with regard to day, month, and year, and this unusu-
tially, more than 1450 children were enrolled in the ally high skill enables the patient to recall the precise
study, and later additions resulted in a final sample date of many of his personal experiences. Another
size of 1528 children. example is that of hyperlexia, or precocious devel-
The second large-scale study of early identifica- opment of single-word reading, which has typically
tion was the work begun in 1971 by Stanley to been identified only in persons who have a devel-
identify junior high school students with very high opmental disability (Grigorenko, Klin, & Volkmar,
levels of math skill, an undertaking subsequently 2003). As a result, the identification of a partic-
dubbed the Study of Mathematically Precocious ular form of performance as a “superability” does
Youth (SMPY). These youth were identified at age not insure that the person displaying such an abil-
12 or 13 years and had to score within the top 3% on ity is thereby a member of a generally advantaged
a test of school achievement. Then, the young peo- group. Rather, the superability may merely represent
ple were given the SAT, a standard test for college an unusually superior level of performance in the
admission, and had to score within the top 1% on context of a generally depressed level of functioning.
the test. Once selected into a cohort, the young stu-
dents were given a variety of accelerated academic
experiences to facilitate their learning in domains Demographic Groups
of science and math. As described by Lubinski Another way of using the “special population”
and Benbow (2006), who are now the co-directors term is to refer to groups of individuals who are
of SMPY, a total of five cohorts of SMPY youth identified on the basis of demographic character-
were recruited. Recruitment for the various cohorts istics. The most common characteristics used to
took place between 1972 and 1992, and a total classify individuals in the United States are sex, age,
of more than 5300 youth have participated in the ethnic (or racial) status, and socioeconomic status
program. (SES) grouping, although other demographic vari-
A more recent example of early identification of ables are also used. For example, Sussman (2006)
gifted individuals is a study by Kuo, Maker, Su, and has discussed the prevention of adolescent alcohol
Hu (2010), who described an early identification problems in various special populations, citing vari-
protocol used in Taiwan to identify gifted preschool- ation across gender, ethnicity, region of the country,
ers. Children so identified were then enrolled in and SES groups.
an enrichment program to offer an optimal envi- The issue of special populations often arises when
ronment for them to increase their problem-solving investigating the differential validity of psychologi-
abilities in multiple modalities. The protocol used cal tests with persons from different ethnic groups.
various kinds of information—from interviews, Cleary (1968) offered a series of statistical tests using
checklists, portfolios, intelligence tests, and observer multiple regression analysis to determine whether
ratings—to identify giftedness in several domains. intercept bias and/or slope bias existed in the use
Although most children were deemed gifted in one of test scores when evaluating students from differ-
or another of the domains, almost 20% of children ent ethnic groups for college admission. If neither
were identified as gifted in more than one domain. intercept nor slope bias were found to occur, then a
Whether these very-early identified children will test would be deemed unbiased for use as a selection
remain characterized as gifted at later points in their device. If fewer applicants from a special population

58 s p e c i a l p o p u l at i o n s
(e.g., African-American) were selected using such updated by Henry, 2008; Sears, 2008) decried the
a test, then one could justify the result by claim- use of college sophomores as typical research partici-
ing that the lower level of selection resulted from pants, arguing that college sophomores may provide
lower levels of scores obtained by members of the systematically different responses than members
special population on unbiased tests. Although so- of the general population on many, if not most,
called “Cleary tests” have been used routinely for experimental questionnaires and paradigms. Col-
more than four decades in college admissions and in lege sophomores are in a stage of life when they are
personnel selection, concerns about the utility of the attempting to “find” or define their identities and
approach remain (e.g., Meade & Tonidandel, 2010; thus may be much more susceptible to various influ-
Colarelli, Han, & Yang, 2010). ences, such as experimental inductions, than would
Retrieving literature that references “special pop- others. Despite the potentially limited utility of col-
ulations,” we were struck by the frequency with lege sophomores as research participants, given their
which authors who referred to special populations unrepresentativeness relative to the population, con-
cited groups such as women, children, the elderly, venience in obtaining their participation in research
and minority ethnic groups. This designation of is a leading cause of their continued predominance
special populations implies that non-women, non- in studies in social and personality psychology.
children, non-elderly, non-minority individuals— Political pressure often appears to be another fac-
that is, adult, White males and, perhaps, college tor related to the choice of research participants,
sophomores—are “the” populations that serve as either restricting or promoting research on a given
common reference groups, and anyone who devi- topic. If a researcher chooses to study a topic (e.g.,
ates from these norms is a member of a special romantic love) or a group (e.g., homosexual males)
population. that a member of Congress finds objectionable, the
Why White adult males and college sophomores research project might be highlighted as a waste of
became standard reference groups is anyone’s guess. taxpayer money. Senator William Proxmire (D—
Most likely, the emergence of these reference groups Wisconsin) made headlines in the 1970s and 1980s
resulted from not just one but a combination of when announcing his Golden Fleece Awards, which
factors. Since the founding of the United States of derided programs of research he personally opposed.
America, with few notable exceptions, adult White On the other hand, the decades-long support for
males have tended to occupy the highest positions research on persons with intellectual disability by the
of power in business, academia, and government. members of the Kennedy family in Congress led to
Regardless of the optimality of this distribution far more research funding in this area than otherwise
of power, the mere presence of this demographic would have occurred. Indeed, continued research
in positions of power may have led, explicitly or support for any special population usually requires
implicitly, to acceptance of adult White males as the the presence of a special champion in Congress for
reference group against which outcomes for other that population, given conflicting funding priorities
groups would be compared. Additionally, the use of at the national level.
a single demographic, such as adult White males, Yet another reason why certain groups are not
might have been considered a way to reduce hetero- often represented as research participants is the
geneity that might otherwise cloud research results. sheer difficulty in finding and recruiting partici-
Assuming that results for different demographic pants in these groups. Masten and colleagues (e.g.,
groups should not differ dramatically, reduction of Obradoviæ, Long, Cutuli, Chan, Hinz, Heistad,
heterogeneity might allow trends in data to be seen & Masten, 2009) have been studying homeless and
more clearly. highly mobile children, and identifying members
Other reasons are, almost surely, responsible for of this group and then tracking them longitudi-
the use of college sophomores as a standard or nally has been difficult. Other challenges often face
reference group. The top reason for selecting col- researchers who study court-related samples, such
lege sophomores as “typical” research participants as victims of physical or sexual abuse or children or
must be convenience. When taking introductory adolescents in foster care, where concurrence of legal
psychology courses in college, students often must entities are yet another impediment to research.
participate in experiments so that they learn about Regardless of the basis for the designation of
how studies are conducted, and college sophomores reference populations, signs of the breakup of the
are a common demographic in introductory psy- hegemony of White adult males and college sopho-
chology classes. A quarter-century ago, Sears (1986, mores as reference groups are clearly in evidence.

widaman, early, conger 59


For example, when applying for grants through the highlighting characteristics such as the health sta-
National Institutes of Health, researchers must care- tus, ethnicity, language needs, and gender-related
fully describe the projected sample with regard to issues of clients. A key issue Polinsky et al. dis-
sex, age, and ethnicity, and any exclusion of persons cussed was the application of types of treatments
from a particular demographic category must be to types of clients, under the assumption that client
justified in convincing fashion. Further, concerted characteristics may alter the effectiveness of partic-
efforts have been made to correct the demographic ular treatments. Taking the issue further, Polinsky
imbalance in prior research on health. The Women’s et al. proposed that treatments might be spe-
Health Initiative (http://www.nhlbi.nih.gov/whi/) cially adapted to the characteristics of the client to
was a major program of research funded by the obtain maximal success. More recently, Diaz, Hor-
government to investigate various outcomes in ton, McIlveen, Weiner, and Nelson (2009), using
postmenopausal women, focusing on cardiovas- data from substance abuse programs, found that
cular disease, cancer, and osteoporosis as com- almost half (48%) of the clients in their sample
mon causes of disability, morbidity, and death had dysthymia. Building on this finding, Diaz et
of women. Furthermore, the Office of Minor- al. recommended that treatment programs consider
ity Health of the U.S. Department of Health and whether clients have psychological disorders such as
Human Services has pursued Minority Health Ini- dysthymia when treatments are formulated, because
tiatives (http://minorityhealth.hhs.gov/) to investi- treatment success may depend on the presence of
gate causes of disease, morbidity, and death in significant comorbid characteristics.
minority populations that deviate from patterns Expanding the reach of functional groups beyond
common in the majority (i.e., White) population. personal traits or characteristics, aspects of personal
We look forward to the day when women, children, life situations may also be used to define functional
the elderly, and ethnic minorities are not consid- groups. Individuals are born into families, these
ered special populations but are considered major families live in communities, and communities are
portions of the general population that deserve just nested within larger geographical entities. In many
as much attention as research subjects as any other publications, Bronfenbrenner (e.g., 1977, 1986a,
demographic group. 1986b, 1999) laid out the ever-expanding circles of
embedded environments from micro- to macrosys-
tems. A recent book edited by Little, Bovaird,
Functional Groups and Card (2007) was dedicated to presenting sta-
The term special population may also be applied tistical and other methodological solutions to the
to identify persons who, in our terminology, are modeling of contextual effects, such as embedded
members of identifiable functional groups. By func- social systems, that are rife in studies conducted
tional groups, we refer to individuals who have in representative, everyday contexts. In one con-
particular combinations of behavioral profiles or tribution to the Little et al. volume, Widaman
life situations that may have unique importance for (2007) discussed the integration of embedded ver-
understanding their behavior. That is, individuals sions of both the social environment and the physical
can be characterized by the full repertoire of pos- environment as these combine with personal char-
itive and negative behaviors they exhibit and the acteristics to influence behavior. Although any of
life situations they have selected or that have been these embedded levels may play a role in moderat-
imposed on them. Any of these factors, particularly ing behavioral change, variables associated with the
in combination, may define groups that function more proximal environments of family and com-
in unique fashion to determine their behavior and munity probably play a larger role than do more
their susceptibility to environmental presses, such as distal variables. In a recent study, Chassin, Knight,
treatments. Vargas-Chanes, Losoya, and Naranjo (2009) found
In our literature searches on special populations, that treatments to reduce certain forms of negative
we found a great many of the articles retrieved con- behavior were effective only if families were involved
cerned treatment outcomes in special populations, in the treatment. The upshot of this finding is that
where these special populations met our description personal characteristics and aspects of both the social
of functional groups. Over a decade ago, Polinsky, and physical life situations within which an individ-
Hser, and Grella (1998) described the extremely ual functions should be considered when attempting
varied types of clients who received services from to understand the behavior and adaptability of the
drug treatment programs in Los Angeles County, individual.

60 s p e c i a l p o p u l at i o n s
Biological or Genetic Markers of Group varied levels at which this research is undertaken—
Membership from the level of synaptic processes to molar patterns
A fifth and final type of special population is in behavior—is remarkable, yet the patterns uncov-
perhaps the most current and state-of-the-art way ered all point to the biological nature of the processes
of defining special populations—by the presence of involved.
specific biological or genetic markers that have ties to Two additional ways of understanding biologi-
behavior. Since its rebirth in the early 1970s, behav- cal markers and their effects deserve mention here.
ior genetic research has accumulated at a seemingly First, the search for a solitary SNP or a small number
ever-increasing rate. The general thrust of findings in of SNPs responsible for a particular psychological
this field is that most, if not all, behavioral traits have or behavioral trait is almost certainly a rather unre-
some heritability, and certain important traits (e.g., alistic goal. A more likely outcome is reflected in
intelligence) have high levels of heritability. Consis- research on phenylketonuria (PKU). As reported on
tent with this focus, research continues to focus on the Phenylalanine Hydroxylase Locus Knowledge-
biological markers related to behavior, where biolog- base website (http://www.pahdb.mcgill.ca/), more
ical markers are broadly defined, from the molecular than 500 mutations on the phenylalanine hydroxy-
level of particular genes or single-nucleotide poly- lase (PAH) gene have been identified, and all of these
morphisms (SNPs) to the molar level of performance lead to reduced metabolism of phenylalanine into
on experimental tasks. tyrosine, the underlying problem in PKU. Although
Searching for the molecular genetic bases of a small number of mutations (e.g., 5 or 6) may
behavioral traits is a major focus at the present time account for the majority of mutations in particular
(Petrill, 2010). The successful identification of par- populations (e.g., European), any of the mutations
ticular genes related to a certain condition is often can cause PKU. Further, many of the mutations
touted in the press. For example, Chakrabarti et al. have been categorized with regard to the severity
(2010) recently reported that certain genes associ- of the mutation, indexed by the degree of disrup-
ated with sex steroids and neural growth are related tion of phenylalanine metabolism. If a large number
to autistic traits and Asperger syndrome. But, the of mutations are found to underlie a single, rather
mere presence of certain SNPs may not be the key. restricted phenotype such as PKU, we should expect
Rather, gene expression at the molecular levels may that very large numbers of mutations or SNPs are
be the important feature, as Simunovic et al. (2009) related to broader phenotypes such as intelligence,
reported in their study of the pathology underlying intellectual disability, or personality disorders.
Parkinson’s disease. Although successes in finding Second, we think that researchers must pay at
certain SNPs or particular forms of gene expression least as much attention to the environment as to gene
related to a disease or behavior have been reported, SNPs when searching for genes that affect behav-
most of these involve very small portions of vari- ior, an approach that has been characterized as the
ance explained and suffer from lack of replication in search for Gene X Environment, or G X E, interac-
follow-up investigations. tions. Specific genes or SNPs are important in the
Perhaps more promising is research at a more G X E approach, but the SNPs alone do not directly
molar level. For example, Pennington et al. (2008) herald the emergence of a behavioral trait. Rather,
reported synaptic and metabolic abnormalities in behavioral differences among groups identified with
the prefrontal cortex in persons with schizophre- different genetic alleles may arise only in particu-
nia or bipolar disorder. Focusing on the anterior lar environmental circumstances or may be clearly
cingulate, Eastwood and Harrison (2010) found exacerbated in such environments, so main effects
increased synaptic transmission and plasticity in of genes or environments supply less-than-complete
persons with bipolar disorder. At a still more information. Rather than main effects, the G X E
molar level, Koychev, El-Deredy, Haenschel, and interaction indicates that the effects of genetic alle-
Deakin (2010) reported visual information process- les is moderated by environmental circumstances,
ing deficits in persons with schizotypy, which is so environments modulate how genetic factors are
a marker of vulnerability to schizophrenia. And, expressed in behavior. Two early papers on the G
Reichenberg, Caspi, Harrington, Houts, Keefe, X E approach, by Caspi et al. (2002) and by Caspi
Murray, Poulton, and Moffitt (2010) argued that et al. (2003), were both published in Science, the
they had identified patterns in performance on leading general journal in all of science. In the sec-
standard psychological tests that reflected cogni- ond of these papers, Caspi et al. (2003) investigated
tive deficits related to childhood schizophrenia. The the 5-HTTLPR region, which is associated with

widaman, early, conger 61


serotonin function. The 5-HTTLPR region is char- most susceptible to environmental influence, with
acterized by either short (s) or long (l) alleles; because their behavioral outcomes tracking the negativity or
individuals obtain one copy from mother and one positivity of the environments within which they
from father, individuals can be characterized as s/s, have developed. Although firm conclusions about
s/l, or l/l based on whether they have (1) two short, the presence of differential susceptibility and the
(2) one short and one long, or (3) two long alleles, resulting cross-over G X E interactions has not yet
respectively. Caspi et al. presented data suggesting been provided, many experimental results published
that persons with the l/l allele were relatively imper- in the last few years seem to support this idea. If G X
vious to stressful environments and therefore tended E interactions—particularly cross-over G X E inter-
to have lower levels of negative outcomes (e.g., actions consistent with the differential susceptibility
depressive symptoms, suicide ideation/attempts) in notion—are present in many behavioral domains, a
the most stressful environments. In contrast, indi- more nuanced picture must be drawn, with genes
viduals with the s/s allele tended to have the most and environments having co-equal status as the basis
negative outcomes in the most stressful environ- for behavioral phenotypes. Groups may still be iden-
ments, and persons with the s/l allele had outcomes tified by their genetic alleles, but the implications of
that fell midway between the l/l and s/s groups. their genes for these special populations can only
However, these differences did not hold in all envi- be understood by considering the environments in
ronmental situations. Indeed, in the least stressful which the persons have developed.
environments, essentially no differences across allele
groups were found. Thus, in low-stress environ-
Summary
ments or in environments with no maltreatment,
As a reading of this section demonstrates, the
the allele groups did not differ in depressive out-
notion of special populations is often invoked but
comes, but differences across groups appeared only
can refer to very different kinds of groupings of
as the stressfulness of the environment increased.
individuals. We have identified five ways of char-
More recently, researchers have been investigat-
acterizing special populations, based on disability
ing the notion of the genetic basis for differential
status, superability status, demographic characteris-
susceptibility to the environment. The studies by
tics, functional characteristics, and genetic markers.
Caspi et al. (2002, 2003) support the contention
Others might be able to identify additional classes of
of G X E interactions, but the allele group that
variables that might be used to designate special pop-
fared worst in the most stressful environments rarely
ulations. Although the distinctions among different
exhibited any benefit versus the other groups in
types of special populations underscore the hetero-
the least stressful environments. However, as Bel-
geneous nature of the alternate bases for groupings of
sky, Bakermans-Kranenburg, and van IJzendoorn
persons, all of the distinctions among groups have an
(2007) argued, the Caspi et al. studies may not
important familial resemblance: Researchers must
have investigated the widest possible range of envi-
investigate whether patterns of empirical results vary
ronmental circumstances, instead generally looking
in important ways across special populations or
only at stressful versus average environments. How-
when moving from the general population to a spe-
ever, if one studied the entire range of environmen-
cial population. If results vary importantly across
tal circumstances—from worst through average to
groups, then special population status is a modera-
superior environments—then a true cross-over G X
tor of results, and conclusions about “the ways that
E interaction may be found. That is, persons with
things work” do not generalize across groups. Thus,
certain genetic alleles (e.g., the l/l allele from the
special populations constitute a crucible for research
5-HTTPLR) may do relatively well in very poor
in the social sciences, and we must guard against
environments but also may not do much better in
unwarranted generalization of findings across
superior environments, representing a group of per-
groups unless research supports such conclusions.
sons who are relatively impervious to environmental
circumstances and therefore have low susceptibility
to effects of the environment. In contrast, per- Methodological Implications of Special
sons with other alleles (e.g., the s/s allele from the Populations
5-HTTPLR) may indeed perform rather poorly in Having established some guidelines for distin-
the worst environments but might show the best guishing among special populations, we turn next
outcomes of all groups in superior environments. If to the methodological or quantitative implications
this were to occur, these individuals would be the that arise when considering special populations. In

62 s p e c i a l p o p u l at i o n s
most graduate education in psychology, quantitative or implicit exclusion criteria should be used or one
experts teach classes in which students are prepared might inadvertently draw an unrepresentative sam-
to obtain data from a sample (typically described in ple. Of course, a more restricted population could be
nebulous terms); take standard, off-the-shelf statis- the topic of a study, such as persons from minority
tical methods; estimate population parameters in groups who seek mental health services with regard
a particular analytic model, and use their results to alcohol problems but without attendant drug use
to make conjectures about the population, as if problems. We hope the reader is clear on the issue
the population were a single, monolithic entity. of identification of a special population: Both inclu-
But, once one acknowledges the presence of spe- sion and exclusion criteria for participants must be
cial populations (which represent specially identified clearly elucidated and justified by the nature of the
subsets of the larger population) our thinking about study, and the research team should remain vigilant
methodological and statistical procedures must be to ensure that no unexpected factors are biasing the
amended. Rather than estimating “the” population nature of the population defined.
parameter in monolithic population, we should A second issue is the development of a plan
begin trying to understand the nature of special to access participants from the special population.
populations, how parameter estimates might vary If one is interested in studying clinical popula-
across populations, when it is possible to compare tions, access to potential participants will often be
parameter estimates across groups, and similar dif- sought through professional agencies, such as men-
ficult questions. In this section, we slice up this tal health centers, regional centers that offer services
task under four headings—identifying and accessing for members of particular populations (e.g., intel-
special populations, measuring the same constructs lectual disability), state agencies, or the like. If the
across groups, exploring the bounds of psychological goal is to study students in elementary school, devel-
theories, and exploiting unusual variation in spe- oping ties to schools is the most obvious approach.
cial populations—and discuss the methodological But, members of some populations are much harder
implications of the substantive issues that arise. to access. For example, researchers who study child
physical or sexual abuse or neglect must often access
their participants through arrangements with court
Identifying and Accessing Participants in systems, child welfare organizations, and so forth.
Special Populations In these research endeavors, investigators frequently
The first task of any investigation into partici- find that a great deal of time must be spent and
pants from special populations is gaining access to a large amount of red tape must be surmounted to
the participants. But, this characterization of “gain- gain access to research subjects. Still, one should not
ing access to the participants” masks several neces- let the difficulty of the access alter the research goals
sary steps in the design and conduct of an empirical of a program of research. To provide the most valid
investigation. To provide some cognitive structur- data from the most representative settings, proper
ing of this first concern, we have broken down the and optimal access must be developed.
research process associated with drawing a sample A third step, once the special population has
from a special population into four steps or issues. been identified and modes of accessing participants
The first task is the clear identification of the spe- have been developed, is to develop a plan for sam-
cial population to be studied. As earlier sections pling participants from the population. In some
of this chapter have shown, a single, overarching situations, a sampling plan can be relatively straight-
notion of what constitutes a special population is forward. For example, a researcher might select each
not present in the research literature. Rather, vari- birth (or every second birth) from a set of hospi-
ous ways of defining special populations can be used, tals during a given time period as the sample to
and some of these will provide partially overlapping be drawn, and this will often lead to the draw-
subsets of possible participants. Because of the seem- ing of a representative sample from the population
ingly murky nature of some special populations, we that is served by the hospitals in question. Study-
offer only the most general recommendations of how ing birth outcomes in low SES families might be
to deal with the problem of identification of the pop- approached in this fashion, if the hospitals selected
ulation. Researchers should be careful to develop a tended to serve the low SES families in a community.
clear statement of the population to be studied. If But, if a researcher intends to investigate differ-
the target population is all persons receiving mental ences across certain strata (e.g., African-American,
health services in a catchment area, then no explicit Hispanic [or Latino]) and participants from these

widaman, early, conger 63


strata are unequally represented in the population, the meaning and the metrics of dimensions of indi-
then oversampling the low-frequency participants vidual differences within and among populations
and then re-weighting in the analysis stage can or subpopulations. Researchers typically assess indi-
ensure statistical analyses that have greater external vidual differences on key dimensions in a domain
validity or generalizability. Full discussion of sam- in their research projects. For example, a study
pling plans is beyond the scope of this chapter; here, on mental abilities may include measures of fluid
we merely emphasize the need to consider the sam- and crystallized intelligence, and a study of per-
pling plan with as much attention and concern as sonality is likely to assess the Big Five dimensions
other aspects of study design. of Extraversion, Agreeableness, Conscientiousness,
The final step in an empirical study is the recruit- Neuroticism, and Openness. Further, the listing
ment of participants from the special population. of certain dimensions of intelligence and person-
This step involves the initial contact with potential ality represents only a meager sampling of the many
participants and their recruitment into the study. different types of characteristics we assess in our
The recruitment rate into a study is a crucial statis- work as researchers in psychology. However, using
tic and should be reported in every published report. a standard measure developed and normed on the
Recruitment rates vary across types of studies, so a general population to assess individuals from a spe-
low recruitment rate for a given study may not be cial population is fraught with problems, and a
a fatal flaw if this is representative of the studies in researcher cannot assume that the individual differ-
the domain. However, researchers should collect as ences reflected in scores on the measure are directly
much basic data as possible on potential participants comparable across populations.
so that trends in participation versus nonparticipa- Consider the use of a widely employed measur-
tion might be discerned. If nonparticipants differ ing device in a special population. For example, a
systematically from participants on any variables, researcher might want to assess dimensions of per-
this may limit the generalizations to be drawn from sonality in a large sample of persons with intellectual
the study. Researchers should report the recruitment disability. The researcher may intend to determine
rate in any published paper, so readers will have a whether persons with intellectual disability have dif-
basis for placing the research results in context. ferent mean levels on the personality dimensions
(e.g., higher levels of Agreeableness) and whether
Implication 1: Informed identification of special individual differences on the personality traits relate
populations and members of such groups is often a in meaningful ways to success in community place-
difficult task. But, the failure to identify, sample, and ments. To do so, the research might select the Big
recruit members of special populations in appropriate Five Inventory (BFI) (John, Donahue, & Kentle,
ways may render any empirical results of dubious 1991), a 44-item measure of the major dimensions
value. of personality that contains 8 to 10 items for each of
The history of research in psychology tends to the five dimensions. On each item, the respondent
reflect the conduct of studies on samples of con- is asked to indicate, using a 1-to-5 rating scale, his
venience. Researchers often are careful to describe or her degree of agreement with certain adjectives
the basic demographic characteristics of their partic- describing personal behaviors or descriptions.
ipants but infrequently discuss whether the sample In a situation like this, a researcher may blithely
is representative of any larger population. As psy- assume that responses by persons with intellectual
chology matures as a science, greater attention will disability can easily be compared at the scale level
be paid to the issues of drawing appropriate sam- with responses by persons who do not have intel-
ples from well-defined populations. Our hope is lectual disability. Thus, one might sum up the
that as researchers pay greater attention to this set item scores on the Extroversion scale and com-
of issues, research results will begin demonstrating pare mean and variance differences on scale scores
greater replication across studies than has often been between relevant samples of persons with and with-
the case to date. out intellectual disability. However, psychometric
investigations over the past 50 years and more have
demonstrated that incorrect conclusions may be
Measuring the Same Constructs Across drawn in such situations unless one is confident
Groups that one is assessing “the same constructs” across
Perhaps the initial analytic task to undertake the different populations. Methods of verifying that
when studying special populations is to determine one is assessing “the same constructs” across groups

64 s p e c i a l p o p u l at i o n s
have been published under several rubrics, including r latent variables, η is an (r × 1) vector of scores
 for
methods to assess measurement invariance, factorial person i on the r latent variables, and ε is a p × 1
invariance, or lack of differential item function- vector of scores of person i on the p unique factors.
ing (DIF). The upshot of this concern leads to One can use the model in Equation 2 to develop
an important implication when researching special moment expectations for the manifest variables,
populations: moment expectations that yield expressions for the
covariance structure and the mean structure of the
Implication 2: Establishing the measurement
manifest variables. In a single-group case, these
invariance of instruments across groups is a key result
expectations are:
in the investigation of any special population.
Measurement invariance must hold for meaningful
comparisons to be made across samples from Σ = ΛΨ + Θ (3a)
different populations. μ = τ + Λα, (3b)
Measurement invariance is a broad topic, one  
that subsumes research and statistical models that where Σ is the p × p matrix of population covari-
cross many boundaries. For example, some work ances among manifest variables, Ψ is the (r × r)
on measurement invariance has looked at predic- matrixof covariances
 among the latent variables, Θ
tion models, which may be used to predict some is the p × p matrix (usually diagonal)
 of covari-
behavioral outcome (e.g., college grade point aver- ances among unique factors, μ is a p × 1 vector
age) from relevant predictors (e.g., high school grade of population means on manifest variables, α is
point average, admission test scores). Multiple mod- an (r × 1) vector of means on the latent variables,
els can be employed in such research, and regression and other symbols were defined above. Equation 3a
analysis is often used. In regression analysis, ques- is the population covariance structure model, and
tions arise regarding any potential intercept bias Equation 3b is the population mean structure
or slope bias across groups when predicting the model.
outcome. In any sample from a population, we observe
When measurement invariance is considered sample covariances among manifest variables, which
within the context of factor analysis models, the we signify as S, and sample means on the manifest
term applied is usually factorial invariance, which is variables, which we can signify as Ȳ. Given these
a restricted form of measurement invariance (Mered- sample estimators of population values, we can write
ith, 1993). In factor models, we typically begin with the sample covariance and mean models as:
a data model of the form:
Yji = τj + λj1 η1i + · · · + λjr ηri + εji , (1) S∼
= Λ̂Ψ̂Λ̂ + Θ̂=Σ̂ (4a)

where Yji is the score of person i(i = 1, . . . , N ) on Ȳ ∼


= μ̂ = τ̂ + Λ̂α̂, (4b)
manifest variable j(j = 1, . . . , p), τj is the intercept
for manifest variable j, λjk is the factor loading (or where carets (ˆ) are added to matrices to indicate that
regression weight) for predicting manifest variable j sample estimates of population parameters are con-
from latent variable k(k = 1, . . . , r), ηki is the factor tained in the matrices, and all symbols are defined
score for person i on latent variable k, and εji is the above.
score of person i on the unique factor for manifest The covariance structure model in Equation
variable j. The model in Equation 1 is termed the 4a signifies that the matrix of sample covariances
linear factor analysis data model to signify the fact among manifest variables S is approximated by the
that the linear model was developed as a model for covariance structure model, Λ̂Ψ̂Λ̂ +Θ̂; with esti-
understanding persons’ scores on manifest variables mates in the three parameter matrices Λ̂, Ψ̂, andΘ̂,
as linear functions of their scores on latent variables. the matrix expression yields an estimate of the pop-
Writing Equation 1 in matrix notation results in: ulation covariances among manifest variables, Σ̂,
under the assumption that the model is correct
Y = τ + Λη + ε, (2)
  in the population. The mean structure model in
where Y is a p × 1 vector of scores  for person i Equation 4b shows that the sample means are esti-
on the p manifest variables, τ is a p × 1 vector of mators of population means, and these are approx-
intercepts for the p manifest variables, Λ is a p × r imated as a function of intercepts, factor loadings,
matrix of loadings of the p manifest variables on the and means of the latent variables.

widaman, early, conger 65


If we generalize the model in Equations 4a and In our presentation here, we will follow the sum-
4b to the multiple-group context, we arrive at: mary of approaches to factorial invariance research
∼ Λ̂g Ψ̂g Λ̂ g + Θ̂g = Σ̂g provided by Widaman and Reise (1997), who syn-
Sg = (5a)
thesized prior work by Jöreskog (1971), Horn,
Ȳg ∼
= μ̂g =τ̂g + Λ̂g α̂g , (5b) McArdle, and Mason (1983), and Meredith (1993)
where the subscript g (g = 1, . . . , G) has been to arrive at four levels of factorial invariance.
added to each matrix or vector to indicate that the These four levels constitute levels of increasing
elements of equations are derived from group g , and restriction on parameters of the factor analysis
all other symbols are defined above. model.
Levels of factorial invariance. Given the multiple- Horn et al. (1983) termed the first level of invari-
group model shown in Equations 5a and 5b, con- ance configural invariance. By configural invariance,
sideration of factorial invariance can commence. A we mean that the same pattern of fixed and free
simple rendition of factorial invariance is this: The loadings in Λ is observed in each group. This form
same factors should be present in multiple groups. of invariance states merely that within each group,
But, this simple statement masks key issues. How each manifest variable is predicted by the same latent
can we tell if the same factors are present in differ- variable(s) as occurs in other groups. Under config-
ent groups? What empirical results would give us ural invariance, the g subscript is still employed for
confidence that we have identified the same factors each Λ matrix, because different estimates can be
in different groups? found for particular factor loadings across groups
One way to get a bit more definite about how and the stipulation of configural invariance will still
to verify that invariant factors have been identified be satisfied.
in multiple groups is to consider a mathematical The second level of factorial invariance was
statement regarding the expectations of the termed weak factorial invariance by Widaman and
 mani- Reise (1997) and refers to a model in which factor
fest variables, which can be written as E Y|η, g .
This equation states that the expected values of the loadings are constrained to be invariant, or iden-
manifest variables in Y are a function of the com- tical, on a one-by-one basis across groups. Thus,
mon latent variables in η and not only must the factor loadings display the same
 the group
 g to which
pattern of fixed and free loadings across groups, but
a person belongs. Now, if E Y|η, g = E (Y|η), or
if the expectations of the manifest variables given η the free loadings across groups are constrained to
and g equal the expectations of the manifest variables invariance. If the invariance constraint on factor
given just η, then the expectations are not depen- loadings is supported, then regression weights for
dent on the group of which a person is a member. predicting manifest variables from latent variables
If this identity holds, then the same latent variables are invariant across groups. This equality is one key
are present in the different groups. element in showing that latent variables are trans-
To translate the expectation equations into impli- lated into manifest variables in the same fashion.
cations regarding data, consider the factor model If this holds and the factor loading matrices can
shown in Equations be constrained to invariance across groups, the g
 5aand 5b above. The expecta- subscript can be deleted from the loading matrices,
tion equality E Y|η, g = E (Y|η) will hold only
if the common factors are “translated” into manifest leading to:
variable scores in the same fashion in each group.
Sg ∼
= Λ̂Ψ̂g Λ̂ + Θ̂g = Σ̂g (6a)
This requirement has led researchers to discuss sev-
eral levels of factorial invariance. Many researchers Ȳg ∼
= μ̂g = τ̂g + Λ̂α̂g , (6b)
(e.g., Byrne, Shavelson, & Muthén, 1989; Chen,
Sousa, & West, 2005; Cheung & Rensvold, 1999; where all symbols in Equations 6a and 6b are
Ferrer, Balluerka, & Widaman, 2008; Hancock, defined above.
Kuo, & Lawrence, 2001; Little, 1997; McArdle, The third level of invariance, strong factorial
1988; Meredith & Horn, 2001; Millsap & Mered- invariance, adds invariance across groups of the
ith, 2007; Nesselroade, 1983; Rensvold & Cheung, measurement intercepts to the weak factorial invari-
1998) have written on the topic of factorial invari- ance model. Thus, under strong factorial invariance,
ance, and the preceding listing of contributions the regression equation for predicting each manifest
merely scratches the surface of work in this area and variable from the set of latent variables is invariant
therefore necessarily misses many contributions that across groups, satisfying a key criterion for measure-
should also be cited. ment invariance—that the latent variables related to

66 s p e c i a l p o p u l at i o n s
manifest variables in the same fashion across groups. particular matrices in Equations 8a and 8b. Specif-
The resulting equations are: ically, group differences in mean levels are obtained
from the α̂g matrices in Equation 8b. Models are
Sg ∼
= Λ̂Ψ̂g Λ̂ + Θ̂g = Σ̂g (7a)
often identified by fixing factor means to zero in one
Ȳg ∼
= μ̂g =τ̂ + Λ̂α̂g , (7b) group (e.g., group 1, which serves as the reference
group), so mean values for the other groups (e.g.,
where all symbols are defined above. Note that both groups 2, . . . , G) are estimated as mean differences
the factor loading matrix and the vector of intercepts from the reference group.
have the g subscript deleted, because these matrices Group differences in variance on the latent vari-
have invariant estimates across groups. Two key out- ables or covariances among latent variables are
comes accompany strong factorial invariance: (1) all
obtained from the Ψ̂g matrices in Equation 8a. If
differences across groups in the means on manifest vari-
latent variables are identified by fixing latent vari-
ables are due to mean differences on the latent variables;
able variances to unity in the reference group (e.g.,
and (2) group differences in means and variances on
group 1), then variances on the latent variables are
the latent variables are identified in a comparable met-
estimated relative to this reference group.
ric across groups, enabling comparisons at the latent
In summary, if strong factorial invariance holds
variable level across groups. Thus, the latent variable
for a set of data, latent variables are identified
model in Equations 7a and 7b identifies the latent
in an invariant fashion across groups. Given this,
variables as the sources of differences across groups in
between-group comparisons on mean and/or vari-
mean levels on the manifest variables, and variances
ance on the latent variables are justified, and these
on latent variables can also be compared.
comparisons can be made at the error-free, latent
The fourth and most restricted level of invari-
variable level.
ance is strict factorial invariance. Under this level
of invariance, the unique factor variances are addi-
tionally constrained to invariance across groups. The Exploring the Bounds of Psychological
resulting equations for the covariance structure and Theories
mean structure, respectively, are: As discussed above, obtaining strong factorial
invariance allows the researcher to assume that
Sg ∼
= Λ̂Ψ̂g Λ̂ + Θ̂ = Σ̂g (8a)
mean and/or variance differences on dimensions of
Ȳg ∼
= μ̂g =τ̂ + Λ̂α̂g , (8b) individual difference are interpretable across popu-
lations. However, psychological theories frequently
where all symbols are defined above. As shown in
lead to predicted relations among constructs. For
Equation 8a, the g subscript is deleted from the
example, many researchers have sought to outline
unique factor covariance matrix Θ̂ because esti- dimensions of parenting styles, and the isolation and
mates in this matrix are invariant across groups. replication of such dimensions—particularly across
Under strict factorial invariance, all group differ- groups (e.g., across ethnic groups)—is an impor-
ences in mean levels and in variances on the manifest tant matter. But, once these initial steps have been
variables are due to mean and variance differences, taken, researchers are typically interested in whether
respectively, in the latent variables. Thus, under strict parenting styles have impacts on, or at least con-
factorial invariance, we have a concise representa- sistent patterns of asymmetric relations with, other
tion of all between-group differences on manifest variables, such as child behavior. To pursue such
variables. Although strict factorial invariance is the research, we must focus on the relations among
most concise of the levels of factorial invariance, latent variables, determining whether the patterns
researchers often find that equality constraints on and strength of relations among latent variables is
unique factor variances across groups are too restric- similar or different across groups.
tive. This is not a problem, because comparisons
across groups on the latent variables are justified if Implication 3: Results of studies of the general
strong factorial invariance holds. population should not be generalized to special
Representing mean and variance/covariance dif- populations without research that supports such
ferences across groups. If at least strong factorial generalization. Investigations of the structural
invariance is satisfied for a set of data, then latent relations among latent variables across populations
variables are identified in a form that allows inves- hold the key to determining whether conclusions
tigation of group differences on the latent variables. regarding relations among variables generalize to
These differences across groups are contained in special populations.

widaman, early, conger 67


In the structural modeling literature, researchers Equation 9 for η leads to:
often distinguish between the measurement model
η = (I − B)−1 ζ, (10)
and the structural model. The measurement model
consists of the relations of the latent variables to where I is an (r × r) identify matrix, the superscript
the manifest variables, and the structural model −1 indicates the inverse of the associated matrix,
contains relations among the latent variables. In and all other symbols were defined above.
the foregoing section, we described several levels of The covariance expectations for Equation 10 are:
factorial invariance—such as weak and strong facto-    −1
rial invariance—and these concerned invariance of E ηη = (I − B)−1 Ψ I − B , (11)
parameter estimates in the measurement model. where all symbols are defined above. To ease inter-
Once at least strong factorial invariance is estab- pretation of two parameter matrices, we will dis-
lished, we may pursue other forms of invariance tinguish between independent latent variables and
that are of great importance for generalizing the- dependent latent variables, using superscript (i) and
oretical conclusions across groups. These additional (d ), respectively. Thus, we will write the matrix of
forms of invariance fall under the rubric we are covariances among latent variables as:
calling structural invariance, because they involve  (i) 
invariance of the parameter estimates in the struc- Ψ 0
Ψ= (12)
tural model. A great deal of work has been done 0 Ψ(d )
on measurement or factorial invariance, but rela-
and the matrix of factor means as
tively little research has been published under the  (i) 
heading of structural invariance. To document this, α
α= , (13)
PsycINFO searches were made with “measurement α(d )
invariance” and “factorial invariance” as keywords;
these searches returned 412 and 402 citations, where Ψ(i) contains free covariances among the
respectively. A search with “structural invariance” as independent latent variables, Ψ(d ) contains covari-
keyword returned only 101 citations, a much lower ances among residuals of the dependent latent vari-
number. ables, the independent latent variables are assumed
In addition, the term structural invariance to be uncorrelated with the residuals of the depen-
seemed to be used in a much less consistent, more dent latent variables, α(i) is a vector of means of
confused fashion in prior research. Many authors the independent latent variables, α(d ) is a vector
used the term “structural invariance” synonymously of intercepts of the dependent latent variables, and
with what we have called factorial invariance, other symbols are as defined above.
arguing that satisfying tests of factorial invariance Placing Equations 11, 12, and 13 into Equations
implied that the factors were structurally invari- 7a and 7b for a multiple-group version of the strong
ant across groups. When using the term “structural factorial invariance model yields:

invariance” to refer to elements in the structural  −1 Ψ̂(i) 0
model, “structural invariance” has typically been Sg =∼ Λ̂ I − B̂g g
0 Ψ̂(d
g
)
interpreted as denoting the invariance of the pattern
of significant and nonsignificant directed relations  −1
among latent variables. I − B̂ g Λ̂ + Θ̂g = Σ̂g (14a)
To discuss levels of structural invariance, we need 
α̂g(i)
to revise slightly the structural model shown in Ȳg ∼
= μ̂g = τ̂ + Λ̂ , (14b)
Equations 8a and 8b. If we return to Equation 2 α̂g(d )
and allow directed relations among latent vari- where g subscripts on vectors or matrices indicate
ables, we can write an equation for the latent the presence of differing parameter estimates across
variables as: groups, and all symbols are defined above.
Levels of structural invariance. Here, we propose
η = Bη + ζ, (9)
that different levels of invariance can be distin-
guished for the structural model, levels that are
where B is an (r × r) matrix of regression weights analogous to the levels of factorial invariance. In
for predicting latent variables from other latent offering this proposal, we hope we can lead to
variables, ζ is an (r × 1) vector of latent variable research on structural invariance that is as illumi-
residuals, and other terms are defined above. Solving nating as work on factorial invariance and, in the

68 s p e c i a l p o p u l at i o n s
process, help establish a common nomenclature for be placed on raw score regression weights (or equiva-
discussing structural invariance. lents of these), rather than standard score regression
The first and most basic form of structural invari- weights. As in typical multiple regression analy-
ance is configural structural invariance of the pattern ses, raw score regression weights are presumed to
of fixed and free regression weights in the B̂g matrices be invariant across samples from a common pop-
in Equation 14a. If a restricted pattern of directed ulation, whereas standardized regression weights
paths is estimated from independent latent variables are expected to vary as a result of range restric-
to dependent latent variables and among depen- tion resulting from sampling. Now, with latent
dent latent variables, configural structural invariance variable models, the scale of each latent variable
implies that the same latent variables have directed may be fixed in any of several ways, and the “raw
effects on the same outcome latent variables in each score regression weights” among the latent η vari-
group. ables will vary as a result. However, if the scale
The second level of structural invariance is invari- of each latent variable is fixed in one group and
ance of the regression weight estimates for predicting strong factorial invariance constraints are placed on
certain latent η variables from other latent variables. the Λ̂ and τ̂ matrices, then the latent η variables
Paralleling distinctions made above for factorial are on the same scale across groups. As a result,
invariance, placing invariance on the regression the regression weights in the B̂ matrices are anal-
weights leads to what may be called weak structural ogous to raw score regression weights and are on
invariance. These regression weights are contained a comparable metric, so invariance constraints on
in the B̂g matrices in Equation 14a. If parame- these weights are reasonable a priori hypotheses
ter estimates in the B̂g matrices are constrained to to test.
invariance across groups, the g subscript would be A third form of structural invariance involves
dropped from these matrices, leading to the intercepts for the dependent latent variables,


−1 Ψ̂(i)
contained in the αg(d ) matrices. Placing invariance
0
Sg ∼
= Λ̂ I − B̂
g constraints on these latent intercepts leads to strong
0 Ψ̂(d
g
)
structural invariance. If across-group constraints on
 −1 latent intercepts are imposed, then the resulting
I − B̂ Λ̂ + Θ̂g = Σ̂g , (15) equation for mean expectations has the form:
where all terms are defined above, and the lack of  
α̂g(i)
subscripts on the B̂ matrices indicates that across- Ȳg ∼
= μ̂g = τ̂ + Λ̂ , (16)
α̂(d )
groups invariance constraints have been imposed on
parameter estimates in this matrix.
where all terms are defined above, and the lack of
At least two issues must be mentioned with regard
subscripts on the α(d ) matrix indicates the presence
to constraints on regression weights in the B̂ matri- of cross-group invariance constraints on parameter
ces. First, across-group constraints on the B̂ matrices estimates in this matrix.
are interpretable only if weak or strong factorial More importantly, if these latent intercept terms
invariance has been established. Thus, if only con- are constrained to invariance across groups with
figural factorial invariance holds for a given set of no significant loss in model fit, then any group
data, no clear and convincing substantive interpre- differences in mean level on the dependent latent
tation can be placed on constraints on elements variables result from group differences in mean level
in the B̂ matrices. Data must support at least the on the independent latent variables. This condition
hypothesis of weak factorial invariance to yield inter- is analogous to the distinction at the measured vari-
pretable constraints on the B̂ matrices, and still able level that distinguishes weak factorial invariance
stronger interpretations of such constraints accom- from strong factorial invariance; hence, our dis-
pany the successful specification of strong factorial tinction here between weak and strong structural
invariance. For this reason, we placed the restricted invariance.
B̂ matrices in the strong factorial invariance model The fourth and final type of structural invariance
in Equation 15, because strong factorial invariance to be represented and tested involves the residual
allows a more adequate basis for discussing these covariances among the dependent latent variables,
constraints. contained in the Ψ̂(d g
) matrices. If these residual
Second, if across-group constraints are imposed covariances are constrained to invariance, the result-
on the B̂ matrices, the constraints should generally ing model is characterized as conforming to strict

widaman, early, conger 69


structural invariance. Under this model, Equation 15 Exploiting Unusual Variation in Special
becomes: Populations
A final issue that arises in the study of special pop-
 ulations is the frequent finding of unique forms of
 −1 Ψ̂(i) 0
Sg ∼
= Λ̂ I − B̂ g variability in the special populations. That is, rela-
0 Ψ̂(d ) tive to the general population, individuals in a given
 −1 special population may exhibit substantial variation
I − B̂ Λ̂ + Θ̂g = Σ̂g , (17) on key variables that does not exist in marked form
in the general population. For example, adaptive
behaviors are forms of behavior that enable one to
where all terms are defined above, and the lack of live independently in the community. Because of
subscripts on certain parameter matrices indicates the way that adaptive behaviors are measured, per-
cross-group invariance constraints on parameter sons with IQ scores that are at or above the mean
estimates in these matrices. of the population may exhibit little variability on a
As a summary of issues in structural invariance, measure of adaptive behavior, as they score at the
distinctions among levels of constraints similar to highest level on every item. This is a quite reason-
those made for factorial invariance may be drawn. able outcome, as persons with IQ scores at or above
Specifically, invariance constraints on the α̂(d ) and B̂ the population mean do not have problems living
matrices are the most important for tests of sub- independently in the community. One tends to find
stantive theory. Once cross-group constraints are substantially more variability in ratings of adaptive
imposed on the α̂(d ) and B̂ matrices, identical raw behavior in samples of persons with intellectual dis-
score regression models—both the intercepts and ability; again, this is a reasonable outcome, as many
regression weights—hold in each group at the latent persons with intellectual disability have difficulties
variable level. Invariance of the α̂(d ) and B̂ matrices in one or more domains of adaptive functioning,
is a reasonable a priori hypothesis; if sustained, then limiting the quality of their independent living in
the lack of group differences at this level is an impor- the community. Of course, if a special population is
tant finding. By comparison, additional constraints a subset of the general population, then the marked
on the Ψ̂(d )
g matrices are nice but not necessary. In variation on key variables does exist in the general
fact, there are reasonable bases for expecting that the population. But, if the special population is a rather
Ψ̂(d )
g matrices will vary significantly across groups small subset of the population, then extreme vari-
under sampling from a population, although the ability on certain variables may be submerged in data
α̂(d ) and B̂ matrices may display invariance across on the full population and therefore go unnoticed.
groups. This issue of unusual variation within special
Finally, we reiterate our earlier statement that populations leads to our fourth methodological
invoking strong or strict structural invariance con- implication of the presence of special populations:
straints makes no sense substantively unless at least
Implication 4: Special populations may exhibit
strong factorial invariance constraints have been
unique forms of variability that can be exploited to
placed on the model. Only if invariant latent vari-
test theoretical conjectures in ways unavailable in the
ables are identified in an identical metric across
general population.
samples does it make sense to test whether invariant
regression parameters hold among these latent vari- We have not uncovered any particular method-
ables. Moreover, if the α̂(d ) and B̂ are invariant across ological techniques that are unique to the issue of
the general population and specific special popula- unusual variation within special populations. In
tions, then processes bringing about the dependent the preceding sections, we outlined how quantita-
latent variables are the same across populations, tive techniques could provide informative ways of
allowing one to generalize theoretical conclusions answering questions regarding measurement equiv-
across groups. On the other hand, if the α̂(d ) and B̂ alence and structural invariance. If unusual variation
are not invariant across the general population and were present on one or more variables in a special
special populations, then results from the general population, then this unusual variation could well
population cannot be extended to the special popu- be uncovered in particular matrices in the general
lations, and theories regarding the nature of relations structural equation model we described.
among latent variables would require modifications Here, we merely highlight the issue of poten-
in connection with special populations. tial unusual variation in a special population. In

70 s p e c i a l p o p u l at i o n s
emphasizing this issue, we encourage researchers to reported by Nary, Froehlich-Grobe, and Aaronson
be vigilant for the presence of unusual variation. (2011), who sought to recruit wheelchair users into
Unusual variation may take any of several forms. Per- a randomized controlled exercise trial. Nary et al.
sons in a special population may exhibit variation on set out to recruit 180 participants for their study, or
dimensions on which members of the general pop- 60 persons in each of three cohorts. The research
ulation may show no variability at all. An example team used a wide variety of recruiting strategies,
of this is special artistic abilities, such as individual beginning with the traditional approach of con-
differences in composing symphonies or directing tacting hospitals, health-care organizations, health-
movies. No one doubts the presence of individual care providers, and disability agencies. When these
differences among composers and movie directors, sources of recruitment led to discouraging results,
and preference for certain music or movies is often Nary et al. began using more innovative approaches,
a lively topic of discussion. But, we rarely reflect including putting flyers in movie theaters and public
on the fact that relatively few individuals compose transportation (e.g., busses), advertising in news-
symphonies or direct movies, so no individual dif- papers, newsletters, and so forth, and employing
ferences on these dimensions are evident outside the direct mail coupon packets (identified as ValPak).
small group of persons who pursue these distinctive These recruitment efforts were extremely varied, as
endeavors. Nary et al. listed approximately 30 different loca-
Alternately, members of special populations tions/activities associated with their recruitment.
may exhibit unusually large or small variation on When participants were recruited, they were asked
behavioral traits relative to the general population. where they heard about the project. Somewhat sur-
Persons in a special population may show a particu- prisingly, the single most effective method was the
lar symptomatic behavior and so exhibit essentially ValPak direct mail coupon packet and other contacts
no variation in this behavior, which may fall at one through the media (e.g., radio and TV advertise-
end of a behavioral continuum on which the general ments, newspaper advertisements); a full third of
population exhibits wide individual variability. For the sample of participants was recruited through the
example, persons with Down syndrome often have media contacts. Although the research team did not
facial features, including small chin, round face, fully meet their recruitment goal of 180 participants,
almond eye shape, and so forth, that are an unusual they did come close to the goal—and only did so by
combination in the general population. Thus, the employing a wide array of approaches that cost much
common facial appearance of persons with Down more in time and effort than the research team had
syndrome, which is distinct from other members of anticipated. The Nary et al. paper is an entertaining,
the population, is an early signal that the individual if sobering, accounting of the myriad approaches the
is a member of this special population. Or, per- researchers took to recruit participants and the rel-
sons in a special population may exhibit enhanced ative value of the different approaches. Nary et al.
or enlarged variability on a dimension relative to have summarized their experiences in a series of five
the general population. Regardless of the form of “lessons learned,” such as needing to be cognizant
unusual variability, such variability may offer unique that recruiting of persons with disabilities will be
ways to understand differences across groups. more difficult than recruiting persons without dis-
abilities and that more time, effort, and money will
Examples of Quantitative Explorations of be required in recruiting members of special popu-
Special Populations lations than initially thought. The most unfortunate
Having outlined various methodological issues lesson learned was the fourth lesson, which related to
and approaches that can be used to characterize spe- the research team finding that health-care agencies
cial populations, we now describe empirical studies were not especially helpful in recruiting participants,
that embody each of the methodological issues that although the special population of wheelchair users
we have discussed. The approaches we have taken has, in general, greater reliance on such agencies.
and the results obtained can be extended readily by The Nary et al. paper is a wonderful place to start
other researchers to the study of special populations. when planning to recruit members of populations
with disabilities, and learning from the efforts of
Identifying and Accessing Participants in these researchers might ease recruitment in future
Special Populations studies.
One example of the problems that arise when A second example of ways researchers have dealt
recruiting members of a special population was with identifying and accessing participants in a

widaman, early, conger 71


special population comes from our California Fam- conclusions we wish to draw. Optimal recruitment
ilies Project (CFP), a study of 674 Mexican-origin and retention is an unexpectedly time-consuming
families in California (e.g., Conger, Song, Stock- and expensive proposition, and we have our own
dale, Ferrer, Widaman, & Cauce, 2012; Conger, list of lessons learned in this necessary, thankless
Stockdale, Song, Robins, & Widaman, in press). task.
We worked through the school districts in two cities
in Northern California to identify potential mem-
bers of our Mexican-origin population, which we Measuring the Same Constructs Across
defined as the population of fifth grade public or Groups
Catholic school children whose four grandparents Economic pressure and depression. As one exam-
were born in Mexico or whose forebears were born ple of measuring the same constructs across groups,
in Mexico. Children were drawn at random from the we present an example of measurement invari-
student rosters for the school districts of these two ance across groups using data from three ethnically
cities. Families of these children were then recruited diverse samples living in the United States. Data
by telephone or, for cases in which there was no listed for these samples were obtained from the follow-
phone number, by a recruiter who went to their ing studies: (1) The Family and Community Health
home. Of the 982 families contacted, 68.6% of these Study (FACHS), a study of 889 African-American
eligible families (N = 674) agreed to participate children and their families; (2) the CFP, a study
in the study. All family members were of Mexican of 674 Mexican-origin families and children; and
origin as determined by their ancestry and their self- (3) the Family Transitions Project (FTP), a study
identification as being of Mexican heritage. First-, of 550 European American children and families.
second-, and third-generation children of Mexican For these analyses, we focused our attention on
origin were eligible for the study. Also, the focal child the impact of economic pressure on husbands’ and
had to be living with his or her biological mother. wives’ depressive symptoms. Because we were inter-
Either two-parent (82% of the sample) or single- ested in impacts on two-parent families, all three
parent (18% of the sample) families were eligible to samples were reduced in number, yielding 300 fam-
participate. In two-parent families, the father had to ilies from the FACHS study, 482 families from
be the child’s biological father. the CFP study, and 281 families from the IYFP
In addition to the initial recruitment, we face study, for a total of 1063 families across the three
challenges in tracking our families over time, espe- groups.
cially given high mobility of this population during The confirmatory factor model fit to the data
times of economic downturn and resulting eco- for the three samples contained seven manifest vari-
nomic pressure on families. We are currently in ables and three latent variables and is illustrated in
the sixth year of assessing families in our longi- Figure 4.1. In the figure, standard figural notation
tudinal study and are happy to report that we is used: Triangles denote the unit constant used
have a retention rate over 90% for the families to estimate means (αk ) or intercepts (τj ), squares
in our study. This has taken a number of special or rectangles represent manifest variables, circles
approaches to retaining families, including going or ellipses stand for latent variables, single-headed
door-to-door if attempts to contact a family by arrows reflect directed effects (e.g., factor loadings
phone are unsuccessful, contacting persons who λj , regression weights), and double-headed arrows
know a family (especially if the family has moved), represent latent variable (ψkk ) and unique factor (θj )
and following families for assessments as they move variances or covariances (ψkk  ).
to other states (e.g., Texas, Arizona) or back to Mex- In Figure 4.1, three latent variables are shown:
ico. In the future, we should document all of the Economic Pressure, Husband Depression, and Wife
different strategies we have used to keep contact Depression. All three studies included the same
with our sample. An article detailing the different three indicators of Economic Pressure: (1) Unmet
approaches and their relative success would provide Material Needs (U), the average of 4 items assessing
a useful comparison to the recruitment problems unmet material needs in different domains, includ-
confronted by Nary et al. (2011) in their study ing “having enough money to afford the kind
of wheelchair users. Only by keeping recruitment of home, clothing and food your family needs”
and retention at the highest possible levels will the (response scale 1 = “strongly agree” to 4 = “strongly
data generated by our study and by others (e.g., disagree”); (2) Can’t Make Ends Meet (C), the
Nary et al., 2011) be optimal for drawing the average of 2 items measuring difficulties in having

72 s p e c i a l p o p u l at i o n s
money to cover expenses, including “difficulty pay- Assessing factorial invariance across groups. We
ing bills” and “the amount of money left at the end of then fit a series of three-group confirmatory fac-
the month” (response scale: 1 = “no difficulty at all” tor models to test factorial invariance of the model
to 4 = “a great deal of difficulty”); and (3) Financial across the three samples. The first model, Model
Cutbacks (F), the number of 11 areas in which the 1, was a three-factor model that had the same pat-
family had to make cutbacks, including “the family tern of fixed and free factor loadings in each group.
postponing major household purchases, changing This configural invariance model fit the data well,
residence, and or eliminating medical insurance” χ 2 (33, N = 1063) = 41.99, with RMSEA of
(each area scored dichotomously, 0 = “no cut- .028, and CFI and TLI values of 0.996 and 0.992,
backs” and 1 = “cutbacks”). Husbands and wives respectively.
provided answers on the preceding scales, and aver- Next, we imposed invariance constraints on fac-
age responses across husbands and wives were used tor loadings in the Λ matrix across gender and
as analyzed scores. For families in which husbands ethnicity. For example, the loadings for the two indi-
did not participate, scores on the economic pres- cators of husband depression were constrained to be
sure variables were treated as missing data. Scores equal to the loadings for the two indicators of wife
on all three indicators were reverse-scored, when depression within each sample; these loading were
necessary, so that higher scores indicated greater then constrained to equality across all three samples.
economic pressure on the family. The fit of the resulting model, Model 2, was slightly
Across the three ethnic groups, the depression worse; however, indicators of model fit remained
latent construct was obtained from the Mini-Mood in an acceptable range, χ 2 (42, N = 1063) =
and Anxiety Questionnaire (Mini-MASQ; Clark 61.25, p = 0.02, with RMSEA of 0.036, and CFI
& Watson,1997). For this analysis, we used the and TLI scores of 0.991 and 0.987, respectively.
five items that measure general distress—depression. The latter indices of practical fit suggest that the
These items ask both husband and wife to self-report invariance constraints imposed on Model 2, which
on how much they had felt depressed, discour- represent weak factorial invariance, are appropri-
aged, and or worthless in the past week. Each ate. In prior research, several authors have discussed
item was answered on a scale ranging from 1 = changes in TLI or CFI values that might be consid-
“not at all” to 4 = “very much.” Two parcels (one ered practically significant when comparing nested
two-item and one three-item parcel) were formed, models. Thus, Widaman (1985) considered a differ-
and these are denoted M1H, M2H, M1W, and ence in the TLI of less than 0.01 when comparing
M2W for parcels 1 and 2 from husband and wife, two models to be a practically unimportant differ-
respectively. ence in fit, and Cheung and Rensvold (1999) and

α1 α2 α3
1
ψ 31
ψ21 ψ 32

ψ11 Economic Husband Wife


ψ22 ψ 33
Pressure Depression Depression

λ1 λ2 λ3 λ4 λ5 λ6 λ7

U C F M1H M2H M1W M2W

θ1 θ2 θ3 θ4 θ5 θ6 θ7
τ1 τ2 τ3 τ4 τ5 τ6 τ7

Figure 4.1 A figural presentation of the relations between latent variables for Economic Pressure, Husband Depression and Wife
Depression and their respective indicators (Note: triangle stands for the unit constant, squares for manifest variables, circles or ellipses
for latent variables, single-headed arrows for direct effects, and double-headed arrows for symmetric effects such as variances or
covariances)

widaman, early, conger 73


Little, Card, Slegers, and Ledford (2007) argued higher levels of depression (M = 0.40 [SE =
that the CFI must meet this criterion (i.e., a dif- 0.10], SD = 1.46) relative to their husbands. The
ference in CFI of 0.01 or more between models) FACHS sample had elevated means of 0.12 [SE =
to be deemed a difference in fit worthy of note. 0.10](SD = 1.03), 0.28 [SE = 0.13](SD =
Using these standards, the changes in fit when 1.37), and 0.45 [SE = 0.12](SD = 1.27),
moving from the configural invariance model to on the Economic Pressure, Husband Depression,
the weak factorial invariance model were small and and Wife Depression latent variables, respectively,
unimportant. and the CFP sample also showed higher means
Because fit indices for Model 2 were acceptable, on the three factors, respectively, of 0.60 [SE =
we next invoked invariance constraints on the man- 0.09](SD = 0.97), 0.36 [SE = 0.11](SD =
ifest variable intercepts in τ, leading to Model 3. 1.47), and 0.69 [SE = 0.11](SD = 1.52). Thus,
Invariance constraints on all but one τ value in each relative to European American families, African-
sample were acceptable; the τ value allowed to vary American and Mexican-origin families showed both
freely across the three samples was the intercept for higher mean levels and greater variability on the
the third economic pressure indicator. This partial three factors assessed in these confirmatory factor
strong invariance model resulted in poorer statisti- models.
cal fit than prior models, χ 2 (49, N = 1063) =
95.29, p < 0.001, reflecting a significant worsen-
ing of fit relative to the weak invariance model, Exploring Bounds of Psychological Theories

χ 2 (7, N = 1063) = 34.04, p < 0.001. How- Ethnic differences in effects of economic pressure on
ever, practical fit indices were only modestly worse, depression. We turn next to ways of exploring the
with an RMSEA of 0.052, and CFI and TLI values bounds of psychological theories. A substantial body
of 0.979 and 0.973, respectively. Despite having of prior research supports the conclusion that family
worse fit that slightly exceeded the “change of 0.01” economic pressure has negative effects on husband
criterion for practically significant difference in fit, and wife depression. Much of this research has been
the practical fit indexes for Model 3 were still accept- based on data from European American families,
able (i.e., the CI for the RMSEA included 0.05, and and we wanted to see whether the effect of economic
both TLI and CFI values were above 0.95), indi- pressure on husband and wife depression was similar
cating that the partial strong invariance constraints across ethnic groups.
applied were appropriate. To answer this question, we returned to the
The final confirmatory model, Model 4, imposed three-group data used to test factorial invariance
invariance constraints on the unique factor variances but examined models that substituted direct effects
in Θ and therefore corresponded to strict factorial of economic pressure on husband and wife depres-
invariance. Model 4 resulted in relatively poor fit, sion in place of the correlations estimated in prior
χ 2 (65, N = 1063) = 317.84, p < 0.001, with models. In the first of these models, we imposed
RMSEA of 0.105, and CFI and TLI scores of 0.886 invariance constraints on the path coefficients lead-
and 0.890, respectively. This represented a signifi- ing from economic pressure to depression across
cant worsening of both the statistical index of fit, husbands and wives within each sample. This cross-

χ 2 (16, N = 1063) = 222.55, p < 0.001, and gender constraint resulted in a significant worsening
practical indices of fit when compared to Model 3. in fit compared to the partial strong invariance
Based on the poor practical fit indexes of the strict model,
χ 2 (3, N = 1063) = 12.96, p = 0.01,
factorial invariance model, we accepted the partial with moderately worse practical fit indexes, RMSEA
strong factorial invariance model (Model 3) as the of 0.055, and CFI and TLI values of 0.975 and
optimal model for the data. 0.969, respectively. Next, we freed the cross-gender
Group differences in means and variances on latent constraints, but imposed equality constraints on the
variables. Given the fit of the partial strong facto- path coefficients of economic pressure on husband
rial invariance model, mean and variance differences depression across groups and on wife depression
across groups could be evaluated in an unbiased fash- across groups. This model also resulted in a sig-
ion. For the latent variables of Economic Pressure nificant worsening of fit when compared to partial
and Husband Depression, means were fixed at 0 and strong factorial invariance model,
χ 2 (4, N =
variances at unity for the FTP sample, and means 1063) = 19.65, p < 0.001, with worse practical
and variances were estimated for all other latent vari- fit indexes, RMSEA of 0.057, and CFI and TLI
ables. In the FTP sample, wives had significantly values of 0.972 and 0.967, respectively.

74 s p e c i a l p o p u l at i o n s
In our final model, we imposed invariance con- positive parenting behaviors of induction, maturity
straints on the path coefficients for economic pres- demands, and promoting autonomy had the
sure on husband and wife depression across the two strongest, positive effects on longitudinal changes
minority samples, allowing the corresponding coef- in different forms of adaptive functioning, such as
ficients to be free for the European American sample. practical skills (e.g., dressing oneself ), conceptual
This resulted in a nonsignificant change in model competence (e.g., functional academics), and social
fit when compared to the partial strong invariance competence (e.g., making and keeping friends).
model,
χ 2 (2, N = 1063) = 1.52, p = 0.47. The However, contrary to research on the general popu-
practical fit indexes for this model were essentially lation, Widaman and Borthwick-Duffy found that
identical to those for the strong factorial invari- the authoritarian dimension of power assertion
ance model, so we chose this model as the optimal consistently had the strongest effects on longitu-
representation of our data. dinal changes in negative behaviors such as social
Parameter estimates in this model suggest that the maladaption (e.g., aggression, property destruc-
impact of economic pressure on the level of depres- tion) and personal maladaption (e.g., self-injurious
sion for minority couples differs from that for Euro- behavior). Although the standardized regression
pean American couples. Specifically, economic pres- weights were not large (βs ranged from 0.15 to
sure appears to have the strongest effect (β = 0.69) 0.24), power assertion was the only parenting style to
on depression in minority husbands (both African- impact changes in maladaptive behaviors. Thus, the
American and Mexican-American husbands), and a proscription of power assertion as a less useful form
significantly weaker effect (β = 0.47) on depression of parenting behavior was based on research on fam-
in minority wives. In contrast, economic pressure ilies in the general population. However, to reduce
had a stronger effect (β = 0.54) on European Amer- the levels of maladaptive behaviors in children with
ican wives than on European American husbands intellectual disability, power assertion appears to be
(β = 0.23), who were least affected by economic the only viable parenting option.
pressure.
Parenting styles and child behavior. A second
example of exploring the bounds of psychologi- Exploiting Unusual Variability in Special
cal theories developed on the general population Populations
is worthy of note. In this study, Widaman and Adaptive behavior in persons with intellectual dis-
Borthwick-Duffy (April, 1990) reported results ability. The final analytic issue to illustrate is the
from 109 families with a child with intellectual dis- exploiting of unusual variability in special popula-
ability. Key findings on parenting styles, or styles of tions. One example of unusual variability in special
parenting behaviors, by researchers such as Baum- populations involves the domain of adaptive behav-
rind (1968, 1971, 1991; Baumrind, Larzelere, & ior, which attains significance in the population
Owens, 2010) and Hoffman (1975, 1979, 1994) of persons with intellectual disability. The three
supported the presence of several dimensions of par- key dimensions of adaptive behavior are practical
enting behavior. The general findings of this research competence (or independent living skills), concep-
was that authoritative parenting (high control, plus tual competence (or cognitive skills), and social
high nurturance/warmth) was associated with opti- competence (or skills); all three dimensions repre-
mal forms of child behavior and that authoritarian sent everyday behaviors that enable a person to live
parenting (high control and power assertion, plus independently in the community.
low nurturance and warmth) and permissive par- Widaman, Gibbs, and Geary (1987) utilized a
enting (low control, plus high nurturance and database maintained by the Department of Devel-
warmth) both led to less optimal levels of child opmental Services (DDS) of the State of Cali-
behavior. fornia. DDS required all persons receiving state
Widaman and Borthwick-Duffy (April, 1990) services for developmental disability to be assessed
isolated a total of seven dimensions of par- on a 66-item instrument of adaptive behavior. The
enting behavior in their study, including (1) instrument assessed the three key dimensions of
nurturance/warmth, (2) induction, (3) maturity practical, conceptual, and social competence and
demands, (4) promoting autonomy, (5) firm con- also assessed three additional dimensions of motor
trol, (6) love withdrawal, and (7) power assertion. competence, social maladaption, and personal mal-
Consistent with research on the general population, adaption. Widaman et al. extracted 14 groups of
Widaman and Borthwick-Duffy found that the individuals based on a crossing of age (children,

widaman, early, conger 75


adolescents, and adults), levels of intellectual dis- the Maternal PKU Collaborative (MPKUC) study.
ability (mild, moderate, and severe), and placement Phenylketonuria is a well-known genetically based
(home, community). Thus, these 14 groups are disorder, which results in disrupted metabolism of
special subgroups defined by demographic and func- phenylalanine (PHE) into tyrosine. If left untreated,
tional characteristics from the special population infants with PKU who are normal at birth suffer
of persons with intellectual disability. Widaman et permanent brain damage that leads to severe mental
al. confirmed essentially identical six-factor solu- retardation (mean IQ of 50) by age 2 years. However,
tions across all 14 groups; interested readers are with early identification and placement of infants on
referred to the published article for details. Of a diet low in PHE, the negative effects of PKU can be
importance here is the fact that, given the way the circumvented, and children with PKU can grow to
66 items are phrased, persons who do not have adulthood with no evidence of intellectual disability
an intellectual disability would likely score at the (Koch & de la Cruz, 1999a, 1999b).
highest scale point on each item, therefore failing However, the story of maternal PKU is more
to exhibit any variability in responses on the items. complex. If a woman with PKU does not maintain
Only persons with intellectual disability exhibit sub- a low-PHE diet when pregnant, then the increased
stantial variability on the items, so the dimensions PHE in her blood crosses the placental barrier and
of adaptive behavior only have ready application in exposes the developing fetus to high levels of PHE.
this special population. For persons with PKU, levels of endogenous PHE
However, this research on adaptive behavior over 6 mg|dL have been found to be teratogenic,
offers a chance to make recommendations for future which means that such levels lead to negative effects
research. The instrument developed by the Cal- on behavior. However, for the developing fetus, lev-
ifornia DDS contained items that exhibit much els of exogenous PHE (i.e., PHE from the mother)
variability only in samples of persons with develop- that were sufficient to cause teratogenic effects had
mental disabilities, because persons without devel- never been identified. The MPKUC study (Koch, de
opmental disabilities would have ceiling effects on la Cruz, & Azen, 2003) began in 1984 as an inter-
all items. This does not mean, however, that persons vention study to help pregnant women maintain
without developmental disabilities do not display low-PHE diets and therefore maintain low blood
individual differences in adaptive forms of behav- PHE levels during pregnancy. A secondary goal of
ior. More likely, the lack of variance in a sample the MPKUC study was to monitor maternal PHE
of persons without developmental disabilities is a levels throughout pregnancy to study the relation
measurement issue or problem. This measurement between PHE levels and child outcomes.
problem could be confronted by developing a set Table 4.1 shows descriptive statistics for partic-
of harder items for each dimension of adaptive ipants from the MPKUC study for three mother
behavior, so that items would have a higher ceil- variables (Full Scale IQ, PHE level when on an
ing and persons without developmental disabilities unrestricted diet, and average PHE level during
might not “top out” on every item. If this were pregnancy) and three child variables (Verbal, Perfor-
done, then a computerized adaptive testing (CAT) mance, and Full Scale IQ). Various forms of unusual
approach could be developed to administer a unique variability are contained in this table. First, all four of
set of items to each examinee, presenting items the IQ variables exhibit substantial deviation from
that would enable precise measurement of the indi- the population mean: The mean mother IQ was
vidual’s standing on each dimension, although a approximately a full standard deviation below the
minimal number of items were needed to do so. population mean of 100, and the three mean child
Waller and Reise (1989) described how to apply IQ scores are more than one-half a standard devia-
the CAT approach to personality measurement, and tion below the population mean. Furthermore, the
Reise and Waller (2003) discussed application of three child IQ scores show markedly greater variabil-
sophisticated techniques to assessing psychopathol- ity than in the population, with SDs greater than
ogy. Similar approaches could undoubtedly be used 21 relative to a population SD of 15. The exces-
to assess adaptive behaviors across the entire popu- sive variability in child IQ scores is the result of a
lation in a more adequate way. larger-than-expected number of observations with
Effects of prenatal exposure to phenylalanine. A rather low IQ, presumably resulting from higher
second example of unusual variability in special levels of prenatal exposure to PHE. The indices of
populations that provides a unique opportunity to mother PHE levels also exhibit unusual variability.
explore relations among variables is derived from Persons who do not have the PKU genetic defect

76 s p e c i a l p o p u l at i o n s
Table 4.1. Descriptive Statistics on Variables from the Maternal PKU
Study
Variable N Mean SD Min Max

Mother full-scale IQ 379 85.90 13.65 40 130

Mother PHE level on unrestricted diet 413 22.03 9.18 3.30 51.10

Mother PHE level during pregnancy 412 8.23 4.49 1.30 28.30

Child verbal IQ 7 years 284 92.06 22.39 40 142

Child performance IQ 7 years 285 92.00 21.86 40 133

Child full-scale IQ 7 years 284 91.35 23.21 35 139

Note: Mother full-scale IQ based on Wechsler Adult Intelligence Scale—Revised; Mother PHE
levels on regular diet and during pregnancy are in mg | dL; Child verbal, performance, and
full-scale IQ based on Wechsler Intelligence Scale for Children—Revised.

would exhibit extremely low PHE values, below 1.0 the three coefficients were 1.70, 0.32, and 0.62, the
and near 0, on these two measures. In contrast to point estimates and interval estimates for the coef-
this, mother PHE levels on an unrestricted diet var- ficients are: intercept = 103.9, 95% CI [100.5,
ied between 3.3 and 51.1, and mother PHE levels 107.3], regression slope for PHE = −4.14, 95%
during pregnancy varied between 1.3 and 28.3. CI [ −3.50, −4.78], and knot point = 5.50, 95%
To model the effect of prenatal PHE exposure on CI [4.25, 6.75].
child Full Scale IQ at 7 years, one could use a linear Exploiting the unusual variability of mother PHE
regression model. The result of fitting this model is levels during pregnancy allowed us to reach several
the following equation: important goals. First, we could verify the nature of
the teratogenic effect of prenatal exposure to PHE
predicted IQ = 119.5 − 3.53 (PHE)
on child cognitive outcomes, which is nonlinear
The standard errors for the intercept and regres- in nature with a threshold for teratogenic effects.
sion slope were 2.13 and 0.24, respectively, so Second, we could estimate the level of exogenous
interval estimates (95% CIs) for the two parameter PHE at which the teratogenic effect begins to occur,
estimates were approximately 117.2 to 123.7 and which is close to the level of endogenous PHE often
−3.05 to −4.00 for the intercept and slope, respec- assumed to have teratogenic effects for persons with
tively. The above equation represents the unrealistic PKU. Third, these results could be used as the basis
expectation that a child’s IQ at age 7 years would for recommendations for monitoring PHE levels
be approximately 120 if his/her mother had main- in mothers with PKU, attempting to ensure that
tained a PHE level of 0 during pregnancy, and that mothers with PKU keep their PHE levels below
IQ would drop about 3.5 points for every 1 mg|dL the level at which teratogenic effects are likely to
increase above this value of 0. occur. Additionally, all of these goals were accom-
An alternative regression model is a two-piece plished only because of the unusual variability in
linear spline, with three parameter estimates: an PHE levels exhibited by mothers with PKU in the
intercept, a knot point that estimates the PHE expo- MPKUC study. Readers are referred to Widaman
sure at which a teratogenic effect begins to occur, and (2009) for an informative summary of findings of
the linear slope representing the teratogenic effect the MPKUC study.
after the knot point. The results for this model were:
predicted IQ = 103.9−4.14 (PHE), knot = 5.50.
Conclusions
In this equation, children who experienced pre- The presence of special populations—however
natal PHE levels between 0 and 5.50 all had an delineated—provides opportunities beyond those
expected 7-year IQ of about 104, near the popu- afforded by drawing repeated samples that are rep-
lation mean. Moreover, after the knot point of 5.50, resentative of the general population. If we disre-
the predicted IQ drops over 4.1 IQ points for every garded the presence of special populations, then we
1 mg|dL increased. Because the standard errors for might develop a corpus of scientific findings that

widaman, early, conger 77


applies to “the” population—whatever that is—but groups, but how should we characterize the
in actuality applies to very few individuals or, at magnitudes of effects, what magnitude of effects
least, fails to apply to members of important special should be considered practically important, and
populations. The mere existence of special popula- would the magnitude of effects considered important
tions, however these are defined, challenges us to vary across domains of research?
verify that the most important of our findings are Question 5: When does unusual variability represent a
not moderated substantially as a function of special valid measurement outcome, and when does unusual
population status. (or different) variability across groups represent a
We outlined four primary methodological issues failure to assess individual differences adequately or in
that arise when investigating phenomena in the comparable fashion across groups? Measurement is
presence of special populations. These four issues the basis on which the whole enterprise of science is
involve identifying and assessing participants in spe- erected, and concerted attention to accurate
cial populations, verifying that we are measuring the measurement of individual differences across the
same things across populations, discovering whether entire span of a dimension is crucial to answering
sociobehavioral processes unfold in the same ways questions such as these. Use of the most up-to-date
across populations, and searching for unusual vari- measurement approaches, such as CAT, would go a
ability that might provide unique ways of viewing long way to resolving issues of differences in variance
behavioral phenomena and testing theoretical con- across groups.
jectures. Given the rather recent upsurge of research
pursued under the explicit rubric of special popula- The future is bright for methodological and
tions, the implications of special populations on the quantitative innovations in the study of special pop-
way we do science will only increase in the future. ulations. As we documented at the start of this
Furthermore, we expect that the recognition of the chapter, labeling groups as special populations is
presence of special populations will only enrich the a relatively recent phenomenon, largely a product
way we view social and behavioral science as we move of the past three decades. Moreover, the majority
further into the twenty-first century. of statistical methods and techniques for studying
differences across groups are also of recent origin.
Many of the most advanced techniques have been
Future Directions available for, at most, 30 years, and new ways in
Several future directions for research and the- which these methods can be used to illuminate sim-
ory with regard to quantitative methodology as ilarities and differences across populations are being
applied to special populations can be drawn from developed on an almost daily basis. Improved, more
the material presented earlier in this chapter. These sophisticated understanding of the nature of special
are provided as a series of questions to guide future populations is occurring at the genetic, biologi-
research. cal, and psychological/behavioral levels, and optimal
use of methodological approaches and quantitative
Question 1: What is the best way to conceptualize special techniques is a crucial element that will push this
populations? Or, are alternative bases for identifying endeavor forward. Indeed, new quantitative tech-
special populations the best way to proceed? niques or innovative use of existing techniques may
Question 2: Is a “one-size-fits-all” conception of special well be the key that unlocks the door to advanced
populations possible, or will the definition of special understanding of why special populations deserve
populations vary as a function of the research their distinctive status.
question asked?
Question 3: How should a researcher proceed if full
strong factorial invariance does not hold but only Author Note
partial strong invariance is exhibited by data? How Support for this work was provided by grants
seriously does partial strong factorial invariance from the National Institute of Child Health and
impede scientific conclusions relative to full Human Development, the National Institute on
invariance of all factor loadings and intercepts in the Drug Abuse, and the National Institute of Men-
strong factorial invariance model? tal Health (DA017902, HD047573, HD051746,
Question 4: How large must differences be across special and MH051361).
populations before differences are considered Correspondence regarding this manuscript should
important? Large sample sizes lead to increased power be addressed to Keith Widaman, Department
to detect statistically significant differences across of Psychology, University of California at Davis,

78 s p e c i a l p o p u l at i o n s
One Shields Avenue, Davis, CA 95616. E-mail: Clark, L. A., & Watson, D. (1997). The Mini Mood and
kfwidaman@ucdavis.edu Anxiety SymptomQuestionnaire (Mini-MASQ). Unpublished
manuscript, University of Iowa.
Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and
References White students in integrated colleges. Journal of Educational
Barton, B. (2009). Dreams deferred: Disability definitions, data, Measurement, 5, 115–124.
models, and perspectives. Journal of Sociology and Social Cobb, B., Lehmann, J., Newman-Gonchar, R., & Alwell, M.
Welfare, 36(4), 13–24. (2009). Self-determination for students with disabilities: A
Baumrind, D. (1968). Authoritarian vs. authoritative parent narrative metasynthesis. Career Development for Exceptional
control. Adolescence, 3, 255–272. Individuals, 32, 108–114.
Baumrind, D. (1971). Current patterns of parental authority. Colarelli, S. M., Han, K., & Yang, C. (2010). Biased against
Developmental Psychology, 4 (1, Pt. 2), 1–103. whom? The problems of “group” definition and membership
Baumrind, D. (1991). The influence of parenting style on ado- in test bias analyses. Industrial and Organizational Psychology:
lescent competence and substance abuse. The Journal of Early Perspectives on Science and Practice, 3, 228–231.
Adolescence, 11, 56–95. Conger, R. D., Song, H., Stockdale, G. D., Ferrer, E., Widaman,
Baumrind, D., Larzelere, R. E., & Owens, E. B. (2010). Effects K. F., & Cauce, A. M. (2012). Resilience and vulnerability of
of preschool parents’ power assertive patterns and practices Mexican origin youth and their families: A test of a culturally-
on adolescent development. Parenting: Science and Practice, informed model of family economic stress. In P. K. Kerig, M.
10, 157–201. S. Schulz, & S. T. Hauser (Eds.), Adolescence and beyond:
Belsky, J., Bakermans-Kranenburg, M. J., van IJzendoorn, M. H. Family processes and development (pp. 268–286). New York:
(2007). For better and for worse: Differential susceptibility to Oxford University Press.
environmental influences. Current Directions in Psychological Conger, R. D., Stockdale, G. D., Song, H., Robins, R. W., &
Science, 16, 300–304. Widaman, K. F. (in press). Predicting change in substance
Bronfenbrenner, U. (1977). Toward an experimental ecology of use and substance use cognitions of Mexican origin youth
human development. American Psychologist, 32, 513–531. during the transition from childhood to early adolescence. In
Bronfenbrenner, U. (1986a). Ecology of the family as a context Y. F. Thomas, L. N. Price, & A. V. Lybrand (Eds.), Drug use
for human development: Research perspectives. Developmen- trajectories among African American and Hispanic Youth. New
tal Psychology, 22, 723–742. York: Springer.
Bronfenbrenner, U. (1986b). Recent advances in research on Diaz, N., Horton, E. G., McIlveen, J., Weiner, M., & Nelson, J.
the ecology of human development. In R. K. Silbereisen, (2009). Dysthymia among substance abusers: An exploratory
K. Eyferth, & G. Rudinger (Eds.), Development as action in study of individual and mental health factors. International
context: Problem behavior and normal youth development (pp. Journal of Mental Health and Addiction, 7, 357–367.
287–309). Heidelberg: Springer-Verlag. Eastwood, S. L., & Harrison, P. J. (2010). Markers of glutamate
Bronfenbrenner, U. (1999). Environments in developmental synaptic transmission and plasticity are increased in the ante-
perspective: Theoretical and operational models. In S. L. rior cingulate cortex in bipolar disorder. Biological Psychiatry,
Friedman & T. D. Wachs (Eds.), Measuring environment 67, 1010–1016.
across the life span: Emerging methods and concepts (pp. 3–28). Ferrer, E., Balluerka, N., & Widaman, K. F. (2008). Factorial
Washington, DC: American Psychological Association. invariance and the specification of second-order latent growth
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for models. Methodology, 4, 22–36.
the equivalence of factor covariance and mean structures: The Folks, H. (1931). Committee C-1—Socially handicapped—
issue of partial measurement invariance. Psychological Bulletin, Dependency and neglect. In White House Conference on
105, 456–466. Child Health and Protection, White House conference 1930:
Caspi, A., McClay, J., Moffitt, T. E., Mill, J., Martin, J., Craig, Addresses and abstracts of committee reports, White House con-
I. W., et al. (2002). Role of genotype in the cycle of violence ference on child health and protection called by President Hoover
in maltreated children. Science, 297, 851–854. (pp. 319–340). New York: The Century Co.
Caspi, A., Sugden, K., Moffitt, T. E., Taylor, A., Craig, I. W., Grigorenko, E. L., Klin, A., & Volkmar, F. (2003). Annota-
Harrington, H., et al. (2003). Influence of life stress on tion: Hyperlexia: disability or superability? Journal of Child
depression: Moderation by a polymorphism in the 5-HTT Psychology and Psychiatry, 44, 1079–1091.
gene. Science, 301, 386–389. Hancock, G. R., Kuo, W.-L., & Lawrence, F. R. (2001). An
Chakrabarti, B., Dudbridge, E., Kent, L., Wheelwright, S., Hill- illustration of second-order latent growth models. Structural
Cawthorne, G., Allison, C., et al. (2010). Genes related Equation Modeling, 8, 470–489.
to sex steroids, neural growth, and social-emotional behav- Henry, P. J. (2008). College sophomores in the laboratory
ior are associated with autistic traits, empathy, and Asperger redux: Influences of a narrow data base on social psychol-
syndrome. Autism Research, 2, 157–177. ogy’s view of the nature of prejudice. Psychological Inquiry, 19,
Chassin, L., Knight, G., Vargas-Chanes, D., Losoya, S. H., & 49–71.
Naranjo, D. (2009). Substance use treatment outcomes in a Hoffman, M. L. (1975). Altruistic behavior and the parent-child
sample of male serious juvenile offenders. Journal of Substance relationship. Journal of Personality and Social Psychology, 31,
Abuse Treatment, 36, 183–194. 937–943.
Chen, F. F., Sousa, K. H., & West, S. G. (2005). Testing mea- Hoffman, M. L. (1979). Development of moral thought, feeling,
surement invariance of second-order factor models. Structural and behavior. American Psychologist, 34, 958–966.
Equation Modeling, 12, 471–492. Hoffman, M. L. (1994). Discipline and internalization. Devel-
Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial opmental Psychology, 30, 26–8.
invariance across groups: A reconceptualization and proposed Horn, J. L., McArdle, J. J., & Mason, R. (1983). When is
new method. Journal of Management, 25, 1–27. invariance not invariant: A practical scientist’s look at the

widaman, early, conger 79


ethereal concept of factorial invariance. Southern Psychologist, Nary, D. E., Froehlich-Grobe, K., & Aaronson, L. (2011).
1, 179–188. Recruitment issues in a randomized controlled exercise trial
John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big targeting wheelchair users. Contemporary Clinical Trials, 32,
Five Inventory-Versions 4a and 54. Berkeley, CA: University 188–195.
of California, Berkeley, Institute of Personality and Social Nesselroade, J. R. (1983). Temporal selection and factor invari-
Research. ance in the study of development and change. In P. B. Baltes
Jöreskog, K. J. (1971). Simultaneous factor analysis in several & O. G. Brim, Jr. (Eds.), Life-span development and behavior
populations. Psychometrika, 36, 409–426. (Vol. 5; pp. 59–87). New York: Academic Press.
Koch, R., & de la Cruz, F. (1999a). Historical aspects and Obradoviæ, J., Long, J. D., Cutuli, J. J., Chan, C.-K.,
overview of research on phenylketonuria. Mental Retardation Hinz, E., Heistad, D., & Masten, A. S. (2009). Aca-
and Developmental Disabilities Research Reviews, 5, 101–103. demic achievement of homeless and highly mobile children
Koch, R., & de la Cruz, F. (Eds.). (1999b). Phenylketonuria. in an urban school district: Longitudinal evidence on risk,
Mental Retardation and Developmental Disabilities Research growth, and resilience. Development and Psychopathology, 21,
Reviews, 5, 101–161. 493–518.
Koch, R., de la Cruz, F., & Azen, C. G. (Eds.). (2003). The Olson, I. R., Berryhill, M. E., Drowos, D. B., Brown, L., & Chat-
Maternal Phenylketonuria Collaborative Study: New devel- terjee, A. (2010). A calendar savant with episodic memory
opments and the need for new strategies. Pediatrics, 112, impairments. Neurocase, 16, 208–218.
1513–1587. Or, D., Cohen, A., & Tirosh, E. (2010). The reliability and valid-
Koychev, I., El-Deredy, W., Haenschel, C., & Deakin, J. F. W. ity of the Aggregate Neurobehaviroal Student Health and
(2010). Visual information processing deficits as biomarkers Educational Review Parent’s Questionnaire (ANSER-PQ).
of vulnerability to schizophrenia: An event-related potential Journal of Child Neurology, 25, 157–164.
study in schizotypy. Neuropsychologia, 48, 2205–2214. Pennington, K., Beasley, C. L., Dicker, P., Fagan, A.,
Kuo, C.-C., Maker, J., Su, F.-L., & Hu, C. (2010). Identify- English, J., Pariante, C. M., et al. (2008). Prominent
ing young gifted children and cultivating problem solving synaptic and metabolic abnormalities revealed by pro-
abilities and multiple intelligences. Learning and Individual teomic analysis of the dorsolateral prefrontal cortex in
Differences, 20, 365–379. schizophrenia and bipolar disorder. Molecular Psychiatry, 13,
Little, T. D. (1997). Mean and covariance structures (MACS) 1102–1117.
analyses of cross-cultural data: Practical and theoretical issues. Petrill, S. A. (2010). Editorial: Identifying the genetic and envi-
Multivariate Behavioral Research, 32, 53–76. ronmental influences on psychological outcomes. Journal of
Little, T. D., Bovaird, J. A., & Card, N. A. (Eds.) (2007). Mod- Child Psychology and Psychiatry, 51, 745–746.
eling contextual effects in longitudinal studies. Mahwah, NJ: Polinsky, M. L., Hser, Y.-I., & Grella, C. E. (1998). Considera-
Erlbaum. tion of special populations in the drug treatment system of a
Little, T. D., Card, N. A., Slegers, D. W., & Ledford, E. C. large metropolitan area. Journal of Behavioral Health Services
(2007). Representing contextual effects in multiple-group & Research, 25, 7–21.
MACS models. In T. D. Little, J. A. Bovaird, & N. A. Card Reichenberg, A., Caspi, A., Harrington, H., Houts, R.,
(Eds.), Modeling contextual effects in longitudinal studies (pp. Keefe, R. S. E., Murray, R. M., et al. (2010). Static
121–147). Mahwah, NJ: Erlbaum. and dynamic cognitive deficits in childhood schizophrenia:
Lubinski, D., & Benbow, C. P. (2006). Study of Mathematically A 30-year study. The American Journal of Psychiatry, 167,
Precocious Youth after 35 years: Uncovering antecedents for 160–169.
the development of math-science expertise. Perspectives on Reise, S. P., & Waller, N. G. (2003). How many IRT parameters
Psychological Science, 1, 316–345. does it take to model psychopathology items? Psychological
Martz, E., & Daniel, S. (2010). Comparing disability proto- Methods, 8, 164–184.
types in the United States and Kenya. Journal of Applied Rensvold, R. B., & Cheung, G. W. (1998). Testing measure-
Rehabilitation Counseling, 41, 19–25. ment model for factorial invariance: A systematic approach.
McArdle, J. J. (1988). Dynamic but structural equation modeling Educational and Psychological Measurement, 58, 1017–1034.
of repeated measures data. In J. R. Nesselroade & R. B. Cattell Sears, D. O. (1986). College sophomores in the laboratory: Influ-
(Eds.), Handbook of multivariate experimental psychology (2nd ences of a narrow database on social psychology’s view of
ed., pp. 561–614). New York: Plenum. human nature. Journal of Personality and Social Psychology,
Meade, A. W., & Tonidandel, S. (2010). Not seeing clearly with 57, 515–530.
Cleary: What test bias analyses do and do not tell us. Indus- Sears, D. O. (2008). College student-it is redux. Psychological
trial and Organizational Psychology: Perspectives on Science and Inquiry, 19, 72–77.
Practice, 3, 192–205. Simunovic, F., Yi, M., Wang, Y., Macey, L., Brown, L. T.,
Meredith, W., & Horn, J. (2001). The role of factorial invariance Krichevsky, A. M., et al. (2009). Gene expression profiling
in modeling growth and change. In L. M. Collins & A. G. of substantia nigra dopamine neurons: Further insights into
Sayer (Eds.), New methods for the analysis of change (pp. 203– Parkinson’s disease pathology. Brain, 132, 1795–1809.
240). Washington, DC: American Psychological Association. Sussman, S. (2006). Prevention of adolescent alcohol problems in
Meredith, W. M. (1993). Measurement invariance, factor special populations. In M. Galanter (Ed.), Alcohol problems
analysis and factorial invariance. Psychometrika, 58, 525–543. in adolescents and young adults: Epidemiology, neurobiology,
Millsap, R. E., & Meredith, W. (2007). Factorial invariance: prevention, and treatment (pp. 225–253). New York: Springer
Historical perspectives and new problems. In R. Cudeck & Science & Business Media.
R. C. MacCallum (Eds.), Factor analysis at 100: Historical Uehling, H. F. (1952). Rorschach “shock” for two spe-
developments and new directions (pp. 131–152). Mahwah, NJ: cial populations. Journal of Consulting Psychology, 16,
Erlbaum. 224–225.

80 s p e c i a l p o p u l at i o n s
Waller, N. G., & Reise, S. P. (1989). Computerized adaptive presented at the Gatlinburg Conference on Research and The-
personality assessment: An illustration with the Absorption ory in Mental Retardation and Developmental Disabilities,
scale. Journal of Personality and Social Psychology, 57, 1051– Brainerd, MN.
1058. Widaman, K. F., Gibbs, K. W., & Geary, D. C. (1987). The
Widaman, K. F. (1985). Hierarchically nested covariance struc- structure of adaptive behavior: I. Replication across fourteen
ture models for multitrait-multimethod data. Applied Psycho- samples of nonprofoundly retarded persons. American Journal
logical Measurement, 9, 1–26. of Mental Deficiency, 91, 348–360.
Widaman, K. F. (2007). Intrauterine environment affects infant Widaman, K. F., & Reise, S. P. (1997). Exploring the measure-
and child intellectual outcomes: Environment as direct effect. ment invariance of psychological instruments: Applications
In T. D. Little, J. A. Bovaird, & N. A. Card (Eds.), Model- in the substance use domain. In K. J. Bryant, M. Windle,
ing contextual effects in longitudinal studies (pp. 387–436). & S. G. West (Eds.), The science of prevention: Method-
Mahwah, NJ: Erlbaum. ological advances from alcohol and substance abuse research
Widaman, K. F. (2009). Phenylketonuria in children and moth- (pp. 281–324). Washington, DC: American Psychological
ers: Genes, environments, behavior. Current Directions in Association.
Psychological Science, 18, 48–52.
Widaman, K. F., & BorthwickDuffy, S. A. (April, 1990). Parental
influences on the development of adaptive behaviors. Paper

widaman, early, conger 81


CHAPTER

Theory Construction, Model Building,


5 and Model Selection

James Jaccard

Abstract

General issues involved in (1) building causal theories, (2) translating those theories into a set of
mathematical representations, (3) choosing an analytic strategy to estimate parameters in the
equations implied by the theory, and (4) choosing the “best” model from a set of competing models
are discussed. Theory construction fundamentally relies on six relationship types, including direct
effects, indirect effects, moderated effects, reciprocal effects, spurious effects and unanalyzed
relationships. Once specified, each type of effect can be represented mathematically, thereby
translating a path diagram into a set of (linear) equations. Parameters in the equations are then
estimated using either limited information estimation approaches or full information estimation
approaches, taking into account measurement properties, population distributions, and matters of
robustness. Choices between competing models are based on indices of relative fit with the data and
relative fit with competing models, as well as more general theoretical criteria (e.g., parsimony,
consistency with related theories).
Key Words: Theory; theory construction; modeling; parameter estimation

Few people dispute that theory is at the heart of this, we would transform data on depression so
the scientific enterprise. We use theories to explain that it would approximate a normal distribution
phenomena and to help solve important applied and be amenable to analysis using statistical meth-
problems. With theoretical propositions in hand, ods that assumed normality. Such days, fortunately,
we design research to gain perspectives on the viabil- are over. The statistical tools available now allow
ity of the theory and imbue those propositions with us to give precedence to theory and model test-
a certain degree of confidence based on the results ing without being slave to many of the traditional
of research and the quality of the research design. assumptions made by methods of analysis. The
Theory is fundamental to the social sciences. present chapter discusses issues that analysts need
Advances in statistical analysis have been con- to take into account as they move from theory to
siderable during the past 30 years. Our statistical analysis. I focus first on the nature of theory in
toolbox was at one time somewhat limited, and it the social sciences, with an emphasis on describ-
was not uncommon for analysts to adopt practices ing theories that invoke the language of causality.
that forced data to conform to the assumptions of Although there are many other approaches to the-
our statistical tools so that we could apply inferen- ory and modeling, one cannot deny the prominence
tial methods of analysis to those data. For example, and pervasiveness of causal thinking in social sci-
although we knew that a construct like depression ence research. Hence, causal frameworks capture
was non-normally distributed in the population and the bulk of my attention. Next, I discuss issues
that our measures of depression and data reflected involved in moving from a well-specified causal

82
theory to a mathematical representation of that the- Lewis, 2000, for alternative conceptualizations).
ory for purposes of statistical analysis. I then discuss Four properties of causality are typically emphasized.
a range of issues that must be considered as one First, a cause always must precede an effect in time.
moves from a set of equations representing a the- Second, the time that it takes for a change in X to
ory to formal data analysis, including measurement produce a change in Y can vary, ranging from almost
considerations, full versus limited information esti- instantaneous change to years, decades, centuries, or
mation strategies, distributional assumptions, and millennia. Third, the nature and/or strength of the
the possible presence of outliers. Finally, I consider effect of X on Y can vary depending on context. X
issues in model selection, which refers to the process may influence Y in one context but not another con-
of choosing one model as being the “best” from a set text. Finally, cause and effect must be in some form
of candidate models. of spatial contact or must be connected by a chain
In the present chapter, I use the terms theory of intermediate events. We return to these ideas in
and model interchangeably, although some scien- later sections of this chapter.
tists do not do so. As examples, various authorities
contend that models are a special type of theory The Nature of Theories that Use
(e.g., Coombs, Dawes, & Tversky, 1970, p. 4; Causality
Kaplan, 1964, p. 263), models are portions of the- The focus of most causal theories is on explaining
ories (Sheth, 1967, p. 444; Torgerson, 1958, p. 4), why variation in one or more variables exists. Some
models are derived from theories (e.g., Pap, 1962, people make a great deal of money and others are
p. 355), models are simplified versions of theories poor. Why? What can account for this variation?
(e.g., Carnap, 1971, p. 54), models represent corre- Some people are able to remember complex mate-
spondence between two or more theories (Brodbeck, rial easily whereas for other people, it is difficult
1968), or theories represent specific interpretations to do so. Why? What explains this variability? We
of (i.e., are derived from) models (e.g., Green & answer these questions by specifying the presumed
Tull, 1975, p. 42). Others consider the terms to be causes of the variability, and then we seek to test our
synonymous (cf, Dubin, 1976; Simon & Newell, theoretical proposition(s).
1956). Although there may indeed be meaningful In any given causal theory, there are six funda-
distinctions between theories and models, it also is mental types of relationships that can be referenced;
the case that models, like theories, involve variables these are illustrated in Figure 5.1. These six relation-
and relationships between variables, usually invok- ship types are the cornerstone of causal theories and
ing the concept of causality. Accordingly, I will use define the universe of causal relationships that theo-
the terms theory and model interchangeably. rists draw upon. In Figure 5.1, a variable is indicated
by a box, and a causal influence is represented by a
Specifying a Causal Theory straight arrow emanating from the cause and point-
The Nature of Causality ing to the effect. I discuss the bidirectional curved
Theories take many forms in the social sciences. arrow in Figure 5.1 shortly.
One common form involves specifying presumed Referring to Figure 5.1, a direct causal relation-
relationships between variables while invoking the ship is one in which a given cause is assumed to
concept of causality. The nature of causality has have a direct causal impact on some outcome vari-
been debated extensively by philosophers of sci- able. For example, exposure to violence in the media
ence (e.g., Bunge, 1961; Cartwright, 2007; Frank, is assumed to influence aggressive behavior. Or,
1961; Morgan & Winship, 2007; Pearl, 2000; Rus- the quality of the relationship between a mother
sell, 1931; Rubin, 1974, 1978; Shadish, Cook, & and her adolescent child is assumed to influence
Campbell, 2002) and most agree that causality is whether the child uses drugs. By contrast, an indi-
an elusive concept that is fraught with ambiguities. rect causal relationship is when a variable influences
It is beyond the scope of this chapter to delve into another variable indirectly through its impact on an
the underlying philosophical issues (see Jaccard & intermediary variable (see Fig. 5.1). For example,
Jacoby, 2010). Rather, I emphasize here a “working failing to accomplish a goal may lead to frustra-
model” of causality that is adopted by most social tion that, in turn, causes someone to aggress against
scientists. another. In this case, the failure to obtain a goal
The concept of causality is usually thought of in is an indirect cause of aggression. It influences
terms of change—that is, X is a cause of Y if changes aggression through its impact on frustration. Frus-
in X produce changes in Y (but see Sowa, 2000, and tration is formally called a mediating variable or,

jaccard 83
Direct Causal Relationship Indirect Causal Relationship

X Y X Z Y

Spurious Relationship Bi-directional Causal Relationship

Z X Y

X Y

Unanalyzed Relationship Moderated Causal Relationship

X Y

X Y

Figure 5.1 Relationships in Causal Models

more informally, a mediator, because other vari- give to peers who use drugs. We therefore decide to
ables “work through” it to influence the outcome. posit partial mediation to reflect this fact and add a
Indirect relationships are sometimes called mediated direct causal path from X to Y.
relationships. If we are able to specify another mechanism by
Whenever a theorist posits an indirect relation- which Z influences Y over and beyond Z, then why
ship, the issue of whether to specify partial or not just incorporate that additional mediator into
complete mediation presents itself. In partial medi- the theory? Of course, we could very well do this,
ation, the distal variable has a direct effect on but then the issue becomes whether the two media-
the outcome variable over and above its effect on tors in the theory, considered together, are complete
the mediator. In complete mediation, all of the or partial mediators of the causal effect of X on Y.
impact of the distal variable on the outcome vari- This might lead us to speculate about yet a third
able is accounted for by the mediator. With partial mechanism, and once we have specified it, the issue
mediation, in addition to the two causal paths char- of partial or complete mediation will present itself
acterizing the indirect effect illustrated in Figure 5.1, yet again. At some point, we must decide to close
one adds an additional causal path linking the distal out the system and just let a direct path between X
variable (X) and the outcome variable (Y) directly. As and Y stand so as to reflect partial mediation with-
an example, we might posit that the quality of the out formally bringing additional mediators into the
relationship between a mother and her adolescent model. If pressed, we could articulate one, but we
child (X) impacts the child’s motivation to perform simply do not want to complicate the theory further.
well in school (Z) and that one’s motivation to do A spurious relationship is one in which two vari-
well in school, in turn, impacts (negatively) the ten- ables are related because they share a common cause
dency for an adolescent to use drugs (Y). As we think but not because either causes the other (see Fig.
about the matter of specifying partial versus com- 5.1). As an example, if we select a random sam-
plete mediation, we might decide that there are other ple of people in the United States and calculate
mechanisms by which the quality of the relationship the association between height and length of hair,
between parent and adolescent can impact drug use, then we would find a moderate relationship between
such as by lessening the attention that adolescents the two variables: People with shorter hair grow

84 theory construction, model building


taller. Does this mean that a causal relationship exists in time we measure adolescent drug use and grade
between these variables—that is, that cutting one’s point averages in school. It is likely that the mea-
hair will make one grow taller? Of course not. The sured drug use reflects the influence of prior poor
reason the variables are correlated is because they performance in school because adolescents who do
share a common cause: gender. Males tend to have poorly in school might turn to drugs as a coping
shorter hair than females and men tend to grow mechanism or as a way of spending free time that
taller than females. The common cause of gender normally would have been directed to school work.
produces a correlation between length of hair and Similarly, the measured school performance likely
height, but it is spurious. reflects the effects of any prior drug use, which can
A moderated causal relationship, like spurious and cause students to lose interest in school and to not
indirect relationships, involves at least three variables concentrate on tests and studying. It would be wrong
(see Fig. 5.1). In this case, the causal relationship to assume there is unidirectional causality from one
between two variables, X and Y, differs depending construct to the other in this study. More realisti-
on the value of a third variable, Z. For example, cally, the two measures reflect a more fine-grained
it might be found that a given type of psychother- process that has played itself out—that is, poor per-
apy (X) is effective for reducing depression (Y) for formance in school at time t influenced drug use at
females but not for males. In this case, the causal time t + 1, which in turn influenced school perfor-
relationship between being exposed to psychother- mance at time t + 2, which in turn influenced drug
apy and depression is moderated by gender. When use at time t + 3, and so on. It is only when we are
gender has the value “female,” X impacts Y. How- unable to capture the more fine-grained time inter-
ever, when gender has the value “male,” X does vals and we must instead work with coarser time
not impact Y. Gender is called a moderator vari- intervals that the dynamic of the reciprocal causal
able because the relationship between the presence relationship as illustrated in Figure 5.1 applies. By
or absence of psychotherapy (X) and depression (Y) working with coarser time units, the more fine-
changes as a function of (or is “moderated by”) the grained temporal causal dynamics are assumed to
levels of gender. have already played themselves out (often referred to
A bidirectional causal relationship exists when as the equilibrium assumption). In this sense, there
two variables are conceptualized as influencing each exists reciprocal causality per Figure 5.1.
other (see Fig. 5.1). For example, in the area of The final type of relationship that can occur in
reproductive health, a theorist might posit a bidirec- a causal model is an unanalyzed relationship. In
tional influence between a woman’s belief that the Figure 5.1, the two variables for this type of relation-
rhythm method is effective at preventing pregnancy ship are connected by a double-headed curved arrow.
(X) and her attitude toward the rhythm method This arrow signifies that the two variables are pos-
(Y). A woman may have a positive attitude toward sibly correlated but that the theorist is not going to
the rhythm method because she believes it is effec- specify why they are correlated. The correlation may
tive. Simultaneously, she may believe it is effective, be spurious or it may result from a causal connection
in part, because she has a positive attitude toward of some kind. The theorist wants to recognize the
it, via a mechanism that involves rationalization of possible correlation between the variables, but trying
behavior. to explain it is beyond the scope of the theoretical
Technically, there can never be simultaneous effort. The relationship will remain unanalyzed.
reciprocal causation because there always must be Most causal models have more than one of these
a time interval, no matter how infinitesimally small, six types of relationships in them. An example of
between the cause and the effect that follows from a multivariate causal model appears in Figure 5.2,
that cause. If we observed the causal dynamics which was based on an endeavor that developed
within the appropriate time frames, then the true an intervention to reduce unprotected sexual inter-
dynamic underlying a reciprocal causal relationship course to reduce the spread of sexually transmitted
would appear as follows: diseases (STDs). The intervention is represented as
a two-level, qualitative variable in which individuals
Xt1→ Yt2→ Xt3→ Yt4,
are randomly assigned to either an intervention
where Xt1 is variable X at time 1, Yt2 is variable group or a control group. The intervention is
Y at time 2, Xt3 is variable X at time 3, and Yt4 designed to influence (1) the belief that having
is variable Y at time 4. As an example, suppose we unprotected sex increases the risk of contracting an
conduct a cross-sectional study and at a given point STD (see path a in Fig. 5.2) and (2) the belief

jaccard 85
Belief
Unprotected Sex
c
leads to STDs
a

Intervention Engages in
e
versus Unprotected Has an STD
Control Sex

b
Belief STDs d
are Serious

Figure 5.2 A Multivariate Causal Model

that contracting an STD is bad for one’s health the direct causes can then be turned into an indirect
(see path b in Fig. 5.2). These two beliefs, in turn, effect by elaborating the mechanisms by which the
are assumed to influence the tendency for a person cause produces the effect. The “mechanisms” are,
to engage in unprotected sex (paths c and d ). The essentially, mediators. The theory might be further
tendency to engage in unprotected sex, in turn, is elaborated by specifying the boundary conditions
thought to impact the extent to which the person of effects, thereby introducing moderated relation-
contracts STDs (path e). Paths a through e each ships into the framework (e.g., the effect holds when
represent direct causal relationships. There also is condition A is operative but not when conditions
aspurious relationship in this model, as seen by the B is operative; or the effect holds for one type
fact that the two beliefs share a common cause, of individual but not another type of individual).
the intervention (paths a and b). Because of this For a discussion of the many heuristics scientists
common cause, we expect the two beliefs to be use to identify mechanisms and boundary condi-
correlated to some extent, although there is no pre- tions when developing causal theories see Jaccard
sumed causal connection between them. There also and Jacoby (2010).
are several indirect causal relationships in the model. Theories can be represented in the form of path
For example, the intervention indirectly affects the diagrams using the graphical schemes illustrated in
occurrence of STDs through the two belief media- Figures 5.1 and 5.2. Such representations are pos-
tors and, in turn, their influence on the tendency to sible no matter what research design is used to test
engage in unprotected sex. a theory—that is, we can represent the underlying
In causal theories, distinctions are made between theory with path diagrams for experiments just as
exogenous and endogenous variables. Any variable readily as for purely observational designs. Figure 5.2
that has a straight arrow going to it in a path dia- is an example of a causal theory that incorporates an
gram is called an endogenous variable. Endogenous experimental design.
variables, essentially, are outcome variables—that is,
they are presumed to be influenced by another vari-
able in the theoretical system. Variables that do not Causal Theories with Explicit Temporal
have a straight arrow going to them are called exoge- Dynamics
nous variables. They do not have presumed causes An important structure of many (but not all)
that are elaborated upon in the theory. causal theories is a focus on longitudinal dynam-
In sum, the fundamental orientation to con- ics. Causal theories that focus on longitudinal
structing a causal theory is to explain variation in dynamics contain one or more of the six types
one or more outcomes. This is accomplished with of relationships described above, but there also is
reference to six types of relationships, direct effects, an explicit temporal dynamic that is of theoretical
indirect effects, reciprocal effects, moderated effects, interest. A nomenclature has emerged around such
spurious effects, and unanalyzed relationships. In theories, which I briefly describe here.
the theory construction process, it is not uncommon One popular type of causal model that includes
for the theorist to first identify the outcome variable longitudinal dynamics is called a panel model in
he or she wishes to explain and then to specify a which multiple variables are modeled at multi-
few direct causes of that variable. One or more of ple points in times, also called waves. Figure 5.3

86 theory construction, model building


Alcohol Use a Alcohol Use a Alcohol Use a Alcohol Use
Time t Time t+1 Time t+2 Time t+3

d d d
c c c

Delinquency Delinquency Delinquency Delinquency


Time t b Time t+1 b Time t+2 b Time t+3

Figure 5.3 A Panel Model

presents an example of a two-variable, four-wave “growth” process that causes changing values of
panel model. This model examines the relationship an outcome over time. These models are typically
between adolescent alcohol use and delinquency associated with latent growth curve models in the
across the 4 years of high school. Heavy alcohol use social science literature or, more simply, latent curve
at time t is thought to have a direct effect on alcohol models (LCMs). The classic form of these models is
use at time t + 1, and the same type of dynamic is presented in Figure 5.4, as applied to the alcohol-
thought to hold for delinquency. A path that spec- delinquency example considered earlier. Observed
ifies a variable at time t + 1 as being a function variables or measures are represented by boxes, and
of that same variable at time t is said to reflect an latent (unobserved) variables are represented by cir-
autoregressive effect (see paths a and b in Fig. 5.3). cles. This model parameterizes a latent “growth”
In the model in Figure 5.3, the autoregressive effects or “maturation” process for alcohol use as students
are first-order effects, because the variable at a given progress through high school (represented by the
time, t, is assumed to be a function of that same variable called “Alcohol Slope” in Fig. 5.4) as well
variable at the time just prior to it. A second-order as a latent growth or maturation process for delin-
autoregressive effect is one where the variable at time quency as students progress through high school
t is impacted (also) by the same variable at time t −2. (see the variable “Delinquency Slope” in Fig. 5.4).
A third-order autoregressive effect implies a direct The “growth process” for alcohol use is assumed to
effect between a variable at time t and that same impact the “growth process” for delinquency across
variable at time t − 3. And so on. the four time periods. One often will encounter
The theory in Figure 5.3 also specifies that drink- longitudinal causal theories expressed in this form
ing alcohol at time t predisposes one to delinquency instead of in the more traditional panel model form
at that same time period under the rationale that (Collins, 2006).
people often commit acts of delinquency when they Of course, it is possible that both types of dynam-
are drunk. When a causal relationship exists between ics in the two types of models operate. When the two
two variables at the same period of time, it is said to types of processes are integrated into a single model,
represent a contemporaneous causal effect (see path c we obtain what is called an autoregressive latent tra-
in Fig. 5.3). Finally, when a variable at time t has jectory model (Bollen & Curran, 2006). Figure 5.5
an effect on a different variable at time t + 1, it is presents this model.
referred to as a lagged effect (see path d in Fig. 5.3). In sum, when constructing theories that incorpo-
In this case, alcohol use at time t is thought to have rate longitudinal dynamics, one will explicitly take
a delayed effect on delinquency at time t + 1 inde- into account the possible causal dynamics described
pendent of the other indirect causal chains that link by panel models, by LCMs, or by autoregressive
alcohol use at time t to delinquency at time t + 1. latent trajectory models.
In sum, it is common in longitudinal models to
theorize about autoregressive effects, contempora-
neous effects, and lagged effects. These effects are Multilevel Causal Theories
common features of panel models (Collins, 2006; Another common type of causal model that has
Finkel, 2008). received considerable attention in the social sciences
A second type of theory that formally incorpo- is one that incorporates multiple levels of analysis, or
rates temporal dynamics is theory based on growth what is called a multilevel model. These models deal
processes. This approach views variation in out- with scenarios where there is nesting—for example,
comes across time as arising from an unobserved where individuals are nested within different, higher

jaccard 87
Baseline
Alcohol
Alcohol
Slope
Use

Alcohol Use Alcohol Use Alcohol Use Alcohol Use


Time t Time t+1 Time t+2 Time t+3
a

Delinquency Delinquency Delinquency Delinquency


Time t Time t+1 Time t+2 Time t+3

Baseline Delinquency
Delinquency Slope

Figure 5.4 A Latent Curve Model

Baseline Alcohol
Alcohol Slope

Alcohol Use a Alcohol Use a Alcohol Use a Alcohol Use


Time t Time t+1 Time t+2 Time t+3
e
d d d
c c c

Delinquency Delinquency Delinquency Delinquency


Time t b Time t+1 b Time t+2 b Time t+3

Baseline Delinquency
Delinquency Slope

Figure 5.5 An Autoregressive Latent Trajectory Model

88 theory construction, model building


level organizational units. For example, students are (a)
nested within schools and both characteristics of the X Y δ
students and characteristics of the schools can influ-
ence an outcome variable, such as performance by
students on a standardized math test. Employees are
nested within organizations, and both characteristics
of the employees and characteristics of the organiza- Z
tion can influence employee behavior. Patients are
nested within hospitals and both characteristics of (b)
the patients and characteristics of the hospitals can
influence patient recovery. δ δ2

Summary of Causal Model Forms


In sum, causal theories seek to explain variation in
Smoking Drug Use
one or more outcome variables by identifying causes
of those variables. Social scientists posit theories or
models to represent the presumed causal relation-
ships among variables, and these theories typically
have one or more of the six fundamental relation- Gender
ships types in them—namely, direct effects, indirect
effects, spurious effects, moderated effects, recip-
rocal causality, and unanalyzed relationships. The (c)
theories can focus on a single time period or explic-
itly deal with temporal dynamics, If they deal with
temporal dynamics, then this usually takes the form
of a panel model, a LCM, or an autoregressive latent δ1 δ2
trajectory model. Theorizing also can occur at a sin-
gle level of analysis or at multiple levels of analysis
in which lower order units are nested within higher
order units, with characteristics of the units at both Smoking Drug Use
levels influencing outcomes.

Theories and Disturbance Terms


There is a more subtle facet of theory construc- Gender
tion beyond those elucidated thus far, and this
concerns the concept of disturbance terms. Con- Figure 5.6 Theories with Disturbance Terms. (a) Theory with
sider the simple theory in Figure 5.6a. This theory Disturbance Term (b) Smoking and Drug Example with Uncor-
related Disturbance Terms (c) Smoking and Drug Example with
has two direct causes where variables X and Z are Correlated Disturbance Terms
assumed to influence variable Y. A fourth “variable”
is represented in the system by a circle. This “vari- Consider another example in Figure 5.6b. There
able” reflects all unspecified variables that influence are two endogenous variables, and they share a com-
Y other than X and Z. It formally recognizes that the mon cause. One of the endogenous variables is
theory is incomplete and that we have not specified adolescent tobacco use, and the other is adolescent
every cause of the outcome variable. This “vari- drug use. The common cause is gender. The theory
able” is called a disturbance term, and it represents posits that boys are more likely than girls to smoke
the totality of all unspecified causal effects on the cigarettes and that boys also are more likely than girls
endogenous variable it is associated with. The pres- to use drugs. There is a disturbance term for each
ence of a disturbance term explicitly recognizes that of the endogenous variables. These terms recognize
not all causal influences on a variable have been spec- that factors other than gender impact tobacco use
ified in the model. Traditionally, each endogenous and drug use.
variable in a theory has a disturbance term associated But there is a problem with this theory. According
with it. to the theory, the only reason that smoking cigarettes

jaccard 89
and drug use in adolescence are correlated is because the theoretical system and work just with the vari-
they share the common cause of gender. In real- ables we have specified. By including disturbance
ity, there are many other common causes of these terms and correlated disturbances, we are explic-
two constructs. For example, social class impacts itly recognizing the operation of other variables, but
both tobacco use and drug use during adolescence, we choose not to give them central focus in our
with more economically disadvantaged youth hav- theory.
ing an increased tendency to smoke cigarettes and
to use drugs. Essentially, social class resides within
the disturbance term for smoking cigarettes and it Latent Variables, Structural Models, and
also resides within the disturbance term for drug Measurement Models
use. If the same unspecified cause is in each distur- Some researchers take matters a step further and
bance term, then we would expect the two distur- also incorporate a measurement theory into their
bance terms to be correlated. Figure 5.6c presents conceptual frameworks when they are performing
a more plausible theory that includes this correla- an empirical test of the theory. This goes beyond
tion between disturbances. According to this theory, the typical province of theory construction per se,
there are two reasons why adolescent cigarette smok- but I mention the ideas here as they ultimately
ing and adolescent drug use are correlated. One impact data analysis and the choice of statisti-
reason is because they share the common cause of cal methods. The integration of conceptual and
gender. Another reason is that they share other com- measurement theories is something that should be
mon causes that are unspecified by the theory and standard practice for social science research.
that reside in both disturbance terms. An empirical test of a theory necessarily requires
A well-developed theory provides explicit state- developing and using measures of the theoretical
ments about which disturbance terms in the frame- constructs in the theory. Just as one can build a the-
work are correlated and which disturbance terms ory linking one concept to another concept, so too
are not. The lazy way out for a theorist is to simply can one build a theory linking a construct to a mea-
assume all disturbance terms are correlated. But this sure of that construct. Some theorists combine both
is not satisfactory, and it can create considerable dif- types of theories into a single overarching frame-
ficulties for testing the theory empirically. A better work. Traditional measurement models make a
approach is to carefully consider every pair of dis- distinction between a latent variable and an observed
turbance terms and to articulate a possible common measure of that variable. The latent variable is the
cause that resides in each of the separate disturbance true construct about which one is interested in
terms. If such a common cause can be articulated, making statements, such as depression. Although
then it makes sense to posit correlated disturbances. we can directly observe many of the symptoms of
If one cannot articulate any such variable, or if its depression, we can’t directly observe the “seat” of
effects are thought to be trivial, then one does not depression in a person’s mind. Rather, we rely on
posit correlated disturbances. some observable response(s) to assess the latent vari-
For models with a longitudinal component, able, such as a multi-item inventory of depression
many theorists have a “knee-jerk” reaction that that a person completes. Figure 5.7a presents one
disturbances directed at the same variable at two representation of a measurement model. The latent
different points in time must be correlated. Again, variable of depression is contained in a circle, and
if one can articulate a compelling rationale for cor- the observed measure thought to reflect depression
related disturbances, then by all means, correlated is contained in a square (the label “AR” stands for
disturbances should be incorporated into the the- adolescent report of depression). A causal path is
ory. Otherwise, correlated disturbances should be drawn from the latent variable to the observed mea-
viewed with theoretical skepticism. sure, under the assumption that how depressed a
If a theorist is able to articulate a variable that person is influences how he or she responds to the
resides in two disturbance terms to create correlated questions on the inventory. There also is an error
disturbances, then why not explicitly incorporate term that reflects measurement error—that is, there
the variable into the theoretical system? For exam- are factors other than depression that influence a
ple, for the smoking cigarette and drug use example person’s responses on the inventory. Ideally, mea-
in Figure 5.6, why not explicitly bring social class surement error is minimal, but it is a fact of life for
into the theoretical system? This, of course, is desir- many research endeavors. The relationship between
able. But at some point, we want to close out the latent construct and the observed indicator is

90 theory construction, model building


(a) (b) happens to her is beyond her control. The theory
ε ε1 ε2 is that the more a mother thinks that what happens
is not under her control, the more likely she will
be to keep smoking and drinking during pregnancy.
These two latent exogenous variables are assumed to
AR AR MR be correlated. The three latent endogenous variables
each have a disturbance term, indicated by a circle
with a δ inside of it. The disturbances are assumed
to be uncorrelated.
The portion of the diagram with arrows from
Depression Depression the latent constructs to the observed measures con-
stitutes the measurement model. Each of the latent
variables has multiple indicators. In other words, the
Figure 5.7 Measurement Models. (a) Single Indicator (b) Mul- researcher obtains three measures of each construct,
tiple Indicators with the exception of birth weight, which is mea-
sured using two different indicators. In the interest
of space, we do not describe these measures, but note
usually assumed to be linear, but it could also be that each is assumed to be fallible (i.e., subject to
nonlinear. some measurement error; see the circles ranging from
Sometimes we obtain multiple indicators of a ε1 to ε14). The measurement errors are assumed
construct. For example, a researcher might obtain a to be uncorrelated. Figure 5.8 provides an explicit
self-report of depression from an adolescent as well roadmap for a researcher to test the combined
as a report from the adolescent’s mother about how structural theory and measurement theory.
depressed the child is (MR). Figure 5.7b presents In experiments that involve a formal manipula-
a measurement model for this scenario. The latent tion, the manipulation typically is considered to be
variable of depression is assumed to influence both of an observed variable in its own right, with no latent
the observed measures, and each measure is assumed construct underlying it (see Fig. 5.2). In some cases,
to have some measurement error as reflected by the the manipulation is designed to reflect or produce
presence of error terms. The errors are assumed to differences in an underlying psychological state (e.g.,
be uncorrelated because we cannot articulate any one’s mood when studying the effects of mood on
viable reason why we would expect them to be memory). In this case, a measurement model may
correlated. However, one can introduce correlated be introduced into the causal system, treating the
measurement error, if appropriate. manipulation as a formative rather than a reflec-
Figure 5.8 presents an example of a more elabo- tive indicator of the construct (see Schumacker &
rate theoretical framework that incorporates a theory Lomax, 2004, for elaboration).
about the relationship between constructs as well as a Many social scientists view measurement mod-
measurement theory. Although it appears somewhat els as interesting causal theories in their own right
intimidating, it is a straightforward model. There are rather than just a methodological feature of the-
five latent constructs, and the main substantive the- ory testing. The most common case is when the
ory is focused on them. The portion of the diagram conceptual focus is on specifying the facets or dimen-
focused on the causal relations among the latent sions of a construct vis-à-vis factor analysis. The
variables is called the structural model. The primary causal theory underlying factor analysis is that one or
outcome variable in this model is the birth weight more (possibly correlated) unmeasured latent vari-
of a newborn. Birth weight is thought to be influ- ables are each a common cause of a set of observed
enced by two factors: how much alcohol the mother measures of constructs that are of interest in their
consumes during her pregnancy and how much she own right. For example, theorists have suggested
smokes during her pregnancy. Both of these variables there are four facets of social support: informational
are thought to be influenced by two other variables. support, emotional support, tangible support, and
The first determinant is the extent of support the companionship support. Each of these facets is con-
mother has from friends and relatives who can help ceptualized as a latent variable that impacts distinct
her quit smoking and drinking. The second is the manifestations of social support. For elaboration of
mother’s locus of control. Locus of control refers to theory-based expressions of measurement models,
the extent to which the mother believes that what see Brown (2006).

jaccard 91
ε1 ε2 ε3 ε4 ε5 ε6

S1 S2 S3 A1 A2 A3
ε13 ε14

W1 W2
Support Alcohol

δ1
Birth-
weight

δ2
δ3

Control Smoking

C1 C2 C3 K1 K2 K3

ε7 ε8 ε9 ε10 ε11 ε12

Figure 5.8 Example of Integrated Structural and Measurement Model

In sum, a well-developed theory will not only the theoretical system are continuous. Second, I will
include causal relationships between constructs but assume that the measures of all variables have at least
also will include disturbance terms and a theory interval level properties. Finally, I will assume that
of the correlational structure of those disturbance all relationships between variables are linear. I adopt
terms. As one moves to an empirical test of a theory, a strategy whereby the theory under consideration
one also specifies a measurement model that links is expressed in the form of a set of linear equations.
theoretical constructs to measures obtained in the This does not mean, however, that researchers always
empirical test. parameterize data analysis in terms of the linear
model. Many well-known methods of analysis, such
as t tests, analysis of variance, and analysis of covari-
From Theories to Mathematical ance, evolved outside of the context of the linear
Representations model. However, these methods can be re-expressed
Specifying Core Model Equations as linear equations, and in this sense, they are com-
With a well-articulated theory in hand, the next patible with the current approach. Complex models
step for choosing a form of statistical analysis is to of dichotomous, nominal, and count variables also
translate the theory into a set of equations that then can be approached from the perspective of linear
guide the statistical analysis. The basic strategy for equations using what is known as the generalized
doing so can be illustrated by making some simpli- linear model (see Yuan & Schuster, Chapter 18,
fying assumptions, which I relax later. First, unless Volume 1). Finally, newer methods of analysis
otherwise specified, I will assume that all variables in that focus on robust indices of central tendency

92 theory construction, model building


ε3

Y3
β4 β2

β1
X Y2 Y1

β5 β3
Y4 ε2 ε1

ε4

Figure 5.9 Path Model for Defining Equations

and variability can be viewed in the context of lin- The rule for expressing a path diagram in terms
ear equations but where the parameterization shifts of a set of core equations also applies to models with
from means and variances to trimmed means, M latent and observed variables, such as the model in
estimators, and the like (see Erceg-Hurn, Wilcox, & Figure 5.10. For this model, the structural model
Keselman, Chapter 19, Volume 1). has the following core equations:
Consider the case where the initial theory is LY = α1 + β1 LM + δ1
represented as a path or influence diagram, such
as the theory in Figure 5.9. A path diagram can and
be viewed as a pictorial representation of a set of LM = α2 + β2 LX + β3 LZ + δ2 ,
equations. There is a separate equation for each
with each equation focusing on a latent endoge-
endogenous variable in the theory. More specifi-
nous variable as an outcome. We are interested in
cally, given the aforementioned assumptions, each
estimating the parameters in these equations, but
endogenous variable can be expressed as being a
the task is more challenging statistically because we
linear function of all variables with arrows going
do not have direct access to a person’s scores on
directly to and explicitly touching the box repre-
the latent variables. Nevertheless, statisticians have
senting the endogenous variable. For the model in
derived methods for obtaining such estimates (see
Figure 5.9, there are four linear equations that are
McDonald, Chapter 7, Volume 1).
of primary theoretical interest because there are four
The measurement model for the theory in
endogenous variables.
Figure 5.10 implies the following equations (again,
Using the above rule, the relevant equations are:
using the rule of specifying an equation for each
Y1 = α1 + β1 Y2 + ε1, endogenous variable, in this case, the observed
endogenous variables):
Y2 = α2 + β2 Y3 + β3 Y4 + ε2,
X1 = α3 + β4 LX + ε1,
Y3 = α3 + β4 X + ε3, and
X2 = α4 + β5 LX + ε2,
Y4 = α4 + β5 X + ε4,
Z1 = α5 + β6 LZ + ε3,
where the various α are intercepts, the various β are
linear coefficients, and the various ε are disturbance Z2 = α6 + β7 LZ + ε4,
terms. Primary interest of the research is estimat- M1 = α7 + β8 LM + ε5,
ing and interpreting the parameters α1 through α4 ,
M2 = α8 + β9 LM + ε6,
β1 through β5 , and the variances of ε1 through
ε4 relative to the variances of Y1 through Y5 . We Y1 = α9 + β10 LY + ε7,
select statistical methods of analysis that provide and
the best and most well-behaved estimates of these
parameters. Y2 = α10 + β11 LY + ε8.

jaccard 93
ε1 ε2

ε5 ε6 ε7 ε8
X1 X2

β4 β5
M1 M2 Y1 Y2

LX
β2 β8 β9 β10 β11

β1
LM LY

β3

LZ δ2 δ1

β6 β7

Z1 Z2

ε3 ε4

Figure 5.10 Path Model with Latent Variables to Define Equations

It is not uncommon to express the set of core respect to two latent η, and p = 4 observed endoge-
equations in a theory in more succinct form than nous variables with respect to the two latent ξ . We
the above using matrix algebra. Readers unfamiliar further specify an mXm matrix (B) representing the
with matrix algebra can skip to the summary para- linear (regression) coefficients regressing the η onto
graph at the end of this section. As an example, the the η and a mXq matrix () representing the linear
above structural model is expressed by making dis- (regression) coefficients regressing the η onto the ξ :
tinctions between the latent endogenous variables


β11 β12 γ11 γ12
and the latent exogenous variables, defining a vector B= = .
of latent endogenous variables (η) and a vector of β21 β22 γ21 γ22
latent exogenous variables (ξ ) as In the example in Figure 5.10, let LY = η1 ,


LM = η2 , LX = ξ1 , and LZ = ξ2 . Then
η1 ξ1
η= ξ= ,


η2 ξ2 0 β12 0 0
B= =
0 0 γ21 γ22
and a vector of intercepts (α) and a vector of
disturbance terms (δ) for the latent η as and the structural model is defined as



δ1 α1 η = α + Bη + ξ + δ.
δ= α= .
δ2 α2
The measurement model for the latent endoge-
In the present example, there are m = 2 latent nous and exogenous variables defines separate matri-
endogenous variables, q = 2 latent exogenous vari- ces of factor loadings for the linear (regression)
ables, r = 4 observed endogenous measures with coefficients from the latent endogenous variables to

94 theory construction, model building


the observed endogenous variables/indicators with variables, and X is a column vector of the observed
respect to them (λY which is an rXm matrix) and endogenous variables with respect to the latent
for the linear (regression) coefficients from the latent exogenous variables.
exogenous variables to the observed endogenous In addition to these matrices and vectors, one also
variables/indicators of them (λX , which is a pXq typically specifies a matrix of covariances between
matrix): the latent exogenous variables, a matrix of covari-
⎛ ⎞ ances for the disturbance terms, a matrix of covari-
λY11 λY12
⎜ λY ances for the error terms, and a vector of means
λY22 ⎟
λY = ⎜ 21
⎝ λY31 λY32 ⎠
⎟ for the latent exogenous variables, although these
are not typically part of the “core” equations in
λY41 λY42
⎛ ⎞ the model. Nevertheless, statisticians make use of
λX11 λX12 these matrices in the analysis of data and model
⎜ λX λX22 ⎟ evaluation.
λX = ⎜ 21
⎝ λX31 λX32 ⎠ .

In sum, given a well-developed causal theory that
λX4 λX42 takes the form of a path or influence diagram, one
can translate that diagram into a set of core equations
For the present example, let the four observed
that guide statistical analysis. This process is central
endogenous variables with respect to the latent
to building correspondence between our theoret-
endogenous variables be Y1 , Y2 , M1 , and M2 ,
ical models and our statistical models. The core
respectively, and the four observed endogenous vari-
equations can be represented in traditional algebraic
ables with respect to the latent exogenous variables
terms or in matrix form, but either way, they guide
be X1 , X2 , Z1 , and Z2 , respectively. Then the above
the choice of the statistical model for purposes of
matrices are
⎛ ⎞ data analysis.
λY11 0
⎜ λY 0 ⎟
λY = ⎜ ⎝ 0
21 ⎟
λY32 ⎠ Some Qualifications
0 λY42 For models with moderated causal effects, the
⎛ ⎞ rule for translating a path diagram into a set of core
λX11 0 equations must be slightly modified. Moderated
⎜ λX 0 ⎟
λX = ⎜ ⎝ 0
21 ⎟. effects are traditionally (but not always) represented
λX32 ⎠ by including product terms in the equation. Con-
0 λX42 sider a simple causal model of the following form:
A vector of intercepts (τY ) and a vector of error
terms (εY ) are defined for the observed indicators of Z
the latent endogenous variables:
⎛ ⎞ ⎛ ⎞
τY1 εY1
⎜ τY ⎟ ⎜ εY ⎟
τY = ⎜ 2 ⎟ ⎜ 2 ⎟
⎝ τY3 ⎠ εY = ⎝ εY3 ⎠,
X Y
τY4 εY4
and the observed indicators of the latent exogenous The core equation in this case describes the out-
variables: come variable Y, as a linear function of the two
⎛ ⎞ ⎛ ⎞ variables involved in the moderated relationship and
τX1 εX1
⎜ τX ⎟ ⎜ εX ⎟ the product of the two variables:
τX = ⎜ 2 ⎟ ⎜ 2 ⎟
⎝ τX3 ⎠ εX = ⎝ εX3 ⎠,
Y = α + β1 X + β2 Z + β3 XZ + δ.
τX4 εX4
The coefficient associated with the product term
and the measurement models are defined as:
reflects the degree of moderation that is operating.
Y = τY + λY η + εY Traditionally, the component parts of the product
term are included in the core equation as separate
X = τX + λ X ξ + ε X ,
predictors in addition to the product term because
where Y is a column vector of the observed endoge- doing so protects against arbitrary scaling affect-
nous variables with respect to the latent endogenous ing conclusions. For elaboration, see Blanton and

jaccard 95
Jaccard (2006) and Yuan and Schuster (Chapter properties of the measures of the variables that
18, Volume 1). The above equation suggests that comprise the theory. One important metric consid-
another way of diagramming a moderated effect is eration is the level of measurement of the measures.
to include three separate direct effects, one from The classic distinction between nominal, ordinal,
each of the component parts of the product and a interval, and ratio level measurement is particularly
third from the product term. Interaction analysis important, because different methods of analysis
does not have to be pursued through product terms often are called for depending on the operative levels
in a linear equation (Wilcox, 2005), but this is the of measurement. Nominal measurement involves
most common approach. the assignment of numbers to levels of variable that
Another qualification to specifying core equations are categorical in nature and that have no inher-
is whether the relationship between variables is non- ent ordering on an underlying dimension. For the
linear rather than linear. For example, the relation- variable gender, a researcher might assign the val-
ship between an outcome variable and a causal vari- ues of 0 to females and 1 to males. The fact that
able might be characterized by a logarithmic func- one number is larger than the other has no sub-
tion, an exponential function, a power function, stantive interest. The numbers function much like
a polynomial function, or a trigonomic function, non-numeric labels.
among others. In such cases, the form of the core Ordinal, interval, and ratio level measures can
equation differs from those specified earlier. Non- be defined relative to the measurement equations
linear modeling can be complex and consideration between quantitative latent variables and observed
of such models is beyond the scope of this chapter. measures described earlier. In describing the prop-
Readers are referred to Jaccard and Jacoby (2010) erties of these levels of measurement, I will assume
and Yuan and Schuster (Chapter 18, Volume 1). no measurement error is operating, to simplify the
Although there are other nuances that can fur- presentation. Assuming no measurement error, an
ther shape the nature of the equations one derives interval level measure is one that is a linear function
from a causal theory, the general point I stress here is of the underlying latent variable—that is,
that (1) it is possible to represent most causal mod-
els in the form of one or more equations and (2) X = α + βLX,
that translating a model into equation form is a fun- where X is the observed measure and LX is the latent
damental step for choosing an appropriate analytic variable. The intercept and the linear coefficient
method for testing the viability of a causal model. can take on any value. For ratio level measures, the
underlying latent variable, LX, should have a mean-
Additional Considerations for Choosing ingful conceptual zero-point (such as zero weight
Analytic Strategy when measuring how heavy an object is) and the
With a carefully specified set of equations that are intercept in the measurement equation must equal
theoretically guided, the next task of the researcher zero (as would be the case for a weight measure
is to choose a data analytic method to estimate in pounds or a measure in grams). For ordinal
the parameters in those equations. The present measures, the relationship between X and LX is non-
section considers the role of facets that must be linear but monotonic. For example, X might be a
taken into account when making such decisions, power function of LX, in which case the measure
including measurement/metric considerations, the has ordinal properties.
use of full versus limited information estimation, As discussed in the other chapters in this volume,
distributional assumptions of estimation strategies, analytic strategies for parameter estimation vary as a
and the anticipated presence of outliers. Of course, function of these measurement levels. For example,
there are a host of more nuanced considerations that when a measure of an exogenous variable in a causal
researchers must account for, and these are devel- model is nominal, it is common to represent that
oped in the different chapters in this volume. My variable in a core equation using dummy variables
focus is on categories of more general considerations. (Chapter 18). If one wants to compare two groups
on their average heights, then a measure of height
that has ordinal properties is potentially problem-
Measurement and Metric Considerations atic and not well-suited to comparing means on the
Strategies for estimating parameters in core underlying latent construct of height.
equations, or the act of defining the parameters An important but underappreciated point when
themselves, are influenced by the psychometric taking measurement levels into account is the fact

96 theory construction, model building


that the distinction between the different measure- Individual Height in Inches Rank Order Height
ment levels is best viewed as approximate rather than
all-or-none. Metric properties are not inherent in A 72" 5
scales but, rather, are inherent in data and, hence, B 71" 4
are influenced by all of the facets of data collection. C 70" 3
The extent to which a measure has interval prop- D 69" 2
erties not only depends on the scale used to make E 68" 1
observations but also possibly on the particular set of
individuals on which the observations are made, the Note that for these five individuals, the rank order
time at which the data are collected, the setting in measure has interval level properties. The difference
which the data are collected, and so on. Consider the in scores between individuals A and B is 1, as is
following simplistic yet pedagogically useful exam- the difference between individuals D and E. These
ple. The height of five individuals is measured on differences correspond to the exact same amount
two different metrics, inches and a rank order of on the underlying physical dimension. In this case,
height: what we think of as traditionally being an ordinal
“scale” actually yields measures with interval level
properties. Suppose that individual E was not 68”
Individual Height in Inches Rank Order Height tall but instead was 67.9” tall. In this case, the rank
order measure is not strictly interval. But it is close
A 72" 5
and probably can be treated as if it is interval level
B 71" 4
without adverse effects.
C 70" 3
This example illustrates that the crucial issue is
D 69" 2
not whether a set of measures is interval or ordinal.
E 67" 1
Rather, the critical issue is the extent to which a set of
measures approximates interval level characteristics.
If the approximation is close, then the data often
As is well known, the measures taken in inches can be effectively analyzed using statistical methods
have interval level properties. For example, a differ- that assume interval level properties. If the approx-
ence of 1 between any two scores corresponds to the imation is poor, an alternative analytic strategy is
same physical difference on the underlying dimen- called for. In this sense, we often can apply statis-
sion of height. The actual height difference between tical strategies that assume interval level measures
individuals A and B corresponds to the same true to ordinal data without consequence as long as the
underlying height difference between individuals C ordinal data reasonably approximate interval level
and D, and the metric reflects this (i.e., 72−71 = 1 properties.
and 70 − 69 = 1). Similarly, the difference between Some researchers confuse the concept of measure-
D and E is 69 − 67 = 2, and the difference ment level with measurement precision. Precision
between A and C is 2. These differences also reflect refers to the number of scale points of a measure,
the same amount on the underlying dimension of such as a 5-point scale, a 10-point scale, or a 21-
height. Note, however, that these properties do not point scale. Measures can have interval level proper-
hold for the rank order measure. The difference ties, for example, but be imprecise, or they can have
in scores between individuals A and B is 1 (i.e., ordinal properties yet be relative precise. Precision
5 − 4), and the difference in scores for individu- of a measure may shape the way that one chooses to
als D and E is also 1 (i.e., 2 − 1). These identical model data to evaluate a theory, but to the extent
differences correspond to differing degrees of height it does, it is not because of levels of measurement
disparities on the underlying dimension of height of the measures. With coarse and imprecise mea-
(i.e., the true difference between individuals D and sures of continuous constructs, analytic methods
E is larger than the true difference between indi- that assume high levels of precision can be problem-
viduals A and B, as is evident for the measure atic and alternative analytic strategies are required
using inches). For these individuals, the rank order (see Yuan & Schuster, Chapter 18, Volume 1).
measures have ordinal properties but not interval Another measurement matter that shapes the
properties. method of data analysis is whether the outcome mea-
Now consider five different individuals with the sure is discrete and zero-bounded (such as measures
following scores: of counts, like the number of times an adolescent has

jaccard 97
smoked marijuana) and whether measures are cen- hand. Three assumptions are typical: (1) normal-
sored. Censoring occurs when a value occurs outside ity of scores in the population, (2) homogeneity
the range of a measuring instrument. For example, of variances in the population, and (3) indepen-
a bathroom scale might only measure up to 300 dence of replicates. The ways these assumptions
pounds. If a 350-pound individual is weighed using are instantiated vary by analytic method, and other
the scale, then we would only know that the indi- assumptions also can come into play (Maxwell &
vidual’s weight is at least 300 pounds. For details of Delaney, 2004). I focus on the above assumptions
how these metric qualities affect analytic strategy, see primarily to make general points about approaches
Long (1997) and Yuan and Schuster (Chapter 18, to distributional assumptions. I will make reference
Volume 1). to the robustness of a test to violations of underlying
assumptions. A statistical test is said to be robust to
violations of assumptions if (1) the nominal Type I
Full Information versus Limited error rate (alpha level) set by the investigator a priori
Information Estimation (usually 0.05) is maintained in the face of assump-
Given a set of equations derived from a causal tion violations and (2) the statistical power of the
theory, one approach to estimating parameters in test is relatively unaffected by assumption violations.
those equations is to use a full information estima- For a more technical and nuanced discussion of
tion approach. In full information estimation, the robustness, see Wilcox (2005).
coefficients in all equations in the model are esti- A common perception of many researchers is
mated simultaneously. This can be contrasted with that traditional F and t tests in analysis of vari-
a limited information estimation approach where ance (ANOVA) and regression are quite robust to
the coefficients in the equations are estimated one violations of normality and homogeneity of vari-
equation at a time or, alternatively, focusing on only ance assumptions. Because of this, these methods
a portion of the larger model rather than the entire are often applied to data where the population
model. Full information estimation is common in assumptions are tenuous. Studies have shown that
structural equation modeling (SEM), but there also ANOVA and regression are not necessarily robust
are limited information variants of SEM (see Bollen, to assumption violations of normality and variance
2001). Full information estimation has the advan- heterogeneity (see Keselman et al., 1998; Maxwell
tage that parameter estimates typically (but not & Delaney, 2004; Wilcox, 2005). One strategy for
always) are more efficient (in a strict statistical sense dealing with assumption violations is to perform
of the term) than limited information estimators as a preliminary test of the viability of the assump-
long as the model being tested is correctly specified. tion in question and, if the test suggests a problem,
Limited information estimation has the advantage perform a metric transformation or use a robust
that it often is less susceptible to adverse effects of analytic alternative. This two-step strategy is contro-
specification error, because the effects of specifica- versial for several reasons. First, many preliminary
tion error are limited to the particular portion of tests lack power without large sample sizes and
the model where the error occurs. It also allows one yield nonsignificant results for testing an assumption
to use the strongest methods of analysis available violation, even when the violation is problematic
dependent on the properties of measures and vari- (Wilcox, Charlin, & Thompson, 1986; Wilcox,
ables in different portions of the model rather than 2003). Second, the crucial issue is not whether the
applying the same homogenous algorithm through- null hypothesis of normality or variance homogene-
out the model. A decision point for analysts when ity can be rejected but, rather, estimating the degree
choosing a statistical method to estimate parameters to which the assumption is violated and making
defined by a theoretical model is whether to adopt a a decision as to whether the degree of violation is
full information or limited information estimation consequential. This requires documenting the mag-
strategy. nitude of the assumption violation in the sample
data and then taking sampling error into account
when making decisions. For example, we might find
Statistical Assumptions that a variance ratio comparing the variances of two
Another issue that analysts must consider when groups is 4.0, with a margin of error of plus or minus
moving from equations to estimation is the dis- 3.0. The margin of error suggests that the variance
tributional assumptions of the estimation method could be as large as 7.0, which could be problem-
and whether they are appropriate for the task at atic. Unfortunately, it is rare for researchers to take

98 theory construction, model building


margins of error into account when evaluating pre- et al., 2008; Wilcox, 2005). These scientists recog-
liminary tests. Third, many tests of non-normality nize that cases may occur where defaulting to robust
are based on asymptotic theory and only perform analytic strategies will result in some loss of statistical
adequately with large sample sizes (Shapiro & Wilk, power and inaccurate probability coverage of confi-
1965). However, with large N , such tests tend to dence intervals (CIs). However, the argument is that
detect minor departures from normality that may in the long run, the use of robust methods will result
be of little consequence. In addition, normality tests in better Type I error protection, increased power to
can be differentially sensitive to different types of detect effects, and CIs that more accurately reflect
non-normality. Some tests are sensitive mostly to the desired probability coverage (Wilcox, 1998). Of
skew, whereas others are sensitive mostly to kurtosis. course, it is always important to explore the shapes of
Fourth, the preliminary tests often make assump- distributions and dispersions of data. However, the
tions in their own right and may perform poorly recommendation is to view traditional tests of model
when their assumptions are violated. For example, assumptions and remedial strategies based on trans-
many tests of variance homogeneity assume the pop- formations with caution, deferring instead to the use
ulation data are normally distributed (Carroll & of more modern robust analytic methods.
Schneider, 1985; Keyes & Levy, 1997; Parra-Frutos, Earlier, I discussed causal theories that are multi-
2009). If the population data are non-normal, then level in nature, such as theories of how characteristics
the preliminary test of variance homogeneity may of organizations as well as characteristics of indi-
be invalid. Fifth, using preliminary tests as a screen viduals affect the behavior of individuals within an
can change the sampling distribution of F tests and organization. Research that tests multilevel mod-
t tests in unpredictable ways. Although it seems rea- els often strategically samples organizations (called
sonable, the strategy of conducting preliminary tests Level 2 units) and individuals nested within those
of model assumptions faces numerous difficulties. organizations (called Level 1 units). In such cases,
Transformation strategies for dealing with the statistical assumption of independent residu-
assumption violations have also been criticized. For als/errors often is untenable because of the impact
example, Budescu and Appelbaum (1981) found that individuals within an organization have on
that transformations to address variance heterogene- one another, either directly or indirectly. In such
ity can create more problems than they solve in cases, specialized statistical methods must be used
inferential tests because they can adversely impact to deal with the dependencies (see Yuan & Schuster,
normality (see also Blaylock et al., 1980; Milligan, Chapter 18, Volume 1).
1987; Doksum & Wong, 1983; Wilcox, 1996, In sum, the choice of an analytic method to
1998). Transformed variables often are difficult to use with data is impacted by the equations used to
interpret (e.g., the mean log of annual income is represent a theory, the psychometric properties of
not easily interpreted). In models with multiple the measures, whether one seeks full information
predictors, transformations of the dependent vari- estimation or limited information estimation, and
able can create specification error that undermines the population distributional assumptions that the
covariate control because it alters the relationships statistical tools available to the researcher make rela-
between the outcome variable and all predictors in tive to the actual population distributions. A grow-
the equation. Years ago, before high-speed comput- ing number of scientists suggest adopting robust
ers were widespread, analysts had little choice but methods in favor of traditional methods because of
to use transformations to make measures conform the complexities of two step strategies that rely on
to the assumptions of the limited number of para- preliminary tests of assumptions.
metric statistical strategies available. Such practices
are rarely needed today given the array of modern
methods of analysis that are available. Outliers
A growing number of statisticians recommend Outliers are unusually small or large scores that
that analysts simply abandon the more traditional distort basic trends in the data. For example, for
tests that make strong population assumptions un- the scores 2, 3, 4, 5, 6, 7, 8, 9, 10, and 50, the
less they are confident in assumption viability based last score is an outlier that distorts the mean and
on theory or extant research. Rather, analysts should makes the use of the mean suspect as a way to char-
routinely use modern-day robust methods of analysis acterize the central tendency of the data. Simple
or, at the very least, routinely supplement traditional methods for outlier detection compare the results
methods with modern robust methods (Keselman of an analysis when the case is included versus the

jaccard 99
results of an analysis when the case is deleted. Such theory must be in agreement with known data and
approaches, however, can be nondiagnostic when facts. Third, the theory must be testable—that is,
multiple outliers are present. For example, if there a theory must ultimately be subject to empirical
are two individuals in an analysis who distort a mean evaluation.
upward, deleting only one of them may not reveal In addition to these criteria, Shaw and Costonzo
an “outlier” effect as long as the second outlier is still (1982) have specified six additional features of a the-
in the data. Only when both outliers are removed ory that are desirable but not necessarily critical. First,
is their distorting character revealed. Outlier iden- a theory should be stated in terms that can be under-
tification is a complex enterprise, with some of the stood and communicated to most other scientists.
most sophisticated work being pursued in the liter- Second, the theory should be parsimonious in that it
ature on robust statistics (see Wilcox, 2003, 2005, adequately explains a phenomenon but with a min-
for elaboration). imum of concepts and principles. All other things
Wilcox (1998, 2006) objects to applying tradi- being equal, preference is given to theories that make
tional inferential methods to data that have elim- fewerassumptions.Third, althoughwerecognizethat
inated outliers based on simple outlier detection theories are occasionally so novel that they upset the
methods. He argues that doing so invalidates the sta- theoretical apple cart, a theory should be consistent
tistical theory on which the inferential tests are based with other accepted theories that have achieved con-
because of dependencies that outlier elimination cre- sensus among the scientific community—that is, it
ates. Others recommend conducting analyses with should be able to be integrated into existing bod-
and without outliers to determine if conclusions ies of theory. A fourth desideratum is scope. Other
change. If conclusions do change, then one moves things being equal, the greater the range of the the-
forward with any conclusions on a tentative basis. ory, the better it is thought to be. That said, there are
Probably the most effective strategy for dealing with times when narrow range theories tend to hold up
outliers is to focus on parameter estimation methods better over time than broad range theories. Creativ-
that are resistant to outliers. For an introduction to ity or novelty is a fifth criterion sometimes suggested
these methods, see Wilcox (2005). for evaluating a theory. A theory that explains the
obvious is generally not as highly valued by the scien-
tific community as one that provides a novel insight
Model Selection into an interesting phenomenon. Finally, many sci-
Model selection refers to the process of choosing entists suggest that a good theory is one that generates
what one believes is the “best” model for describ- research activity.
ing a phenomenon from a set of candidate models. Brinberg and McGrath (1985) have noted that
The set of plausible models might consist of many the various theory desiderata sometimes conflict
models or it might consist of just one or two with each other. For example, parsimonious theories
models. Criteria for model selection can be consid- tend to be more limited in scope. As such, the-
ered in general terms using a mindset of specifying orists often make trade-offs as they construct the-
criteria to evaluate the overall quality of a theory ories to maximize what is valued by the scientific
in general; or they can be discussed in specific, community.
quantitative terms when choosing between compet-
ing models within an experiment. I consider both
perspectives. Choosing Between Models in a Given Study
As one pursues the theory construction process,
one may develop competing theories that either
General Criteria for Evaluating Theories make opposite predictions or that account for the
Consensual validation is one basis by which the- same phenomena but using different explanations
ories are evaluated. This refers to the degree of con- and assumptions. Faced with such scenarios, we
sensus among the scientific community about the design research to help us choose between the com-
validity of the theory. If a theory enjoys widespread peting theories. Many scientists consider research
acceptance, then it is seen as being a “good” theory. that chooses between two or more logical and plau-
Shaw and Costonzo (1982) have argued that three sible theories to be inherently more interesting than
criteria are crucial for the acceptance of a theory. studies that yield results regarding a single theory
First, the theory must be logically consistent—that (Platt, 1964).
is, the theoretical statements within the concep- In some cases, competing models making quali-
tual system must not be contradictory. Second, the tatively opposite predictions about how data should

100 theory construction, model building


pattern themselves in a given study. In these cases, Table 5.1. Correlations for Intelligence
Example
deciding which theory better accounts for the data
is reasonably straightforward. As an example, sup- Y1 Y2 Y3 Y4 Y5
pose we describe the personal qualities of a political Y1 1.00 0.72 0.63 0.54 0.45
candidate to a person who he or she has not heard of
by providing the person with three pieces of infor- Y2 0.72 1.00 0.56 0.48 0.40
mation about the candidate. Suppose that the three
Y3 0.63 0.56 1.00 0.42 0.35
pieces of information all are quite positive (e.g., the
candidate is said to be honest, smart, and empathic). Y4 0.54 0.48 0.42 1.00 0.30
For purposes of developing this example, we char-
Y5 0.45 0.40 0.35 0.30 1.00
acterize how positive each piece of information is
considered to be using a metric that ranges from
0 to 10, with higher numbers reflecting higher
degrees of positivity. Suppose we want to predict the first candidate to the second candidate. Psycho-
how favorable a person will feel toward the can- logically, the first candidate has nothing but very
didate based on the three pieces of information. positive qualities, whereas the second candidate has
One plausible model states that the overall degree very positive qualities but also some qualities that are
of favorability is a function of the sum of the posi- only somewhat positive. The person prefers the first
tivity of each individual piece of information—that candidate, who has nothing but very positive quali-
is, that people “tally up” the affective implications of ties, to the second candidate, who has very positive
each piece of information when forming their over- qualities but also moderately positive qualities.
all impression. Another model, by contrast, specifies In the above example, the summation and averag-
a different function—namely, an averaging func- ing models make opposite predictions about candi-
tion. In this case, the overall feeling of favorability date preference, and it is straightforward to choose
is thought to be a function of the average positivity between the theories based on an empirical study
of the information presented. that asks people which of the two candidates they
What are the implications of specifying the func- prefer. Of course, if the study found that neither
tion as being summative versus averaging in form? candidate tended to be preferred, then this would
It turns out, they are considerable. Let’s explore question both models.
the summation model first. Suppose the positivity Although studies like the above are compelling, it
values of the three pieces of information are 8, 8, is common to conduct studies where competing the-
and 8 respectively. The overall feeling of favorability ories do not make qualitatively opposite predictions,
toward the candidate will be a scaled function of but, rather, they make predictions about how data
8 + 8 + 8 = 24. Now suppose we describe a sec- should pattern themselves that allows researchers to
ond candidate to this person using the same three choose between them. As an example, in the area
pieces of information but we add a fourth descrip- of intelligence testing, theorists agree that there are
tor to them (cunning) that has a positivity value of different kinds of intelligence and cognitive abilities,
4. According to the summation model, the over- such as math skills, vocabulary breadth, spatial skills,
all feeling of favorability toward this new candidate motor skills, and memory. Studies suggest that mea-
will be a scaled function of 8 + 8 + 8 + 4 = 28, and sures of these constructs are moderately correlated
the person will prefer the second candidate to the with one another. Some theorists believe that the
first candidate. Psychologically, it is as if the sec- correlations among them result from the common
ond candidate has all the same qualities as the first influence of general intelligence, sometimes called g .
candidate, and then “as a bonus,” you get a fourth We refer to this as a “single factor model” because it
positive attribute as well. Hence, the person prefers posits that the correlations among the measures can
the second candidate to the first candidate. be accounted for by a single underlying factor. Other
Now consider instead the averaging function. theorists believe that g does not exist and that the
The overall feeling toward the first candidate is pre- correlations among the different abilities are a result
dicted to be (8 + 8 + 8)/3 = 8.0, and the overall of a more complex constellation of determinants
feeling toward the second candidate is said to be (8 consisting of multiple factors.
+ 8 + 8 + 4)/4 = 7.0. In the averaging model, exactly Suppose in a population of individuals the corre-
the reverse prediction is made in terms of candi- lations among the five variables are as presented in
date preference—namely, the person now will prefer Table 5.1. It can be shown using somewhat involved,

jaccard 101
but tedious, algebra that the one factor model pre- is true in the population, then the sample correla-
dicts that the correlations should pattern themselves tion matrix should at least reasonably approximate
in a certain way—that is, they should follow cer- the regularities predicted by the one-factor model.
tain mathematical regularities. For example, if we A challenge for scientists when comparing models
choose any two columns and ignore entries where vis-à-vis indices of fit (or, for that matter, evaluating
a 1.00 occurs, then the ratios of the respective row a single model using an index of fit) is to take into
entries for the two columns should be equal in value account such sampling error.
to the comparable ratio for any row in the matrix. A final strategy that one encounters when scien-
For example, tists evaluate competing models is the case where
scientists prefer the model that explains the most
For columns 1 and 2:
variation in an outcome variable. In these cases,
.63/.56 = .54/.48 = .45/.40 = 1.125 one typically cannot differentiate models in terms
of degree of fit to the data, as in the previous
and
examples we considered. Nevertheless, one can dif-
For columns 2 and 5 : ferentiate between them in terms of whether one
model accounts for significantly more variation in
.72/.45 = .56/.35 = .48/.30 = 1.600.
an outcome than the other model. For example,
By contrast, models that posit more complex pat- when explaining adolescent drug use, one might
terns of underlying determinants (e.g., a two-factor compare a model that assumes drug use is a sim-
model, a three-factor model) predict a different set ple additive function of gender and grade in school
of regularities, which we do not elaborate here in versus a model that assumes drug use is an additive
the interest of space (see McDonald, 1985). The function of gender, grade, plus the interaction effect
model selection task is to choose which of the com- between gender and grade. If the models account
peting models (e.g., a one-factor model, a two-factor for about the same amount of variation in drug
model, etc.) is most consistent with the data. use, then preference for the first model will take
Statisticians have developed quantitative indices precedence on grounds of parsimony and because
of fit that represent the overall degree of correspon- taking into account the interaction effect does not
dence between the predicted pattern of data by a seem to matter. By contrast, if the three-parameter
model and the observed data pattern. When there model explains substantially more variation in drug
are competing models, a fit index is derived for use than the two-parameter model, then the three-
each model and then the fit indices are compared to parameter model is preferred.
identify the model that best accounts for the data. As scientists consider the complex issues sur-
The technical details of such comparisons can be rounding model selection, a paramount concern
quite complex, and these are elaborated through- is not to mistakenly embrace a model that is mis-
out the many chapters in this volume. The point specified. A misspecified model is a model that is
I am emphasizing here is that even if two models wrong because it (1) left out an important variable
do not make qualitatively opposite predictions, as from the model whose omission biases parame-
was the case for the example on summation versus ter estimates and leads to faulty conclusions (also
averaging, models often can be compared in terms called left-out-variable-error or LOVE problems);
of their relative consistency with the data, with one (2) assumed one type of function between variables
model ultimately being declared as being more con- (e.g., linear) when, in fact, a different type of func-
sistent with the data than another model based on tion was operative (e.g., a threshold function), with
the comparison of quantitatively defined fit indices. such misspecification leading to faulty conclusions;
There is, however, an additional complication in and/or (3) incorrectly modeled the temporal dynam-
model selection. In practice, it would be surprising ics within a model in such a way that parameter
if sample data yielded a set of correlations that per- estimates and conclusions are non-trivially biased.
fectly followed the predicted regularities of, say, a Throughout this volume, chapters elaborate on
one-factor model, even if the one-factor model was issues of model misspecification and ways to gain
operative in the population. This is because sam- perspectives on it. Cudeck (1989) has argued that
ple correlations are subject to sampling error and the crucial issue is not whether a model is misspeci-
will randomly deviate to a greater or lesser extent fied because misspecification is inevitable in so much
from the true population correlations. Even if this of the research we conduct. Rather, the more cen-
is the case, one expects that if the one factor model tral issue is whether the degree of misspecification is

102 theory construction, model building


sufficiently large that it leads to conclusions that are future, A festschrift in honor of Karl Jöreskog. (pp. 119–138).
incorrect or misplaced. Lincoln, IL: Scientific Software.
Bollen, K.A., & Curran, P.J. (2006). Latent curve models: A
In sum, model selection can involve two pro-
structural equation approach. Hoboken, NJ: Wiley.
cesses: (1) evaluating a model by and of itself to Brinberg, D. & McGrath, J. (1985). Validity and the research
determine whether it meets criteria the scientific process. Newbury Park: Sage.
community judges to be important (such as being Brodbeck, M. (1968). Readings in the philosophy of the social
logical, testable, consistent with known facts/data, sciences. New York: Macmillan.
Brown, T. (2006). Confirmatory factor analysis for applied research.
etc.), and (2) quantitatively comparing a model with
New York: Guilford.
other competing models within a given study to Budescu, D. & Appelbaum, M. (1981). Variance stabilizing
identify which model best accounts for the observed transformations and the power of the F test. Journal of
data or explains the most variation in an out- Educational and Behavioral Statistics, 6, 55–74.
come. Three strategies are typically used to choose Bunge, M. (1961). Causality, chance, and law. American Scientist,
69, 432–488.
between models. First, one designs a study where
Carnap, R. (1971). A basic system of inductive logic, Part 1. In
the competing models make qualitatively opposite R. Carnap & R. C. Jeffrey (Eds.), Studies in inductive logic
predictions. Second, one deduces from the model and probability. Vol. I, (pp. 18–39). Berkeley, CA: University
how data should pattern themselves and then derives of California Press.
a quantitative (fit) index representing the corre- Carroll, RJ, & Schneider, H. (1985). A note on Levene’s test
for equality of variances. Statistics & Probability Letters, 3,
spondence between the predicted patterns and the
191–194.
observed patterns. The model that “fits” the data Cartwright, N. (2007). Hunting causes and using them. New York:
best is the preferred model, everything else being Cambridge University Press.
equal. Third, one selects a model that can explain Collins, L. (2006). Analysis of longitudinal data: The integration
the most meaningful variation in a targeted outcome of theoretical model, temporal design, and statistical model.
Annual Review of Psychology, 57, 505–528.
variable. There are many nuances surrounding the
Coombs, C., Dawes, R. & Tversky, A. (1970). Mathematical
above strategies, and these are elaborated in other psychology: An elementary introduction. Englewood Cliffs, NJ:
chapters in this volume. Prentice Hall.
Cudeck, R. (1989). Analysis of correlation matrices using covari-
ance structure models. Psychological Bulletin, 105, 317–327.
Concluding Comments Doksum, K.A., & Wong, C.-W. (1983). Statistical tests based
This chapter has described general issues involved on transformed data. Journal of the American Statistical
in (1) building causal theories, (2) translating those Association, 78, 411–417.
theories into a set of mathematical representations, Dubin, R., (1976). Theory building in applied areas. In Marvin
(3) choosing an analytic strategy to estimate param- D. Dunnette (Ed.), Handbook of industrial and organizational
psychology. (pp. 17–39). Chicago: Rand McNally.
eters and sampling error vis-à-vis those equations,
Erceg-Hurn, D. M., Wilcox, R. R., & Keselman, H. H. (2012).
and (4) choosing the “best” model from a set of Robust statistical estimation. In T. D. Little (Ed.), The Oxford
competing models. There is a vast array of technical Handbook of Quantitative Methods (Vol. 1, pp. 388–406).
issues to consider as one approaches these four tasks, New York: Oxford University Press.
and these technicalities are elaborated throughout Finkel, S. (2008). Linear panel analysis. In S. Menard (Ed.),
Handbook of longitudinal research (pp. 475–504). New York:
the chapters in this volume. My purpose here was to
Elsevier Press
provide a “big picture” view of the broader enterprise Frank, P. (1961). Modern science and its philosophy. New York:
so that we can “see the forest through the trees” as Collier Books.
we approach the task of building cumulative bodies Green, P. & Tull, D. S. (1975). Research for marketing decisions.
of knowledge and solving social problems. Englewood Cliffs, NJ: Prentice-Hall.
Jaccard, J. & Jacoby, J. (2010). Theory construction and model
building skills: A practical guide for social scientists. New York:
References Guilford.
Blanton, H. & Jaccard, J. (2006). Tests of multiplicative mod- Kaplan, A. (1964). The conduct of inquiry. San Francisco, CA:
els in psychology: A case study using the unified theory of Chandler.
implicit attitudes, stereotypes, self esteem and self concept. Keselman, H., Algina, J., Lix, L., Wilcox, R. & Deering, K.
Psychological Review, 122, 155–169. (2008). A generally robust approach for testing hypotheses
Blaylock, J., Salathe, L. & Green, R. (1980). A note on the Box- and setting confidence intervals for effect sizes. Psychological
Cox transformation under heteroskedasticity. Western Journal Methods, 13, 110–129.
of Agricultural Economics, 45, 129–135. Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie,
Bollen, K.A. (2001). Two-stage least squares and latent variable R., Donahue, B., et al. (1998). Statistical practices of educa-
models: Simultaneous estimation and robustness to misspec- tional researchers: An analysis of their ANOVA, MANOVA,
ifications. In Robert Cudeck, Stephen Du Toit, & Dag and ANCOVA analyses. Review of Educational Research, 68,
Sörbom (Eds.), Structural equation modeling: Present and 350–386.

jaccard 103
Keyes, T. & Levy, M.S. (1997). Analysis of Levene’s test under Shadish, W. R., Cook, T. & Campbell, D. (2002). Experimental
design imbalance. Journal of Educational and Behavioral and quasi-experimental designs for generalized causal inference.
Statistics, 22, 227–236. Boston: Houghton Mifflin.
Lewis, D. (2000) Causation as influence. Journal of Philosophy, Shapiro, S. S. & Wilk, M. B. (1965). An analysis of variance test
97, 182–197. for normality (complete samples). Biometrika, 52, 591–611.
Long, S. (1997).Regression models for categorical and limited Shaw, M. & Costonzo, P. (1982). Theories of social psychology.
dependent variables. Thousand Oaks, CA: Sage. New York: McGraw-Hill.
Maxwell, S. E. & Delaney, H.D. (2004). Designing experiments Sheth, J. N. (1967). A review of buyer behavior. Management
and analyzing data: A model comparison perspective. Mahwah, Science, 13, B718–B758.
NJ: Erlbaum. Simon, H. A. & Newell, A. (1956) Models: Their uses and lim-
McDonald, R. (1985). Factor analysis and related methods. itations. In L.D. White (Ed.), The state of the social sciences.
Hillsdale, NJ: Erlbaum. (pp. 61–83). Chicago, IL: University of Chicago Press.
McDonald, R. P. (2012). Modern test theory. In T. D. Little Sowa, J. (2000). Processes and causality. From www.jfsowa.com/-
(Ed.), The Oxford Handbook of Quantitative Methods (Vol. 1, ontology/causal.htm. Accessed August 15, 2011.
pp. 118–143). New York: Oxford University Press. Torgerson, W. S., (1958) Theory and methods of scaling. New York:
Milligan, G. (1987). The use of the arc-sine transformation Wiley.
in the analysis of variance. Educational and Psychological Wilcox, R. (2006). Graphical methods for assessing effect size:
Measurement, 47, 563–573. Some alternatives to Cohen’s d. Journal of Experimental
Morgan, S. & Winship, C. (2007). Counterfactuals and causal Education, 74, 353–367.
inference. New York: Cambridge University Press. Wilcox, R.R. (1996). Statistics for the social sciences. New York:
Pap, A. (1962). An introduction to the philosophy of science. Academic Press.
Glencoe, IL: The Free Press of Glencoe (MacMillan). Wilcox, R. R. (1998). How many discoveries have been lost by
Parra-Frutos, I. (2009). The behaviour of the modified Lev- ignoring modern statistical methods. American Psychologist,
ene’s test when data are not normally distributed. Journal 53, 300–314.
Computational Statistics, 24, 671–693. Wilcox, R.R. (2003). Applying contemporary statistical techniques.
Pearl, J. (1999). Causality: Models, reasoning, and inference. New San Diego, CA: Academic Press.
York: Cambridge University Press. Wilcox, R.R. (2005). Introduction to robust estimation and
Platt, J. (1964). Strong inference. Science, 146, 347–353. hypothesis testing (2nd ed.). San Diego, CA: Academic
Rubin, D. (1974). Estimating causal effects of treatments in ran- Press.
domized and nonrandomized studies. Journal of Educational Wilcox, R. R., Charlin, V. L., & Thompson, K. (1986). New
Psychology, 66, 688–701. Monte Carlo results on the robustness of the ANOVA F, W,
Rubin, D. (1978). Bayesian inference for causal effects: The role and F* statistics. Communications in Statistics. Simulation and
of randomization. Annals of Statistics, 6, 34–58. Computation, 15, 933–944.
Russell, B. (1931). The scientific outlook. London: George Allen Yuan, K-H., & Schuster, C. (2012). Overview of statistical esti-
& Unwin. mation methods. In T. D. Little (Ed.), The Oxford Handbook
Schumacker, R. E. & Lomax, R. G. (2004). A beginner’s guide to of Quantitative Methods, (Vol. 1, pp. 360–386). New York:
structural equation modeling. Mahwah, NJ: Erlbaum. Oxford University Press.

104 theory construction, model building


CHAPTER

Teaching Quantitative Psychology


6
Lisa L. Harlow

Abstract

This chapter strives to enliven quantitative psychology teaching and encourage statistical literacy.
Summaries of quantitative training, at the undergraduate and graduate levels, offer guidelines to
improve instruction and call for greater emphasis on measurement, research, and quantitative
methods. Strategies for effectively teaching quantitative psychology are suggested, including active,
hands-on learning for engaging students, e-learning and Web-based instruction, mentoring and role
models, and encouraging conceptual understanding. Quantitative students are encouraged to focus on
the nature of research questions, similarities and differences in statistical methods, and interpreting
findings with statistical tests, effect sizes, and confidence intervals. Future directions are offered
regarding model building, quantitative learning beyond the classroom through workshops, membership
in quantitative societies, and reading the quantitative literature. The reach of quantitative training
should be widened to more readily include those from disadvantaged, early career, and
under-represented groups to further strengthen the field and enlighten greater numbers about the
wonders of quantitative psychology.
Key Words: quantitative training, statistical literacy, engaging students, strategies for teaching, active
learning, mentoring, underlying concepts, widening quantitative reach

Introduction Experience and research make it all too apparent that


Readers of this volume are part of a rare and students often approach quantitative courses with
unique subset of individuals who resonate with little interest or confidence, coupled with anxiety
the term quantitative psychology. When the topic andmisperceptionsaboutlearningstatisticalmaterial
of psychology is discussed, whether by students, (Ashcraft, & Kirk, 2001; DeVaney, 2010; Harlow,
faculty, or the general public, the qualifier “quan- Burkholder & Morrow, 2006; Onwuegbuzie, 2000;
titative” does not always enter the conversation. Onwuegbuzie, & Wilson, 2003; Piotrowski, Bagui,
Students in psychology often delay taking required & Hemasinha, 2002). What compounds the prob-
statistics and research courses, short-changing their lem is that students with high anxiety tend to have
ability to understand and develop scientific skills poorer attitudes toward and lower performance in
necessary to open up their career options and further quantitative studies (Budé, Van De Wiel, Imbos,
the field of psychology (Rajecki, Appleby, Williams, Candel, Broers, &Berger, 2007; Harlow, Burkholder
Johnson, & Jeschke, 2005). In more than 25 years & Morrow, 2002; Mills, 2004; Rodarte-Luna &
of teaching quantitative psychology, I have learned Sherry, 2008; Tremblay, Gardner, & Heipel, 2000).
not to be too surprised by cringing and flinching, Fortunately, quantitative attitudes have been shown
if only subtle, in students enrolled in my courses. to predict quantitative performance just as strongly as

105
pre-course quantitative skill in a class with engaging measurement, and methodology. Only 15% of the
activities to get students involved, suggesting that graduate programs had a quantitative area, although
it is worthwhile to try to reduce quantitative anxi- about 50% offered a minor in quantitative studies.
ety by making quantitative learning more appealing Almost all of the graduate programs required stu-
(Harlow, Burkholder, & Morrow, 2002). dents to take at least one or two quantitative courses
Thus, it is probably not extraordinary that I find during their first year, although more than 25% sent
it my greatest joy to try to engage and enlighten students to other departments to get this training.
students with the remarkable prospective of quan- Required quantitative courses usually included basic
titative thinking and analysis. I go so far as to analysis of variance (ANOVA) and some regres-
argue that any student interested in psychology has sion, with half to two-thirds of programs including
the making of a quantitative scientist, if only at limited training in measurement or research. Com-
some latent, implicit level. What student has not puter training in SPSS and SAS is often part of the
posed a possible theory for why something is, or curriculum, with the possibility of EQS, AMOS,
is not, a certain way? Doesn’t everyone wake up Mplus, LISREL, and other software exposure in
each morning with one or more hypotheses about more advanced courses (e.g., structural equation
how the day will emerge and which variables or fac- modeling) in about half of the programs. Over a span
tors will help bring about the desired effect, whether of almost two decades between surveys, quantitative
small, moderate, or large? And, don’t we all think training had not improved much; over time there
about how much confidence we can place on our was slightly more emphasis on measurement and
hypotheses or expectations, depending on specific deterioration in coverage of research design. Much
mitigating or confounding circumstances? I believe improvement in advanced methods training is still
that we all engage in these activities, however for- needed in most departments to impart the expertise
mal or informal, in which quantitative psychology needed to effectively compete in quantitative studies
can play a pivotal role. Quantitative psychology and fill needed positions in the workforce.
has the potential to empower and enlighten with It is becoming readily apparent that efforts are
the training, skills, reasoning, and capabilities to needed to increase the number of quantitatively
formalize these kinds of questions, helping us to trained individuals. A recent report from an Amer-
describe and understand the essence of myriad data ican Psychological Association (APA) Task Force
that come our way. With more advanced quantita- to Increase the Quantitative Pipeline (Aiken et al.,
tive training, we could further develop and analyze 2007) reported that in recent history, whereas there
intricate theoretical models that help explain and were approximately two applicants for each non-
predict complex processes and behaviors, integrat- quantitative doctoral level job (e.g., cognitive, devel-
ing information necessary to inform and improve opmental, social), there were almost 2.5 jobs for
interventions, policies, and the human condition every quantitatively trained PhD student. Further,
(Harlow, 2010; Rodgers, 2010). the focus of quantitative methodology is moving
In this chapter, I highlight some major issues and away from emphasis on learning a set of specific
topics that go into teaching quantitative psychology. methods and procedures and instead placing greater
First, I provide an overview of quantitative training priority on developing a broader vision of quanti-
to get an overarching picture of the field. Second, tative science through theory building, modeling
I suggest a number of issues that should be taken underlying processes, and integrating information
into account when teaching quantitative courses and across meta-analytic studies (e.g., Harlow, 2010;
research that investigates ways to address these con- Jaccard & Jacoby, 2009; McGrath, 2011; Rodgers,
cerns. Third, I offer several themes that run through 2010; Rosenthal & DiMatteo, 2001). Consistent
many statistical methods to help tie together the with this vision is a focus on encouraging greater
numerous, seemingly isolated and obscure quanti- statistical reasoning, thinking and literacy, rather
tative procedures. Finally, I summarize the current than rote learning (Gal, 2003; Garfield & delMas,
field of quantitative teaching and give recommenda- 2010; Ridgway, Nicholson, & McCusker, 2007).
tions for other options to supplement, enrich, and Researchers are realizing that quantitative literacy
expand statistical learning. is a needed goal in undergraduate studies, begin-
ning with the most basic, introductory statistics
Overview of Quantitative Training courses (Ben-Zvi & Garfield, 2004; Mulhern &
Aiken et al. (1990, 2008) surveyed more than Wylie, 2004, 2006; Rumsey, 2002; Watson, 2006).
200 graduate training programs regarding statistics, Resources are becoming available to help encourage

106 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
quantitative thinking (Garfield & Ben-Zvi, 2008; & Garfield, 2006). Humor is also a great ice-
Saville, Zinn, Lawrence, Barron, & Andre, 2008; breaker, catching students’ attention (e.g., Cobb,
Stark & Krause, 2009), along with a realization of 1999, 2007; Zeedyk, 2006), helping to diffuse ten-
the challenges involved (Ben-Zvi & Garfield, 2004; sion, and surreptitiously calling students back to
Sen, 2004). quantitative learning.
Garfield and colleagues from the University of Creating a lively environment is essential for
Minnesota developed a program in Quantitative capturing the interest of students. A number of
Methods in Education that is at the forefront on researchers have emphasized the importance of
improving undergraduate statistical training. Much hands-on and interactive learning (Dolinsky, 2001;
of their research focuses on training statistics teach- Kolar & McBride, 2003; Wulff & Wulff, 2004).
ers (Garfield & Everson, 2009), reforming the way A simple strategy could be to invite students to
introductory statistics is taught (Garfield, Hogg, write down what is clear and not clear at the end
Schau, & Whittinghill, 2002), and investigating of each lecture, with the faculty clarifying unclear
how to teach and learn statistics (Garfield & Ben- points at the beginning of the next lecture (Harlow,
Zvi, 2007). Other researchers (Friedrich, Buday, Burkholder, & Morrow, 2006). It takes little time,
& Kerr, 2000) also help outline the field by sur- and students get fairly immediate feedback on how
veying undergraduate programs across the country to clear up misunderstandings, whereas faculty get
with respect to quantitative training. In the next a clearer idea of what the students understood
section, I review research focusing on strategies and what needs to be clarified. Moreover, involv-
for teaching quantitative psychology (e.g., Gelman ing students with creative examples (Chew, 2007;
& Nolan, 2002), including a set of guidelines Schwartz & Martin, 2004) and analyzing data (e.g.,
developed by Garfield and colleagues (Franklin & Nie & Lau, 2010; Watkins, Scheaffer, & Cobb,
Garfield, 2006). 2004) bring about more in-depth learning than tra-
ditional lecture-based approaches. Demonstrating
visual images and graphs of procedures also helps
Strategies for Teaching Quantitative
improve understanding (e.g., Peden, 2001).
Psychology
If there is more time, then students can be
There is an abundance of resources to aid instruc-
arranged into small groups and be given a research
tors of statistics courses, including best practice
scenario in which they need to consult among them-
volumes (e.g., Dunn, Smith, & Beins, 2007;
selves to recommend statistical procedures to address
Hulme, 2007; Hulsizer & Woolf, 2009), com-
the research question (Harlow, Burkholder, & Mor-
pilations of research on teaching statistics (e.g.,
row, 2006). For a basic example, students could be
Bjornsdottir & Garfield, 2010; Zieffler, Garfield,
asked how to assess whether men and women dif-
Alt, Dupuis, Holleque, & Chang, 2008), learn-
fered on hours of exercise per week (i.e., with a two-
ing objectives for introductory statistics and research
sample independent t-test) or whether the number
(Tomcho, Rice, Foels, Folmsbee, Vladescu, Liss-
of hours worked at an outside job was related to GPA
man, Matulewicz, & Bopp, 2009), and sugges-
(i.e., with a correlation). A number of researchers
tions for teaching quantitative courses (Garfield &
have recommended engaging students in these
Everson, 2009; Ware & Johnson, 2000). In this
cooperative small groups (DaRos-Voseles, Collins,
section, I present several strategies that have been
Onwuegbuzie, & Jiao, 2008; Onwuegbuzie,
researched and recommended to improve statistical
Collins, & Jiao, 2009; Peterson & Miller, 2004) or
literacy.
learning communities (e.g., Barren, Benedict, Sav-
ille, Serdikoff, & Zinn, 2007) where students work
Active Learning together to understand quantitative material and
Research has demonstrated that actively involv- immerse themselves in the process. Krause, Stark,
ing students significantly improves performance in and Mandl (2009), on the other hand, have found
quantitative courses (e.g., Helman & Horswill, that cooperative groups did not directly improve
2002). Guidelines to improve statistical learn- statistics performance, although students reported
ing (e.g., encouraging statistical thinking, using greater perceived efficacy when working with others.
examples with technology and real data, emphasiz- It may be that group learning is not effective for all
ing concepts, promoting active learning) reiterate students, with more advanced students possibly ben-
this approach of engaging students in the process efiting the least. For example, Harlow, Burkholder,
(Everson, Zieffler, & Garfield, 2008; Franklin and Morrow (2002) found that learning activities

harlow 107
that included working in groups and with peer men- Ferreira, 2001) and can help to supplement class
tors was viewed more favorably when students were lectures and faculty input (e.g., Katayama, 2001).
more anxious and had lower confidence about quan- Fortunately, graduate teaching assistants (TAs) are
titative learning. Future research could investigate often provided for undergraduate- and graduate-
whether ability or achievement level is a moderator level quantitative courses. When TAs are not funded,
for group learning and performance to clarify who I have found it very effective to invite one or more
benefits from group learning. top achievers from a previous semester to serve as
volunteer peer mentors or TAs, offering indepen-
Technology and Learning dent study or teaching practicum credit. Students
Other research has revealed the value of tech- in the course benefit from having a peer of similar
nology in heightening quantitative learning using age demonstrating and facilitating expertise in quan-
computer-assisted analysis (Bartz & Sabolik, 2001), titative methods. TAs or peer mentors gain efficacy
Web-based tutorials (Bliwise, 2005), and specific and greater confidence, often choosing to become
online learning programs such as Estimating Statis- even more involved with other quantitative courses
tics (EStat; Britt, Sellinger, & Stillerman, 2002), and research to continue building their skills (e.g.,
Simulation-Assisted Learning Statistics (SALS; Liu, Harlow, Burkholder, & Morrow, 2002, 2006).
Lin, & Kinshuk, 2010), the Utah Virtual Lab Mentoring can be particularly valuable for
(Malloy & Jensen, 2001), Statistical Understand- women and individuals from under-represented
ing Made Simple (SUMS; Swingler, Bishop, & groups who have few role models in the quanti-
Swingler, 2009), or Web Interface for Statistics Edu- tative field, for providing direction, support, and
cation (WISE; Berger, n.d., http://wise.cgu.edu/). encouragement (e.g., Kosoko-Lasaki, Sonnino, &
In a meta-analysis of 45 studies, there was a small- Voytko, 2006; Neal-Barnett, Mitchell, & Boeltar,
to medium-sized Cohen’s d (i.e., 0.33) performance 2002; Zirkel, 2002).
benefit effect size attributed to learning statistics
with computer-assisted instruction versus learning Conceptual Approach to Teaching
in a lecture-based control group that did not provide Perhaps the most effective idea for conveying
such input (Sosa, Berger, Saw, & Mary, 2010). Thus, complex quantitative material is to focus on the
students who had technology-enhanced instruction concepts rather than using a strictly mathematical
demonstrated one-third of a standard deviation approach (e.g., Atkinson, Catrambone, & Merrill,
higher performance than those students without 2003). Chiou (2009) found that encouraging stu-
such advantage; further, the improvement was even dents to collaborate on conceptually mapping sta-
more salient when involving those who were more tistical material significantly improved performance
advanced (i.e., graduate students) and when more compared to having students complete textbook
time was allotted for instruction. exercises and calculations. In another study, Aber-
Still, whereas research has demonstrated the son, Berger, Healy, and Romero (2003) demon-
benefits of e-learning approaches (e.g., Fillion, strated that an interactive approach to hypothe-
Limayem, Laferrière, & Mantha, 2008; Hanley, sis testing concepts was received more positively
2004; Sosa et al., 2010; Wender & Muehlboeck, and improved performance over traditional labo-
2003), more research is needed, as others remain ratory exercises. Similarly, Meletiou-Mavrotheris
unconvinced of the merits of adding technology to and Lee (2002) found that helping students to
the classroom environment (e.g., Härdle, Klinke, & understand concepts, improve statistical reason-
Ziegenhagen, 2007). Even the value of online dis- ing, and build intuitions about statistical ideas was
cussions and whether or how much faculty should more facilitating than using a traditional teaching
facilitate or contribute is not entirely clear (e.g., approach.
Mazzolini & Maddison, 2007). Instructors are Presenting and encouraging understanding of
encouraged to consider literature on cognitive learn- core concepts—particularly through hands-on
ing and assessment to improve Web-based materials engagement and research—can help foster more
for students, particularly in quantitative learning in-depth insight and greater involvement in inquiry-
(e.g., Romero, Berger, Healy, & Aberson, 2000). based future learning (Aulls & Shore, 2008; Dewey,
1997). Inviting students to seek out solutions to
Mentors and Role Models quantitative research problems promotes greater sta-
Mentoring can help students get the extra input tistical awareness and literacy. In the next section, I
needed to understand quantitative concepts (e.g., present a set of conceptual themes that are common

108 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
to many statistical methods and that help provide MANOVA) after taking into account one or more
a foundation for understanding introductory and covariates such as number of previous math courses
advanced quantitative learning. or GPA.
For those who are not specifically interested in
mean differences across groups, a correlational ques-
Themes that Run Through Quantitative
tion may be asked. Researchers interested in finding
Psychology
what is associated with achievement could conduct a
Quantitative understanding is increased when
simple correlation analysis, assessing whether math
common ideas are revealed that occur in many sta-
interest is related to college GPA. This bivariate pro-
tistical procedures. Harlow (2005) has emphasized
cedure (between just two variables) could expand
a number of themes that run through multivariate
to canonical correlation (CC) that allows an exam-
methods and that can be extended to encompass
ination of two sets of variables. For example, a
univariate methods as well. Three basic themes are
researcher could examine whether several math atti-
presented below to facilitate approaching, analyzing,
tude scores (e.g., math interest, math efficacy, and
and interpreting quantitative methods.
math anxiety) are related to several achievement
outcomes (e.g., GPA, the number of conferences
Considering the Research Question attended, and the number of memberships in honor
First, it is helpful to encourage students to con- societies). Correlation could also be extended to
sider the kind of research question that needs to be multiple regression (MR) to relate a linear combi-
addressed. Group difference questions can be ana- nation of several continuous predictors (e.g., math
lyzed with basic z-tests when information is known ability, verbal ability, previous GPA, achievement
about the mean and the variability in the popula- motivation) to a single outcome such as current
tion. For example, we could investigate whether GPA. If the outcome were dichotomous (e.g., suc-
an exceptional students’ class evinced an IQ that cess or failure in a course or grade level), then
was different from the known average of 100, with the merit of similar predictors could be examined
a standard deviation of 15. When only a pop- with logistic regression (LR) or discriminant func-
ulation mean is known or there are two groups tion analysis (DFA), depending on whether there
involved, a t-test would be appropriate, requiring was interest in conveying the odds or the degree of
that a researcher estimate the population standard correlation with the outcome, respectively. Multi-
deviation(s) from the sample(s). For studies with level modeling (MLM) would be useful to predict
two or more groups, an ANOVA would be useful an outcome such as achievement, when there is
for assessing whether there were differences among reason to believe that participants are clustered in
multiple groups. For example, a teacher may want separate groups (e.g., classrooms, school districts).
to compare the level of interest in taking math Thus, MLM allows, and even models, heterogeneity
courses for male and female students, which could of variance across several groups in the data, con-
be examined with a two-sample independent t-test. trary to more restrictive prediction methods (e.g.,
To investigate potential differences in math inter- MR, LR, and DFA) that assume that data are drawn
est among students from three different college from a homogeneous group.
majors, an ANOVA would be helpful. To exam- Other, more advanced, research questions could
ine whether several groups differ on math interest, be addressed with more sophisticated methods. Fac-
after accounting for the number of previous math tor analysis (FA) or principal components analysis
classes taken, analysis of covariance (ANCOVA) (PCA) would be useful when faced with a large
could be used. Further, group differences across sev- set of measures with a goal of identifying a few
eral groups could be assessed on several measures key dimensions that explain the underlying struc-
(e.g., math interest, math efficacy, and math anxiety) ture or associations (e.g., quantitative and verbal
using multivariate analysis of variance (MANOVA). intelligence dimensions in an IQ test of multiple
Similarly to the way that ANCOVA extended subtests). Factor analysis would allow the delin-
ANOVA by taking into account another predic- eation of unique or error variance in measures before
tor that correlated or covaried with the outcome, forming factors with the variance that was in com-
a multivariate analysis of covariance (MANCOVA) mon among the measures. In contrast, PCA analyzes
can extend a MANOVA when wanting to examine all of the variance in the variables when form-
whether several groups differ on several outcomes ing components. In practice, there may not be
(e.g., the three math attitudes suggested for the much actual difference in results across these two

harlow 109
methods if the loadings are high for relevant vari- being and exercise. Thus, both group difference and
ables on their respective factors, even with the initial correlational methods can examine how much vari-
difference in whether unique variance is included ance is shared between independent and dependent
in the analysis (Velicer & Jackson, 1990). Struc- variables. In this regard, Cohen, Cohen, West, and
tural equation modeling (SEM) could be used to Aiken (2003) have described how categorical group-
examine whether latent factors, each with several ing variables can be coded for use in correlational and
measures, could be theoretically modeled to explain regression procedures.
hypothesized underlying processes. For example,
using Bandura’s (1997) social learning theory, we
could test whether different factors of intelligence
(i.e., quantitative and verbal) would predict degree Noting Similarities and Differences in
of self-efficacy in learning, which in turn could Quantitative Methods
predict an achievement factor (measured by home- Examining research questions and seeing specific
work, quiz, and exam scores). In testing the SEM, methods that seem to be relevant to such questions
other covariates (e.g., socioeconomic status, pre- leads into a second theme, which is to notice the
vious GPA) could be added as possible predictors similarities and differences among various quantita-
of the mediator (i.e., self-efficacy) to see if they tive procedures. Quilici and Mayer (2002) helped
are important, or to rule them out as predictors students notice the underlying similarities in differ-
of achievement. Further, multiple group analyses ent statistical word problems. This helped students
could examine whether such a predictive SEM of to know how to analyze specific problems by not-
achievement held in different groups (e.g., men ver- ing their similarities to other statistical problems
sus women, different ethnicities). If findings differed that were addressed by certain methods. Similarly,
across groups, then it would indicate that the group- Derryberry, Schou, and Conover (2010) helped
ing variable moderated the prediction. Longitudinal students understand how to conduct rank-based sta-
modeling (e.g., latent growth curve modeling) could tistical tests by revealing their resemblance to already
examine whether achievement changed over time in studied parametric tests.
mean level and in the rate of change and whether Quantitative methods can be classified in several
predictors (e.g., previous GPA, IQ, gender) could ways to help delineate common aspects or distin-
predict the level (called the intercept) and rate of guishing differences. Thus, group-difference meth-
change (called the slope) across time. Other meth- ods (e.g., z-test, t-test, ANOVA, ANCOVA, and
ods discussed in this volume could also be examined MANOVA) are similar in allowing an examination
to further fine tune the nature of the question being of potential mean differences between groups on
assessed. one or more outcome variables. In contrast, correla-
Therefore, the main point of the first theme (type tional methods (e.g., Pearson’s r, CC) do not tend to
of research question asked) is that it is the nature focus on means but, rather, on assessing association
of the research and the questions asked of the data between independent and dependent variables. Pre-
that drive the kind of quantitative method that is diction methods (e.g., MR, LR, DFA, and MLM)
selected. There is a whole world of methods to all focus on predicting an outcome from a set of pre-
consider, and the choice is not so daunting when dictors and may have some emphasis on grouping
realizing that certain methods readily lend them- variables—particularly LR and DFA, which explic-
selves to different kinds of research questions (e.g., itly include a categorical dependent variable, and
group difference, correlation or prediction, underly- MLM, which takes into account different levels or
ing structure, longitudinal, etc.). In the next section, groups in the data. However, each of these predic-
similarities and differences among methods is the tion methods differs somewhat from ANOVA-based
second theme that is discussed. Here, it will become methods in focusing more on weights linking inde-
apparent that although there are distinctions among pendent and dependent variables and less on mean
methods that make them more likely to be applied to scores such as group averages or centroids. Dimen-
specific research questions, many quantitative proce- sional or structural methods (e.g., FA, PCA) are
dures share similarities in how they reveal the essence similar by involving one or more sets of measures
of the data. For example, researchers who find a sig- with a smaller set of underlying factors or compo-
nificant difference in well being between groups who nents posited or revealed to help explain the patterns
exercise regularly and those who do not will also find of relationships among variables. Thus, dimensional
that there is a significant relationship between well methods are similar to correlational methods in their

110 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
focus on associations among variables and differ- despite their individual variance (or uniqueness), a
ent from group difference methods that are more meaningful relationship emerges.
concerned with mean differences among groups. After assessing significance, it is important to
SEM combines the best of group difference, correla- examine the magnitude of an overall finding, called
tion/prediction, and structural methods by allowing an effect size (ES). For group-difference methods,
an investigation of the means of latent factors across an ES can indicate the number of standard deviation
different samples or groups, while still allowing an units of difference there is among group means (i.e.,
examination or confirmation of the structural or Cohen’s d ; Cohen, 1988), with small, medium, and
correlational relationships among variables. Longi- large effects having values of about 0.20 (almost
tudinal methods (e.g., time series, latent growth a quarter of a standard deviation), 0.50 (half a
modeling) add the element of time to allow an standard deviation), and 0.80 (almost a full stan-
examination of temporal ordering that can help in dard deviation) (Cohen, 1988). For correlational
assessing causal precedence among variables. methods, we would hope to show a meaningful rela-
The main point of this second theme is to encour- tionship between pertinent variables, with values of
age students to see how various quantitative methods 0.1, 0.3, and 0.5 indicating small, medium, and
are similar and how they are different. Discern- large Pearson product moment correlations, respec-
ing these distinctions and similarities will go a long tively. Particularly for prediction methods, as well
way toward highlighting the overriding features and as other methods, proportions of shared variance
underlying aspects of quantitative methods and in effect sizes (e.g., η2 or R2 ) are useful for showing
selecting an appropriate method to address a specific how much the independent and dependent variables
kind of research question. have in common, ranging in size from 0 to 1.0. Pro-
portion of variance ES values of 0.01, 0.09, and
Interpreting Findings from Quantitative 0.25 indicate small, medium, and large univariate
Methods effects, respectively (obtained by squaring 0.1, 0.3,
Third, after identifying a research question and and 0.5 correlations, respectively); and 0.02, 0.13,
noticing similarities and differences that lead to and 0.26 or more refer to small, medium, and large
selecting a method to analyze one’s data, it is impor- multivariate effect sizes, respectively (e.g., Cohen,
tant to examine and interpret findings from multiple 1988; Harlow, 2005).
perspectives. Although the field of quantitative When interpreting ES, it is also important to pro-
psychology has traditionally focused largely on sig- vide an indication of the margin of error with con-
nificance testing, there is growing emphasis on fidence intervals (e.g., Cumming & Fidler, 2009;
considering other alternatives such as effect sizes Gilliland & Melfi, 2010; Odgaard & Fowler, 2010).
and confidence intervals (e.g., Harlow, Mulaik, & Confidence intervals provide a range of values, with
Steiger, 1997; Kline, 2004; Wilkinson, & The Task narrower intervals indicating more precision for an
Force on Statistical Inference, 1999). Initially, it can estimated effect. Research journals such as the Jour-
be helpful to assess whether the overall variation nal of Consulting and Clinical Psychology and others
in the data is significantly different from chance. are beginning to require that statistical findings be
Significance tests usually involve some ratio of vari- supplemented with effect sizes and confidence inter-
ances. We can think of group difference models vals (e.g., La Greca, 2005; Odgaard & Fowler,
as explaining the ratio of how scores vary among 2010).
groups, relative to how much scores vary within each Thus, the third theme, concerned with inter-
group. This ratio can be readily recognized as the preting results, involves providing an indication as
F -test. The closer this ratio is to 1.0, the less we to whether a result is significantly different from
are apt to consider between-groups variance as any- chance, the magnitude of the effect, and the degree
thing more meaningful than the within-group error of certainly about the result.
variance. Correlational methods examine whether Ultimately, understanding quantitative themes
the covariance between variables is salient when and concepts (e.g., Abelson, 1995), including
contrasted with the variances within each variable. framing initial research questions; noticing sim-
Thus, a correlation is simply a ratio of the covariance ilarities and differences when selecting statistical
between variables over the product of the stan- methods; and interpreting the significance, extent,
dardized variance (i.e., standard deviations) within and precision of one’s finding will go a long way
each variable. Using an analogy relating individuals, toward moving psychology more in the forefront of
when the covariance between two people is salient, quantitative science.

harlow 111
Conclusions overarching constructs and processes rather than
This chapter features a number of issues regard- limiting teaching to narrow and isolated method-
ing the teaching of quantitative psychology. First, ological procedures (Embretson, 2010; McGrath,
research on quantitative training underscores the 2011; Rodgers, 2010). Building quantitative mod-
need to provide greater guidelines and opportunities els of behavior will generate more integrated and
for learning more about measurement, research, and explanatory understanding that is necessary to keep
statistical methods (e.g., Franklin & Garfield, 2006; psychology scientifically rigorous.
Garfield & Everson, 2009). Job opportunities are Second, students, faculty, and professionals
more available than are applicants for quantitative should be encouraged to explore options for fur-
careers (e.g., Aiken et al., 1990, 2008). An APA task thering quantitative learning outside of the class-
force on increasing the quantitative pipeline (Aiken room. For example, quantitative journals (e.g.,
et al., 2007) is aimed at bringing more people into Multivariate Behavioral Research, Psychological Assess-
this field to fill this need. ment, Psychological Methods, Structural Equation
Second, strategies for effectively teaching quan- Modeling ) as well as quantitative handbooks (e.g.,
titative psychology emphasize the advantages of Cooper et al., 2012), statistical dictionaries (e.g.,
active learning (e.g., Onwuegbuzie, Collins, & Dodge, 2003; Everitt, 2002; Upton & Cook,
Jiao, 2009), technology, and Web-based instruction 2008), and even online resources on quantita-
(e.g., Bliwise, 2005; Fillion, Limayem, Laferrière, tive methods (e.g., Journal of Statistics Educa-
& Mantha, 2008; Sosa et al., 2010), mentoring tion, http://www.amstat.org/publications/jse/) pro-
and role models (e.g., Ferreira, 2001), and concep- vide opportunities for further quantitative studies.
tual understanding (e.g., Mulhern & Wylie, 2004; There are also numerous quantitative volumes that
Swingler, Bishop, & Swingler, 2009). All of these provide input and guidelines for understanding
strategies implicitly invite students to become more the main aspects of various quantitative methods
immersed in the quantitative learning process and (Hancock & Mueller, 2010), including multi-
ultimately to become more statistically literate. ple regression/correlation (Cohen, Cohen, West,
Third, several themes have been offered to bring & Aiken, 2003), multivariate statistics (Stevens,
about greater clarity in quantitative learning (Har- 2009), statistical mediation (MacKinnon, 2008),
low, 2005). One of the themes includes considering missing data analysis (Enders, 2010), and structural
the nature of the research question. Students should equation modeling (e.g., Kline, 2011), among oth-
be encouraged to ask whether their research ques- ers. For those interested in more advanced statistical
tion involves group differences, correlation, predic- material, they could venture into the British Jour-
tion, underlying structure, or longitudinal studies. nal of Mathematical and Statistical Psychology and
Another theme encourages students to notice the Psychometrika, or even the Millsap and Maydeu-
similarities and differences among quantitative pro- Olivares (2009) volume on Contemporary Psychomet-
cedures, as this kind of awareness is helpful in select- rics, realizing that these latter sources would not be
ing appropriate methods to analyze data and address as accessible as the more translational articles in the
research questions. Still another theme presents sug- former outlets.
gestions for interpreting quantitative findings using Those interested in additional training could
statistical tests, ES, and confidence intervals. consider workshops (e.g., the University of Kansas
Being aware of quantitative training practices, Summer Institute held by Todd Little), annual
encouraging more engaged learning, and focusing conferences featuring quantitative presentations
on conceptual thinking and underlying themes can (e.g., Division 5 talks at the APA convention,
help us feature and spotlight quantitative psychology Division D talks at the American Educational
as a highly worthwhile and exciting field in which Research Association convention, and presenta-
to take part. tions and workshops at the Association for Psy-
chological Science). Websites focusing on quan-
Future Directions titative material (e.g., UCLA statistical website,
A number of future directions are suggested to http://www.ats.ucla.edu/stat/, and a similar one at
point out areas in which quantitative teaching is Georgetown, http://statpages.org/) and online tuto-
emerging and needs to grow. First, teaching should rials in quantitative methods (Dinov & Christou,
focus on actively engaging students and helping 2009; Garfield & delMas, 2010) offer additional
them to understand basic, underlying concepts. This occasions to heighten quantitative proficiency. Fur-
would also involve more emphasis on modeling ther, there are quantitative societies that provide

112 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
useful opportunities and contacts (e.g., American Active
g
Psychological Association, Division 5: Evaluation, Learnin
ning
Measurement, and Statistics; American Statisti- E-Lear
cal Association; Psychometric Society; Society for ring
Quant Mento
Mathematical Psychology, and the Society of Multi- Teaching
variate Experimental Psychology), including inter-
national statistical societies (e.g., European Associa- Research
tion of Methodology, International Association for Questions
Statistical Education, International Statistical Insti- Quant Similarities &
tute, Society for Multivariate Analysis in the Behav- Concepts Differences
ioral Sciences). Participating in quantitative forums Interpret
helps to enlarge quantitative networks, providing a Results
stronger foundation to further one’s skills. Other Quant Statistical
Third, quantitative teaching should be widened Learning Literacy
Models
to include individuals from low-income (Kitchen,
Wo
DePree, Celedón-Pattichis, & Brinkerhoff, 2007) Soc
rks
ho
Li ieti ps
and under-represented groups (Kosoko-Lasaki, Son- ter
atu es
nino, & Voytko, 2006). We also need to con- re
sider whether traditional approaches to quantitative
teaching have helped to shrink or amplify the differ- Widening
Quant Reach
ences among mainstream and marginalized students Disadvantaged
(Ceci & Papierno, 2005). Other offerings can be
Early
encouraged (e.g., Quantitative Training for Under- Care
er
Und
represented Groups) to help bring about greater erre
pres
equity and effectiveness that can open doors to ente
d
more career options in the field of quantitative sci-
ence to an even larger and more diverse group of Figure 6.1 Bringing about statistical literacy through quanti-
individuals. tative teaching, other quantitative learning, and widening the
quantitative reach
Fourth, it is important to reach out to quantita-
tive researchers who are newly graduated to encour-
age them to become involved in the larger field.
Early career psychologists make up about 10% of the technology, engineering, and mathematics that
150,000 members of the APA (2010), and Division sometimes exclude psychology. Further, research
5 (measurement, statistics and evaluation) involves reveals that students from the United States fall
approximately 1% of APA members (i.e., Gruber, in the bottom half regarding quantitative perfor-
2010). The APA Quantitative Task Force (Aiken mance when compared to other countries (Gon-
et al., 2007) helps in this regard, with a website zales, Guzmán, Partelow, Pahlke, Jocelyn, Kast-
(http://www.apa.org/research/tools/quantitative/in- berg, & Williams, 2003; McQuillan & Kennelly,
dex.aspx) that provides definitions of the areas of 2005).
focus within quantitative methods, recommenda- Figure 6.1 depicts the main focus of this chapter,
tions for preparing for quantitative training, and lists where innovative quantitative teaching (with active
of quantitative graduate programs that can help in learning, e-learning, mentoring, and concepts learn-
recruiting individuals into quantitative psychology ing), other quantitative learning (e.g., model build-
and in retaining those that are already participating. ing, workshops, quantitative societies, and quan-
Finally, continuing efforts to generate greater titative literature), and widening the quantitative
interest, participation, and performance in quan- reach (to the disadvantaged, early career individu-
titative literacy is an important goal. Unfortunately, als, and the under-represented) can help bring about
psychology is not always considered as having the greater statistical literacy. More emphasis needs to be
scientific credentials of other disciplines such as made on involving others in the marvels of quanti-
physics, chemistry, and biology (Simonton, 2004). tative psychology, whether at the undergraduate or
Particularly, recent efforts to improve technologi- graduate level, in the larger profession of psychol-
cal education in the United States (Chopra, 2010) ogy or in the general public. Ideas presented in this
have focused more on traditionally defined science, chapter, and in the larger handbook, are offered to

harlow 113
help in teaching quantitative psychology, within the Ashcraft, M. H., & Kirk, E. P. (2001). The relationship among
classroom, in research centers, and in every facet of working memory, math anxiety, and performance. Journal of
Experimental Psychology: General, 130, 224–237.
life that involves quantitative thinking.
Atkinson, R. K., Catrambone, R., & Merrill, M. M. (2003). Aid-
ing transfer in statistics: Examining the use of conceptually
oriented equations and elaborations during subgoal learning.
Author Note
Journal of Educational Psychology, 95, 762–773.
Lisa L. Harlow is Professor of Quantitative Psy- Aulls, Mark W., & Shore, B. M. (2008). Inquiry in education,
chology at the University of Rhode Island. Please volume 1: The conceptual foundations for research as a curricular
address correspondence to Lisa L. Harlow, Depart- imperative. New York: Taylor & Francis/Erlbaum.
ment of Psychology, University of Rhode Island, 142 Bandura, A. (1997). Self-efficacy: The exercise of control. New York:
W. H. Freeman.
Flagg Rd, Chafee, Kingston, RI 02881-0808.
Barren, K. E., Benedict, J. O., Saville, B. K., Serdikoff, S.
L., & Zinn, T. E. (2007). Innovative approaches to teach-
ing statistics and research methods: Just-in-time teaching,
Acknowledgments
interteaching, and learning communities. In Dunn, D. S.,
I wish to acknowledge support from the National Smith, R. A, & Beins, B. (Eds.), Best practices for teaching
Science Foundation (NSF) Grant (# 0720063) statistics and research methods in the behavioral sciences, (pp.
on Quantitative Training for Underrepresented 143–158). Mahwah, NJ: Erlbaum.
Groups; Lisa L. Harlow, Principal Investigator (PI). Bartz, A. E., & Sabolik, M. A. (2001). Computer and software
use in teaching the beginning statistics course. Teaching of
I would like to thank students, faculty and staff at
Psychology, 28, 147–149.
the University of Rhode Island and the Cancer Pre- Ben-Zvi, D. & Garfield, J. (Eds.) (2004). The challenge of devel-
vention Research Center in Rhode Island for their oping statistical literacy, reasoning, and thinking. Dordrecht,
ongoing support, inspiration and encouragement. Netherlands: Kluwer.
Similarly, I would like to thank Dr. Lisa Dierker Berger, D. (n.d.). WISE Web Interface for Statistics Educa-
tion. Retrieved December 20, 2010 from http://wise.cgu.
(PI) and her colleagues at the Quantitative Analy-
edu/.
sis Center at Wesleyan University. Serving on the Bjornsdottir, A., & Garfield, J. (2010). Teaching bits: Statistics
advisory board for their NSF Grant (# 0942246) education articles from 2009. Journal of Statistics Educa-
on An Inquiry-Based, Supportive Approach to Sta- tion, 18, Retrieved May 9, 2012 from www.amstat.org/
tistical Reasoning and Application has opened up publications/jse/v18n1/garfieldtb.pdf
Bliwise, N. G. (2005). Web-based tutorials for teaching intro-
my thinking and those of my students on how
ductory statistics. Journal of Educational Computing Research,
to productively engage students in the wonders of 33, 309–325.
quantitative inquiry. Britt, M. A., Sellinger, J., & Stillerman, L. M. (2002). A review
of ESTAT: An innovative program for teaching statistics.
Teaching of Psychology, 29, 73–75.
References Budé, L., Van De Wiel, M. W. J., Imbos, T., Candel, M.
Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, J. J. M., Broers, N. J., and Berger, M. P. F. (2007).
NJ: Erlbaum. Students’ achievements in a statistics course in relation to
Aberson, C. L., Berger, D. E., Healy, M. R., & Romero, V. motivational aspects and study behaviour. Statistics Education
L. (2003). Evaluation of an interactive tutorial for teach- Research Journal [Online], 6, 5–21. Retrieved May 9, 2012
ing hypothesis testing concepts. Teaching of Psychology, 30, from http://www.stat.auckland.ac.nz/∼iase/serj/SERJ6(1)_
75–78. Bude.pdf
Aiken, L. S., Aguinis, H., Appelbaum, M., Boodoo, G., Ceci, S. J., & Papierno, P. B. (2005). The rhetoric and reality of
Edwards, M. C., Gonzalez, R. D., Panter, A., et al. gap closing: When the “have nots” gain but the “haves” gain
(2007). Report of the task force for increasing the num- even more. American Psychologist, 60, 149–160.
ber of quantitative psychologists. Washington DC: American Chew, S. L. (2007). Designing effective examples and problems
Psychological Association. Retrieved July 1, 2010 from: for teaching statistics. In, Dunn, Dana S., Smith, Randolph
http://www.apa.org/research/tools/quantitative/quant-task- A, & Beins, Barney (Eds.), Best practices for teaching statistics
force-report.pdf. and research methods in the behavioral sciences, (pp. 73–91).
Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral Mahwah, NJ: Erlbaum.
training in statistics, measurement, and methodology in psy- Chiou, C.-C. (2009). Effects of concept mapping strategy on
chology: Replication and extension of Aiken, West, Sechrest, learning performance in business and economics statistics.
and Reno’s (1990) survey of PhD programs in North America. Teaching in Higher Education, 14, 55–69.
American Psychologist, 63, 32–50. Chopra, A. (2010). Building a future for technologi-
Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). cal innovation. White House Office of Science &
Graduate training in statistics, methodology, and measure- Technology Policy. Retrieved January 13, 2011 from
ment in psychology: A survey of PhD programs in North http://www.whitehouse.gov/sites/default/files/microsites/
America. American Psychologist, 45, 721–734. ostp/CommClubr081710FINAL.pdf.
APA (2010). Early career psychologists. Washington D.C.: Amer- Cobb, G. (1999). Move over, Bill Nye: George Cobb
ican Psychological Association. Retrieved on July 1, 2010 makes learning about statistics practical and fun. Mount
from: http://www.apa.org/membership/index.aspx. Holyoke College news and events. Retrieved May 9, 2012

114 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
from http://www.mtholyoke.edu/offices/comm/csj/990409/ Franklin, C., & Garfield, J. (2006). The guidelines for assess-
gummy.html. ment and instruction in statistics education (GAISE) project:
Cobb, G. W. (2007). The introductory statistics course: A Developing statistics education guidelines for grades pre K-
ptolemaic curriculum? Technology Innovations in Statis- 12 and college courses, In G.F. Burrill, (Ed.), Thinking and
tics Education, 1, 1?15. Retrieved May 9, 2012 from reasoning with data and chance: Sixty-eighth annual NCTM
http://escholarship.org/uc/item/6hb3k0nz# page-15 yearbook (pp. 345–375). Reston, VA: National Council of
Cohen, J. (1988). Statistical power analysis for the behavioral Teachers of Mathematics.
sciences (2nd edition). Hillsdale, NJ: Erlbaum. Friedrich, J., Buday, E., & Kerr, D. (2000). Statistical train-
Cohen, J. Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied ing in psychology: A national survey and commentary
multiple regression/correlation analysis for the behavioral sciences on undergraduate programs. Teaching of Psychology, 27,
(3rd edition). Mahwah, NJ: Erlbaum. 248–257.
Cooper, H., Camic, P. M., Long, D. L., Panter, A. T., Rind- Gal, I. (2003). Teaching for statistical literacy and ser-
skopf, D., & Sher, K. J. (Eds.). (2012). APA handbook of vices of statistics agencies. The American Statistician, 57,
research methods in psychology (Vols. 1–3). Washington, DC: 80–84.
American Psychological Association. Garfield, J., & Ben-Zvi, D. (2007). How students learn statis-
Cumming, G., & Fidler, F. (2009). Confidence intervals: Better tics revisited: A current review of research on teaching
answers to better questions. Zeitschrift für Psychologie/Journal and learning statistics. International Statistical Review, 75,
of Psychology, 217, 15–26. 372–396.
DaRos-Voseles, D. A., Collins, K. M. T., Onwuegbuzie, A. Garfield, J., & Ben-Zvi, D. (2008). Developing students’ statistical
J., & Jiao, Q. G. (2008). Effect of self-perception on per- reasoning: Connecting research and teaching practice. New York:
formance of graduate-level cooperative groups in research Springer.
methodology courses. Journal of Instructional Psychology, 35, Garfield, J., & delMas, R. (2010). A web site that provides
254–259. resources for assessing students’ statistical literacy, reasoning
Derryberry, D. R., Schou, S. B., & Conover, W. J. and thinking. Teaching Statistics, 32, 2–7.
(2010). Teaching rank-based tests by emphasizing struc- Garfield, J., & Everson, M. (2009). Preparing teachers of
tural similarities to corresponding parametric tests. Journal statistics: A graduate course for future teachers. Journal of
of Statistics Education, 18, Retrieved May 9, 2012 from Statistics Education, 17 (2), Retrieved May 9, 2012 from
www.amstat.org/publications/jse/v18n1/derryberry.pdf. www.amstat.org/publications/jse/v17n2/garfield.html.
DeVaney, T. A. (2010). Anxiety and attitude of graduate stu- Garfield, J., Hogg, B., Schau, C., & Whittinghill, D.
dents in on-campus vs. online statistics courses. Journal (2002). First courses in statistical science: The status
of Statistics Education, 18, Retrieved May 9, 2012 from of educational reform efforts. Journal of Statistics Educa-
www.amstat.org/publications/jse/v18n1/devaney.pdf. tion, 10(2), Retrieved May 9, 2012 from www.amstat.
Dewey, J. (1997). How we think. New York: Dover Publications. org/publications/jse/v10n2/garfield.html.
Dinov, I. D., & Christou, N. (2009). Statistics online com- Gelman, A., & Nolan, D. (2002). Teaching statistics: A bag of
putational resource for education. Teaching Statistics, 31, tricks. New York: Oxford University Press.
49–51. Gilliland, D., & Melfi, V. (2010). A note on confidence
Dodge, Y. (Ed.) (2003). The Oxford dictionary of statistical terms. interval estimation and margin of error. Journal of Statis-
New York: Oxford University Press. tics Education, 18(1), Retrieved May 9, 2012 from
Dolinsky, B. (2001). An active learning approach to teaching www.amstat.org/publications/jse/v18n1/gilliland.pdf.
statistics. Teaching of Psychology, 28, 55–56. Gonzales, P., Guzmán, J. C., Partelow, L., Pahlke, E., Jocelyn,
Dunn, D. S., Smith, R. A., & Beins, B. (Eds.) (2007). Best prac- L., Kastberg, D., & Williams, T. (2003). Highlights from the
tices for teaching statistics and research methods in the behavioral trends in international mathematics and science study (TIMSS).
sciences. Mahwah, NJ: Erlbaum. U.S. Department of Education Institute of Education Sci-
Embretson, S. E. (Ed.) (2010). Measuring psychological con- ences, National Center for Educational Statistics.
structs: Advances in model-based approaches. Washington, DC: Gruber, C. (Ed.) (2010) The score newsletter, Vol. XXXII. No. 3.
American Psychological Association. Division 5, American Psychological Association, Retrieved
Enders, C. K. (2010). Applied missing data analysis. New York: July 1, 2010 from http://www.apa.org/divisions/div5/pdf/
Guilford. July10Score.pdf.
Everitt, B. S. (2002). The Cambridge dictionary of statis- Hancock, G. R., & Mueller, R. O. (Eds.). (2010). The reviewer’s
tics (2nd edition). Cambridge, UK: Cambridge University guide to quantitative methods in the social sciences. New York:
Press. Taylor & Francis.
Everson, M., Zieffler, A., & Garfield, J. (2008). Implement- Hanley, G. L. (2004). E-learning and the science of instruction.
ing new reform guidelines in teaching introductory college Applied Cognitive Psychology, 18, 123–124.
statistics courses. Teaching Statistics, 30, 66–69. Härdle, W., Klinke, S., & Ziegenhagen, U. (2007). On the utility
Ferreira, M. (2001). Building communities through role mod- of e-learning in statistics. International Statistical Review, 75,
els, mentors, and hands-on-science. The School Community 355–364.
Journal, 11, 27–37. Harlow, L.L. (2005). The essence of multivariate thinking: Basic
Fillion, G., Limayem, M., Laferrière, T., & Mantha, R. (2008). themes and methods. Mahwah, NJ: Erlbaum.
Integrating ICT into higher education: A study of onsite Harlow, L. L. (2010). On scientific research: Invited com-
vs. online students’ and professors’ perceptions. International mentary on the role of statistical modeling and hypothesis
Journal of Web-Based Learning and Teaching Technologies, 3, testing. Journal of Modern Applied Statistical Methods, 9,
48–72. 348–358.

harlow 115
Harlow, L. L., Burkholder, G., & Morrow, J. (2002). Evaluating McQuillan, L., & Kennelly, L. (2005). New study finds U.S.
attitudes, skill and performance in a learning enhanced quan- math students consistently behind their peers around the world:
titative methods course: A structural modeling approach. Findings challenge conventional wisdom about U.S. math success
Structural Equation Modeling Journal, 9, 413–430. in early grades. Washington, D.C.: American Institutes for
Harlow, L. L., Burkholder, G., & Morrow, J. (2006). Engag- Research.
ing students in learning: An application with quantitative Meletiou-Mavrotheris, M, & Lee, C. (2002). Teaching students
psychology. Teaching of Psychology, 33, 231–235. the stochastic nature of statistical concepts in an introduc-
Harlow, L. L., Mulaik, S., & Steiger, J. (Eds.) (1997). What if tory statistics course. Statistics Education Research Journal, 1,
there were no significance tests? Mahwah, NJ: Erlbaum. 22–37.
Helman, S., & Horswill, M. S. (2002). Does the introduction Mills, J. D. (2004). Students’ attitudes toward statistics: Impli-
of non-traditional teaching techniques improve psychology cations for the future. College Student Journal, 38, 349–361.
undergraduates’ performance in statistics? Psychology Learning Millsap, R. E., Maydeu-Olivares, A. (Eds.) (2009). The Sage
& Teaching, 2, 12–16. handbook of quantitative methods in psychology. Thousand
Hulme, J. (2007). Review of Handbook of demonstrations and Oaks, CA: Sage Publications.
activities in the teaching of psychology volume 1: Intro- Mulhern, G., & Wylie, J. (2004). Changing levels of numer-
ductory, statistics, research methods and history. Psychology acy and other core mathematical skills among psychology
Learning & Teaching, 6, 164–165. undergraduates between 1992 and 2002. British Journal of
Hulsizer, M. R., & Woolf, L. M. (2009). Guide to teaching Psychology, 95, 355–370.
statistics: Innovations and best practices. New York: Wiley- Mulhern, G., & Wylie, J. (2006). Mathematical prerequisites
Blackwell. for learning statistics in psychology: Assessing core skills of
Jaccard, J., & Jacoby, J. (2009). Theory construction and model numeracy and mathematical reasoning among undergradu-
building skills: A practical guide for social scientists. New York: ates. Psychology Learning & Teaching, 5, 119–132.
Guilford Press. Neal-Barnett, A., Mitchell, M., & Boeltar, C. (2002). Faculty
Katayama, A. D. (2001). Bi-modal instructional practices in edu- of color serving students, serving self: The psychology group.
cational psychology: Mentoring and traditional instruction. Teaching of Psychology, 29, 44–45.
Journal of Instructional Psychology, 28, 171–177. Nie, Y., & Lau, S. (2010). Differential relations of constructivist
Kitchen, R. S., DePree, J., Celedón-Pattichis, S., & Brinkerhoff, and didactic instruction to students’ cognition, motivation,
J. (2007). Mathematics education at highly effective schools that and achievement. Learning and Instruction, 20, 411–423.
serve the poor: Strategies for change. Mahwah, NJ: Erlbaum. Odgaard, E. C., & Fowler, R. L. (2010). Confidence intervals for
Kline, R. B. (2004). Beyond significance testing: Reforming data effect sizes: Compliance and clinical significance in the jour-
analysis methods in behavioral research. Washington, DC: nal of consulting and clinical psychology. Journal of Consulting
American Psychological Association. and Clinical Psychology, 78, 287–297.
Kline, R. B. (2011). Principles and practice of structural equation Onwuegbuzie, A. J. (2000). Attitudes toward statistics assess-
modeling ( 3rd edition). New York: Guilford. ments. Assessment and Evaluation in Higher Education, 25,
Kolar, D. W., & McBride, C. A. (2003). Creating problems 321–339.
to solve problems: An interactive teaching technique for Onwuegbuzie, A. J., Collins, K. M. T., & Jiao, Q. G. (2009).
statistics courses. Teaching of Psychology, 30, 67–68. Performance of cooperative learning groups in a postgraduate
Kosoko-Lasaki, O., Sonnino, R. E., & Voytko, M. L. (2006). education research methodology course: The role of social
Mentoring for women and underrepresented minority fac- interdependence. Active Learning in Higher Education, 10,
ulty and students: experience at two institutions of higher 265–277.
education. Journal of the National Medical Association, 98(9), Onwuegbuzie, A. J., & Wilson, V. A. (2003). Statistics anxiety:
1449–1459. Nature, etiology, antecedents, effects, and treatments-a com-
Krause, U,-M., Stark, R., & Mandl, H. (2009). The effects of prehensive review of literature. Teaching in Higher Education,
cooperative learning and feedback on e-learning in statistics. 8, 195–209.
Learning and Instruction, 19, 158–170. Peden, B. F. (2001). Correlational analysis and interpretation:
La Greca, A. M. (2005). Editorial. Journal of Consulting and Graphs prevent gaffes. Teaching of Psychology, 28, 129–131.
Clinical Psychology, 73, 3–5. Peterson, S. E., & Miller, J. A. (2004). Quality of college students’
Liu, T.-C., Lin, Y.-C., & Kinshuk (2010). The applica- experiences during cooperative learning. Social Psychology of
tion of Simulation-Assisted Learning Statistics (SALS) for Education, 7(2), 161–183.
correcting misconceptions and improving understanding Piotrowski, C., Bagui, S. C., & Hemasinha, R. (2002). Devel-
of correlation. Journal of Computer Assisted Learning, 26, opment of a measure of statistics anxiety in graduate-level
143–158. psychology students. Journal of Instructional Psychology, 29,
MacKinnon, D. P. (2008). Introduction to statistical mediation 97–100.
analysis. New York: Taylor & Francis. Quilici, J. L., & Mayer, R. E. (2002). Teaching students to recog-
Malloy, T. E., & Jensen, G. C. (2001). Utah Virtual Lab: nize structural similarities between statistics word problems.
JAVA interactivity for teaching science and statistics on line. Applied Cognitive Psychology, 16, 325–342.
Behavior Research Methods, Instruments & Computers, 33, Rajecki, D. W., Appleby, D., Williams, C. C., Johnson,
282–286. K., & Jeschke, M. P. (2005). Statistics can wait: Career
Mazzolini, M., & Maddison, S. (2007). When to jump in: The plans activity and course preferences of American psychology
role of the instructor in online discussion forums. Computers undergraduates. Psychology Learning & Teaching, 4, 83–89.
& Education, 49, 193–213. Ridgway, J., Nicholson, J., & McCusker, S. (2007). Teaching
McGrath, R. E. (2011). Quantitative models in psychology. statistics—Despite its applications. Teaching Statistics, 29,
Washington, DC: American Psychological Association. 44–48.

116 t e a c h i n g q u a n t i tat i v e p s yc h o l o g y
Rodarte-Luna, B., & Sherry, A. (2008). Sex differences in Tomcho, T. J., Rice, D., Foels, R., Folmsbee, L., Vladescu, J.,
the relation between statistics anxiety and cognitive/learning Lissman, R., Matulewicz, R., et al. (2009). APA’s learning
strategies. Contemporary Educational Psychology, 33, 327– objectives for research methods and statistics in practice: A
344. multimethod analysis. Teaching of Psychology, 36, 84–89.
Rodgers, J. L. (2010). The epistemology of mathematical and Tremblay, P. F., Gardner, R. C., & Heipel, G. (2000). A model
statistical modeling: A quiet methodological revolution. of the relationships among measures of affect, aptitude, and
American Psychologist, 65, 1–12. performance in introductory statistics. Canadian Journal of
Romero, V. L., Berger, D. E., Healy, M. R., & Aberson, C. L. Behavioural Science, 32, 40–48.
(2000). Using cognitive learning theory to design effective on- Upton, G. & Cook, I. (2008). A dictionary of statistics. New York:
line statistics tutorials. Behavior Research Methods, Instruments Oxford University Press.
& Computers, 32, 246–249. Velicer, W.F., & Jackson, D.N. (1990). Component analy-
Rosenthal, R. & DiMatteo, M. R. (2001) Meta-analysis: recent sis vs. common factor analysis: Some issues in selecting an
developments in quantitative methods for literature reviews. appropriate procedure. Multivariate Behavioral Research, 25,
Annual Review of Psychology, 52, 59–82. 1–28.
Rumsey, D. J. (2002). Statistical literacy as a goal Ware, M. E. & Johnson, D. E. (Eds) (2000). Handbook of
for introductory statistics courses. Journal of Statis- demonstrations and activities in the teaching of psychology: Intro-
tics Education, 10(3), Retrieved May 9, 2012 from ductory, statistics, research methods, and history, Vol. I (2nd
http://www.amstat.org/publications/jse/v10n3/rumsey2.html. ed.). Mahwah, NJ: Erlbaum.
Saville, B. K., Zinn, T. E., Lawrence, N. K., Barron, K. E., Watkins, A. E., Scheaffer, R. L., & Cobb, G. W. (2004). Statis-
& Andre, J. (2008). Teaching critical thinking in statistics tics in action: Understanding a world of data. New York: Key
and research methods. In Dunn, D. S., Halonen, J. S., & College Publishing.
Smith, R. A. (Eds.), Teaching critical thinking in psychology: A Watson, J. M. (2006). Statistical literacy at school: Growth and
handbook of best practices, (pp. 149–160). New York: Wiley- goals. Mahwah, NJ: Erlbaum.
Blackwell. Wender, K. F., & Muehlboeck, J.-S. (2003). Animated diagrams
Schwartz, D., & Martin, T. (2004). Inventing to prepare for in teaching statistics. Behavior Research Methods, Instruments
future learning: The hidden efficiency of encouraging orig- & Computers, 35, 255–258.
inal student production in statistics instruction. Cognition Wilkinson, L., & The Task Force on Statistical Inference (1999).
Instruction, 22, 129–184. Statistical methods in psychology journals: Guidelines and
Sen, S. (2004). Teaching quantitative thinking: Comment. explanations. American Psychologist, 54, 594–604.
Science, 304( 5670), 518. Wulff, S. S., & Wulff, D. H. (2004). “Of course I’m com-
Simonton, D. K. (2004). Psychology’s status as a scientific dis- municating, I lecture every day:” Enhancing teaching and
cipline: Its empirical placement within an implicit hierarchy learning in introductory statistics. Communication Education,
of the sciences. Review of General Psychology, 8, 59–67. 53, 92–103.
Sosa, S., Berger, D. E., Saw, A. T., & Mary, J. C. (2010). Zeedyk, M. S. (2006). Detective work on statistics street: Teach-
Effectiveness of computer-assisted instruction in statistics: ing statistics though humorous analogy. Psychology Learning
A meta-analysis. Review of Educational Research, published & Teaching, 5, 97–109.
online August 12, 2010. DOI: 10.3102/0034654310378174 Zieffler, A., Garfield, J., Alt, S., Dupuis, D., Holleque, K.,
Stark, R., & Krause, U.-M. (2009). Effects of reflection prompts & Chang, B. (2008). What does research suggest about
on learning outcomes and learning behaviour in statistics the teaching and learning of introductory statistics at the
education. Learning Environments Research, 12, 209–223. college level? A review of literature. Journal of Statistics
Stevens, J. P. (2009). Applied multivariate statistics for the social Education, 16 (2), Retrieved May 9, 2012 from http://
sciences (5th ed.). New York: Routledge. www/amstat.org/publications/jse/v16n2/zieffler.html.
Swingler, M. V., Bishop, P., & Swingler, K. M. (2009). SUMS: Zirkel, S. (2002). Is there a place for me? Role models and aca-
A flexible approach to the teaching and learning of statistics. demic identity among white students and students of color.
Psychology Learning & Teaching, 8, 39–45. Teachers College Record, 104, 357–376.

harlow 117
CHAPTER

Modern Test Theory


7
Roderick P. McDonald†

Abstract

This chapter provides a unified treatment of seven major topics in test theory. It relates Modern
Theory based on item response modeling to classical linear modeling through the use of the common
factor model. The topics include choosing a metric; measurement and measurement error (reliability);
item selection: homogeneity and dimensionality; validity; equating tests; comparing populations. The
treatment of these problems makes no distribution assumptions.
Key Words: Test theory, metric, homogeneity, dimensionality, validity, equating, differential item
functioning

Introduction models were illustrated by the same data set.


The subject of this chapter is very large, so the However, for decades Psychometric Theory has
approach to it requires a broad brush and has to be treated them as foundations of two major, unrelated
somewhat selective. I apologize for this in advance. topics.
We begin, as is commonly the case, with ques- The primary problems of classical test theory were
tions of definition. Test theory consists of mathemat- the homogeneity, the reliability, and the validity of a
ical models for item scores and theory for the test test. These remain major topics in the modern era.
scores derived from them by a scoring formula. Fur- The treatment of these problems rested on simple
ther, a test (a set of items) is designed to measure linear additive models—nothing could be simpler
a quantifiable attribute or set of related attributes than Spearman’s true score model—but with an
of examinees. The term “attributes” has its ordinary uneasy recognition that linear models were inappro-
dictionary definition—properties or characteristics priate for binary items. The interested reader would
of the examinees. There is an inevitable looseness find Gulliksen’s (1950) neglected classic an excellent
in these three statements, but they should serve as a account of thinking in the classical period.
platform for the following account. The incorporation of Alan Birnbaum’s rigorous
The topics of test theory have developed in a treatment of item response models in Lord and
piecemeal fashion, with little attempt to examine Novick (1968) marks the beginning of the mod-
the relationships between them. We might date the ern era. Lord and Novick’s text, accompanied by
beginnings of the “classical” period of test theory Lord (1980), should still be required reading for any
to Spearman’s papers on True Score Theory (1904a) student or researcher who wishes to have a general
and on the general factor model (1904b). The two understanding of the field.

† Editor’s Note: Sadly, Rod passed away in October of 2011. His contributions to quantitative methods are inestimable
and he will be dearly missed. I would like to thank Aaron Boulton who completed the tables, figures, and proofs for
this chapter.

118
The Lord and Novick text contains the necessary for the values they take or the scale on which they are
foundations for a unified treatment of test theory but distributed. I assume the reader is familiar with the
leaves implicit the relationship between the linear algebra of expectations and with variances, covari-
models of classical theory, and item response models ances, and of course correlations. I will write an
for binary test items. A general misconception is still expected value—a mean—as E{ }, covariance as
apparently widely held that test theory is made up Cov{ }, and variance as Var{ }. This allows us to
of two parts—Classical Test Theory for quantitative have E{Y |X = x} for the conditional mean of ran-
data, and Item Response Theory (IRT) for binary dom Y when random X takes the value x, and
data, with a great gulf fixed between them. Var{Y |X = x} for its conditional variance. Any
The possibility of a unified treatment of linear change from, say, X to x signals a change from repre-
and nonlinear models for item scores was adum- senting a random variable to representing a specific
brated by McDonald (1967), who showed how an value for an individual or points on the scale of x.
item response model could be approximated by a When a sentence defines a new term, the term will
linear model using standard procedures, common in be written in italics.
physics, for linearizing a nonlinear model. A more The models we consider are mathematical ideal-
rigorous unified treatment, based on the general izations and can be regarded only as approximations
linear model of Nelder and Wedderburn (1972), to the behavior of any real measures (See McDonald,
was provided by Bartholomew. (See Bartholomew 2010). A set of m items is given to a large sample
& Knott, 1999, and Skrondall & Rabe-Hesketh, of examinees. For our first model, we suppose that
2004). However, these accounts do not address the responses to them can be coded by item scores
the central problems of Test Theory. McDonald that range over enough numerical values to apply a
(1999) sought to present a “unified treatment” by linear model as an acceptable approximation. The
applying the same psychometric concepts to a lin- item scores might be subtest scores, or Likert-scaled
ear (unidimensional or multidimensional) model for scores: for example, coding Strongly agree, Agree,
quantitative data and to the parallel nonlinear model Indifferent, Disagree, and Strongly disagree as the
for binary data. This treatment shows how the linear integers 5, 4, 3, 2, and 1, respectively.
model serves as a first approximation to the nonlin- We take as a suitable model Spearman’s general
ear model. Other authors have contributed to this factor model, written in the form
unification of theory. See, for example, Jones and X j = μj + λ j F + U j , (1)
Thissen (2007) and Thissen and Wainer (2001).
This chapter contains a brief introduction to the where Xj is the score on the jth item of a ran-
treatment given by McDonald (1999), with some domly chosen subject from a defined population,
revision and with restructuring to exhibit parallels F is the (unobserved) value of the attribute given by
between linear and nonlinear models. I rely on more the model, and Uj is a random interaction between
specialized accounts in the present volume to fill the item and the examinee. (To simplify the pre-
in details that I must omit here. Specifically, I will sentation, it will be understood that the subscript j,
not examine standard estimation methods for item wherever it appears, ranges from 1 to m. We leave
parameters. (See Hallberg, Wing, Wong, & Cook, out the formal j = 1, 2, . . . , m. All summations are
Chapter 12, Volume 1; Steiner and Cook, Chapter over this range, unless otherwise stated.)
13, Volume 1.) Equation 1 is a simple regression model, with
Following the next section, which sets out the F the “explanatory” variable, Xj the response vari-
properties of the linear (factor) model and the par- able, and Uj the residual. Accordingly, F and Uj are
allel nonlinear (item response) model that we need, I uncorrelated. We assume that the interaction terms
will discuss the application of these models to seven Uj of distinct items are uncorrelated and write ψj2 for
problems in test theory. These are: (1) imposing a their variances. Then by the algebra of expectations,
metric on the measured attribute, (2) measurement the variance of Xj is given by
and error of measurement, (3) item selection, (4) Var{Xj } = λ2j Var{F } + ψj2 , (2)
homogeneity and dimensionality, (5) validity, (6)
equating tests, and (7) comparing populations. The and the covariance of two distinct item scores Xj and
last section, not surprisingly, is general discussion. Xk by
Cov {Xj , Xk } = λj λk Var{F }. (3)
The Models
With reference to notation, I will use uppercase (The reader who is familiar with the factor analy-
italics for random variables, and lowercase Roman sis of correlations needs to be told that for test theory

mcdonald 119
applications we use covariances.) In this mathemat- enough. Higher moments of the joint distribution
ical model, F is the common factor that accounts for are extremely unstable in sampling and cannot be
the covariance of the item scores through the rela- relied on to improve estimates. It is implicit in appli-
tion given by Equation 3. The common factor F is cations of factor models that the weak form of the
linked to the real world of applications by identify- principle of local independence is used to fit the
ing it with a measure of the abstract attribute that the model, but the strong form is intended. That is, we
items share as their common property. The interac- do not suppose that the common factor accounts for
tion terms Uj —the unique components—are linked the covariances between the items but leaves other
to applications by identifying them with measures of forms of statistical dependence unaccounted for.
specific properties of the items. This identification Equation 4 gives a simple linear relationship
requires the strong assumption that their specific between the item scores and the factor. We might
properties are strictly unique to the items. The instead suppose that each item score has a distinct
assumption is realized only approximately in prac- relationship to the factor, writing
tice. (The unique components may also contain an
E {Xj |F = f } = γj (f ), (8)
error of replication that cannot be separated in a sin-
gle set of observations.) The regression constant μj where γj is intended to represent a different nonlin-
is the item score mean, representing item difficulty in ear function for each item. There has been very little
cognitive applications. (Strictly, this should be item work on this kind of generality (see, for example,
facility—the easier the item, the higher the mean McDonald, 1967; Yalcin & Amemiya, 2001).
score.) The regression slope λj measures the extent A generalized linear model, following Nelder and
to which the item discriminates between subjects Wedderburn (1972), is obtained by substituting a
at different levels of the attribute. It is traditionally common nonlinear link function γ for a set of dis-
termed a factor loading. (For reasons that will appear, tinct functions γj , relating the item scores to a linear
I would like to call it the discrimination parameter, function of f, with
the counterpart term in item response models, but
E {Xj |F = f } = γ (μj + λj f ). (9)
tradition is too strong.) The parameter ψj2 is the
unique variance of the item score. In principle there is a wide choice of link func-
We can write Equation 1 as tions. In applications the choice is motivated by
metric properties of the item scores (see Skrondall
E {Xj |F = f } = μj + λj f , (4) & Rabe-Hesketh, 2004, for a fairly comprehensive
list). For our purposes, it must suffice to consider
the alternative way to write a regression—that is, as
just two. The first, which we already have, is the
the expected value of Xj conditioning on a value f of
simple linear relationship given by Equation 4, with
F . Defining
the link function just the identity function. This
Uj = Xj − E {Xj |F = f } (5) would be chosen whenever the item score ranges
appear to allow a linear model, especially if we can
returns us to Equation 1, but in the form, Equation suppose—approximately—multivariate normality.
4 it allows us to write The only alternative we will consider is motivated
Var{Xj |F = f } = ψj2 , (6) by binary data, items scored with just two values—
0 and 1. This will include multiple choice cognitive
and items, with one answer—the correct answer, we
Cov {Xj , Xk |F = f } = 0. (7) hope—scored 1 and the rest 0. It also includes
noncognitive items with one response keyed for the
That is, for a fixed value of the attribute the item
attribute to be measured, and the other response(s)
scores are uncorrelated. Equation 7 is a weak ver-
nonkeyed—for example, a checklist of symptoms of
sion of the Principle of Local Independence, which
a behavior disorder. In such a case, it is easily shown
governs item response models. A strong form of
that E {Xj |F = f} is the probability of giving the
this principle requires that X1 , X2 , . . . , Xm are mutu-
keyed response, conditioned on f. To avoid a clash
ally statistically independent for fixed f. The strong
of notations I will write this as:
form of the principle implies the weak form, and
if the data have a multivariate normal distribution, E {Xj |F = f } = P {Xj = 1|F = f } = γ (aj + bj f ),
then the weak form implies the strong form. Fitting (10)
and testing the model using just limited informa- where aj replaces μj and bj replaces λj . The former
tion from the item covariances is commonly good notation is set in stone for the factor model and will

120 modern test theory


1
0.9 4 2
0.8
1
0.7
0.6 .5
P(F)
0.5
0.4
0.3 −1
0.2
0.1
0
−3 −2 −1 0 1 2 3
F

Figure 7.1

be needed again. The notation in Equation 10 is (But it is written without the stars, which I have
standard for regression coefficients. added to distinguish it from my regression notation.)
The choice of a link function is motivated by the Lord’s notation is firmly established—even to the use
fact that for binary data, E {Xj |F = f} becomes a of θ for “ability”—but will not be used here. It has
probability, which is bounded below by 0 and above the advantage that the constant bj∗ corresponds to
by 1. If we suppose that the link function should be the position on the attribute scale where P(f ) is 0.5.
monotone, then we naturally choose a cumulative Its disadvantage is that it does not generalize to mul-
distribution function for γ . There are many distri- tidimensional models (or allow a simple notational
bution functions, but theory has been developed for distinction between a random variable and values of
just two—namely, the cumulative normal curve (the it). The regression parameterization in Equation 10
normal ogive) and the cumulative form of the logis- easily generalizes, as we will see.
tic distribution (the logistic function). I will just write The third parameterization comes from the work
N(z) = N(aj + bj f) for the normal ogive—given by of Christoffersson (1975). He developed the nor-
the integral from negative infinity to z of the normal mal ogive form of the model by supposing that an
density function: underlying tendency Xj∗ to give the keyed response
follows the linear model (Equation 1) with a nor-
n(t) = [1/(2π)1/2 ] exp[−(1/2)t 2 ], (11)
mal density function. The keyed response is given
and L(aj + bj f ) for the logistic function—given by: when Xj∗ is greater than a threshold value τj . This
assumption leads to the parameterization:
L(z) = 1/[1 + exp(−Dz)]. (12)
P {Xj = 1|F = f } = N[λj f − τj )/ψj2 ]. (14)
With D set equal to 1.701, these functions are
virtually indistinguishable, but each has distinct and The chief advantage of this form of the model is that
useful mathematical properties, and I will switch with a proper choice of origin and unit for F (see
between them as needed. When I wish to refer to next section), the factor loadings λj are bounded
both, I will just write P(aj + bj f ). Figure 7.1 can be by +1 and −1. They are much more stable than
regarded as a graphical representation of either. the corresponding parameters in the other forms,
There are three accepted ways to define the which range over the entire real line, and the load-
parameters of this model. The first we already have ings can be interpreted by the established standards
in Equation 10. I will refer to it as the regres- of the common factor model. I will refer to Equation
sion parameterization. The second—Lord’s (1980) 14 as giving the item factor parameterization. The
parameterization—is conventionally written as: functions in Equations 10, 13, and 14 are vari-
ously referred to as item characteristic functions, item
P {Xj = 1|F = f } = γ (aj∗ (θ − bj∗ ). (13) characteristic curves, and item response functions or

mcdonald 121
curves. Although not an established convention, we patterns of responses. Each has advantages and dis-
can apply the same terms to the linear function advantages. (See McDonald, 1999, Chapter 12.) We
in Equation 4. The constant in these models is certainly assume the strong form in applications of
referred to as the difficulty parameter (although this the fitted model.
hardly applies to noncognitive items), and the slope Because the common factor model and the item
parameter—the multiplier of F —as the discrimina- response models developed independently, estab-
tion parameter. The threshold parameter τj actually lished terminology for the former refers to F as the
measures in the direction of difficulty for cognitive common factor of the items whereas Item Response
items, whereas bj and a∗j measure in the direction Theory calls it a latent trait or latent variable. From
of easiness. In Lord’s parameterization, aj∗ can be here I will refer to it as a latent trait in both the lin-
called a location parameter. It is the location on the ear factor model and the item response models and
scale of the point of inflexion of the response curve. regard it as a measure of the attribute in the metric
These parameters serve the same functions as in the supplied by the model. (See Metric section below.)
counterpart linear model (Equation 4). It might be desirable to give an example of
The model: an application of Equation 1 to a suitable set of
quantitative data, and an application of Equation 10
P(f ) = L(aj + bj f ) (15) to a set of binary data. But for brevity, and to exhibit
is referred to as the two-paramenter (2PL) model, parallels, I will deliberately apply both to a single,
a term pointing to its two-item parameters and much-analyzed set: the LSAT6 data set. Responses
the choice of (L)ogistic function. We can write a from 1000 examinees to a section of a Law School
one-parameter (1PL) model by equating the slope Admissions Test (items 11–15 of Section 6) have
parameters, giving been reanalyzed by a number of psychometric the-
orists (e.g., Christoffersson, 1975; Bock & Aitkin,
P(f ) = L(aj + bf ). (16) 1981; Muthén, 1978). The items themselves have
This is a member of the family of models identi- been lost. Table 7.1 gives the proportion pj passing
fied by the work of Rasch (1960). It is often referred each item, and the item variances and covariances.
to as “the” Rasch model. A three-parameter (3PL) Table 7.2 gives the fitted parameters and the discrep-
model ancies between the matrix of sample covariances S
and the matrix of fitted covariances  (given by
P(f ) = cj + (1 − cj )L(aj + bf ) (17) Equations 2 and 3), using a standard factor analysis
program.2
allows for the effects of guessing in multiple-choice
The corresponding normal ogive model, fitted
items. It will not be considered further here.1
to the same data set by the NOHARM program
As in the linear case, we can fit the model by the
(McDonald, 1982, 1997), gives the parameters in
weak form of the Principle of Local Independence,
Table 7.3, in the item factor and the regression
using just bivariate relations between items. (See
parameterizations. Any direct relations between the
McDonald, 1982, 1997; Muthén, 1984.) These
parameters of the factor model in Table 7.2 and those
estimation procedures can be referred to as limited or
of the normal ogive are not visible to inspection,
bivariate information methods. Alternatively we can
but Table 7.4 gives the values of the fitted normal
fit the model using the strong form of the principle,
which for binary items reduces to the statement:
Prob{X1 = 1, X2 = 1, . . . , Xm = 1|F = f } Table 7.1. LSAT-6—Difficulties and
Covariance Matrix
= Prob{Xm = 1|F = f }x Prob{Xm = 1|F = f }
Item
. . . x Prob{Xm = 1|F = f }. (18)
Item pj 1 2 3 4 5
This is a way of saying that the items are related ⎡ ⎤
1 .924 .0702 .664 .524 .710 .806
in probability only through their relations with the ⎢ ⎥
attribute. Methods of estimating the item param- 2 ⎢ .708 .0089 .2063 .418 .553 .630 ⎥
⎢ ⎥
eters using this strong form of the principle are ⎢ ⎥
⎢ .553 .0130 .0259 .2472 .445 .490 ⎥
3 ⎢ ⎥
referred to as full information methods (e.g., Bock & ⎢ ⎥
⎢ ⎥
Aitkin, 1981). There is at present no clear advantage 4 ⎢ .763 .0050 .0120 .0231 .1808 .678 ⎥
⎣ ⎦
to fitting these models using the limited information
5 .870 .0021 .0132 .0089 .0142 .1131
from pairwise relations or the full information from

122 modern test theory


Table 7.2. LSAT-6—Spearman Analysis
Discrepancy Matrix

Loadings λ Unique (Sample-Fitted Covariance Matrix, S – )


Variances ψ 2
⎡ ⎤
1 .0605 .0665 .0 .0008 .0017 .0021 −.0024
⎢ ⎥
⎢ ⎥
2 .1345 .1882 ⎢ .0008 .0 .0009 .0038 .0032 ⎥
⎢ ⎥
⎢ ⎥
3 .1861 .2126 ⎢ .0017 .0009 .0 .0012 .0050 ⎥
⎢ ⎥
⎢ ⎥
⎢ .0021 ⎥
4 .1174 .1670 ⎢ .0038 .0012 .0 .0054 ⎥
⎣ ⎦
5 .0745 .1076 −.0024 .0032 .0050 .0054 .0

Table 7.3. LSAT-6—NOHARM Analysis simple sum of the item scores is an obvious choice
Item τ̂j λ̂j aj bj but not the only possibility.
To calibrate the model, we fit it to the responses
1 −1.433 .381 1.549 .412 of a (hopefully large, hopefully random) sample of
2 −0.550 .379 0.595 .410 subjects from a suitable population. At least initially
we choose the origin of the scale for the latent trait
3 −0.133 .478 0.152 .544 as the mean of F and the unit as its standard devi-
4 −0.716 .377 0.773 .406 ation. That is, we standardize the latent trait in the
population chosen. (This simplifies Equations 2 and
5 −1.126 .345 1.200 .368 3, setting Var{F } = 1.) The entire metric, and con-
sequently the distribution of F , is then determined
by the link function.3
ogive item response functions N(aj + bj f ), with the If, in the course of developing the test items,
corresponding values of the approximating linear a set is chosen that fits a model with equal coef-
functions μj + λj f from the factor Equation 4, in ficients of F , and the link function is monotone,
parentheses, for six values of f . In this example, the then it can be shown that the difference f1 − f2 in
approximation is remarkably good. F between two subpopulations is independent of
the items chosen to measure them. This property
was termed specific objectivity by Rasch (1960). It is
Some Test Theory Problems sometimes thought of as a special and valuable prop-
We turn now to a series of problems that can be erty of the 1PL model and somewhat hyperbolically
solved with the use of the models. referred to as “item-free measurement” (e.g., Wright
& Stone, 1979). However, specific objectivity is a
Metric property of any model, linear or nonlinear, in which
The attribute that we first aim to model, and then the item slope parameters do not vary (see McDon-
aim to measure, must be quantifiable in principle. ald, 1999, p. 425). The claim is sometimes also
By this I mean that it must have ordinal properties, made that the 1PL model, identified as the Rasch
admitting of “more” or “less.” However, its metric— model, gives interval scale measurement—equality
not only the origin and unit of measurement, but its of attribute differences as measured by the model.
entire calibration—is not given by data and gener- However, the choice of distinct link functions for
ally must be imposed by the model. Imagine a meter 1PL models gives distinct and mutually contradic-
stick drawn on a rubber sheet, with the millimeters tory “interval” scales. To repeat, the metric of the
able to be stretched or compressed at will. The units measurements is imposed by the model chosen, and
in which the attribute is measured are determined alternative metrics are always available. Note that
by the choice of (1) a link function and (2) an origin specific objectivity does not make the items of a test
and unit. The distribution of F in the calibration interchangeable measures of the attribute. The error
population will depend on the choice of link func- of measurement in a 1PL model depends on the dif-
tion and be determined by it. As we will see, it is ficulty parameter of the item, a fact that makes a
also possible to determine a metric by choosing a problem for equating tests (see Alternate Forms and
formula score—a function of the item scores. The Test Equating)

mcdonald 123
Table 7.4. LSAT-6—Normal-Ogive Item Response Functions
Item −3 −2 −1 0 1 2

1 .62 (.74) .76 (.80) .87 (.86) .94 (.92) .97 (.98) .99 (1.04)

2 .26 (.30) .41 (.44) .57 (.57) .77 (.71) .84 (.84) .92 (0.98)

3 .07 (.01) .17 (.18) .35 (.37) .56 (.55) .76 (.74) .89 (0.93)

4 .33 (.41) .48 (.53) .64 (.65) .78 (.76) .88 (.88) .94 (1.00)

5 .54 (.65) .68 (.72) .80 (.80) .88 (.87) .94 (.94) .97 (1.02)

To calibrate the model, we need to sample a care- That is, the rescaled test score F Y is a measure
fully chosen population, but we do not need to of the latent trait F with error of measurement E Y ,
use the population mean and standard deviation to the error in F from a rescaling of Y . Note that the
define the scale. Alternatives are available that do correspondence between T and F fails, and con-
not rest on the population chosen for their origin sequently the linear model fails, when values of F
and unit. A simple alternative measure—long ante- would give values of T outside the range of possible
dating IRT—is the raw sum or the mean of the item values of Y .
scores. The test score, the raw sum of the item scores, Before proceeding we should note that in Spear-
man’s original true score model, as in Equation
Y = Xj , (19)
24, the distinction between true score and error
and the mean score, remained undefined for decades, and various devices
M = Y /m, (20) were proposed to deal with the lack of definition
(see McDonald, 1999, Ch. 5.) If, as is usual, F
provide alternatives that many users of tests may has been standardized in the calibration population,
justifiably prefer. For example, a psychiatrist may Equation 23 gives a simple relation between the
prefer a count of symptoms of a behavior disorder attribute so measured and the true score in sum score
to a score on the latent trait. units, whereas Equation 24 gives an appropriate
In the linear model, from Equation 4, foundation for the classical model. This relation-
E {Y |F = f } = μj + (λj )f , (21) ship justifies using the raw sum score or mean score
as a measure of the attribute, each with an origin and
or, from Equation 1, unit that can have a direct and important meaning.
Y = μj + (λj )F + Uj . (22) Consider, as an example, a set of items with the same
Likert-scale format, where a respondent’s mean score
We can rewrite Equation 21 as
might represent, say, modal disagreement with all of
t = μY + λ.f , (23) the statements measuring an attitude.
In any model with a nonlinear link function, we
where μY is the test score mean, and λ. the sum of
have, correspondingly, from Equation 19,
the loadings λj . We can rewrite Equation 22 as:
Y = T + E = [(μY + λ.)F ] + Uj . (24) E {Y |F = f } = γ (aj + bj f ). (28)
Here, T is the true score of classical test theory, For the case of binary items, Equation 28 becomes
and E the error of measurement of the attribute by
the total test score. t = E {Y |F = f } =  P(aj + bj f ). (29)
By Equation 23,
Equation 29 will give a nonlinear monotone rela-
f = (t − μY )/λ., (25) tionship, with f unbounded on the real line and t
we can define bounded by 0 and the number of items. The graph
of t on f is known as the test characteristic curve.
F Y = (Y − μY )/λ., (26) The relation given by Equation 23 for the linear
and rescale E into E Y = E /λ., so that model is also a test characteristic curve, although
not always recognized as such. In the LSAT6 data,
F Y = F + EY. (27) as Table 7.5 shows, the test characteristic curve from

124 modern test theory


Table 7.5. LSAT-6—Normal-Ogive Item as scaled by the model or put confidence bounds on
Response Functions
the measurement. We may wish to obtain a measure-
f −3 −2 −1 0 1 2 ment of an attribute from one or more members of
NO 1.82 2.51 3.27 3.88 4.39 4.72
the calibration population or examinees from a dif-
ferent population. This consideration does not affect
Lin 2.10 2.67 3.24 3.82 4.39 4.96 the measurement process. I suggest as an axiom that
a measure of a defined quantity (which requires a
Note. NO, normal-ogive; Lin, linear
calibrated scale) equals the quantity to be measured
plus an error of measurement. Also the variance of
the measurement should equal the variance of the
the linear model gives a remarkably good approxi- measured quantity plus the variance of measure-
mation to that from the normal ogive model in this ment error. Spearman’s true score model essentially
example. Undoubtedly cases can be found in which expresses this axiom.
the approximation is poor. In the linear (factor) model, Equation 24 satisfies
The metric given by the latent trait f is the axiom, with
unbounded on the real line. We can regard Var{Y } = Var{T } + Var{E }, (30)
Equations 28 and 29 as applicable to any func-
tion Y of the item scores. The sum score metric and from
is bounded by 0 and m, and, an equivalent, the pro- Var{T } = (λj )2 , (31)
portion keyed—the mean score—is bounded by 0 and
and 1. The metrics are related nonlinearly by the Var{E } = ψj2 , (32)
(formula score) test characteristic curve. As we will
see, an important function of the model is to supply we have
information about measurement error in the latent
Var{Y } = Var{T } + Var{E } = (λj )2 + ψj2 .
trait, in the sum or mean score, or in any function
(33)
of the item scores. Because the attribute to be cali-
Thus, the fitted factor model gives us the variance
brated is not itself bounded, in any sense, it might
of the error of measurement from the unique vari-
seem that we should regard the metric given by the
ances of the item scores, and the true score variance
latent trait as more fundamental than that given by
from the factor loadings. Note that in this realization
a (formula) score from a set of items chosen to mea-
of Spearman’s (1904a) classical true score model, the
sure it. Following Lord (1980), we would regard the
error of measurement arises in a single administra-
test characteristic curve as exhibiting “distortions of
tion of the test and results from specific properties
mental measurement” (Lord, 1980, p. 49) resulting
of the items, although it may include confounded
from the test score chosen.
errors of unrealized replication.
When the metric of the scale has been determined
One of the oldest concepts in Test Theory is,
by the choice of a link function, the distribution
of course, the reliability coefficient of a total test
of F and of any formula score is determined by
score, defined as the ratio of the variance of the true
that choice and is not open to arbitrary assumption.
score to the variance of the total score, and con-
McDonald (1967) gave a method for estimating the
ventionally denoted by ρYY . A somewhat neglected
moments of the distribution of F from the distri-
alternative concept is the reliability index, the cor-
bution of the observations. This provides a simple
relation between the true score and the total score,
test of departure from normality. Bartholomew and
denoted by ρYT . The two coefficients are related by
Knott (1999) have taken the view that the distri-
2
bution is arbitrary along with the choice of link ρYY = ρYT . (34)
function.
However, for decades these quantities remained
undefined, like the true score itself.
Measurement and Error of Measurement From the parameters of the factor model, the
(Reliability) reliability coefficient is given by
When we have calibrated the attribute by fitting 2
ρYY = ρYT = ω = λ.2 /(λ.2 + ψj2 ). (35)
a model, we may then wish to use the created test to
assign a measure of the attribute to an examinee. We This is also Var {F }/Var{F Y }, the ratio of the vari-
also wish to obtain a standard error of measurement ance of the factor score to the variance of its measure

mcdonald 125
obtained by rescaling the total score in Equation gives a measure (resulting from Bartlett, 1937),
26. Coefficient omega, defined by Equation 35, was
F B = [1/{λ2j /ψj2 }][λj /ψj2 ](Xj −μj ) = F +E B ,
originally given by McDonald (1970). From the
(40)
parameters of the linear model in Table 7.2, it is easy
with minimum error variance
to compute coefficient omega. It is 0.307, which, of
course, is very low. Var{E B } = 1/(λ2j /ψj2 ), (41)
Guttman (1945) gave a lower bound to the
reliability of a total test score, which was then fur- and maximum reliability coefficient 1/[1 +
ther studied by Cronbach (1951) and is commonly Var(E B }]. For the LSAT6 data, the maximum reli-
known as Cronbach’s alpha. I prefer to acknowl- ability, given by these weights, is 0.309—hardly an
edge Guttman’s original contribution and call it the improvement on 0.307 from the simple sum score.
Guttman-Cronbach alpha. It is defined by The reciprocal of the error variance in Equation
41,
α = [m/(m−1)][1−( Var{Xj }0/Var{Y }]. (36) I = (λ2j /ψj2 ), (42)
is a sum of m independent terms, one for each item.
A sample estimate of G.-C. alpha is still com-
Each makes a separate contribution to the reduction
monly used for reliability, although Guttman (1945)
of the measurement error. The more informative
clearly showed that it was a lower bound. Novick
items are those with the largest ratio of squared
and Lewis (1967) showed that alpha gives reliability
loading to unique variance. We can take the test
if the items are true score-equivalent, meaning that
information to be defined by this reciprocal and the
the items fit the model
item information to be the contribution of each term
Xj = T + Ej , (37) in Equation 41.4 The usefulness of the information
concept for us rests on the additivity of these terms,
and McDonald (1970) showed further that it gives enabling, as we will see, the independent selection of
reliability if and only if the factor loadings in good items. The weights given by Equation 38 min-
Equation 1 are equal. In applications, G.-C. alpha is imize the measurement error variance and maximize
often a very good lower bound to coefficient omega. the information and reliability, among all possible
In the LSAT6 data, from Table 7.1 we obtain a value weighted sums of the items.5 The maximum reli-
for alpha of 0.295, very little less than omega. The ability can be written in terms of information as
case for using omega rests on the fact that it is a I /(I + 1).
simple byproduct of a prior analysis that gives other The raw sum score is an equally weighted sum of
useful results, as will be seen. the item scores. Scaled to F Y as in Equation 26, it
A reliability coefficient is not an end in itself. has error variance
From its origin, it was a device for overcoming
Var{E Y } = ψj2 /(λj )2 , (43)
the problem of replicating test scores, to obtain a
standard error of measurement. The measurement and test score (sum score) information
error variance can be expected to be approximately
I Y = (λj )2 /(ψj2 ). (44)
invariant over populations, whereas the reliability
varies widely (with the true score variance) from The ratio
population to population. This variability can be
seen in lists of reliability estimates from different RE = I Y /I = Var{E }/Var{E Y }, (45)
populations, recorded in handbooks of tests and which is the ratio of the information in the simple
measurements. sum score to the maximum information in the test
The simple sum score, possibly rescaled to a mean (given by Equation 37 with Equation 38), is the
by dividing by m, or rescaled to the metric of the relative efficiency of this test score, necessarily less
common factor by Equation 26, is not the best than 1. The relative efficiency of the sum score for
measure of the attribute. With scaling to latent-trait the LSAT6 is 0.994. There are other possibilities—
metric (and latent-trait variance 1), the weighted for example, we can obtain the relative efficiency of
sum scores from a subset of items.6
F B =  wj (Xj − μj ), (38) Given the error variance, we have the correspond-
ing standard error of measurement as its square
with root. By the Central Limit Theorem, the error of
wj = [1/{λ2j /ψj2 }][λj /ψj2 ], (39) measurement will approach a normal distribution

126 modern test theory


as the number of items becomes large, because it The error of measurement of the formula score
is a sum of m independent unique components. E S = S − tS . Then
We can then put confidence bounds on an exam-
inee’s score by standard methods, without imposing S = T S + E S = g(F ) + E S . (47)
distribution assumptions on the model. Using the If the function g(f ) is invertible, then formally
linear model for the LSAT6 gives an error vari- there is a nonlinear transformation
ance Var{E } = ψj2 of the raw score Y equal to
0.742 and a standard error of measurement of 0.876. g−1 (S) = F + EfS = F + g −1 (E S ). (48)
Ninety-five percent confidence bounds on an exam- Thus, g−1 (S) is a measure of F , with error of
inee’s score of 3 would be 3+/−1.96×0.876—that measurement EfS = g−1 (ES ). Now suppose we take
is, 1.31 and 4.69, which nearly covers the range (0 to
5) of the scale. The usefulness of the LSAT6 clearly S W =  wj Xj , (49)
lies in the pleasures of analysis it has given to psy-
chometric theorists rather than in its precision as a a weighted sum with fixed weights. (With weights
measuring instrument. We can rescale these num- equal to 1, this gives the simple raw sum.) Then,
bers to the scale of the latent trait, from λj = T W = E {S w |F = f } =  wj Pj (F ), (49a)
0.573 and the test mean μY = μj = 3.818,
giving fY = (3 − 3.818)/0.573 = −1.43, with and
error variance 0.742/0.5732 = 2.260, standard E W =  wj Xj −  wj Pj (F ). (49a)
error of measurement 1.503, and confidence bounds
−1.43 + / − 1.96 × 1.503—that is, −4.377 and Here T W and E W represent the true and error
1.517. components of the weighted sum S W given by
On the face of it, the linear (factor) model makes Equation 48. Then,
the strong assumption that the errors of measure- Var{E W } =  w2j Pj (1 − Pj ). (50)
ment are homoscedastic. This assumption is easily
tested and will very commonly be found to be If we intend to measure the attribute in the metric
false. A classical method for testing it results from of the chosen formula score (the chosen weighted
Mollenkopf (1949) and Thorndike (1951). (See sum), then this last result is all we need. A common
also Lord, 1984, Feldt et al., 1985, and Qualls- choice will be the raw sum score with
Payne, 1992.) The principle, which is capable of
Y = T Y + EY, (51)
refinement, is: We split the items into two paral-
lel halves, and plot the variance of the difference where
(which estimates error variance) against the sum
(which estimates the true test score). We can call Var{E Y } = {Pj (f )[1 − P(f )]}. (52)
this a practical or empirical method. A method This result supplies the conditional standard error
based on IRT is given later. When the item response of measurement of the sum score, based on the
model fits, there should be little to choose between item response model. As remarked already, when
the methods, and the model-based method has the model fits, model-based conditional standard
best theoretical justification. The resulting stan- errors will agree closely with the empirical results
dard error of measurement, a function of test score, from classical methods.
is referred to as a conditional standard error of A measure of the latent trait f for any individual
measurement. can be obtained from her/his formula score sw in
In a nonlinear model, the item information and Equation 48 by equating it to its expectation, writing
the test information—and hence the corresponding sW = tW —that is,
measurement error variance—are, by theory, func-
tions of the position of the examinee on the scale.  wj xj =  wj Pj (f ). (53)
We can use the model to define a true score and a
conditional error of measurement for any formula Equation 53 can be solved for a measure fW
score—any quantity calculated from the item scores. by plotting tW against f, and finding the point
Let S be any such score. Then a corresponding true where the equality is satisfied—or by an equivalent
score tS is given by, for example, computer program. This corresponds to applying
Equation 47. Then,
tS = E {S|F = f } = g(f ). (46) F W = F + EfW , (54)

mcdonald 127
where F W is the measure of F given by Equation (2007) have suggested inventing a conditional reli-
53. The corresponding error of measurement, EfW , ability for a sum score, defined as the unit com-
has variance plement of the ratio of the conditional variance of
the error of measurement to the total variance. The
Var{EfW } = [ w2j Pj {f}(1 − Pj {f})]/[ wj Pj {f}], intended use of this index is to compare the condi-
(55) tional precision of two tests of the same attribute. It
where P’j {f} is the gradient of P{f}. Any choice of the is not clear whether such an index has any advan-
weights will give a measure of f for each individual, tages over the use of relative efficiency as suggested
from Equation 53, and a corresponding variance of by Lord (1980) for such purposes.
measurement error, given by Equation 55. Unlike the linear case, the item information is
If we set weights equal to the discrimination a function of f. Any other formula score must
(slope) parameters of the items, then the variance give less information and a greater measurement
of the measurement error is the minimum possible error variance than the solution of Equation 57—
from all choices of weights. This is the choice wj = a counterpart of Bartlett’s minimum error measure
bj in the regression parameterization (Equation 10), Equation 37 with 38—at every point on the scale
a∗j in Lord’s parameterization (Equation 13), or of f. In particular, the sum score, Y = Xj , with
λj /ψj in the item factor parameterization (Equation true score T Y = Pj (F) gives a conditional error
14). In Lord’s (1980) original account of these variance of Y
results, Equation 13 is used. Here it is convenient to
use Equation 14, to exhibit the relation to the linear Var{E Y } = [(Pj {f})(1 − Pj {f})], (62)
model. That is, the formula score
and conditional error variance of the measure of f
s = (λj /ψj )[xj − μj ] (56)
Var{Ef } = [(Pj {f})(1−Pj {f})]/[ Pj (f )2 ], (63)
gives a measure fb of f for any individual from the parallel to Equation 37. At every point of the scale,
solution fb of this must be greater than the minimum variance
(λj /ψj )xj = (λj /ψj )Pj (f ). (57) given by Equation 57. In applications, the differ-
ence may be small, and for some purposes it would
It has the property that be reasonable to use the total test score as a measure
or its transformation onto the latent-trait scale. The
F b = F + E b, (58)
information function and its reciprocal—the error
with variance—changes radically under transformations
Var{F b } = Var{F } + Var{E b }, (59) of the scale. For example, at the floor and ceiling of
the test, the error variance of the latent trait becomes
where Var{F } = 1 and infinite, whereas that of the true score becomes zero.
Var{E b } = E {F b |F = f } = 1/I (f ). (60) The unbounded metric shows clearly that we cannot
get good estimates for examinees for whom the test is
Here, too easy or too difficult. At these extremes, the true
I (f ) = [(Pj {f})(1 − Pj {f})]/[Pj ’(f )2 ], (61) score is constrained from changing in response to
the attribute and conveys no information about it.
and Pj ’(f ) is the gradient of P(f ). As in the lin- For the LSAT6 data, Table 7.6 gives the item
ear case, I (f ) is the information, the reciprocal information functions. Table 7.7 summarizes the
of the error variance. The choice of these weights further results relevant to this section. These are:
minimizes the error variance and, equivalently, max- (1) TCC: The test characteristic curve—the sum of
imizes the information from the data. We notice the item characteristic curves; (2) I (f ): the test infor-
that Equation 61 parallels Equation 41 for the linear mation function—the sum of the item information
model. The measures fb are nonlinear counterparts functions; (3) Var{E b }: the minimum error variance
of the Bartlett scores given by Equations 37 and 38. available from the test; (4) S.E.M.(f ): the Stan-
The information is the sum of terms that constitute dard Error of Measurement from fb ; (5) The TCC
item information—the contribution of each item to from the linear model; and (6) Var{EfY }: the (con-
the reduction of error. stant) error variance of F from the linear model. We
There is no counterpart of the reliability coeffi- observe that all the items are easy for this population,
cient in the nonlinear model. This is not a defect, and we have lower standard errors of measurement
from the modern perspective. However, Raju et al. (more information) for examinees of low ability than

128 modern test theory


Table 7.6. Item Information Functions My treatment of measurement and errors of
f measurement has been free of distribution assump-
tions. Rather than measures of factor scores/latent
Item −3 −2 −1 0 1 2 3
variables, in the sense employed here, we can ask
1 .114 .086 .054 .031 .016 .008 .004 for best predictors of them from any information
we have about the subjects. In the Spearman model,
2 .092 .117 .118 .095 .063 .037 .020
given no information about a subject beyond his/her
3 .059 .120 .192 .210 .154 .083 .038 item scores, we can use the regression of the latent
variable on the item scores as a best linear predictor,
4 .104 .119 .108 .080 .050 .028 .015 minimizing the residual variance. This predictor,
5 .097 .084 .062 .040 .024 .013 .008 given by Thomson (1934), takes the form
fT = E {F |X1 = x1 , . . . , Xm = xm }
= [1/(1 + I )][λj /ψj2 ](xj − μj ). (64)
for examinees of high ability. For some purposes, this
might be desirable, whereas for others it would be a Thomson’s predictor decomposes the measured
defect of the test. We can immediately see how we quantity F into orthogonal components, giving
could select items from a large calibrated set to form F = F T + E T, (65)
a test information curve of desired form. This is the
subject of the next section. with
Over the last couple of decades or so in psycho- Var{F } = Var{F T } + Var{E T }
metric theory, there has been a general movement
away from the method of maximum likelihood to = I /(1 + I ) + 1/(1 + I ) = 1. (66)
Bayesian estimators, both for parameters of the This may be contrasted with the Bartlett measure,
items and for predicting values of the attributes which decomposes the measure F B into orthogonal
of examinees (see, for example, Bartholomew & components
Knott, 1999). I need to point out that there F B = F + E B, (67)
remains some confusion in terminology about the
“estimation” of latent traits, with no distinction with
made between measurement and prediction, and Var{F B } = Var{F } + Var{E B } = 1 + 1/I . (68)
some writers loosely referring to obtaining, assess-
ing, constructing, or finding “proxies” for factor We note also that
scores—values of individual latent traits. To dis-
Var{F T } = 1/Var{F B } < Var{F } < Var{F B },
cuss the Bayesian treatment, I find it convenient
(69)
to distinguish measurement, as treated so far, and
prediction. Var{E T } < Var{E B }, (70)

Table 7.7. LSAT6—Summary of 2PL Results


f

−3 −2 −1 0 1 −2 3

TCC 1.82 2.52 3.25 3.90 4.39 4.69 4.85

I (f ) 0.466 0.526 0.534 0.455 0.307 0.170 0.084

Var{E B } 2.146 1.901 1.873 2.198 3.257 5.882 11.904

S.E.M.(f ) 1.47 1.38 1.37 1.48 1.81 2.43 3.44

Linear Approximation

TCC 2.10 2.67 3.24 3.82 4.39 4.96 5.59

Var{E Y } 2.23 2.23 2.23 2.23 2.23 2.23 2.23

mcdonald 129
and with varying degrees of theoretical motivation.
E{F T |F = f } = [I /(1 + I )f ]. (71) These are no longer needed (see McDonald, 1999,
The regression predictor is a shrunken, conditionally Chapter 11). In the linear case, with a set of items
biased estimator of the latent variable. The standard fitting a single-factor model, we see immediately
error of prediction is less than the standard error of that a best subset of m items would contain the
measurement. It is not presently clear what would items with the largest information values, λ2j /ψj2 ,
motivate the choice between measurement and pre- thus yielding the smallest error variance and largest
diction in applications of the model. Prediction reliability for a given number of items. As an exam-
appears to be the current “default” option. ple in miniature, if we wanted the best three items
If it is found empirically that the latent trait has from the LSAT6, using the linear model as approx-
a normal distribution, then the regression predic- imating it well enough, then we would take items
tor is also a Bayes predictor (see Bartholomew & 3, 2, and 4, with information 0.163, 0.096, and
Knott, 1999). In the corresponding 2PL model, if 0.082, respectively, rejecting items 1 and 5, with
it turns out that empirically the latent trait has a information 0.055 and 0.052, respectively. Keep-
normal distribution, then we can obtain a Bayes ing the first three gives test information 0.341 and
predictor by methods described in Bartholomew error variance 2.93, to be compared with infor-
and Knott (1999) and Skrondall and Rabe-Hesketh mation 0.448 and error variance 2.232 from the
(2004). The Bayes predictor is, again, a shrunken, full set.
conditionally biased estimator. The nonlinear counterpart procedure is more
complex. We use the item information functions
Item Selection to select a subset of items giving a desirable test
In developing a test, it is common practice to try information function, recognizing that we may not
out a set of items on a calibration sample, of which be able to minimize conditional error variance at all
only a subset will be kept for actual application— levels of the latent trait, and may wish to have “small”
measuring examinees with the final calibrated test. error variance in specified intervals. This depends on
We need a convenient way to choose a “best” subset. the purpose of the test. A careful study of the item
It may also happen that the items under trial seem information functions in Table 7.6 and the way they
insufficient to measure the attribute well enough, give different orders at different points of the scale
and we wish to know how many more need to be will indicate the possibilities and difficulties of the
written. task.
The conception of our freedom to shorten or Of course, the principles just adduced for elimi-
lengthen a test measuring a given attribute contains, nating items from a set apply also to adding items of
at least implicitly, the recognition that the attribute known characteristics. For the problem of “proph-
is not “operationally” defined by just the set of items esying” the effect of adding items that are not yet
chosen to measure it. The possibility of shortening available, Modern Test Theory adds little to the
or lengthening a test for an attribute rests on an ide- classical treatment. The Spearman-Brown classi-
alization. In effect, we suppose that the items written cal “prophesy” formula (Spearman, 1910; Brown,
come from a quasi-infinite set of items that would, 1910) allowed the prediction of the reliability of a
if written and administered, define and measure the lengthened test, of r items, from the reliability of a
attribute precisely. Such a quasi-infinite set has been test of m items. It required the very strong condi-
called a behavior domain, or a universe of content. I tion that the items in the given test be parallel. In
prefer to call it an item domain. Although it will vir- Classical Test Theory, this condition requires that
tually never be the case that the m items we have are a each measures a shared true score with equal error
random sample from an item domain, it is necessary variance. Interpreting this in terms of the linear fac-
to think of them as obtained from it and to take the tor model, the condition is that each of the m items
true score or latent trait to be determined by it. The has the same factor loading and the same unique
limit results justifying this do not depend on random variance. An assumption is also needed that the
sampling. They do depend on the strong assumption additional items have the same loading and unique
that we know how to realize indefinitely more items variance.
measuring just the attribute we intend to measure. The one mild advantage of applying the factor
This requires very careful conceptualization. model over the classical treatment is that the condi-
In Classical Test Theory, a large number of heuris- tion and assumption can be weakened considerably.
tic devices for item selection have been developed, If the m given items fit the Spearman model, we can

130 modern test theory


define an approximate reliability for them as a remarkable number of indices or heuristic devices
2 2 2 were invented to measure or test the extent to which
ω = (λ. /(λ. + ψ .), (72)
a set of items is homogeneous. These were based
where λ. is the sum of the loadings and ψ 2 . the sum on rather intuitive notions of what homogeneity
of the unique variances. The Spearman-Brown for- means. Hattie (1984, 1985) studied the behavior
mula is derived in Classical Test Theory as a special of 87 of these and found that only one could be rec-
case of G.-C. alpha, and can also be derived from ommended. As expected from McDonald (1981),
the Spearman model with equal loadings and equal this exception was based on a check to see whether
unique variances (see McDonald, 1999, Chapter 6). the item scores fit a model with a single latent
The formula is given by trait. Indeed, we may now take it that what was
always intended by the term “homogeneous test”
ρm = mρ1 /[(m − 1)ρ1 + 1]. (73)
is one whose item scores fit a model with a single
This is an expression for the reliability of a test latent trait. This follows from our identification of
of m parallel items from the reliability of just one of the latent trait in the mathematical model with the
them. We can use it with Equation 72 to obtain attribute as measured by the item responses.
the reliability of a projected test of r items with In the early literature, psychometric theorists
the strong condition eliminated, and the assump- often treated the single common factor and the m
tion weakened to the hope that the added items unique factors as on the same conceptual level and
have the same average loadings and the same aver- described the model as containing m + 1 factors.
age unique variances as the given set. This is still a This way of expressing it makes an ambiguity over
strong demand on the item writer. what we should mean by the dimensionality of a
There is no clear strategy in the conceptually par- test.7 Writing the linear model as the expected value
allel model for binary data, allowing the investigator of an item score for fixed factor score, as in Equation
to predict the number of items needed to meet a cri- 4, and the item response model as the corresponding
terion for error variance. For this purpose, it is not Equation 10, we regard f as the single dimension on
unreasonable to use the linear model as an approxi- which the attribute varies.
mation and apply the modernized classical formula. We now consider the possibility of writing
This is perhaps a little better than consulting a crys- alternative p-dimensional models. I will just illus-
tal ball, and perhaps the question is not of great trate this possibility with two-dimensional models,
importance. writing
The most important function of the behavior
domain concept is that it serves to determine the E{Xj |F1 = f1 , F2 = f2 } = μj + λ1 f1 + λ2 f2 , (74)
latent trait or true score as the score on a test of with
infinite length. The behavior domain gives a clear
distinction between these scores and a measure of Uj = Xj − E{Xj |F1 = f1 , F2 = f2 }, (75)
them from a chosen set of items. In Lord and a 2PL model for quantitative data, or
Novick’s account, there is room for alternative treat-
ments of a true score as either the score on a test Prob{Xj = 1} = P(aj + bj1 f1 + bj2 f2 ), (76)
of infinite length or as the mean of a “propen- a two-latent-trait model for binary data, in regres-
sity distribution.” A propensity distribution is the sion equation notation. This can also be written as
distribution of the score obtained when one exami-
nee is brainwashed to forget previous responses and Prob{Xj = 1} = P[(λj1 f1 + λj2 f2 − τj )/ψj ], (77)
retested many times under ideal conditions (but see
the obvious extension of Equation 10.
McDonald, 2003).
The extension of the factor model to multi-
ple dimensions has a long history. The history of
Homogeneity and the Dimensionality of the corresponding latent trait models is short, and
Tests unfortunately it has not always been recognized that
An unexamined assumption of the previous dis- multidimensional item response models require the
cussion is that the items are measures of just one techniques invented for multiple-factor models if
attribute. In the classical period, this was discussed we are to fit and interpret them in applications to
as the question of test homogeneity. In terms of the real data. The latent traits in models of this kind
Greek root, the question is whether the items are are correlated, and each of the latent traits is stan-
of the same (homo-) kind (genos). In that period, dardized. The parameters of the linear model are

mcdonald 131
the item means μj , the factor loadings λj1 and λj2 , Reckase (2009) has provided a very different
the unique variances ψj2 , and, for this model, the treatment of multidimensional item response mod-
correlation between the latent traits, which we will els. Reckase seeks to describe the multidimensional
write as φ12 . Correspondingly, the parameters of the space without reference to the methods developed
two-dimensional item response model are aj , bj1 , in common factor modeling to determine what
bj2 , and the correlation of the latent traits φ12 . In is measured. This interesting development awaits
two or more dimensions, in addition to the prob- evaluation.
lem of choosing an origin and unit for each axis, In the linear Equation 4, and the nonlinear
we face the very old problem of choosing an ori- counterpart in Equation 10, if the items form inde-
entation of the axes in the two- or p-dimensional pendent clusters, then a measure of each latent trait
space—the rotation problem of classical factor anal- with minimum error variance is given by the corre-
ysis. In exploratory studies, this problem has usually sponding cluster of items, with the same expressions
been solved by fitting the model on some conve- (Equations 37 and 65) as for the unidimensional
nient basis and then transforming the factor loadings case. Even if the factors are highly correlated, the
to Thurstonian simple structure, with uncorrelated errors of measurement are uncorrelated. I suggest
(orthogonal) factors or correlated (oblique) factors that we call this the case of pure measurement, with
(see, for example, Mulaik, 2010). each attribute measured by just its own items and
In the context of test construction, we write items with uncorrelated measurement errors. This is an
to measure an attribute and should not need to use important property, because generally we would
exploratory methods. Even so, in the early stages not wish to have the measurement of one ability,
of developing and calibrating a test, the conceptual say, affected by items measuring another correlated
denotation of the attribute may be unclear, and it ability.
may be that the attribute is conceived at a high level Some writers would reject the distinction made
of abstraction from behavior, and the domain of pos- here between measurement and prediction, regard-
sible items divides into subdomains. The paradigm ing both as “estimation.” Bartholomew and Knott
case is the set of correlated primary mental abilities (1999), for example, suggest the use of Bayes pre-
into which Thurstone divided Spearman’s general dictors in place of the measures I recommend. The
“intelligence”—scholastic ability. If the items are Bayes predictors have the property that the pre-
written so that they fall neatly into the subdomains, dicted value of one attribute of an examinee is
they form clusters that, if fitted separately, are homo- increased or decreased by a high or low value of
geneous and fit the unidimensional models given by another. Thus, being good at English improves
Equations 1 or 10. Jointly, they fit multidimensional an examinee’s mathematics score. This effect can
models here represented by Equations 75 and 77, be described as “measurement contamination” or
with correlated factors, and an item with nonzero “borrowing strength from extraneous information,”
factor loadings/slope parameters on one latent trait depending on whether we wish to measure or to
has zero factor loadings on the other. The items are predict. If, indeed, the intention is to predict the
said to be factorially simple, belonging clearly to value of an attribute, then we can use items belong-
just one attribute. This case is commonly referred to ing to related attributes and any other information
as having independent clusters (see McDonald 1999, about the examinee (e.g., socioeconomic status,
Chapter 9). educational history, etc.) that is relevant to predic-
We may not succeed in creating pure clusters of tion. My view is that prediction from extraneous
items measuring just one latent trait. Some items information is not measurement, and it would
may be factorially complex, with nonzero loadings require a distinct research motive. On the face of it,
on two or more factors. Although pure clusters are measurement models are designed for the purpose
a desirable goal of measurement, at least the aim of measurement.
should be to create enough factorially simple items As an example of a multidimensional model, I
for these to form a basis for analyzing the measure- will give a brief account of an example in McDon-
ment properties of the complex items and possibly ald (1999, Chapter 14). Fifteen items taken from
eliminating them. (An example following should the ACT Mathematics Test have the item stems
make these statements clearer.) Without such a basis, listed in Table 7.8. It seems reasonable to describe
we cannot be sure what we are measuring (McDon- items 1 through 5 as measuring geometry achieve-
ald, 1999, Chapter 9, calls this case an independent ment, and items 6 through 10 as measuring algebra
clusters basis). achievement. Items 11 through 16 are less readily

132 modern test theory


Table 7.8. Fifteen ACT Mathematics Items Table 7.9. ACT Independent Clusters
Basis Solution
Item Item Stem Description
Loadings
1 Angles in a right triangle
Item I II Uniqueness
2 Areas of bisected triangles
1 .766 .413
3 Length hypotenuse—right triangle
2 .642 .588
4 Length adjacent—right triangle
3 .451 .814
5 Area trapezoid
4 .604 .636
√ √
6 2 28 + 3 175
5 .485 .765
7 √1
2−1 6 .439 .809
8 (−3)2 + 3−2 7 .502 .650
9 x, for which [(x(x − 2))][(x − 1)(x − 2)] 8 .386 .851
is undefined
9 .666 .556
10 22 + 20 + 2−2
10 .388 .849
11 Application of 73/3 + 17.85 + 61/2
11 .365 .367 .534
12 Slope of line 2x + 3y + 6 = 0
12 .355 .363 .551
13 Radius of circle given circumference
13 .358 .349 .567
14 Speed given distance and time
14 .223 .436 .615
15 Longest diagonal in box
15 −.335 .548 .859

two latent traits as given by the last set. We could


classified. We fit the model given by Equation have pure measurements by keeping only the first 10
78, specifying a corresponding pattern of zero and items.
nonzero coefficients, allowing the last group of items
to be, possibly, composites of geometry and algebra.
The fitted parameters are given in Table 7.9.8 Validity
The correlation between the latent traits is 0.739. The classical problem of determining the extent
The last five items appear to be fairly equally to which a test score is a valid measure of an attribute
balanced combinations of geometry and algebra remains with us, although not quite in its classi-
abilities. The clusters formed by the first two sets cal form. I accept as a definition the statement:
give simple item characteristic curves and the sub- “A test score is valid to the extent that it measures
test characteristic curves given in Table 7.10. The the attribute of the respondents that it is employed
complex items in the last group give the subtest to measure, in the population(s) in which it is
characteristic surface tabulated also in Table 7.10. used.”
The simplicity of the basis supplied by the pure Early validity theory was influenced by an
clusters is what gives us an understanding of the extremely behaviorist psychology and a logical pos-
complex items in the last group. Without the basis itivist philosophy of science. The attributes we wish
for interpretation supplied by the geometry and to measure were regarded as invented, convenient,
algebra items, it would be difficult, if not impos- fictional “constructs.” This view led to a recognition
sible, to determine what we are measuring and of three forms of validity—namely, predictive valid-
clearly impossible in models of higher dimension- ity, concurrent validity, and content validity, with
ality. The information functions and error variances hesitant admission of a fourth—“construct” validity.
have the same structure, with uncorrelated errors for The many predictive validities of a test were its abili-
formula scores from the first two sets and high cor- ties to predict external measures. Concurrent validity
relations between the errors of measurement of the required correlating the test with another test of

mcdonald 133
Table 7.10. ACT Test Characteristic Curves
s1 s2 s3

f1 TCC f2 TCC f1 /f2 −2 −1 0 1 2

−2 .149 −2 .051 −2 .026 .076 .175 .325 .507

−1 .326 −1 .121 −1 .042 .110 .236 .415 .614

0 .586 0 .260 0 .085 .182 .333 .521 .708

1 .821 1 .471 1 .162 .289 .452 .622 .770

2 .938 2 .679 2 .271 .418 .580 .698 .795

the same name. Content validity, regarded with sus- define a validity and generalizability index for a set
picion by behaviorists, rested on the “subjective” of binary items. The biserial correlation between a
judgment that the item contents were indicators of standardized variable Z and a binary variable X
a common attribute. is the correlation between Z and an underlying
Largely through the work of Cronbach and response tendency X ∗ , such that X = 1 if X ∗
Meehl (1955), and that of Messick (1989), the exceeds a threshold value τ . It can be shown that
common view now seems to be that there is one con- Cov{X , Z } = n(τ )Cor{X ∗ , Z }, where n( ) is the
cept of validity, still called construct validity (out of ordinate of the normal density function. In the
pure habit), with predictive, concurrent, and con- normal ogive model in Equation 14,
tent validity seen as evidence of it. Validation would
now include all forms of evidence that the score Cov{Xj∗ , F } = λj ,
is an acceptable measure of a specified attribute.
My account will be limited to those forms of evi- so
dence that rest on the models we are considering as Cov{Xj , F } = n(τj )λj ,
a framework for applications of test theory.9
It is possible to take the view that primary evi- giving
dence of the validity of a test score comes from
establishing that the test is homogeneous in two Cor{Y , F } = [n(τj )Cov{Xj , F }]/[Var{Y }1/2 ].
senses—that the test is unidimensional and that the (78)
item contents are judged to be measuring an abstract Evidence of validity comes from an examination
attribute in common. If we regard the attribute of what the test does not measure as well as what
as what is perfectly measured by a test of infinite it does. Suppose a set of items fits the multiple
length, then a measure of validity can be taken to factor Equation 75 with independent clusters and
be the correlation between the total test score and correlated factors. Each cluster gives a subtest sum
the true, domain score. The redundant qualifier that should be correlated with its own factor, but
“construct” can be omitted, and we simply refer to it may also have a high correlation with the fac-
“validity.” tors belonging to other clusters. The independent
In the case of the Spearman model, the validity clusters model supplies the natural explication of
coefficient is just the reliability index given by the a somewhat intuitive suggestion by Campbell and
square root of coefficient omega. We can also call Fiske (1959). They suggested that multiple measures
this index a coefficient of generalizability, from the of a construct have convergent validity if they are suf-
m items in the test to the infinite set given by the item ficiently highly correlated and discriminant validity
domain. Thus, on this view, reliability is validity is if they have sufficiently low correlations with tests of
generalizability, which is a great simplification (see other, distinct constructs. The main contribution of
McDonald, 1985, 1999). Campbell and Fiske (1959) was the suggestion that
In the case of binary items, the error variance is any item can be regarded as a trait-method unit—
a function of the latent trait, so it seems impossi- a union of a particular trait-content and a method
ble to define an overall reliability index. However, of measurement. To segregate trait from method,
using the normal ogive model, we can usefully they recommend measuring a number of traits by

134 modern test theory


a number of methods in a crossed (multitrait– sets of items. It is customary to refer to these as alter-
multimethod) design. This rather casual sugges- nate forms of a test. Two tests Y and V are item-parallel
tion has spawned a large and confusing literature. if the items in each are paired to have equal param-
The position I take is that multitrait–multimethod eters in a jointly unidimensional model. (They are
designs have not yet been shown to contribute to sometimes called “strictly parallel.”) A necessary and
convergent/discriminant validity. sufficient condition for the scores Y and V on test
To quantify convergent and discriminant valid- forms Y and V to have the same distribution for
ity by the independent clusters model, we compute examinees with the same value of their latent trait is
(1) the correlation between each cluster sum and that the forms are item-parallel. This is a condition
its own factor, and (2) the correlation between each for the complete exchangeability of the test forms. In
cluster sum and the other factors. We hope to find some applications, it is a condition for equity, mean-
the former high for convergent validity and the lat- ing that it cannot matter to the examinee which form
ter small for discriminant validity. The correlation is administered.
of each cluster-sum with its own factor is just its Two item-parallel test forms have (1) equal test
reliability index, the square root of omega. The cor- characteristic functions, (2) equal test information
relation of each with the other factors is just its functions, (3) equal test-score information functions,
reliability index multiplied by its correlation(s) with and (4) matched true scores and matched error
the other factor(s). This holds for binary items also, variances at every point on the scale. We can recog-
using Equation 79 for the correlation between the nize three distinct levels of matching or equivalence
cluster-sum and its latent trait. Table 7.11 gives the between test forms—namely, (1) item-parallel, (2)
correlations for the ACT data between the three equal test characteristic and test-score information
cluster sums (s1 for items 1–5, s2 for items 6– functions, and (3) equal test characteristic curves—
10, and s3 and the geometry [I ] and algebra [II ] matched true scores, but possibly different error
latent traits; see McDonald, 1999, pp. 322–323, for